Improving Soil Information with 
Generative and Machine Learning 
Models 
 
Will data augmentation and machine learning help improve the quality of soil 
data predictions over a large area?  
 
Master’s thesis in Applied Data Science 
 
 
 
Ebrahim Jahanshiri 
 
 
 
 
 
 
 
 
DEPARTMENT OF COMPUTER SCIENCE 
GOTHENBURG UNIVERSITY AND CHALMERS UNIVERSITY OF TECHNOLOGY  
Gothenburg, Sweden 2022 
www.chalmers.se/www.gu.se 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ii 
 
 
 
 
 
 
               Master’s thesis 2022 
 
 
 
 
 
 
 
 
Improving Soil Information with Generative and Machine Learning Models 
 
 
 
 
Ebrahim Jahanshiri 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Department of Computer Science and Engineering 
Chalmers University of Technology 
University of Gothenburg 
Gothenburg, Sweden 2022 
 iii 
 
 
 
 
 
 
 
 
Supervisor at CSE:   Dr. Yinan Yu   Department of Computer Science  
Examiner:    Prof. Peter Damaschke  Department of Computer Science  
 
 
Improving Soil Information with Generative and Machine Learning Models 
Ebrahim Jahanshiri 
 
 
 
 
 
 
 
 
© EBRAHIM JAHANSHIRI, 2022.  
Master’s Thesis 2022 
Department of Computer Science and Engineering 
Chalmers University of Technology and University of Gothenburg  
SE-412 96 Gothenburg 
Telephone +46 31 772 1000  
 
 
Cover: prediction surface for pH for Sri Lanka 
 
 
 iv 
 
 
 
 
 
 
Improving Soil Information with Generative and Machine Learning Models 
Master’s Thesis 2022 
Department of Computer Science and Engineering 
Chalmers University of Technology and University of Gothenburg  
SE-412 96 Gothenburg 
Telephone +46 31 772 1000  
 
Abstract  
 
Soil data observations are among the most difficult data to collect. Low sample density along with the 
high cost of sampling has made current soil information that are usually presented as maps, unusable 
for detailed applications such as modelling earth system dynamics, crop modelling, natural hazards 
prediction and climate change impacts. In this research, a new method for data augmentation of already 
available data was used to improve the accuracy levels of current soil maps across the country of Sri 
Lanka. Soil profile datasets were collected from custodians of national soil profile data and combined 
with data that were mined from published sources. Moreover, two extra datasets were developed to 
augment the original sample size. A series of pseudo-samples were generated by analysing over 600 
images obtained from NASA MODIS MCD34A4 product, to locate Sand dunes with a high percentage 
of sand and low organic carbon. Another dataset was created by training and validation of a spatial 
generative adversarial neural network (SpaceGAN) that can mimic the original distribution of soil 
properties. The augmented datasets were used to predict the values of soil elements at unsampled 
locations with machine learning models. The performance of successful models was then compared 
with different levels of data augmentation. Results showed that data augmentation, particularly with 
the generated data from SpaceGAN can enhance the knowledge of some soil elements and could be 
explored as a viable option to improve the overall accuracy of soil information.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Keywords: data augmentation, soil data, mapping, generative adversarial models, random forest 
 
 
 
 v 
 
 
 
 
 
 
Contents 
Abstract ..................................................................................................................................... v 
Contents ................................................................................................................................... vi 
List of Figures ....................................................................................................................... viii 
List of Tables ........................................................................................................................... ix 
1. Introduction ...................................................................................................................... 1 
1.1. Soil data .................................................................................................................................1 
1.2. Why we need better soil data ..............................................................................................1 
1.3. Problem statement................................................................................................................1 
1.4. Contribution .........................................................................................................................2 
2. Theory ............................................................................................................................... 4 
2.1. Improving soil data quality .................................................................................................4 
2.1.1. Published sources .............................................................................................................................4 
2.1.2. Pseudo-samples ................................................................................................................................5 
2.1.3. Data augmentation with deep learning .............................................................................................5 
2.1.4. Covariate features.............................................................................................................................6 
2.2. State of the art in soil mapping ...........................................................................................6 
2.2.1. Deterministic interpolation models ..................................................................................................7 
2.2.2. Stochastic interpolation models .......................................................................................................7 
2.2.3. Machine Learning models ................................................................................................................7 
3. Methods ............................................................................................................................. 9 
3.1. Data ........................................................................................................................................9 
3.1.1. Observed data ...................................................................................................................................9 
3.1.2. Pseudo-observations .......................................................................................................................11 
3.1.3. GAN data generation......................................................................................................................12 
3.1.4. Covariate features...........................................................................................................................13 
3.1.5. Training, testing, and prediction datasets .......................................................................................14 
3.1.6. Coordinate reference system ..........................................................................................................15 
3.2. Analysis methods ................................................................................................................15 
3.2.1. Machine learning models ...............................................................................................................15 
3.2.2. Evaluation ......................................................................................................................................16 
3.3. Software...............................................................................................................................16 
4. Results ............................................................................................................................. 17 
4.1. Tuning data augmentation model .....................................................................................17 
4.2. Exploratory analysis ..........................................................................................................19 
4.3. Modelling results ................................................................................................................20 
4.4. Model performance ............................................................................................................22 
4.5. Final maps ...........................................................................................................................23 
5. Discussion........................................................................................................................ 25 
5.1. Data augmentation .............................................................................................................25 
5.2. Mode collapse......................................................................................................................27 
5.3. Accuracy assessment ..........................................................................................................28 
 vi 
 
 
 
 
 
 
5.4. Predictions and features ....................................................................................................29 
5.5. Data usability ......................................................................................................................32 
5.5.1. Modern soil data .............................................................................................................................32 
5.5.2. Machine learning interpretability dilemma ....................................................................................33 
5.5.3. Automation .....................................................................................................................................33 
6. Conclusion ...................................................................................................................... 34 
7. Future work .................................................................................................................... 36 
8. References ....................................................................................................................... 37 
Appendix ................................................................................................................................. 42 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 vii 
 
 
 
 
 
 
List of Figures  
 
 
 
FIGURE 1: LOCATION OF OBSERVATION DATA PROFILES AND LITERATURE 
DATA .............................................................................................................................. 10 
FIGURE 2: THE LOCATION OF PSEUDO SAMPLES THAT WERE DERIVED FROM 
MODIS SATELLITE DATA ON 2ND SEPTEMBER 2021 ............................................ 11 
FIGURE 3: PSEUDO-SAMPLES OVERLAID ON GOOGLE SATELLITE IMAGES ....... 12 
FIGURE 4: SPATIAL ANALYSIS STEPS FOR DERIVING RANDOM SAMPLES ......... 12 
FIGURE 5: GRID OF 13000 POINTS FOR PREDICTION AND MAPPING ...................... 13 
FIGURE 6: DATASET DEVELOPMENT PROCESS FLOW............................................... 15 
FIGURE 7: TRAINING AND VALIDATION STEPS........................................................... 16 
FIGURE 8: LOSS, RMSE, AND MIE PERFORMANCE FOR MODEL SELECTION ....... 19 
FIGURE 9: MODEL PERFORMANCE AGAINST AUGMENTED DATA, RF: RANDOM 
FOREST, MLP: MULTI-LAYER PERCEPTRON ........................................................ 21 
FIGURE 10: FINAL MAPS CREATED WITH PREDICTED DATA AT THE GRIDS 
COORDINATES ............................................................................................................. 24 
FIGURE 11: NORMAL DISPERSION AROUND THE MEAN AND LOW SD FOR PH 
(TOP) AND CEC (BOTTOM) ........................................................................................ 26 
FIGURE 12: SPATIAL DISTRIBUTION OF GENERATED DATA BASED ON TWO 
GROUPS OF TREATMENTS ........................................................................................ 28 
FIGURE 13: ORIGINALLY OBSERVED AND GENERATED DATA OVERLAID ON 
THE PREDICTION MAP ............................................................................................... 29 
FIGURE 14: FEATURE IMPORTANCE OF PH ELEMENT PREDICTIONS .................... 31 
FIGURE 15: COMPARISON BETWEEN OBSERVED AND GLOBAL SOIL DATA FOR 
PH (ADAPTED FROM (WIMALASIRI ET AL., 2020C) ............................................. 32 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 viii 
 
 
 
 
 
 
List of Tables 
 
 
TABLE 1: CHARACTERISTICS OF SOIL PROPERTIES USED IN THIS RESEARCH .... 9 
TABLE 2: CHARACTERISTICS OF OPTIMUM SPACEGAN GENERATOR AND 
DISCRIMINATOR NETWORKS................................................................................... 17 
TABLE 3: TARGET VARIABLE STATISTICS ACROSS THREE TYPES OF DATA ..... 20 
TABLE 4: CHARACTERISTICS OF SUCCESSFUL MODELS FOR ALL ELEMENTS 
(ST TRAINING STRATEGY) ........................................................................................ 22 
TABLE 5: COMPARISON OF MODEL ACCURACY WITH PREVIOUS STUDY (ST 
STRATEGY) ................................................................................................................... 23 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ix 
 
1. Introduction  
 
1.1. Soil data  
 
High-quality soil information is required for many environmental and Earth sciences, where 
functions of soil such as water preservation, housing micro-organisms, and nourishing plants’ 
roots are studied (Chesworth, 2007). Although soil depth can reach up to a few hundred meters 
(Richter and Markewitz, 1995), it is its first few layers that are usually of interest. Soils are 
created from parent material for a period of several thousand years, and all the environmental 
phenomena including but not limited to climate and organisms can affect the soil types in 
different ways. All these interactions within the soil provides a natural body that has many 
functions, from being a source of minerals, organic material, nutrients, and water, to sustaining 
plant growth and food production. Soil is also part of cultures and habitats and supports man-
made structures. Data of observational samples (or pedons) that are taken by digging deep into 
the soils are rarely provided to user communities. Instead, soil information is available through 
soil maps that are aggregations of similar soils delineated into units or polygons that are 
provided as printed maps with different colours representing different soil types.  
 
1.2. Why we need better soil data 
 
Soil information plays an important role in human life by impacting our decisions related to 
the environment. General soil knowledge is needed for decision-making at the larger scales, 
where limited general information is enough to devise a strategy or recommend a course of 
action. An example of such decisions is determining optimal land use and suitability of crops 
at the national level and providing a general recommendation about what crops can grow in a 
region (Wong, 1974). Likewise, soil data are important for understanding environmental risks 
and hazards such as pollution, floods, and global warming. However, detailed soil information 
is also needed at the smaller scales for understanding soil-crop relationships, yield forecasting, 
and water management (Dobos and Hengl, 2009). Also, many environmental models require 
specific information about a particular soil property such as texture, water holding capacity, 
degree of acidity and alkalinity, soil electrical charge, and so on. Nonetheless, the conventional 
soil data can no longer cater to the needs of such models (MacMillan, 2019) often at the level 
of individual soil elements (see section 3.1.1 for a list of these properties). Therefore, as more 
and more data are needed, it is important to improve the soil data quality. 
 
1.3. Problem statement 
 
Increasing demands for quantitative soil data have led to a shift from providing aggregated data 
in the form of associations of soil series maps to per-property soil data inventories or maps. In 
this regard, soil data are increasingly mapped for a specific property across different scales. 
This has led to a growing interest in developing spatial and non-spatial data structures which 
could be used for the storage and representation of soil data. Also, methods that were previously 
developed in computer science to predict and interpolate information are currently used to 
develop maps of soil properties. A major problem is uncertainty in modelling and accuracy of 
data. The source of uncertainty in soil data could be from the low number of input sample data 
and data diversity. Also, available data are rarely collected following a sampling design that is 
statistically based (identically and independently distributed). This also leads to the existence 
of large gaps or space between samples that leads to lower accuracy of predictions in areas 
 
 
 
 
 
where sample density is low. Other issues with current sample data include error propagation 
and uncertainty in associated data such as variation in depths, time of sampling, and coordinate 
errors that is inherent to sample analysis in soil survey methodologies (MacMillan, 2019).  
 
1.4. Contribution 
 
Using de-facto and state-of-the-art techniques to generate maps of continuous soil properties 
is not a new field. A challenge for mapping, however, is its accuracy and uncertainty at 
locations where sample density is low and local relationships between the target variable and 
the features (covariates) cannot be explained (Hengl et al., 2017a). Therefore, the objectives of 
this research were to: 
1- Increase sample size by acquiring more soil data by applying deep learning models to 
the distribution of already available soil property data; 
2- Use machine learning models to evaluate the accuracy of predictions using different 
combinations of both observed and generated data; 
3- Discuss the strengths and weaknesses of using data augmentation, dissemination, and 
publication of the soil maps for their benefit to the public.  
The main hypotheses for this research were the following: 
1- Both data augmentation and machine learning methods improve the quality of 
predictions.  
2- Machine learning methods provide a more robust approximation of real spatial patterns 
from data without the need for any statistical assumptions.  
 
To solve the data issues, more sample information from 3 sources was added: 1) data that is 
mined from published articles that contain analysis of soil data, 2) expert-based or pseudo-
observations that are obtained by analysing satellite data, and 3) data augmentation using deep 
neural network models. Together, all data represent two types of information: 1) soil data 
observations and 2) artificially generated data.  
A feature dataset containing 157 data layers from different sources such as satellite images, 
climate, soil, and topography was also created. All these features are representative of 
environmental factors that are responsible for soil variation across space and time.  
The observational data were then used to train a Conditional Generative Adversarial Net 
(CGAN) that is adapted to work for data with coordinates (Latitude, Longitude) and spatial 
neighbours. A major problem with the output was the lack of diversity in the generated data. 
This problem was somehow tackled with data standardisation and oversampling of the tails of 
the input distribution.  
The training data was then combined with the generated data at different percentages (5%, 
10%, 20%, etc.) to test the impact of generated data on the final outputs of a random forest 
model. For each soil element, an optimum data augmentation mixture was selected based on 
the lowest root mean square error. The final results showed an improvement over de facto 
statistical models that were used in the previous studies using additional observed data and 
machine learning models. Successful models were used to predict values over a grid of 13,000 
points spread over the entire country of Sri Lanka to produce final maps. Visual interpretation 
of the final maps showed that machine learning models were able to detect and map general 
 2 
 
 
 
 
 
 
spatial patterns without any pre-training that is usually required for the statistically-based 
models. More work needs to be done in the future to optimise the models, both for generating 
data for soils and optimising the final prediction and maps. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 3 
 
 
 
 
 
 
2. Theory 
 
Semi-detailed soil maps provide information for planning at the national or regional levels. 
These maps are traditionally developed by surveying the soils and aggregating observations or 
descriptions of soils based on samples or observations about the type, colour, and other 
physicochemical characteristics of the soils. However, such data lose “usability” when detailed 
information is required at smaller scales where informed decisions are crucial and risk levels 
are high such as crop and water management. This is due to many factors related to the scale 
of soil maps (failure to capture local variation) and forms of data availability (map polygons), 
or funding limitations. For example, in a case where an estimate of crop yield is required to be 
determined under current and future climates, soil data accuracy can have a severe impact on 
the conclusions that are made based on using those data (Wimalasiri et al., 2020c).  
 
Studies have shown the impact of soil data quality on the results of environmental modelling 
applications such as landslide susceptibility, desertification, and climate change (Stoorvogel et 
al., 2017). One of the most important requirements of these models is high-resolution 
quantitative data. An example of such variables are the amount of carbon and or biomass in the 
soil and standardised quantities such as pH, bulk density, categories such as soil classes, 
probabilities such as the occurrence of stones beneath the soil (R-horizon), and others. Such 
variables are now directly measured through soil profile samples obtained by surveying soils 
in the laboratory (Legros, 2006). These measurements assign a value to a quantitative variable 
at a location. The so-called primary data that are collected directly from the field can also be 
used to develop secondary information. For example, soil carbon storage can be predicted from 
the amount of carbon that was available within samples.  
 
2.1. Improving soil data quality 
 
Efforts have been made in recent years to improve the quality of soil information by taking 
advantage of state of the art in big environmental data and machine learning techniques. Maps 
of spatial variability of soil data can be improved by using soil data that are published in 
research articles and literature studies. Further methods are used to fill the gaps in soil data by 
utilising remote sensing data. These methods are particularly useful for determining the 
location of Sand dunes and rock outcrops where a particular soil element such as Sand is at the 
highest quantity or in areas where there is no soil available. In recent years, machine learning 
models have also been used for the prediction of soil properties. Generative models can 
particularly be useful in providing extra information needed for soil mapping.  
 
2.1.1. Published sources 
 
As soil sampling remains to be a costly process, soil series maps are rarely updated at regional 
and national scales. On the other hand, as soil data is required for many environmental 
applications such as agriculture, soil samples that are collected locally are rarely available 
publicly or seldom published separately in online databases (Batjes et al., 2020). Furthermore, 
soil data are collected for a variety of purposes such as construction. Also, many published 
studies may contain soil information that are the result of soil sampling at the local scale. An 
example of such studies is agronomy studies that look at the performance of crops at a 
particular location (soil type) in response to specific treatments such as fertiliser and irrigation. 
Data mining from literature could provide a viable option for increasing sample size, 
 4 
 
 
 
 
 
 
particularly at locations where no sample profile data is available or data mining from the 
published literature (Sun et al., 2020). 
 
2.1.2. Pseudo-samples  
 
A few studies have also looked at improving sample size (and therefore improving the accuracy 
of predictions) with the addition of pseudo-samples derived from remote sensing data (Hengl 
et al., 2014; Li et al., 2020; Shangguan et al., 2017a). Some samples can be artificially 
generated by applying expert knowledge. For example, Shangguan et al., (2017) used satellite 
data to distinguish the areas where no soil is available (outcrops, hilly and rocky areas) to add 
a set of pseudo-samples to their dataset of soil depth. This increased the sample size and 
coverage and hence the quality of the spatial predictions and maps. Such data mining can also 
be applied to other soil elements such as texture, organic matter, and bulk density to improve 
the quality of estimations. As (Dewitte et al., 2012) discussed, although using satellite remote 
sensing results cannot be the only resource for deriving soil information, they would improve 
our understanding of the surface soil properties.  
 
2.1.3. Data augmentation with deep learning 
 
In recent years, machine learning techniques have been used to generate new data for image 
analysis and map generation. The adversarial nature of Generative Adversarial Nets (GAN) is 
the most important feature of these models (Creswell et al., 2018). A blunt “generator” that 
learns from the original data distribution, generates new data from a latent noise. A 
“discriminator” will then be trained to recognise the fake data from real data. Both networks 
are trained simultaneously, and the training stops when a criterion is met. GAN models have 
been successfully used for generating images and maps before. For example, Li et al., (2020) 
used GAN-based image translation models to generate new data from the distribution of 
already available data. Likewise, Spick et al., (2019) used a modified spatial GAN method to 
improve the detail of elevation data that was previously available at a low resolution only. 
GAN models can be used to develop new datasets that can better resemble the distribution of 
prior soil data and improve the overall sample density.  
 
Given the geospatial nature of soil data, particularly the characteristics of such data that is 
inherently non-iid (independently or identically distributed), a specific model for data 
augmentation, particularly for spatial data has been developed to cater to the need to train for 
data with coordinates (Klemmer et al., 2019). The correlation between data points can be used 
to generate new data using the distribution of known soil elements using features such as 
coordinates of the sample data (Latitude and Longitude), spatial neighbourhood matrices, and 
other soil variables as extra features. The converging mechanism for training the neural 
network model could be based on some spatial correlation structures or simply a measurement 
of error. 
 
The conditional GAN considers the spatial correlation that exists within geospatial data, as a 
mechanism for convergence of the generator G that takes random noise as input to generate 
synthetic data and a discriminator D that does a binary classification. To carry a conditional 
GAN for spatial point data, a local indicator of spatial correlation called Moran’s I (Anselin, 
1995) is computed for each generated set against the training set.  
 
To derive Moran’s I, a weight matrix containing information about neighbours of each point i 
should be assigned. The weight matrix can be defined in such a way that gives more weight to 
 5 
 
 
 
 
 
 
the neighbouring location than far locations. The main assumption here is that the target 
variable, which in this case is one of the soil elements (bulk density, pH, Organic carbon, etc.) 
is distributed over the space following a known spatial process. A realisation of the spatial 
process is the present samples 𝑁 = {1,2, … , 𝑛} . For each point in the dataset, a set of 
neighbours can be defined that are considered to systematically influence the value of each 
target point. The weight value 𝑤𝑖,𝑗 are assigned to the neighbours, (js) of point i, following a 
standard criterion to define nearest neighbours, 𝒩𝑖 = {𝑗𝜖𝑁|∃𝑖 ∋ 𝑁 ∶  𝑤𝑖,𝑗 ≠ 0} . I is then 
calculated for each i: 
𝑛
𝑦𝑖 − ?̅?
𝐼𝑖 = (𝑛 − 1) 𝑛 ∑ 𝑤𝑖𝑗(𝑦𝑖 − ?̅?) ∑𝑗=1,𝑗≠1(𝑦𝑖 − ?̅?)
2
𝑗=1,𝑗≠1
 
where ?̅? is the mean of the target variable and 𝑤𝑖𝑗  is the components of the weight matrix. A 
stopping mechanism for convergence that looks for minimisation of Moran’s I mean absolute 
error is defined as (Klemmer et al., 2019): 
𝑛
𝑀𝐼𝐸 = ∑ |𝐼(𝑦𝑖) − 𝐼(?̂?𝑖)| 
𝑖=1
Where 𝐼(?̂?𝑖) is the Moran’s I calculated from the generated sample. Klemmer et al., (2019) 
also implemented the root mean square error (RMSE) criteria and compared it with the original 
riteria (MIE).   
 
2.1.4. Covariate features 
 
Spatial prediction models for soil utilise information from an extensive array of covariate 
features for their prediction. Whether the prediction method is a simple regression model or a 
highly sophisticated quantile regressions forest, highly correlated variables with the target 
variable play an important role in the prediction. The main justification for such correlation is 
the interactions that are usually attributed to soil structure and properties with different 
environmental conditions. The variation in the soil can be correlated to different conditions in 
climate, organisms, relief, age of parent material, and other factors (Minasny and McBratney, 
1999). The geo-environmental features can range from climatic variables (temperature, 
precipitation) to topographical features derived from digital elevation models (DEMs) and soil 
physicochemical and compositional properties (Lalitha et al., 2021; MacMillan, 2019; 
Moghaddam et al., 2020; Shangguan et al., 2017b).  
 
2.2. State of the art in soil mapping 
 
The main goal of soil prediction models is to reduce the bias and variance in predictions. 
However, different models utilise different pathways for predictions that are based on widely 
different assumptions. For example, conventional statistical models assume that patterns of 
spatial variation exist within the soil data because nearer observations tend to resemble similar 
values. This is based on Tobler’s first law of geography that entails similar objects are closer 
to each other than far objects (Waters, 2018). Other models tend to ignore such relationships 
and try to use only mechanistic models to arrive at a prediction. On the other hand, machine 
learning models can be developed to understand the complex relationship between target 
variables and the covariates within the context of linear or non-linear models. 
 
 6 
 
 
 
 
 
 
The overall interpolation and prediction of soil data have been traditionally made with the help 
of physically based, statistical, and machine learning models (Hengl et al., 2017b; Kuriakose 
et al., 2009; MacMillan, 2019). However, in recent years it is repeatedly shown that machine 
learning models and in particular, tree-based variants are very successful in producing 
acceptable predictions for soil properties. Statistical methods such as kriging and deterministic 
approaches such as Inverse Distance Weighting (IDW) have also been used to develop and 
compare the results of prediction. Kriging is particularly useful in modelling the stochastic part 
of the variation when it is combined with other models such as random forest. For example 
Szabó et al., (2019) used an ensemble of kriging with random forest models to improve the 
mapping accuracy of soil hydraulic properties. 
 
2.2.1. Deterministic interpolation models  
 
Prediction or interpolation models based on soil observation data (augmented average of 
neighbouring locations) are based on methods such as Theisen polygons and IDW. These 
methods consider the position of samples to create a mathematical relationship for the 
prediction at unknown locations. For example Karlsson et al., (2014) used four methods,  linear 
regression (LR) with topographical variables as independent variables, IDW, Trigonometric 
approach (TA) and a simple regolith model (SRM) with outcrops supplied from a digital 
elevation model (DEM) to predict soil depth for three locations in Stockholm County, Sweden 
(Tyresö, Österåker, Vallentuna). The results showed that depending on the sample size, both 
IDW and LR can perform well, while physically based models such as SRM can perform well 
if the drilling data are available but very sparse.  
 
2.2.2. Stochastic interpolation models  
 
Methods that include inherent spatial variability patterns defined as ‘spatial autocorrelation’ 
are collectively called geostatistics (Goovaerts and Goovaerts, 1997). The main premise of 
geostatistics is to provide an estimate within a reasonable scoping area that the data covers for 
unsampled locations based on a model of variation that is developed as a function of distance. 
The idea here is that phenomena such as soil depth can show randomness in their behaviours 
across space since there are too many unknown factors that affect the value at each location. In 
this case, at location 𝐱, a value 𝑍(𝐱)  is drawn randomly from some probability distribution 
with its cumulative distribution function (CDF) and probability distribution function (PDF). 
The true realisation  of random function (actual values), 𝑍(𝐱1), 𝑍(𝐱2), … , 𝑍(𝐱𝑖)  at locations 
𝐱𝑖 , 𝑖 = 1, … 𝑛 constitute a regionalised variable (Webster and Oliver, 2007). Therefore, in the 
case of spatial data, the randomness could be double (both values at each location and the 
locations themselves).  The values within regionalised variables are not independent and closer 
locations generally exhibit similar values (Jahanshiri et al., 2015). Until recently, the statistical 
methods were the de facto methods for the prediction of spatial data. Methods such as kriging, 
are still used for the prediction of soil properties at smaller scales.  
 
2.2.3. Machine Learning models 
 
Random forest uses fully grown decision trees (if-then questions about labelled data or a 
threshold in the case of continuous variables) to create an ensemble that will then vote for best-
representing classes (Breiman, 2001). Random forest is shown to be the most accurate method 
for prediction when compared exhaustively with other methods (Fernández-Delgado et al., 
2014). The gradient boosting technique (Chen and Guestrin, 2016) primarily reduces bias and 
 7 
 
 
 
 
 
 
variance by giving more weight to badly classified observations. Both of the above-mentioned 
methods have been used in soil mapping (Hengl et al., 2017a) with various degrees of success.  
 
Other methods such as artificial machine learning (ANN) methods have also been used recently 
to develop maps of spatial variability of soils. It is reported that ANN can also detect spatial 
patterns and predict the values at unsampled locations like statistician methods. For example 
Bouasria et al., (2022) investigated the use ANNs in the prediction of soil organic matter (OC) 
and report an improvement over the linear regression method. The use of covariate features, 
particularly those related to terrain are essential for any machine learning model that is dealing 
with the soil data prediction (Bagheri bodaghabadi et al., 2015) 
 
The use of machine learning models for soil mapping is on the rise and as (Heuvelink and 
Webster, 2022) argue: “they are here to stay”. One reason for their popularity is that more and 
more data are made available and the extent to which environmental data can be used as 
features or independent variables has made their use even more plausible. Machine learning 
models are inherently non-spatial, that is, they are not trained to generalise any pattern other 
than those that they can discover themselves. This will make them even more powerful as they 
are no longer bound by the limits of traditional statistics, where a model of spatial prediction 
is needed to be constructed before any prediction can be made and normality of input variables 
is also considered in the analysis. 
 
 
 
 
 
 
 
 
 8 
 
3. Methods 
 
In this section, detail of procedures and methods that were followed to create training and 
evaluation datasets for the experiments are discussed. Also, the main methods used to examine 
the main hypotheses of the project (section 1.4) are presented.  
 
3.1. Data 
 
3.1.1. Observed data  
 
A dataset containing soil property data across Sri Lanka was made available to the author by 
Prof. Ranjith B. Mapa, Emeritus Professor and former Chair in Soil Science, University of 
Peradeniya (Wimalasiri et al., 2020a). This dataset has already been used to develop soil 
property maps with limited accuracy due to its very limited size. Each observation contains 
data from different soil elements originally collected through SRICANSOL project that was 
implemented by the Soil Science Society of Sri Lanka (SSSSL) in collaboration with Canadian 
Society of Soil Sciences (CSSS) (Mapa, 2020). Table 1 shows the characteristics of soil 
properties (henceforth called elements) used in this research.  
 
Table 1: Characteristics of soil properties used in this research 
Element Explanation Unit 
Latitude, Longitude Location of the soil information. Important because soils Degrees (or 
(or X and Y) form similar patterns over space.  meters) 
Sand, Silt, Clay The proportion of mineral particles in the soil. Sands (0.05 - % 
2 mm), silt (0.002 - 0.05 mm) and clay (< 0.002 mm). 
Together provide information about soil fertility, biological 
processes, etc.  
BD Bulk density or the amount of pore space available within the g cm-3 
soil. It indicates how much air is available within soils.  
pH Degree of acidity and alkalinity that impacts the chemical Scale of 0 to 
properties of soil. These chemical properties will in turn 14 
determine many other factors such as the fertility of the soils 
EC Electrical conductivity is the ability of soil water to transmit milliSiemens 
electricity. It can be used to show the degree of salinity of per meter (mS 
soils and nutrient availability.  m-1) 
CEC Cation exchange capacity is the ability of soil to hold centimoles 
exchangeable positive cations such as Na, K, Mg, etc. These per kilogram 
minerals have a direct impact on the soil fertility of of soil 
negatively charges soils.  (cmol(+) kg-
1) 
OC Organic carbon content in the soil. The more organic carbon % 
in the soil the more fertile it becomes.  
VMC0.33, VMC15 Volumetric Moisture Content at two different pressure (Kilo mm mm-1 
Pascal). Indicators of the capacity of soils for holding water 
at two different levels, field capacity and permanent wilting 
point 
 
 
 
 
 
 
Although the observed soil profile dataset is a valuable source of information to understand 
soils across the country, the current sample size (n = 120) is simply not enough for modern 
earth system models that require detailed soil information (M Suhari et al., 2022).  
 
As soil samples are usually collected at different depths, comparing values, and making 
inferences based on the spatial patterns of such data is a challenge. To overcome this problem, 
data at different depths are harmonised to make them available for further analysis (Hengl et 
al., 2014; Sulaeman et al., 2013; Wimalasiri et al., 2020c). The soil properties were 
standardised for 5 standard depths: 0-5 cm, 5-15 cm, 15-30 cm, 30-60 cm and 60-100cm.  
 
To increase the density of soil data, data mining through literature analysis was carried out by 
extracting soil data that are available in published scientific articles.  Data from 183 published 
studies between 2006 and 2021 were collected manually and organised in collaboration with 
the University of Sabaragamuwa in Sri Lanka (Thathsarani, 2022).  Collected data include bulk 
density (gcm−3), particle density, porosity, moisture, pH, organic carbon content (%), organic 
matter content, electrical conductivity, cation exchange capacity (CEC) (cmol + kg–1), clay 
(%), sand (%) and silt content (%),  texture classes, soil colour,  volumetric moisture content 
(VMC) at 0.33 bars level (drainage upper limit – mm mm−1), VMC at 15 bars level (wilting 
point – mm mm−1. However as different studies were done based on different objectives, not 
all data were available for all the above-mentioned variables. For simplicity and to alleviate 
computational issues, only values at the top layers (0-5 cm) where chosen. Also, as data within 
published articles are collected using different objectives, a data harmonisation exercise was 
carried out to standardise their depths.  
 
Figure 1 shows the location of the additional 180 samples that were collected by literature 
analysis. The density of samples was increased from 1 sample per 1 square kilometre to 4 
samples per square kilometre across the country.  As soil data are collected from different 
depths they were standardised across different depths using the methodology that was 
developed by Sulaeman et al., (2013) (see section 3.1.1).   
 
 
 
 
Figure 1: Location of observation data profiles and literature data 
 10 
 
 
 
 
 
 
3.1.2. Pseudo-observations 
 
This data augmentation technique could improve the density of sample data by adding more 
information, particularly in areas where sample density is low. Data from the National 
Aeronautics and Space Administration of the U.S. (NASA) MODIS (MODerate Resolution 
Imaging Spectroradiometer) satellite was used for this purpose. The MCD43A4  (band 7 
infrared) data product contains information about the Earth’s surface reflectance or albedo 
(Schaaf, 2015).  
 
To identify likely locations that can be used as pseudo-samples, 600 images covering Sri Lanka 
over 2021 were analysed. Initial analysis revealed that the best image date to derive the 
locations was 2nd September 2021 due to its wide range of albedo values. It is worth 
mentioning that the MCD43A4 product is already cleared from clouds. A survey of literature 
revealed that the typical albedo value for Sand dunes is 0.38 (Ashkenazy and Shilo, 2018; 
Tetzlaff, 1983).  About 150 samples or 5% of the observed soil data as per recommendation of 
(Shangguan et al., 2017b) were selected using a script that was developed in R statistical 
language (R Core Team., 2020). The selected areas were mostly coastal (Figure 2). 
 
 
Figure 2: The location of pseudo samples that were derived from MODIS satellite data 
on 2nd September 2021 
To reduce the number of clusters in the data and to improve the quality, all generated samples 
were examined with an overlay analysis on Google Earth (Figure 3), and coinciding samples 
with land-uses or clustering samples were removed (Buchhorn et al., 2020). To simplify the 
process, a high value of sand 70% was given to these data points. Also, due to low amount of 
organic matter in sand dunes, the value of organic carbon for those locations was set to 0.  
 
 
 
 11 
 
 
 
 
 
 
 
Figure 3: Pseudo-samples overlaid on Google satellite images 
 
3.1.3. GAN data generation  
 
To improve the sample size, and sample coverage, geographical analysis was used to develop 
a random sample to fill the gaps in observed and literature data.  A circular buffer of 10 km 
was first generated for each sample location (Figure 4 A). The buffer areas were then joined 
together using a vector analysis operation in geographic information system (GIS) software 
(QGIS.org, 2022) to derive empty areas within the empty polygons (green areas Figure 4 B). 
A random sample of 300 points was then generated within the area that was under-represented 
by original samples. The location of these samples will be used to generate values for the soil 
elements (Table 1). The same coordinates were also used to derive the covariates features 
(section 3.1.4).  
 
A B C 
 
Figure 4: Spatial analysis steps for deriving random samples 
 
 
 12 
 
 
 
 
 
 
Data generation was carried out using a special deep learning algorithm that has recently been 
developed to use conditional generative adversarial nets (CGANs) that considers the spatial 
dependence of observations (see section 2.1.3). To perform the data augmentation, target 
variables (observed data) were organised based on coordinates so that data import is possible. 
The code for conditional GAN was adapted based on code from the Space-GAN repository 
(Klemmer, 2021). The algorithm and code were adopted to train neural nets based on spatial 
coordinates (Latitude and Longitude), nearest neighbour values, and the target variables (Table 
1). The main selection indicators are Mean Moran’s I Error (MIE) and RMSE. The functions 
were run several times to ensure the best hyperparameters are selected for each soil element.   
 
3.1.4. Covariate features 
 
Environmental processes and factors that can influence soil genesis are climates, vegetation, 
and topography (section 2.1.4). In order to capture these influences as much as possible, an 
extended dataset of more than 150 features containing layers from different environmental 
datasets that are available on Google Earth Engine (GEE) (Gorelick et al., 2017) was obtained. 
Two sets of coordinates were developed and uploaded to the GEE; a dataset containing 
locations of the observation and augmented data (Figure 1and Figure 4C) and the other, a 
dataset of locations for a grid of more than 13,000 grid points that will then be used for mapping 
(Figure 5).  
 
 
Figure 5: Grid of 13000 points for prediction and mapping 
 
The following layers were extracted using the geedataextract library that was developed in 
Python: 
1- Topography layers; elevation, aspect, and slope from NASA’s highest-resolution 
digital elevation model of the Earth (Farr et al., 2007). 
2- Climate and soil-related layer averages from TerraClimate monthly averages from 2010 
to 2020 (Abatzoglou et al., 2018). The layers include the following information: 
a. Actual evapotranspiration (mm) 
b. Climatic water deficit (mm) 
c. Palmer Drought Severity Index 
d. Reference evapotranspiration (Penman-Montieth, mm) 
e. Precipitation (mm) 
f. Runoff (mm) 
 13 
 
 
 
 
 
 
g. Soil moisture (mm) 
h. Downward surface shortwave radiation (W/m2) 
i. Snow water equivalent (mm) 
j. Min temperature (deg C) 
k. Max temperature (deg C) 
l. Vapor pressure (kPa) 
m. Vapor pressure deficit (kPa) 
n. Wind-speed at 10m (m/s) 
3- Normalized Difference Vegetation Index from USGS Landsat 4,5,7,& 8 Surface 
Reflectance Tier 1 products for the period 2013 to 2021 (U.S. Geological Survey, 
2022). 
 
A total of 156 features were created from the abovementioned list for both training-testing 
and grid dataset and made ready for further analysis.  
 
3.1.5. Training, testing, and prediction datasets 
 
The following procedure was followed to develop training, testing, and prediction datasets: 
 
1- The set consisting of observation data was divided randomly into a training (60%) and 
testing set (40%) using train-test split function in Scikit-learn package (Pedregosa et 
al., 2011).  
2- The training set were combined with augmented data to create a series of training sets. 
These sets will include 0 to 100% of augmented data. These training sets were used to 
train the prediction models.  
3- The testing set was then used to evaluate the result of the training.  
 
Figure 6 shows the entire procedure for developing datasets for the analysis.  
 
 14 
 
 
 
 
 
 
 
Figure 6: Dataset development process flow 
 
3.1.6. Coordinate reference system 
 
The main coordinate reference system (CRS) was WGS 84 -- WGS84 - World Geodetic System 
1984 with Geodetic Parameter Dataset (EPSG) of 4326 (“WGS 84 - EPSG:4326,” 2022), which 
is the official coordinate system for most of the global applications such as Global Positioning 
System (GPS) and satellite data. To perform spatial analysis, the map layer CRS were projected 
to Universal Transverse Mercator (UTM) zone 44N with EPSG of 32644 (“WGS 84 / UTM 
zone 44N - EPSG,” 2022).  
 
3.2. Analysis methods 
 
The next step is to utilize the training datasets to evaluate machine learning models for the 
prediction of select soil properties. The aim here is to improve the prediction by minimising 
the error and improving the extrapolation ability (Meyer, 2016) of the prediction models.  
 
3.2.1. Machine learning models 
 
Machine learning models, particularly tree-based algorithms such as random forest and 
gradient boosting are more successful than de facto statistical methods such as linear (or 
logistic) regression (Hengl et al., 2017a). Newly developed methods for determining accuracy 
(particularly by generating uncertainty maps) such as quantile regression (Kasraei et al., 2021) 
can also be examined to evaluate the uncertainty of predictions. Figure 7 shows the training 
and analysis steps that will be taken in this research.  
 
 15 
 
 
 
 
 
 
 
 
Figure 7: Training and validation steps 
 
Following the training and evaluation procedure, the successful method was used to predict the 
value of soil properties over a grid across the country of Sri Lanka. The final maps will follow 
the style and categories of the published maps (Wimalasiri et al., 2020a).  
 
3.2.2. Evaluation  
 
To evaluate and compare the results, both standard error analysis and cross-validation are 
chosen instead (MacMillan, 2019). The reason for the inclusion of cross-validation is twofold: 
cross-validation is the de-facto method for evaluating the performance of spatial models and it 
is also a suitable method to alleviate overfitting issues in machine learning methods. The 
statistics that were chosen to compare the model fitting were coefficient of determination (R2) 
and root mean square error (RMSE) as described by (Shangguan et al., 2017b).  
 
3.3. Software 
 
Many software packages are available for spatial data manipulation, prediction, and mapping. 
The focus of this research was on open-source software. Both R (R Core Team., 2020) and 
Python (Van Rossum and Drake Jr, 1995) were used to develop and automate data pre-
processing and analysis code. Basic exploratory data analysis and visualisations were done 
with R. The capabilities of machine learning in Python were used for testing a few machine 
learning algorithms (random forest and neural networks). Finally, mapping, and final 
visualisations was done within R. Geographic analysis was performed using an open-source 
GIS software (QGIS.org). 
 16 
 
4. Results  
 
In this section, statistics and main results are presented. Also, general trends in the outputs are 
identified and discussed and impetus is provided for the discussion section.  
  
4.1. Tuning data augmentation model 
 
All the training sets of the observations (Figure 1) were used to train a SpaceGAN model for 
each soil element. The trained models were then used to predict value for each target variable 
at the randomly selected points in Figure 4. These randomly selected points were subsequently 
used for training machine learning models and accuracy assessment (section 3.1.3). 
 
For each target variable within the first depth (0-5 cm), a SpaceGAN model was trained based 
on the coordinates of each training set as well as the target variable. Since the scale of the target 
variable is continuous, a distance-based neighbourhood weight matrix was developed with the 
Python Spatial Analysis Library Core (Rey and Anselin, 2007) using all three variables 
(latitude, longitude, target variable) for each target variable (Klemmer, 2021). To improve the 
results and avoid mode collapse (Klemmer, 2022), a series of pre-processing steps were done. 
It was determined that removing outliers that are distanced 1.5 times the interquartile range 
(Q3-Q1) was deemed to be a useful strategy. Normalisation and scaling were not very helpful 
due to back-transformation needs for the predicted data. Other measures to avoid the mode 
collapse problem are discussed in section 5.2. Table 2 shows the characteristics of the 
successful models that were used to generate data for all target variables.  
 
Table 2: Characteristics of optimum SpaceGAN generator and discriminator networks 
Target variables BD, CEC, Clay, EC, OC, pH, 
Sand, Silt, VMC0.33 and 
VMC15 
Neighbour features Target variables + Coordinates 
(Longitude, Latitude) 
Average epochs 3000 (2X Generator vs 
Discriminator) 
Batch size  100 
Latent noise input  Gaussian 
Generator Linear (2 times size of 
neighbour features, 50), ReLU, 
Linear (50, target dimension) 
Discriminator Linear (2 times size of 
neighbour features, 50), Tanh, 
Linear (50, 1), Sigmoid 
Learning rate 0.01 (both generator and 
discriminator) 
Neighbours  50 
Weight matrix Distance based 
Loss  SGD (stochastic gradient 
descent) 
Model selection metrics RMSE, MIE 
 
 
 
 
 
 
Training a SpaceGAN model using a small sample proved to be a non-trivial task. After many 
iterations and tuning of the parameters, the best parameters that would produce acceptable 
results in terms of accuracy were selected for further analysis. According to (Klemmer, 2022), 
this type of periodical loss is to be expected, particularly for smaller datasets. The structure that 
are described in Table 2 was successfully applied to all the elements except for Sand, which 
needed more training to generate loss function. Although MIE criteria seemed to be robust for 
some of the target variables, model selection based on the minimum RMSE proved to be a 
better choice for data augmentation for the majority of data elements.  
 
Three major treatments for GAN modelling were developed to improve the results and avoid 
the mode collapse phenomena (Klemmer, 2022; Pei et al., 2021): 
 
1- Tuning the GAN model based on different learning rates.  
2- Training Discriminator l times more than Generator (where l is an even number). 
3- Oversampling the tails of the original data distribution during training. The training 
batch was divided by two, and half of the data would be sampled specifically from 
the tails of the distribution (below Quantile 1 and beyond Quantile 3).  
4- Outlier removal: removing data points that are beyond the interquartile range of 
data points (Q1 or Q3 + 1.5 IQR where IQR = Q3 – Q1). 
5- Normalisation and lognormal transformation  
 
Based on the results of the training experiment, two groups of treatments were identified that 
offered the best results in terms of training statistics and spatial distribution of generated 
samples.  
- OT: learning rate (0.01), training Generator 2 times and training Discriminator 1 time, 
and outlier removal (methods 1, 2, 4 above) 
- ST: learning rate (0.01), training Generator 2 times and training Discriminator 1 time, 
outlier removal and oversampling tails (methods 1, 2, 3, and 4 above).  
 
Using any other treatment such as normalisation and transformation did not produce 
satisfactory results in terms of MIE, RMSE, and loss reduction. Figure 8 shows the loss and 
model performance for the successful models based on the ST treatment strategy.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 18 
 
 
 
 
 
 
BD CEC
  
Clay EC 
  
OC pH 
  
Sand Silt 
  
VMC0.33 VMC15 
  
Figure 8: Loss, RMSE, and MIE performance for model selection 
 
 
4.2. Exploratory analysis 
 
Table 3 shows basic statistics of observed and generated samples. Although generated data 
ranges are within the boundaries of the observed data, their statistics tend to move towards the 
mean of the observed data distribution and stay around the mean with little variation. Except 
for BD, Sand and Silt, generated mean values are smaller than those of the real data (OBS). 
Also, data dispersion around the mean is also lower for all the elements except BD. Without 
any exception, the minimum values for all the elements are higher than the observational data 
and their maximum is lower than the observations’ maximum. This is another indicator that 
generated data is respecting the boundaries of plausible values. The values for pseudo-samples 
for OC and Sand remain constant because they are assigned values (section 2.1.2) and therefore 
their mean is constant, and their standard deviation is 0.  
 
 
 
 19 
 
 
 
 
 
 
Table 3: Target variable statistics across three types of data 
  TYPE BD CEC CLAY EC OC pH SAND SILT VMC15 VMC0.33 
GEN 1.47 9.89 15.26 194.75 0.64 6.05 73.205 14.78 0.12 0.2 
MEAN 
OBS 1.42 16.83 19.2 516.62 1.45 6.34 67.79 11.53 0.14 0.2 
GEN 0.23 1.66 1.88 40.47 0.22 0.3 9.72 1.53 0.01 0.01 
SD 
OBS 0.23 37.53 13.67 2619.75 1.71 1.13 18.48 6.82 0.07 0.08 
GEN 0.97 5.02 10.97 85.45 0 5.4 51.39 11.1 0.09 0.16 
MIN 
OBS 0.62 0.91 0.1 0.02 0.04 3.1 1.68 0 0.02 0.03 
GEN 1.86 15.07 20.46 310.08 1.79 6.98 92.17 19.33 0.15 0.24 
MAX 
OBS 1.982 396 53.548 29100 24.558 8.584 99 41 0.437 0.553 
COUN GEN 299 299 299 299 329 299 329 299 299 299 
T OBS 428 428 428 428 428 428 428 428 428 428 
 
Note:   
- Generated samples (GEN) for OC and Sand include Pseudo-samples.  
- Observed (OBS) data are from (Thathsarani, 2022) 
 
4.3. Modelling results 
 
Except for pH and VMC (both 0.33 and 15 level), the response to various degrees of data 
augmentation: mixing observational data with the synthetic data (GAN and Pseudo samples). 
The most successful model for all the soil data elements was random forest except for Clay. 
Different levels of data augmentation had different effects on the root mean square error 
(RMSE) and coefficient of determination values (Figure 9). This could be due to the random 
nature of sampling from augmented data. Even though a certain amount of data augmentation 
could help reduce RMSE, the overall coefficient of determination for EC is very low and it 
didn’t help to improve the results.  
 
The cross-validation score (CV score) is a measure of performance for repeated one-out sample 
prediction, and it is a good measure when the number of samples is limited. The CV score 
based on R2 does not show any specific pattern of increase or decrease. However, it shows that 
for most models, there is a gain by increasing the number of samples and therefore improve 
the accuracy of predictions during training. They are all positive numbers except for Clay, EC, 
and CEC.  
 
 
 
 
 20 
 
 
 
 
 
 
  
  
  
 
  
  
Figure 9: Model performance against augmented data, RF: random forest, MLP: multi-
layer perceptron 
 
 
 
 21 
 
 
 
 
 
 
4.4. Model performance 
 
Table 4 shows the characteristics of successful models, the percent of data augmentation that 
lead to those results, and the Coefficient of Determination (CoD) for all the predictions using 
the ST treatment group (section 3.1.3 and Figure 12). For all models, a certain degree of data 
augmentation would lead to a better prediction over the baseline observed dataset (data 
augmentation = 0%).  
 
The highest data augmentation requirement is for Clay and the lowest are for pH and Sand 
(with pseudo-samples). CoD for pH is the highest while for OC it is negative, meaning that 
even with the addition of 75% augmented data to the original dataset, there was no significant 
improvement with machine learning models for prediction. However, this result is still better 
than the baseline data (without any augmentation).  The addition of uniform pseudo-samples 
did not have any significant impact on the improvement of results for Sand and OC as the 
coefficient of determination did not change dramatically.   
 
 
Table 4: Characteristics of successful models for all elements (ST training strategy) 
ELEMENT MODEL % DATA COD 
AUGMENTATION 
BD RF 20 0.28 
CEC RF 70 0.26 
CLAY MLP 90 0.23 
EC RF 45 0.02 
OC-WP RF 75 -1.03 
OC-NP RF 10 -0.78 
PH RF 5 0.56 
SAND-WP RF 5 0.38 
SAND-NP RF 90 0.38 
SILT RF 95 0.22 
VMC0.33 RF 10 0.28 
VMC15 RF 45 0.19 
Note: RF: Random Forest, WP: with pseudo-samples, NP: no pseudo-samples, COD: coefficient of 
determination  
 
Table 5 shows the final model CV score and machine learning RMSE in comparison with the 
results of the previous study (Wimalasiri et al., 2020b) that used statistical and mechanistic 
models for prediction using baseline observation data. Except for BD, CEC, and Clay, addition 
of observed literature data and machine learning has improved the results of modelling.  
 
It is noteworthy that since no distinction is made between the type of observed data (direct filed 
survey data from SRICANSOL project and those that are collected recently from the literature), 
comparing the results of the present analysis with those of the previous study shown in Table 
5 might not be meaningful as it is unclear at this point what is the main reason for the 
improvement (addition of literature data or machine learning models or both). Therefore, 
further research might be needed to ascertain direct impact of data and models on the accuracy 
of the outputs.  
 
 
 22 
 
 
 
 
 
 
 
Table 5: Comparison of model accuracy with previous study (ST strategy) 
ELEMENT CVSCOR ML RMSE PREVIOUS METHOD 
RMSE 
BD 0.11 0.201 0.184 EMK 
CEC -36.1 47.303 14.949 EMK 
CLAY -0.25 11.369 10.838 RBF 
EC -104.39 1884.86 -  
OC-WP 0.03 1.211 
2.649 EMK 
OC-NP 0.09 1.133 
PH 0.46 0.736 0.770 EMK 
SAND-WP 0.18 14.767 
15.970 RBF 
SAND-NP 0.17 14.69 
SILT 0.05 5.361 8.159 RBF 
VMC0.33 0.15 0.066 0.090 EMK 
VMC15 0.3 0.061 0.082 EMK 
Note: CVSCOR: cross validation score, ML: machine learning, RMSE: root mean square error, EMK: 
empirical bayes kriging, RBF: radial basis function.  
Note: previous RMSE are achieved without the use of literature data 
 
 
4.5. Final maps 
 
A grid of more than 13,000 points (Figure 5) was used to predict the values for target variables 
and develop final maps based on the machine learning models. The threshold between classes 
(colours) was defined based on the histogram of the input. For each target variable, 6 classes 
were identified based on the range and the number of bins (Sturges, 1926). A uniform colour 
chart was assigned to different data intervals to create distinguishable patterns for the final 
maps.  
 
Although the resulting maps are not very smooth for most of the maps presented in Figure 10, 
machine learning models were still able to detect the general patterns without the need for any 
statistical modelling or assumptions. For Clay, a general improvement can be seen in terms of 
uniform patterns the model performance was not superior compared to the previous study. 
 
 
 
 
 
 23 
 
 
 
 
 
 
   
  
 
   
  
 
Figure 10: Final maps created with predicted data at the grids coordinates 
 
 
 
 
 
 
 
 24 
 
 
5. Discussion  
 
In this section analysis results are discussed concerning the main objectives and the hypotheses 
of the research. The strengths and limitations of the results and their implication for real-world 
applications are also reviewed.  
  
5.1. Data augmentation 
 
Two main datasets that were used for this research were: 
 
1- Observation data that are a combination of samples from previous surveys and those 
that were extracted from the literature (section 3.1.1).  
2- Augmented data or artificial samples that were identified based on surface reflectance 
(Sand and OC only) and generated data using Space-GAN (all variables).  
 
Soil sampling is a resource-consuming endeavour that is usually carried out at regional or 
national scales or across smaller areas to fulfil a specific objective. As sampling density for 
national and regional scales is usually lower than those that are collected for smaller areas, there 
is a possibility to combine data from different scales to create better datasets (Batjes et al., 
2020). Unlike large-scale sampling data that are usually available in different formats, small-
scale sample data are rarely shared with the researchers, except for those that are published in 
research articles. In this research, a dataset of previously collected and harmonised soil samples 
that were combined with the observed datasets that were collected from published articles. 
Furthermore, an augmented dataset was developed using satellite remote sensing analysis and 
Generative Adversarial Nets.  
 
A survey of literature reveals that only two studies have augmented their training data with 
pseudo-samples derived from remote sensing images (Hengl et al., 2017b; Shangguan et al., 
2017b). Both studies advise that expert-based samples should only be between 1-5% of the 
training data. Therefore, the number of pseudo-samples was kept at a minimum in this study. 
Pseudo-samples were only applied to Sand and organic carbon content (OC) in this research.  
 
To date, no study has investigated the possibility of the addition of generated data using GAN 
to improve the quality of the predictions and no guideline is available to follow in this case. 
Therefore, to test the impact of generated data, a large set of 300 samples were generated for 
all target variables to quantify the degree of improvements with the addition of 0 to 100% of 
the augmented data. The density of observed data is 1 sample per 155 sqkm whilst with the 
addition of the generated samples, the density increased were improved to approximately 1 
sample per 90 sqkm (Figure 1 and Figure 2).  
 
Except for the minimum value of OC, the minimum for all generated data is higher than the 
observed values and the maximum is lower than the observed maximum value in Table 2. The 
 
 
 
 
 
mean values for all elements except for BD, pH, and VMC (both 0.33 and 15 bars level drainage 
upper limit and permanent wilting point –mm mm–1) are lower than the true (observed data) 
mean. This shows that the generated data tends to be more conservative and acknowledges the 
range of the observational data. However, this might be related to the problem of mode collapse 
that is also associated with the generative models (Bau et al., 2019; Pei et al., 2021). This 
problem is also highlighted in Figure 11 for pH and Sand. The generated (fake) data covers 
almost all values for pH but this case is not true for Sand values. Another important aspect of 
generated data is low diversity, as for all the variables (except for BD), the standard deviation 
is much lower than the original data.  
 
 
 
 
Figure 11: Normal dispersion around the mean and low SD for pH (top) and CEC 
(bottom) 
 
 26 
 
 
 
 
 
 
5.2. Mode collapse 
 
Different strategies for mode collapse were used to improve the results of the output data. One 
major problem was to improve the diversity while keeping the output of GAN model within 
the plausible range of the original data. Section 4.1 described major treatments that were tested 
to identify the best GAN model for training all the output variables. The two groups of 
treatments; OT and ST have three strategies in common.   
 
The learning rate is an important regularisation factor in training neural network models. The 
major issue of small samples can be overcome with higher learning from the beginning of the 
training routine. In this research, the network structures are simple and therefore, reducing the 
learning rate beyond the threshold 0.01 (Table 2) did not produce satisfactory results (Li et al., 
2019).  
 
Removing outliers improved the results by reducing the diversity in the original data while 
releasing the GAN model to try to learn sub-population distributions. Similar to statistical 
modelling, evidence-based pre-processing techniques can have a great impact on the results of 
machine learning models (Fan et al., 2021).  
 
Machine learning models are not inherently trained to detect patterns based on prior 
assumptions. Indeed, it has been reported that batch or mini-batch normalisation can improve 
the results of image classification (Blanch et al., 2019). However, for the present case, batch 
normalisation did not produce satisfactory results due to two reasons: 
 
1- Small scale variation that is present within some of the target variables (e.g., BD 
and CEC) 
2- Complications with inverse normalisation due to small variation in the 
observational dataset.  
 
Another major solution that proved useful in this research was unbalanced training of the input 
data (Klemmer, 2022). In this case, either the Discriminator or Generator are trained with more 
iterations. The most successful approach to unbalanced training was training the Generator 
twice and Discriminator only one time. This could help the Generator, gather more information 
about the original data while avoiding the need for hard classification using the Discriminator.  
 
A major distinguisher between the two treatments OT and ST is over-sampling the tails of the 
training distribution. Although outlier removal reduces the burden of the Generator to generate 
more diverse outputs within the reasonable range of the input data, oversampling the tails of 
the distributions, particularly for highly skewed data, leads to better results. Indeed, the results 
that are discussed in Tables 3-4 were achieved using this method. However, a major issue that 
arises, in this case, is the spatial distribution of the output data. Figure 12 shows an exemplar 
spatial distribution of generated data versus original data using OT and ST treatments. The 
proximity of generated data and the original data is not explainable for the case of ST as it is 
compared to OT. Particularly for BD, the model assigns lower values even in the vicinity of 
the higher values in the original set that can perhaps be attributed to the attributes such as 
elevation and low bulk density in lowlands. However, ST can capture more variation in data 
compared to OT treatment which is probably the reason behind its slightly better results 
(section 4.1).  
 
 
 27 
 
 
 
 
 
 
 
 
Figure 12: Spatial distribution of generated data based on two groups of treatments 
 
5.3. Accuracy assessment 
 
The addition of GAN and pseudo-samples have improved the representativeness of the samples 
by increasing sample density and even though the results are sub-par for most of the variables, 
a degree of data augmentation has helped with improving the results (section 4.3). Model 
performances shown in Figure 9 demonstrate a random behaviour (except for EC, pH, and 
VMC33). This could be due to randomness in batch selection during the sampling. Although 
the density is higher with the addition of generated samples, the sample size is still considered 
low for the area of the country (close to 65,000 square km). This could be the reason for the 
random behaviour of the model outputs.  
 
A major issue with low sample size is representativeness and sampling scheme. There are two 
important properties of the dataset used in this study: 
 
1- The original dataset does not follow any objective design. That is, the locations are 
not random. This violates the iid assumptions and this could lead to biased results. 
This problem is solved with consideration of “spatial processes” (Webster and Oliver, 
2001) where randomness can also be defined at each generated output.  
2- The locations of the generated data however are randomly selected. This will resolve 
the iid assumption since samples are selected following a simple random sampling 
scheme (SRS), the chance of the models producing biased results is lower (even with 
the presence of original convenient sampling). 
 
In principal machine learning models can be used for prediction using any type of data, 
regardless of their sampling design (OpenGeoHub, 2022) and this gives them a better 
advantage to be used for spatial data prediction. However, the main issue with machine learning 
models is the sample size, which poses a challenge when the sample cannot be considered 
representative.   
 
For most of the target variables, the random forest model tends to produce the best results for 
ST treatment particularly when there is a non-linear relationship between covariate features and 
the target variables and this confirms the results from the previously published studies (Hengl 
et al., 2017a; Shangguan et al., 2017c). This also holds for the OT treatment results (Appendix 
 28 
 
 
 
 
 
 
A 1 and A 2). However, the results of modelling based on OT do not differ meaningfully from 
those of ST and both agree with previous studies.  
 
5.4. Predictions and features 
 
One of the most important goals for spatial prediction using any model is to be able to map and 
visualise the results.  Also, it is important for any map that visualises the soil spatial variability 
to showcase some patterns of homogeneity based on the intuitive notion of the first law of 
geography. Keeping in mind that machine learning models are not inherently ‘trained’ or 
‘supervised’ to detect any predefined spatial pattern, the output maps showcase some of those 
patterns that correspond to the previously published maps (Wimalasiri et al., 2020b). One of 
the important drawbacks of the maps is the presence of artefacts or areas of discontinuity. This 
could be because in those areas where there is low sample density or gaps, machine learning 
models perform unexpectedly and create these artefacts. However, artefacts could be due to 
different colour scales that were chosen in Figure 10. Figure 13 shows the observation data 
overlaid on the predicted surface map. Although the prediction surface misses some obvious 
values, it captures the general patterns of variation.    
 
 
Figure 13: Originally observed and generated data overlaid on the prediction map 
 
 29 
 
 
 
 
 
 
Using an extended array of coordinates is not always a necessary element for good prediction. 
Figure 14 shows the importance for all 157 features that were used for modelling of pH element 
(the best model for pH used 5% generated data for an RF model). This shows that the amount 
of evapotranspiration, vegetation, and elevation are among the most important features that can 
predict the values of pH more accurately.  
 
 
 
 
 
 
 
 
 30 
 
 
 
 
 
 
 
aet: Actual evapotranspiration (mm), def: climatic water deficit (mm), pdsi: Palmer Drought Severity Index, pet: reference 
evapotranspiration (penman-montieth, mm), pr: precipitation (mm), ro: runoff (mm), soil: soil moisture (mm), srad: 
Downward surface shortwave radiation (W/m2), swe: snow water equivalent (mm), tmmn:min temperature (deg C), tmmx: 
max temperature (deg C), vap: vapor pressure (kPa), vpd: vapor pressure deficit (kPa), vs: Wind-speed at 10m (m/s), nvdi: 
normalized difference vegetation Index 
 
Figure 14: Feature importance of pH element predictions 
 31 
 
 
 
 
 
 
5.5. Data usability 
 
5.5.1. Modern soil data 
 
Soil knowledge is needed for modern decision-making in many sectors from agriculture to 
construction. The recent development in three areas has made it possible to provide better soil 
data for the users.: 
 
1- Big environmental data:  there are extensive datasets available for use from satellite 
images to surface maps of temperature and rainfall statistics that can be used to build a 
strong covariate feature for prediction.   
2- The development of machine learning models particularly ensemble models have 
contributed to the development and visualisation of large-scale prediction of 
environmental data, particularly soil maps.  
3- Soil data are now increasingly made available to the public, despite the barriers that 
exist in terms of copyright and ownership. This will establish a strong feedback loop 
between users and producers of data.   
 
The combination of all three abovementioned advances would gradually help improve the 
quality of soil data. Such improvements are not possible without an understanding of the spatial 
relationship between features and machine learning models. All the above mentioned elements 
have been used in this research, however, as previous studies have shown, more work needs to 
be done to improve the results of maps. Figure 15 shows an example of comparison maps 
between two openly accessible datasets (SoilGrids and OpenLandMap). The local data provide 
much better information in comparison to the global datasets. This shows the importance of 
using local data in prediction and mapping of the soil elements.  
 
Figure 15: Comparison between observed and global soil data for pH (adapted from 
(Wimalasiri et al., 2020c) 
 
 32 
 
 
 
 
 
 
5.5.2. Machine learning interpretability dilemma 
 
A major issue with the use of machine learning models in soil mapping is the apparent lack of 
transparency, interpretability, and explainability (Rentschler, 2021; Wadoux et al., 2019). 
Models such as neural networks are increasingly promoted as ‘black-box’ machines. In this 
context, further analysis and ‘overlays’ of data such as the one presented in Figure 13 seem to 
be indispensable. Also, methods such as feature importance that can be done for de facto models 
such as random forest can play a key role in developing better models (Figure 14).  
 
5.5.3. Automation  
 
Recent advances in big data and automation are now being built into automated systems that 
are suitable for re-tuning the algorithms and providing new versions of datasets as new samples 
are made available. Such automated systems are already envisaged for data providers such as 
SoilGrids.org who aim to serve the public with open-access datasets. However, such data 
systems can now be easily developed at the national level for the need of local users that are 
still ‘reluctant to share their data’. In this case, machine learning models can be advantageous 
since their algorithms are more streamlined compared to statistically-based models such as 
regression kriging where detailed variogram modelling is required.  
 33 
 
 
6. Conclusion  
   
Soil information plays a key role in improving our understanding of environmental processes 
that impact our world. Major and global issues such as land degradation, climate change, and 
food security require detailed soil data. However, obtaining soil sample data is time and 
resource consuming and quite often, the available data does not meet the demand. More 
accurate soil data that capture the spatial variability of soil properties can aid scientists in many 
fields in better understanding our natural environment. It can also lead to the development of 
solutions for sustainable use of limited land resources for the benefit of food and nutrition 
security of a growing population. For example, the use of better soil data can lead to the 
development of better prediction models for crop yield in different soils across the country. 
This information would also benefit policy-makers to accurately plan for food and nutrition 
security by analysing the likely performance of a specific crop for current and future climatic 
conditions. Better soil data can also benefit other disciplines such as construction, geology, 
ecology, and other Earth sciences to fine-tune their models to the soil variation that exists on 
the ground and derive better estimates of their target variables. In this research, a combination 
of data augmentation and mapping mechanisms using novel approaches to data generation and 
analysis were used to enhance the soil maps at the national scale for the country of Sri Lanka.  
 
Data from different sources were combined to examine the main hypotheses of this research. 
Although observed soil sample data are always difficult to collect and analyse, they are not 
always shared with the public or made available to the researchers. Observed data in the 
literature can be a solution. However, the result of the survey of the literature showed that those 
data are also very limited and imbalanced in nature. Therefore, there is no choice but to use 
different methods for data augmentation. One way is to use satellite data and expert judgement 
to add more training samples (using pseudo-samples). However, the result of analysis using 
surface reflectance showed that although they can be a good source of information for some 
soil element such as (Sand and OC), their contribution is still very limited.  
 
Data generation using newly developed deep learning methods such as Generative Adversarial 
nets (GAN) provides a good solution for the augmentation of soil data. Specific GAN models 
can be trained to understand and detect spatial patterns based on some degree of spatial 
proximity measure such as Moran’s I indicator of spatial correlation (section 3.1.3).  
 
Different strategies were used to tackle the data generation problem using GAN models. The 
main issue was the lack of diversity for generated data or the problem of mode collapse. The 
result showed that a combination of pre-processing techniques such as outlier removal in 
combination with unbalanced training of generator and discriminator and over sampling the 
tails of distribution can be helpful.  
 
 
 
 
 
 
The result of data generation for a set of randomly sleeted coordinates showcasing areas that 
are gaps in the analysis showed that using generated data could provide a solution for the 
scarcity of soil sample data. The degree of data augmentation was vastly different among 
different soil elements; however, it is possible to fine-tune specific models for each element. 
The overall results were satisfactory for OC, pH, Sand, Silt, and VMC. But not satisfactory for 
BD, CEC, and Clay.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 35 
 
7. Future work  
 
One of the main limitations of the current work is that no distinction is made between observed 
data that are collected in the field and those that are added by literature survey. A more 
comprehensive investigation is therefore needed in the future to conclusively arrive at the 
notion that both addition of survey literature data and machine learning models can improve 
the prediction results.  
 
A major issue with artificial data augmentation is finding the right place to sample. In this 
research, a simple random sampling (SRC) scheme was used to generate samples using GAN 
architecture. However, other sampling schemes can be used for data augmentation. For 
example, Latin Hypercube Sampling (LHS) and newly developed Feature Space Coverage 
Sampling (FSCS) could provide a more representative sample for spatial modelling (Brus, 
2019).  
 
Another aspect that could be improved in the future is increasing the number of generated 
samples. However, a blunt increase of sample number might have direct consequences in the 
mapping phase where spatial patterns of generated samples influence those of the real data. 
There should also be better strategies for tackling the problem of mode collapse. For example, 
many Generator nets could be trained for a smaller number of Discriminators. Also, a better 
data normalisation strategy (Minmax scaling or non-normal transformation for highly skewed 
data) could be used.  
 
It is also important to develop element-specific models as using a generic model (uniform 
hyperparameters) that was used for both data generation using GAN and prediction using 
machine learning models lead to mixed results. For example, a specific model and data pipeline 
could be developed for soil structural elements (Sand, Silt, Clay, and BD) while for chemical 
properties such as pH, EC and CEC. The results of such modelling should also be explored.  
 
Publication of data, codes, and results is also an important aspect that will be considered in this 
research. Previous experience with the growth of open-source software has triggered the 
reproducibility movement in many scientific disciplines. Publication of data in the open access 
repositories will not only expose the methods used in this research to the public but also will 
pave the way for the development of better datasets in the future (Musker et al., 2018). 
 
8. References  
 
Abatzoglou, J.T., Dobrowski, S.Z., Parks, S.A., Hegewisch, K.C., 2018. TerraClimate, a 
high-resolution global dataset of monthly climate and climatic water balance from 
1958–2015. Sci Data 5, 170191. https://doi.org/10.1038/sdata.2017.191 
Anselin, L., 1995. Local Indicators of Spatial Association—LISA. Geographical Analysis 27, 
93–115. https://doi.org/10.1111/j.1538-4632.1995.tb00338.x 
Ashkenazy, Y., Shilo, E., 2018. Sand Dune Albedo Feedback. Geosciences 8, 82. 
https://doi.org/10.3390/geosciences8030082 
Bagheri bodaghabadi, M., Martínez-casasnovas, J., Salehi, M.H., Mohammadi, J., 
Esfandiarpoor borujeni, I., Toomanian, N., Gandomkar, A., 2015. Digital Soil 
Mapping Using Artificial Neural Networks and Terrain-Related Attributes. 
Pedosphere 25, 580–591. https://doi.org/10.1016/S1002-0160(15)30038-2 
Batjes, N.H., Ribeiro, E., Oostrum, A. van, 2020. Standardised soil profile data to support 
global mapping and modelling (WoSIS snapshot 2019). Earth System Science Data 
12, 299–320. https://doi.org/10.5194/essd-12-299-2020 
Bau, D., Zhu, J.-Y., Wulff, J., Peebles, W., Strobelt, H., Zhou, B., Torralba, A., 2019. Seeing 
What a GAN Cannot Generate. Presented at the Proceedings of the IEEE/CVF 
International Conference on Computer Vision, pp. 4502–4511. 
Blanch, M.G., Mrak, M., Smeaton, A.F., O’Connor, N.E., 2019. End-to-End Conditional 
GAN-based Architectures for Image Colourisation, in: 2019 IEEE 21st International 
Workshop on Multimedia Signal Processing (MMSP). Presented at the 2019 IEEE 
21st International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6. 
https://doi.org/10.1109/MMSP.2019.8901712 
Bouasria, A., Ibno Namr, K., Rahimi, A., Ettachfini, E.M., Rerhou, B., 2022. Evaluation of 
Landsat 8 image pansharpening in estimating soil organic matter using multiple linear 
regression and artificial neural networks. Geo-spatial Information Science 0, 1–12. 
https://doi.org/10.1080/10095020.2022.2026743 
Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32. 
https://doi.org/10.1023/A:1010933404324 
Brus, D.J., 2019. Sampling for digital soil mapping: A tutorial supported by R scripts. 
Geoderma 338, 464–480. https://doi.org/10.1016/j.geoderma.2018.07.036 
Buchhorn, M., Lesiv, M., Tsendbazar, N.-E., Herold, M., Bertels, L., Smets, B., 2020. 
Copernicus Global Land Cover Layers—Collection 2. Remote Sensing 12, 1044. 
https://doi.org/10.3390/rs12061044 
Chen, T., Guestrin, C., 2016. XGBoost: A Scalable Tree Boosting System, in: Proceedings of 
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data 
Mining, KDD ’16. Association for Computing Machinery, New York, NY, USA, pp. 
785–794. https://doi.org/10.1145/2939672.2939785 
Chesworth, W., 2007. Encyclopedia of Soil Science. Springer Science & Business Media. 
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A., 2018. 
Generative Adversarial Networks: An Overview. IEEE Signal Processing Magazine 
35, 53–65. https://doi.org/10.1109/MSP.2017.2765202 
Dewitte, O., Jones, A., Elbelrhiti, H., Horion, S., Montanarella, L., 2012. Satellite remote 
sensing for soil mapping in Africa: An overview. Progress in Physical Geography: 
Earth and Environment 36, 514–538. https://doi.org/10.1177/0309133312446981 
 
 
 
 
 
Dobos, E., Hengl, T., 2009. Chapter 20 Soil Mapping Applications, in: Hengl, Tomislav, 
Reuter, H.I. (Eds.), Developments in Soil Science, Geomorphometry. Elsevier, pp. 
461–479. https://doi.org/10.1016/S0166-2481(08)00020-2 
Fan, C., Chen, M., Wang, X., Wang, J., Huang, B., 2021. A Review on Data Preprocessing 
Techniques Toward Efficient and Reliable Knowledge Discovery From Building 
Operational Data. Frontiers in Energy Research 9. 
https://doi.org/10.3389/fenrg.2021.652801 
Farr, T.G., Rosen, P.A., Caro, E., Crippen, R., Duren, R., Hensley, S., Kobrick, M., Paller, 
M., Rodriguez, E., Roth, L., Seal, D., Shaffer, S., Shimada, J., Umland, J., Werner, 
M., Oskin, M., Burbank, D., Alsdorf, D., 2007. The Shuttle Radar Topography 
Mission. Reviews of Geophysics 45. https://doi.org/10.1029/2005RG000183 
Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D., 2014. Do we need hundreds of 
classifiers to solve real world classification problems? J. Mach. Learn. Res. 15, 3133–
3181. 
Goovaerts, P., Goovaerts, D. of C. and E.E.P., 1997. Geostatistics for Natural Resources 
Evaluation. Oxford University Press. 
Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., Moore, R., 2017. Google 
Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of 
Environment 202, 18–27. https://doi.org/10.1016/j.rse.2017.06.031 
Hengl, T., Jesus, J.M. de, MacMillan, R.A., Batjes, N.H., Heuvelink, G.B.M., Ribeiro, E., 
Samuel-Rosa, A., Kempen, B., Leenaars, J.G.B., Walsh, M.G., Gonzalez, M.R., 2014. 
SoilGrids1km — Global Soil Information Based on Automated Mapping. PLOS ONE 
9, e105992. https://doi.org/10.1371/journal.pone.0105992 
Hengl, T., Mendes de Jesus, J., Heuvelink, G.B.M., Ruiperez Gonzalez, M., Kilibarda, M., 
Blagotić, A., Shangguan, W., Wright, M.N., Geng, X., Bauer-Marschallinger, B., 
Guevara, M.A., Vargas, R., MacMillan, R.A., Batjes, N.H., Leenaars, J.G.B., Ribeiro, 
E., Wheeler, I., Mantel, S., Kempen, B., 2017a. SoilGrids250m: Global gridded soil 
information based on machine learning. PLOS ONE 12, e0169748. 
https://doi.org/10.1371/journal.pone.0169748 
Hengl, T., Mendes de Jesus, J., Heuvelink, G.B.M., Ruiperez Gonzalez, M., Kilibarda, M., 
Blagotić, A., Shangguan, W., Wright, M.N., Geng, X., Bauer-Marschallinger, B., 
Guevara, M.A., Vargas, R., MacMillan, R.A., Batjes, N.H., Leenaars, J.G.B., Ribeiro, 
E., Wheeler, I., Mantel, S., Kempen, B., 2017b. SoilGrids250m: Global gridded soil 
information based on machine learning. PLOS ONE 12, e0169748. 
https://doi.org/10.1371/journal.pone.0169748 
Heuvelink, G.B.M., Webster, R., 2022. Spatial statistics and soil mapping: A blossoming 
partnership under pressure. Spatial Statistics 50. 
https://doi.org/10.1016/j.spasta.2022.100639 
Jahanshiri, E., bin Mohamed Shariff, A.R., Amiri, F., Soom, M.A.M., Wayayokb, A., 
Buyonga, T., Pradhan, B., 2015. Spatial soil analysis using geostatistical analysis and 
map Algebra. Arabian Journal of Geosciences 8, 9775–9788. 
Karlsson, C.S.J., Jamali, I.A., Earon, R., Olofsson, B., Mörtberg, U., 2014. Comparison of 
methods for predicting regolith thickness in previously glaciated terrain, Stockholm, 
Sweden. Geoderma 226–227, 116–129. 
https://doi.org/10.1016/j.geoderma.2014.03.003 
Kasraei, B., Heung, B., Saurette, D.D., Schmidt, M.G., Bulmer, C.E., Bethel, W., 2021. 
Quantile regression as a generic approach for estimating uncertainty of digital soil 
maps produced from machine-learning. Environmental Modelling & Software 144, 
105139. https://doi.org/10.1016/j.envsoft.2021.105139 
Klemmer, K., 2022. How to train a SpaceGAN for soil data augmentation. 
 38 
 
 
 
 
 
 
Klemmer, K., 2021. SpaceGAN - A generative adverserial net for geospatial point data. 
Klemmer, K., Koshiyama, A., Flennerhag, S., 2019. Augmenting correlation structures in 
spatial data using deep generative models. arXiv:1905.09796 [cs, stat]. 
Kuriakose, S.L., Devkota, S., Rossiter, D.G., Jetten, V.G., 2009. Prediction of soil depth 
using environmental variables in an anthropogenic landscape, a case study in the 
Western Ghats of Kerala, India. CATENA 79, 27–38. 
https://doi.org/10.1016/j.catena.2009.05.005 
Lalitha, M., Dharumarajan, S., Suputhra, A., Kalaiselvi, B., Hegde, R., Reddy, RS., Prasad, 
CR.S., Harindranath, CS., Dwivedi, BS., 2021. Spatial prediction of soil depth using 
environmental covariates by quantile regression forest model. Environ Monit Assess 
193, 660. https://doi.org/10.1007/s10661-021-09348-9 
Legros, J.-P., 2006. Mapping of the Soil. Science Publishers. 
Li, J., Chen, Z., Zhao, X., Shao, L., 2020. MapGAN: An Intelligent Generation Model for 
Network Tile Maps. Sensors (Basel) 20, 3119. https://doi.org/10.3390/s20113119 
Li, Y., Wei, C., Ma, T., 2019. Towards Explaining the Regularization Effect of Initial Large 
Learning Rate in Training Neural Networks, in: Advances in Neural Information 
Processing Systems. Curran Associates, Inc. 
M Suhari, T.A.S.T., Wimalasiri, E.M., Jahanshiri, E., 2022. Paving the way for more accurate 
earth system modelling in Malaysia. IOP Conf. Ser.: Earth Environ. Sci. 1016, 
012031. https://doi.org/10.1088/1755-1315/1016/1/012031 
MacMillan, T.H. and R.A., 2019. Predictive Soil Mapping with R. 
Mapa, R., 2020. SRICANSOL Project – Phase III- Development of a land resource 
information system for land evaluation of Sri Lanka. 
https://doi.org/10.13140/RG.2.2.18230.32322 
Meyer, H., 2016. Machine-learning based modelling of spatio-temporal environmental data 
(using R) 56. 
Minasny, B., McBratney, Alex.B., 1999. A rudimentary mechanistic model for soil 
production and landscape development. Geoderma 90, 3–21. 
https://doi.org/10.1016/S0016-7061(98)00115-3 
Moghaddam, D.D., Rahmati, O., Panahi, M., Tiefenbacher, J., Darabi, H., Haghizadeh, A., 
Haghighi, A.T., Nalivan, O.A., Tien Bui, D., 2020. The effect of sample size on 
different machine learning models for groundwater potential mapping in mountain 
bedrock aquifers. CATENA 187, 104421. 
https://doi.org/10.1016/j.catena.2019.104421 
Musker, R., Tumeo, J., Schaap, B., Parr, M., 2018. GODAN’s Impact 2014-2018 – 
Improving Agriculture, Food and Nutrition with Open Data. F1000Research 7, 1328. 
https://doi.org/10.7490/f1000research.1115970.1 
OpenGeoHub, 2022. Spatial sampling and resampling for Predictive Mapping with Machine 
Learning: A tutorial in R. MLearning.ai. URL https://medium.com/mlearning-
ai/spatial-sampling-and-resampling-for-predictive-mapping-with-machine-learning-a-
tutorial-in-r-99f71555bc43 (accessed 5.30.22). 
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., 
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., 
Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine Learning in 
Python. Journal of Machine Learning Research 12, 2825–2830. 
Pei, S., Da Xu, R.Y., Xiang, S., Meng, G., 2021. Alleviating Mode Collapse in GAN via 
Diversity Penalty Module. 
QGIS.org, 2022. QGIS Geographic Information System. QGIS Association. 
R Core Team., 2020. R: A language and environment for statistical computing. R Foundation 
for Statistical Computing, Vienna, Austria. 
 39 
 
 
 
 
 
 
Rentschler, T., 2021. Explainable machine learning in soil mapping: Peeking into the black 
box (Dissertation). Universität Tübingen. https://doi.org/10.15496/publikation-57960 
Rey, S.J., Anselin, L., 2007. PySAL: A Python Library of Spatial Analytical Methods. The 
Review of Regional Studies 37, 5–27. 
Richter, D.D., Markewitz, D., 1995. How Deep Is Soil?: Soil, the zone of the earth’s crust 
that is biologically active, is much deeper than has been thought by many ecologists. 
BioScience 45, 600–609. https://doi.org/10.2307/1312764 
Schaaf, C., 2015. MODIS BRDF/Albedo/NBAR Product MCD43 [WWW Document]. URL 
https://www.umb.edu/spectralmass/terra_aqua_modis/v006/introduction (accessed 
4.23.22). 
Shangguan, W., Hengl, T., Jesus, J.M. de, Yuan, H., Dai, Y., 2017a. Mapping the global 
depth to bedrock for land surface modeling. Journal of Advances in Modeling Earth 
Systems 9, 65–88. https://doi.org/10.1002/2016MS000686 
Shangguan, W., Hengl, T., Jesus, J.M. de, Yuan, H., Dai, Y., 2017b. Mapping the global 
depth to bedrock for land surface modeling. Journal of Advances in Modeling Earth 
Systems 9, 65–88. https://doi.org/10.1002/2016MS000686 
Shangguan, W., Hengl, T., Mendes de Jesus, J., Yuan, H., Dai, Y., 2017c. Mapping the 
global depth to bedrock for land surface modeling. Journal of Advances in Modeling 
Earth Systems 9, 65–88. https://doi.org/10.1002/2016MS000686 
Spick, R.J., Cowling, P., Walker, J.A., 2019. Procedural Generation using Spatial GANs for 
Region-Specific Learning of Elevation Data, in: 2019 IEEE Conference on Games 
(CoG). Presented at the 2019 IEEE Conference on Games (CoG), IEEE, London, 
United Kingdom, pp. 1–8. https://doi.org/10.1109/CIG.2019.8848120 
Stoorvogel, J.J., Bakkenes, M., Temme, A.J.A.M., Batjes, N.H., ten Brink, B.J.E., 2017. S-
World: A Global Soil Map for Environmental Modelling. Land Degradation & 
Development 28, 22–33. https://doi.org/10.1002/ldr.2656 
Sturges, H.A., 1926. The Choice of a Class Interval. Journal of the American Statistical 
Association 21, 65–66. https://doi.org/10.1080/01621459.1926.10502161 
Sulaeman, Y., Minasny, B., McBratney, A.B., Sarwani, M., Sutandi, A., 2013. Harmonizing 
legacy soil data for digital soil mapping in Indonesia. Geoderma 192, 77–85. 
https://doi.org/10.1016/j.geoderma.2012.08.005 
Sun, T., Wang, Y., Hui, D., Jing, X., Feng, W., 2020. Vertical distributions of soil microbial 
biomass carbon: a global dataset. Data in Brief 32, 106147. 
https://doi.org/10.1016/j.dib.2020.106147 
Szabó, B., Szatmári, G., Takács, K., Laborczi, A., Makó, A., Rajkai, K., Pásztor, L., 2019. 
Mapping soil hydraulic properties using random-forest-based pedotransfer functions 
and geostatistics. Hydrology and Earth System Sciences 23, 2615–2635. 
https://doi.org/10.5194/hess-23-2615-2019 
Tetzlaff, G., 1983. Albedo of the Sahara, Satellite Meas. of Radiation Budget Parameters. 
Thathsarani, S.P.K., 2022. STRENGTHENING THE NATION-WIDE OPEN 3D SOIL 
DATABASE OF SRI LANKA USING DATA MINING. Sabaragamuwa University 
of Sri Lanka, Belihuloya, Sri Lanka. 
U.S. Geological Survey, 2022. Landsat Surface Reflectance Products [WWW Document]. 
URL https://www.usgs.gov/landsat-missions/landsat-surface-reflectance (accessed 
5.23.22). 
UTM zone 44N - EPSG:32644 [WWW Document], 2022. URL http://epsg.io (accessed 
5.23.22). 
Van Rossum, G., Drake Jr, F.L., 1995. Python reference manual. Centrum voor Wiskunde en 
Informatica Amsterdam. 
 40 
 
 
 
 
 
 
Wadoux, A.M.J.-C., Padarian, J., Minasny, B., 2019. Multi-source data integration for soil 
mapping using deep learning. SOIL 5, 107–119. https://doi.org/10.5194/soil-5-107-
2019 
Waters, N., 2018. Tobler’s First Law of Geography, in: International Encyclopedia of 
Geography. John Wiley & Sons, Ltd, pp. 1–15. 
https://doi.org/10.1002/9781118786352.wbieg1011.pub2 
Webster, R., Oliver, M.A., 2007. Geostatistics for environmental scientists. John Wiley & 
Sons. 
Webster, R., Oliver, M.A., 2001. Geostatistics for environmental scientists (Statistics in 
Practice). 
WGS 84 - EPSG:4326 [WWW Document], 2022. URL https://epsg.io/4326 (accessed 
5.23.22). 
Wimalasiri, E.M., Jahanshiri, E., Suhairi, T.A.S.T.M., Mapa, R.B., 2020a. Nation-wide open 
soil property maps of up to 100 cm depth for Sri Lanka 1. 
https://doi.org/10.17632/5sc7njfcyn.1 
Wimalasiri, E.M., Jahanshiri, E., Suhairi, T.A.S.T.M., Mapa, R.B., Karunaratne, A.S., 
Vidhanarachchi, L.P., Udayangani, H., Nizar, N.M.M., Azam-Ali, S.N., 2020b. The 
first version of nation-wide open 3D soil database for Sri Lanka. Data Brief 33, 
106342. https://doi.org/10.1016/j.dib.2020.106342 
Wimalasiri, E.M., Jahanshiri, E., Suhairi, T.A.S.T.M., Udayangani, H., Mapa, R.B., 
Karunaratne, A.S., Vidhanarachchi, L.P., Azam-Ali, S.N., 2020c. Basic Soil Data 
Requirements for Process-Based Crop Models as a Basis for Crop Diversification. 
Sustainability 12, 7781. https://doi.org/10.3390/su12187781 
Wong, I.F.T., 1974. Soil-crop suitability classification for Peninsular Malaysia. Division of 
Agriculture, Ministry of Agriculture and Fisheries. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 41 
 
Appendix  
 
A 1: Characteristics of successful models for all elements (OT training strategy) 
ELEMENT MODEL % DATA COD 
AUGMENTATION 
BD RF 20 0.06 
CEC RF 70 0.26 
CLAY MLP 55 0.25 
EC RF 100 0.01 
OC-WP RF 80 -0.69 
OC-NP - -  
PH RF 5 0.56 
SAND-WP RF 5 0.37 
SAND-NP - -  
SILT MLP 60 0.23 
VMC0.33 RF 25 0.27 
VMC15 RF 20 0.20 
 
 
 
A 2: Comparison of model accuracy with previous study (OT strategy) 
ELEMENT CVSCOR ML RMSE PREVIOUS METHOD 
RMSE 
BD 0.06 0.218 0.184 EMK 
CEC -32.11 47.529 14.949 EMK 
CLAY -0.24 11.24 10.838 RBF 
EC -118.29 1895.97 -  
OC-WP 0.05 1.105 2.649 EMK 
OC-NP - - 
PH 0.47 0.74 0.770 EMK 
SAND-WP 0.15 14.85 15.970 RBF 
SAND-NP - - 
SILT -0.46 5.30 8.159 RBF 
VMC0.33 0.16 0.066 0.090 EMK 
VMC15 0.31 0.060 0.082 EMK