Improving Soil Information with Generative and Machine Learning Models Will data augmentation and machine learning help improve the quality of soil data predictions over a large area? Master’s thesis in Applied Data Science Ebrahim Jahanshiri DEPARTMENT OF COMPUTER SCIENCE GOTHENBURG UNIVERSITY AND CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2022 www.chalmers.se/www.gu.se ii Master’s thesis 2022 Improving Soil Information with Generative and Machine Learning Models Ebrahim Jahanshiri Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2022 iii Supervisor at CSE: Dr. Yinan Yu Department of Computer Science Examiner: Prof. Peter Damaschke Department of Computer Science Improving Soil Information with Generative and Machine Learning Models Ebrahim Jahanshiri © EBRAHIM JAHANSHIRI, 2022. Master’s Thesis 2022 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: prediction surface for pH for Sri Lanka iv Improving Soil Information with Generative and Machine Learning Models Master’s Thesis 2022 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Abstract Soil data observations are among the most difficult data to collect. Low sample density along with the high cost of sampling has made current soil information that are usually presented as maps, unusable for detailed applications such as modelling earth system dynamics, crop modelling, natural hazards prediction and climate change impacts. In this research, a new method for data augmentation of already available data was used to improve the accuracy levels of current soil maps across the country of Sri Lanka. Soil profile datasets were collected from custodians of national soil profile data and combined with data that were mined from published sources. Moreover, two extra datasets were developed to augment the original sample size. A series of pseudo-samples were generated by analysing over 600 images obtained from NASA MODIS MCD34A4 product, to locate Sand dunes with a high percentage of sand and low organic carbon. Another dataset was created by training and validation of a spatial generative adversarial neural network (SpaceGAN) that can mimic the original distribution of soil properties. The augmented datasets were used to predict the values of soil elements at unsampled locations with machine learning models. The performance of successful models was then compared with different levels of data augmentation. Results showed that data augmentation, particularly with the generated data from SpaceGAN can enhance the knowledge of some soil elements and could be explored as a viable option to improve the overall accuracy of soil information. Keywords: data augmentation, soil data, mapping, generative adversarial models, random forest v Contents Abstract ..................................................................................................................................... v Contents ................................................................................................................................... vi List of Figures ....................................................................................................................... viii List of Tables ........................................................................................................................... ix 1. Introduction ...................................................................................................................... 1 1.1. Soil data .................................................................................................................................1 1.2. Why we need better soil data ..............................................................................................1 1.3. Problem statement................................................................................................................1 1.4. Contribution .........................................................................................................................2 2. Theory ............................................................................................................................... 4 2.1. Improving soil data quality .................................................................................................4 2.1.1. Published sources .............................................................................................................................4 2.1.2. Pseudo-samples ................................................................................................................................5 2.1.3. Data augmentation with deep learning .............................................................................................5 2.1.4. Covariate features.............................................................................................................................6 2.2. State of the art in soil mapping ...........................................................................................6 2.2.1. Deterministic interpolation models ..................................................................................................7 2.2.2. Stochastic interpolation models .......................................................................................................7 2.2.3. Machine Learning models ................................................................................................................7 3. Methods ............................................................................................................................. 9 3.1. Data ........................................................................................................................................9 3.1.1. Observed data ...................................................................................................................................9 3.1.2. Pseudo-observations .......................................................................................................................11 3.1.3. GAN data generation......................................................................................................................12 3.1.4. Covariate features...........................................................................................................................13 3.1.5. Training, testing, and prediction datasets .......................................................................................14 3.1.6. Coordinate reference system ..........................................................................................................15 3.2. Analysis methods ................................................................................................................15 3.2.1. Machine learning models ...............................................................................................................15 3.2.2. Evaluation ......................................................................................................................................16 3.3. Software...............................................................................................................................16 4. Results ............................................................................................................................. 17 4.1. Tuning data augmentation model .....................................................................................17 4.2. Exploratory analysis ..........................................................................................................19 4.3. Modelling results ................................................................................................................20 4.4. Model performance ............................................................................................................22 4.5. Final maps ...........................................................................................................................23 5. Discussion........................................................................................................................ 25 5.1. Data augmentation .............................................................................................................25 5.2. Mode collapse......................................................................................................................27 5.3. Accuracy assessment ..........................................................................................................28 vi 5.4. Predictions and features ....................................................................................................29 5.5. Data usability ......................................................................................................................32 5.5.1. Modern soil data .............................................................................................................................32 5.5.2. Machine learning interpretability dilemma ....................................................................................33 5.5.3. Automation .....................................................................................................................................33 6. Conclusion ...................................................................................................................... 34 7. Future work .................................................................................................................... 36 8. References ....................................................................................................................... 37 Appendix ................................................................................................................................. 42 vii List of Figures FIGURE 1: LOCATION OF OBSERVATION DATA PROFILES AND LITERATURE DATA .............................................................................................................................. 10 FIGURE 2: THE LOCATION OF PSEUDO SAMPLES THAT WERE DERIVED FROM MODIS SATELLITE DATA ON 2ND SEPTEMBER 2021 ............................................ 11 FIGURE 3: PSEUDO-SAMPLES OVERLAID ON GOOGLE SATELLITE IMAGES ....... 12 FIGURE 4: SPATIAL ANALYSIS STEPS FOR DERIVING RANDOM SAMPLES ......... 12 FIGURE 5: GRID OF 13000 POINTS FOR PREDICTION AND MAPPING ...................... 13 FIGURE 6: DATASET DEVELOPMENT PROCESS FLOW............................................... 15 FIGURE 7: TRAINING AND VALIDATION STEPS........................................................... 16 FIGURE 8: LOSS, RMSE, AND MIE PERFORMANCE FOR MODEL SELECTION ....... 19 FIGURE 9: MODEL PERFORMANCE AGAINST AUGMENTED DATA, RF: RANDOM FOREST, MLP: MULTI-LAYER PERCEPTRON ........................................................ 21 FIGURE 10: FINAL MAPS CREATED WITH PREDICTED DATA AT THE GRIDS COORDINATES ............................................................................................................. 24 FIGURE 11: NORMAL DISPERSION AROUND THE MEAN AND LOW SD FOR PH (TOP) AND CEC (BOTTOM) ........................................................................................ 26 FIGURE 12: SPATIAL DISTRIBUTION OF GENERATED DATA BASED ON TWO GROUPS OF TREATMENTS ........................................................................................ 28 FIGURE 13: ORIGINALLY OBSERVED AND GENERATED DATA OVERLAID ON THE PREDICTION MAP ............................................................................................... 29 FIGURE 14: FEATURE IMPORTANCE OF PH ELEMENT PREDICTIONS .................... 31 FIGURE 15: COMPARISON BETWEEN OBSERVED AND GLOBAL SOIL DATA FOR PH (ADAPTED FROM (WIMALASIRI ET AL., 2020C) ............................................. 32 viii List of Tables TABLE 1: CHARACTERISTICS OF SOIL PROPERTIES USED IN THIS RESEARCH .... 9 TABLE 2: CHARACTERISTICS OF OPTIMUM SPACEGAN GENERATOR AND DISCRIMINATOR NETWORKS................................................................................... 17 TABLE 3: TARGET VARIABLE STATISTICS ACROSS THREE TYPES OF DATA ..... 20 TABLE 4: CHARACTERISTICS OF SUCCESSFUL MODELS FOR ALL ELEMENTS (ST TRAINING STRATEGY) ........................................................................................ 22 TABLE 5: COMPARISON OF MODEL ACCURACY WITH PREVIOUS STUDY (ST STRATEGY) ................................................................................................................... 23 ix 1. Introduction 1.1. Soil data High-quality soil information is required for many environmental and Earth sciences, where functions of soil such as water preservation, housing micro-organisms, and nourishing plants’ roots are studied (Chesworth, 2007). Although soil depth can reach up to a few hundred meters (Richter and Markewitz, 1995), it is its first few layers that are usually of interest. Soils are created from parent material for a period of several thousand years, and all the environmental phenomena including but not limited to climate and organisms can affect the soil types in different ways. All these interactions within the soil provides a natural body that has many functions, from being a source of minerals, organic material, nutrients, and water, to sustaining plant growth and food production. Soil is also part of cultures and habitats and supports man- made structures. Data of observational samples (or pedons) that are taken by digging deep into the soils are rarely provided to user communities. Instead, soil information is available through soil maps that are aggregations of similar soils delineated into units or polygons that are provided as printed maps with different colours representing different soil types. 1.2. Why we need better soil data Soil information plays an important role in human life by impacting our decisions related to the environment. General soil knowledge is needed for decision-making at the larger scales, where limited general information is enough to devise a strategy or recommend a course of action. An example of such decisions is determining optimal land use and suitability of crops at the national level and providing a general recommendation about what crops can grow in a region (Wong, 1974). Likewise, soil data are important for understanding environmental risks and hazards such as pollution, floods, and global warming. However, detailed soil information is also needed at the smaller scales for understanding soil-crop relationships, yield forecasting, and water management (Dobos and Hengl, 2009). Also, many environmental models require specific information about a particular soil property such as texture, water holding capacity, degree of acidity and alkalinity, soil electrical charge, and so on. Nonetheless, the conventional soil data can no longer cater to the needs of such models (MacMillan, 2019) often at the level of individual soil elements (see section 3.1.1 for a list of these properties). Therefore, as more and more data are needed, it is important to improve the soil data quality. 1.3. Problem statement Increasing demands for quantitative soil data have led to a shift from providing aggregated data in the form of associations of soil series maps to per-property soil data inventories or maps. In this regard, soil data are increasingly mapped for a specific property across different scales. This has led to a growing interest in developing spatial and non-spatial data structures which could be used for the storage and representation of soil data. Also, methods that were previously developed in computer science to predict and interpolate information are currently used to develop maps of soil properties. A major problem is uncertainty in modelling and accuracy of data. The source of uncertainty in soil data could be from the low number of input sample data and data diversity. Also, available data are rarely collected following a sampling design that is statistically based (identically and independently distributed). This also leads to the existence of large gaps or space between samples that leads to lower accuracy of predictions in areas where sample density is low. Other issues with current sample data include error propagation and uncertainty in associated data such as variation in depths, time of sampling, and coordinate errors that is inherent to sample analysis in soil survey methodologies (MacMillan, 2019). 1.4. Contribution Using de-facto and state-of-the-art techniques to generate maps of continuous soil properties is not a new field. A challenge for mapping, however, is its accuracy and uncertainty at locations where sample density is low and local relationships between the target variable and the features (covariates) cannot be explained (Hengl et al., 2017a). Therefore, the objectives of this research were to: 1- Increase sample size by acquiring more soil data by applying deep learning models to the distribution of already available soil property data; 2- Use machine learning models to evaluate the accuracy of predictions using different combinations of both observed and generated data; 3- Discuss the strengths and weaknesses of using data augmentation, dissemination, and publication of the soil maps for their benefit to the public. The main hypotheses for this research were the following: 1- Both data augmentation and machine learning methods improve the quality of predictions. 2- Machine learning methods provide a more robust approximation of real spatial patterns from data without the need for any statistical assumptions. To solve the data issues, more sample information from 3 sources was added: 1) data that is mined from published articles that contain analysis of soil data, 2) expert-based or pseudo- observations that are obtained by analysing satellite data, and 3) data augmentation using deep neural network models. Together, all data represent two types of information: 1) soil data observations and 2) artificially generated data. A feature dataset containing 157 data layers from different sources such as satellite images, climate, soil, and topography was also created. All these features are representative of environmental factors that are responsible for soil variation across space and time. The observational data were then used to train a Conditional Generative Adversarial Net (CGAN) that is adapted to work for data with coordinates (Latitude, Longitude) and spatial neighbours. A major problem with the output was the lack of diversity in the generated data. This problem was somehow tackled with data standardisation and oversampling of the tails of the input distribution. The training data was then combined with the generated data at different percentages (5%, 10%, 20%, etc.) to test the impact of generated data on the final outputs of a random forest model. For each soil element, an optimum data augmentation mixture was selected based on the lowest root mean square error. The final results showed an improvement over de facto statistical models that were used in the previous studies using additional observed data and machine learning models. Successful models were used to predict values over a grid of 13,000 points spread over the entire country of Sri Lanka to produce final maps. Visual interpretation of the final maps showed that machine learning models were able to detect and map general 2 spatial patterns without any pre-training that is usually required for the statistically-based models. More work needs to be done in the future to optimise the models, both for generating data for soils and optimising the final prediction and maps. 3 2. Theory Semi-detailed soil maps provide information for planning at the national or regional levels. These maps are traditionally developed by surveying the soils and aggregating observations or descriptions of soils based on samples or observations about the type, colour, and other physicochemical characteristics of the soils. However, such data lose “usability” when detailed information is required at smaller scales where informed decisions are crucial and risk levels are high such as crop and water management. This is due to many factors related to the scale of soil maps (failure to capture local variation) and forms of data availability (map polygons), or funding limitations. For example, in a case where an estimate of crop yield is required to be determined under current and future climates, soil data accuracy can have a severe impact on the conclusions that are made based on using those data (Wimalasiri et al., 2020c). Studies have shown the impact of soil data quality on the results of environmental modelling applications such as landslide susceptibility, desertification, and climate change (Stoorvogel et al., 2017). One of the most important requirements of these models is high-resolution quantitative data. An example of such variables are the amount of carbon and or biomass in the soil and standardised quantities such as pH, bulk density, categories such as soil classes, probabilities such as the occurrence of stones beneath the soil (R-horizon), and others. Such variables are now directly measured through soil profile samples obtained by surveying soils in the laboratory (Legros, 2006). These measurements assign a value to a quantitative variable at a location. The so-called primary data that are collected directly from the field can also be used to develop secondary information. For example, soil carbon storage can be predicted from the amount of carbon that was available within samples. 2.1. Improving soil data quality Efforts have been made in recent years to improve the quality of soil information by taking advantage of state of the art in big environmental data and machine learning techniques. Maps of spatial variability of soil data can be improved by using soil data that are published in research articles and literature studies. Further methods are used to fill the gaps in soil data by utilising remote sensing data. These methods are particularly useful for determining the location of Sand dunes and rock outcrops where a particular soil element such as Sand is at the highest quantity or in areas where there is no soil available. In recent years, machine learning models have also been used for the prediction of soil properties. Generative models can particularly be useful in providing extra information needed for soil mapping. 2.1.1. Published sources As soil sampling remains to be a costly process, soil series maps are rarely updated at regional and national scales. On the other hand, as soil data is required for many environmental applications such as agriculture, soil samples that are collected locally are rarely available publicly or seldom published separately in online databases (Batjes et al., 2020). Furthermore, soil data are collected for a variety of purposes such as construction. Also, many published studies may contain soil information that are the result of soil sampling at the local scale. An example of such studies is agronomy studies that look at the performance of crops at a particular location (soil type) in response to specific treatments such as fertiliser and irrigation. Data mining from literature could provide a viable option for increasing sample size, 4 particularly at locations where no sample profile data is available or data mining from the published literature (Sun et al., 2020). 2.1.2. Pseudo-samples A few studies have also looked at improving sample size (and therefore improving the accuracy of predictions) with the addition of pseudo-samples derived from remote sensing data (Hengl et al., 2014; Li et al., 2020; Shangguan et al., 2017a). Some samples can be artificially generated by applying expert knowledge. For example, Shangguan et al., (2017) used satellite data to distinguish the areas where no soil is available (outcrops, hilly and rocky areas) to add a set of pseudo-samples to their dataset of soil depth. This increased the sample size and coverage and hence the quality of the spatial predictions and maps. Such data mining can also be applied to other soil elements such as texture, organic matter, and bulk density to improve the quality of estimations. As (Dewitte et al., 2012) discussed, although using satellite remote sensing results cannot be the only resource for deriving soil information, they would improve our understanding of the surface soil properties. 2.1.3. Data augmentation with deep learning In recent years, machine learning techniques have been used to generate new data for image analysis and map generation. The adversarial nature of Generative Adversarial Nets (GAN) is the most important feature of these models (Creswell et al., 2018). A blunt “generator” that learns from the original data distribution, generates new data from a latent noise. A “discriminator” will then be trained to recognise the fake data from real data. Both networks are trained simultaneously, and the training stops when a criterion is met. GAN models have been successfully used for generating images and maps before. For example, Li et al., (2020) used GAN-based image translation models to generate new data from the distribution of already available data. Likewise, Spick et al., (2019) used a modified spatial GAN method to improve the detail of elevation data that was previously available at a low resolution only. GAN models can be used to develop new datasets that can better resemble the distribution of prior soil data and improve the overall sample density. Given the geospatial nature of soil data, particularly the characteristics of such data that is inherently non-iid (independently or identically distributed), a specific model for data augmentation, particularly for spatial data has been developed to cater to the need to train for data with coordinates (Klemmer et al., 2019). The correlation between data points can be used to generate new data using the distribution of known soil elements using features such as coordinates of the sample data (Latitude and Longitude), spatial neighbourhood matrices, and other soil variables as extra features. The converging mechanism for training the neural network model could be based on some spatial correlation structures or simply a measurement of error. The conditional GAN considers the spatial correlation that exists within geospatial data, as a mechanism for convergence of the generator G that takes random noise as input to generate synthetic data and a discriminator D that does a binary classification. To carry a conditional GAN for spatial point data, a local indicator of spatial correlation called Moran’s I (Anselin, 1995) is computed for each generated set against the training set. To derive Moran’s I, a weight matrix containing information about neighbours of each point i should be assigned. The weight matrix can be defined in such a way that gives more weight to 5 the neighbouring location than far locations. The main assumption here is that the target variable, which in this case is one of the soil elements (bulk density, pH, Organic carbon, etc.) is distributed over the space following a known spatial process. A realisation of the spatial process is the present samples 𝑁 = {1,2, … , 𝑛} . For each point in the dataset, a set of neighbours can be defined that are considered to systematically influence the value of each target point. The weight value 𝑤𝑖,𝑗 are assigned to the neighbours, (js) of point i, following a standard criterion to define nearest neighbours, 𝒩𝑖 = {𝑗𝜖𝑁|∃𝑖 ∋ 𝑁 ∶ 𝑤𝑖,𝑗 ≠ 0} . I is then calculated for each i: 𝑛 𝑦𝑖 − ?̅? 𝐼𝑖 = (𝑛 − 1) 𝑛 ∑ 𝑤𝑖𝑗(𝑦𝑖 − ?̅?) ∑𝑗=1,𝑗≠1(𝑦𝑖 − ?̅?) 2 𝑗=1,𝑗≠1 where ?̅? is the mean of the target variable and 𝑤𝑖𝑗 is the components of the weight matrix. A stopping mechanism for convergence that looks for minimisation of Moran’s I mean absolute error is defined as (Klemmer et al., 2019): 𝑛 𝑀𝐼𝐸 = ∑ |𝐼(𝑦𝑖) − 𝐼(?̂?𝑖)| 𝑖=1 Where 𝐼(?̂?𝑖) is the Moran’s I calculated from the generated sample. Klemmer et al., (2019) also implemented the root mean square error (RMSE) criteria and compared it with the original riteria (MIE). 2.1.4. Covariate features Spatial prediction models for soil utilise information from an extensive array of covariate features for their prediction. Whether the prediction method is a simple regression model or a highly sophisticated quantile regressions forest, highly correlated variables with the target variable play an important role in the prediction. The main justification for such correlation is the interactions that are usually attributed to soil structure and properties with different environmental conditions. The variation in the soil can be correlated to different conditions in climate, organisms, relief, age of parent material, and other factors (Minasny and McBratney, 1999). The geo-environmental features can range from climatic variables (temperature, precipitation) to topographical features derived from digital elevation models (DEMs) and soil physicochemical and compositional properties (Lalitha et al., 2021; MacMillan, 2019; Moghaddam et al., 2020; Shangguan et al., 2017b). 2.2. State of the art in soil mapping The main goal of soil prediction models is to reduce the bias and variance in predictions. However, different models utilise different pathways for predictions that are based on widely different assumptions. For example, conventional statistical models assume that patterns of spatial variation exist within the soil data because nearer observations tend to resemble similar values. This is based on Tobler’s first law of geography that entails similar objects are closer to each other than far objects (Waters, 2018). Other models tend to ignore such relationships and try to use only mechanistic models to arrive at a prediction. On the other hand, machine learning models can be developed to understand the complex relationship between target variables and the covariates within the context of linear or non-linear models. 6 The overall interpolation and prediction of soil data have been traditionally made with the help of physically based, statistical, and machine learning models (Hengl et al., 2017b; Kuriakose et al., 2009; MacMillan, 2019). However, in recent years it is repeatedly shown that machine learning models and in particular, tree-based variants are very successful in producing acceptable predictions for soil properties. Statistical methods such as kriging and deterministic approaches such as Inverse Distance Weighting (IDW) have also been used to develop and compare the results of prediction. Kriging is particularly useful in modelling the stochastic part of the variation when it is combined with other models such as random forest. For example Szabó et al., (2019) used an ensemble of kriging with random forest models to improve the mapping accuracy of soil hydraulic properties. 2.2.1. Deterministic interpolation models Prediction or interpolation models based on soil observation data (augmented average of neighbouring locations) are based on methods such as Theisen polygons and IDW. These methods consider the position of samples to create a mathematical relationship for the prediction at unknown locations. For example Karlsson et al., (2014) used four methods, linear regression (LR) with topographical variables as independent variables, IDW, Trigonometric approach (TA) and a simple regolith model (SRM) with outcrops supplied from a digital elevation model (DEM) to predict soil depth for three locations in Stockholm County, Sweden (Tyresö, Österåker, Vallentuna). The results showed that depending on the sample size, both IDW and LR can perform well, while physically based models such as SRM can perform well if the drilling data are available but very sparse. 2.2.2. Stochastic interpolation models Methods that include inherent spatial variability patterns defined as ‘spatial autocorrelation’ are collectively called geostatistics (Goovaerts and Goovaerts, 1997). The main premise of geostatistics is to provide an estimate within a reasonable scoping area that the data covers for unsampled locations based on a model of variation that is developed as a function of distance. The idea here is that phenomena such as soil depth can show randomness in their behaviours across space since there are too many unknown factors that affect the value at each location. In this case, at location 𝐱, a value 𝑍(𝐱) is drawn randomly from some probability distribution with its cumulative distribution function (CDF) and probability distribution function (PDF). The true realisation of random function (actual values), 𝑍(𝐱1), 𝑍(𝐱2), … , 𝑍(𝐱𝑖) at locations 𝐱𝑖 , 𝑖 = 1, … 𝑛 constitute a regionalised variable (Webster and Oliver, 2007). Therefore, in the case of spatial data, the randomness could be double (both values at each location and the locations themselves). The values within regionalised variables are not independent and closer locations generally exhibit similar values (Jahanshiri et al., 2015). Until recently, the statistical methods were the de facto methods for the prediction of spatial data. Methods such as kriging, are still used for the prediction of soil properties at smaller scales. 2.2.3. Machine Learning models Random forest uses fully grown decision trees (if-then questions about labelled data or a threshold in the case of continuous variables) to create an ensemble that will then vote for best- representing classes (Breiman, 2001). Random forest is shown to be the most accurate method for prediction when compared exhaustively with other methods (Fernández-Delgado et al., 2014). The gradient boosting technique (Chen and Guestrin, 2016) primarily reduces bias and 7 variance by giving more weight to badly classified observations. Both of the above-mentioned methods have been used in soil mapping (Hengl et al., 2017a) with various degrees of success. Other methods such as artificial machine learning (ANN) methods have also been used recently to develop maps of spatial variability of soils. It is reported that ANN can also detect spatial patterns and predict the values at unsampled locations like statistician methods. For example Bouasria et al., (2022) investigated the use ANNs in the prediction of soil organic matter (OC) and report an improvement over the linear regression method. The use of covariate features, particularly those related to terrain are essential for any machine learning model that is dealing with the soil data prediction (Bagheri bodaghabadi et al., 2015) The use of machine learning models for soil mapping is on the rise and as (Heuvelink and Webster, 2022) argue: “they are here to stay”. One reason for their popularity is that more and more data are made available and the extent to which environmental data can be used as features or independent variables has made their use even more plausible. Machine learning models are inherently non-spatial, that is, they are not trained to generalise any pattern other than those that they can discover themselves. This will make them even more powerful as they are no longer bound by the limits of traditional statistics, where a model of spatial prediction is needed to be constructed before any prediction can be made and normality of input variables is also considered in the analysis. 8 3. Methods In this section, detail of procedures and methods that were followed to create training and evaluation datasets for the experiments are discussed. Also, the main methods used to examine the main hypotheses of the project (section 1.4) are presented. 3.1. Data 3.1.1. Observed data A dataset containing soil property data across Sri Lanka was made available to the author by Prof. Ranjith B. Mapa, Emeritus Professor and former Chair in Soil Science, University of Peradeniya (Wimalasiri et al., 2020a). This dataset has already been used to develop soil property maps with limited accuracy due to its very limited size. Each observation contains data from different soil elements originally collected through SRICANSOL project that was implemented by the Soil Science Society of Sri Lanka (SSSSL) in collaboration with Canadian Society of Soil Sciences (CSSS) (Mapa, 2020). Table 1 shows the characteristics of soil properties (henceforth called elements) used in this research. Table 1: Characteristics of soil properties used in this research Element Explanation Unit Latitude, Longitude Location of the soil information. Important because soils Degrees (or (or X and Y) form similar patterns over space. meters) Sand, Silt, Clay The proportion of mineral particles in the soil. Sands (0.05 - % 2 mm), silt (0.002 - 0.05 mm) and clay (< 0.002 mm). Together provide information about soil fertility, biological processes, etc. BD Bulk density or the amount of pore space available within the g cm-3 soil. It indicates how much air is available within soils. pH Degree of acidity and alkalinity that impacts the chemical Scale of 0 to properties of soil. These chemical properties will in turn 14 determine many other factors such as the fertility of the soils EC Electrical conductivity is the ability of soil water to transmit milliSiemens electricity. It can be used to show the degree of salinity of per meter (mS soils and nutrient availability. m-1) CEC Cation exchange capacity is the ability of soil to hold centimoles exchangeable positive cations such as Na, K, Mg, etc. These per kilogram minerals have a direct impact on the soil fertility of of soil negatively charges soils. (cmol(+) kg- 1) OC Organic carbon content in the soil. The more organic carbon % in the soil the more fertile it becomes. VMC0.33, VMC15 Volumetric Moisture Content at two different pressure (Kilo mm mm-1 Pascal). Indicators of the capacity of soils for holding water at two different levels, field capacity and permanent wilting point Although the observed soil profile dataset is a valuable source of information to understand soils across the country, the current sample size (n = 120) is simply not enough for modern earth system models that require detailed soil information (M Suhari et al., 2022). As soil samples are usually collected at different depths, comparing values, and making inferences based on the spatial patterns of such data is a challenge. To overcome this problem, data at different depths are harmonised to make them available for further analysis (Hengl et al., 2014; Sulaeman et al., 2013; Wimalasiri et al., 2020c). The soil properties were standardised for 5 standard depths: 0-5 cm, 5-15 cm, 15-30 cm, 30-60 cm and 60-100cm. To increase the density of soil data, data mining through literature analysis was carried out by extracting soil data that are available in published scientific articles. Data from 183 published studies between 2006 and 2021 were collected manually and organised in collaboration with the University of Sabaragamuwa in Sri Lanka (Thathsarani, 2022). Collected data include bulk density (gcm−3), particle density, porosity, moisture, pH, organic carbon content (%), organic matter content, electrical conductivity, cation exchange capacity (CEC) (cmol + kg–1), clay (%), sand (%) and silt content (%), texture classes, soil colour, volumetric moisture content (VMC) at 0.33 bars level (drainage upper limit – mm mm−1), VMC at 15 bars level (wilting point – mm mm−1. However as different studies were done based on different objectives, not all data were available for all the above-mentioned variables. For simplicity and to alleviate computational issues, only values at the top layers (0-5 cm) where chosen. Also, as data within published articles are collected using different objectives, a data harmonisation exercise was carried out to standardise their depths. Figure 1 shows the location of the additional 180 samples that were collected by literature analysis. The density of samples was increased from 1 sample per 1 square kilometre to 4 samples per square kilometre across the country. As soil data are collected from different depths they were standardised across different depths using the methodology that was developed by Sulaeman et al., (2013) (see section 3.1.1). Figure 1: Location of observation data profiles and literature data 10 3.1.2. Pseudo-observations This data augmentation technique could improve the density of sample data by adding more information, particularly in areas where sample density is low. Data from the National Aeronautics and Space Administration of the U.S. (NASA) MODIS (MODerate Resolution Imaging Spectroradiometer) satellite was used for this purpose. The MCD43A4 (band 7 infrared) data product contains information about the Earth’s surface reflectance or albedo (Schaaf, 2015). To identify likely locations that can be used as pseudo-samples, 600 images covering Sri Lanka over 2021 were analysed. Initial analysis revealed that the best image date to derive the locations was 2nd September 2021 due to its wide range of albedo values. It is worth mentioning that the MCD43A4 product is already cleared from clouds. A survey of literature revealed that the typical albedo value for Sand dunes is 0.38 (Ashkenazy and Shilo, 2018; Tetzlaff, 1983). About 150 samples or 5% of the observed soil data as per recommendation of (Shangguan et al., 2017b) were selected using a script that was developed in R statistical language (R Core Team., 2020). The selected areas were mostly coastal (Figure 2). Figure 2: The location of pseudo samples that were derived from MODIS satellite data on 2nd September 2021 To reduce the number of clusters in the data and to improve the quality, all generated samples were examined with an overlay analysis on Google Earth (Figure 3), and coinciding samples with land-uses or clustering samples were removed (Buchhorn et al., 2020). To simplify the process, a high value of sand 70% was given to these data points. Also, due to low amount of organic matter in sand dunes, the value of organic carbon for those locations was set to 0. 11 Figure 3: Pseudo-samples overlaid on Google satellite images 3.1.3. GAN data generation To improve the sample size, and sample coverage, geographical analysis was used to develop a random sample to fill the gaps in observed and literature data. A circular buffer of 10 km was first generated for each sample location (Figure 4 A). The buffer areas were then joined together using a vector analysis operation in geographic information system (GIS) software (QGIS.org, 2022) to derive empty areas within the empty polygons (green areas Figure 4 B). A random sample of 300 points was then generated within the area that was under-represented by original samples. The location of these samples will be used to generate values for the soil elements (Table 1). The same coordinates were also used to derive the covariates features (section 3.1.4). A B C Figure 4: Spatial analysis steps for deriving random samples 12 Data generation was carried out using a special deep learning algorithm that has recently been developed to use conditional generative adversarial nets (CGANs) that considers the spatial dependence of observations (see section 2.1.3). To perform the data augmentation, target variables (observed data) were organised based on coordinates so that data import is possible. The code for conditional GAN was adapted based on code from the Space-GAN repository (Klemmer, 2021). The algorithm and code were adopted to train neural nets based on spatial coordinates (Latitude and Longitude), nearest neighbour values, and the target variables (Table 1). The main selection indicators are Mean Moran’s I Error (MIE) and RMSE. The functions were run several times to ensure the best hyperparameters are selected for each soil element. 3.1.4. Covariate features Environmental processes and factors that can influence soil genesis are climates, vegetation, and topography (section 2.1.4). In order to capture these influences as much as possible, an extended dataset of more than 150 features containing layers from different environmental datasets that are available on Google Earth Engine (GEE) (Gorelick et al., 2017) was obtained. Two sets of coordinates were developed and uploaded to the GEE; a dataset containing locations of the observation and augmented data (Figure 1and Figure 4C) and the other, a dataset of locations for a grid of more than 13,000 grid points that will then be used for mapping (Figure 5). Figure 5: Grid of 13000 points for prediction and mapping The following layers were extracted using the geedataextract library that was developed in Python: 1- Topography layers; elevation, aspect, and slope from NASA’s highest-resolution digital elevation model of the Earth (Farr et al., 2007). 2- Climate and soil-related layer averages from TerraClimate monthly averages from 2010 to 2020 (Abatzoglou et al., 2018). The layers include the following information: a. Actual evapotranspiration (mm) b. Climatic water deficit (mm) c. Palmer Drought Severity Index d. Reference evapotranspiration (Penman-Montieth, mm) e. Precipitation (mm) f. Runoff (mm) 13 g. Soil moisture (mm) h. Downward surface shortwave radiation (W/m2) i. Snow water equivalent (mm) j. Min temperature (deg C) k. Max temperature (deg C) l. Vapor pressure (kPa) m. Vapor pressure deficit (kPa) n. Wind-speed at 10m (m/s) 3- Normalized Difference Vegetation Index from USGS Landsat 4,5,7,& 8 Surface Reflectance Tier 1 products for the period 2013 to 2021 (U.S. Geological Survey, 2022). A total of 156 features were created from the abovementioned list for both training-testing and grid dataset and made ready for further analysis. 3.1.5. Training, testing, and prediction datasets The following procedure was followed to develop training, testing, and prediction datasets: 1- The set consisting of observation data was divided randomly into a training (60%) and testing set (40%) using train-test split function in Scikit-learn package (Pedregosa et al., 2011). 2- The training set were combined with augmented data to create a series of training sets. These sets will include 0 to 100% of augmented data. These training sets were used to train the prediction models. 3- The testing set was then used to evaluate the result of the training. Figure 6 shows the entire procedure for developing datasets for the analysis. 14 Figure 6: Dataset development process flow 3.1.6. Coordinate reference system The main coordinate reference system (CRS) was WGS 84 -- WGS84 - World Geodetic System 1984 with Geodetic Parameter Dataset (EPSG) of 4326 (“WGS 84 - EPSG:4326,” 2022), which is the official coordinate system for most of the global applications such as Global Positioning System (GPS) and satellite data. To perform spatial analysis, the map layer CRS were projected to Universal Transverse Mercator (UTM) zone 44N with EPSG of 32644 (“WGS 84 / UTM zone 44N - EPSG,” 2022). 3.2. Analysis methods The next step is to utilize the training datasets to evaluate machine learning models for the prediction of select soil properties. The aim here is to improve the prediction by minimising the error and improving the extrapolation ability (Meyer, 2016) of the prediction models. 3.2.1. Machine learning models Machine learning models, particularly tree-based algorithms such as random forest and gradient boosting are more successful than de facto statistical methods such as linear (or logistic) regression (Hengl et al., 2017a). Newly developed methods for determining accuracy (particularly by generating uncertainty maps) such as quantile regression (Kasraei et al., 2021) can also be examined to evaluate the uncertainty of predictions. Figure 7 shows the training and analysis steps that will be taken in this research. 15 Figure 7: Training and validation steps Following the training and evaluation procedure, the successful method was used to predict the value of soil properties over a grid across the country of Sri Lanka. The final maps will follow the style and categories of the published maps (Wimalasiri et al., 2020a). 3.2.2. Evaluation To evaluate and compare the results, both standard error analysis and cross-validation are chosen instead (MacMillan, 2019). The reason for the inclusion of cross-validation is twofold: cross-validation is the de-facto method for evaluating the performance of spatial models and it is also a suitable method to alleviate overfitting issues in machine learning methods. The statistics that were chosen to compare the model fitting were coefficient of determination (R2) and root mean square error (RMSE) as described by (Shangguan et al., 2017b). 3.3. Software Many software packages are available for spatial data manipulation, prediction, and mapping. The focus of this research was on open-source software. Both R (R Core Team., 2020) and Python (Van Rossum and Drake Jr, 1995) were used to develop and automate data pre- processing and analysis code. Basic exploratory data analysis and visualisations were done with R. The capabilities of machine learning in Python were used for testing a few machine learning algorithms (random forest and neural networks). Finally, mapping, and final visualisations was done within R. Geographic analysis was performed using an open-source GIS software (QGIS.org). 16 4. Results In this section, statistics and main results are presented. Also, general trends in the outputs are identified and discussed and impetus is provided for the discussion section. 4.1. Tuning data augmentation model All the training sets of the observations (Figure 1) were used to train a SpaceGAN model for each soil element. The trained models were then used to predict value for each target variable at the randomly selected points in Figure 4. These randomly selected points were subsequently used for training machine learning models and accuracy assessment (section 3.1.3). For each target variable within the first depth (0-5 cm), a SpaceGAN model was trained based on the coordinates of each training set as well as the target variable. Since the scale of the target variable is continuous, a distance-based neighbourhood weight matrix was developed with the Python Spatial Analysis Library Core (Rey and Anselin, 2007) using all three variables (latitude, longitude, target variable) for each target variable (Klemmer, 2021). To improve the results and avoid mode collapse (Klemmer, 2022), a series of pre-processing steps were done. It was determined that removing outliers that are distanced 1.5 times the interquartile range (Q3-Q1) was deemed to be a useful strategy. Normalisation and scaling were not very helpful due to back-transformation needs for the predicted data. Other measures to avoid the mode collapse problem are discussed in section 5.2. Table 2 shows the characteristics of the successful models that were used to generate data for all target variables. Table 2: Characteristics of optimum SpaceGAN generator and discriminator networks Target variables BD, CEC, Clay, EC, OC, pH, Sand, Silt, VMC0.33 and VMC15 Neighbour features Target variables + Coordinates (Longitude, Latitude) Average epochs 3000 (2X Generator vs Discriminator) Batch size 100 Latent noise input Gaussian Generator Linear (2 times size of neighbour features, 50), ReLU, Linear (50, target dimension) Discriminator Linear (2 times size of neighbour features, 50), Tanh, Linear (50, 1), Sigmoid Learning rate 0.01 (both generator and discriminator) Neighbours 50 Weight matrix Distance based Loss SGD (stochastic gradient descent) Model selection metrics RMSE, MIE Training a SpaceGAN model using a small sample proved to be a non-trivial task. After many iterations and tuning of the parameters, the best parameters that would produce acceptable results in terms of accuracy were selected for further analysis. According to (Klemmer, 2022), this type of periodical loss is to be expected, particularly for smaller datasets. The structure that are described in Table 2 was successfully applied to all the elements except for Sand, which needed more training to generate loss function. Although MIE criteria seemed to be robust for some of the target variables, model selection based on the minimum RMSE proved to be a better choice for data augmentation for the majority of data elements. Three major treatments for GAN modelling were developed to improve the results and avoid the mode collapse phenomena (Klemmer, 2022; Pei et al., 2021): 1- Tuning the GAN model based on different learning rates. 2- Training Discriminator l times more than Generator (where l is an even number). 3- Oversampling the tails of the original data distribution during training. The training batch was divided by two, and half of the data would be sampled specifically from the tails of the distribution (below Quantile 1 and beyond Quantile 3). 4- Outlier removal: removing data points that are beyond the interquartile range of data points (Q1 or Q3 + 1.5 IQR where IQR = Q3 – Q1). 5- Normalisation and lognormal transformation Based on the results of the training experiment, two groups of treatments were identified that offered the best results in terms of training statistics and spatial distribution of generated samples. - OT: learning rate (0.01), training Generator 2 times and training Discriminator 1 time, and outlier removal (methods 1, 2, 4 above) - ST: learning rate (0.01), training Generator 2 times and training Discriminator 1 time, outlier removal and oversampling tails (methods 1, 2, 3, and 4 above). Using any other treatment such as normalisation and transformation did not produce satisfactory results in terms of MIE, RMSE, and loss reduction. Figure 8 shows the loss and model performance for the successful models based on the ST treatment strategy. 18 BD CEC Clay EC OC pH Sand Silt VMC0.33 VMC15 Figure 8: Loss, RMSE, and MIE performance for model selection 4.2. Exploratory analysis Table 3 shows basic statistics of observed and generated samples. Although generated data ranges are within the boundaries of the observed data, their statistics tend to move towards the mean of the observed data distribution and stay around the mean with little variation. Except for BD, Sand and Silt, generated mean values are smaller than those of the real data (OBS). Also, data dispersion around the mean is also lower for all the elements except BD. Without any exception, the minimum values for all the elements are higher than the observational data and their maximum is lower than the observations’ maximum. This is another indicator that generated data is respecting the boundaries of plausible values. The values for pseudo-samples for OC and Sand remain constant because they are assigned values (section 2.1.2) and therefore their mean is constant, and their standard deviation is 0. 19 Table 3: Target variable statistics across three types of data TYPE BD CEC CLAY EC OC pH SAND SILT VMC15 VMC0.33 GEN 1.47 9.89 15.26 194.75 0.64 6.05 73.205 14.78 0.12 0.2 MEAN OBS 1.42 16.83 19.2 516.62 1.45 6.34 67.79 11.53 0.14 0.2 GEN 0.23 1.66 1.88 40.47 0.22 0.3 9.72 1.53 0.01 0.01 SD OBS 0.23 37.53 13.67 2619.75 1.71 1.13 18.48 6.82 0.07 0.08 GEN 0.97 5.02 10.97 85.45 0 5.4 51.39 11.1 0.09 0.16 MIN OBS 0.62 0.91 0.1 0.02 0.04 3.1 1.68 0 0.02 0.03 GEN 1.86 15.07 20.46 310.08 1.79 6.98 92.17 19.33 0.15 0.24 MAX OBS 1.982 396 53.548 29100 24.558 8.584 99 41 0.437 0.553 COUN GEN 299 299 299 299 329 299 329 299 299 299 T OBS 428 428 428 428 428 428 428 428 428 428 Note: - Generated samples (GEN) for OC and Sand include Pseudo-samples. - Observed (OBS) data are from (Thathsarani, 2022) 4.3. Modelling results Except for pH and VMC (both 0.33 and 15 level), the response to various degrees of data augmentation: mixing observational data with the synthetic data (GAN and Pseudo samples). The most successful model for all the soil data elements was random forest except for Clay. Different levels of data augmentation had different effects on the root mean square error (RMSE) and coefficient of determination values (Figure 9). This could be due to the random nature of sampling from augmented data. Even though a certain amount of data augmentation could help reduce RMSE, the overall coefficient of determination for EC is very low and it didn’t help to improve the results. The cross-validation score (CV score) is a measure of performance for repeated one-out sample prediction, and it is a good measure when the number of samples is limited. The CV score based on R2 does not show any specific pattern of increase or decrease. However, it shows that for most models, there is a gain by increasing the number of samples and therefore improve the accuracy of predictions during training. They are all positive numbers except for Clay, EC, and CEC. 20 Figure 9: Model performance against augmented data, RF: random forest, MLP: multi- layer perceptron 21 4.4. Model performance Table 4 shows the characteristics of successful models, the percent of data augmentation that lead to those results, and the Coefficient of Determination (CoD) for all the predictions using the ST treatment group (section 3.1.3 and Figure 12). For all models, a certain degree of data augmentation would lead to a better prediction over the baseline observed dataset (data augmentation = 0%). The highest data augmentation requirement is for Clay and the lowest are for pH and Sand (with pseudo-samples). CoD for pH is the highest while for OC it is negative, meaning that even with the addition of 75% augmented data to the original dataset, there was no significant improvement with machine learning models for prediction. However, this result is still better than the baseline data (without any augmentation). The addition of uniform pseudo-samples did not have any significant impact on the improvement of results for Sand and OC as the coefficient of determination did not change dramatically. Table 4: Characteristics of successful models for all elements (ST training strategy) ELEMENT MODEL % DATA COD AUGMENTATION BD RF 20 0.28 CEC RF 70 0.26 CLAY MLP 90 0.23 EC RF 45 0.02 OC-WP RF 75 -1.03 OC-NP RF 10 -0.78 PH RF 5 0.56 SAND-WP RF 5 0.38 SAND-NP RF 90 0.38 SILT RF 95 0.22 VMC0.33 RF 10 0.28 VMC15 RF 45 0.19 Note: RF: Random Forest, WP: with pseudo-samples, NP: no pseudo-samples, COD: coefficient of determination Table 5 shows the final model CV score and machine learning RMSE in comparison with the results of the previous study (Wimalasiri et al., 2020b) that used statistical and mechanistic models for prediction using baseline observation data. Except for BD, CEC, and Clay, addition of observed literature data and machine learning has improved the results of modelling. It is noteworthy that since no distinction is made between the type of observed data (direct filed survey data from SRICANSOL project and those that are collected recently from the literature), comparing the results of the present analysis with those of the previous study shown in Table 5 might not be meaningful as it is unclear at this point what is the main reason for the improvement (addition of literature data or machine learning models or both). Therefore, further research might be needed to ascertain direct impact of data and models on the accuracy of the outputs. 22 Table 5: Comparison of model accuracy with previous study (ST strategy) ELEMENT CVSCOR ML RMSE PREVIOUS METHOD RMSE BD 0.11 0.201 0.184 EMK CEC -36.1 47.303 14.949 EMK CLAY -0.25 11.369 10.838 RBF EC -104.39 1884.86 - OC-WP 0.03 1.211 2.649 EMK OC-NP 0.09 1.133 PH 0.46 0.736 0.770 EMK SAND-WP 0.18 14.767 15.970 RBF SAND-NP 0.17 14.69 SILT 0.05 5.361 8.159 RBF VMC0.33 0.15 0.066 0.090 EMK VMC15 0.3 0.061 0.082 EMK Note: CVSCOR: cross validation score, ML: machine learning, RMSE: root mean square error, EMK: empirical bayes kriging, RBF: radial basis function. Note: previous RMSE are achieved without the use of literature data 4.5. Final maps A grid of more than 13,000 points (Figure 5) was used to predict the values for target variables and develop final maps based on the machine learning models. The threshold between classes (colours) was defined based on the histogram of the input. For each target variable, 6 classes were identified based on the range and the number of bins (Sturges, 1926). A uniform colour chart was assigned to different data intervals to create distinguishable patterns for the final maps. Although the resulting maps are not very smooth for most of the maps presented in Figure 10, machine learning models were still able to detect the general patterns without the need for any statistical modelling or assumptions. For Clay, a general improvement can be seen in terms of uniform patterns the model performance was not superior compared to the previous study. 23 Figure 10: Final maps created with predicted data at the grids coordinates 24 5. Discussion In this section analysis results are discussed concerning the main objectives and the hypotheses of the research. The strengths and limitations of the results and their implication for real-world applications are also reviewed. 5.1. Data augmentation Two main datasets that were used for this research were: 1- Observation data that are a combination of samples from previous surveys and those that were extracted from the literature (section 3.1.1). 2- Augmented data or artificial samples that were identified based on surface reflectance (Sand and OC only) and generated data using Space-GAN (all variables). Soil sampling is a resource-consuming endeavour that is usually carried out at regional or national scales or across smaller areas to fulfil a specific objective. As sampling density for national and regional scales is usually lower than those that are collected for smaller areas, there is a possibility to combine data from different scales to create better datasets (Batjes et al., 2020). Unlike large-scale sampling data that are usually available in different formats, small- scale sample data are rarely shared with the researchers, except for those that are published in research articles. In this research, a dataset of previously collected and harmonised soil samples that were combined with the observed datasets that were collected from published articles. Furthermore, an augmented dataset was developed using satellite remote sensing analysis and Generative Adversarial Nets. A survey of literature reveals that only two studies have augmented their training data with pseudo-samples derived from remote sensing images (Hengl et al., 2017b; Shangguan et al., 2017b). Both studies advise that expert-based samples should only be between 1-5% of the training data. Therefore, the number of pseudo-samples was kept at a minimum in this study. Pseudo-samples were only applied to Sand and organic carbon content (OC) in this research. To date, no study has investigated the possibility of the addition of generated data using GAN to improve the quality of the predictions and no guideline is available to follow in this case. Therefore, to test the impact of generated data, a large set of 300 samples were generated for all target variables to quantify the degree of improvements with the addition of 0 to 100% of the augmented data. The density of observed data is 1 sample per 155 sqkm whilst with the addition of the generated samples, the density increased were improved to approximately 1 sample per 90 sqkm (Figure 1 and Figure 2). Except for the minimum value of OC, the minimum for all generated data is higher than the observed values and the maximum is lower than the observed maximum value in Table 2. The mean values for all elements except for BD, pH, and VMC (both 0.33 and 15 bars level drainage upper limit and permanent wilting point –mm mm–1) are lower than the true (observed data) mean. This shows that the generated data tends to be more conservative and acknowledges the range of the observational data. However, this might be related to the problem of mode collapse that is also associated with the generative models (Bau et al., 2019; Pei et al., 2021). This problem is also highlighted in Figure 11 for pH and Sand. The generated (fake) data covers almost all values for pH but this case is not true for Sand values. Another important aspect of generated data is low diversity, as for all the variables (except for BD), the standard deviation is much lower than the original data. Figure 11: Normal dispersion around the mean and low SD for pH (top) and CEC (bottom) 26 5.2. Mode collapse Different strategies for mode collapse were used to improve the results of the output data. One major problem was to improve the diversity while keeping the output of GAN model within the plausible range of the original data. Section 4.1 described major treatments that were tested to identify the best GAN model for training all the output variables. The two groups of treatments; OT and ST have three strategies in common. The learning rate is an important regularisation factor in training neural network models. The major issue of small samples can be overcome with higher learning from the beginning of the training routine. In this research, the network structures are simple and therefore, reducing the learning rate beyond the threshold 0.01 (Table 2) did not produce satisfactory results (Li et al., 2019). Removing outliers improved the results by reducing the diversity in the original data while releasing the GAN model to try to learn sub-population distributions. Similar to statistical modelling, evidence-based pre-processing techniques can have a great impact on the results of machine learning models (Fan et al., 2021). Machine learning models are not inherently trained to detect patterns based on prior assumptions. Indeed, it has been reported that batch or mini-batch normalisation can improve the results of image classification (Blanch et al., 2019). However, for the present case, batch normalisation did not produce satisfactory results due to two reasons: 1- Small scale variation that is present within some of the target variables (e.g., BD and CEC) 2- Complications with inverse normalisation due to small variation in the observational dataset. Another major solution that proved useful in this research was unbalanced training of the input data (Klemmer, 2022). In this case, either the Discriminator or Generator are trained with more iterations. The most successful approach to unbalanced training was training the Generator twice and Discriminator only one time. This could help the Generator, gather more information about the original data while avoiding the need for hard classification using the Discriminator. A major distinguisher between the two treatments OT and ST is over-sampling the tails of the training distribution. Although outlier removal reduces the burden of the Generator to generate more diverse outputs within the reasonable range of the input data, oversampling the tails of the distributions, particularly for highly skewed data, leads to better results. Indeed, the results that are discussed in Tables 3-4 were achieved using this method. However, a major issue that arises, in this case, is the spatial distribution of the output data. Figure 12 shows an exemplar spatial distribution of generated data versus original data using OT and ST treatments. The proximity of generated data and the original data is not explainable for the case of ST as it is compared to OT. Particularly for BD, the model assigns lower values even in the vicinity of the higher values in the original set that can perhaps be attributed to the attributes such as elevation and low bulk density in lowlands. However, ST can capture more variation in data compared to OT treatment which is probably the reason behind its slightly better results (section 4.1). 27 Figure 12: Spatial distribution of generated data based on two groups of treatments 5.3. Accuracy assessment The addition of GAN and pseudo-samples have improved the representativeness of the samples by increasing sample density and even though the results are sub-par for most of the variables, a degree of data augmentation has helped with improving the results (section 4.3). Model performances shown in Figure 9 demonstrate a random behaviour (except for EC, pH, and VMC33). This could be due to randomness in batch selection during the sampling. Although the density is higher with the addition of generated samples, the sample size is still considered low for the area of the country (close to 65,000 square km). This could be the reason for the random behaviour of the model outputs. A major issue with low sample size is representativeness and sampling scheme. There are two important properties of the dataset used in this study: 1- The original dataset does not follow any objective design. That is, the locations are not random. This violates the iid assumptions and this could lead to biased results. This problem is solved with consideration of “spatial processes” (Webster and Oliver, 2001) where randomness can also be defined at each generated output. 2- The locations of the generated data however are randomly selected. This will resolve the iid assumption since samples are selected following a simple random sampling scheme (SRS), the chance of the models producing biased results is lower (even with the presence of original convenient sampling). In principal machine learning models can be used for prediction using any type of data, regardless of their sampling design (OpenGeoHub, 2022) and this gives them a better advantage to be used for spatial data prediction. However, the main issue with machine learning models is the sample size, which poses a challenge when the sample cannot be considered representative. For most of the target variables, the random forest model tends to produce the best results for ST treatment particularly when there is a non-linear relationship between covariate features and the target variables and this confirms the results from the previously published studies (Hengl et al., 2017a; Shangguan et al., 2017c). This also holds for the OT treatment results (Appendix 28 A 1 and A 2). However, the results of modelling based on OT do not differ meaningfully from those of ST and both agree with previous studies. 5.4. Predictions and features One of the most important goals for spatial prediction using any model is to be able to map and visualise the results. Also, it is important for any map that visualises the soil spatial variability to showcase some patterns of homogeneity based on the intuitive notion of the first law of geography. Keeping in mind that machine learning models are not inherently ‘trained’ or ‘supervised’ to detect any predefined spatial pattern, the output maps showcase some of those patterns that correspond to the previously published maps (Wimalasiri et al., 2020b). One of the important drawbacks of the maps is the presence of artefacts or areas of discontinuity. This could be because in those areas where there is low sample density or gaps, machine learning models perform unexpectedly and create these artefacts. However, artefacts could be due to different colour scales that were chosen in Figure 10. Figure 13 shows the observation data overlaid on the predicted surface map. Although the prediction surface misses some obvious values, it captures the general patterns of variation. Figure 13: Originally observed and generated data overlaid on the prediction map 29 Using an extended array of coordinates is not always a necessary element for good prediction. Figure 14 shows the importance for all 157 features that were used for modelling of pH element (the best model for pH used 5% generated data for an RF model). This shows that the amount of evapotranspiration, vegetation, and elevation are among the most important features that can predict the values of pH more accurately. 30 aet: Actual evapotranspiration (mm), def: climatic water deficit (mm), pdsi: Palmer Drought Severity Index, pet: reference evapotranspiration (penman-montieth, mm), pr: precipitation (mm), ro: runoff (mm), soil: soil moisture (mm), srad: Downward surface shortwave radiation (W/m2), swe: snow water equivalent (mm), tmmn:min temperature (deg C), tmmx: max temperature (deg C), vap: vapor pressure (kPa), vpd: vapor pressure deficit (kPa), vs: Wind-speed at 10m (m/s), nvdi: normalized difference vegetation Index Figure 14: Feature importance of pH element predictions 31 5.5. Data usability 5.5.1. Modern soil data Soil knowledge is needed for modern decision-making in many sectors from agriculture to construction. The recent development in three areas has made it possible to provide better soil data for the users.: 1- Big environmental data: there are extensive datasets available for use from satellite images to surface maps of temperature and rainfall statistics that can be used to build a strong covariate feature for prediction. 2- The development of machine learning models particularly ensemble models have contributed to the development and visualisation of large-scale prediction of environmental data, particularly soil maps. 3- Soil data are now increasingly made available to the public, despite the barriers that exist in terms of copyright and ownership. This will establish a strong feedback loop between users and producers of data. The combination of all three abovementioned advances would gradually help improve the quality of soil data. Such improvements are not possible without an understanding of the spatial relationship between features and machine learning models. All the above mentioned elements have been used in this research, however, as previous studies have shown, more work needs to be done to improve the results of maps. Figure 15 shows an example of comparison maps between two openly accessible datasets (SoilGrids and OpenLandMap). The local data provide much better information in comparison to the global datasets. This shows the importance of using local data in prediction and mapping of the soil elements. Figure 15: Comparison between observed and global soil data for pH (adapted from (Wimalasiri et al., 2020c) 32 5.5.2. Machine learning interpretability dilemma A major issue with the use of machine learning models in soil mapping is the apparent lack of transparency, interpretability, and explainability (Rentschler, 2021; Wadoux et al., 2019). Models such as neural networks are increasingly promoted as ‘black-box’ machines. In this context, further analysis and ‘overlays’ of data such as the one presented in Figure 13 seem to be indispensable. Also, methods such as feature importance that can be done for de facto models such as random forest can play a key role in developing better models (Figure 14). 5.5.3. Automation Recent advances in big data and automation are now being built into automated systems that are suitable for re-tuning the algorithms and providing new versions of datasets as new samples are made available. Such automated systems are already envisaged for data providers such as SoilGrids.org who aim to serve the public with open-access datasets. However, such data systems can now be easily developed at the national level for the need of local users that are still ‘reluctant to share their data’. In this case, machine learning models can be advantageous since their algorithms are more streamlined compared to statistically-based models such as regression kriging where detailed variogram modelling is required. 33 6. Conclusion Soil information plays a key role in improving our understanding of environmental processes that impact our world. Major and global issues such as land degradation, climate change, and food security require detailed soil data. However, obtaining soil sample data is time and resource consuming and quite often, the available data does not meet the demand. More accurate soil data that capture the spatial variability of soil properties can aid scientists in many fields in better understanding our natural environment. It can also lead to the development of solutions for sustainable use of limited land resources for the benefit of food and nutrition security of a growing population. For example, the use of better soil data can lead to the development of better prediction models for crop yield in different soils across the country. This information would also benefit policy-makers to accurately plan for food and nutrition security by analysing the likely performance of a specific crop for current and future climatic conditions. Better soil data can also benefit other disciplines such as construction, geology, ecology, and other Earth sciences to fine-tune their models to the soil variation that exists on the ground and derive better estimates of their target variables. In this research, a combination of data augmentation and mapping mechanisms using novel approaches to data generation and analysis were used to enhance the soil maps at the national scale for the country of Sri Lanka. Data from different sources were combined to examine the main hypotheses of this research. Although observed soil sample data are always difficult to collect and analyse, they are not always shared with the public or made available to the researchers. Observed data in the literature can be a solution. However, the result of the survey of the literature showed that those data are also very limited and imbalanced in nature. Therefore, there is no choice but to use different methods for data augmentation. One way is to use satellite data and expert judgement to add more training samples (using pseudo-samples). However, the result of analysis using surface reflectance showed that although they can be a good source of information for some soil element such as (Sand and OC), their contribution is still very limited. Data generation using newly developed deep learning methods such as Generative Adversarial nets (GAN) provides a good solution for the augmentation of soil data. Specific GAN models can be trained to understand and detect spatial patterns based on some degree of spatial proximity measure such as Moran’s I indicator of spatial correlation (section 3.1.3). Different strategies were used to tackle the data generation problem using GAN models. The main issue was the lack of diversity for generated data or the problem of mode collapse. The result showed that a combination of pre-processing techniques such as outlier removal in combination with unbalanced training of generator and discriminator and over sampling the tails of distribution can be helpful. The result of data generation for a set of randomly sleeted coordinates showcasing areas that are gaps in the analysis showed that using generated data could provide a solution for the scarcity of soil sample data. The degree of data augmentation was vastly different among different soil elements; however, it is possible to fine-tune specific models for each element. The overall results were satisfactory for OC, pH, Sand, Silt, and VMC. But not satisfactory for BD, CEC, and Clay. 35 7. Future work One of the main limitations of the current work is that no distinction is made between observed data that are collected in the field and those that are added by literature survey. A more comprehensive investigation is therefore needed in the future to conclusively arrive at the notion that both addition of survey literature data and machine learning models can improve the prediction results. A major issue with artificial data augmentation is finding the right place to sample. In this research, a simple random sampling (SRC) scheme was used to generate samples using GAN architecture. However, other sampling schemes can be used for data augmentation. For example, Latin Hypercube Sampling (LHS) and newly developed Feature Space Coverage Sampling (FSCS) could provide a more representative sample for spatial modelling (Brus, 2019). Another aspect that could be improved in the future is increasing the number of generated samples. However, a blunt increase of sample number might have direct consequences in the mapping phase where spatial patterns of generated samples influence those of the real data. There should also be better strategies for tackling the problem of mode collapse. For example, many Generator nets could be trained for a smaller number of Discriminators. Also, a better data normalisation strategy (Minmax scaling or non-normal transformation for highly skewed data) could be used. It is also important to develop element-specific models as using a generic model (uniform hyperparameters) that was used for both data generation using GAN and prediction using machine learning models lead to mixed results. For example, a specific model and data pipeline could be developed for soil structural elements (Sand, Silt, Clay, and BD) while for chemical properties such as pH, EC and CEC. The results of such modelling should also be explored. Publication of data, codes, and results is also an important aspect that will be considered in this research. Previous experience with the growth of open-source software has triggered the reproducibility movement in many scientific disciplines. Publication of data in the open access repositories will not only expose the methods used in this research to the public but also will pave the way for the development of better datasets in the future (Musker et al., 2018). 8. References Abatzoglou, J.T., Dobrowski, S.Z., Parks, S.A., Hegewisch, K.C., 2018. TerraClimate, a high-resolution global dataset of monthly climate and climatic water balance from 1958–2015. Sci Data 5, 170191. https://doi.org/10.1038/sdata.2017.191 Anselin, L., 1995. Local Indicators of Spatial Association—LISA. Geographical Analysis 27, 93–115. https://doi.org/10.1111/j.1538-4632.1995.tb00338.x Ashkenazy, Y., Shilo, E., 2018. Sand Dune Albedo Feedback. Geosciences 8, 82. https://doi.org/10.3390/geosciences8030082 Bagheri bodaghabadi, M., Martínez-casasnovas, J., Salehi, M.H., Mohammadi, J., Esfandiarpoor borujeni, I., Toomanian, N., Gandomkar, A., 2015. Digital Soil Mapping Using Artificial Neural Networks and Terrain-Related Attributes. Pedosphere 25, 580–591. https://doi.org/10.1016/S1002-0160(15)30038-2 Batjes, N.H., Ribeiro, E., Oostrum, A. van, 2020. Standardised soil profile data to support global mapping and modelling (WoSIS snapshot 2019). Earth System Science Data 12, 299–320. https://doi.org/10.5194/essd-12-299-2020 Bau, D., Zhu, J.-Y., Wulff, J., Peebles, W., Strobelt, H., Zhou, B., Torralba, A., 2019. Seeing What a GAN Cannot Generate. Presented at the Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4502–4511. Blanch, M.G., Mrak, M., Smeaton, A.F., O’Connor, N.E., 2019. End-to-End Conditional GAN-based Architectures for Image Colourisation, in: 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP). Presented at the 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6. https://doi.org/10.1109/MMSP.2019.8901712 Bouasria, A., Ibno Namr, K., Rahimi, A., Ettachfini, E.M., Rerhou, B., 2022. Evaluation of Landsat 8 image pansharpening in estimating soil organic matter using multiple linear regression and artificial neural networks. Geo-spatial Information Science 0, 1–12. https://doi.org/10.1080/10095020.2022.2026743 Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32. https://doi.org/10.1023/A:1010933404324 Brus, D.J., 2019. Sampling for digital soil mapping: A tutorial supported by R scripts. Geoderma 338, 464–480. https://doi.org/10.1016/j.geoderma.2018.07.036 Buchhorn, M., Lesiv, M., Tsendbazar, N.-E., Herold, M., Bertels, L., Smets, B., 2020. Copernicus Global Land Cover Layers—Collection 2. Remote Sensing 12, 1044. https://doi.org/10.3390/rs12061044 Chen, T., Guestrin, C., 2016. XGBoost: A Scalable Tree Boosting System, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16. Association for Computing Machinery, New York, NY, USA, pp. 785–794. https://doi.org/10.1145/2939672.2939785 Chesworth, W., 2007. Encyclopedia of Soil Science. Springer Science & Business Media. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A., 2018. Generative Adversarial Networks: An Overview. IEEE Signal Processing Magazine 35, 53–65. https://doi.org/10.1109/MSP.2017.2765202 Dewitte, O., Jones, A., Elbelrhiti, H., Horion, S., Montanarella, L., 2012. Satellite remote sensing for soil mapping in Africa: An overview. Progress in Physical Geography: Earth and Environment 36, 514–538. https://doi.org/10.1177/0309133312446981 Dobos, E., Hengl, T., 2009. Chapter 20 Soil Mapping Applications, in: Hengl, Tomislav, Reuter, H.I. (Eds.), Developments in Soil Science, Geomorphometry. Elsevier, pp. 461–479. https://doi.org/10.1016/S0166-2481(08)00020-2 Fan, C., Chen, M., Wang, X., Wang, J., Huang, B., 2021. A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge Discovery From Building Operational Data. Frontiers in Energy Research 9. https://doi.org/10.3389/fenrg.2021.652801 Farr, T.G., Rosen, P.A., Caro, E., Crippen, R., Duren, R., Hensley, S., Kobrick, M., Paller, M., Rodriguez, E., Roth, L., Seal, D., Shaffer, S., Shimada, J., Umland, J., Werner, M., Oskin, M., Burbank, D., Alsdorf, D., 2007. The Shuttle Radar Topography Mission. Reviews of Geophysics 45. https://doi.org/10.1029/2005RG000183 Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D., 2014. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15, 3133– 3181. Goovaerts, P., Goovaerts, D. of C. and E.E.P., 1997. Geostatistics for Natural Resources Evaluation. Oxford University Press. Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., Moore, R., 2017. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment 202, 18–27. https://doi.org/10.1016/j.rse.2017.06.031 Hengl, T., Jesus, J.M. de, MacMillan, R.A., Batjes, N.H., Heuvelink, G.B.M., Ribeiro, E., Samuel-Rosa, A., Kempen, B., Leenaars, J.G.B., Walsh, M.G., Gonzalez, M.R., 2014. SoilGrids1km — Global Soil Information Based on Automated Mapping. PLOS ONE 9, e105992. https://doi.org/10.1371/journal.pone.0105992 Hengl, T., Mendes de Jesus, J., Heuvelink, G.B.M., Ruiperez Gonzalez, M., Kilibarda, M., Blagotić, A., Shangguan, W., Wright, M.N., Geng, X., Bauer-Marschallinger, B., Guevara, M.A., Vargas, R., MacMillan, R.A., Batjes, N.H., Leenaars, J.G.B., Ribeiro, E., Wheeler, I., Mantel, S., Kempen, B., 2017a. SoilGrids250m: Global gridded soil information based on machine learning. PLOS ONE 12, e0169748. https://doi.org/10.1371/journal.pone.0169748 Hengl, T., Mendes de Jesus, J., Heuvelink, G.B.M., Ruiperez Gonzalez, M., Kilibarda, M., Blagotić, A., Shangguan, W., Wright, M.N., Geng, X., Bauer-Marschallinger, B., Guevara, M.A., Vargas, R., MacMillan, R.A., Batjes, N.H., Leenaars, J.G.B., Ribeiro, E., Wheeler, I., Mantel, S., Kempen, B., 2017b. SoilGrids250m: Global gridded soil information based on machine learning. PLOS ONE 12, e0169748. https://doi.org/10.1371/journal.pone.0169748 Heuvelink, G.B.M., Webster, R., 2022. Spatial statistics and soil mapping: A blossoming partnership under pressure. Spatial Statistics 50. https://doi.org/10.1016/j.spasta.2022.100639 Jahanshiri, E., bin Mohamed Shariff, A.R., Amiri, F., Soom, M.A.M., Wayayokb, A., Buyonga, T., Pradhan, B., 2015. Spatial soil analysis using geostatistical analysis and map Algebra. Arabian Journal of Geosciences 8, 9775–9788. Karlsson, C.S.J., Jamali, I.A., Earon, R., Olofsson, B., Mörtberg, U., 2014. Comparison of methods for predicting regolith thickness in previously glaciated terrain, Stockholm, Sweden. Geoderma 226–227, 116–129. https://doi.org/10.1016/j.geoderma.2014.03.003 Kasraei, B., Heung, B., Saurette, D.D., Schmidt, M.G., Bulmer, C.E., Bethel, W., 2021. Quantile regression as a generic approach for estimating uncertainty of digital soil maps produced from machine-learning. Environmental Modelling & Software 144, 105139. https://doi.org/10.1016/j.envsoft.2021.105139 Klemmer, K., 2022. How to train a SpaceGAN for soil data augmentation. 38 Klemmer, K., 2021. SpaceGAN - A generative adverserial net for geospatial point data. Klemmer, K., Koshiyama, A., Flennerhag, S., 2019. Augmenting correlation structures in spatial data using deep generative models. arXiv:1905.09796 [cs, stat]. Kuriakose, S.L., Devkota, S., Rossiter, D.G., Jetten, V.G., 2009. Prediction of soil depth using environmental variables in an anthropogenic landscape, a case study in the Western Ghats of Kerala, India. CATENA 79, 27–38. https://doi.org/10.1016/j.catena.2009.05.005 Lalitha, M., Dharumarajan, S., Suputhra, A., Kalaiselvi, B., Hegde, R., Reddy, RS., Prasad, CR.S., Harindranath, CS., Dwivedi, BS., 2021. Spatial prediction of soil depth using environmental covariates by quantile regression forest model. Environ Monit Assess 193, 660. https://doi.org/10.1007/s10661-021-09348-9 Legros, J.-P., 2006. Mapping of the Soil. Science Publishers. Li, J., Chen, Z., Zhao, X., Shao, L., 2020. MapGAN: An Intelligent Generation Model for Network Tile Maps. Sensors (Basel) 20, 3119. https://doi.org/10.3390/s20113119 Li, Y., Wei, C., Ma, T., 2019. Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks, in: Advances in Neural Information Processing Systems. Curran Associates, Inc. M Suhari, T.A.S.T., Wimalasiri, E.M., Jahanshiri, E., 2022. Paving the way for more accurate earth system modelling in Malaysia. IOP Conf. Ser.: Earth Environ. Sci. 1016, 012031. https://doi.org/10.1088/1755-1315/1016/1/012031 MacMillan, T.H. and R.A., 2019. Predictive Soil Mapping with R. Mapa, R., 2020. SRICANSOL Project – Phase III- Development of a land resource information system for land evaluation of Sri Lanka. https://doi.org/10.13140/RG.2.2.18230.32322 Meyer, H., 2016. Machine-learning based modelling of spatio-temporal environmental data (using R) 56. Minasny, B., McBratney, Alex.B., 1999. A rudimentary mechanistic model for soil production and landscape development. Geoderma 90, 3–21. https://doi.org/10.1016/S0016-7061(98)00115-3 Moghaddam, D.D., Rahmati, O., Panahi, M., Tiefenbacher, J., Darabi, H., Haghizadeh, A., Haghighi, A.T., Nalivan, O.A., Tien Bui, D., 2020. The effect of sample size on different machine learning models for groundwater potential mapping in mountain bedrock aquifers. CATENA 187, 104421. https://doi.org/10.1016/j.catena.2019.104421 Musker, R., Tumeo, J., Schaap, B., Parr, M., 2018. GODAN’s Impact 2014-2018 – Improving Agriculture, Food and Nutrition with Open Data. F1000Research 7, 1328. https://doi.org/10.7490/f1000research.1115970.1 OpenGeoHub, 2022. Spatial sampling and resampling for Predictive Mapping with Machine Learning: A tutorial in R. MLearning.ai. URL https://medium.com/mlearning- ai/spatial-sampling-and-resampling-for-predictive-mapping-with-machine-learning-a- tutorial-in-r-99f71555bc43 (accessed 5.30.22). Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830. Pei, S., Da Xu, R.Y., Xiang, S., Meng, G., 2021. Alleviating Mode Collapse in GAN via Diversity Penalty Module. QGIS.org, 2022. QGIS Geographic Information System. QGIS Association. R Core Team., 2020. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 39 Rentschler, T., 2021. Explainable machine learning in soil mapping: Peeking into the black box (Dissertation). Universität Tübingen. https://doi.org/10.15496/publikation-57960 Rey, S.J., Anselin, L., 2007. PySAL: A Python Library of Spatial Analytical Methods. The Review of Regional Studies 37, 5–27. Richter, D.D., Markewitz, D., 1995. How Deep Is Soil?: Soil, the zone of the earth’s crust that is biologically active, is much deeper than has been thought by many ecologists. BioScience 45, 600–609. https://doi.org/10.2307/1312764 Schaaf, C., 2015. MODIS BRDF/Albedo/NBAR Product MCD43 [WWW Document]. URL https://www.umb.edu/spectralmass/terra_aqua_modis/v006/introduction (accessed 4.23.22). Shangguan, W., Hengl, T., Jesus, J.M. de, Yuan, H., Dai, Y., 2017a. Mapping the global depth to bedrock for land surface modeling. Journal of Advances in Modeling Earth Systems 9, 65–88. https://doi.org/10.1002/2016MS000686 Shangguan, W., Hengl, T., Jesus, J.M. de, Yuan, H., Dai, Y., 2017b. Mapping the global depth to bedrock for land surface modeling. Journal of Advances in Modeling Earth Systems 9, 65–88. https://doi.org/10.1002/2016MS000686 Shangguan, W., Hengl, T., Mendes de Jesus, J., Yuan, H., Dai, Y., 2017c. Mapping the global depth to bedrock for land surface modeling. Journal of Advances in Modeling Earth Systems 9, 65–88. https://doi.org/10.1002/2016MS000686 Spick, R.J., Cowling, P., Walker, J.A., 2019. Procedural Generation using Spatial GANs for Region-Specific Learning of Elevation Data, in: 2019 IEEE Conference on Games (CoG). Presented at the 2019 IEEE Conference on Games (CoG), IEEE, London, United Kingdom, pp. 1–8. https://doi.org/10.1109/CIG.2019.8848120 Stoorvogel, J.J., Bakkenes, M., Temme, A.J.A.M., Batjes, N.H., ten Brink, B.J.E., 2017. S- World: A Global Soil Map for Environmental Modelling. Land Degradation & Development 28, 22–33. https://doi.org/10.1002/ldr.2656 Sturges, H.A., 1926. The Choice of a Class Interval. Journal of the American Statistical Association 21, 65–66. https://doi.org/10.1080/01621459.1926.10502161 Sulaeman, Y., Minasny, B., McBratney, A.B., Sarwani, M., Sutandi, A., 2013. Harmonizing legacy soil data for digital soil mapping in Indonesia. Geoderma 192, 77–85. https://doi.org/10.1016/j.geoderma.2012.08.005 Sun, T., Wang, Y., Hui, D., Jing, X., Feng, W., 2020. Vertical distributions of soil microbial biomass carbon: a global dataset. Data in Brief 32, 106147. https://doi.org/10.1016/j.dib.2020.106147 Szabó, B., Szatmári, G., Takács, K., Laborczi, A., Makó, A., Rajkai, K., Pásztor, L., 2019. Mapping soil hydraulic properties using random-forest-based pedotransfer functions and geostatistics. Hydrology and Earth System Sciences 23, 2615–2635. https://doi.org/10.5194/hess-23-2615-2019 Tetzlaff, G., 1983. Albedo of the Sahara, Satellite Meas. of Radiation Budget Parameters. Thathsarani, S.P.K., 2022. STRENGTHENING THE NATION-WIDE OPEN 3D SOIL DATABASE OF SRI LANKA USING DATA MINING. Sabaragamuwa University of Sri Lanka, Belihuloya, Sri Lanka. U.S. Geological Survey, 2022. Landsat Surface Reflectance Products [WWW Document]. URL https://www.usgs.gov/landsat-missions/landsat-surface-reflectance (accessed 5.23.22). UTM zone 44N - EPSG:32644 [WWW Document], 2022. URL http://epsg.io (accessed 5.23.22). Van Rossum, G., Drake Jr, F.L., 1995. Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam. 40 Wadoux, A.M.J.-C., Padarian, J., Minasny, B., 2019. Multi-source data integration for soil mapping using deep learning. SOIL 5, 107–119. https://doi.org/10.5194/soil-5-107- 2019 Waters, N., 2018. Tobler’s First Law of Geography, in: International Encyclopedia of Geography. John Wiley & Sons, Ltd, pp. 1–15. https://doi.org/10.1002/9781118786352.wbieg1011.pub2 Webster, R., Oliver, M.A., 2007. Geostatistics for environmental scientists. John Wiley & Sons. Webster, R., Oliver, M.A., 2001. Geostatistics for environmental scientists (Statistics in Practice). WGS 84 - EPSG:4326 [WWW Document], 2022. URL https://epsg.io/4326 (accessed 5.23.22). Wimalasiri, E.M., Jahanshiri, E., Suhairi, T.A.S.T.M., Mapa, R.B., 2020a. Nation-wide open soil property maps of up to 100 cm depth for Sri Lanka 1. https://doi.org/10.17632/5sc7njfcyn.1 Wimalasiri, E.M., Jahanshiri, E., Suhairi, T.A.S.T.M., Mapa, R.B., Karunaratne, A.S., Vidhanarachchi, L.P., Udayangani, H., Nizar, N.M.M., Azam-Ali, S.N., 2020b. The first version of nation-wide open 3D soil database for Sri Lanka. Data Brief 33, 106342. https://doi.org/10.1016/j.dib.2020.106342 Wimalasiri, E.M., Jahanshiri, E., Suhairi, T.A.S.T.M., Udayangani, H., Mapa, R.B., Karunaratne, A.S., Vidhanarachchi, L.P., Azam-Ali, S.N., 2020c. Basic Soil Data Requirements for Process-Based Crop Models as a Basis for Crop Diversification. Sustainability 12, 7781. https://doi.org/10.3390/su12187781 Wong, I.F.T., 1974. Soil-crop suitability classification for Peninsular Malaysia. Division of Agriculture, Ministry of Agriculture and Fisheries. 41 Appendix A 1: Characteristics of successful models for all elements (OT training strategy) ELEMENT MODEL % DATA COD AUGMENTATION BD RF 20 0.06 CEC RF 70 0.26 CLAY MLP 55 0.25 EC RF 100 0.01 OC-WP RF 80 -0.69 OC-NP - - PH RF 5 0.56 SAND-WP RF 5 0.37 SAND-NP - - SILT MLP 60 0.23 VMC0.33 RF 25 0.27 VMC15 RF 20 0.20 A 2: Comparison of model accuracy with previous study (OT strategy) ELEMENT CVSCOR ML RMSE PREVIOUS METHOD RMSE BD 0.06 0.218 0.184 EMK CEC -32.11 47.529 14.949 EMK CLAY -0.24 11.24 10.838 RBF EC -118.29 1895.97 - OC-WP 0.05 1.105 2.649 EMK OC-NP - - PH 0.47 0.74 0.770 EMK SAND-WP 0.15 14.85 15.970 RBF SAND-NP - - SILT -0.46 5.30 8.159 RBF VMC0.33 0.16 0.066 0.090 EMK VMC15 0.31 0.060 0.082 EMK