A critical issue in digital soil mapping (DSM) is the selection of data sampling method for model training. One emerging approach applies instance selection to reduce the size of the dataset by drawing only relevant samples in order to obtain a representative subset that is still large enough to preserve relevant information, but small enough to be easily handled by learning algorithms. Although there are suggestions to distribute data sampling as a function of the soil map unit (MU) boundaries location, there are still contradictions among research recommendations for locating samples either closer or more distant from soil MU boundaries. A study was conducted to evaluate instance selection methods based on spatially-explicit data collection using location in relation to soil MU boundaries as the main criterion. Decision tree analysis was performed for modeling digital soil class mapping using two different sampling schemes: a) selecting sampling points located outside buffers near soil MU boundaries, and b) selecting sampling points located within buffers near soil MU boundaries. Data was prepared for generating classification trees to include only data points located within or outside buffers with widths of 60, 120, 240, 360, 480, and 600m near MU boundaries. Instance selection methods using both spatial selection of methods was effective for reduced size of the dataset used for calibrating classification tree models, but failed to provide advantages to digital soil mapping because of potential reduction in the accuracy of classification tree models.
soil survey; data sampling; model training