1. Introduction
Indonesia has an area of approximately 1.89 million km2, holding great natural resources. This potential is both an advantage and a challenge for Indonesia. The natural resources have made a big contribution to encouraging Indonesia’s development. However, managing these natural resources is a big challenge because of the expanse of the area and because it consists of almost seventeen thousand islands connected by territorial waters.
To manage such a spacious country, reliable and accurate data are required. One type of these data is spatial data. Spatial development planning is a prevalent topic in Indonesia, which is subsequent to the government’s implementation of novel strategies in development planning, including spatial-based development planning. In these cases, a large-scale topographic map is needed.
Large-scale topographic maps provide critical information for municipal and regional planning and management [
1]. In the rapidly evolving landscape of urban development, the use of large-scale topographic maps has become increasingly crucial as a foundational element for spatial-based applications in urban planning. These high-resolution maps provide a detailed and comprehensive representation of the physical landscape and serve as an invaluable tool for a wide range of urban planning and management tasks [
2]. Large-scale topographic maps of urban areas have the potential to provide valuable data for various applications. Urban planners and engineers can use these data for analytical and management purposes. Topographic maps are becoming more available to mapping agencies and municipal governments in developing countries, opening new opportunities to increase data supply for urban management applications [
3]. A topographic map is also needed as a base map. It is used as a reference in the compilation of thematic maps in various sectors. A base map is needed to avoid overlaps and conflicts between sectors due to the use of maps with different references. It follows the ’One Map Policy’ regulated in Presidential Regulation No. 9, 2016, on Acceleration of the Implementation of One Map Policy. This policy has been implemented on a medium-scale map (1:50,000 and 1:25,000). Furthermore, the government wants to implement the one-map policy on large-scale maps (1:5000). This means that a topographic map with a scale of 1:5000 is needed. However, it is now only available in roughly 3% of Indonesia’s total area, as seen in
Figure 1.
Large-scale (1:5000) topographic map production activities were started by the Geospatial Information Agency, the national mapping authority of Indonesia, in 2013, using acquisition technology for large-scale mappings such as aerial photo (2013–2016), a combination of aerial photo and Lidar (2016–2020), and orthorectified high-resolution satellite imagery (2016–2020). Since 2013, the Geospatial Information Agency has made a consistent effort to expand its large-scale topographic map (1:5000) production capability, which was only approximately 562.39 km
2 in 2013. The years 2019 and 2020 saw a very notable increase in this capacity, with approximately 12,138.43 km
2 and 13,118.87 km
2 mapped, respectively. Nonetheless, it still takes more than a century to finish with this capability. Furthermore, if the forest area, which is about 49.7% of the whole area, is excluded, it still takes more than 60 years to finish (
Figure 2).
Based on Indonesian Law No. 4 (2011) about geospatial information, Article 7, Indonesian topographic maps consist of eight themes: hypsography, land cover, building and public facility, transportation and utility, administration boundary, hydrography coastlines, and toponym. Considering that the topographic map production process is quite complicated, it is necessary to conduct research from various aspects before comprehensive implementation. This research will focus on the production of land cover data.
Land cover refers to the observable bio-physical layer on the Earth’s surface. Traditional methods to obtain land cover data have often relied on hand-crafted features and rule-based approaches, which may be labor-intensive and constrained in their capacity to discern the intricate patterns inherent in remote sensing data [
4]. Recently, deep learning approaches have arisen as a formidable instrument for land cover classification, leveraging the ability of deep neural networks to learn hierarchical representations from data automatically [
5]. Deep learning models have exhibited enhanced efficacy relative to conventional methods, particularly in the face of challenges such as limited resolution, high dimensionality, and noise acquisition [
6]. A primary advantage of deep learning in land cover classification is its proficiency in managing complicated, high-dimensional remote sensing data. Deep convolutional neural networks (CNNs), in particular, have shown great promise in this domain, as they can capture spatial and spectral features from satellite imagery. Additionally, deep learning models can be trained in an end-to-end fashion, reducing the need for manual feature engineering and enabling more scalable and automated solutions [
7].
A multitude of studies have investigated the application of deep learning to land cover classification using a variety of remote sensing data sources, including multispectral, hyperspectral, and high-resolution satellite imagery. These studies have shown that deep learning approaches can effectively capture the spatial and spectral information present in remote sensing data, leading to improved classification accuracy and the ability to handle complex land cover patterns. For example, a study proposed a deep learning architecture called MRFusion that combines panchromatic (PAN) and multispectral (MS) satellite imagery to improve land cover mapping. The authors demonstrated that their approach outperformed traditional fusion methods and standalone classification models, highlighting the potential of deep learning for multi-source data integration in remote sensing [
8]. Another study reviewed the state-of-the-art of deep learning methodologies for land cover classification, discussing how deep learning techniques have been adapted to tackle challenges such as limited resolution, high dimensionality, and noise acquisition. The authors concluded that deep learning offers significant advantages over traditional methods, particularly in terms of its ability to automatically learn relevant features from the data [
9].
One key advantage of deep learning for land cover classification is the ability to automatically learn feature representations from the data, rather than relying on hand-crafted features [
5]. This allows deep learning models to adapt to the specific characteristics of the remote sensing data and the land cover classes of interest [
10]. Moreover, the hierarchical nature of deep neural networks enables the capture of multi-scale spatial and spectral features, which can be crucial for accurate land cover mapping [
11]. Another benefit of deep learning is its potential to address challenges such as limited training data, missing or noisy data, and class imbalance. Recent research has explored techniques like transfer learning, data augmentation, and multitask learning to improve the performance of deep learning models in these scenarios. Despite the promising results, the adoption of deep learning for land cover classification in operational settings is still limited. Further research is needed to address challenges such as the interpretability of deep learning models, the availability of labeled training data, and the computational requirements of these approaches [
12].
The objective of this research is to improve the extraction of land cover from very-high-resolution remote sensing data using a deep learning approach. To accomplish this objective, some image processing procedures are executed to acquire the suitable band combination as an input of the deep learning model. In addition, a post-classification regularization procedure is carried out to improve the quality of the result.
2. Related Work
The rapid advancements in remote sensing and satellite technology have significantly increased the availability of high-quality, very-high-resolution satellite imagery, offering invaluable insights into land cover characteristics. However, processing and extracting meaningful information from the sheer volume and complexity of these data remain challenging. Deep learning techniques have emerged as a promising solution to these challenges, demonstrating remarkable capabilities in tasks such as image classification and object detection [
10,
13].
A key challenge in land cover mapping from satellite imagery is the lack of consistent global coverage and the complexities involved in preparing seamless image mosaics [
14]. Advances in remote sensing and information technology have facilitated high-resolution, operational land cover mapping and monitoring on a global scale [
14]. Deep learning-based methods have shown great promise in this domain, offering improvements over traditional approaches in terms of accuracy and efficiency [
10,
13].
Recent studies have investigated the application of convolutional neural networks (CNNs) for land cover mapping, showcasing their ability to classify land cover types accurately from satellite imagery. These studies underline the potential of deep learning to automate and enhance the land cover mapping process, addressing challenges such as large data volumes and the need for timely and accurate training and validation datasets. The integration of remote sensing and deep learning is poised to drive the next generation of global land cover characterization, mapping, and monitoring. Deep learning-based methods demonstrate potential for improved generalization and robustness in land cover classification, addressing the diversity of remote sensing data and the requirement for comprehensive training datasets [
10,
13,
14,
15].
The proposed approach builds upon recent advancements in deep learning for remote sensing applications, particularly in the use of the U-Net architecture for semantic segmentation tasks. U-Net has proven highly effective in extracting land cover features from satellite imagery, and its modifications have been widely applied in building segmentation and land cover mapping. These studies provide a strong foundation for the current investigation, demonstrating the versatility and accuracy of U-Net in capturing spatial and contextual information essential for land cover classification.
Furthermore, recent research highlights the potential benefits of integrating multitask learning into the segmentation process. Multitask learning leverages the unique characteristics of specific features, such as the shape and boundary information of buildings, to improve the accuracy and robustness of land cover extraction. The integration of this approach has shown promise in enhancing the precision of building detection and segmentation in remote sensing applications [
16]. These findings underscore the relevance of combining advanced neural architectures such as U-Net with multitask learning to effectively address the challenges of land cover classification.
Despite substantial advances in deep learning and its application to remote sensing, numerous crucial gaps in the literature still exist. One notable disadvantage is the inadequate integration of multispectral indices and supplementary information, which have the potential to considerably improve land cover classification accuracy. Multispectral indices, such as the normalized difference vegetation index (NDVI), can provide additional spectral information that helps discriminate between comparable land cover types. However, many existing research studies depend solely on raw spectral bands, limiting their ability to fully leverage these additional indices.
Another notable gap is the insufficient emphasis on using deep learning-based technologies to support the production of large-scale topographic maps. While deep learning has been widely utilized for land cover mapping and feature extraction, direct application to support the production of large-scale topographic maps is still in its infancy. Topographic mapping involves the accurate representation of both natural and man-made features, requiring high precision and comprehensive integration of diverse data sources. Current studies often focus on specific features, such as vegetation or urban areas, without addressing the broader scope of topographic map production.
Our research intends to close these gaps by introducing multispectral indices and supplementary datasets (e.g., pan-sharpening, texture information, PCA) into the deep learning framework, hence improving the model’s ability to handle various and complicated land cover types. While these techniques have been reported individually, our approach combines them in a systematic way, optimized for very-high-resolution imagery in extensive regions. Furthermore, we extend the application of deep learning to support the automated production of large-scale topographic maps, leveraging advanced architectures like U-Net. By doing so, we aim to contribute to the development of more robust, scalable, and versatile approaches for land cover classification and topographic mapping, bridging the gap between research and practical applications.
3. Study Area, Materials, and Methods
3.1. Study Area
Mataram is situated on Lombok Island in Indonesia. The city serves as the administrative center of Nusatenggara Barat Province. The urban landscape of the city is predominantly characterized by developed regions, agricultural land, and plantations. This research is conducted in the city depicted in
Figure 3.
3.2. Materials
This study utilizes very-high-resolution satellite imagery to obtain land cover data. The imagery consists of Pleiades data, which include a panchromatic band with a spatial resolution of 0.5 m and multispectral bands (green, blue, red, and near-infrared) with a spatial resolution of 2 m. The acquisition of Pleiades image data took place in 2015.
To assess the accuracy of the land cover classification results, it is necessary to have reference data. This study utilizes land cover extracted through a manual interpretation of the Pleiades remote sensing data as reference data. Indonesian topographic maps were used as references in the manual interpretation of the reference data. The topographic maps were generated in 2017 using the photogrammetry technique.
3.3. Methods
In this research, our aim was to leverage advanced image processing techniques and deep learning models to achieve accurate land cover classification using high-resolution Pleiades imagery to support the production of large-scale topographic maps. The research methodology was organized into a systematic sequence of steps to ensure precision and clarity. The general flow of the research is represented in a flowchart in
Figure 4 that provides a visual summary of the process.
The workflow begins with data preparation, where the raw satellite imagery is preprocessed and enhanced through pan-sharpening to improve spatial resolution while preserving spectral fidelity. Subsequently, advanced techniques such as NDVI calculation, texture analysis using Gray-Level Co-occurrence Matrix (GLCM), and principal component analysis (PCA) are applied to extract meaningful features from the imagery.
The next phase employs a deep learning approach using the U-Net architecture with a ResNet34 backbone, which enables precise pixel-wise land cover classification. The classified output undergoes post-processing through morphological image processing techniques to refine the results by removing anomalies and enhancing feature clarity. Subsequently, the performance of the classification is evaluated by comparing the pre- and post-morphological processing accuracy. The final stage involves an analysis of capacity against established topographic mapping standards.
This study introduces a novel approach to deep learning-based land cover classification using very-high-resolution Pleiades satellite imagery. It combines established techniques like NDVI and texture analysis with a U-Net architecture and post-classification enhancements within a unified pipeline designed for efficient land cover extraction. The research emphasizes practical application and scalability, addressing challenges often overlooked in previous studies. This methodology offers significant advancements for accelerating large-scale topographic map production, particularly in regions like Indonesia where timely and accurate geospatial data are crucial for various applications, including urban planning, environmental management, disaster response, and infrastructure development. Furthermore, this approach promotes wider adoption of automated and scalable geospatial solutions in similar settings.
3.3.1. Data Preparation
The Pleiades image is composed of one panchromatic band with a resolution of 0.5 m and four multispectral bands with a resolution of 2 m each. The preparation stage involves performing the pan-sharpening procedure on the Pleiades imagery to obtain 0.5 m resolution of multispectral data as shown in
Figure 5. Pan-sharpening is a technique in digital image processing that has gained considerable interest in the field of remote sensing and geospatial analysis, as it provides a means to enhance the spatial resolution of multispectral satellite or aerial photos without compromising the spectral information. This process involves the fusion of a high-resolution panchromatic (Pan) image with a lower-resolution multispectral (MS) image, resulting in a composite image with both high spatial and spectral resolution [
17].
This process has several key benefits such as making an essential tool in a wide range of remote sensing applications. One of the main benefits of pan-sharpening is its capacity to enhance the geometric correction and feature enhancement of multispectral images. By enhancing the spatial resolution, pan-sharpening can unveil intricate details and characteristics that are imperceptible in the original multispectral data, enabling more precise geometric correction and improved identification of specific objects or land cover types [
18]. Furthermore, pan-sharpening can enhance certain features that are not easily discernible in the lower-resolution multispectral data, such as roads, buildings, and other man-made structures. Another significant advantage of pan sharpening is to retain spectral information from the multispectral data while enhancing the spatial resolution [
19]. Contrary to basic resampling strategies, pan sharpening methods, like the principal component analysis (PCA) technique, hue–intensity–saturation (HIS) technique, Brovey technique, and wavelet technique, aim to preserve the radiometric accuracy and spectral fidelity of the original multispectral data [
20]. This is crucial for applications that require accurate spectral information, such as land cover classification, mineral mapping, and vegetation analysis.
This study employed high-quality Bayesian pan-sharpening [
20]. The least squares approach was designed to accurately approximate the relationship between the gray values of the original multispectral, panchromatic, and fused images to achieve the most accurate color representation. Statistical methodologies were created to provide a uniform and automated fusion procedure. This new methodology addressed the two primary issues encountered in image fusion, especially color distortion and the reliance on the operator/data that occurred in the previous methods [
20].
3.3.2. NDVI Calculation
The normalized difference vegetation index (NDVI) is a commonly utilized remote sensing index that has become an invaluable tool in the field of vegetation analysis and monitoring. The NDVI assesses vegetation by evaluating the distinction between the red band, absorbed by plants, and the near-infrared (NIR) band, which is strongly reflected by vegetation [
21].
This study utilized the NDVI band to augment the model’s sensitivity to vegetation-covered areas. The formula of NDVI is
3.3.3. Texture Analysis
Texture analysis is an essential component of image processing and computer vision, as it plays an important part in various applications such as image segmentation, object recognition, remote sensing, and medical imaging. Texture refers to the arrangement of pixel intensities or the differences in pixel values inside a certain area. The variation in tone and spatial relationships among adjacent pixels can be utilized to distinguish between various regions or objects within an image [
22].
There are two basic approaches to texture analysis: the structured approach and the statistical approach. The structured approach to texture analysis is centered around the identification and modeling of the fundamental patterns or recurring structures present in the image [
23]. This approach commonly employs methodologies such as Fourier analysis, wavelet transform, and Gabor filters to extract distinctive features that describe the periodic or predictable qualities of the texture. Conversely, the statistical method for texture analysis is based on the statistical distribution of pixel intensities in the image. This approach entails the computation of diverse statistical parameters, including mean, variance, skewness, and kurtosis, to depict the texture attributes [
24].
The Gray-Level Co-occurrence Matrix (GLCM) is a commonly employed method for texture analysis [
25,
26,
27]. The GLCM is a statistical technique that examines the spatial correlations among pixels inside an image. It describes the process of measuring the arrangement of pixel values that occur together at a specific distance or spatial connection [
26,
28]. The GLCM, or Gray-Level Co-occurrence Matrix, is a mathematical representation that quantifies the occurrence of one pixel with a specific gray level in relation to another pixel with a different gray level. This matrix provides information about the frequency of these occurrences in a given spatial relationship.
The Gray-Level Co-Occurrence Matrix is a robust tool used in remote sensing and land cover mapping. The utilization of GLCM in land cover mapping is highly advantageous as it can accurately differentiate between various land cover categories by analyzing their distinct textural characteristics. The typical method entails converting the image to grayscale, categorizing the pixel values into bins, and subsequently computing the Gray-Level Co-occurrence Matrix (GLCM) for a particular window size and spatial relationship. The resultant GLCM can be utilized to produce statistical measures that depict the local texture characteristics, which can then be employed in land cover categorization algorithms [
29].
The GLCM can be used to derive various statistical features that describe the texture of the image, such as homogeneity, dissimilarity, contrast, etc. [
27,
30]. This study uses 5 statistical texture features: homogeneity, dissimilarity, contrast, entropy, and Angular Second Moment (ASM). Homogeneity is a measure of the uniformity of the image, dissimilarity is a measure of the variation in the image, contrast is a measure of the local variations in the image, entropy is a measure of the randomness of the image, and ASM is a measure of the image’s regularity.
An essential factor in computing the GLCM is the selection of the matrix dimension, as this choice can have a substantial influence on the retrieved texture characteristics [
30]. In order to tackle this issue, it was suggested is to employ semi-variance analysis, which is a geostatistical method, for the purpose of identifying the most suitable GLCM matrix dimension. Semi-variance analysis investigates the spatial autocorrelation of pixel intensities, offering an understanding into the suitable scale and resolution for texture analysis. By examining the semi-variance of the pixel intensities, one may determine the appropriate size of the GLCM matrix, which will accurately capture the essential texture features. This method enables a more knowledgeable choice of the GLCM matrix dimension, resulting in enhanced texture-based categorization and analysis [
27]. Consider two-pixel values, Z(xi) and Z(xi + h), where h represents a vector with a defined direction and distance. For all pixel pairs in an image, N(h), the semi-variance
(h) is defined as
The semi-variance may display curvature in a certain direction. The sill represents the highest level of semi-variance achieved. The lag it takes to reach the sill is called the range. The range might serve as the most advantageous window size. Similar techniques have been successfully applied in previous studies [
31].
3.3.4. Principal Component Analysis
Principal component analysis (PCA) is a commonly employed multivariate statistical method that seeks to convert a group of interrelated variables into a reduced collection of independent variables known as principal components. These components capture the bulk of the variability included in the original data [
32]. The method has been utilized in several domains, such as data exploration [
33], assessment of small enterprises’ competitiveness [
34], and non-destructive quality control of rice by image analysis [
35].
Principal component analysis can be employed in texture analysis to decrease the dimensionality of the feature space by selecting the most significant texture features [
35]. Utilizing this technique can be quite advantageous when dealing with a substantial quantity of texture bands, as it can effectively mitigate the problem of the curse of dimensionality and enhance the efficiency of future analysis or classification tasks.
In order to evaluate the implementation of principal component analysis on 5 texture bands, the initial step would include extracting the pertinent texture characteristics from the image data. The features may encompass measurements of contrast, homogeneity, energy, and other statistical attributes of the texture [
35]. After obtaining the feature vectors, one can apply principal component analysis to determine the principal components that capture the majority of the variance in the data [
33].
3.3.5. Deep Learning Approach
In the last ten years, the field of remote sensing has undergone a significant change due to the introduction of deep learning techniques. These approaches have completely transformed our approach to land cover classification. In the past, land cover mapping predominantly utilized standard machine learning algorithms, such as support vector machines and decision trees. Recent breakthroughs in deep learning have demonstrated their exceptional performance in a range of geographic data processing tasks, including as object detection, image categorization, and scene interpretation [
7].
A significant benefit of utilizing deep learning in land cover classification is its capacity to extract more resilient and distinguishing characteristics from the input data, resulting in enhanced classification accuracy. This is especially apparent when dealing with intricate or diverse land cover types, as conventional approaches may have difficulties in accurately representing the subtle details and spatial arrangements [
5].
The fast progress in remote sensing technology and the availability of high-resolution satellite and aerial images have resulted in a growing need for effective and precise ways to classify land cover. An effective method in this field is the utilization of deep learning methods. Among the popular deep learning architectures, U-Net, SegNet, and DeepLab have garnered considerable attention for their applications in this domain. The U-Net architecture, initially proposed for biomedical image segmentation [
36], has gained popularity in remote sensing due to its ability to effectively capture both local and global features [
37]. SegNet, on the other hand, is a symmetrical encoder–decoder network designed for efficient semantic segmentation [
15]. DeepLab, a deep convolutional neural network, has also shown promising results in various remote sensing applications.
The key advantage of U-Net is its unique encoder–decoder structure with skip connections, which allows for the preservation of spatial and contextual information throughout the network [
16], the capability to capture both local and global properties, and the efficient utilization of limited training data [
38]. This feature is particularly beneficial for land cover classification, where the accurate delineation of object boundaries is crucial [
16,
37].
In contrast, SegNet’s symmetric architecture may not be as effective in capturing the necessary hierarchical features required for robust land cover classification [
15]. Additionally, the DeepLab architecture, while powerful, may not fully capitalize on the spatial relationships inherent in remote sensing data, potentially leading to suboptimal performance in certain land cover classification tasks.
Several studies have highlighted the superior performance of U-Net over SegNet and DeepLab in land cover classification applications. For example, the paper entitled “Effective Building Extraction From High-Resolution Remote Sensing Images with Multitask-Driven Deep Learning” [
16] demonstrated that a modified U-Net architecture outperformed both SegNet and DeepLab in building extraction from high-resolution remote sensing images. Similarly, the work titled “Building Segmentation with Inception-UNet and Classical Methods” showed that the U-Net-based approach achieved better results than classical image processing methods, with the added advantage of being end-to-end trainable [
37].
In addition to SegNet and DeepLab, there are more recent deep learning architectures, such as UNet3+. UNet3+ extends the UNet architecture by introducing full-scale skip connections. These connections aggregate multi-resolution features from encoder and decoder layers, enhancing the model’s ability to capture fine-grained details and global context simultaneously [
39]. While the U-Net3+ model has been proposed as an evolution of the original U-Net, the advantages of U-Net in large-scale land cover classification remain significant. U-Net’s ability to effectively capture both local and global features, as well as its efficient use of skip connections, have made it a versatile choice for a wide range of satellite imagery segmentation applications [
40,
41]. One of the key advantages of U-Net over U-Net3+ is its computational efficiency. U-Net’s streamlined architecture allows for faster training and inference times, making it a more practical choice for large-scale land cover classification tasks that often require processing vast amounts of satellite imagery [
40,
41]. Furthermore, U-Net’s skip connections facilitate the preservation of spatial information, which is crucial for accurately delineating land cover boundaries, a critical requirement in large-scale mapping applications [
41]. UNet’s simpler design makes it easier to train and optimize, especially when working with diverse and large-scale land cover datasets. The added complexity of UNet3+ can sometimes lead to challenges in hyperparameter tuning or training stability.
This study employed the U-net architecture, a kind of deep convolutional neural network architecture, to train and categorize Pleiades images. The U-net architecture consists of two primary elements: an encoder (contracting path) and a decoder (expanding path), as shown in
Figure 6. The Resnet34 model serves as the main structure in the contracting path. It employs convolutional blocks and is subsequently followed by a max-pool downsampling procedure to transform the input data into feature representations at varying levels. To accomplish dense categorization, the expanding path implements convolution and up-sampling techniques to transfer semantic knowledge from feature representation to high-resolution pixel space [
42].
This study classified six different land cover categories, including bare land, agricultural land, plantation, building, water bodies, and roads. The training dataset consists of a total of 1440 tiles collected from 6 distinct places within Mataram city region. The area of each training set site is 0.25 square kilometers. The dimensions of the tiles are 512 × 512 pixels. The labeled images were generated using land cover data obtained from topographic maps. Before being utilized as label data, a review was conducted to verify the congruity between the land cover depicted on the topographic map and the actual appearance captured in the image. This analysis was conducted taking into account the various data sources and collection periods of the two datasets. Prioritizing any discrepancies, adjustments were performed initially to ensure that the labels produced align with the visual representation in the Pleiades image.
Table 1 displays the distribution of each land cover class in the training data. The data indicate an imbalance in the distribution of classes. The imbalance is present in the road and water bodies, which account for only 3.73% and 4.17% of the data, respectively. The distribution of land cover types is rather even, with building land accounting for 25.56%, agricultural land for 27.78%, bare ground for 12.96%, and plantation land for 25.80%.
Before establishing the number of epochs to employ in the training process, an initial experiment was undertaken to identify the most effective epoch. An epoch refers to the iteration of passing the dataset forward and backward through the neural network. The batch size was configured at 2. It signifies that two training samples will be simultaneously processed for training. Increasing the batch size can enhance the model’s performance, but it necessitates more memory as the batch size increases. The learning rate was automatically determined. The optimal learning rate will be determined from the learning curve during the training procedure. The U-net model utilizes ResNet34 as its underlying architecture. The backbone model serves as a form of transfer learning [
11]. The ResNet34 is a conventional residual network that has been trained using the Image-net dataset, which consists of over 1 million photos. This network is composed of 34 layers.
Once the model training process was completed, the model was next employed to generate the land cover within the designated testing location. The testing location spans a total of 9.18 km by 6.90 km within Mataram city.
Figure 7 displays the site of the testing area. An assessment was conducted following the completion of the deep learning classification process in order to determine the correctness of the classification result.
3.3.6. Morphological Image Processing
Morphological image processing is a method that is designed to analyze the shape, or morphology, of image features. Imperfections or anomalies may occasionally be present during the process of image segmentation or classification. This approach is widely employed to eliminate such imperfections or abnormalities [
43]. Morphological image processing resembles spatial filtering, wherein a structuring element is traversed across every pixel in the initial image to generate a distinct pixel in the adjusted image. The two most fundamental morphological operations are erosion and dilation. Erosion decreases the thickness of all features in the image, potentially splitting joined features or stripping away extrusions. Dilation, on the other hand, increases the thickness of all features, repairing breaks and adding definition to features before vectorization [
44].
The utilization of dilations and erosions in combination can be employed to accomplish novel morphological techniques. New procedures involve both closing and opening. Closing is a morphological operation that involves performing erosion subsequent to dilation, employing the identical value. It can refine jagged straight lines in the image and bridge narrow spaces between prominent elements in the image foreground. The opening operation consists of applying dilation subsequent to erosion, both of which utilize the identical value. It has the capability to remove fine lines or details from the image. This study employs morphological image processing techniques to remove undesired characteristics or noises that arise following deep learning classification operations. Following the completion of morphological processing, an assessment is conducted to determine the precision of the morphological processing result. The accuracy achieved after post-classification enhancement by applying morphological processing will be tested against the accuracy obtained from the initial deep learning classification result.
4. Results and Discussion
4.1. Texture Analysis and Principal Component Analysis Result
The determination of kernel size was conducted by semi-variance analysis. This study utilized semi-variance analysis by generating subsets from six land cover classes in the study area, including bare land, building, plantation, agricultural land, road, and water. There are two samples for each class of land cover.
Figure 8 illustrates the building’s variogram over samples 1 and 2. The kernel size for texture analysis can be determined by the lag distance at which the semi-variance value reaches an equilibrium. The experimental variogram is represented by the black line, whereas the spherical model variogram is represented by the blue line. The summary of the semi-variance analysis, utilizing the spherical model variogram, for all land cover classes, is presented in
Table 2 and
Table 3.
Table 2 presents the summary of the semi-variance analysis results obtained from a spherical model variogram in four directions (0°, 45°, 90°, and 135°) for six different land cover classes. The frequency for the ranges (lag distance), which represents the kernel size, can be determined and displayed in
Table 3 based on the data provided in
Table 2. According to the analysis of semi-variance, using the spherical model variogram, it is recommended to choose a kernel size of 5 × 5 for texture analysis.
Figure 9 displays instances of the GLCM textures derived from five formulas.
Following the texture analysis, the PCA procedure was implemented. The first and second PCAs were retrieved.
Figure 10 illustrates the examples of the first and second PCA bands. Using a small fraction of the training data, an experiment was conducted to identify the best choice between the PCA band and the original texture band for inclusion into the deep learning classification model. The initial training samples consist of five unaltered texture bands, whereas the subsequent training samples consist of two principal component analysis (PCA) bands. The outcome is observable in
Table 4 and
Table 5.
The precision of models trained on barren land, buildings, plantations, agricultural land, and water using PCA bands surpasses that of models trained with texture bands. Only the accuracy of the route produces the contrary effect. The highest enhancement in precision is observed in bare land, up to 2.8%. The recall for models utilizing PCA bands surpasses that of models employing texture bands for building, plantation, agricultural land, road, and water classification. Only the recollection of vacant ground produces the contrary effect. The highest increase in memory is shown in the recollection of road, which is 10.0%. The F1-score for all land cover classes in the model trained using PCA bands surpasses that of the model trained with texture bands. The F1-score for road exhibits a maximum increase of 8.7%.
The impact of incorporating an NDVI band and two PCA-GLCM bands on precision, recall, and F1-score may be observed by examining the numbers presented in
Table 6,
Table 7 and
Table 8.
Table 6 shows that the inclusion of NIR, NDVI, and PCA bands, in addition to RGB bands, leads to higher precision values across all land cover classifications. Similar patterns can also be observed for the recall and F1-score. Overall, using near-infrared (NIR), normalized difference vegetation index (NDVI), and principal component analysis (PCA) bands can enhance the quality of the results, as evidenced by the rise in precision, recall, and F1-score metrics. After conducting tests with a limited amount of training sets, it was established that the optimal choice was to utilize seven bands: R, G, B, NIR, NDVI, 1st PCA, and 2nd PCA.
4.2. Deep Learning Classification Result
An experiment was conducted utilizing a subset of the training dataset to determine the ideal number of epochs. The training method utilized a total of 415 tiles from the training data as input. The training process was conducted seven times, with each iteration consisting of 25, 50, 100, 200, 300, 400, and 500 epochs.
Figure 11 displays the validation results obtained by using 10% of the training set for each epoch.
The F1-score value of the validation for each land cover class exhibits a substantial increase from epoch 25 to epoch 200. Starting from epoch 200, the F1-score value exhibits a consistent pattern with a rather modest improvement.
Table 9 displays the correlation between F1-score and the quantity of epochs, along with the duration for each epoch. After reaching epoch 200, the inclusion of more epochs and the subsequent rise in time does not result in a substantial improvement in the F1-score. Based on the upward trend of the F1-score value and the duration of each epoch, it is determined that 200 epochs is the most advantageous value.
In order to assess the accuracy of the classification result, a confusion matrix is employed. The ground truth data refer to the data collected through the visual interpretation of the Pleiades images. The test region spans an approximate area of 63.3 square kilometers and contains about 240,000,000 pixels of ground truth data. The extremely limited amount of training data available for roads and water bodies, which account for only 3% to 4% of the total, renders their results incomparable to those of other land cover classes. Thus, the evaluation of classification accuracy solely concentrates on four specific land cover classes: bare land, building, agricultural land, and plantation.
The ground truth data consist of around 25% bare land, 22% building, 19% plantation, and 34% agricultural land. The number of random sample points for evaluation exceeds 24,000,000 points by 10%. It signifies that the random sample point exhibits relatively uniform and consistent distribution within the test location. The quantity of random sample points for each category of land cover is displayed in
Table 10. The assessment is conducted within four distinct land cover categories: bare land, buildings, plantations, and agricultural land.
The result of the assessment is displayed in the confusion matrix presented in
Table 6. The deep learning classification achieved a kappa value of 0.79 and an overall accuracy of 84% at the testing location. The values of producer accuracy vary from 79% to 95% while user accuracy vary from 81% to 89%, as indicated in
Table 11.
4.3. Morphological Processing Result
Morphological image processing is used to further regularize the classification results. This post-classification regularization procedure eliminates the undesirable features generated during the classification process. Undesirable features may include a small cluster of pixels, a minor discrepancy across land cover categories, or slight protrusions.
Figure 12 illustrates the application of the post-classification regularization procedure in the testing area. In
Figure 12, certain isolated pixel clusters within the blue circles 1 and 2 merged together, resulting in the formation of a larger group that exhibited a more desirable shape.
A confusion matrix is employed to evaluate the outcome of the post-classification regularization procedure.
Table 12 displays the confusion matrix for the land cover data generated by the post-classification regularization procedure. The kappa coefficient of the new land cover is 0.81 and the overall accuracy is 86%. The findings indicate that employing post-classification regularization procedures can boost the accuracy of classification outcomes achieved through the deep learning approach by around 2%. The initial kappa values and overall accuracy of land cover classification using the deep learning approach were 0.79 and 84%, respectively. Following the post-classification regularization procedure, the kappa and overall accuracy scores rose to 0.81 and 86%, respectively.
The final result of the land cover classification using the deep learning approach can be seen in
Figure 13. These result can then be integrated with other map features, such as road networks, contour lines, etc., to produce a topographic map. For a variety of applications that do not necessitate a full topographic map layer, these results can potentially be utilized directly as land cover data.
4.4. Capacity of Deep Learning Classification
The production capacity of topographic maps in Indonesia is regulated by the Regulation of the Geospatial Information Agency No. 11 of 2018 concerning Technical Analysis of the Implementation of Geospatial Information Production. This regulation governs the productivity factors, production capacity, and resources required in the production process of Indonesian topographic maps. Based on this regulation, the production process of Indonesian topographic maps consists of technical stages, namely preparation, stereo-plotting, topology and polygon processing, digital elevation model (DEM) processing, contour and spot-height processing, field surveys, and data synchronization [
45]. The capacity of each stage can be seen in
Table 13. For example, in the stereo-plotting stage, an operator needs 24 days to finish one sheet of an Indonesian topographic map at a scale of 1:5000. One sheet of an Indonesian topographic map at a scale of 1:5000 covers an area of 75 arcsec × 75 arcsec (2.3 km × 2.3 km) or about 5.29 square kilometers.
The stereo-plotting stage is the extraction of map features from aerial photogrammetric data. The retrieved map features comprise hydrographic, road network, building, vegetation, and hypsographic features. The production capacity of each topographic map feature is presented in
Table 14. The production capacity of this topographic map feature is derived from empirical data obtained from eight topographic mapping projects conducted in 2019 by the Geospatial Information Agency, encompassing 1156 topographic map sheets at a scale of 1:5000 [
46].
Table 14 indicates that the stereo-plotting capacity of building and vegetation features is 38% of the total stereo-plotting stages, equating to 9 days/sheet/operator.
The experiments conducted in this research generated building and vegetation features using the deep learning approach. The duration necessary for generating building and vegetation features utilizing the deep learning approach is presented in
Table 15.
Table 15 reveals that the model training phase is conducted solely once at the outset, following the preparation of the training dataset. However, this process needs to be prepared well, so that the resulting model can be compatible with the area designated for mapping. The selection of training dataset is essential to guarantee the model’s compliance with the area to be mapped. The value of 17 h in the table is the time required to train the model with a training dataset of 1440 tiles consisting of seven bands, with an epoch of 200.
Upon acquiring a compatible model, the classification process is expedited and simplified.
Table 15 indicates that the duration necessary to acquire land cover data, especially buildings and vegetation, is 43 min (deep learning classification and morphological image processing) per sheet of a 1:5000 scale topographic map. This indicates a substantial increase compared to the 9 days needed using the stereo-plotting approach. Furthermore, reliance on operator working hours can be reduced, as the classification process utilizing this deep learning approach can be executed on a computer outside of designated working hours. The operator merely needs to execute it and review the results the following working day.
5. Discussion
The inclusion of texture information has been shown to yield significant improvements in land cover classification accuracy, with an average increase of 12.1% [
47]. This can be attributed to the additional spatial context information provided by texture features [
47]. The use of near-infrared, NDVI, and PCA bands also contributed to the enhanced classification performance. Our research demonstrates the effectiveness of deep learning for land cover classification and its potential to revolutionize topographic map creation. By using texture analysis, principal component analysis, and optimized feature combinations (including near-infrared, NDVI, and PCA bands), this study achieved substantial improvements in classification accuracy (precision, recall, and F1-score). Preprocessing steps like semi-variance analysis and PCA proved to be crucial for optimizing deep learning model input.
Deep learning has emerged as a powerful technique for land cover classification from remote sensing data, offering state-of-the-art performance [
48]. Advances in deep learning algorithms have led to significant improvements in accuracy, with one study reporting a 26.43% increase over previous results using an artificial neural network that achieved 97.01% accuracy, outperforming traditional machine learning classifiers [
49]. Our experiments showed that 200 training epochs achieved stable, high accuracy while minimizing computational effort. The model reached 84% accuracy and a kappa of 0.79, which further improved to 86% and 0.81, respectively, after applying morphological processing to reduce noise and inconsistencies in land cover classifications. This demonstrates the effectiveness of morphological processing for refining results.
Comparing traditional stereo-plotting to a deep learning method for creating building and vegetation features on topographic maps reveals significant efficiency gains through automation. The deep learning approach drastically reduced production time from 9 days to 43 min per map sheet, greatly improving productivity, minimizing reliance on operator labor, and increasing scheduling flexibility.
Despite promising results, this study had limitations. Limited training data for roads and water bodies affected the comparability of the classification results for these features. Future research should incorporate more balanced datasets and explore transfer learning to improve model generalization for underrepresented classes.
6. Conclusions
Analysis showed that using principal component analysis improved results, with the first two PCA bands outperforming the original five texture bands. This highlights the effectiveness of dimensionality reduction techniques like PCA for improving input features for classification. Deep learning with a U-Net architecture achieved good land cover classification results (kappa of 0.79 and 84% overall accuracy), demonstrating the robustness of this method for extracting land cover information from very-high-resolution satellite imagery.
Applying morphological image processing after classification further refines the results from deep learning. This post-processing step increased the kappa value and overall accuracy to 0.81 and 86%, respectively (a 2% improvement). It effectively reduces noise and inconsistencies, leading to more reliable land cover data. This highlights the importance of post-processing for high-quality results.
This study’s application of deep learning to extract land cover features (buildings and vegetation) from very-high-resolution satellite imagery offers a transformative approach to accelerate large-scale topographic map production. The impact of this approach is particularly evident in the dramatic increase in efficiency. What previously took 9 days using traditional stereo-plotting methods to acquire building and vegetation data for a single 1:5000 scale topographic map sheet now takes less than an hour using the deep learning method. This represents a remarkable improvement in productivity, significantly reducing operator working hours. The automated nature of the deep learning process allows for continuous operation (24 h a day) on a computer, requiring operators only for initiation and occasional monitoring.
The implications of this study extend beyond immediate improvements in mapping efficiency, showcasing the feasibility of automated, scalable approaches for large-scale topographic mapping and setting the stage for a wider adoption of deep learning technologies in geospatial applications. The ability to rapidly process high-resolution imagery significantly reduces production time and costs, enabling mapping agencies to generate up-to-date maps more frequently and cover larger areas, effectively addressing national and regional mapping requirements. This efficiency and scalability are crucial for various applications, including urban planning, disaster response, environmental monitoring, and resource management, all of which rely on accurate and timely geospatial information. Additionally, reduced reliance on operator time not only frees up human resources for critical tasks like quality control, data analysis, and map interpretation but also revolutionizes the mapping industry, making it more responsive to evolving needs and facilitating more comprehensive and detailed mapping projects.
The research acknowledges the limitations of its training dataset, specifically the need for more diverse, high-quality data with balanced representation across land cover classes. Because the study area had limited land cover variety, the current work only considered four classes (buildings, bare land, agricultural land, and plantations). Future research should prioritize expanding the training data to include more land cover types and diverse geographical regions to improve the model’s generalizability and adaptability. A more comprehensive classification system and testing the method’s robustness across different regions and datasets are also recommended. Further exploration of integrating auxiliary datasets could improve classification accuracy and address limitations related to specific land cover types, ultimately advancing the application of deep learning for large-scale topographic mapping.
Additionally, future work should also explore the feasibility of incorporating additional experiments with newer deep learning architectures, such as UNet3+ and transformer-based models. These advanced architectures may offer improvement, particularly in capturing fine-grained details and complex spatial patterns in land cover classification tasks. Comparing these models quantitatively with the current approach will provide deeper insights into their strengths and limitations and potentially identify more effective methods for large-scale applications.