1. Introduction
In the remote sensing context, the purpose of image classification is to extract meaningful information (land cover categories) from an image [
1]. This process is generally performed via a supervised or unsupervised method [
2]. In the supervised classification, the algorithm is first trained using ground data and then applied to classify the image pixels. Unsupervised algorithms only use the information that is contained in the image to classify pixels without requiring any training data [
3,
4]. As the algorithm does not require training samples, it is easy to perform with minimum human intervention and cost [
5,
6]. The algorithms use the pixel values through a set of statistical rules or cost functions to label pixels in the feature space [
7]. The process is typically performed via an iterative mechanism by minimising the spectral distances (similarities) between pixel values and the calculated cluster centers in the feature space, regardless of their locations in the image. This causes a speckled appearance (also known as the salt-and-pepper noise) in which there are isolated pixels or small regions of pixels in the scene.
The above drawback is typically addressed by incorporating the spatial information between pixels into the clustering process [
1]. To determine the neighbouring pixels, the algorithms use a fixed-size window or segmented objects [
8]. In the fixed-size window structure, the algorithm, e.g., hidden Markov model [
1], or textural information [
9,
10] applies the spectral–spatial information extracted from a fixed-size window that is centred on each pixel to extract the spatial information and label image pixels. The methods show better classification outcomes in comparison to the conventional clustering algorithms. However, the clustering results are not always adequate because there is no specific rule to determine the optimum size and shape of these local windows [
8,
11]. As a result, the classified maps may contain significant amounts of misclassifications, especially when there are heterogeneous or complex features in a scene, e.g., roads or buildings. The main assumption for these methods is that objects have similar geometric structures. Thus, a fixed neighbourhood distance can model all possible spatial interactions between and within objects in a scene.
To tackle this limitation, the algorithms usually utilise image segmentation for extracting spatial information, which is specified by the geometry of segmented objects. These methods employ different techniques, such as graph-cut-based segmentation [
3], edge features [
12], statistical region merging (SRM) [
13], or robust fuzzy c-means (RFCM) [
14] to create segments and extract the spatial information from images. The structure of these algorithms allows them to increase the accuracy of output maps without setting any fixed neighbourhood distances. Despite the advantages that these methods offer, such as flexible neighbourhood distances, there is no unique solution for the image segmentation problem [
15,
16]. This is usually solved based on parameters, e.g., scale and colour weight, typically determined based on trial and error [
17]. This can lead to poor results when the algorithm uses an over-or under- segmented map to extract the spatial information.
The issue is typically addressed via a hierarchical network of segmented or classified images generated at different scales in the image or feature space [
18,
19,
20,
21,
22]. For example, Gençtav et al. [
19] proposed a hierarchical segmentation algorithm to determine segments. They used homogeneity and circularity of segments to formulate the spatial and spectral relationship between small segments in the lower level of the hierarchy and meaningful segments in the higher level of the hierarchy. Hesheng and Baowei [
20] updated the fuzzy C-means (FCM) function to integrate the spatial information at different scales into spectral information to label pixels in the feature space via an iterative process. Kurtz et al. [
18] used segmented images at different resolutions to classify images at different levels of spatial detail in an urban area. Fang et al. [
21] implemented a multi-model deep learning framework formulated based on an over-segmentation map and semantic labelling to classify an image. The method utilised an iterative mechanism via a fixed geometry to merge segments into each other and relabel pixels to produce a real land cover map. In contrast, Yin et al. [
22] proposed a top-bottom structure using graphs for an unsupervised hierarchical segmentation process. The spatial connectivity between pixels was formulated via nodes and edges, and the algorithm used the average intensity of each region to tune the weight of each edge. These methods showed that they can generate better clustering maps even when there are complex features in an image. However, there is no unique setting to formulate the concept of the scale and hierarchical structure between objects at different scales [
23]. In other words, spatial relationships between objects in the scene can be subjective to the parameters defined by a human expert that reduces the flexibility of the neighbourhood system.
Some advanced classification algorithms use a cyclic mechanism of image segmentation and classification to address a flexible neighbourhood system to extract spatial information and reduce noises. These approaches use an iterative process of image segmentation and classification to integrate spatial and spectral information, use expert knowledge, and reduce noise within objects in an image [
24,
25,
26,
27]. The approach allows the algorithms to alter the geometry of objects during the classification process. For example, Baatz et al. [
24] used a set of geometric operators to enable segments to change their geometry during a classification process. Hofmann et al. [
27] proposed a classification method that enables objects to negotiate at object and pixel levels to change their geometry. This structure is mainly applied by supervised approaches as they require training samples and expert knowledge.
All the above methods have two main geometric characteristics in common to create a flexible neighbourhood system. First, they construct the internal and external boundaries of an object in an image separately. This is because the geometry applied by these methods lacks the necessary complexity to take advantage of topological relationships between and within objects in a scene, e.g., car and street, or chimney and roof. For example, to address a street object which includes cars, the algorithm segments the street into multiple parts and then the street object is formed by merging the segments, regardless of the spatial relationship between cars and streets. Second, they use a segmentation process to define the initial geometry of objects. Thus, the geometric changes are restricted to the object level, not the pixel. For example, when the initial geometry of forest and shadow objects are formed in a scene, the forest object cannot capture a shadow pixel located in the boundary between two classes, and vice versa.
To overcome the above drawback, we propose a dynamic and unified geometry constructed on the Vector Agents (VAs). The VAs are a distinctive type of Geographic Automata (GA) [
28] that can draw and find their geometry and state and interact with each other and their environment in a dynamic fashion [
29]. The dynamic structure of the VAs enables them to support a flexible neighbourhood system where objects determine their neighbourhood distances, rather than a human expert. The method also applies a unified geometric structure that allows the interior (holes) and exterior boundaries of objects to simultaneously be modelled in a geographical area, i.e., a scene is represented by remote sensing images. This geometry gives the power to the VAs to automatically identify and remove isolated pixels and regions in the image when they lie within objects. The proposed method distinguishes itself from other classification methods based on the following spatial capabilities:
Construct and change the interior and exterior geometry of objects in an image simultaneously;
Describe the topological relationships between objects in the image;
Support geometric changes of objects at the pixel and object level with minimum human intervention;
Remove salt-and-pepper noise using the geometry of objects in the image.
The remainder of this paper is organised as follows:
Section 2 demonstrates the structure of the VA model, and
Section 3 presents the clustering results of the proposed method. Experimental results are discussed in
Section 4. Finally,
Section 5 concludes this paper.
3. Experiments and Results
To study the performance of the VA model, three high-resolution satellite remotely sensed images have been used. They are described below along with the details of clustering methods and the metrics applied to measure the effectiveness of the proposed method and the outputs of the proposed method.
3.1. Datasets
The proposed approach was experimentally tested on three hyperspectral images: Pavia Centre and Pavia University datasets collected by a ROSIS sensor and an AVIRIS scene of Salinas Valley, California (
Figure 8). The number of spectral bands is 102 for Pavia Centre, 103 for Pavia University, and 224 for Salinas Valley. Considering the computational efficiency, four bands for each image are selected using the PCA function in MATLAB. In the first experiment, a subset of 199 × 199 is cut from the Pavia Centre image with a pixel size of 1.3-metres (
Figure 8a). The image includes four information classes: water (C1), tree (C2), bare soil (C3), and bridge (C4) (
Figure 8b).
For the second experiment, we used a subset of 199 × 199 cut from the University of Pavia image with a pixel size of 1.3-m (
Figure 8c). It covers an urban area including five information classes: buildings and asphalt (C1), shadow (C2), bare soil (C3), meadows (C4), and painted metal sheets (C5) (
Figure 8d). In the third experiment, a subset of hyperspectral AVIRIS with the size of 198 × 198 pixels was applied to test the VA method (
Figure 8e). The geometric resolution is 3.7 m. The scene covers an agricultural zone in California, containing seven information classes: Vinyard_untrained (C1), Brocoli_green_weeds_2 (C2), Grapes_untrained (C3), Fallow_rough_plow (C4), Fallow_smooth (C5), Stubble (C6), and Celery (C7). (
Figure 8f).
3.2. Image Clustering
To evaluate the performance of the proposed approach for removing noises from the classified images, we compared the results of the VA model with the conventional classification approaches in two modes.
- i.
Unsupervised
We assume that the only available information is the number of clusters and there is no semantic information. In this setting, a classical k-means algorithm is applied, and the spatial information is imposed to the algorithm at two different stages: pre-and post-classification as follows:
The algorithm utilises the density function first to group pixels together in the feature space, using a combination of spectral similarity and spatial proximity between pixels. It then applies the k-means algorithm to label pixels, based on a new vector determined by the pixel values and the features associated with its corresponding regions in the segmented image.
The algorithm uses a three-by-three fixed window to replace labels in the classified image by k-means based on the majority of their contiguous neighbouring pixels. The number of neighbours is set to eight to retain edges.
- ii.
Semi-supervised
The algorithms use a limited number of training samples at two levels to train the SVM model and classify the images: pixel and object.
We used the SVM method described in
Section 2 to label the pixels without using spatial information. Training samples were manually selected from the ground truth data to train the SVM.
The method first uses the MSS algorithm to segment the image. The SVM model is then applied to label segments. In this scenario, we manually selected the training objects.
The algorithm employs the MRS function, which is formulated based on a combination of spectral homogeneity and shape homogeneity to segment the image. Like the MSS method, the training objects are manually selected to train the SVM classifier.
In the above methods, the segmentation parameters are manually defined. Since there are no rules to determine those parameters, we conducted different segmentation experiments to initialise the segmentation parameters to generate segments with less speckle.
3.3. Evaluation Metrics
To evaluate the results quantitatively, the VA objects are compared with their corresponding reference objects in the ground truth maps. We use the following metrics to assess the accuracy of the VA maps [
35]:
True Positive (TP), False Positive (FP), and False Negative (FN) are correctly detected pixels, wrongly detected pixels, and unrecognised pixels, respectively.
We also evaluate the spatial connectivity and fragmentation of the objects in the classified maps. The Perimeter/Area (P/A) ratio of classified objects is applied to assess the local neighbourhood connectivity between pixels within clusters in the classified maps.
3.4. Results
We first used the k-means algorithm to cluster the images. From the classified maps, 15 pixels are randomly selected for each cluster using Equation (1), where
is set to 1. The VA model then applies the selected samples to train its SVM classifier. The parameters of the k-means and SVM algorithm were set as defined in
Section 2. The trained VAs are then added to the image to extract clusters from images (
Figure 9).
For accuracy assessment among different methods, the initial clusters were mapped into information classes. This is because, in unsupervised classification each, information class may contain more than one cluster. For example, in the University of Pavia dataset, the number of clusters and information classes were considered seven and five, respectively.
5. Conclusions
Conventional k-means methods use pixels in isolation to cluster an image. Because of this, they lack the ability to deal with limitations, such as salt-and-pepper noise. This drawback is usually addressed through the use of a combination of spectral and spatial information. To extract spatial information around each pixel, classification algorithms generally apply a static geometry formulated based on a local fixed window or an irregular polygon. The properties of the geometry are generally defined by an expert user. The algorithm then applies this spatial information and the spectral information of pixels to classify the image or relabel pixels. The primary assumption in this structure is that there is a unique mathematical formula that can be applied to formulate spatial connectivity between all features in an image. If this assumption is violated, e.g., when there are complex or heterogeneous features in a scene, the clustering results may contain significant amounts of misclassifications.
In this paper, we presented a geometry-led approach, constructed based on the VA model, to remove the salt-and-pepper noise without setting any geometric parameters, e.g., scale, in unsupervised image classification. In the presented algorithm, we applied a unified and dynamic geometry, instead of using a predefined geometry or a hierarchical structure, to create an actual flexible neighbourhood system for extracting spatial information and removing speckle noise. The experimental results demonstrated the desirable performance of the VA model. For example, the P/A values in
Table 2,
Table 4 and
Table 6 highlighted that the VAs increase spatial connectivity between pixels and provide a better visual appearance by simultaneously modelling the exterior and interior boundaries of objects in the images. The results in
Table 1,
Table 3 and
Table 5 also indicate better performance of the VA model to remove noise than the MSS and MRS methods.
For future research, we plan to improve the performance of the proposed method by reducing the processing time. This can be performed by adding the learning capabilities to VAs in order to find the shortest route to determine the boundary lines, which can help the algorithm to save memory and reduce simulation time. Another area for future research can be to adapt the proposed method for object extraction from remotely sensed imagery, e.g., road extraction.