DCCaps-UNet: A U-Shaped Hyperspectral Semantic Segmentation Model Based on the Depthwise Separable and Conditional Convolution Capsule Network

Wei, Siqi; Liu, Yafei; Li, Mengshan; Huang, Haijun; Zheng, Xin; Guan, Lixin

doi:10.3390/rs15123177

Open AccessArticle

DCCaps-UNet: A U-Shaped Hyperspectral Semantic Segmentation Model Based on the Depthwise Separable and Conditional Convolution Capsule Network

College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(12), 3177; https://doi.org/10.3390/rs15123177

Submission received: 15 May 2023 / Revised: 16 June 2023 / Accepted: 16 June 2023 / Published: 19 June 2023

(This article belongs to the Special Issue New Advances in Hyperspectral–Multispectral Image Classification and Fusion Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Traditional hyperspectral image semantic segmentation algorithms can not fully utilize the spatial information or realize efficient segmentation with less sample data. In order to solve the above problems, a U-shaped hyperspectral semantic segmentation model (DCCaps-UNet) based on the depthwise separable and conditional convolution capsule network was proposed in this study. The whole network is an encoding–decoding structure. In the encoding part, image features are firstly fully extracted and fused. In the decoding part, images are then reconstructed by upsampling. In the encoding part, a dilated convolutional capsule block is proposed to fully acquire spatial information and deep features and reduce the calculation cost of dynamic routes using a conditional sliding window. A depthwise separable block is constructed to replace the common convolution layer in the traditional capsule network and efficiently reduce network parameters. After principal component analysis (PCA) dimension reduction and patch preprocessing, the proposed model was experimentally tested with Indian Pines and Pavia University public hyperspectral image datasets. The obtained segmentation results of various ground objects were analyzed and compared with those obtained with other semantic segmentation models. The proposed model performed better than other semantic segmentation methods and achieved higher segmentation accuracy with the same samples. Dice coefficients reached 0.9989 and 0.9999. The OA value can reach 99.92% and 100%, respectively, thus, verifying the effectiveness of the proposed model.

Keywords:

hyperspectral image; semantic segmentation; capsule network; depthwise separable convolution; dynamic routing; encoding–decoding

1. Introduction

In recent years, with the continuous development of hyperspectral image technology [1], the analysis and processing of hyperspectral data [2,3,4,5] have become a research hotspot in many fields. Hyperspectral image is a kind of image with hundreds to thousands of spectral bands for a single pixel, which can obtain a large amount of spatial and spectral information, so it has important application value in many fields such as remote sensing [6,7], environmental monitoring [8,9], agriculture [10,11,12], and medicine [13,14]. Semantic segmentation is one of the main tasks in hyperspectral image application research. The semantic segmentation of hyperspectral images is mainly to classify each pixel in the image into a specific feature category according to its spatial semantic information. However, the complex data structure and high information redundancy of hyperspectral images make this task challenging. Although the operation of traditional image segmentation [15,16,17,18,19,20,21] is relatively simple, it is difficult to obtain satisfactory performance because it mostly relies on handmade features. Therefore, it is of great research significance to establish an efficient semantic segmentation model for hyperspectral images.

With the rapid development of deep learning, convolutional neural networks (CNN) have been widely used in semantic segmentation [22,23]. In 2015, Hu et al. [24] first tried to use 1D-CNN to extract spectral information from hyperspectral images, providing a new idea for CNN to be used in hyperspectral image classification. However, only extracting spectral information will present problems such as “spectral confusion,” which can significantly reduce the accuracy of the model. Researchers have realized that the extraction of spatial features from hyperspectral images is also essential. In reference [25,26,27], 2D-CNN is used to extract spatial features. However, the small sample size and high channel number of hyperspectral images tend to lead to overfitting of the model, resulting in reduced generalization ability. Therefore, extracting both spatial and spectral features simultaneously has become a mainstream method. According to the different extraction methods of spatial information and spectral information, it can be roughly divided into two types of methods: one is to extract spatial information and spectral information, respectively, and then fuse the two kinds of information to obtain new joint features of space spectrum. Reference [28] designed an end-to-end CNN network, which was divided into two branches at the initial stage to extract spectral and spatial features, respectively. Liu et al. [29] introduced a twin convolutional neural network (Siamese-CNN), which contains a two-channel CNN, to extract the empty spectrum information. Reference [30] combines CNN with Markov random fields, first using CNN to extract spatial–spectral joint information, and then using Markov random fields to extract more refined spatial information. Li et al. [31] proposed a dual-channel CNN based on automatic clustering, which first reduces inter-class variance in the spectral dimension through automatic clustering, and then uses CNN to extract spectral information. Another method is to use a 3D-CNN to extract spatial and spectral features simultaneously. Reference [32] realizes for the first time that three-dimensional convolution kernel is used to extract spatial-spectral joint information of hyperspectral images. XU et al. [33] proposed an 8-fold convolution (Octave-CNN), which decomposes feature maps into low-frequency and high-frequency maps, effectively reducing the redundancy of spatial information. Ghaderizadeh et al. [34] combined 2D-CNN and 3D-CNN while introducing Depthwise 3D-CNN and Depthwise 2D-CNN, which significantly reduced the number of required samples and the model optimization time. Although significant progress has been made in semantic segmentation of hyperspectral images, there are still problems that need to be further explored.

The first problem is the data partitioning and accuracy of segmentation. Usually, researchers use sliding windows to segment hyperspectral images into patches for partitioning, including data partitioning methods of random patch [35] and non-overlapping sliding using n×n size windows [36]. However, the setting of patch window size will involve precision problems. Small-size patches cannot fully reflect the spatial and spectral characteristics of hyperspectral data and cannot cover the whole picture of the analysis object, and the accuracy may not be high; on the contrary, using a large patch size will result in richer information, but it may also be at the cost of computing speed. Therefore, choosing the right patch size is also very important, and how to ensure accuracy while minimizing the amount of computation is also a challenge.

The second problem is insufficient utilization of spatial information. CNN do not have scale invariance for rotation or scale change. From simple convolution in ordinary CNN, only feature information of a certain position in the feature map can be obtained, and the expression ability is limited when processing spatial information. To address this problem, Sabour et al. [37] proposed Capsule Networks (CapsNet) which represent features in a vector form to better capture spatial relations between features. Paoletti et al. were the first to apply capsule networks to hyperspectral images, achieving better results than traditional CNN models. Sun et al. [38] proposed a hyperspectral semantic segmentation model based on a Cube Capsule Network and extended multiscale attribute features, which can effectively extract complex spatial features with more details. However, the dynamic routing in capsule networks generates a large number of parameters during actual training, leading to slow training times.

Lastly, there is the problem of small sample sizes in hyperspectral images. Due to the high cost of hyperspectral image sample labeling, it is difficult to obtain a large amount of training data. Long et al. [39] proposed a full convolutional network (FCN) for semantic segmentation, replacing the full connection layer in VGG-16 with a convolutional layer, and successfully extended the task of hyperspectral semantic segmentation from the image level to the pixel level. Zou et al. [36] applied FCN to hyperspectral images for the first time and proposed a spectral spatial three-dimensional full convolution (SS3FCN) network to extract spectral spatial information and semantic information at the same time, achieving high accuracy. Although it is pixel-level segmentation, there are still limitations in the classification of small samples. Similarly, U-Net [40], a network suitable for pixel-level semantic segmentation, has achieved good results in the field of semantic segmentation. Compared with CNN, U-Net requires fewer training samples and can be well applied in small-sample scenarios. In reference [41], U-Net network was applied as the basic infrastructure for hyperspectral image semantic segmentation, and a PSE-UNet model was proposed. Soucy et al. [42] proposed a clustering integration U-Net hyperspectral semantic segmentation model, which uses an integration method to combine extracted features and achieve higher segmentation accuracy. However, the U-Net based method is rarely used in hyperspectral semantic segmentation, and there is still room for improvement in the algorithm structure.

To solve these problems, a DCCaps-UNet model for hyperspectral image semantic segmentation is proposed in this paper. Considering the characteristics of hyperspectral images, an end-to-end training method is adopted to alleviate the problem of small samples. During the encoding process, two blocks are constructed to extract spectral information and spatial information, respectively. This allows for better learning of different feature representations, improves computational efficiency, and can adapt to scale changes that may occur in hyperspectral images. The main contributions of this paper are as follows:

Firstly, in this paper, we construct a new semantic segmentation model for hyperspectral images, DCCaps-UNet, which uses the “coding–decoding” structure to capture and transmit features at different levels. A new loss function is defined to optimize the model, which effectively solves the problem of small samples in hyperspectral images. The model can achieve high segmentation accuracy under different scale changes.

Secondly, A depthwise separable block is proposed to extract spectral features from images, which adopts pointwise convolution followed by depthwise convolution, making the computation more efficient for the entire block. The use of residual connections improves the model’s ability to express details, while avoiding problems such as gradient vanishing and overfitting.

Thirdly, an extended convolution capsule block is constructed to extract spatial features, and the features are represented as vectors to obtain more dense spatial information. In this block, a new dynamic routing mechanism is proposed. By using extended convolution in capsules, a larger receptive field can be obtained and the weight of capsules can be adjusted better. By conditioning constraint windows, the number of capsules in dynamic routing is reduced, and the weight sharing between capsules of the same type is utilized to reduce training pressure and effectively decrease the number of model parameters.

2. Theory and Method

2.1. Data Preprocessing and Partitioning

2.1.1. Principal Component Analysis (PCA)

Hyperspectral data usually have multiple bands. PCA can be used to reduce the dimensions of the original data and eliminate redundant information. In this paper, PCA is used to reduce the spectral dimensions of original hyperspectral data to 30 while maintaining the complete spatial dimensions (width and height) of the original cube and preserving the main features.

2.1.2. Cross-Validation

Cross-validation enables each sample of the dataset to be accurately used for training and verification, reduces the bias caused by uneven distribution of data samples, improves the robustness and accuracy of the model, and avoids the problem of overfitting and underfitting of the training model. In this paper, with the 5-fold cross validation method, the dataset is divided into 5 sets. One set is selected as the test set each time and the remaining 4 sets are set as the training sets. The 5 testing results were averaged to obtain the final test result.

2.1.3. Patch

In this paper, images are preprocessed by overlapping region division and the testing accuracy of the semantic segmentation model is improved by using spatial neighborhood information. Through overlapping patch sliding windows, hyperspectral image is divided into multiple sub-patches and each sub-patch contains all channels of HSI. Finally, the data are cut into pixel blocks of size

n \times n

, which are used as the input of the DCCaps-UNet network.

2.2. Feature Encoding

2.2.1. Spectral Feature Extraction–Depthwise Separable Convolutional Block (PDR-Block)

Depthwise separable convolution is a kind of convolution operation that separates spatial convolution from channel convolution. In this paper, a new depthwise separable block is defined for extracting image spectral features, as shown in Figure 1. The traditional depthwise separable convolution sequence is changed. Firstly, feature extraction is carried out according to the pointwise convolution method (

1 \times 1

convolution) and then the

3 \times 3

deep convolution method. In this way, the feature learning pays more attention to the correlation between channels and can significantly reduce the amount of model computation. By using residual connections to establish connections between the original input and subsequent layers, the problem of vanishing gradients can be avoided. LeakyRelu activation function and BN layer are introduced to normalize the data. In this paper, two consecutive depthwise separable blocks are used to extract the shallow features of hyperspectral images and output the shallow feature map.

2.2.2. Spatial Feature Extraction–Extended Convolution Capsule Block with a Conditional Window

Unlike traditional CNN models, capsule network uses vectorized capsule neurons instead of traditional neurons and adopts the dynamic routing algorithm instead of pooling to overcome the problems of spatial information loss and insufficient relationships between acquired features in the pooling operation of CNN. In hyperspectral images, the spatial spectral information and the position relation of pixel vector are the key factors in the segmentation of hyperspectral images. Therefore, this paper uses capsule network to extract deeper features and retain the spatial features of objects.

In this paper, dynamic routing with a conditional window is proposed. By adjusting the coupling coefficient between the shallow capsule and the deep capsule, dynamic connection between them is allowed. Figure 2 shows the structure diagram of extended convolution capsule block with a conditional window. In the shallow capsule layer (Layer L), multiple capsule groups are included:

N^{(L)} = {n_{1}^{(L)}, n_{2}^{(L)}, \dots, n_{n}^{(L)} | n \in N}

. Each capsule group contains

w (L) \times h (L)

capsules.

A = {a_{11}, a_{12}, \dots, a_{{1 w}^{(L)}}, a_{h^{(L)} 1}, \dots, a_{w^{(L)} h^{(L)}}}

, and each capsule is d(L) in length.

In this paper, some shallow capsules are obtained through constraint windows, and an dilated condition constraint window is defined for all the capsules in

n_{i}^{(L)}

. In this paper, dilated convolution is proposed to create constraint windows so as to obtain a larger sensitivity field and more dense spatial features. The size of the window is

m_{1} \times m_{2} \times d (L)

, where

m_{1}

and

m_{2}

indicate the size of the two-dimensional convolution kernel and the dilation rate is set to 3. The value of the deep capsule is determined by capsule

u_{i}^{(L)}

passing through the constrained window.

u_{i | (k_{1} - 1) s_{1} - (m_{1} - 1) (r - 1), (k_{2} - 1) s_{2} - (m_{2} - 1) (r - 1), d^{(L)}}

is the capsule passing through the constrained window at (

k_{1}

,

k_{2}

), where

s_{1}

and

s_{2}

indicate the sliding step size of the constrained window and r is the dilation rate of the window, and should meet the following conditions:

(k_{1} - 1) s_{1} - (m_{1} - 1) (r - 1) < m_{1}

(1)

(k_{2} - 1) s_{2} - (m_{2} - 1) (r - 1) < m_{2}

(2)

In the dynamic routing process between two layers of capsules, a weight matrix

W_{i j}

is firstly introduced and the parameters are shared with the capsules of the same type. With the obtained structure, the number of parameters to be optimized is reduced so as to prevent overfitting. The weight matrix

W_{i j}

is multiplied by capsule

u_{i}^{(L)}

to obtain the prediction vector

u_{i j}^{'}

:

u_{i j | k_{1} k_{2}}^{'} = W_{i j} \times u_{i | (k_{1} - 1) s_{1} - (m_{1} - 1) (r - 1), (k_{2} - 1) s_{2} - (m_{2} - 1) (r - 1), d^{(L)}} + b

(3)

where b is the learnable bias quantity and

u_{i j | k_{1} k_{2}}^{'}

is the linear combination of output vectors of Layer L. The coupling coefficient

c_{i j | k_{1} k_{2}}

is defined between the two capsule layers, calculated with the softmax function, and iteratively updated with the parameter B, whose initial value is 0. The weighted summation of the prediction vector

u_{i j | k_{1} k_{2}}^{'}

and the coupling coefficient gives

S_{i | k_{1} k_{2}}

, as expressed in Equation (4):

S_{i | k_{1} k_{2}} = \sum_{i} c_{i j | k_{1} k_{2}} u_{i j | k_{1} k_{2}}^{'}

(4)

After vector

S_{i | k_{1} k_{2}}

is processed with the quashing activation function, the output vector

v_{j | k_{1} k_{2}}^{(L + 1)}

is obtained. The extrusion function is expressed as follows:

v_{j | k_{1} k_{2}}^{(L + 1)} = \frac{{| |S_{i | k_{1} k_{2}}| |}^{2}}{1 + {| |S_{i | k_{1} k_{2}} ||}^{2}} \cdot \frac{S_{i | k_{1} k_{2}}}{| |S_{i | k_{1} k_{2}} ||}

(5)

The scalar product of the obtained deep capsule

v_{i | k_{1} k_{2}}^{(L + 1)}

and the corresponding prediction vector is taken as the consistency measure. The scalar product and the original coupling coefficient are added to update the coupling coefficient as follows:

B_{{i j | k}_{1} k_{2}} = B_{{i j | k}_{1} k_{2}} + v_{j | k_{1} k_{2}}^{(L + 1)} \cdot u_{i j | k_{1} k_{2}}^{'}

(6)

In this block, the input data first passes through the PrimaryCapslayer (routing = 1) to receive the extracted primary features and encapsulate them into vector form, while fully obtaining spatial features. The convolution kernel has a size of 5 and the dilation factor is 3, encapsulating the 16-dimensional vectors as a group into a capsule. Then, the output is successively passed through three capsule layers (DConvCapslayer, routing = 3), the capsule weight is calculated using a dynamic routing mechanism, the coupling coefficient between the two layers is constantly updated, the vector is passed to the next layer DConvCapslayer, and the output is the vector form with capsule compression. Each element represents the probability that the corresponding feature is present in the input data and contains spatial information. In the capsule block, the capsule dimension of the last layer is equal to the number of categories in the image, and the output image shape is H × W × N × S, where N is the number of capsule categories and S is the size of the capsule.

2.3. DCCaps-UNet Model Structure

DCCaps-UNet proposed in this paper adopts the encoding–decoding structure. The encoding part is composed of the depthwise separable block (PDR-block) and the capsule block with the conditional window for information extraction. Among them, the depthwise separable block is used to extract the spectral (shallow) features of the image, and the spatial (deep) features are encoded by the expansion capsule block to fully acquire the image features. The decoding part uses upsampling to restore the details of images layer by layer, thus, restoring the size of the feature map to the original image size.

Figure 3 shows the network structure of the DCCaps-UNet model is composed of 2 PDR-blocks, 1 capsule block, 3 transposed convolutional layers, 2 fusion layers, and a softmax layer. Image information is firstly processed using convolution layers and then batch normalization layer so as to accelerate the training process and improve the performance of the model. The leaky ReLU activation function is introduced to optimize the model.

In the encoding part, an image of size H × W is first passed through the PDR-block to extract shallow features while expanding the number of channels. The size of the feature map is half of the original image. After that, the feature map of size H/2 × W/2 × 64 is reshaped and fed into the capsule module to further extract deep features. The feature is converted using PrimaryCapslayer into vector of shape H/2 × W/2 × 4 × 16 and the number of capsules in that layer is H/2 × W/2 × 16. Then, through DConvCapslayer, the final output image shape is H/8 × W/8 × N × S, and the specific structural parameters are shown in Table 1.

In the decoding part, the feature map is reshaped by upsampling, skip joining, and convolution layer and upsampling is carried out for a total of 3 times. The upsampling method is transposed convolution. The features of the encoding part are fused using concatenate so as to combine feature information of different levels and restore the size of the feature map.

2.4. Loss Function

In order to improve the model, Margin loss and Dice loss are summated to obtain a new loss function for supervised learning. The Margin loss in the capsule block is defined as:

{L o s s}_{M} = T_{c} \max {(m^{+} - ‖v_{c}‖)}^{2} + λ (1 - T_{c}) m a x {(0, ‖v_{c}‖ - m^{-})}^{2}

(7)

where

T_{c}

represents whether a category exists;

T_{c} = 1, m^{+} = 0.9, m^{-} = - 0.1

and

λ

is generally set as 0.5 when an object exists; and

‖v_{c}‖

represents the probability that the capsule unit belongs to this category.

Dice loss is used to evaluate the training results. Dice loss is a loss function to measure the intersection ratio and can solve the problem of sample imbalance. Dice loss is expressed as follows:

{L o s s}_{D} = 1 - \frac{2 T P}{2 T P + F P + F N} = 1 - \frac{2 |X \cap Y|}{|X| + |Y|}

(8)

where TP, FP, and FN, respectively, represent the number of correctly classified pixels, the number of falsely classified background pixels, and the number of falsely classified pixels; X represents the prediction set of target class; and Y represents the set of true label class.

The sum of Margin loss and Dice loss is taken as the loss function of this model. In the training process, parameter

ε

is used to balance the influences of the two functions:

L o s s = ε {L o s s}_{M} + (1 - ε) {L o s s}_{D}

(9)

where

ε

is a floating-point number between 0 and 1.

3. Experimental Results and Analysis

The experiment in this paper involves data preprocessing, dataset division, DCCaps-UNet model training, performance evaluation, and segmentation prediction, as shown in Figure 4. Firstly, the original hyperspectral data are preprocessed and the dataset is randomly divided into training set, validation set, and test set. Then, the proposed model is trained and evaluated. Finally, the trained and mature model is used to segment and predict hyperspectral images.

3.1. Dataset

3.1.1. Indian Pines Dataset (Dataset A)

The Indian Pines dataset (referred to as A in this paper) [43] was collected using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) over a test site in the northwest region of Indiana state. The dataset has an image annotation size of 145 × 145. The wavelength range is 0.4–2.5 µm and includes 16 different features and 220 spectral bands. After 20 bands affected by water vapor were excluded, the remaining 200 bands were experimentally tested.

3.1.2. Pavia University Dataset (Dataset B)

The Pavia University dataset (referred to as B in this paper) [44] was captured using the ROSIS-03 sensor at the University of Pavia in northern Italy. The dataset generally uses images generated from 103 spectral bands and has the data size of 610 × 340 and a total of 2,207,400 pixels, including 42,776 pixels and nine types of ground objects. Figure 5 shows false-color images and ground truth diagrams of the two datasets. Data can be obtained on hyperspectral remote sensing scenes.

3.2. Evaluation Indexes

In order to quantitatively evaluate segmentation effects, DCCaps-UNet was evaluated with different indexes, including overall accuracy (OA), average accuracy (AA), kappa coefficient, Dice coefficient, and mean intersection over union (mIoU). The indexes are calculated as:

D i c e = \frac{2 * T P}{(T P + F N) + (T P + F P)}

(10)

m I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{P_{i i}}{\sum_{j = 0}^{k} P_{i j} + \sum_{j = 0}^{k} P_{j i} - P_{i i}}

(11)

where

P_{i j}

represents the number of the pixels in category i predicted as category j;

P_{i i}

represents true positives (TP);

P_{i j}

represents false positives (FP); and

P_{j i}

represents false negatives (FN). It is assumed that there is a total of k+1 classes (including k target classes and 1 background class). The larger the above index is, the better the segmentation effect is.

3.3. Experimental Setting

In this paper, the input pixel block size was

N \times N \times 30

, which was randomly divided into five training/test folds for k-fold cross-verification. The model training was completed under the open-source deep learning framework of Keras and Tensorflow. The model training was performed based on a custom loss function and Adam optimizer under the following conditions: batch size of 256, initial learning rate of 0.001, weight decay rate of 1e-3, and training epoch of 100 rounds.

3.4. Experimental Results and Analysis

In order to further evaluate the performance of the model proposed in this paper, U-Net [40], SS3FCN [36], HybridSN [45], SpectralNet [46], and JigsawHSI [47] algorithms were selected for the comparative analysis. U-Net is the original semantic segmentation algorithm, and this paper also builds the model on this architecture. SS3FCN is a three-dimensional spectral space full convolutional network based on the classical semantic segmentation network FCN. HybridSN is currently the most widely used hyperspectral image classification network, which effectively extracts joint spatial–spectral features by combining 2D-CNN and 3D-CNN. SpectralNet and JigsawHSI are the two models with the highest accuracy of hyperspectral image classification. SpectralNet is a wavelet CNN that is a variant of the 2D CNN used for multi-resolution HSI classification. JigsawHSI is a high spectral classification model based on the inception structure. Table 2 and Table 3 shows the precision comparison of different categories of the two data sets.

As shown in Table 2, U-Net in dataset A yielded many falsely classified targets, and the segmentation accuracy of any category was not satisfactory. In the results of SS3FCN network, classification accuracy of some categories with less objects was relatively low. HybridSN network has slightly more errors in Soybean-mintill category, but has ideal accuracy in other categories. The OA value of HybridSN, SpectralNet, JigsawHSI, and the model in this paper are all higher than 99%, and the segmentation effect is good.

The proposed algorithm had higher segmentation accuracy for all kinds of ground objects and its segmentation effect was generally better than that of other algorithms except that its segmentation accuracy for the slight misclassification of Corn-mintill and Corn categories. The slight difference might be ascribed to the interaction between background and categories. At AA value, it is 0.12% lower than SpectralNet; the difference is very small. The coefficients of OA and kappa were higher than those of SpectralNet. On other overall indicators, OA, AA, and kappa coefficients are higher than those of the comparison algorithm, achieving higher segmentation accuracy.

This contrast between the proposed algorithm and comparison algorithms was more significant in the results of dataset B. DCCaps-UNet and JigsawHSI achieved 100% in each category and in the overall indicator (Table 3). For dataset B, DCCaps-UNet also yielded satisfactory segmentation accuracy and better segmentation results than other segmentation algorithms in terms of OA, AA, and kappa.

In order to visually demonstrate the segmentation performance of the algorithm proposed in this paper, a visual comparison was made (Figure 6 and Figure 7). The corresponding label images were provided. In this paper, the visualization results of delineated areas are compared.

In dataset A, the classification results of DCCaps-UNet generally did not have much noise and contained fewer misclassified pixels. The segmentation results of U-Net or SS3FCN were not satisfactory and many misclassified points were obtained in the delineated areas in the map, so that the overall result map was a broken map. The segmentation results of JigsawHSI were relatively clear, but some errors still existed in the classification results, especially in the classification of pixels in the area delineated by red squares in Figure 6. The segmentation results of the proposed algorithm were the closest to the original label and were smoother. In dataset B, U-Net and SS3FCN could correctly classify most pixels, but a large number of scattered spots existed. HybirdSN and SpectralNet have fewer spots and only a few misclassification phenomena. The segmentation results of DCCaps-UNet were basically completely correctly and noise points were almost not observed. The final training results are basically consistent with the segmentation results and labeling results in different categories of ground objects and the identified boundaries were also relatively clear. The final segmentation results were consistent with the results in Table 2 and Table 3, indicating that the algorithm proposed in this paper had good segmentation results on different types of ground objects.

In order to display the segmentation results of different algorithms more clearly, data statistics and comparative analysis were performed with two indexes: the mIoU and Dice coefficient, which are commonly used evaluation indexes for semantic segmentation. Figure 8 shows the variations of mIoU and Dice coefficients training epoch.

Through statistics and comparative analysis of experimental results, it was found that the proposed model achieved significant segmentation results. The mIoU values of DCCaps-UNet were higher than those of the other five algorithms on the whole and converged faster (Figure 8). After running 100 epochs, the mIoU values were in the range of 0.9 to 0.1 and almost close to 1. The results of the Dice coefficient of DCCaps-UNet were generally on the right and the Dice coefficient showed a significant increasing trend during the first 35 training epochs and converged in the subsequent 65 epochs (Figure 8c). The contour was obviously wider when it reached 0.99 and above, suggesting a relatively concentrated interval (Figure 8d).

Figure 9 shows the distribution of mIoU values more clearly. This model tended to converge after 50 iterations, possibly. In Figure 9b, most scattered points are distributed on the top. The DCCaps-UNet model had the least difference between correct and incorrect classification results. The changes in the Dice coefficient (Figure 8c) indicated that DCCaps-UNet performed better than other algorithms. Its data distribution interval was also higher than that of other algorithms and the Dice coefficient almost reached 1. The Dice coefficient of the algorithm proposed in this paper was higher than that of the comparison algorithms and the Dice coefficient of DCCaps-UNet reached 0.9986, suggesting almost completely correct segmentation. Compared with the other five networks, DCCaps-UNet improved all the indexes.

4. Discussion

4.1. The Influence of Each Module on the Model Performance

In order to verify the effectiveness of each module of the proposed algorithm and the adopted loss function, ablation experiments were conducted with dataset B. The influences of depthwise separable block and capsule block were compared to analyze the coding part. In this paper, a new loss function was introduced to replace the common cross entropy loss function. The overall effects of the model before and after combining each module were compared. The experimental settings are provided in Table 4.

The segmentation effect was not satisfactory when only the U-Net network architecture was used (Figure 10). However, after the introduction of the capsule block, all indexes were significantly improved. Dice was increased by 0.1408 and mIoU was increased by 0.1552. Then, the segmentation effect after the introduction of the depthwise separable block was tested. After the introduction of the two modules, the whole algorithm achieved the best segmentation performance (mIoU = 0.9992, Dice = 0.9987), indicating the effectiveness of each module. The introduction of the combined loss function leads to an increase in mIoU. After the introduction of depthwise separable, capsule block and combination loss functions, the values of mIoU and Dice were, respectively, increased by 0.9 and 0.74. The experimental results showed that the proposed method, DCCaps-UNet, significantly improved the semantic segmentation performance.

Figure 11 shows training losses and validation losses under different training epochs. With the increase in iterations, the training loss and validation loss of the model decreased gradually and some problems such as underfitting, overfitting, gradient disappearing or explosion did not occur. The model with the highest performance on the verification set was obtained. As can be seen from the figure, the convergence speed of this model is very fast, and the convergence speed of the model in dataset A is faster than that in dataset B.

4.2. Impact of Patch Size on Model Accurary

The choice of patch size is an important factor affecting the accuracy of the model. If the size chosen is too small, the model will not be trained correctly; if the size chosen is too large, it will increase the training time of the model. Therefore, in order to choose the best patch size, we chose different sized windows for comparison. Figure 12 shows the accuracy of the two datasets for different patch sizes.

As can be seen from Figure 11a, when patch size increases to 11, the model has the highest OA value on dataset A. AA also reaches the highest when the patch size is increased to 23, but OA is smaller than when the patch size is 11. It can be seen from (b) that when the patch size is 13, the accuracy of the model is the highest. After that, the accuracy of the model on the Pavia University dataset decreased slightly as the patch size continued to increase. As can be seen from the figure, this model has achieved good results under various sizes of patches. It can also be concluded that this model can adapt to patch sizes of different scales and has good segmentation ability in small sample scenarios.

4.3. Model Run Time Comparison

Table 5 clearly shows the comparison between the training time and the number of model parameters among the models. DCCaps-UNet has achieved significant advantages on both datasets. On dataset A, all metrics are the best. On dataset B, the training time is slightly slower than SpectralNet, but as mentioned earlier, the accuracy has improved compared to SpectralNet. It is worth noting that on dataset B, while achieving the same accuracy as JigsawHSI and having a similar training time, the parameter size of DCCaps-UNet is reduced by 64 times. This also validates the superiority of our model from another perspective.

According to the experimental results and various indexes, DCCaps-UNet has good performance and is an effective advanced hyperspectral image semantic segmentation method.

5. Conclusions

In this paper, DCCaps-Unet, a U-shaped hyperspectral semantic segmentation model based on depthwise separable and conditional convolution capsule network is proposed for hyperspectral image semantic segmentation. The model is based on the U-Net encoder–decoder structure and first defines a depthwise separable block to extract spectral features. It proposes pointwise convolution followed by depthwise convolution, which enhances the model’s generalization ability. Then, the capsule block is used to extract spatial features to retain the spatial structural characteristics of the object and make the processing of small sample scenes more accurate. A new dynamic routing mechanism with a conditional window is proposed to connect and adjust the weight of input and output capsules. Finally, the extracted features are uniformly encoded and input into the decoding network for image reconstruction.

Comparing the other five algorithms on two datasets, the experimental results showed that the DCCaps-UNet model greatly improved segmentation accuracy and segmentation effect compared with currently used algorithms. The model converges quickly, can adapt to multi-scale patch sizes, and is better than the model with small precision difference in terms of parameter number and running time. In addition, the spatial information for hyperspectral images can be effectively used to reduce the training time of the model. The U-Net network architecture is applied to a hyperspectral image. The segmentation accuracy of hyperspectral images with small samples is effectively improved. DCCaps-Unet provides a new method for precision agriculture, urban planning, vegetation monitoring, mineral exploration, water body identification, and national defense.

In the future, we will further explore the correlation between multi-band features, pay attention to the pixel-level segmentation model in eliminating dimensionality reduction and the method of multivariate data fusion, so as to enhance the representation ability of the model for images of different scales. We hope that the proposed method will serve as a strong foundation and contribute to the future research of hyperspectral image semantic segmentation.

Author Contributions

Conceptualization, S.W. and L.G.; methodology, S.W.; software, S.W. and Y.L.; formal analysis, M.L. and S.W.; writing—original draft preparation, S.W.; writing—review and editing, L.G., M.L., S.W. and H.H.; visualization, S.W., M.L. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 51663001, 52063002, 42061067.

Data Availability Statement

The codes are available free of charge at GitHub. (https://github.com/wsq0923/dccaps-unet.git, accessed on 16 June 2023).

Acknowledgments

The authors thank the anonymous reviewers and editors for their valuable comments and constructive suggestions.

Conflicts of Interest

No potential conflict of interest was reported by authors.

References

Goetz, A.F. Three decades of hyperspectral remote sensing of the earth: A personal view. Remote Sens. Environ. 2009, 113, S5–S16. [Google Scholar] [CrossRef]
Chen, L.; Chen, S.; Guo, X. Multilayer nmf for blind unmixing of hyperspectral imagery with additional constraints. Photogramm. Eng. Remote Sens. 2017, 83, 307–316. [Google Scholar] [CrossRef]
Zhao, C.; Zhang, L.; Cheng, B. A local mahalanobis-distance method based on tensor decomposition for hyperspectral anomaly detection. Geocarto Int. 2019, 34, 490–503. [Google Scholar] [CrossRef]
Li, C.; Wang, Y.; Zhang, X.; Gao, H.; Yang, Y.; Wang, J. Deep belief network for spectral-spatial classification of hyperspectral remote sensor data. Sensors 2019, 19, 204. [Google Scholar] [CrossRef] [Green Version]
Kale, K.V.; Solankar, M.M.; Nalawade, D.B.; Dhumal, R.K.; Gite, H.R. A research review on hyperspectral data processing and analysis algorithms. Proc. Natl. Acad. Sci. India Sect. A-Phys. Sci. 2017, 87, 541–555. [Google Scholar] [CrossRef]
Govender, M.; Chetty, K.; Bulcock, H.J.W.S.A. A review of hyperspectral remote sensing and its application in vegetation and water resource studies. Water Sa 2007, 33, 145–151. [Google Scholar] [CrossRef] [Green Version]
Terentev, A.; Dolzhenko, V.; Fedotov, A.; Eremenko, D. Current state of hyperspectral remote sensing for early plant disease detection: A review. Sensors 2022, 22, 757. [Google Scholar] [CrossRef]
Qin, P.; Cai, Y.; Wang, X. Small waterbody extraction with improved u-net using zhuhai-1 hyperspectral remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Hennessy, A.; Clarke, K.; Lewis, M. Generative adversarial network synthesis of hyperspectral vegetation data. Remote Sens. 2021, 13, 2243. [Google Scholar] [CrossRef]
Lu, B.; Dao, P.D.; Liu, J.; He, Y.; Shang, J. Recent advances of hyperspectral imaging technology and applications in agriculture. Remote Sens. 2020, 12, 2659. [Google Scholar] [CrossRef]
Ma, D.; Maki, H.; Neeno, S.; Zhang, L.; Wang, L.; Jin, J. Application of non-linear partial least squares analysis on prediction of biomass of maize plants using hyperspectral images. Biosyst. Eng. 2020, 200, 40–54. [Google Scholar] [CrossRef]
Zhou, W.; Yang, H.; Xie, L.; Li, H.; Huang, L.; Zhao, Y.; Yue, T. Hyperspectral inversion of soil heavy metals in three-river source region based on random forest model. Catena 2021, 202, 105222. [Google Scholar] [CrossRef]
Wei, X.; Li, W.; Zhang, M.; Li, Q. Measurement, Medical hyperspectral image classification based on end-to-end fusion deep neural network. IEEE Trans. Instrum. Meas. 2019, 68, 4481–4492. [Google Scholar] [CrossRef]
Cui, R.; Yu, H.; Xu, T.; Xing, X.; Cao, X.; Yan, K.; Chen, J. Deep learning in medical hyperspectral images: A review. Sensors 2022, 22, 9790. [Google Scholar] [CrossRef]
Chakraborty, R.; Sushil, R.; Garg, M.L. Hyper-spectral image segmentation using an improved pso aided with multilevel fuzzy entropy. Multimed. Tools Appl. 2019, 78, 34027–34063. [Google Scholar] [CrossRef]
Ismail, M.J.A. Segment-based clustering of hyperspectral images using tree-based data partitioning structures. Algorithms 2020, 13, 330. [Google Scholar] [CrossRef]
Noyel, G.; Angulo, J.; Jeulin, D. Morphological segmentation of hyperspectral images. Image Anal. Ster. 2007, 26, 101–109. [Google Scholar] [CrossRef] [Green Version]
Mercier, G.; Derrode, S.; Lennon, M. Hyperspectral image segmentation with markov chain model. In Proceedings of the IEEE International Geoscience & Remote Sensing Symposium 2003, Toulouse, France, 21–25 July 2003. [Google Scholar]
Acito, N.; Corsini, G.; Diani, M. An unsupervised algorithm for hyperspectral image segmentation based on the gaussian mixture model. In Proceedings of the 2003 IEEE International Geoscience and Remote Sensing Symposium, (IGARSS ’03), Toulouse, France, 21–25 July 2003. [Google Scholar]
Li, J.; Bioucas-Dias, J.M.; Plaza, A. Hyperspectral image segmentation using a new bayesian approach with active learning. IEEE Trans. Geosci. Remote Sens. 2011, 49, 3947–3960. [Google Scholar] [CrossRef] [Green Version]
Li, J.; Bioucas-Dias, J.M.; Plaza, A. Spectral-spatial hyperspectral image segmentation using subspace multinomial logistic regression and markov random fields. IEEE Trans. Geosci. Remote Sens. 2012, 50, 809–823. [Google Scholar] [CrossRef]
Wu, Y.; Lin, L.; Wang, J.; Wu, S. Application of semantic segmentation based on convolutional neural network in medical images. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi J. Biomed. Eng. 2020, 37, 533–540. [Google Scholar] [CrossRef]
Geng, Q.; Zhou, Z.; Cao, X. Survey of recent progress in semantic image segmentation with cnns. Sci. China-Inf. Sci. 2018, 61, 051101. [Google Scholar] [CrossRef] [Green Version]
Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep convolutional neural networks for hyperspectral image classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef] [Green Version]
Makantasis, K.; Karantzalos, K.; Doulamis, A.; Doulamis, N. Deep supervised learning for hyperspectral data classification through convolutional neural networks. In Proceedings of the Geoscience & Remote Sensing Symposium 2015, Milan, Italy, 26–31 July 2015. [Google Scholar]
Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef] [Green Version]
Song, W.; Li, S.; Fang, L.; Lu, T. Hyperspectral image classification with deep feature fusion network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3173–3184. [Google Scholar] [CrossRef]
Lee, H.; Kwon, H.J.I. Going deeper with contextual cnn for hyperspectral image classification. IEEE Trans. Image Process. 2017, 26, 4843–4855. [Google Scholar] [CrossRef] [Green Version]
Liu, B.; Yu, X.; Zhang, P.; Yu, A.; Fu, Q.; Wei, X. Supervised deep feature extraction for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2017, 56, 1909–1921. [Google Scholar] [CrossRef]
Cao, X.; Zhou, F.; Xu, L.; Meng, D.; Xu, Z.; Paisley, J. Hyperspectral image classification with markov random fields and a convolutional neural network. IEEE Trans. Image Process. 2018, 27, 2354–2367. [Google Scholar] [CrossRef] [Green Version]
Li, Y.; Xu, Q.; Li, W.; Nie, J. Automatic clustering-based two-branch cnn for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7803–7816. [Google Scholar] [CrossRef]
Ying, L.; Haokui, Z.; Qiang, S.J.R.S. Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network. Remote Sens. 2017, 9, 67. [Google Scholar]
Xu, Q.; Xiao, Y.; Wang, D.; Luo, B.J.R.S. Csa-mso3dcnn: Multiscale octave 3d cnn with channel and spatial attention for hyperspectral image classification. Remote Sens. 2020, 12, 188. [Google Scholar] [CrossRef] [Green Version]
Ghaderizadeh, S.; Abbasi-Moghadam, D.; Sharifi, A.; Zhao, N.; Tariq, A. Hyperspectral image classification using a hybrid 3d-2d convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7570–7588. [Google Scholar] [CrossRef]
Nalepa, J.; Myller, M.; Kawulok, M.J.I.G.; Letters, R.S. Validating hyperspectral image segmentation. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1264–1268. [Google Scholar] [CrossRef] [Green Version]
Zou, L.; Zhu, X.; Wu, C.; Liu, Y.; Qu, L. Spectral–spatial exploration for hyperspectral image classification via the fusion of fully convolutional networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 659–674. [Google Scholar] [CrossRef]
Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic routing between capsules. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Sun, L.; Song, X.; Guo, H.; Zhao, G.; Wang, J. Patch-wise semantic segmentation for hyperspectral images via a cubic capsule network with emap features. Remote Sens. 2021, 13, 3497. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; Volume 39, pp. 640–651. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Cham, Switzerland, 18 November 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Li, J.; Wang, H.; Zhang, A.; Liu, Y. Semantic segmentation of hyperspectral remote sensing images based on PSE-UNet model. Sensors 2022, 22, 9678. [Google Scholar] [CrossRef] [PubMed]
Soucy, N.; Sekeh, S.Y. CEU-Net: Ensemble semantic segmentation of hyperspectral images using clustering. J. Big Data 2023, 10, 43. [Google Scholar] [CrossRef]
Baumgardner, M.F.; Biehl, L.L.; Landgrebe, D.A. 220 Band Aviris Hyperspectral Image Data Set: June 12, 1992 Indian Pine Test Site 3; Purdue University Research Repository: West Lafayette, IN, USA, 2015. [Google Scholar] [CrossRef]
University of Pavia Dataset. Available online: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Pavia_University_scene (accessed on 16 June 2023).
Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B.J.I. Hybridsn: Exploring 3-d–2-d cnn feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2020, 17, 277–281. [Google Scholar] [CrossRef] [Green Version]
Chakraborty, T.; Trehan, U. Spectralnet: Exploring spatial-spectral waveletcnn for hyperspectral image classification. arXiv 2021, arXiv:2104.00341. [Google Scholar]
Jaime Moraga, H.S.D. Jigsawhsi: A network for hyperspectral image classification. arXiv 2022. [Google Scholar] [CrossRef]

Figure 1. PDR-block structure diagram.

Figure 2. Structure diagram of capsule block in dliated condition constraint window.

Figure 3. Network structure diagram of DCCaps-UNet model.

Figure 4. Overall flowchart of the experiment.

Figure 5. False-color images and real label diagrams of the datasets used in this paper.

Figure 6. A visual comparison of segmentation results.

Figure 7. (a–g) B Visual comparison of segmentation results.

Figure 8. Comparison of segmentation effects of dataset A. (a) Comparative heat map of the segmentation results of different categories obtained with six algorithms; (b) color band (the ordinates 1–6, respectively, indicate six models and the main distribution and interval ranges of the data are shown in the upper right corner); (c) comparison graph of OA, AA, and kappa coefficients; and (d) main distributions of the resulting data.

Figure 9. mIoU and Dice coefficient results of dataset B. Graphs (a–c) are the same as Figure 8. (d) Histogram of Dice coefficient.

Figure 10. Comparison of ablation results.

Figure 11. Training loss curves and verification loss curves.

Figure 12. Accuracy of different patch sizes for two datasets. (a) Results for different patch sizes for the Indian Pines dataset. (b) Results for different patch sizes for the Pavia University dataset.

Table 1. Network structure parameters of DCCaps-UNet model.

Type	Layer	Kernel/Stride	Shape
	Input	0	(32,32,30)
	Conv_block	$3 \times 3 / 1$	(32,32,64)
	Conv pw	$1 \times 1 / 1$	(32,32,32)
	Conv dw	$3 \times 3 / 1$	(32,32,32)
	BN+LeakyReLU	0	(32,32,32)
	Conv pw	$1 \times 1 / 1$	(16,16,128)
Encoder	Conv dw	$3 \times 3 / 2$	(16,16,128)
	BN+LeakyReLU	0	(16,16,128)
	PrimaryCapslayer	$5 \times 5 / 1$ (routing = 1)	(16,16,8,16)
	DConvCapslayer1	$5 \times 5 / 2$ (routing = 3)	(8,8,4,16)
	DConvCapslayer2	$5 \times 5 / 1$ (routing = 3)	(8,8,4,32)
	DConvCapslayer3	$1 \times 1 / 1$ (routing = 3)	(4,4,1,16)
	Transpose Conv1	$3 \times 3 / 1$	(8,8,64)
	BN+LeakyReLU	0	(8,8,64)
	Conv	$3 \times 3 / 1$	(8,8,64)
	Transpose Conv2	$3 \times 3 / 1$	(16,16,64)
Decoder	BN+LeakyReLU	0	(16,16,64)
	Conv	$3 \times 3 / 1$	(16,16,64)
	Concatenate1	0	(16,16,128)
	Transpose Conv3	$3 \times 3 / 1$	(32,32,128)
	BN+LeakyReLU	0	(32,32,128)
	Concatenate2	0	(32,32,256)
	Conv+softmax	$1 \times 1 / 1$	(None,1,16)

Table 2. Comparison of segmentation results of different categories in dataset A.

Class	U-Net	SS3FCN	HybridSN	SpectralNet	JigsawHSI	DCCaps-UNet
Alfalfa	84.53	90.25	100.0	100	100	100
Corn-notill	48.79	98.31	100	100	100	100
Corn-mintill	99.85	88.4	99.2	100	100	99.25
Corn	74.72	97.25	100	100	100	99.51
Grass-pasture	87.34	94.9	100	99.6	100	100
Grass-trees	97.78	95.99	100	100	99.74	100
Grass-pasture-mowed	80.68	97.9	100	100	100	100
Hay-windrowed	44.25	86.94	100	100	100	100
Oats	37.04	95.14	100	100	100	100
Soybean-notill	78.55	96.63	100	100	100	100
Soybean-mintill	60.24	90.32	96.8	100	100	100
Soybean-clean	73.83	79.45	100	98.16	100	100
Wheat	92.16	75.93	100	100	100	100
Woods	100	83.7	100	100	100	100
Buildings-Grass-Trees-Drives	72.77	79.65	100	100	100	100
Stone-Steel-Towers	100	92.42	100	100	100	100
OA (%)	77.03	93.28	99.75	99.86	99.74	99.92
AA (%)	79.5	92.74	99.71	99.98	99.7	99.86
kappa	0.7383	0.8998	0.9963	0.9984	0.9811	0.9985

Table 3. Comparison of segmentation results of different categories in dataset B.

Class	U-Net	SS3FCN	HybridSN	SpectralNet	JigsawHSI	DCCaps-UNet
Asphalt	94.06	97.48	100	100	100	100
Meadows	98.67	90.86	100	100	100	100
Gravel	78.3	98.75	100	100	100	100
Trees	96.01	84.81	100	100	100	100
Painted metal sheets	100	94.28	100	100	100	100
Bare soil	99.8	83.59	100	100	100	100
Bitumen	79.56	81.68	100	99.9	100	100
Self-blocking bricks	99.58	88.84	99.5	100	100	100
Shadows	99.25	88.68	100	100	100	100
OA (%)	96.09	93.79	99.98	99.99	100	100
AA (%)	93.47	92.1	99.98	99.98	100	100
kappa	0.9480	0.8989	0.9997	0.9998	1.00	1.00

Table 4. Setting of ablation experiments.

Foundation Model Architecture	PDR-Block	Capsule-Block	Combination Loss	Cross Entropy Loss	Name
2D-UNet	×	×	×	√	Base
	√	×	×	√	Base-PDR
	×	√	√	×	Base-Caps-myloss
	√	√	×	√	Base-PDR-Caps
	√	√	√	×	Base-PDR-Caps-myloss

Note: √ means include the module, × means not included.

Table 5. Comparison table of parameters quantity and running time of each model.

Model	Dataset	Parameters	Training Time (s)	Test Time (s)
U-Net	A	7,759,554	8200.23	16.2
U-Net	B	7,765,442	5200.25	12.98
SS3FCN	A	846,778	6336.12	10.31
SS3FCN	B	821,864	8371.32	11.34
HybridSN	A	5,503,108	5707.86	8.73
HybridSN	B	5,366,853	5312.18	9.58
SpectralNet	A	6,805,072	4785.97	6.21
SpectralNet	B	6,194,133	3964.56	5.05
JigsawHSI	A	20,864,246	4874.86	5.38
JigsawHSI	B	21,073,014	4335.24	5.93
DCCaps-UNet	A	321,024	3933.87	5.74
DCCaps-UNet	B	312,953	4273.74	5.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, S.; Liu, Y.; Li, M.; Huang, H.; Zheng, X.; Guan, L. DCCaps-UNet: A U-Shaped Hyperspectral Semantic Segmentation Model Based on the Depthwise Separable and Conditional Convolution Capsule Network. Remote Sens. 2023, 15, 3177. https://doi.org/10.3390/rs15123177

AMA Style

Wei S, Liu Y, Li M, Huang H, Zheng X, Guan L. DCCaps-UNet: A U-Shaped Hyperspectral Semantic Segmentation Model Based on the Depthwise Separable and Conditional Convolution Capsule Network. Remote Sensing. 2023; 15(12):3177. https://doi.org/10.3390/rs15123177

Chicago/Turabian Style

Wei, Siqi, Yafei Liu, Mengshan Li, Haijun Huang, Xin Zheng, and Lixin Guan. 2023. "DCCaps-UNet: A U-Shaped Hyperspectral Semantic Segmentation Model Based on the Depthwise Separable and Conditional Convolution Capsule Network" Remote Sensing 15, no. 12: 3177. https://doi.org/10.3390/rs15123177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DCCaps-UNet: A U-Shaped Hyperspectral Semantic Segmentation Model Based on the Depthwise Separable and Conditional Convolution Capsule Network

Abstract

1. Introduction

2. Theory and Method

2.1. Data Preprocessing and Partitioning

2.1.1. Principal Component Analysis (PCA)

2.1.2. Cross-Validation

2.1.3. Patch

2.2. Feature Encoding

2.2.1. Spectral Feature Extraction–Depthwise Separable Convolutional Block (PDR-Block)

2.2.2. Spatial Feature Extraction–Extended Convolution Capsule Block with a Conditional Window

2.3. DCCaps-UNet Model Structure

2.4. Loss Function

3. Experimental Results and Analysis

3.1. Dataset

3.1.1. Indian Pines Dataset (Dataset A)

3.1.2. Pavia University Dataset (Dataset B)

3.2. Evaluation Indexes

3.3. Experimental Setting

3.4. Experimental Results and Analysis

4. Discussion

4.1. The Influence of Each Module on the Model Performance

4.2. Impact of Patch Size on Model Accurary

4.3. Model Run Time Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI