1. Introduction
With the rapid development of modern technology, hyperspectral imaging technology has been widely used in many fields, such as geology [
1], ecology [
2], geomorphology [
3], atmospheric science [
4], forensic science [
5] and so on, not just in remote sensing satellite sensors and airborne platforms. Hyperspectral sensors can capture hundreds of narrow continuous spectral bands from visible to infrared wavelengths that are reflected or emitted from the scene. The 3D hyperspectral images (HSIs) have high spectral resolution and fine spatial resolution for the taken scene. These allow us to get more information about the object being studied. However, due to the high spectral dimensionality, the interpretation and analysis of hyperspectral images face many challenges. (1) Radiometric noise in some bands limits the precision of image processing [
6]. (2) Some redundant bands reduce the quality of image analysis since the adjacent spectral bands are often correlated and not all bands are valuable for image processing [
7]. (3) These redundant bands also lead to the cost of huge computational resources and storage space [
8]. (4) There is a Hughes phenomenon, that is, the higher the data dimensionality, the poorer the classification performance because of the limited samples [
9]. These makes dimensionality reduction (DR) become an essential task for hyperspectral image processing.
Many classic algorithms have been used for HSIs DR, such as principal component analysis (PCA) [
10], Laplacian eigenmaps (LE) [
11], locally linear embedding (LLE) [
11], Isometric feature mapping (ISOMAP) [
12], linear discriminant analysis (LDA) [
13]. These classical algorithms based on different concepts all attempt to explore and maintain the relationship among samples in HSIs, which is beneficial to improve the separability of low-dimensional features. However, there are several problems when they are applied for HSIs DR. Firstly, ISOMAP, LE and LLE have the out-of-sample problems. On this issue, locality preserving projection (LPP) [
14] and neighborhood preserving embedding (NPE) [
15] are proposed. Nevertheless, LPP, NPE, PCA and LAD are the linear transformations, which are ill-suited for HSIs because HSIs derived from the complex light scattering of natural objects are inherently nonlinear [
16]. Also, spatial feature extraction is a common problem faced by these classical algorithms for HSI DR, which has allowed for good improvements in HSIs representation. Moreover, these algorithms focus on the shallow features of HSIs via a single mapping but cannot extract the deep complex features iteratively.
In recent years, deep learning, as one of the most popular learning algorithms, has been applied to various fields, which can yield more non-linear and more abstract deep representations of data by multiple processing layers [
17]. The spatial features extraction is generally achieved by convolutional neural networks (CNN) which can exploit a set of trainable filters to capture local spatial features from receptive fields but often needs supervised information. Many studies have used CNN for HSIs [
18]. Paoletti et al. [
19] proposed a new deep convolutional neural network for fast hyperspectral image classification. Zhong et al. [
20] proposed a supervised spectral-spatial residual network for HSIs on basic of the 3D convolutional layers. Han et al. [
21] proposed a different-scale two-stream convolutional network for HSIs. These CNN-based methods can extract superior hyperspectral image features for classification, but they generally require enough class label samples for supervised learning. As a matter of fact, the task of labeling each pixel contained in HSIs is arduous and time-consuming, which generally requires a human expert. As a result, the class label samples of HSIs are scarce and limited, and even unavailable in some scenarios. To address this issue, a few of unsupervised CNN-based methods have been proposed for HSIs. Mou et al. [
22] proposed a deep residual conv-deconv network for unsupervised spectral-spatial feature learning. Zhang et al. [
23] proposed a novel modified generative adversarial network for unsupervised feature extraction in HSIs. Recently, Zhang et al. [
24] proposed a symmetric all convolutional neural-network-based unsupervised feature extraction for HSIs. However, these unsupervised CNN-based approaches are usually based on data reconstruction, but they are short of the exploration of discriminability which is usually the primary goal of DR.
To overcome the drawbacks mentioned above, we propose an unsupervised deep fully convolutional embedding network (DFCEN) for dimensionality reduction of HSIs. Different from the conventional CNN-based network, DFCEN utilizes the learning parameters of convolutional (deconvolutional) layer to replace the fixed down-sampling (up-sampling) of pooling layer to improve the validity of the representation. Meanwhile parameter sharing of convolutional layer is conducive to the extraction of spatial features and reduce the number of parameters compared with fully-connected layer. For the convenience of explanation, DFCEN can be divided into two parts: convolutional subnetwork that encodes high-dimensional data into a low-dimensional space and deconvolutional subnetwork that recovers low-dimensional features to the original high-dimensional data. Accordingly, the network structure of DFCEN lays a foundation for unsupervised learning.
To address the shortcoming of the above unsupervised CNN-based approaches, we introduce a specific learning task of enhancing feature discriminability into DFCEN. Considering the completeness and discriminability of low-dimensional data, we particularly design a novel objective function containing two terms: reconstruction term and embedding term of the specific learning task. The former makes the low-dimensional features keep completeness and original intrinsic information in HSIs. How to design a specific learning task to enhance the discriminability and separability of low-dimensional features is the key point of the latter. The relationships among samples is of considerable value, which are concerned in the classical DR algorithms described above and has been shown to be conducive to HSIs DR. In this paper, the DR concepts of two classical algorithms, LLE and LE, are used as references for the specific learning task in embedding term. Furthermore, in order to balance the contribution of two terms to DR, an adjustable trade-off parameter is added to the objective function. In addition, in order to reduce the training time, we choose to utilize the convolutional autoencoder (CAE) for pretraining to get good initial learning parameters of DFCEN.
Specifically, the contributions of this paper are as follows.
An end-to-end symmetric fully convolutional network, DFCEN, is proposed for HSIs DR, which is the foundation of unsupervised learning. In addition, owing to the symmetry of DFCEN, the network structure of symmetry layer in convolutional subnetwork and deconvolutional subnetwork is the same. For that, these two subnetwork can share the same pretraining parameters, which saves the pretraining time.
A novel objective function with two terms constraining different layers respectively is designed for DFCEN. This allows DFCEN to explore not only completeness but also discriminability compared to the previous unsupervised CNN-based approaches
This is the first work to introduce LLE and LE into an unsupervised fully convolutional network, which simultaneously solved their out-of-sample, linear transformation, and spatial feature extraction problem. In addition, other different DR concepts also can be implemented in embedding term as long as it can be expressed in the form of an objective function.
Due to the limited training samples, inherent complexity and the presence of noise bands in HSIs, DFCEN as an unsupervised network is sensitive to input data. So, a preprocessing strategy of removing noise band is adopted, which is proved to effectively improve the DFCEN representation of HSIs.
This paper is organized as follows. In
Section 2, we introduce the background and the related works. The proposed deep fully convolutional embedding network are described in detail in
Section 3.
Section 4 presents the experimental results on three datasets that demonstrate the superiority of the proposed DR method. A conclusion is presented in
Section 5.
3. The Proposed Method
In this section, we will introduce our proposed method in detail. The flowchart is shown in
Figure 2. Usually due to changes in atmospheric conditions, occlusion caused by the presence of clouds, changes in lighting, and other environmental disturbances, some noise bands in HSIs increase the difficulty in feature extraction and classification. As an unsupervised network, DFCEN is sensitive to these noise spectral bands because of the limited training samples and complex intrinsic features of HSIs. For this reason, a simple band selection based on mutual information is adopted for selecting and removing the noise bands at first. Then the relationships among samples is obtained for the specific learning task, which is specially based on LLE and LE in this paper. Next, training samples specifically applied to DFCEN are generated through a data preprocessing. Afterwards, DFCEN is learning from the training samples and relationship among samples. Eventually, the low-dimensional features from DFCEN is classified by classifiers.
3.1. Data Preprocessing
Data preprocessing includes data standardization, data denoising and data expansion. Data standardization is to standardize the pixel values of each spectral band to 0∼1 since it is not appropriate to directly process the raw HSIs data with large pixel values. Data denoising is to select and remove the noise spectral band that may disturb feature extraction and classification. MI can evaluate the contribution of each band to classification [
8], Besides, due to the simplicity of calculation, MI is adopted to search for bands that contribute little to the classification as the noise spectral band. Each band
in HSIs is considered as a random variable. Its probability distribution function can be estimated as
, where
represents the gray-level histogram of the
jth band with
pixels. The joint probability distributions of any two bands in HSIs is estimated by
, where
is the joint gray-level histogram of the
ith and
jth band.
Figure 3 shows the MI values of each band in three datasets. As we can see, the two lines fluctuate almost identically. For this reason, we can find and remove noise bands with low MI in an unsupervised way according to the red dotted line. For a raw HSIs data
, where
M and
N is the spatial size and
is the raw number of the spectral bands, the corresponding de-noising data can be expressed as
, where
is the number of bands after removing the noise bands and
. Actually, we only removed 30 noise bands for Indian Pines dataset, 0 band for Pavia University dataset, 8 bands for Salinas dataset. In order to further prove the validity of removing the noise bands before DFCEN, we take the Indian Pines dataset as an example to compare the classification accuracy of different dimensionality reduction algorithms before and after removing the noise bands. From
Table 1, NBS means that the algorithm directly acts on the raw data while BS represents removing the noise bands before dimensionality reduction algorithm. It can be seen from
Table 1 that for two unsupervised methods based on neural network, DFCEN and SAE, removing the noise bands is conducive to improving classification accuracy. In the meantime, it also slightly improves other dimensionality reduction algorithms.
Spatial features have been proven to be beneficial to improve the representation of HSIs and increase interpretation accuracy [
35,
36]. For each pixel, the neighborhood pixel is one of the most important spatial information which is fed to DFCEN in the form of neighborhood window centered around each pixel. With this in mind, the input data size of DFCEN is designed as
, where
s is the size of the neighborhood window and
is the number of bands. However, the problem is that the neighborhood window of the pixels at the image boundary is incomplete. These boundary pixels cannot be ignored since our goal is to reduce the dimensions of each pixel in HSIs. It is also inappropriate to simply fill the neighborhood window of boundary pixels with 0. In order to deal with this problem better, we implement a data expansion strategy based on the Manhattan distance to fill the neighborhood window of the boundary pixels.
Figure 4 shows the process of expanding the data by two layers, where the dark color is the original data and the light color is the filling data. For a pixel
in a de-noising HSI
(
is the number of pixels), its neighborhood window is a training sample
that is fed to the proposed DFCEN. As a result, a training sample set
with
samples can be generated from a de-noising HSI
.
3.2. Structure of DFCEN
DFCEN is composed of convolutional layer and deconvolutional layer, excluding pooling layer and full-connected layer. Accordingly, DFCEN can be divided into two parts: convolutional subnetwork and deconvolutional subnetwork. In the convolutional subnetwork, the input data is propagated through multiple convolutional layers to a perception layer, while this perception layer is propagated through multiple deconvolutional layers to a output layer (whose size is same as the input layer) in the deconvolutional subnetwork.
Figure 5 shows the network structure of DFCEN. The introduction in the red box is the name and structure of each layer, while the name of the learning parameter and the filter size is in the green box. It is worth emphasizing that DFCEN is a symmetric and end-to-end network where the number of layers can be set or changed based on specific data or tasks. For the sake of explanation, we take a 7-layer DFCEN shown in
Figure 5 as an example to introduce the network structure characteristics of DFCEN in detail. The following is the description of a 7-layer DFCEN shown in
Figure 5.
In the convolutional subnetwork, firstly, a training sample is fed to DFCEN, where is also the number of channels of the input layer. Secondly, the output of input layer is sent to the first convolutional layer through filters of size . The output of contains feature maps that are then transmitted to the second convolutional layer via filters of size . Next, feature maps are obtained after is activated, which are then send to the last convolutional layer by d filters of size . The last convolutional layer in the convolutional subnetwork is also the central layer of the whole DFCEN. Eventually, a low-dimensional feature of concern is generated after applying the activation function to .
In the deconvolutional subnetwork, the low-dimensional feature (which is also the output of the convolutional subnetwork) from is up-sampled layer by layer through multiple deconvolutional layers. At first, is sent to the first deconvolutional layer with filters of size . Then, feature maps are gained after the activation function and then transfered to the second deconvolutional layer through filters of size . Next, after activating , feature maps are obtain and transfered to the last deconvolutional layer (which is also the output layer of the whole DFCEN) with filters size of . In the end, the output of the whole DFCEN is generated after is activated, whose size is the same as the input of DFCEN.
In fact, the characteristics of DFCEN are the size and number of filters (learning parameters), which are identical for the symmetrical layer in the convolutional and deconvolutional subnetwork. This rule also applies to the number and size of feature maps per layer. In particular, the number of feature maps per layer exists:
where
d is target dimension of dimensionality reduction and
is the dimension of input data. Meanwhile, the relationship of the size of feature maps per layer is
where
s is the size of input data and the size of
must be 1 since it represents the low-dimensional features of one pixel. For this reason, the size of the filter between
and its preceding layer must be the same as the size of its preceding layer. In
Figure 5, the preceding layer of
is
. In brief, DFCEN is a symmetric full convolutional network with a central layer of size 1, where the convolutional subnetwork reduce the dimensionality and size of data layer by layer while the deconvolutional subnetwork restores the data dimensionality and size layer by layer. Therefore, the network structure determines that feature extraction of DFCEN is an unsupervised process as long as the embedding term in objective function does not require any class label information.
3.3. Objective Function of DFCEN
As discussed in
Section 1, DFCEN supports not only unsupervised feature extraction based on data reconstruction, but also task-specific learning which is conducive to dimensionality reduction and classification. The objective function of DFCEN consists of two terms: embedding term for the specific learning task and reconstruction term. The embedding term can be changed or designed according to specific concept or task, which is dedicated to improving the discriminant ability of the low-dimensional features. As shown in
Figure 5, the embedding term is to constrain the low-dimensional output of the central layer
. So it only acts on the parameter update of the convolutional subnetwork. For a training sample set
, the output of
in
Figure 5 is expressed as follows
where
is the learning parameters in the convolutional subnetwork.
denotes the 2D convolution and
is the activation function
.
is also the low-dimensional representation of DFCEN.
In order to enhance the separability and discriminability of low-dimensional features, we explore and maintain the relationship among samples as a specific learning task. In this paper, LLE and LE, two classical manifold learning algorithms are introduced into the embedding term of DFCEN.
3.3.1. LLE-Based Embedding Term
LLE aims at preserving the original reconstruction relationship between each sample and its neighbors in the mapping space, which assumes that a sample data can be reconstructed by a linear combination of its neighborhood samples. The linear reconstruction is described in Equation (
2). The original reconstruction coefficient
W can be calculated according to Equation (
3). For a HSI dataset
, the relationship coefficient
W can be expressed as:
. Since the coefficient
W only characterizes the relationship between the sample and its nearest
k neighbor samples, it can also be described as
is the nearest
k neighbor samples of
. The number of selected neighbor samples
k is much smaller than the total number of samples
, namely,
. Therefore, the relationship coefficient matrix
W is a sparse matrix.
Referring to LLE, the embedding term should constrain the low-dimensional representation to maintain the original reconstruction relationship. Hence, for a training sample set
, the LLE-based embedding term can be defined as follow
where
is the original reconstruction coefficient that is calculated according to Equation (
11), which is a constant for the LLE-based embedding term.
is the output of
in DFCEN.
is the learning parameters in the convolutional subnetwork.
is the number of training samples in
T.
is the square of the
F norm, which is to calculate the sum of the squares of all the elements inside.
3.3.2. LE-Based Embedding Term
LE is to construct the relationship among samples with local angels and reconstruct the local structure and features in the low-dimensional space. An adjacency graph based on the Euclidean distance is constructed to characterize the relationship among samples, which is also called the weight matrix and defined in Equation (
6). When the sample
does not belong to the nearest
k neighbor samples of the sample
, the weight coefficient
between the samples
and
is 0. In fact, for a HSI dataset
, due to
, the adjacency graph matrix
M is also a sparse matrix. In practice, LE hopes that samples that are related to each other (the points connected in the adjacency graph) are as close as possible in the low-dimensional space, which is described in a formula in Equation (
5).
Referring to LE, for samples that are related in the original space, the embedding term should constrain their low-dimensional representation as close as possible. As a result, for a training sample set
, the LE-based embedding term can be defined as follow
where
is the adjacency graph coefficient in the original space, which also is a constant.
3.3.3. Reconstruction Term
As shown in
Figure 5, the reconstruction term is to constrain the output of the whole DFCEN. So it acts on all learning parameter updates. The reconstruction term ensures that low-dimensional features can be restored as input data. For a training sample set
, the output of DFCEN in
Figure 5 is expressed as follow
where
represents all learning parameters in DFCEN and
is the parameters in the deconvolutional subnetwork.
denotes the 2D deconvolution and
is the activation function.
is the output of the convolutional subnetwork.
The reconstruction term aims at maintaining original intrinsic information, which restores the low-dimensional features to the original input data. After the low-dimensional representation
is propagated by the multiple deconvolutional layers, the reconstructed data
q is obtained. The reconstruction term minimizes the error between the reconstructed data and the original input data. For a training sample set
, the reconstruction term can be described as follow
where
is the output of DFCEN and
denotes all learning parameter.
3.3.4. Objective Function
The embedding and reconstruction term have been introduced above. The embedding term constrains the low-dimensional output of the central layer to maintain the original sample relationship, while the reconstruction term ensures that the low-dimensional feature is reconstructed back to the high-dimensional input data. To balance the effects of these two terms on dimensionality reduction, a trade-off parameter is added to the objective function. As a result, for a training sample set
, the objective function of DFCEN can be described as
where
is a adjustable trade-off parameter.
is the reconstruction term and
is the embedding term.
3.4. Learning of DFCEN
The learning of DFCEN is to optimize the network parameters
according to the objective function which is formulated in Equation (
16). In this paper, we adopt the gradient descent method to optimize learning parameters. The update formula for
is expressed as
, where
is the partial derivative of the objective function with respect to
, which has the form
In the following, we calculate these two partial derivatives separately. For a training sample
, the partial derivative from the reconstruction term can be formulated as
Here
is the partial derivative of the output layer (also last layer) with respect to all network parameters
. For the 7-layer DFCEN shown in
Figure 5,
is the parameters in the convolutional subnetwork while
is in the deconvolutional subnetwork. For
, the partial derivative with respect to the
lth layer parameters
can be calculated as
where
is the feature maps in the
th layer and
is the
lth layer of DFCEN. When
,
is the input data
. The derivation process can be consulted in [
37].
represents a rotation of 180 degrees.
is a 2D convolution.
is the derivative function of the activation function, which is described as
. For
, the partial derivative is calculated as
where
is a 2D deconvolution.
The embedding term is only responsible for updating the parameters
in the convolutional subnetwork. For a training sample
, the partial derivative of the LLE-based embedding term with respect to
can be formulated as
Here
is a constant.
is the partial derivative of the central layer
with to the parameters
in the convolutional subnetwork. It can be expressed in the form of Equation (
19). The partial derivative of the LE-based embedding term can be formulated as
where
is also a constant.
In order to reduce the training time, we choose to use the convolutional autoencoder (CAE) to pretrain network to obtain good initial parameters. Owing to the symmetry of DFCEN, the parameter structure between the layers in the convolutional subnetwork is the same as that between the corresponding layers in the deconvolutional subnetwork. For this reason, symmetrical layers of two subnetworks can be initialized with the same parameters. So, a 7-layer DFCEN shown in
Figure 5 only requires 3 CAEs for pretraining parameters, which saves the pretraining time.
Figure 6 shows the pretraining process, where only after the first CAE has been trained can the second CAE be trained, and so on. The parameters in
Figure 6, corresponding to the parameters in
Figure 5, initializes DFCEN. The activation function of CAE is the same as that of DFCEN.