1. Introduction
Hyperspectral (HS) images have been widely used in remote sensing, biomedical and environmental monitoring, and other fields [
1,
2,
3,
4,
5] due to their enriched spatial and spectral information. However, such plentiful information is accompanied by an exponential growth in data amount, and requires dense sampling and a long imaging time for HS image acquisition. Compressed sensing (CS) theory reconstructs original signal from sub-Nyquist sampled measurements [
6,
7,
8], which can effectively reduce the density and time of data acquisition, and is thus becoming increasingly popular in HS imaging [
5,
9,
10,
11,
12].
The coded aperture tunable filter (CATF) spectral imager is a compressive spectral imaging system based on liquid crystal tunable filters (LCTFs), which can effectively improve spectral and spatial resolutions over traditional LCTF-based spectral imagers without changing the structures of LCTFs and detectors [
13]. The CATF spectral imager simultaneously modulates spatial and spectral domains to obtain three-dimensional encoded compressive spectral images. Spatial encoding is implemented by digital micromirror devices (DMDs), which load the discretionary coded pattern, making it feasible to improve system performance simply through coding optimization [
14,
15]. Hereby, LCTF is taken as a spectral modulator instead of a narrowband filter as common. Under the CS framework, by designing the coded patterns on DMD and precisely measuring the transmission function of LCTF, HS images with higher resolution than the detector and more spectral bands than the LCTF can be reconstructed.
Like other computational imaging systems, the nature of CATF spectral imager is to shift efforts from data acquisition to reconstruction. Conventional solutions use iterative algorithms to reconstruct HS images, such as the popular iterative algorithms TwIST [
16], GPSR [
17], and GAP-TV [
18]. Yet, reconstructing a
-sized HS image with 196 spectral bands would take over 5 h on a CPU. Such time-consuming features greatly hamper the application of CATF spectral imager for real-time imaging. Moreover, iterative algorithms are rather sensitive to the variation of sensing matrix (the product of measurement matrix and sparse transform basis). However, the measurement matrix in real optical systems can hardly achieve the same effect as designed and signals are usually not strictly sparse on this basis [
19,
20,
21]. These further weaken the practicability of iterative algorithms for reconstructing real HS images.
In recent years, advanced reconstruction algorithms for HS images have also emerged [
22,
23,
24,
25]. Compared with popular iterative algorithms, these algorithms take into account the features of the HS images and achieve better reconstruction results. However, these elaborate algorithms usually contain multiple hyperparameters, which reduces their transferability across different imaging models. In addition, these algorithms are still fundamentally iteration-based and thus cannot escape the limitations of iterative algorithms.
In the past decade, with the advent of deep learning, convolution neural networks (CNNs) have achieved a huge success in the fields of computer vision and pattern recognition, etc. [
26,
27,
28,
29,
30,
31,
32,
33]. Recently, CNNs have also been offering promising solutions to reconstruct measurements of compressed HS imaging, and demonstrate superiority in reconstruction quality and speed over traditional iterative algorithms. HyperReconNet [
34] is a pioneering spectral image reconstruction network for coded aperture snapshot spectral imaging (CASSI) systems, which optimize the entities of coded apertures as network parameters to obtain both optimized coded apertures and reconstructed spectral images. Some other reconstruction networks [
35,
36,
37,
38] have been further proposed for CASSI systems, but suffer limited spectral resolution for reconstructed results. DeepCubeNet [
39] incorporates pseudo-inverse operators and 3D convolutions to perform spectral reconstruction for compressive sensing miniature ultra-spectral imagers (CS-MUSI), achieving reconstructed images of high spectral resolution. However, all previous methods are designed for specific imaging systems, failing to be applied to systems such as the CATF spectral imager, whose measurements are compressed in both spectral and spatial dimensions.
To overcome the limitations of iterative algorithm, this paper makes the first attempt to design a CNN-based framework for three-dimensional compressed HS data reconstruction. Instead of directly applying a well-established CNN backbone to forcibly map compressed measurements to original HS data and train the network as a black box with limited insights from the CS domain, we pay more attention to network design to achieve a more reasonable and interpretable solution. If we take the optical imaging and reconstruction as a process of encoding and decoding, respectively, then a simple yet effective way for reconstruction should be to reverse the imaging process step by step. Following the above intuition, we propose a backtracking reconstruction network (BTR-Net) to reconstruct three-dimensional compressed HS data for a CATF spectral imager.
Concretely, BTR-Net performs spatial initialization, spatial enhancement, spectral initialization, and spatial–spectral enhancement in a step-wise manner through multiple built-in subnetworks. The spatial initialization subnet exploits channel–spatial relations to extend the spatial resolution of input compressed data. The spatial enhancement subnet is designed to enrich spatial details through residual learning [
40]. The spectral initialization subnet captures long-range dependencies among sampled spectra to increase spectral resolution. The spatial–spectral enhancement network lifts image quality from spatial and spectral perspective collaboratively to obtain final reconstructed results.
For evaluation, we conduct both quantitative and qualitative comparisons of BTR-Net with widely used iterative algorithms. Robustness tests on varying levels of noise are presented. Experimental results demonstrate that BTR-Net achieves higher reconstruction quality and stronger anti-noise performance, while running two magnitudes faster than iterative algorithms. In addition, we build a real optical system and verify the effectiveness of BTR-Net in real data.
The rest of this paper is organized as follows.
Section 2 introduces the imaging model of CATF spectral imager.
Section 3 develops a CNN-based backtracking reconstruction framework for the CATF spectral imager.
Section 4 shows the performance of proposed framework.
Section 5 discusses the parameters settings.
Section 6 presents our conclusion.
2. CATF Spectral Imager
Figure 1 shows the schematic of CATF spectral imager with its optical structure presented in the dashed box. The reflected light source from the object is modulated by LCTF and DMD, in turn, and received by the detector. Concretely, the imaging lens first focuses the reflected light on LCTF that modulates the object’s spectral information by adjusting the amplitudes of selected channel transmission functions. The spectrally modulated scene is then imaged on the DMD by the first relay lens to undergo spatially modulation with a random coded aperture pattern. Finally, the compressed measurements are projected on the detector by the second relay lens.
Differently from traditional imaging systems, the detector hereby receives compressed measurements that need to be further reconstructed by algorithms to obtain final HS images. The original HS cube is denoted by with a spatial resolution of and number of spectral bands. Multiple sampling in both spectral and spatial dimensions is required to achieve accurate HS data reconstruction. Let L denote the spectral sampling number (i.e., the number of LCTF channels) and K denote the spatial sampling number (i.e., the number of coded patterns in the DMD). Then, the compressed measurements acquired on the detector can be expressed as , where denotes the dimension of the detector. Since the spatial resolution of the detector is usually much smaller than that of the coded aperture, the scaling factor is . Note that we hereby only consider the case where the coded aperture matches with the detector, i.e., R shall be an integer.
Mathematically, let
denote the vectorized representation of the original HS cube, the process of obtaining vectorized compressed measurements
by the detector can be formulated as:
where
is the measurement matrix of the system, which can be regarded as the product of a spectral and a spatial measurement matrix determined by the transmission functions of the LCTF and the patterns of the coded aperture, respectively.
The massive information contained in HS image, which inevitably results in high computational cost for reconstruction, poses a great challenge to reconstruct the whole HS images using either iterative algorithms or deep networks. Hence, we adopt a block-compressed sensing (BCS) framework [
41] to alleviate computational complexity. Assume that the original HS cube is spatially divided into several
-sized subcubes, each of which corresponds to a
-sized region on the detector (
). The measurement matrix of CATF spectral imager can be denoted as:
where
is the measurement matrix for each subcube, which can be written in the following Kronecker production form:
where
and
are the spectral and the spatial measurement matrix for the subcube, respectively. Based on the above analysis, Equation (
1) can be decomposed into a number of subproblems:
where
are the vectorized compressed measurements for subcube
. The iterative reconstruction algorithms rely on the sparsity of HS images and thus transform the solution to Equation (
4), to an optimization problem of
norm:
where
is the sparse basis of the sub-HS cube,
is the sparse coefficient vector, and
is the reconstruction error bound.
In addition, Equation (
4) can also be solved as a total variation (TV) minimization problem:
where
denotes the TV norm of
.
3. Methodology
In this section, we first briefly describe design inspiration and representation for network-based reconstruction, and then elaborate on the design of the proposed BTR-Net. The BCS framework is also used in the design of BTR-Net, and the superscript B is omitted for simplicity of writing.
3.1. Design Inspiration and Representation
The intention of the proposed reconstruction network is to reverse the imaging process of spectral imager step by step.
Figure 2 illustrates the overall workflow of BTR-Net with the example of acquiring and reconstructing compressed one-pixel measurements on the detector.
For the acquisition process, the input HS subcube , with a spatial resolution of and spectral bands, is first spectrally modulated by LCTF with L channels to produce a multispectral (MS) image . The MS image is further spatially encoded by DMD with K different coding patterns to obtain shrunken measurements on an -sized detector, resulting in modulated output .
Conversely, the reconstruction process aims to learn a reverse mapping function . It first maps the compressed measurement, , to an MS image, , by spanning spatial resolution (spatial initialization) and enriching fine-grained details (spatial enhancement), then it extends its spectral resolution (spectral initialization) and jointly promotes the image quality spatially and spectrally (spatial–spectral enhancement), leading to final reconstructed result .
3.2. BTR-Net Architecture
Figure 3 shows the network architecture of the proposed BTR-Net. It is composed of four subnetworks: spatial initialization, spatial enhancement, spectral initialization, and spatial–spectral enhancement subnet, mapping the compressed measurement to original HS data step by step in an interpretable and unified manner. In the following, we use
to denote the data size in the BTR-Net.
Spatial Initialization Subnet: This aims to acquire the spatial initialization
from compressed measurement
by spanning spatial resolution. The function of this subnet can be formulated as follows:
where
represents a reshape operation,
and
are the weights to be trained for the first and second convolution layers, respectively, and each convolution layer is followed by a ReLU activation
.
is a periodic shuffling operator called sub-pixel convolution layer, which was first introduced in [
30]. More specifically,
merges the spectral measurements of the input data
g into the batch size dimension (i.e.,
), thus we were able to focus on the spatial information in the following operations. Subsequent convolutional layers extract features with low spatial resolution, and ensure that the number of feature maps feeding to
is
(where
). Finally,
rearranges
features with resolution
to a single image:
(i.e.,
).
Spatial Enhancement Subnet: Designed to obtain predicted MS image
from the spatial initialization
by feature refinement. Mathematically, it learns the following function:
where
performs a reverse operation of
(i.e.,
).
takes the spatial initialization
as input and contains
residual learning block (Resblock) to be learned. The
nth Resblock is defined as:
where
represents the weights to be trained for the
ith residual mapping
. Each Resblock takes the output of the previous Resblock as input and adds the learned residual mapping to the input of the current block as output. The structure of Resblock is designed with reference to the setting in [
42], which contains three convolutional layers. The residual mapping is formulated as:
where
x represents the input, and
is the weight for the
ith convolutional layer of residual mapping. Furthermore, the first two convolutional layers are followed by ReLU activation
.
Spectral Initialization Subnet: Designed to obtain initialization of the HS image
via extending the spectral resolution of the predicted MS image
, which can be formulated as:
where
represents the weights to be trained. We generated
feature maps with a convolutional layer to preliminarily reconstruct spectra (i.e.,
). As with the previous design, an ReLU activation
is added.
Spatial–Spectral Enhancement Subnet: Jointly promotes image quality of the initialized HS image
spatially and spectrally, resulting in the final reconstructed HS image
. The function of this subnet could be expressed as:
where
represents the weights to be trained. Moreover, a Sigmoid activation
is added to limit the output to between 0 and 1.
It is worth noting that, when we feed the divided images into the network, it is likely to cause distinct block artifacts for the reassemble reconstructed images due to zero-padding. Reflect-padding replaces “zero” with a pixel value at the feature map, so that the convolution result at the edge will not be pulled down. Thus, we use reflect-padding in the padding operation of each convolution layer, which effectively mitigates artifacts caused by block-wise processing [
43].
3.3. Loss Function
We optimize the network parameters by minimizing the pixel-wise mean square error (MSE), i.e.,
where
is the trainable parameters in BTR-Net, and
and
represent the HS image predicted by BTR-Net and the original HS image, respectively.
4. Results
We trained the networks from scratch for 30 epochs on a single NVIDIA GeForce GTX 1080, with batch size and an initial learning rate of . We gradually decreased the learning rate by an order of magnitude after every 10 epochs. We used two Resblocks in BTR-Net throughout the experiments.
4.1. Dataset and Evaluation Metrics
We carried out experiments on a public HS dataset—the BGU iCVL Hyperspectral Image Dataset [
44]. This dataset consists of HS images with 519 spectral bands ranging from 400 to 1000 nm, with a spectral interval of about 1.25 nm. We only used
bands (488 to 730 nm) to make the spectral range of input HS images consistent with that of the LCTF. We randomly selected 32 HS images for training and 8 for testing, and normalized the pixel values of all images to
. Through the blocking operation, more than 12,000 image pairs are used for network training. When generating input data, the following key points should be noted: (i) the original HS images were divided into
-sized image blocks, and then input cubes of
were extracted by spectral filtering and spatial coding; (ii)
real measured LCTF transmittance curves were utilized to simulate spectral filtering; and (iii) instead of using the same coded patterns for every image block,
random coded patterns (random matrices with values between 0 and 1) were generated for each image block to simulate spatial coding.
For a comprehensive evaluation of the reconstructed results, we adopted mean peak signal to noise ratio (MPSNR), mean structural similarity index measure (MSSIM), mean relative absolute error (MRAE), and mean spectral angle mapper (MSAM) as evaluation metrics. The lower the MRAE and MSAM, or the larger the MSSIM and MPSNR, the better the reconstructed images.
The MPSNR, which measures the difference between two images, is defined as:
where
denotes the number of spectral bands, and
is the MSE between the reconstructed and the original HS image at the
ith spectral band.
The MSSIM, which evaluates the structural similarity between the reconstructed and the original images, is defined as:
where
(with mean
and variance
) and
(with mean
and variance
) denote the reconstructed and the original HS image at the
ith spectral band, respectively.
is the covariance of
and
, and
and
are the two hyperparameters.
The MRAE, which describes the proportion of the reconstruction error of each pixel to the original value, is defined as:
where
and
denote the point at the
ith spectral band with spatial coordinates of
on the reconstructed and the original HS image, respectively.
and
denote the spatial resolution of the HS image.
The MSAM, which calculates the average angle between spectra of the reconstructed and the original images across all spatial positions, is defined as:
4.2. Comparison with Iterative Algorithms
We compare the proposed BTR-Net with popular iterative algorithms, including TwIST [
16], GPSR [
17], and GAP-TV [
18]. TwIST and GPSR aim to find the sparse solution of HS data, as in Equation (
5), and the sparse basis is defined as the DCT basis. GAP-TV is a TV-based algorithm.
Table 1 provides quantitative comparisons of reconstructed results from our BTR-Net and iterative algorithms. One can see that BTR-Net outperforms the three iterative algorithms in terms of all three metrics on each testing image (without noise). For instance, BTR-Net gains more than 4 dB over iterative algorithms in terms of average MPSNR.
Table 2 shows the running time required for each methods to reconstruct an HS image with a spatial resolution of
and 196 spectral bands. BTR-Net demonstrates significant decrease in running time compared with iterative algorithms. Specifically, BTR-Net runs two orders of magnitude faster than iterative algorithms on CPU. Moreover, BTR-Net supports acceleration using GPU, which makes it runs about 38 times faster than when using CPU.
Figure 4 shows qualitative comparisons of the reconstructed results in RGB projection. The red, green and blue channels of the RGB image are taken from three spectral bands of the HS image at 660 nm, 550 nm and 500 nm, respectively. For a clear comparison of reconstructed details, the image region in red square is enlarged at the lower left corner of the RGB image. The RGB images indicate that BTR-Net is superior to conventional iterative algorithms in color reproduction and detail recovery.
Figure 5 compares spectral curves reconstructed by these four methods. The second and third columns draw the spectra of two representative pixels whose positions are marked on the RGB image in the first column, where the x-axis and y-axis represent wavelength and normalized intensity, respectively. The SAM is labeled in the legend to evaluate the quality of the reconstructed spectra. The spectra suggest that BTR-Net performs well in spectrum reconstruction, while conventional iterative algorithms show poor performance at the edge of spectrum.
In order to further demonstrate the wavelength-dependent performance variation, we quantitatively compare the results at different wavelengths (taking Scene 1 as an example), as shown in
Figure 6. One can see that the proposed BTR-Net possesses excellent global performance and stable reconstructed outcomes. By comparison, the reconstructed results of iterative algorithms change significantly with wavelength, probably due to the fact that the spectra are not strictly sparse on the given sparse basis.
We studied the noise immunity of BTR-Net by adding white Gaussian noise to the compressive measurements (i.e., input data of BTR-Net). The network is trained in the absence of noise and tested in the presence of noise. The three iterative algorithms were also tested by adding noise to the compressive measurements.
Table 3 compares the experimental results with noise levels of 40 dB, 30 dB and 20 dB. Taking Scene 1 as an example,
Figure 7 compares the visual quality of these four methods at different noise levels, and
Figure 8 shows the performance of the reconstructed spectra. It can be observed that the proposed BTR-Net surpasses iterative algorithms in terms of reconstruction performance and anti-noise ability. Concretely, the results of BTR-Net are almost impervious to the addition of 40 dB and 30 dB noise; although the 20 dB noise slows down the performance of BTR-Net, it is still acceptable. By contrast, the performance of TwIST declines obviously with 30 dB noise; the performance of GPSR and GAP-TV degrades but is still acceptable. When the noise level is set to 20 dB, TwIST and GPSR can hardly reconstruct the data successfully; the noise immunity of GAP-TV is better than that of TwIST and GPSR, but the reconstruction result is unsatisfactory with 20 dB noise.
4.3. Real Experiments
We constructed an experimental prototype of the CATF spectral imager, as shown in
Figure 9. The testbed consisted of a fiber ring illuminator (Thorlabs FRI61F50 and OSL2), an imaging lens with a focal length of 50 mm, a visible LCTF with a range of 500 nm–710 nm, two relay lenses with a focal length of 75 mm, a DMD (Texas Instruments DLP9500), and a monochromatic CMOS camera (Basler acA2040-90 um).
We took a set of compressive measurements by the prototype as the input data of the trained BTR-Net. Consistent with the parameter settings in simulations, LCTF channels and random coded patterns were taken in real experiments, and the scaling factor was set to . The object occupied pixels on the DMD, which corresponded to pixels on the detector. The measurements were spatially divided into image blocks to meet the requirements of the BTR-Net on the size of input data.
In real experiments, we used the CATF spectral imager to obtain the measurements, but could not obtain the ground truth of the corresponding HS data. This made it impossible to train the BTR-Net with the dataset obtained in real experiments. So, we applied the trained BTR-Net in simulations to the real experiment. This was challenging since our BTR-Net was not trained with real data and the performance of the measurement matrix in the real experiments was degraded compared with the matrix used to generate the training data.
To compare the results qualitatively,
Figure 10 shows the RGB projections of the reconstructed results from the real experimental data, including three iterative algorithms and the proposed BTR-Net. It can be seen that the colors of the RGB images are similar, which reflects that the spectral reconstruction capabilities of the four methods are comparable. Perceptually, the BTR-Net is superior to iterative algorithms in detail recovery.
Figure 11 draws the reflection spectra of two representative points whose positions are marked on the RGB image, where the
x-axis and
y-axis represent wavelength and reflectivity, respectively. The PSNR is labeled in the legend to evaluate the quality of the reconstructed spectra. The original reflection spectra are measured by a grating spectrometer (Ocean Optics Maya2000pro). The result shows that the reconstructed spectra of BTR-Net is in the best agreement with the original spectra in P1. These four methods show comparable spectral reconstruction performance at the background P2.
In real experiments, iterative algorithms tend to obtain a solution with high sparsity in order to combat noise (that is, the number of non-zero coefficients of
in Equation (
5) is extremely small). The reconstructed results of iterative algorithms lose a large proportion of high-frequency components, resulting in a lack of spatial details and relatively smooth spectra. Our BTR-Net learned the characteristics of HS dataset in the training. Thus, the details in the RGB projections were richer, and the spectra contained more fluctuations. The results demonstrate the feasibility of the proposed BTR-Net; although, the results are not as good as in the simulation.
6. Conclusions
Suffering from traditional, iteration-based algorithms, we propose a backtracking reconstruction network called BTR-Net to solve the reconstruction problem in three-dimensional compressed HS imaging. We decomposed the imaging process based on CATF spectral imager into steps, and designed a series of subnetworks to reverse these steps. Specifically, we built four subnetworks—spatial initialization subnet, spatial enhancement subnet, spectral initialization subnet, and spatial–spectral enhancement subnet—in sequence, to obtain a reverse mapping from compressed measurements to HS data. Experimental results show that the proposed BTR-Net outperforms traditional iteration-based algorithms in reconstruction performance and running speed, while exhibiting great noise resistance.
The BTR-Net shows obvious advantages over iterative algorithms, yet there are several aspects requiring further study:
1. Multiple reshaping operations were adopted in BTR-Net, which means that the size of input data needs to be strictly limited once the network is trained. Follow-up work will be directed towards the design of a fully convolutional network capable of accepting variable dimension input data.
2. The BTR-Net takes compressed measurements as inputs to reconstruct HS data, which ignores characteristics of the imaging system itself, such as the transmittance curves of LCTF and coded patterns of DMD. Further effort is needed to design a network that rationally utilizes these prior information to make the network more interpretable.