1. Introduction
Spectral imaging itself offers benefits for a variety of applications throughout computer vision. All applications that can profit from spectral imaging belong on an abstract level to one of the areas of color science or remote sensing. The terms spectral, multi-spectral and hyper-spectral are broadly utilized and are not well defined. In fact, there exist multiple and distinct definitions for the same terms depending on the current field, e.g., remote sensing or colorimetry. The major commonality is that spectral imaging is a generic term for both multi-spectral and hyper-spectral imaging. Hyper-spectral imaging makes it possible to measure the continuous spectrum. In contrast, multi-spectral imaging only allows to sample the spectrum at a higher resolution, since it utilizes more channels than RGB devices have. The greatest ambiguity lies in the view of what is a sufficient representation of a continuous spectrum.
Although spectral imaging techniques were researched for decades and multiple, distinct systems were proposed, there is always a tradeoff between the spatial resolution, temporal resolution, spectral resolution and costs. On the most abstract level, all spectral imaging systems (HSI) can be subdivided into two classes: scanning and snapshot. Scanning techniques perform a true hyper-spectral image acquisition by either scanning the spatial domain, e.g., push-broom, or the spectral domain, e.g., filter-wheel/liquid crystal tunable (LCTF) based cameras. The major limitation of such systems is that they are limited to static scenes.
Snapshot spectral imaging explicitly aims at achieving a high temporal resolution. This usually leads to a reduced spectral and/or spatial resolution. They should therefore not be considered as hyper-spectral imaging systems, but rather as multi-spectral systems, since the precise spectral signature must first be computed from not ideally sampled data points. Snapshot spectral imaging is thus closely related to computational spectral imaging. Computational based spectral imaging gained particular interest in recent years due to the continuously growing capacity of technology that fosters modern, learning based algorithms, especially deep learning. Computed tomography imaging spectrometers (CTIS) [
1,
2,
3] employ dedicated gratings to disperse spectral stimuli in multiple directions. As a result, multiple dispersed images are captured by the imaging sensor, where each dispersed image might be viewed as a two-dimensional projection of the three-dimensional spectral cube. Although CTIS are in theory capable of a high temporal resolution, the post processing effort is immense. Coded aperture snapshot imaging (CASSI) [
4,
5] can be viewed as a further development of CTIS. CASSI is based on compressive sensing and offers an interesting solution but fails yet to achieve a great reconstruction quality. In analogy to modern RGB imaging devices, there is the comparably new development of multi-spectral color filter arrays (MSCFA). Although they were already considered for a long time theoretically [
6], their manufacturing is not trivial, leading to effects such as a varying spectral sensitivity across the sensor or viewing angle instabilities. Also, there are drawbacks regarding the spatial resolution.
Spectral reconstruction from RGB or spectral super-resolution (SSR) is a computational spectral imaging approach which, in contrast to dedicated spectral imaging systems, only requires cheap camera technology in the form of conventional end-consumer devices. It therefore promises a cheap spectral imaging solution for mass market adoption. The recovery of spectral signatures from RGB signals is a heavily underconstrained problem that can hardly be solved by classical signal processing techniques. Known SSR approaches almost exclusively rely on machine-learning with the underlying premise that the computer may find patterns that are beyond human understanding. To this end, machine learning requires large datasets for training. In the context of SSR, such a dataset ideally consists of corresponding pairs of RGB images and spectral images. Real world image pairs (RGB and spectral) are usually not available. However, there exist large datasets of spectral images only. It is current practice to simulate the corresponding RGB images from the spectral data. Such generated image pairs lay the foundation for not only training modern, data driven approaches, but also for evaluating them.
Pioneering work on SSR utilized radial basis function networks [
7] and sparse coding [
8,
9,
10]. Today, they are outperformed by modern convolutional networks, which have continuously demonstrated superior results. The recent NTIRE2020 challenge on spectral reconstruction from RGB [
11] as well as its predecessor in 2018 [
12] provide a concise overview over the state-of-the-art methods. In summary, the best performing methods in terms of spectral reconstruction quality exclusively consist of complex convolutional neural networks [
13,
14,
15,
16,
17,
18,
19]. A major limitation of deep learning based approaches for SSR is their susceptibility to a varying brightness, i.e., signal scale. It was first noted by Lin et al. [
20], who investigated the potential scale invariance of the leading entries of the NTIRE 2018 Challenge on spectral reconstruction. Scale invariance was basically found to be non-existent. An additional analysis for scale invariance was conducted at the 2020 challenge on spectral reconstruction from RGB [
11]. The leading methods were shown to once again be susceptible to changes in brightness, although their general robustness increased. However, the improved robustness resulted from a superior, more general database that implicitly included varying scales.
Changes in signal scale occur frequently in practical applications. The most intuitive example is a varying exposure time. A probably more critical application relying on brightness invariance is video surveillance, in particular the case of moving objects under a fixed light source. Concrete examples might be video based patient monitoring in a clinical setting or car driver monitoring. If the monitored person moves such that the relative position to the camera/light source changes, the spectral stimuli incident on the sensing device will change in scale (neglecting effects such as specular reflections). Such movements should obviously not cause inherently different spectral reconstructions, potentially leading to false alerts.
Up to the authors knowledge, there exist no published articles on achieving brightness invariance for deep learning based SSR. Yet, utilizing deep learning for SSR is desirable since it forms the state-of-the-art in terms of spectral reconstruction quality. So far, the only known way to approach the issue is from a data side by either employing data augmentation [
20] or directly utilizing more complex databases [
11]. If it is possible to ensure that the training data is diverse and covers all possibly occurring signal intensities, brightness invariance can be learned by the network from the data itself. However, obtaining spectral databases of such diversity requires an immense effort and is often not feasible. When spectra occur whose intensities are lower or higher than what is known from the training data, the signal processing will become unstable.
Within this work, we aim to provide a general solution for deep learning methods that guarantees scale invariance for SSR at all times independent of the training data. The contributions of this work are:
We investigated a modern sparse coding technique to highlight why it is scale invariant by construction: only the signal vector directions are of relevance, not the signal magnitude!
We transferred the gained insights to reformulate the prediction goal for deep learning based SSR.
As a result, we propose a fundamental deep learning based approach for SSR that is invariant to scale.
3. Experiments
The NTIRE2020 [
11] dataset was utilized for evaluation, as it forms the newest and largest spectral dataset to date. It consists of a 450 spectral images ranging from 400 to 700 nm in 10 nm steps and with a spatial resolution of
pixels. All images were captured outdoors in Israel, mostly at bright daylight. Exemplary images of the dataset are shown in
Figure 2a and
Figure 4. For better human interpretability, the spectral images were rendered in sRGB assuming CIE D50 illumination. The dataset was subdivided into three splits: training, validation and testing. Our test set equals the official validation split of the challenge to allow for an easy comparability. Corresponding RGB images were respectively computed from all spectral images for the three scenarios ideal, clean and real utilizing the CIE 1964 human standard observer. It should be noted that the original challenge was subdivided into the two tracks “Clean” and “RealWorld”. The “Clean” track equals our scenario “quantization”. The “RealWorld” track is similar to our scenario “real”. However, the camera sensitivity was not disclosed within the challenge track. Since knowing the camera sensitivity is essential for our workflow, we thus computed our own “real” images as described in the scenario by assuming a known camera response function (CIE 1964 standard observer).
We adopted the most common error metrics to report results. The
mean relative absolute error (MRAE) established itself not only as one of the standard evaluation metrics for the spectral reconstruction quality, but also as the go-to loss function for SSR,
where
M and
N respectively denote the image width and height. In contrast to most other published research, we report the MRAE in percentile, because we believe that it offers a better readability. Thus, there is the scaling factor of 100. Additionally, the root mean squared error (RMSE), and the spectral angle mapper (SAM) in degree, are considered.
3.1. Networks and Training Details
In order to evaluate the proposed methodology, distinct network architectures from the current state-of-the-art methods are considered: the HSCNN+R [
17], an adopted UNet [
18], the adaptive weight attention network (AWAN) [
15] and the pixel-aware deep function-mixture network (FMNet) [
13]. The respective code is publicly available for all individual network architectures. Since all implementations are available in Python and based upon the deep learning framework Pytorch, the proposed workflow was likewise implemented in Python/Pytorch and allows us to exchange the different network architectures on a modular basis. In following such an approach, it is possible to rely on a single, unified workflow for training and evaluation. All methods but the modified UNet [
18] rely on ensemble strategies to further push their performance, mostly self-ensemble and model-ensemble.
Although ensemble methods allow us to optimize a model’s performance further, it does not add anything to the focus of this work while significantly increasing the computational overhead. We therefore perform the experiments without ensemble methods.
There are multiple ways to configure the different network architectures, in particular due to the ensemble strategies applied by different authors. Since we do not consider ensemble strategies but instead focus on a singular network configuration, the precise configurations for every neural network as well as further relevant implementation details are summarized:
UNet
There only exists a single configuration that was proposed in [
18] and no model ensemble methods were employed.
HSCNN-R
This network architecture is a residual network that was optimized for SSR. The main network configuration as proposed in [
17] having 64 filters in each layer and a total of 16 resblocks were utilized. The weights were initialized according to the algorithm proposed by He et al. [
23].
AWAN
We considered the fully stacked configuration consisting of the basic AWAN network combined with the patch-level second-order non-local (PSNL) and the adaptive weighted channel attention (AWCA) modules as reported in [
15]. The amount of dual residual attention blocks (DRAB) is set to 8 with 200 output channels. The AWCA module has a reduction ratio of
and the PSNL model an
r-value of 8.
FMNet
The utilized configuration [
13] consists of two FM blocks with each block containing three basis functions. Each basis function as well as the mixing functions are formed by two convolutional blocks having 64 feature maps. The initial learning rate is halved every 20 epochs and the training ends after 100 epochs.
Finally,
Table 1 provides an overview of additional hyperparameters per method that are required for training.
Using these settings, each network is trained in each scenario from scratch in both its original form (SSR) as well as within our proposed scale invariant SSR (SISSR) approach.
The training details therefore remain identical for both SSR and SISSR training and do not need any modification.
3.2. Results and Discussion
Table 2 offers a first concise overview of the reconstruction results. When only considering all methods in their original form, they perform as would be expected from known benchmarks. The reconstruction results are generally best within the ideal scenario and worst within the real scenario. This is due to the increased amount of disturbances introduced in the RGB image creation. Of particular interest for this work is the performance of all neural networks when applied within our scale-invariant approach. The influence of our methodology can be summarized as a significant performance boost within the ideal scenario, while showing a detrimental effect in terms of average reconstruction error within both the clean and real scenario. The larger the disturbances within the RGB images, the bigger is the performance gap between SSR and SISSR. This result is consistent for all network architectures.
An intuitive explanation for the observed results can be found. If the input RGB image is subjected to disturbances, any deficits in the RGB signals will obviously lead to a deterioration of the spectral reconstruction. This is under the assumption that the signal processing is unable to fully compensate for poor image quality, i.e., the neural networks implicitly learn to denoise RGB images. However, our proposed scale propagation additionally introduces a way for the input RGB image to bypass the neural network in order to correctly propagate any signal scale. In following such an approach, not only the signal scale, but also errors within the RGB images will be directly propagated to the spectral domain. This conclusion can even be visualized.
Figure 5 displays the spectral reconstruction errors per pixel in terms of MRAE for an exemplary image from our test set in the clean scenario. The RGB images within this scenario are only disturbed by quantization noise. Thus, the disturbances within the RGB signals are the more severe the darker the image regions become. The spectral reconstruction can therefore be expected to suffer for dark pixels, especially when using our SISSR approach. This is directly visible in
Figure 5, for example in the top right corner of all images. All models already have trouble accounting for the high noise levels in their original form. Within the SISSR approach, the image noise is additionally propagated to the spectral domain leading to even worse reconstruction results. For noiseless image regions, the SISSR approach has a varying effect on the spectral reconstruction quality. For some regions, the results are better, for others, the results are worse. On average, the spectral reconstruction quality increases for well illuminated image regions. This can be best validated by the fact that a significant performance increase is consistently achieved in the ideal scenario for all networks when used in our SISSR approach.
From the discussed results, the question arises as to how different noise levels in the RGB image effect the spectral reconstruction. For this purpose, we carried out an adapted evaluation in which only pixel positions where the RGB signal magnitude is greater than a threshold value contribute towards the average spectral reconstruction errors.
Figure 6 displays the achieved reconstruction results for all methods in the clean track over the threshold value. Additionally, the chosen threshold values can directly be converted into associated signal–noise ratios (SNR). The worst SNR associated with a threshold value can be approximated by the mean quantization error for the bit just above the threshold. For example, if the threshold value would be 12, the worst quantization errors are introduced within the signal range [12, 13] which are all clipped to 12. By assuming the simplification that all ideal RGB signal magnitudes are uniformly distributed, threshold values are converted to an associated SNR. Due to the involved simplifications in computing the SNR, the provided results are not 100% accurate, but they are more than sufficient to get a general idea of the influence of SNR on the reconstruction quality. The higher the SNR, the better the reconstruction results. However, all neural networks in their original form are somewhat robust against low SNR values and consistently yield a comparable performance. Only minor benefits are gained from better signals. In contrast, the proposed method shows a stronger dependence on low noise levels. Its performance consistently increases with an increasing SNR until it surpasses the original workflow. For all network architectures but AWAN, the break-even point for SSR and SISSR is at an SNR of approximately 35 dB. The AWAN network is in comparison to the other considered architectures by far the most complex, as it comprises the largest amount of trainable parameters.
It is again stressed that the proposed SISSR approach completely removes any scale on the input data before it is processed by the neural network which therefore has less information available. Considering the pure task of spectral reconstruction neglecting additional, implicit functionality of neural networks such as denoising, there is no decrease in performance, indicating that camera signal scale is irrelevant for spectral signal recovery. In fact, there was a major performance gain observed in the ideal scenario, which can be attributed to a more restricted solution space through our proposed approach. However, noisy RGB images have a stronger effect on the SISSR approach due to its limited denoising capabilities.
3.3. Brightness Invariance and Ablation Study
The most important advantage of the proposed approach is its inherent robustness with respect to changes in image brightness. In order to evaluate the robustness of all methods regarding changes in image brightness, all spectral images within the test set as well as the corresponding RGB images were scaled by a scalar,
s. Utilizing the scaled RGB-images, all previously trained networks were again tasked with spectral signal recovery. The average reconstruction results for SSR are shown in
Table 3 at different scales. In contrast, the different scales do not effect the proposed workflow (SISSR). The results that were already reported for SISSR in
Table 2 are therefore independent of any changes in image brightness and may serve as comparison. This is due to the explicit normalization conducted within the SISSR workflow that completely removes different brightness levels. For a convenient comparison of all the results,
Table 3 is printed next to
Table 2. When considering a halving,
, or doubling,
, the spectra recovered by SSR get worse on average, but for the majority of spectra, the results are still reasonable in terms of their general shape. Assuming strong differences in scale,
or
, the spectral reconstructions become unstable and thus unreliable. Only the results achieved for SSR at scale values of
and
in the real scenario can be interpreted as about equal to SISSR in terms of averaged metrics over the test images. When comparing SSR and SISSR at different scale values, SISSR outperforms SSR in most cases. For a better intuition on the results reported in
Table 3,
Figure 7 exemplary visualizes the achieved results for the UNet architecture in the three distinct scenarios for both SSR and the proposed SISSR approach. The average spectral reconstruction error is plotted over different scaling factors in logarithmic scale. The dotted and constant line represents the proposed SISSR approach. In contrast, SSR appears in the shape of a parabola and is therefore limited to signal scales close to one. Finally,
Figure 8 shows distinct reconstructed spectra for the AWAN network which overall performed the best at different scales. It can be observed that in particular for strong changes in image brightness, the shape of the recovered spectra collapses for SSR whereas SISSR remains robust.
The proposed workflow for scale invariance (SISSR) consists of two steps:
- (1)
Training a CNN to only predict the shape of the spectra up to scale from the RGB input.
- (2)
Adjusting the brightness (scale) of the recovered spectra to match the input in post-processing, i.e., Equations (
12) and (
13).
The question may arise if both steps are indeed necessary for achieving scale invariance since step two can be applied independently of step one. Therefore, an ablation study was considered. All spectra as they were recovered by the standalone neural networks (SSR) are subsequently subjected to the second processing step of brightness adjustment, i.e., they are post-processed in the same way as within SISSR. This approach is referred to as SSR-N. The major difference between SISSR and SSR-N is that for SISSR the CNNs were trained to predict the shape of the spectra up to an arbitrary scale due to the explicit normalization layer, whereas for SSR-N, the CNNs were trained to predict the very precise spectra due to no normalization layer. SSR-N could therefore be interpreted as a baseline SISSR has to outperform. The results for SSR-N are shown in
Table 2 to allow for a direct comparison to previous results. Indeed, SSR-N is consistently inferior to SISSR in terms of spectral reconstruction quality yielding the conclusion that only considering the post-processing in form of step two is insufficient. However, SSR-N is in contrast to SSR robust to changes in image brightness as it can be observed in
Figure 7. This demonstrates the explicit need to normalize the input RGB to enforce a shape of the recovered spectra that only depends on the direction of the RGB input vectors. When there is no normalization, the CNN may in the worst case recover two distinct spectra in terms of shape when observing separate highly saturated colors that only differ in brightness such as a dark red and a bright red. Such behavior is not reasonable from a physical perspective. Subsequently adjusting the scale of the ill-shaped spectra to better approximate to observed signal in RGB space is not beneficial.
Finally, the argument can be made that the RGB normalization layer might not be necessary when a scale invariant loss function is considered. Examples might be the spectral angular error, spectral information divergence [
24] or spectral derivative based loss functions [
25]. It should be noted that from a purely theoretical point of view there is no guarantee that training a network with a loss function that is invariant to changes in brightness will make the fully-trained network invariant to changes in brightness. However, the exploration of these loss functions is beyond the scope of this manuscript and might be interesting to investigate in future research.
4. Conclusions
A new method was proposed for deep learning based spectral super-resolution which, in contrast to before, allows any deep learning based spectral reconstruction algorithm to gain the important property of brightness invariance/signal scale invariance. The proposed approach is based on the assumption that only the directions of both RGB and spectral signals are of relevance for the recovery of spectral signals, not the actual signal magnitude. This enables a better generalization and offers spectral reconstructions that are not only more reliable, but also better understandable by humans for practical applications. Analogue to sparse coding based techniques, signal scale propagation is achieved by backprojection into camera signal space. Consequently, emphasis is implicitly placed on predicting the general shape of the spectra. The proposed approach does not effect the inference time since the additional processing steps are computationally extremely light weight in comparison to a modern CNN architecture.
It was demonstrated that a significant performance gain can be observed when considering ideal signals, suggesting that the proposed approach limits the solution space of neural networks in a physically meaningful way. However, it was found that the proposed approach has a higher susceptibility to noise than utilizing neural networks alone, trained in classical end-to-end fashion. Although the proposed approach remains stable under the presence of noise, the averaged reconstruction quality in terms of metrics such as the mean relative absolute error is worse. An analysis was provided on how different SNR values of the input RGB image affect the spectral reconstruction quality. The break even point for neural networks as a stand-alone and in conjunction with our approach is approximately at an SNR of 35 dB. For higher noise levels, neural networks as a stand-alone perform better. For lower noise levels, the proposed approach provides superior performance.
The most significant advantage of the proposed approach is its complete robustness regarding changes in image brightness. While all network architectures alone fail to robustly offer a spectral reconstructions under varying brightness levels, such differences do not effect the proposed workflow. The proposed workflow even outperforms the stand-alone networks at higher noise levels as soon as the image brightness/exposure differs compared to the training data. It should be noted that our brightness invariant approach has the most value when the training set is small and does not cover all possible signal intensities that might occur in practice since it achieves a better generalization. Should the training data be diverse and all signal intensities of the test set fall within the ranges that are known from the training set, brightness invariance can be learned by a network from the data itself.