**Novel Air Temperature Measurement Using Midwave Hyperspectral Fourier Transform Infrared Imaging in the Carbon Dioxide Absorption Band**

#### **Sungho Kim**

Department of Electronic Engineering, Yeungnam University, 280 Daehak-Ro, Gyeongsan, Gyeongbuk 38541, Korea; sunghokim@ynu.ac.kr; Tel.: +82-53-810-3530

Received: 15 May 2020; Accepted: 4 June 2020; Published: 8 June 2020

**Abstract:** Accurate visualization of air temperature distribution can be useful for various thermal analyses in fields such as human health and heat transfer of local area. This paper presents a novel approach to measuring air temperature from midwave hyperspectral Fourier transform infrared (FTIR) imaging in the carbon dioxide absorption band (between 4.25–4.35 μm). In this study, the proposed visual air temperature (VisualAT) measurement is based on the observation that the carbon dioxide band shows zero transmissivity at short distances. Based on analysis of the radiative transfer equation in this band, only the path radiance by air temperature survives. Brightness temperature of the received radiance can provide the raw air temperature and spectral average, followed by a spatial median-mean filter that can produce final air temperature images. Experiment results tested on a database obtained by a midwave extended FTIR system (Telops, Quebec City, QC, Canada) from February to July 2018 show a mean absolute error of 1.25 ◦K for temperature range of 2.6−26.4 ◦C.

**Keywords:** air temperature; spatial measurement; FTIR; MWIR; carbon dioxide absorption

#### **1. Introduction**

How accurately can we measure and visualize air temperature remotely using thermal sensing? Air temperature is an important meteorological factor, which has a wide range of applications in fields like human health [1], virus propagation [2], growth and reproduction of plants [3], climate change [4], and hydrology [5].

Air temperature can be measured in numerous ways, including contact sensors and remote sensors. Contact sensor-based methods include thermistors, thermocouples, and mercury thermometers [6]. Thermistors are metallic devices that undergo predictable changes in resistance in response to changes in temperature. This resistance is measured and converted to a temperature reading in Celsius, Fahrenheit, or Kelvin. A thermocouple consists of two dissimilar electrical conductors forming an electrical junction, which produces a temperature-dependent voltage, and this voltage can be converted to a temperature [7]. A mercury thermometer consists of liquid in a glass rod with a very thin tube in it. Mercury or red-colored alcohol inside the tube expands when the temperature rises. These sensors should be located in the shade to measure air temperature. If the sun shines on the thermometer directly, it heats the liquid and produces an incorrect, higher temperature than the true air temperature. In addition, it needs enough time (at least several minutes for the liquid to expand) to measure outdoor air temperature. Furthermore, it requires hundreds of thousands contact sensors to measure spatial distribution of air temperature.

The thermal remote sensing approaches will use the relationships between land surface temperature (LST) and near surface air temperature. Land surface radiance is measured by thermal infrared sensors mounted on a satellite or an airborne platform, and LST is retrieved via

temperature-emissivity separation (TES) [8,9]. Surface air temperature can be estimated by the temperature-vegetation index (TVX), thermodynamics, and the regression method. The TVX is based on the assumption that a thick vegetation canopy can approximate air temperature [10,11]. It can be useful only if there is high vegetation cover. The thermodynamics-based method uses the energy balance between LST and the surface environment, such as air and water [12,13]. This method provides good air temperature measurement, but it requires many parameters as input. The last type is data-based regression between air temperature and LST. There is a linear regression model [14] and a nonlinear model, especially as a deep learning method [15]. Such machine learning models have reported successful air temperature estimation. One recent deep learning-based method, a five-layer deep belief network (DBN), showed promising air temperature estimation by establishing the relationship between ground station air temperature and multi-source data (remote sensing radiance, socioeconomic data, and assimilation data) [15]. Although the deep learning method showed quite accurate air temperature estimation, it requires huge amounts of data to train the multi-layered deep neural architecture.

The above mentioned approaches are not suitable for instantaneously measuring and visualizing air temperature of a viewing area. Using contact sensors requires several minutes and a huge number of sensors to measure spatial air temperature [16]. Non-contact (remote) sensors, such as thermal infrared (TIR) imagers, require LST and regression with huge amounts of training data. Despite being trained correctly, the regression-based approach is sensitive to many spatio-temporal conditions [17]. In addition, this approach is only suitable for aerial-based sensing in satellites or airborne platforms [18].

In this paper, our research focuses on how to measure spatial air temperature of the viewing area and visualize it instantaneously for environment monitoring of the surveillance area. The key idea is to use up-welling information in the carbon dioxide absorption band (4.25−4.35 μm) with a midwave Fourier transform infrared (FTIR) imager. Most temperature estimation research in TES and LST generates up-welling information by using the moderate-resolution atmospheric radiance and transmittance model (MODTRAN) for atmospheric correction (compensation) purposes [19–21]. Harig proposed a passive sensing of pollutant clouds by longwave FTIR to find optimal SNR [22]. His work focused on detecting cloud target in ground background not air temperature. In this paper, it is possible to measure and visualize air temperature accurately through careful analysis of upwelling (path) spectral radiance in the proposed visual air temperature (VisualAT).

The contributions from this paper can be summarized as follows. First, VisualAT can measure spatial air temperature accurately (mean absolute error [MAE]: 1.25 K). Second, VisualAT can visualize the distribution of air temperature with a high spatial resolution. Third, it can measure and visualize air temperature instantaneously. Finally, the proposed VisualAT can be used for various outdoor air temperature measurement applications, such as health monitoring, weather monitoring, and thermal surveillance of local area.

The remainder of this paper is organized as follows. Section 2 introduces the marials for FTIR analysis and Section 3 explains the proposed VisualAT method, including the radiative transfer equation. Section 4 analyzes VisualAT for temperature measurement applications, considering a range of environmental changes. The paper concludes in Section 5.

#### **2. Materials for FTIR Analysis**

#### *2.1. Outdoor Hyperspectral Data Acquisition System*

Figure 1 presents a measurement scenario in an outdoor environment consisting of a scene and a sensor system. A painted target in front of a sky-and-sea background is 78 m away from the observation laboratory. MWIR hyperspectral images were acquired with the Telops Hyper-Cam MWE model [23]. It can provide calibrated spectral radiance images with a high spatial and spectral resolution from a Michelson interferometer in the short-wave to midwave band (1.5–5.6 μm). The spatial image

resolution is 320 <sup>×</sup> 240, and the spectral resolution is up to 0.25 cm−1. The noise equivalent spectral radiance (NESR) is 7 [nW/(cm2·sr· cm−1)], and the radiometric accuracy is approximately 2 K. The field of view is 6.5 × 5.1 deg.

**Figure 1.** Measurement scenario in an outdoor environment: FTIR is located in an air conditioned room and AWS is installed on the roof. The background scene is 78 m away from the camera.

The objective of this research is to estimate air temperature spatially and to visualize the temperature distribution. A cropped hypercube image provides 128 × 200 × 374 (width × height × bands) data. A hyperspectral spectral imager (HSI) database was recorded daily from February to July 2018. Recording was done three times a day (10:30, 13:30, and 15:30 h) and four times an hour for a total of 12 times per day. In addition, an automatic weather station (AWS) recorded environmental information, such as the date, time, air temperature, humidity, air pressure, and visibility. The measured air temperatures were used to evaluate the temperature estimation accuracy by the proposed VisuaAT method.

#### *2.2. FTIR Data Acquisition Interface*

The selection of the CO2 absorption band is critical in the proposed VisualAT method. An initial baseline band range was selected by the developed GUI shown in Figure 2. The midwave FTIR spectral image software platform was developed for hypercube image display and spectral profile analysis by varying the wavelength parameter and units. The top left image in Figure 2 represents a spectral image in a specific band, and the lower graph shows a spectral profile at a selected point (indicated by +). The top right image in Figure 2 is a broad-band image with a selected spectral range.

#### *2.3. Radiometric Calibration of FTIR Imaging*

The wavenumber and radiometric calibration of midwave hyperspectral data should be done accurately to measure air temperature. The technical details are described in [24] and one full set of data including the raw IR spectrum, internal calibration, and temperature extraction is described in the following paragraphs. A Michelson interferometer produces interferomgrams by moving a mirror. Figure 3a presents an interferogram image at optical path difference (OPD) ID 500 (total 1186). Figure 3b gives an example of the whole interferogram at pixel (60, 60). Figure 3c shows the results of spectrum extraction by applying the fast Fourier transform (FFT) to the interferogram (Figure 3b). The unit of the *y*-*axis* in Figure 3c is just spectral intensity in arbitrary units. The wavelength calibration

is performed using the HeNe laser (wavelength *λ* = 632.8 nm). Figure 3d shows the wavenumber calibrated spectrum.

**Figure 2.** Midwave FTIR image analysis software platform. A spectral image at a selected wavelength can be visualized interactively.

**Figure 3.** Spectral calibration process: (**a**) interferogram image at optical path difference (OPD) ID = 500 (ZPD); (**b**) interferogram at a pixel ((row, col) = (60, 60)); (**c**) fast Fourier transform (FFT) results; (**d**) wavenumber calibration results.

The next step is to calibrate radiometrically using two blackbodies. The HYPER-CAM MWE can provide the spectral radiance data using two built-in BBs (hot, cold) [25]. Figure 4 shows the measured interferogram image, spectra (arbitrary units), and calculated spectral radiances for the hot (95 ◦C) and cold (25 ◦C) blackbodies. Figures 5a,b show the estimated gain and offset magnitude at pixel (60, 60), respectively. Figure 5c,d presents the calculated spectral radiance with the unit of wavenumber and wavelength, respectively.

The amount of spectral radiance energy can be converted into equivalent brightness temperatures [26]. By inverting Equation (7), the temperature *T* (*K*) can be obtained as

$$T = \frac{(\hbar c/k)\tilde{\nu}}{\ln[2\hbar c^2 \hat{\nu}^3/L\_S(\hat{\nu}) + 1]}.\tag{1}$$

#### *2.4. MODTRAN Simulator*

The moderate-resolution atmospheric radiance and transmittance model (MODTRAN, http://modtran.spectral.com/) is used in this paper to simulate atmospheric transmittance and path thermal calculation. Figure 6 shows a GUI interface to simulate spectral path radiance by setting geometric and atmospheric parameters.

**Figure 4.** Blackbody spectrum and spectral radiance extraction for a radiometric calibration: (**a**) interferogram image of a hot blackbody (95 ◦C); (**b**) measured spectrum of a hot blackbody; (**c**) calculated spectral radiance at hot temperature; (**d**) interferogram image of a cold blackbody (25 ◦C); (**e**) mesured spectrum of a cold blackbody; (**f**) calculated spectral radiance at cold temperature.

**Figure 5.** Radiometric calibration and spectral radiance extraction: (**a**) estimated gain magnitude; (**b**) estimated offset magnitude; (**c**) spectral radiance vs. wavenumber; (**d**) spectral radiance vs. wavelength.

**Figure 6.** MODTRAN simulation environment for path thermal calculation.

#### **3. Proposed Visual Air Temperature Measurement Method**

#### *3.1. Derivation of Radiative Ttransfer Equation*

Figure 7 shows the air temperature measurement scenario. Air temperature from VisualAT measurement can be derived from radiative transfer Equation (2). We adopt the radiative transfer equation used in MODTRAN [27]. In general, at-sensor received radiance in the midwave infrared (MWIR) region consists of opaque object-emitted radiance, reflected downwelling radiance, and total atmospheric path radiance (thermal+solar components).

$$L\_{\rm obs}(\lambda) = \tau(\lambda) \left[ \varepsilon(\lambda) L\_{\rm obj}(\lambda, T\_{\rm obj}) + (1 - \varepsilon(\lambda)) (L\_s^{\downarrow}(\lambda) + L\_t^{\downarrow}(\lambda)) \right] + L\_s^{\uparrow}(\lambda) + L\_t^{\uparrow}(\lambda) \tag{2}$$

**Figure 7.** Operational concept of visual air temperature (VisualAT) measurement using the passive open path Fourier transform infrared (FTIR) imaging system.

*Lobs*(*λ*) is the at-sensor radiance; *λ* is wavelenth; *ε*(*λ*) is spectral object surface emissivity; *Lobj*(*λ*, *Tobj*) is the spectral radiance of the object, assuming a blackbody in the Planck function with object surface temperature (*Tobj*). *L*<sup>↓</sup> *<sup>s</sup>* (*λ*) and *L*<sup>↓</sup> *<sup>t</sup>* (*λ*) represent spectral downwelling solar radiance and thermal irradiance, respectively; *τ*(*λ*) is the spectral atmospheric transmittance, and *L*↑ *<sup>s</sup>* (*λ*) and *L*<sup>↑</sup> *<sup>t</sup>* (*λ*) are the spectral upwelling solar and thermal path radiance, respectively, reaching the sensor.

According to the MODTRAN simulation in the MWIR band, the spectral transmittance of the carbon dioxide (CO2) band (4.25–4.35 μm) decreases abruptly with distance, as shown in Figure 8. The average transmittance in the CO2 band is 0.5, 0.13, 0.03, 0.005, 0.0001, and 0 at 1 m, 5 m, 10 m, 20 m, 50 m, and 100 m, respectively. If we consider only the CO2 band (*λCO*<sup>2</sup> ) with a minimum 20 m object distance, the transmittance (*τ*(*λCO*<sup>2</sup> )) can be regarded as 0, which leads to Equation (3). An MWIR FTIR camera receives only the upwelling of path solar and thermal radiances in the *λCO*<sup>2</sup> band where the range is normally 4.25–4.35 μm.

$$L\_{\rm obs}(\lambda\_{\rm CO\_2}) = L\_s^\uparrow(\lambda\_{\rm CO\_2}) + L\_t^\uparrow(\lambda\_{\rm CO\_2}) \tag{3}$$

According to the MWIR radiometric characteristics [28], the contribution of solar radiance (*L*↑ *<sup>s</sup>* (*λCO*<sup>2</sup> )) from air scattering is very small, even for very dry conditions (less than 2% at 5 μm) [28]. Ignoring the first term, we can simplify Equation (3) into Equation(4):

$$L\_{\rm obs}(\lambda\_{\rm CO\_2}) = L\_t^\uparrow(\lambda\_{\rm CO\_2}) \tag{4}$$

The definition of thermal upwelling is defined with Equation (5):

$$L\_t^\top(\lambda\_{CO\_2}) = (1 - \tau(\lambda\_{CO\_2}))B(\lambda\_{CO\_2}, T\_{air})\tag{5}$$

**Figure 8.** Spectral transmittance according to object-camera distance in the MWIR band. Note the abrupt absorption at 4.25–4.35 μm.

Since the spectral transmittance in the CO2 band is 0 (*τ*(*λCO*<sup>2</sup> ) = 0), the final form is approximated in Equation (6):

$$L\_{\rm obs}(\lambda\_{\rm CO\_2}) \simeq B(\lambda\_{\rm CO\_2}, T\_{\rm air}) \tag{6}$$

where *<sup>B</sup>*(*λCO*<sup>2</sup> , *Tair*) denotes the spectral radiance ([W/(m2·sr·μm)]) of a blackbody (Planck's law [29]), and *Tair* is the air temperature in degrees Kelvin [*K*] of the atmosphere between the object and the camera sensor. The spectral radiation of atmosphere is modeled as blackbody [30–32]. Atmospheric path radiance can be described in difference methods, but the simplest way is to model the particles as blackbodies [32]. *B*(*λCO*<sup>2</sup> , *Tair*) is defined with Equation (7):

$$B(\lambda\_{\text{CO}\_2}, T\_{air}) = \frac{2hc^2}{\lambda^5 (e^{hc/\lambda\_{\text{CO}\_2}kT\_{air}} - 1)}\tag{7}$$

where *h* denotes Planck's constant, *c* is the speed of light, and *k* is the Boltzmann constant.

The amount of spectral radiance energy can be converted into equivalent brightness temperatures ([33]). By inverting Equation (7), temperature *Tair* [*K*] can be obtained as follows:

$$T\_{\rm air}(\lambda\_{\rm CO\_2}) = \frac{hc}{\lambda\_{\rm CO\_2}k \ln\left(2hc^2/\lambda\_{\rm CO\_2}^5 B(\lambda\_{\rm CO\_2}, T\_{\rm air}) + 1\right)}.\tag{8}$$

#### *3.2. VisualAT: Proposed Visual Air Temperature Measurement*

Figure 9 summarizes the overall processing flow of the proposed VisualAT method. The first row represents the three steps in spectral brightness air temperature extraction using Equation (8), described as follows.


The second row of Figure 9 represents the image processing for visual air temperature image generation, explained in the following steps.

(4) A raw temperature image is extracted via pixel-wise temperature mean filter along the spectral axis. It still shows a noise-like image consisting of salt-and-pepper noise and thermal noise. This can be removed by consecutive spatial 2D median filtering and Gaussian filtering [34]. The Gaussian filtering is adopted to reduce spatial thermal noise. The empirically tuned kernel size of the median filter is 10 × 15, and sigma of the Gaussian filter is set to 2.

**Figure 9.** Overall processing flow of visual air temperature (VisualAT) measurement method.

Figure 10a shows the spectral brightness temperature by applying Equation (1) to the calibrated spectral radiance at pixel (60, 60). Figure 10b shows the enlarged brightness temperature around CO2 band (4.29–4.34 μm). Figure 10c represents the spatial raw temperature distribution of the CO2 band image. Final visual air temperature image (Figure 10e) is acquired through the median and Gaussian filtering processes.

**Figure 10.** Brightness temperature extraction and VisualAT results: (**a**) brightness temperature; (**b**) enlarged brightness temperature with CO2 band region; (**c**) CO2 band image; (**d**) median filtered image; (**e**) Gaussian filtered temperature image.

#### *3.3. Analysis of Air Temperature Measurement*

The proof of radiometric air temperature measurement and visualization is possible by mathematical derivations as Equations (2)–(8). However, it is a challenging problem to validate the air temperature measurement and visualization experimentally because we need a huge dark room with controllable air temperature. In this subsection, we analyze the properties of the VisualAT method by demonstrating extreme air temperature measurement and by thermal air flow visualization. Figure 11 demonstrates the air temperature measurement and visualization using the hot summer and cold winter data set. Upper row images represent visual temperature extraction process and camera's internal temperature information for hot summer data. The ground truth of air temperature is 26.4 ◦C and the estimated temperature is 25.6 ◦C. Likewise, lower row images represent the same process for cold winter data. Note that the ground truth of air temperature is 2.6 ◦C and the estimated temperature is 2.52 ◦C. The internal camera temperatures of IR lens, front wall, beam splitter, and etc are approximately 29 ◦C and they do not affect to the air temperature estimation because hot/cold blackbody-based radiometric calibration can remove the effect of stray light before each measurement. The proposed VisualAT can visualize thermal air flow using consecutive FTIR hyper-cubes acquired in very short period (1 s). Figure 12 shows the air flow directions indicated by the curves and arrows. Because the sea wind is so strong, the thermal air flow changes dynamically in very short time. Therefore, this result can be another indirect proof of imaging variations in air temperature.

**Figure 11.** Indirect proof of air temperature measurement by extreme weather conditions: Hot summer-(**a**) broad-band image, (**b**) CO2 band image, (**c**) median filtered image, (**d**) Gaussian filtered image, (**e**) camera internal temperature information; Cold winder-(**f**) broad-band image, (**g**) CO2 band image, (**h**) median filtered image, (**i**) Gaussian filtered image, (**j**) camera internal temperature information.

**Figure 12.** Indirect proof of air temperature measurement by air flow visualization. The arrows indicate the directions of air flow.

#### *3.4. Signal Analysis of Air Temperature Monitoring*

It is important to analyze how the IR radiation from different parts of the probed air contributes to the received thermal signal. Figure 13 represents the geometric relationship between atmosphere plane and a pixel. If an atmospheric plane at distance *R* is considered, the probing area (*A*) is *a* × *b* using the instantaneous field of view (IFOV). The received thermal flux (Φ*λ*) at the pixel detector is Φ*<sup>λ</sup>* = (1 − *τ*(*λCO*<sup>2</sup> )) · *Lλ*(*Tair*) · *A* · Ω · *τ<sup>o</sup>* where Ω is solid angle and *τ<sup>o</sup>* is lens transmittance [35]. Because air particles are regarded as blackbodies [32], the emissivity is regarded as 1. If we use basic geometrical relationships, the final form is changed to Φ*<sup>λ</sup>* = (1 − *τ*(*λCO*<sup>2</sup> )) · *Lλ*(*Tair*) · *AIFOV* · Ω*IFOV* · *τ<sup>o</sup>* where *AIFOV* denotes the pixel area and Ω*IFOV* represents the solid angle of IFOV. The received thermal flux is strongly related to (1 − *τ*(*λCO*<sup>2</sup> )). The simulation of atmospheric transmittance is conducted by MODTRAN 4.0 and Figure 14a represents the result. As the distance increases up to 20 m, the thermal contribution of air particles increases. This analysis is confirmed through the MODTRAN-based path thermal simulation as shown in Figure 14b and Figure 6. Therefore, we can conclude that the received air flux is contributed dominantly by the air temperature at 20 m distance.

**Figure 13.** Geometrical model of received thermal flux in a pixel.

**Figure 14.** (**a**) (1-atmospheric transmittance) vs. distance (MODTRAN-based simulation at CO2 absorption band (4.29 μm)), (**b**) path thermal radiance vs. distance (MODTRAN-based simulation).

#### *3.5. Performance Metric*

The mean absolute error (MAE) metric was used to compute the performance of the VisualAT method in predicting air temperature. The MAE metric is defined in Equation (9):

$$MAE = \frac{1}{N} \sum\_{k=1}^{N} \left| T\_{air}^k - T\_{GT}^k \right| \tag{9}$$

where *T<sup>k</sup> air* denotes the *<sup>k</sup>* <sup>−</sup> *th* predicted air temperature by the VisualAT method, and *<sup>T</sup><sup>k</sup> GT* is the corresponding air temperature measured by the AWS, as shown in Figure 15.


**Figure 15.** AWS information: date and time; wind speed, maximum wind speed, average wind direction, and maximum wind direction; air temperature, humidity, and pressure; and visibility.

Figure 16 shows the experimental environment. Figure 16a is the outdoor environment acquired via visible band camera, and Figure 16b shows a recorded broad-band image from summing all the spectral band images (1.5–5.6 μm). Figure 16c represents MODTRAN-based spectral transmittance at the object distance of 78 m, and where the average transmittance of the CO2 band is 0.0001.

**Figure 16.** An example of experiment environment: (**a**) outdoor environment acquired by a visible band camera, (**b**) broad-band infrared image, and (**c**) spectral transmittance in MWIR band.

#### *3.6. Parameter Analysis*

A midwave hyperspectral image database was prepared for various evaluations. Hypercube images from 49 days were valid during the acquisition period (February to July 2018). In the first evaluation, the effect of the CO2 band range was important in order to estimate air temperature accurately. As summarized in Figure 17, baseline bands were selected from a visual inspection using the GUI shown in Figure 2. Top row images in Figure 17 show the spectral images corresponding to specific wavelengths. The visual selection criterion was whether the spectral image looks like it has noise in the whole image area. If atmospheric transmittance is 0, the received radiance consists of thermal path (air) radiance. Therefore, the initially selected bands were 4.22, 4.27, 4.29, 4.31, and 4.34 μm. The MAE of the baseline band showed 1.32 K. If a lower band decreased to 4.20 μm, the MAE decreased to 1.29 K. If the lower band increased to 4.27, 4.29, and 4.31 μm individually, the corresponding MAEs were 1.26, 1.25, and 1.27 K, respectively. From these experiments, the lower band limit can be set at 4.29μm. On the other hand, if we increase the upper band limit to 4.36 μm with the selected lower band limit at 4.29 μm, the MAE increased to 1.29 K. Therefore, we can conclude that the optimal CO2 absorption bands are 4.29, 4.31, and 4.34 μm. These bands were used in the following experiments.

**Figure 17.** CO2 band range selection results: Baseline band is selected by visual inspection and the optimal band range is selected by maximizing mean absolute error (MAE) metric.

SNR can be improved additionally through the 2D median filter and 2D Gaussian filter. The 2D median filter is necessary to remove dead pixels as shown in the top-left of Figure 18 where the effects of median filter sizes from [1 1] to [15 15] were visualized. Even [3 3] median filter can remove the salt and pepper noise effectively and larger filter size can extract larger structure of air temperature distribution. The histogram distribution after the 2D median filter is Gaussian distribution, which is consensus to thermal noise distribution. True temperature signal with Gaussian noise can be estimated by Gaussian smoothing filter (unbiased, consistent linear estimator) [36]. The effects of sigma (*σ*) were displayed in Figure 19 where [3 3] median filter was used initially. Note that larger *σ* can extract larger structure of air temperature distribution. The selection of median filter size and Gaussian filter parameter depends on application images. If salt and pepper noise is strong, larger median filter size with smaller Gaussian smoothing is suitable. If the salt and pepper noise is weak, smaller median filter size with stronger Gaussian smoothing is recommended. In this paper, we used a kernel size of 10 × 15 for the median filter and *σ* = 2 for the Gaussian filter because salt and pepper noise is spread horizontally and thermal noise is high.

The VisualAT method depends on the spectral radiance of atmosphere. The sensitivity analysis and error propagation can be useful to understand the properties of VisualAT. The simulation is conducted by adding spectral radiance noise to Equation (7) and estimating air temperature from Equation (8). Figure 20 shows the sensitivity analysis results. The ground truth air temperature is 20 ◦C and it increases linearly according to the spectral noise level. In the case of FTIR camera noise (NESR, 7 <sup>×</sup> <sup>10</sup>−<sup>5</sup> [W/m2,sr,cm−1]), the estimated temperature is 21.08 ◦C.

**Figure 18.** Spatial effects of 2D median filter size and histogram distribution.

**Figure 19.** Spatial effects of 2D Gaussian filter with *σ*.

**Figure 20.** Spectral noise and estimated air temperature analysis.

#### **4. Experiment Results**

Radiometric accuracy should be evaluated to validate the usefulness of the proposed VisualAT method. Although VisualAT can present measured temperature spatially, a global mean temperature per image was extracted to compare with the ground truth air temperature recorded by the AWS. Figure 21 shows a quantitative evaluation graph between the air temperature estimated by the proposed VisualAT method and the ground truth air temperature for the 49-day dataset (acquired time: 15:30 h). The MAE was 1.25 K, which is accurate considering the remote sensing method and dynamic weather changes. The correlation coefficient between the VisualAT-based air temperature and ground truth air temperature was 0.97. Figure 22 shows representative visual air temperature images measured using the VisualAT method. To the best of our knowledge, there is no method by which to compare these results to others, and this is the first trial of instantaneous ground air temperature measurement and visualization in a ground-based approach.

**Figure 21.** Quantitative evaluation graph of the proposed VisualAT method for the whole database acquired from February to July 2018.

**Figure 22.** Examples of air temperature images measured by the proposed VisualAT method for different months (from February 2018–July2018).

The dependency of temperature estimation accuracy from the proposed VisualAT is important in order to check robustness to various environmental changes. Figure 23 summarizes the relationships between the estimation error [◦C] and other factors, such as (a) air temperature [◦C], (b) humidity [%], (c) air pressure [hPa], (d) visibility [m], and (e) long-wave thermal radiation [W/m2]. The correlation coefficient (R) is annotated on the results. According to the results, the estimation error is negative relationship with air temperature, air pressure, and visibility. In addition, the error has positive relationships with humidity. This means that a more accurate temperature can be estimated if the humidity is lower. On the other hand, the long-wave thermal radiation had no specific relationship. The dependency analysis is for reference only because the data points are very scattered.

The best advantage of the proposed VisualAT method is that it can measure visual air temperature instantly. It takes approximately one second to scan a hypercube. Figure 24 shows consecutive air temperature images acquired at 20-min intervals on 6 March 2018. The color axis was set by the caxis ([7.83 9.83]) function in MATLAB for a fair temperature reading. We can see the flow of the heat flux over one hour.

**Figure 23.** Examples of air temperature images measured by the proposed VisualAT method for different months: (**a**) estimation error vs. air temperature; (**b**) estimation error vs. humidity; (**c**) estimation error vs. pressure; (**d**) estimation error vs. visibility; (**e**) estimation error vs. long wave thermal radiation.

**Figure 24.** VisualAT-based dynamic temperature images measured at different times (20-min. intervals): Temporal variation of air temperature distribution can be found.

Figure 25 shows the visual air temperature images measured at night (21:29 h) and the next early morning (09:23 h). There was strong sea fog, and VisualAT can still produce air temperature images.

**Figure 25.** Example of visual air temperature images at night and in the morning: (left) night time air temperature image at 21:29; (right) early morning temperature image at 09:23.

Figure 26 shows VisualAT-based remote air temperature measurement and visualization in a sea environment. Top left shows a broad-band image integrating a 1.5–5.6 μm hypercube; Top right is the corresponding visible band image. The bottom image is the measured air temperature distribution from the VisualAT method. The average temperature was 16.73 ◦C, and relatively hot and cold regions can be found. Figure 27 presents various air temperature visualization results in maritime environment. Different air temperature distributions can be found.

Through various analysis and evaluation, the proposed VisualAT can be a novel method to measure spatial air temperature and visualize its distribution instantly. The CO2 concentration affects to the atmospheric transmittance [37], which is related with measurement air volume. It can be a useful approach if CO2 band image is available but there are limitations such as expensive sensor system and relatively small measurement air volume (20 m distance times FOV). If there is an object in the measurement volume, the accuracy of air temperature measurement can be degraded. Partial artifact can appear when the assumption of zero atmospheric transmittance is broken. If radiometric calibration is not perfect, then internal camera device temperature can distort air temperature distribution.

**Figure 26.** Remote air temperature image visualization example from an outdoor sea environment: (top left) broad band image; (top right) visible band image, (bottom) VisualAT-based air temperature image.

**Figure 27.** Various remote air temperature image visualization example in maritime environment: (**a**) broadband images; (**b**) corresponding VisualAT images.

#### **5. Discussions and Conclusions**

It is very challenging to measure air temperature contactlessly and instantly. This paper proposed a novel air temperature measurement and visualization method called VisualAT, which uses a midwave FTIR in the carbon dioxide absorption band. We found that spatial air temperature can be measured radiometrically by deriving the radiative transfer equation (RTE). The first physical atmospheric property is that atmospheric transmittance in the CO2 band (4.25–4.35 μm) is 0.005 (at 20 m away), which can remove the effect of object and downwelling radiation. The second physical atmospheric property is that the portion of solar upwelling is very small in the CO2 band. These two properties lead to a simpler received radiance at the sensor; it is just blackbody radiance of the air temperature. The rest of the image processing, such as spectral temperature averaging, spatial median filtering, and Gaussian smoothing, produce the final visual temperature image. The proposed VisualAT method is the first to measure and visualize air temperature remotely and instantly. There is no contact sensor, no measurement delay, and no learning for regression. Based on a long-term outdoor experimentss (from February to July 2018, a valid 49 days), the proposed VisualAT method showed an MAE of 1.25 K for temperature range of 2.6–26.4 ◦C. It is relatively accurate, considering all weather conditions. The measurement error has positive correlations with humidity (R = 5.6%). On the other hand, it has a negative correlation with air temperature (−9.3%), air pressure (R = −13.1%) and visibility (R = −5.6%). There is no relationship with long wave radiance (R = −0.5%). The long-range outdoor test validated the feasibility of visual air temperature measurement and visualization. According to various experiments, the VisualAT can measure air temperature correctly if there is no hot object within 20 m and proper CO2 absorption band is used. Furthermore, the precise radiometric calibration should be activated before each air temperature measurement to remove stray lights. If these conditions are satisfied, the proposed VisualAT method can be applied to spatial air temperature monitoring applications in fields such as human health, virus propagation, plant growth, climate change, hydrology, etc. In the future, we will use the air temperature information in detecting remote thermal objects.

**Funding:** This research was funded by the 2020 Yeungnam University Research Grants and Agency for Defense Development (UE191095FD). The APC was funded by MOTIE (P0008473).

**Acknowledgments:** This work was supported by the 2020 Yeungnam University Research Grants. This study was supported by the Agency for Defense Development (UE191095FD). In addition, this paper was supported by Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE)(P0008473, HRD Program for Industrial Innovation).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article AT***2***ES***: Simultaneous Atmospheric Transmittance-Temperature-Emissivity Separation Using Online Upper Midwave Infrared Hyperspectral Images**

**Sungho Kim 1,\*, Jungsub Shin <sup>2</sup> and Sunho Kim <sup>2</sup>**


**Abstract:** This paper presents a novel method for atmospheric transmittance-temperature-emissivity separation (*AT*2*ES*) using online midwave infrared hyperspectral images. Conventionally, temperature and emissivity separation (TES) is a well-known problem in the remote sensing domain. However, previous approaches use the atmospheric correction process before TES using MODTRAN in the long wave infrared band. Simultaneous online atmospheric transmittance-temperature-emissivity separation starts with approximation of the radiative transfer equation in the upper midwave infrared band. The highest atmospheric band is used to estimate surface temperature, assuming high emissive materials. The lowest atmospheric band (CO2 absorption band) is used to estimate air temperature. Through onsite hyperspectral data regression, atmospheric transmittance is obtained from the y-intercept, and emissivity is separated using the observed radiance, the separated object temperature, the air temperature, and atmospheric transmittance. The advantage with the proposed method is from being the first attempt at simultaneous *AT*2*ES* and online separation without any prior knowledge and pre-processing. Midwave Fourier transform infrared (FTIR)-based outdoor experimental results validate the feasibility of the proposed *AT*2*ES* method.

**Keywords:** atmospheric transmittance; temperature; emissivity; separation; midwave infrared; hyperspectral images

#### **1. Introduction**

The concept of temperature and emissivity separation (TES) was originally developed by Gillespie et al. for Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) satellite data analysis [1,2]. Currently, TES is an important research topic in infrared remote sensing applications. Separated temperature can be used to estimate land surface temperature for the study of climate change [3,4]. Emissivity information is useful for mineral composition analysis [5], vegetative cover mapping [6], and object material classification [7].

The scope of this paper is to apply TES online on a flying platform such as an unmanned aerial vehicle. The most critical issue is how to achieve the atmospheric correction to remove the effect of path radiance and atmospheric transmittance in real time without any prior information or pre-processing. Historically, the original TES method in ASTER satellite images used an atmospherically corrected dataset in five multispectral long wave infrared (LWIR) bands [1]. Li et al. compared six methods for extracting relative emissivity spectra from atmospherically corrected multiple spectral bands [3]. Yong et al. tried to estimate atmospheric transmittance in LWIR bands without TES [8]. Payan and Royer further analyzed the applicability and sensitivity of six TES methods [2]. Borel and Tuttle improved TES by using MODTRAN 5-based atmospheric transmittance [9]. Wang et al.

**Citation:** Kim, S.; Shin, J.; Kim, S. *AT*2*ES*: Simultaneous Atmospheric Transmittance-Temperature-Emissivity Separation Using Online Upper Midwave Infrared Hyperspectral Images. *Remote Sens.* **2021**, *13*, 1249. https://doi.org/ 10.3390/rs13071249

Academic Editor: Chein-I Chang

Received: 1 March 2021 Accepted: 23 March 2021 Published: 25 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

also applied MODTRAN to perform atmospheric correction in a thermal airborne spectrographic imager (TASI) [10]. Adler-Golden et al. adopted simulated atmospheric parameters from the MODTRAN5 model for TES [11]. Wang et al. used the atmospheric transmittance calculated by MODTRAN for TES of Landsat-8 sensor data [12]. Pivovarník et al. improved TES by adopting smoothing in emissivity estimation where atmospheric correction was made using MODTRAN with a mid-latitude summer atmosphere [4].

The previous works have three limitations. First, most require pre-processing of atmospheric correction by using MODTRAN or from prior knowledge. The atmospheric transmittance, downwelling, and upwelling data are generated for TES. Second, TES is conducted offline. Such an approach is impractical for real-time TES on a flying platform because atmospheric conditions change dynamically in time and space. Third, most TES techniques use an LWIR satellite database such as ASTER and TASI.

In this paper, a novel simultaneous atmospheric transmittance-temperature-emissivity separation (*AT*2*ES*) method is proposed for online applications based on the following key ideas. First, the radiative transfer equation (RTE) is approximated by considering the physical properties of the upper midwave infrared band (4.2–5.6 μm). Second, the highest and lowest atmospheric transmittance bands are selected. The former is used to estimate surface temperature, and the latter (the CO2 absorption band: 4.2–4.4 μm) is used to estimate air temperature. Through a data regression process, the atmospheric transmittance is estimated with the y-intercept and air temperature. Emissivity is separated using the observed radiance, the separated object temperature, the air temperature, and atmospheric transmittance.

Therefore, the main contributions are summarized as follows.


The remainder of this paper is organized as follows. Section 2 explains the proposed *AT*2*ES* method, including the basics of the radiative transfer equation in the upper MWIR band. Section 3 analyzes *AT*2*ES* using a synthetic dataset and outdoor remote sensing data. The paper concludes in Section 4.

#### **2. Proposed** *AT*2*ES* **Method**

*2.1. Basics of the Radiative Transfer Equation*

Figure 1 shows hyperspectral imaging in an outdoor environment. It consists of the target, a midwave infrared-Fourier transform infrared (MWIR-FTIR) camera, the sun, and the atmosphere. Observed spectral radiance can be derived from the radiative transfer from Equation (1). Romaniello et al. adopted the radiative transfer equation used in MODTRAN [13]. In general, at-sensor received radiance *Lobs*(*λ*) in the MWIR region consists of opaque object-emitted radiance, reflected downwelling radiance, and total atmospheric path radiance (thermal+solar components).

$$L\_{\rm obs}(\lambda) = \pi(\lambda) \left[ \varepsilon(\lambda) L\_{\rm f\xi}(\lambda, T\_{\rm f\xi}) + (1 - \varepsilon(\lambda)) (L\_{\rm s}^{\downarrow}(\lambda) + L\_{\rm f}^{\downarrow}(\lambda)) \right] + L\_{\rm s}^{\uparrow}(\lambda) + L\_{\rm f}^{\uparrow}(\lambda) \tag{1}$$

*Lobs*(*λ*) is the at-sensor radiance; *λ* is the wavelength; *ε*(*λ*) is spectral object surface emissivity; *Ltg*(*λ*, *Ttg*) is the spectral radiance of the object, assuming a blackbody in the Planck function with object surface temperature *Ttg*. *L*<sup>↓</sup> *<sup>s</sup>* (*λ*) and *L*<sup>↓</sup> *<sup>t</sup>* (*λ*) represent the spectral downwelling solar radiance and thermal irradiance, respectively; *τ*(*λ*) is the spectral atmospheric transmittance, and *L*↑ *<sup>s</sup>* (*λ*) and *L*<sup>↑</sup> *<sup>t</sup>* (*λ*) are spectral upwelling solar and thermal path radiance, respectively, reaching the sensor. Observed spectral radiance *Lobs*(*λ*) is acquired by applying the Fourier transform to the interferogram in the Michelson interferometer and hot-cold blackbody-based radiometric calibration [14].

**Figure 1.** Operational concept of *AT*2*ES* using a passive open path Fourier transform infrared imaging system. Notation: *L*↓ *<sup>s</sup>* (*λ*) and *L*<sup>↓</sup> *<sup>t</sup>* (*λ*) represent the spectral downwelling solar radiance and thermal irradiance, respectively; *L*↑ *<sup>s</sup>* (*λ*) and *L*<sup>↑</sup> *<sup>t</sup>* (*λ*) are spectral upwelling solar and thermal path radiance, respectively, reaching the sensor.

#### *2.2. Proposed Approximation of the RTE in the Upper MWIR Band*

Figure 2 visualizes the fractions of total radiance according to the radiometric characteristics for top of atmosphere (TOA): path thermal *L*↑ *<sup>t</sup>* (*λ*), path reflectance-solar *L*<sup>↑</sup> *<sup>s</sup>* (*λ*), surface reflectance-solar *L*↓ *<sup>s</sup>* (*λ*), surface reflectance-infrared *L*<sup>↓</sup> *<sup>t</sup>* (*λ*), and surface-emitted *Ltg*(*λ*, *Ttg*). The lower MWIR band (3.0–4.2 μm) shows a large fraction for surface reflectance-solar. This means the received radiance strongly depends on the reflected solar energy. However, the contribution of surface reflectance-solar radiance *L*↓ *<sup>S</sup>*(*λ*) is reduced to only 1% to 4% in the upper MWIR band (4.2–5.6 μm) even for very dry conditions [15]. Figures 3 and 4 show the simulation process of surface reflected-solar and surface emitted-object with the portion of surface reflected-solar. According to the simulation, the average portion of surface reflected solar is 0.65%, which affects negligible error. In addition, surface-reflected downwelling thermal radiance *L*↓ *<sup>T</sup>*(*λ*) and path reflectance-solar radiance *L*<sup>↑</sup> *<sup>S</sup>*(*λ*) are negligible, compared to surface-emitted radiance *Ltg*(*λ*, *Ttg*) and path thermal radiance *L*<sup>↑</sup> *<sup>t</sup>* (*λ*).

**Figure 2.** Fractional distribution of spectral radiance in the MWIR band.

**Figure 3.** (**a**) Generation of surface reflected-solar, (**b**) generation of surface emitted-object. (1st row) solar radiance, object radiance, (2nd row) surface reflectivity, emissivity, and (3rd row) surface reflected-solar, surface emitted-object.

If we ignore surface reflectance-solar, surface reflectance-infrared, path reflectancesolar, we can simplify Equation (1) into Equation (2):

$$L\_{\rm obs}(\lambda) = \pi(\lambda)\varepsilon(\lambda)L\_{t\S}(\lambda, T\_{t\S}) + L\_t^\top(\lambda) \tag{2}$$

The definition of thermal upwelling *L*↑ *<sup>t</sup>* (*λ*) is Equation (3) [9]:

$$L\_t^\top(\lambda) = (1 - \pi(\lambda)) L\_{BB}(\lambda, T\_{air}) \tag{3}$$

where *LBB*(*λ*, *Tair*) denotes the spectral radiance, [*W*/(*m*<sup>2</sup> ·*sr*· *<sup>μ</sup>m*)], of a blackbody (Planck's law [16]), and *Tair* is the air temperature in degrees Kelvin [*K*] of the atmosphere between the object and the camera sensor. The spectral radiation of the atmosphere is modeled as a blackbody [17–19]. Atmospheric path radiance can be described in different ways, but the simplest is to model the particles as blackbodies [19]. *LBB*(*λ*, *Tair*) is defined in Equation (4):

$$L\_{BB}(\lambda, T\_{air}) = \frac{2hc^2}{\lambda^5 (e^{hc/\lambda kT\_{air}} - 1)}\tag{4}$$

where *h* denotes Planck's constant, *c* is the speed of light, and *k* is the Boltzmann constant. Therefore, the final form of the proposed approximated RTE is the same as Equation (5):

$$L\_{\rm obs}(\lambda) = \tau(\lambda)\varepsilon(\lambda)L\_{\rm BB}(\lambda, T\_{\rm f\xi}) + (1 - \tau(\lambda))L\_{\rm BB}(\lambda, T\_{\rm air}) \tag{5}$$

where *Ltg*(*λ*, *Ttg*) was changed to *LBB*(*λ*, *Ttg*) for notational consistency. The proposed RTE is valid for the upper MWIR band (4.2–5.6 μm) with 1–4% radiance uncertainty.

**Figure 4.** Calculation of the portion of surface reflected-solar: (**a**) surface reflected-solar + surface emitted-object, (**b**) portion of surface reflected-solar [%], (**c**) enlarged view in the upper MWIR band.

#### *2.3. Details of the AT*2*ES Process*

Given the approximated RTE model in Equation (5), two unknown temperature parameters (*Ttg*, *Tair*) should be separated to estimate spectral atmospheric transmittance *τ*(*λ*) and spectral emissivity (*λ*) given *Lobs*(*λ*). Figure 5 summarizes the overall *AT*2*ES* process, which consists of six blocks: brightness temperature (BT) extraction, *Tair* separation, *Ttg* separation, regression, *τ*(*λ*) separation, and *ε*(*λ*) separation. The BT extraction block converts spectral radiance *Lobs*(*λ*) to brightness temperature units. The band range is limited to the upper MWIR band (4.2–5.6 μm) in order to use the approximate RTE model introduced in the previous subsection. Brightness temperature *BT*(*λ*) is used in the *Tair* and *Ttg* separation blocks. The regression block estimates slope *a*(*λ*) and intersect *b*(*λ*) parameters from observed spectral radiance *Lobs*(*λ*) and target spectral radiance *LBB*(*λ*, *Ttg*). Atmospheric transmittance *τ*(*λ*) and target emissivity *ε*(*λ*) are separated using these parameters and air radiance *LBB*(*λ*, *Tair*). Each module is explained in the following paragraphs.

**Figure 5.** Proposed simultaneous *AT*2*ES* flow.

*Brightness temperature module*: The amount of spectral radiance energy can be converted into an equivalent brightness temperature ([20]). By inverting Equation (4), temperature *BT*(*λ*) [*K*] can be obtained as follows:

$$BT(\lambda) = \frac{hc}{\lambda k \ln(2hc^2/\lambda^5 L\_{\rm BB}(\lambda, T) + 1)}.\tag{6}$$

Figure 6 shows an example of brightness temperature extraction from an observed spectral radiance. The remote spectral radiance shows a complicated shape depending on the surface emissivity, atmospheric transmittance, and path radiance. Brightness temperature is the temperature of a blackbody in thermal equilibrium with its surroundings in order to duplicate the observed intensity of a gray-body object at a specific frequency or wavelength. As a result that the spectral radiance provides radiance energy at each wavelength, Equation (6) can calculate the corresponding brightness temperature at each wavelength. Note that a higher brightness temperature can be extracted if atmospheric transmittance and surface emissivity are closer to 1.

**Figure 6.** Example of brightness temperature extraction from spectral radiance: (**a**) the observed sample spectral radiance [W/(m2 sr μm)], and (**b**) the converted brightness temperature [◦C].

*Tair* separation module: According to the MODTRAN simulation in the MWIR band, the spectral transmittance of the carbon dioxide (CO2) band (4.20–4.35 μm) decreases abruptly with distance [21]. The average transmittance in the CO2 band is 0.13, 0.03, 0.005, 0.0001, and 0 at 5 m, 10 m, 20 m, 50 m, and 100 m, respectively. Figure 7 demonstrates the atmospheric transmittance at the 50 m distance in the upper MWIR band. Note that the atmospheric transmittance is 0.0001 in the CO2 absorption band. If we consider only the CO2 band (*λCO*<sup>2</sup> = [4.20–4.35 μm]) with a minimum 20m object distance, transmittance *τ*(*λCO*<sup>2</sup> ) can be regarded as 0, which leads to Equation (7) derived from Equation (5). An MWIR-FTIR camera receives only the upwelling of path thermal radiances in the *λCO*<sup>2</sup> band.

$$L\_{\rm obs}(\lambda\_{CO\_2}) = L\_{BB}(\lambda\_{CO\_2}, T\_{air}) \tag{7}$$

Therefore, *Tair* can be obtained by applying a mean operation to Equation (6) in *λCO*<sup>2</sup> . The final form of air temperature separation is shown in Equation (8). Figure 8 illustrates an air temperature map image by applying the brightness temperature extraction method to the CO2 absorption band (4.31 μm). A representative air temperature value can be estimated using the spatial and spectral average filter in the CO2 band range.

$$T\_{air} = mean(BT(\lambda\_{CO\_2})) \tag{8}$$

**Figure 7.** Atmospheric transmittance at the 50 m distance, and the characteristics of the CO2 absorption band (4.20–4.35 μm).

**Figure 8.** Air temperature map extraction using spectral radiance in the CO2 absorption band: (**a**) the air temperature map at 4.31 μm, (**b**) the brightness temperature profile at the cross point in (**a**).

*Ttg* separation module: The remote target temperature separation process requires two assumptions. One is that there must be a high atmospheric transmittance band; the other is that there must be high emissivity band. These assumptions can be satisfied because the working distance is within 100 m, and most natural and paint materials show high emissivity. Figure 9 proves that maximal transmittance is above 0.992 within a 100 m distance under a clear sky. The average atmospheric transmittance is 0.72 (50 m), 0.66 (100 m), 0.49 (500 m), and 0.41 (1000 m) under 1976 US standard atmosphere model. If the moisture content is 3 times higher (tropical model), the corresponding average atmospheric transmittance is 0.65 (50 m), 0.57 (100 m), 0.39 (500 m), and 0.31 (1000 m). The reduction rate is 13.6% (50 m), 9.7% (100 m), 20.4% (500 m), and 20.4% (1000 m).

The spectral emissivity of the representative materials (paint, grass, asphalt, and concrete) is at least 0.9 as shown in Figure 10.

**Figure 9.** Atmospheric transmittance distribution, and the maximum values based on object distance.

**Figure 10.** Emissivity distributions of various materials in the upper MWIR band.

In these environmental conditions, there is an optimal band with a high maximum *τ*(*λopt*)*ε*(*λopt*) of 0.9 or more. Therefore, Equation (5) can be reduced to Equation (9) with a maximum 10% margin of error. Target temperature *Ttg* can be obtained by applying brightness temperature to Equation (9). In practical terms, optimal band *λopt* is unknown a

priori because we have no information on object distances and material types. However, the problem can be bypassed by applying the max operation to Equation (6). The final form of target temperature separation is shown in Equation (10), where *λhigh* = [4.35–5.60 μm], which is the complement of the CO2 absorption band. The calculated target temperature (33.6 ◦C) is the blue circle overlaid in Figure 6.

$$L\_{\rm obs}(\lambda\_{\rm opt}) = L\_{BB}(\lambda\_{\rm opt}, T\_{\rm l\xi}) \tag{9}$$

$$T\_{t\S} = \max(BT(\lambda\_{h\text{igh}}))\tag{10}$$

Regression module: The proposed approximate RTE, Equation (5), can be written by replacing coefficients as follows:

$$L\_{obs}(\lambda) = a(\lambda)L\_{BB}(\lambda, T\_{t\_{\mathcal{S}}}) + b(\lambda) \tag{11}$$

where *a*(*λ*) = *τ*(*λ*)*ε*(*λ*), and *b*(*λ*)=(1 − *τ*(*λ*))*LBB*(*λ*, *Tair*). Slope *a*(*λ*) and intercept *b*(*λ*) can be estimated using regression between *Lobs*(*λ*) and *LBB*(*λ*, *Ttg*), as shown in Figure 11. Hyperspectral data points are obtained from different areas with the same distance. Each observed spectrum provides the BT from which *Ttg* is separated by maximization, as explained above. Figure 12 shows the regressed coefficients for each wavelength.

**Figure 11.** Examples of linear regression between *Lobs*(*λ*) and *LBB*(*λ*, *Ttg*) for the representative bands: (**a**) *λ* = 4.568[μm], (**b**) *λ* = 4.8039[μm], (**c**) *λ* = 4.9432[μm], (**d**) *λ* = 5.3294[μm].

**Figure 12.** Examples of slope *a*(*λ*) and y-intercept *b*(*λ*) coefficients in linear regression.

*τ*(*λ*) separation module: As a result that *b*(*λ*)=(1 − *τ*(*λ*))*LBB*(*λ*, *Tair*), atmospheric transmittance *τ*(*λ*) can be calculated using *b*(*λ*) and *LBB*(*λ*, *Tair*) as follows:

$$\pi(\lambda) = 1 - \frac{b(\lambda)}{L\_{BB}(\lambda, T\_{air})} \tag{12}$$

Atmospheric temperature *Tair* provides blackbody radiation, and y-intercept *b*(*λ*) is separated through linear regression. Figure 13 (top chart) shows an example of separated atmospheric transmittance using Equation (12).

*ε*(*λ*) separation module: In Equation (5), spectral emissivity *ε*(*λ*) can be separated using atmospheric transmittance *τ*(*λ*), object temperature *Ttg*, and observed spectral radiance *Lobs*(*λ*), as seen in Equation (13). As a result that each sample has its own spectral emissivity, a representative spectral emissivity profile can be obtained via sample mean. Figure 13 (bottom) shows an example of separated emissivity using Equation (13).

$$\varepsilon(\lambda) = \frac{L\_{\text{obs}}(\lambda) - b(\lambda)}{\pi(\lambda) L\_{BB}(\lambda, T\_{\text{f\%}})} \tag{13}$$

**Figure 13.** Top chart shows separated atmospheric transmittance, and bottom chart, separated emissivity of a sample plane.

#### **3. Experimental Results**

*3.1. Experiments Using Synthetic Hyperspectral Datasets*

In the first experiment, synthetic hyperspectral data were generated for parameter analysis using Equation (5). The four critical parameters are object temperature *Ttg*, air temperature *Tair*, emissivity *ε*(*λ*), and atmospheric transmittance *τ*(*λ*). Figure 14 demonstrates the synthetic spectrum generation flow for observed signal *Lobs*(*λ*). Figure 14a is the grass spectrum downloaded from the ECOSTRESS library, 15 August 2020 (https: //ecostress.jpl.nasa.gov/) [22]. Figure 14b presents spectral blackbody radiance of an object with temperature *Ttg* = 30 ◦C. Figure 14c is emitted object radiance from multiplying Figure 14a,b. The observed spectral radiance in Figure 14f was generated by applying the atmospheric transmittance in Figure 14d to the emitted object radiance and the path radiance in Figure 14e.

**Figure 14.** Synthetic spectrum generation flow: (**a**) grass emissivity, (**b**) object radiance, (**c**) emitted object radiance, (**d**) atmospheric transmittance, (**e**) path radiance, and (**f**) observed radiance.

Through the generation process, 200 observed spectra were generated, as seen in Figure 15a. As a baseline dataset, Gaussian noise was added with the following parameters: *στ* = 0.0001; *σTtg* = 1; *σTair* = 0.0001, and *σε* = 0.0001, where *στ* denotes the standard deviation of atmospheric transmittance, *σTtg* denotes the standard deviation of object temperature, *σTair* is the standard deviation of air temperature, and *σε* is the standard deviation of object emissivity. The *σTair* is set as 0.0001 to consider only the effect of *σTtg* . A value of 0.0001 is the minimal numerical value for simulation purposes. Figure 15b shows an example of a brightness temperature profile converted from an original spectral radiance. The maximum value is regarded as the object temperature, and each separated sample's temperature is displayed in Figure 15c. Each brightness temperature in the CO2 band provides a candidate air temperature, as shown in Figure 15d. The average of the distribution is regarded as the final air temperature. In this baseline dataset, the separated air temperature is 29.99 ◦C.

Figure 16's left side presents the estimated coefficients of slope and intercept for the upper MWIR band. Figure 16's right side shows an example of linear regression at *λ* = 5.6 μm indicating the slope and intercept. Final separation of atmospheric transmittance and emissivity is achieved by applying Equations (12) and (13) to the separated parameters and observed spectrum, as shown in Figure 17. In this case, the mean absolute error (MAE [23]) of spectral atmospheric transmittance is 0.013, and that of spectral emissivity is 0.015.

**G G L O OOL Figure 15.** Temperature separation from synthetic spectra: (**a**) generated synthetic data (200 spectra), (**b**) brightness temperature and peak value for a sample spectrum, (**c**) the distribution of separated object temperatures, and (**d**) the distribution of separated atmospheric temperatures using the CO2 absorption band.

**Figure 16.** Data regression from synthetic spectra: (**left**) slope coefficient and y-intercept coefficient, and (**right**) data regression example for the 5.6 μm wavelength.

**\$GYDQFHG 9LVXDO ,QW Figure 17.** Separated atmospheric transmittance and emissivity: (**a**) a comparison of spectral atmospheric transmittance between the proposed method and ground truth, and (**b**) a comparison of spectral emissivity between the proposed method and ground truth.

It is important to analyze the effects of noise in simultaneous four-parameter (*Ttg*, *Tair*, *τ*(*λ*),*ε*(*λ*)) separation. The MAE performance metric is used to check the trend. If *σTtg* varies from 0.5 to 4.0, the MAEs of the four parameters are shown in Figure 18. As the object surface temperature variation increases, the error in emissivity and air temperature increases. On the other hand, the atmospheric transmittance separation error is reduced.

**Figure 18.** Parameter separation performance according to object temperature noise (*σTtg* ): (**a**) MAE of *τ*,*ε*, and (**b**) MAE of *Ttg*, *Tair*.

If *σTair* varies from 0.0001 to 2.0, the MAEs of the four parameters are as shown in Figure 19. As the air temperature noise increases, the error in atmospheric transmittance, object temperature, and air temperature increases. On the other hand, emissivity separation error has almost no relation to air temperature noise.

**Figure 19.** Parameter separation performance based on air temperature noise (*σTair* ): (**a**) MAE of *τ*,*ε*, and (**b**) MAE of *Ttg*, *Tair*.

If *στ* varies from 0.0001 to 0.0008, the MAEs of the four parameters are as shown in Figure 20. As the atmospheric transmittance noise increases, the error in atmospheric transmittance increases. On the other hand, other parameter separation errors have almost no relation to atmospheric transmittance noise.

**Figure 20.** Parameter separation performance based on atmospheric transmittance noise (*στ*): (**a**) MAE of *τ*,*ε*, and (**b**) MAE of *Ttg*, *Tair*.

Finally, if *σε* varies from 0.0001 to 0.1, the MAEs of the four parameters are as shown in Figure 21. As emissivity noise increases, the error in atmospheric transmittance and emissivity increases. In a small noise interval (0.0001–0.01), the error in air temperature increases sharply. Object temperature separation errors have almost no relation to emissivity noise.

To verify the approximation of the RTE in Equation (2), the effect of path reflectancesolar in air temperature estimation was conducted as shown in Figure 22a. The portion of path reflectance-solar was varied from 0 to 0.5%. The corresponding temperature error was 0 to 0.138 ◦C. Likewise, the effect of surface reflectance-infrared in target temperature estimation was conducted as shown in Figure 22b. In this case, the effect is more negligible due to the small reflectivity (0.05 in case of grass) in the upper MWIR band.

**Figure 21.** Parameter separation performance based on emissivity noise (*σε*): (**a**) MAE of *τ*,*ε*, and (**b**) MAE of *Ttg*, *Tair*.

**Figure 22.** (**a**) Effect of path reflectance-solar in air temperature estimation, (**b**) effect of surface reflectance-infrared in target temperature estimation.

#### *3.2. Experiments Using Real Hyperspectral Datasets*

In the second experiment, the feasibility of the proposed *AT*2*ES* was validated for practical applications. Figure 23 shows the hyperspectral data acquisition environment

and the data sampling points for evaluation. MWIR hyperspectral images were acquired with the Telops Hyper-Cam MWE model [24]. It can provide calibrated spectral radiance images with a high spatial and spectral resolution from a Michelson interferometer in the shortwave to midwave band (1.5–5.6 μm). Spatial image resolution was 320 × 240, with spectral resolution at up to 0.25 cm−1. The noise equivalent spectral radiance (NESR) was 7[*nW*/(*cm*<sup>2</sup> ·*sr*· *cm*−1)], and the radiometric accuracy was approximately 2 K. The field of view was 6.5 × 5.1 deg.

In this paper, only the upper MWIR band (4.2–5.6 μm) was used for our valid approximate RTE model. Although a top-down aerial surveillance scenario is ideal, we chose a ground-based side-looking scenario, because the TELOPS MWE camera is too huge and heavy for an airplane to carry. Note that a narrow horizontal region was selected in order to use the assumption of common atmospheric transmittance. In addition, there were 450 grass samples and 450 asphalt samples.

**Figure 23.** Outdoor field test environment and hyperspectral data acquisition scenario.

Our proposed *AT*2*ES* method can simultaneously extract four parameters: *Tair*, *Ttg*, *τ*(*λ*),*ε*(*λ*). According to the experimental results, the estimated *Tair* was 20.8 ◦C, which is 0.5 ◦C lower than the reference air temperature provided by the Korea Meteorological Administration (21.3 ◦C). In addition, the estimated *Ttg* was 21.8 ◦C. The ground truth for grass temperature is hard to measure due to weak leaves and complex structures. Normally, grass temperature is almost the same as air temperature in a thermal equilibrium state [25]. In general, grass has high albedo and high emissivity (>0.95). High albedo prevents solar energy absorption and high emissivity absorbs the thermal energy radiated by near air. Under no wind state, the assumption that grass temperature is almost the same as air temperature is reasonable. However, if the wind is strong, the evapotranspiration from the grass is an important factor which led to its lower temperature [25].

Figure 24 shows the estimated spectral atmospheric transmittance and emissivity, compared with MODTRAN and the ECOSTRESS grass library. In the MODTRAN simulation, object distance was set to 50 m in a mid-latitude spring environment. Note that *AT*2*ES* can estimate spectral atmospheric transmittance quite accurately, as shown in Figure 24a. In the emissivity comparison, sample No. VH351 (Bromus diandrus) from the ECOSTRESS spectral library was chosen because it was most similar to our grass region. Considering the complex grass composition, *AT*2*ES* estimated a similar emissivity profile, as shown in Figure 24b. Figure 25 visualizes the spectral estimation error of *τ*(*λ*), *ε*(*λ*). The MAEs of atmospheric transmittance and emissivity were 0.087 and 0.063, respectively. Note that large errors were generated around low atmospheric transmittance bands.

**Figure 24.** Grass region: Comparison of atmospheric transmittance and emissivity estimation by the proposed *AT*2*ES*: (**a**) spectral atmospheric transmittance comparison with MODTRAN, and (**b**) spectral emissivity comparison with the ECOSTRESS library.

**Figure 25.** Grass region: Estimation error of spectral atmospheric transmittance and emissivity by the proposed *AT*2*ES*.

In an asphalt region, the estimated *Ttg* was 41.4 ◦C. Ground truth for asphalt temperature was hard to measure due to the bumpy structure. In general, solar radiance energy

(visible/near IR) is converted to long wave thermal energy, and the asphalt temperature is higher than the air temperature. FTIR imaging was done at 13:39 h on 21 May 2020.

Figure 26 shows the estimated spectral atmospheric transmittance and emissivity, compared with MODTRAN and the ECOSTRESS grass library. The MODTRAN simulation was the same as the grass experiment. Note that *AT*2*ES* can estimate spectral atmospheric transmittance quite accurately, as shown in Figure 26a. In the emissivity comparison, sample ID 0095UUUASP (Paving Asphalt) from the ECOSTRESS spectral library was chosen because it was most similar to our asphalt region. As shown in Figure 26b, *AT*2*ES* estimated a similar emissivity profile considering complex asphalt composition, but with some emissivity offset. Figure 27 visualizes the spectral estimation error of *τ*(*λ*), *ε*(*λ*). The MAE for emissivity was 0.041. Note that large errors were generated around low atmospheric transmittance bands in the grass experiment.

Interestingly, if we add an object temperature offset of 2 ◦C to *Ttg*, the estimated emissivity moves upward as shown in Figure 28a, with the same emissivity profile shape. Figure 28b shows the estimation error profile of atmospheric transmittance and emissivity. The MAE of emissivity was reduced to 0.023 from 0.041. From this additional test, the proposed *AT*2*ES* estimated a lower object temperature for low emissivity material, which leads to an emissivity profile with an offset. This is a future research direction to improve *AT*2*ES* for low-emissivity objects.

**Figure 26.** Asphalt region: Comparison of atmospheric transmittance and emissivity estimation by the proposed *AT*2*ES* and MODTRAN: (**a**) spectral atmospheric transmittance, and (**b**) spectral emissivity.

**Figure 27.** Asphalt region: Estimation error in spectral atmospheric transmittance and emissivity by the proposed *AT*2*ES*.

**Figure 28.** Asphalt region: Control of the emissivity offset by adding an object temperature offset: (**a**) spectral emissivity, and (**b**) estimation error in spectral atmospheric transmittance and emissivity.

#### **4. Conclusions**

Temperature emissivity separation (TES) is an important research topic in a remotesensing society. Most approaches use atmospheric correction to remove atmospheric transmittance, downwelling, and upwelling generated from MODTRAN. However, online atmospheric information changes from time to time and region by region. This paper presents *AT*2*ES*, a novel method to separate atmospheric transmittance, temperature, and emissivity simultaneously without the aid of an offline MODTRAN simulation.

The key idea is based on radiometry transfer properties in the upper MWIR band (4.2–5.6 μm) where there are negligible downweilling and solar upwelling components (1–4%) with a high emissivity surface (above 0.9) at a 100 m distance. From the proposed approximate RTE, the *AT*2*ES* algorithm can separate four parameters simultaneously. Air temperature is extracted from the brightness temperature in the CO2 absorption band (4.20–4.35 μm). The object surface temperature is obtained by applying the max operation to the brightness temperature, except the CO2 absorption band. Given some observed spectral radiance samples and an object temperature, data regression of object blackbody radiance and the observed radiance can provide the slope and intercept. In particular, spectral atmospheric transmittance is separated using the y-intercept and air blackbody radiance. The separated atmospheric transmittance is the same for all the samples, but each sample has different emissivity with the same atmospheric transmittance. Therefore, each spectral emissivity is calculated using the separated parameters. The average operation can provide a representative spectral emissivity profile for a certain region.

The first experiment using synthetic spectra provided the effects of noise in the four parameters. Object surface temperature error directly affects spectral emissivity and air temperature. The air temperature error affects atmospheric transmittance, object temperature, and air temperature. Atmospheric transtmittance error directly affects the estimation of atmospheric transmittance. The object emissivity error also affects atmospheric transmittance. The second experiment was based on an outdoor dataset to check the feasibility of the proposed *AT*2*ES*. In grass region samples, the separated temperature parameters were very close to the measured temperatures. Separated spectral atmospheric temperature and emissivity were similar to the profiles in MODTRAN. This is due to the high emissivity of grass regions. In an asphalt region, the estimated emissivity was rather higher than in the ECOSTRESS profile due to lower object temperature estimation. If the object temperature was increased by 2 ◦C, spectral emissivity was consistent with the spectral library. Therefore, a future research direction is to find an improved *AT*2*ES* method for low emissivity materials.

**Author Contributions:** The contributions were distributed between authors as follows: S.K. (Sungho Kim) wrote the text of the manuscript, programmed the hyperspectral *AT*2*ES* method using upper MWIR-FTIR data. J.S. and S.K. (Sunho Kim) provided the midwave infrared hyperspectral database, operational scenario, performed the in-depth discussion of the related literature, and confirmed the accuracy experiments that are exclusive to this paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by ADD grant number UE191095FD, 2021 Yeungnam University Research Grants, and NRF (NRF-2018R1D1A3B07049069).

**Acknowledgments:** This study was supported by the Agency for Defense Development (UE191095FD). This work was supported by the 2021 Yeungnam University Research Grants. This research was also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018R1D1A3B07049069).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

<sup>1.</sup> Gillespie, A.; Rokugawa, S.; Matsunaga, T.; Cothern, J.S.; Hook, S.; Kahle, A.B. A temperature and emissivity separation algorithm for Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) images. *IEEE Trans. Geosci. Remote Sens.* **1998**, *36*, 1113–1126. [CrossRef]


### *Article* **Hyperspectral Nonlinear Unmixing by Using Plug-and-Play Prior for Abundance Maps**

**Zhicheng Wang 1,2, Lina Zhuang 3, Lianru Gao 1,***∗***, Andrea Marinoni 4, Bing Zhang 1,2 and Michael K. Ng <sup>5</sup>**


Received: 20 October 2020; Accepted: 11 December 2020; Published: 16 December 2020

**Abstract:** Spectral unmixing (SU) aims at decomposing the mixed pixel into basic components, called endmembers with corresponding abundance fractions. Linear mixing model (LMM) and nonlinear mixing models (NLMMs) are two main classes to solve the SU. This paper proposes a new nonlinear unmixing method base on general bilinear model, which is one of the NLMMs. Since retrieving the endmembers' abundances represents an ill-posed inverse problem, prior knowledge of abundances has been investigated by conceiving regularizations techniques (e.g., sparsity, total variation, group sparsity, and low rankness), so to enhance the ability to restrict the solution space and thus to achieve reliable estimates. All the regularizations mentioned above can be interpreted as denoising of abundance maps. In this paper, instead of investing effort in designing more powerful regularizations of abundances, we use plug-and-play prior technique, that is to use directly a state-of-the-art denoiser, which is conceived to exploit the spatial correlation of abundance maps and nonlinear interaction maps. The numerical results in simulated data and real hyperspectral dataset show that the proposed method can improve the estimation of abundances dramatically compared with state-of-the-art nonlinear unmixing methods.

**Keywords:** hyperspectral imagery; plug-and-play; denoising; nonlinear unmixing

#### **1. Introduction**

Hyperspectral remote sensing imaging is a combination of imaging technology and spectral technology. It can obtain two-dimensional spatial information and spectral information of target objects simultaneously [1–3]. Benefiting from the rich spectral information, hyperspectral images (HSIs) can be used to identity materials precisely. Hence, HSIs have been playing a key role in earth observation and used in many fields, including mineral exploration, water pollution, and vegetation [3–9]. However, due to the low spatial resolution, mixed pixels always exist in HSIs, and it is one of the main reasons that preclude the widespread use of HSIs in precise target detection and classification applications. So it is necessary to develop the technique of unmixing [2,3,10–14]. Thanks to the rich band information of hyperspectral images, which allows us to design an effective solution to the problem of mixed pixels. Hyperspectral unmixing (HU) is the process of obtaining the basic components (called endmembers) and their corresponding component ratios (called abundance fractions). The spectral unmixing can

be divided into linear unmixing (LU) and nonlinear unmixing (NLU) [2,3]. LU assumes that photons only interact with one material and there is no interaction between materials. Usually, linear mixing only happens in macro scenarios. NLU assumes that photons interact with a variety of materials, including infinite mixtures, bilinear mixtures. For NLU, various models have been proposed to describe the mixing of pixels, taking into account the more complex reflections in the scene. Specifically, they are the generalized bilinear model (GBM) [15], the polynomial post nonlinear model (PPNM) [16], the multilinear mixing model (MLM) [17], the p-linear model [18], the multiharmonic postnonlinear mixing model (MHPNMM) [19], the nonlinear non-negative matrix factorization (NNMF) [20] and so on. Although different kinds of the nonlinear models have been proposed to improve the accuracy of the abundance results, they are always limited by the endmember extraction algorithm. Meanwhile, complex models often lead to excessive computing costs. The LMM has been widely used to address LU problem, while the GBM is the most popular model among the NLMMs to solve the NLU. The NLU is a more challenging problem than LU, and we mainly focus on the NLU in the paper.

The prior information of the abundance has been exploited for spectral unmixing. Different regularizations (such as sparsity, total variation, and low rankness) have been used on the abundances to improve the accuracy of the abundance estimation.

In sparse unmixing methods, sparsity prior of abundance matrix is exploited as a regularization term [21–23]. To produce a more sparse solution, the group sparsity regularization was imposed on abundance matrix [24]. Meanwhile, the sparsity prior is also considered on the interaction abundance matrix, because interaction abundance matrix is much sparser than abundance matrix [25]. In order to capture the spatial structure of the data, the low-rank representation of abundance matrix was used in References [25–28].

Spatial correlation in abundance maps has also been taken advantage for spectral unmixing. By reorganizing the abundance vector as a two dimensional matrix (the height and width of the HSI, respectively), we can obtain a abundance map of *i* endmember. In order to make full use of the spatial information of abundance maps, the total variation (TV) of abundance maps was proposed to enhance the spatial smoothness on the abundances [28–31]. Low-rank representation of abundance maps was newly introduced to LU in Reference [32].

However, it is worth mentioning that all the regularizations mentioned above can provide a priori information about abundances. Specifically, the sparse regularization promotes sparse abundances. Total Variation holds the view that each abundance map is piecewise smooth. Low-rank regularization enforces the abundance maps to be low-rank. Furthermore, when solving an regularized optimization problem using ADMM , a subproblem composed of a data fidelity term and a regularization term is so called "Moreau proximal operator" or "denoising operator" [33–36].

Plug and play technique is a flexible framework that allows imaging models to be combined with state-of-the-art priors or denoising models [37]. This is the main idea of plug-and-play technique, which has been successfully used to solve inverse problems of images, such as image inpainting [38,39], compressive sensing [40], and super-resolution [41,42]. Instead of investing effort in designing more powerful regularizations on abundances, we use directly a prior from a state-of-the-art denoiser as the regularization, which is conceived to exploit the spatial correlation of abundance maps. So we apply the plug-and-play technique to the field of spectral unmixing, especially in hyperspectral nonlinear unmixing. In particular, it is pointed out that NLU is a challenging problem in HU, so it is expected that such a powerful tool can be used to improve the accuracy of abundance inversion efficiently.

This paper exploits spatial correlation of abundance maps through a plug-and-play technique. We tested two of the best single-band denoising algorithms, namely Block-Matching and 3D filtering method (BM3D) [43] and denoising convolutional neural networks (DnCNN) [44].

The main contributions of this article are summarized as follows.


The rest of the article is structured as follows. Section 2 introduces the related works and the proposed plug-and-play prior based hyperspectral nonlinear unmixing framework. Experimental results and analysis for the synthetic data are illustrated in Section 3. The real hyperspectral dataset experiments and analysis are described in Section 4. Section 5 concludes the paper.

#### **2. Nonlinear Unmixing Problem**

#### *2.1. Related Works*

#### 2.1.1. Symbols and definitions

We first introduce the notation and definitions used in the paper. An *n*th-order tensor is identified using Euler-cript letters—for example, Q ∈ <sup>R</sup>*k*1×*k*2×...×*ki*×...×*kn* , with the *ki* is the size of the corresponding dimension *i*. Hence, an HSI can be naturally represented as a third-order tensor, T ∈ <sup>R</sup>*k*1×*k*2×*k*<sup>3</sup> , which consists of *<sup>k</sup>*<sup>1</sup> <sup>×</sup> *<sup>k</sup>*<sup>2</sup> pixels and *<sup>k</sup>*<sup>3</sup> spectral bands. Three further definitions related to tensors are given as follows.

**Definition 1.** *The dimension of a tensor is called the mode:* Q ∈ <sup>R</sup>*k*1×*k*2×...×*ki*×...×*kn has n modes. For a third-order tensor* T ∈ <sup>R</sup>*k*1×*k*2×*k*<sup>3</sup> *, by fixing one mode, we can obtain the corresponding sub-arrays, called slices—for example,* T:,:,*i.*

**Definition 2.** *The 3-mode product is denoted as* <sup>G</sup> <sup>=</sup> Q ×<sup>3</sup> *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*k*1×*k*2×*<sup>j</sup> for a tensor* Q ∈ <sup>R</sup>*k*1×*k*2×*k*<sup>3</sup> *and a matrix <sup>X</sup>* <sup>∈</sup> <sup>R</sup>*j*×*k*<sup>3</sup> *.*

**Definition 3.** *Given a matrix* **<sup>A</sup>** <sup>∈</sup> <sup>R</sup>*k*1×*k*<sup>2</sup> *and vector* **<sup>c</sup>** <sup>∈</sup> <sup>R</sup>*l*<sup>1</sup> *, their outer product, denoted as* **<sup>A</sup>** ◦ **<sup>c</sup>***, is a tensor with dimensions* (*k*1, *k*2, *l*1) *and entries* (**A** ◦ **c**)**i1**,**i2**,**j1** = **Ai1**,**i2cj1***.*

#### 2.1.2. Nonlinear Model: GBM

A general expression of nonlinear mixing models, considering the second-order photon interactions between different endmembers, is given as follows:

$$\mathbf{y} = \mathbf{C}\mathbf{a} + \sum\_{i=1}^{R-1} \sum\_{j=i+1}^{R} b\_{i,j}\mathbf{c}\_i \odot \mathbf{c}\_j + \mathbf{n}\_\prime \tag{1}$$

where the **<sup>y</sup>** <sup>∈</sup> <sup>R</sup>*L*×<sup>1</sup> is a pixel with *<sup>L</sup>* spectral bands. **<sup>C</sup>** = [**c**1, **<sup>c</sup>**2, ..., **<sup>c</sup>***R*] <sup>∈</sup> <sup>R</sup>*L*×*R*, **<sup>a</sup>** = [*a*1, *<sup>a</sup>*2, ..., *aR*] *<sup>T</sup>* <sup>∈</sup> <sup>R</sup>*R*×1, and **<sup>n</sup>** <sup>∈</sup> <sup>R</sup>*L*×<sup>1</sup> represent the mixing matrix containing the spectral signatures of *<sup>R</sup>* endmembers, the fractional abundance vector, and the white Gaussian noise, respectively. The nonlinear coefficient *bi*,*<sup>j</sup>* controls the nonlinear interaction between the materials, and is a Hadamard product operation. With different specific definitions of *bi*,*j*, there are several well-known mixture models, such as GBM [15], FM [1], and PPNM [16].

To satisfy the physical assumptions and overcome the limitations of the FM [1], the GBM redefines the parameter *bi*,*<sup>j</sup>* as *bi*,*<sup>j</sup>* = *γi*,*jaiaj*. Meanwhile, the abundance non-negativity constraint (ANC) and the abundance sum-to-one constraint (ASC) are satisfied as follows:

$$\begin{aligned} a\_i &\ge 0, \sum\_{i=1}^{R} a\_i = 1, \\ 0 &< \gamma\_{i,j} < 1, \forall i < j, \\ \gamma\_{i,j} &= 0, \forall i \ge j. \end{aligned} \tag{2}$$

The spectral mixing model for *N* pixels can be written in matrix form:

$$\mathbf{Y} = \mathbf{C}\mathbf{A} + \mathbf{M}\mathbf{B} + \mathbf{N},\tag{3}$$

where **<sup>Y</sup>** = [**y**1, **<sup>y</sup>**2, ..., **<sup>y</sup>***N*] <sup>∈</sup> <sup>R</sup>*L*×*N*, **<sup>A</sup>** = [**a**1, **<sup>a</sup>**2, ...**a***N*] <sup>∈</sup> <sup>R</sup>*R*×*N*, **<sup>M</sup>** <sup>∈</sup> <sup>R</sup>*L*×*R*(*R*−1)/2, **<sup>B</sup>** <sup>∈</sup> <sup>R</sup>*R*(*R*−1)/2×*N*, and **<sup>N</sup>** <sup>∈</sup> <sup>R</sup>*L*×*<sup>N</sup>* represent the observed hyperspectral image matrix, the fractional abundance matrix with *N* abundance vectors (the columns of **A**), the bilinear interaction endmember matrix, the nonlinear interaction abundance matrix, and the white Gaussian noise matrix, respectively.

As for Equations (1) and (3), both of them model the the hyperspectral image with two-dimensional matrix for processing, thus destroying the internal spatial structure of the data and resulting in poor abundance inversion. However, given that the hyperspectral images can be naturally represented as a third-order tensor, we rewritten the GBM model based on tensor representation for the original hyperspectral image cube. The hyperspectral image cube Y ∈ <sup>R</sup>*nrow*×*ncol*×*<sup>L</sup>* can be expressed in the following format:

$$\mathcal{N} = \mathcal{A} \times\_3 \mathbf{C} + \mathcal{B} \times\_3 \mathbf{M} + \mathcal{N}, \tag{4}$$

where A ∈ <sup>R</sup>*nrow*×*ncol*×*R*, B ∈ <sup>R</sup>*nrow*×*ncol*×*R*(*R*−1)/2, and N ∈ <sup>R</sup>*nrow*×*ncol*×*<sup>L</sup>* denote the abundance cube corresponding to *R* endmembers, the nonlinear interaction abundance cube, and the white Gaussian noise cube, respectively.

This work aims to solve a supervised unmixing problem—that is to estimate the abundances, A, and nonlinear coefficients, B, given the spectral signatures of the endmembers, **C**, which are known beforehand.

#### *2.2. Motivation*

In this paper, we firstly apply the plug-and-play technique to the unmixing problem, especially to the abundance maps and interaction abundance maps for enhancing the accuracy of the estimated abundance results. The plug-and-play technique can be used as the prior information, instead of other convex regularizers [21,22].

The performance of this method is constrained by the denoiser. Two state-of-the-art denoisers, BM3D and DnCNN, are chosen for the prior information of the abundance maps [43,44]. BM3D is well-known nonlocal patch-based denoiser, which can remove noise in a natural image by taking advantage of high spatial correlation of similar patches in the image. As geographic hyperspectral data, the materials in HSIs tend to be spatially dependent, so it is very easy to find similar patches from the images. Meanwhile, the spatial distribution of a single material tends to be aggregated instead of being purely random. The texture structure of abundance maps can be illustrated with an example given in Figure 1. The unmixing of a San Diego Airport image of size 160 × 140 pixels was carried out. The first row in Figure 1 shows the abundance map of 'Ground & road' estimated by the FCLS [45] followed by an endmember estimation step (vertex component analysis (VCA) [46]). As shown in Figure 1, we can find many similar patches (marked with small yellow squares) from the abundance map. Hence, this nonlocal patch-based denoiser can be used on the abundance maps.

**Figure 1.** Denoising an abundance map in San Diego Airport image using BM3D and DnCNN denoisers.

With the development of deep learning, convolutional neural network (CNN) based denoising methods have achieved good results. Specifically, deep network structure can effectively learn the features of images. Hence, in the paper, we also chose a well-known CNN-based denoiser as the prior of the abundance maps, named DnCNN (shown in in Figure 1). DnCNN can handle zero-mean Gaussian noise with unknown standard deviation, and residual learning is adopted to separating noise from noisy observations. Therefore, this method can effectively capture the texture structure of abundance maps.

#### *2.3. Proposed Method: Unmixing with Nonnegative Tensor Factorization and Plug-and-Play Proir*

To better represent the structure of abundance maps, mixing model (4) can be equivalently written as

$$\mathcal{N} = \sum\_{i=1}^{R} \mathcal{A}\_{:,i,i} \circ \mathbf{c}\_i + \sum\_{j=1}^{\mathcal{R}(R-1)/2} \mathcal{B}\_{:,:,j} \circ \mathbf{m}\_j + \mathcal{N}\_{\prime} \tag{5}$$

where <sup>A</sup>:,:,*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*nrow*×*ncol* , **<sup>c</sup>***<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*L*×1, <sup>B</sup>:,:,*<sup>j</sup>* <sup>∈</sup> <sup>R</sup>*nrow*×*ncol* , and **<sup>m</sup>***<sup>j</sup>* <sup>∈</sup> <sup>R</sup>*L*×<sup>1</sup> denote the *<sup>i</sup>*th abundance slice, the *i*th endmember vector, the *j*th interaction abundance slice, and the *j*th interaction endmember vector, respectively. Model (5) is depicted in Figure 2.

**Figure 2.** The representation of the generalized bilinear model using the tensor-based framework.

To take full advantage of the abundance maps' prior, we propose a new unmixing method based on the **P**lug-and-**P**lay (**PnP**) framework of abundance maps and **N**onnegative **T**ensor **F**actorization, termed **PnP-NTF**, which aims to solve the following optimization problem:

$$\underset{\begin{subarray}{c}\mathcal{A}\_{\boldsymbol{\cdot},\boldsymbol{j}\in\mathcal{B},\boldsymbol{\mathcal{B}}\_{\boldsymbol{\cdot},\boldsymbol{j}\in\boldsymbol{\mathcal{D}}\\\boldsymbol{j}\text{-}1,\boldsymbol{2},\ldots,\boldsymbol{R},\\\boldsymbol{j}\text{-}1,\boldsymbol{2},\ldots,\boldsymbol{R}\text{ }\boldsymbol{\ell}\end{subarray}} \;\middle\|\mathcal{Y}-\sum\_{i=1}^{R}\mathcal{A}\_{:,\boldsymbol{:}\boldsymbol{j}\circ\operatorname{c}\_{i}}\circ\mathbf{c}\_{i}-\sum\_{j=1}^{R(\boldsymbol{R}-1)/2}\mathcal{B}\_{:,\boldsymbol{:}\boldsymbol{j}}\circ\mathbf{m}\_{j}\Big\|\_{F}^{2}+\sum\_{i=1}^{R}\Psi\left(\mathcal{A}\_{:,\boldsymbol{:},\boldsymbol{i}}\right)+\sum\_{j=1}^{R(\boldsymbol{R}-1)/2}\Psi\left(\mathcal{B}\_{:,\boldsymbol{:},\boldsymbol{j}}\right)\Big\|\_{F}^{2}\tag{6}$$

$$\text{s.t.}\sum\_{i=1}^{R}\mathcal{A}\_{:,i}=\mathbf{1}\_{\mathsf{n}\boldsymbol{w}\boldsymbol{w}}\mathbf{1}\_{\mathsf{n}\boldsymbol{c}\boldsymbol{!}}^{T}\tag{7}$$

where <sup>X</sup> <sup>2</sup> *<sup>F</sup>* denotes the Frobenius norm which returns the square root of the sum of the absolute squares of its elements. The symbol *ψ*(.) represents the plugged state-of-the-art denoiser, and **1***<sup>d</sup>* represents a vector whose components are all one and whose dimension is given by its subscript.

#### *2.4. Optimization Procedure*

The optimization problem in (6) can be solved by optimization using the alternating direction method of multipliers (ADMM) [47]. To use the ADMM, first (6) is converted into an equivalent form by introducing multiple auxiliary variables **V***i*, **E***<sup>j</sup>* to replace A:,:,*i*(*i* = 1, ..., *R*), B:,:,*j*(*j* = 1, ..., *R*(*R* − 1)/2). The formulation is as follows:

$$\begin{aligned} \min\_{\mathcal{A}\_{\boldsymbol{\cdot},\boldsymbol{\cdot},\boldsymbol{\cdot}} \ge 0, \mathcal{B}\_{\boldsymbol{\cdot},\boldsymbol{\cdot}} \ge 0} \frac{1}{2} \left\| \mathcal{V} - \sum\_{i=1}^{R} \mathcal{A}\_{\boldsymbol{\cdot},\boldsymbol{\cdot},i} \circ \mathbf{c}\_{i} - \sum\_{j=1}^{R(R-1)/2} \mathcal{B}\_{\boldsymbol{\cdot},\boldsymbol{\cdot},j} \circ \mathbf{m}\_{j} \right\|\_{F}^{2} + \lambda\_{1} \sum\_{i=1}^{R} \Psi \left( \mathcal{V}\_{i} \right) + \lambda\_{2} \sum\_{j=1}^{R(R-1)/2} \Psi \left( \mathcal{E}\_{j} \right) \\ \text{s.t.} \quad \begin{cases} \mathcal{A}\_{\boldsymbol{\cdot},j} = \mathbf{V}\_{i}, i = 1, 2, \dots, R \\ \mathcal{B}\_{\boldsymbol{\cdot},j} = \mathbf{E}\_{j}, j = 1, 2, \dots, R(R-1)/2 \\ \sum\_{i=1}^{R} \mathcal{A}\_{\boldsymbol{\cdot},i} = \mathbf{1}\_{\mathbb{R} \times \mathbb{R}} \mathbf{1}\_{\boldsymbol{\cdot} \times \mathbf{1}}^{T} \end{cases} . \end{aligned} \tag{7}$$

By using the Lagrangian function, (7) can be reformulated as:

$$\begin{aligned} &\mathcal{L}(\mathcal{A}\_{:,i:,i},\mathcal{B}\_{:,j},\mathbf{V}\_{i},\mathbf{E}\_{j},\mathbf{D}\_{i},\mathbf{H}\_{j},\mathbf{G}) = \\ &\frac{1}{2}\left\|\mathcal{Y}-\sum\_{i=1}^{R}\mathcal{A}\_{:,i:,i}\circ\mathbf{c}\_{i}-\sum\_{j=1}^{R(R-1)/2}\mathcal{B}\_{:,i:j}\circ\mathbf{m}\_{j}\right\|\_{F}^{2}+\lambda\_{1}\sum\_{i=1}^{R}\Psi\left(\mathbf{V}\_{i}\right) \\ &+\lambda\_{2}\sum\_{j=1}^{R(R-1)/2}\Psi\left(\mathbf{E}\_{j}\right)+\frac{\mu}{2}(\sum\_{i=1}^{R}\left\|\mathcal{A}\_{:,i:,i}-\mathbf{V}\_{i}-\mathbf{D}\_{i}\right\|\_{F}^{2})+\\ &\frac{\mu}{2}(\sum\_{j=1}^{R(R-1)/2}\left\|\mathcal{B}\_{:,j:,i}-\mathbf{E}\_{j}-\mathbf{H}\_{j}\right\|\_{F}^{2})+\frac{\mu}{2}\left\|\sum\_{i=1}^{R}\mathcal{A}\_{:,i:i}-\mathbf{1}\_{\mathcal{U}\_{\text{rw}}}\mathbf{1}\_{\mathcal{U}\_{\text{rw}}}^{T}-\mathbf{G}\right\|\_{F}^{2}\end{aligned} \tag{8}$$

where **D***i*, **H***<sup>j</sup>* and **G** are scaled dual variables [48], and *μ* is the penalty parameter. The variables A:,:,*i*, B:,:,*j*, **V***i*, **E***j*, **D***i*, **H***j*, **G** were updated sequentially: this step is shown in Algorithm 1. The solution of optimization is detailed below.

#### **Algorithm 1:** The Proposed PnP-NTF Unmixing Method.

**Input:** Hyperspectral imagery cube: Y; Endmember matrix: **E**; Iterations = 1000;

**Output:** Abundance map Cube: A

**<sup>1</sup> for** *k* = 1*; k* < *Iterations; k* + + **do**

**<sup>2</sup>** Update abundance map slice: <sup>A</sup>*k*+<sup>1</sup> :,:,*<sup>i</sup>* = ( *<sup>L</sup>* ∑ *b*=1 *cib <sup>c</sup><sup>T</sup> ib* <sup>+</sup> <sup>2</sup>*μ***I**)−1( *<sup>L</sup>* ∑ *b*=1 <sup>O</sup>:,:*bc<sup>T</sup> ib* +*μ*(**V***<sup>k</sup> <sup>i</sup>* + **<sup>D</sup>***<sup>k</sup> <sup>i</sup>* + **<sup>1</sup>***nrow* **<sup>1</sup>***<sup>T</sup> ncol*+**G***k*−**A**!)); **<sup>3</sup>** Update nonlinear map slice: <sup>B</sup>*k*+<sup>1</sup> :,:,*<sup>j</sup>* = ( *<sup>L</sup>* ∑ *b*=1 *mjbm<sup>T</sup> jb* <sup>+</sup> *<sup>μ</sup>***I**)−1( *<sup>L</sup>* ∑ *b*=1 <sup>K</sup>:,:,*bm<sup>T</sup> jb* +*μ*(**E***<sup>k</sup> <sup>j</sup>* + **<sup>H</sup>***<sup>k</sup> <sup>j</sup>*)); **<sup>4</sup>** Update multiple auxiliary variable: **V***k*+<sup>1</sup> *<sup>i</sup>* = **PnP**(**V**!*i*); **<sup>5</sup>** Update multiple auxiliary variable: **E***k*+<sup>1</sup> *<sup>j</sup>* = **PnP**(**E**!*j*); **<sup>6</sup>** Update variable: **D***k*+<sup>1</sup> *<sup>i</sup>* <sup>=</sup> **<sup>D</sup>***<sup>k</sup> <sup>i</sup>* <sup>−</sup> (A*k*+<sup>1</sup> :,:,*<sup>i</sup>* <sup>−</sup> **<sup>V</sup>***k*+<sup>1</sup> *<sup>i</sup>* ); **<sup>7</sup>** Update variable: **H***k*+<sup>1</sup> *<sup>j</sup>* <sup>=</sup> **<sup>H</sup>***<sup>k</sup> <sup>j</sup>* <sup>−</sup> (B*k*+<sup>1</sup> :,:,*<sup>j</sup>* <sup>−</sup> **<sup>E</sup>***k*+<sup>1</sup> *<sup>j</sup>* ); **<sup>8</sup>** Update variable: **<sup>G</sup>***k*+<sup>1</sup> <sup>=</sup> **<sup>G</sup>***<sup>k</sup>* <sup>−</sup> ( *R* ∑ *i*=1 <sup>A</sup>*k*+<sup>1</sup> :,:,*<sup>i</sup>* <sup>−</sup> **<sup>1</sup>***nrow* **<sup>1</sup>***<sup>T</sup> ncol*); **<sup>9</sup>** *k* = *k* + 1;

**<sup>10</sup> end**

**<sup>11</sup> return** result

### 1. Updating of A The optimization problem for A:,:,*<sup>i</sup>* is

$$\begin{split} \mathcal{A}^{k+1}\_{:,:,i} &= \operatorname\*{arg\,min}\_{\mathcal{A}^{k}\_{::,i}} \frac{1}{2} \left\| \mathcal{Y} - \sum\_{i=1}^{R} \mathcal{A}^{k}\_{::,i} \circ \mathbf{c}\_{i} - \sum\_{j=1}^{R(R-1)/2} \mathcal{B}^{k}\_{::,i} \circ \mathbf{m}\_{j} \right\|\_{F}^{2} \\ &+ \frac{\mu}{2} \left\| \mathcal{A}^{k}\_{::,i} - \mathbf{V}^{k}\_{i} - \mathbf{D}^{k}\_{i} \right\|\_{F}^{2} + \frac{\mu}{2} \left\| \sum\_{i=1}^{R} \mathcal{A}^{k}\_{::,i} - \mathbf{1}\_{\mathsf{H}\_{\mathsf{T}:v}} \mathbf{1}\_{n\_{\mathrm{col}}}^{T} - \mathbf{G}^{k} \right\|\_{F}^{2} \\ &= \frac{1}{2} \sum\_{b=1}^{L} \left\| \mathcal{O}\_{::,b} - \mathcal{A}^{k}\_{::,i} \boldsymbol{c}\_{b} \right\|\_{F}^{2} + \frac{\mu}{2} (\sum\_{i=1}^{R} \left\| \mathcal{A}^{k}\_{::,i} - \mathbf{V}^{k}\_{i} - \mathbf{D}^{k}\_{i} \right\|\_{F}^{2}) \\ &+ \frac{\mu}{2} \left\| \mathcal{A}^{k}\_{::,i} + \tilde{\mathbf{A}} - \mathbf{1}\_{\mathsf{H}\_{\mathsf{T}:v}} \mathbf{1}\_{n\_{\mathrm{col}}}^{T} - \mathbf{G}^{k} \right\|\_{F}^{2}. \end{split} \tag{9}$$

where <sup>O</sup> <sup>=</sup> Y − *<sup>R</sup>* ∑ *i*=1,¬*i* A*k* :,:,*<sup>i</sup>* ◦ **<sup>c</sup>***<sup>i</sup>* <sup>−</sup> *<sup>R</sup>*(*R*−1)/2 ∑ *j*=1 B*k* :,:,*<sup>j</sup>* ◦ **<sup>m</sup>***<sup>j</sup>* <sup>∈</sup> <sup>R</sup>*nrow*×*ncol*×*L*, and <sup>O</sup>:,:,*<sup>b</sup>* is the *<sup>b</sup>*th slice. Meanwhile, **<sup>A</sup>**! <sup>=</sup> *<sup>R</sup>* ∑ *i*=1,¬*i* A*k* :,:,*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*nrow*×*ncol* and **<sup>c</sup>***<sup>i</sup>* = [*ci*<sup>1</sup> , *ci*<sup>2</sup> ..., *cib* , ..., *ciL* ] *<sup>T</sup>* <sup>∈</sup> <sup>R</sup>*L*×<sup>1</sup> is the *<sup>i</sup>*th endmember. Hence the solution for A:,:,*<sup>i</sup>* can be derived as follows:

$$\mathcal{A}\_{:,i}^{k+1} = -\left(\sum\_{b=1}^{L} c\_{i\_b} c\_{i\_b}^T + 2\mu \mathbf{I}\right)^{-1} \left(\sum\_{b=1}^{B} \mathcal{O}\_{:,b} c\_{i\_b}^T + \mu \left(\mathbf{V}\_i^k + \mathbf{D}\_i^k + \mathbf{1}\_{n\_{\text{row}}} \mathbf{1}\_{n\_{\text{col}}}^T + \mathbf{G}^k - \tilde{\mathbf{A}}\right)\right). \tag{10}$$

2. Updating of B

The optimization problem for B:,:,*<sup>j</sup>* is

$$\begin{split} \mathcal{B}\_{:,:,j}^{k+1} &= \operatorname\*{arg\,min}\_{\mathcal{B}\_{:,j}^{k}} \frac{1}{2} \left\| \mathcal{Y} - \sum\_{i=1}^{R} \mathcal{A}\_{:,:,j}^{k+1} \circ \mathbf{c}\_{i} - \sum\_{j=1}^{R(R-1)/2} \mathcal{B}\_{:,:,j}^{k} \circ \mathbf{m}\_{j} \right\|\_{F}^{2} + \frac{\mu}{2} \left\| \mathcal{B}\_{:,:,j}^{k} - \mathcal{E}\_{j}^{k} - \mathbf{H}\_{\neq}^{k} \right\|\_{F}^{2} \\ &= \frac{1}{2} \sum\_{b=1}^{L} \left\| \mathcal{K}\_{::,b} - \mathcal{B}\_{:,:,j}^{k} m\_{j\_{b}} \right\|\_{F}^{2} + \frac{\mu}{2} \left\| \mathcal{B}\_{:,:,j}^{k} - \mathbf{E}\_{j}^{k} - \mathbf{H}\_{\neq}^{k} \right\|\_{F}^{2} . \end{split} \tag{11}$$

where <sup>K</sup> <sup>=</sup> Y − *<sup>R</sup>* ∑ *i*=1 A*k* :,:,*<sup>i</sup>* ◦ **<sup>c</sup>***<sup>i</sup>* <sup>−</sup> *<sup>R</sup>*(*R*−1)/2 ∑ *j*=1,,¬*j* B*k* :,:,*<sup>j</sup>* ◦ **<sup>m</sup>***<sup>j</sup>* <sup>∈</sup> <sup>R</sup>*nrow*×*ncol*×*L*, and <sup>K</sup>:,:,*<sup>b</sup>* is the *<sup>b</sup>*th slice. Meanwhile, **m***<sup>j</sup>* = [*mj*<sup>1</sup> , *mj*<sup>2</sup> ..., *mjb* , ..., *mjL* ] *<sup>T</sup>* <sup>∈</sup> <sup>R</sup>*L*×<sup>1</sup> is the *<sup>j</sup>*th interaction endmember. Hence the solution for B:,:,*<sup>j</sup>* can be derived as follows:

$$\mathcal{B}\_{:,:,j}^{k+1} = \left(\sum\_{b=1}^{L} m\_{\hat{j}b} m\_{\hat{j}b}^T + \mu \mathbf{I}\right)^{-1} \left(\sum\_{b=1}^{L} \mathcal{K}\_{:,:,b} m\_{\hat{j}b}^T + \mu \left(\mathbf{E}\_{\hat{j}}^k + \mathbf{H}\_j^k\right)\right). \tag{12}$$

#### 3. Updating of **V**

The optimization problem for **V***<sup>i</sup>* is

$$\begin{split} \mathbf{V}\_{i}^{k+1} &= \operatorname\*{arg\,min}\_{\mathbf{V}\_{i}^{k}} \lambda\_{1} \boldsymbol{\Psi} \left( \mathbf{V}\_{i}^{k} \right) + \frac{\mu}{2} \left\| \begin{aligned} \boldsymbol{\mathcal{A}}\_{:,i}^{k+1} - \mathbf{V}\_{i}^{k} - \mathbf{D}\_{i}^{k} \right\|\_{F}^{2} \\ &= \frac{1}{2} \left\| \begin{aligned} \widecheck{\mathbf{V}}\_{i} - \mathbf{V}\_{i}^{k} \end{aligned} \right\|\_{F}^{2} + \frac{\lambda\_{1}}{\mu} \boldsymbol{\Psi} \left( \mathbf{V}\_{i}^{k} \right), \end{aligned} \tag{13}$$

where **<sup>V</sup>**!*<sup>i</sup>* <sup>=</sup> <sup>A</sup>*k*+<sup>1</sup> :,:,*<sup>i</sup>* <sup>−</sup> **<sup>D</sup>***<sup>k</sup> <sup>i</sup>* <sup>∈</sup> <sup>R</sup>*nrow*×*ncol* . Sub-problem (13) can be solved using **PnP** framework of **<sup>V</sup>**!*i*, then **V***k*+<sup>1</sup> *<sup>i</sup>* can be calculated as

$$\mathbf{V}\_{i}^{k+1} = \mathbf{P} \mathbf{n} \mathbf{P}(\widetilde{\mathbf{V}}\_{i}).\tag{14}$$

4. Updating of **E**

The optimization problem for **E***<sup>j</sup>* is

$$\begin{split} \mathbf{E}\_{j}^{k+1} &= \operatorname\*{arg\,min}\_{\mathbf{E}\_{j}^{k}} \lambda\_{2} \boldsymbol{\Psi} \left( \mathbf{E}\_{j} \right) + \frac{\mu}{2} \left\| \boldsymbol{\mathcal{B}}\_{\text{"\,i,j}}^{k+1} - \mathbf{E}\_{j}^{k} - \mathbf{H}\_{j}^{k} \right\|\_{F}^{2} \\ &= \frac{1}{2} \left\| \left| \mathbf{\ddot{E}}\_{j} - \mathbf{E}\_{j}^{k} \right| \right\|\_{F}^{2} + \frac{\lambda\_{2}}{\mu} \boldsymbol{\Psi} \left( \mathbf{E}\_{j} \right), \end{split} \tag{15}$$

where **<sup>E</sup>**!*<sup>j</sup>* <sup>=</sup> <sup>B</sup>*k*+<sup>1</sup> :,:,*<sup>j</sup>* <sup>−</sup> **<sup>H</sup>***<sup>k</sup> <sup>j</sup>* <sup>∈</sup> <sup>R</sup>*nrow*×*ncol* . Sub-problem (15) can be solved via **PnP** framework of **<sup>E</sup>**!*j*, then **E***k*+<sup>1</sup> *<sup>j</sup>* can be expressed as

$$\mathbf{E}\_{\hat{j}}^{k+1} = \mathbf{P} \mathbf{n} \mathbf{P}(\hat{\mathbf{E}}\_{\hat{j}}).\tag{16}$$

5. Updating of **D**

$$\mathbf{D}\_{i}^{k+1} = \mathbf{D}\_{i}^{k} - (\mathcal{A}\_{:,i,i}^{k+1} - \mathbf{V}\_{i}^{k+1}).\tag{17}$$

6. Updating of **H**

$$\mathbf{H}\_{\dot{j}}^{k+1} = \mathbf{H}\_{\dot{j}}^{k} - (\mathcal{B}\_{\ddots;\dot{j}}^{k+1} - \mathbb{E}\_{\dot{j}}^{k+1}).\tag{18}$$

7. Updating of **G**

$$\mathbf{G}^{k+1} = \mathbf{G}^k - (\sum\_{i=1}^{R} \mathcal{A}\_{:,i}^{k+1} - \mathbf{1}\_{n\_{\text{row}}} \mathbf{1}\_{n\_{\text{col}}}^T) . \tag{19}$$

#### **3. Experiments and Analysis on Synthetic Data**

In this section, we illustrate the performance of the proposed PnP-NTF framework on the two state-of-the-art denoising method, named BM3D and DnCNN, for the abundance estimation. We compare the proposed method with some advanced algorithms to address the GBM, which contains gradient descent algorithm (GDA) [49], the semi-nonnegative matrix factorization (semi-NMF) [50] algorithm and subspace unmixing with low-rank attribute embedding algorithm (SULoRA) [11]. For specifically, the GDA method is a benchmark to solve the GBM pixel by pixel, and semi-NMF can speed up and reduce the time loss. Meanwhile, the semi-NMF based method can consider the partial spatial information of the image. SULoRA is a general subspace unmixing method that jointly

estimates subspace projections and abundance, and can model the raw subspace with low-rank attribute embedding. All of the experiments were conducted in MATLAB R2018b on a desktop of 16 GB RAM, Intel (R) Core (TM) i5-8400 CPU, @2.80 GHz.

In order to quantify the effect of the proposed method numerically, three widely metrics, including the root-mean-square error (RMSE) of abundances, the image reconstruction error (RE), and the average of spectral angle mapper (aSAM) are used. For specifically, the RMSE quantifies the difference between the estimated abundance A( and the true abundances A as follows:

$$RMSE = \sqrt{\frac{1}{R \times N} \| \mathcal{A} - \hat{\mathcal{A}} \|\_F^2}. \tag{20}$$

The RE measures the difference between the observation Y and its reconstruction Y( as follows:

$$RE = \sqrt{\frac{1}{N \times L} \| \mathcal{Y} - \widehat{\mathcal{Y}} \|\_F^2}. \tag{21}$$

The aSAM qualifies the average spectral angle mapping of the estimated *i*th spectral vector **y**ˆ*<sup>l</sup>* and observed *i*th spectral vector **y**. The aSAM is defined as follows:

$$aSAM = \frac{1}{N} \sum\_{i=1}^{N} \arccos\left(\frac{\mathbf{y}\_i^T \cdot \hat{\mathbf{y}}\_i}{||\mathbf{y}\_i|| \, ||\hat{\mathbf{y}}\_i||}\right). \tag{22}$$

#### *3.1. Data Generation*

In the simulated experiments, the synthetic data was generated similar to References [32,51], and the specific process is as follows:


#### *3.2. Evaluation of the Methods*

The details of the simulated data can be obtained with the previous steps, then we generated a series of noisy images with SNRs = {15, 20, 30} dB to evaluate performance of the proposed method and compare with other methods.

#### 3.2.1. Parameter Setting

To compare all the algorithms fairly, the parameters in the all compared methods were hand-tuned to the optimal values. Specifically, the FCLS was used to initialize the abundance information in the all methods (including the proposed method). Note that a direct compasion with FCLS unmixing results is unfair and FCLS is served as a benchmark, which shows the impact of using a linear unmixing method on nonlinear mixed images. The GDA is considered as the benchmark to solve the GBM. The tolerances for stopping the iterations in GDA, Semi-NMF, and SULoRA were set to 1 <sup>×</sup> <sup>10</sup>−6. For the proposed PnP-NTF framework based method, the parameters to be adjusted were divided into two parts, one of which is the parameter of the denoiser we chosen, and the other part is the penalty parameter *μ*. Firstly, the standard deviation of additive white Gaussian noise *σ* is searching from 0 to 255 with the step of 25, the the block size used for the hard-thresholding (HT) filtering is set as 8 in BM3D, respectively. The parameters of the DnCNN is the same as Reference [44]. Meanwhile, the penalty parameter *<sup>μ</sup>* was set to 8 <sup>×</sup> <sup>10</sup>−3, and the tolerance for stopping the iterations was set to 1 <sup>×</sup> <sup>10</sup>−6.

#### 3.2.2. Comparison of Methods under Different Gaussian Noise Levels

In our experiments, we generate three images of size 64 × 64 × 224 with 4096 pixels and 224 bands. More specifically, the 'Scene1' is generated by the GBM model, and the 'Scene2' is generated by the PPNM model. The 'Scene3' is a mixture of the 'Scene1' and 'Scene2', as half pixels in 'Scene3' were generated by the GBM and the others were generated by PPNM [50]. The 'Scene1' is used to evaluate the efficiency of the proposed method to handle mixtures based on GBM, while the 'Scene2' and 'Scene3' were used to evaluate the robustness of the proposed method to mixtures based on different mixing models.

For the proposed method and the other methods, the abundances were initialized with the same method, that is FCLS. In a supervised nonlinear unmixing problem, the spectral vectors of endmember were known as a priori. Considering that the accuracy of abundance inversion depends on the quality of endmember signals, we used the true endmembers in the experiments for fair comparison.

Table 1 quantifies the corresponding results of the three evaluation indicators (RMSE, RE, and aSAM) in detail on the 'Scene1'. As seen from the Table 1, the proposed PnP-NTF based framework with the advanced denoisers provide the superior unmixing results, compared with other methods. Specifically, we tested two state-of-the-art denoisers, namely BM3D and DnCNN, and both of them obtained the best performance. The RMSE, RE, and aSAM obtained minimum values from the proposed PnP-NTF based frameworks, which show the efficiency of the proposed methods is superior compared with other state-of-the-art methods (shown in bold). Figure 3 shows the results of the proposed algorithm and the others algorithms under three indexes (RMSE, RE, and aSAM). For the different levels of noise in 'Scene1', the proposed methods yield the superior performance in all indexes. Also we can see from the histogram of Figures 4–6 that the proposed methods obtain the minimum RMSEs in all scenes.


**Table 1.** Evaluation Results in 'Scene1' with different signal-to-noise ratios (SNRs) and time cost (s).

To evaluate the robustness of the proposed methods against model error, we generated 'Scene2' and 'Scene3' of size 64 × 64 × 224. As shown in Tables 2 and 3, the proposed methods obtained the

best estimate of abundances in terms of RMSE, RE, and aSAM (shown in bold). We cannot provide the proof of the convergence of the proposed algorithm, but the experimental results show that it is convergent when plugged by BM3D and DnCNN (shown in Figures 7 and 8).

**Figure 3.** Unmixing performance in terms of root-mean-square error (RMSE) (**a**), reconstruction error (RE) (**b**), and average of spectral angle mapper (aSAM) (**c**) in the simulated 'Scene1' with different Gaussian Noise Levels.



**Table 3.** Evaluation Results in 'Scene3' with different SNRs and time cost (s).


**Figure 4.** Evaluation results of RMSE with the proposed method and state-of-the-art methods on 'Scene1'.

**Figure 5.** Evaluation results of RMSE with the proposed method and state-of-the-art methods on 'Scene2'.

**Figure 6.** Evaluation results of RMSE with the proposed method and state-of-the-art methods on 'Scene3'.

**Figure 7.** Iterations of RE with BM3D.

**Figure 8.** Iterations of RE with DnCNN.

#### 3.2.3. Comparison of Methods under Denoised Abundance Maps

We make a series of experiments to show the difference between the proposed methods and the conventional unmixing methods (FCLS, GDA, and Semi-NMF) with afterwards denoising the calculated abundance maps by BM4D. The results in Tables 4–6 show that the denoised abundance maps provided by FCLS, GDA, and Semi-NMF can obtain a better results than corresponding original abundance maps. However, for the proposed methods, we use directly a state-of-the-art denoiser as the regularization, which is to exploit the spatial correlation of abundance maps. The results show that using plug-and-play prior for the abundance maps and interaction abundance maps can enhance the accuracy of the estimated abundance results efficiently.

**Table 4.** Evaluation result of denoised abundance in 'Scene1' with different SNRs.



**Table 5.** Evaluation result of denoised abundance in 'Scene2' with different SNRs.

**Table 6.** Evaluation result of denoised abundance in 'Scene3' with different SNRs.


#### **4. Experiments and Analysis on Real Dataset**

In this section, we use two real hyperspectral datasets to validate the performance of the proposed methods. Due to the lack of the groundtruth of abundances in the real scenes, the RE in (21) and the aSAM in (22) were used to test the unmixing performance of the all methods. The convergence of the proposed methods on the two real hyperspectral datasets are shown in Figure 9.

**Figure 9.** Iterations of RE with the proposed methods on two real hyperspectral datasets: (**a**) number of iterations of San Diego Airport, (**b**) number of iterations of Washington DC Mall .

#### *4.1. San Diego Airport*

The first real dataset is called 'San Diego Airport' image, which was captured by the AVIRIS over San Diego. The original image of size 400 × 400 includes 224 spectral channels in the wavelength range of 370 nm to 2510 nm. After removing bands affected by water vapor absorption, 189 band are kept. For our experiments, a subimage of size 160 (rows)×140 (columns) (shown in Figure 10a) is chosen as the test image. The selected scene mainly contains five endmembers, namely 'Roof', 'Grass', 'Rround and Road', 'Tree', and 'Other' [52].

**Figure 10.** Hyperspectral images (HSIs) used for our experiments: (**a**) sub-image of San Diego Airport data, (**b**) sub-image of Washington Dc Mall.

The subimage we chosen has been studied in Reference [52]. VCA [46] method is used to estiamte the endmebers. Meanwhile, the FCLS is used to initialize the abundances in all methods. The ASC constraint in the semi-NMF was set to 0.1. Two state-of-the-art denoisers, embedded in the proposed PnP-NTF-based framework were tested. For the BM3D denoiser, the standard deviation of noise was hand-tuned. For the DnCNN denoiser, its parameter was set in a same way as Reference [44]. The penalty parameter *<sup>μ</sup>* was set to 1 <sup>×</sup> <sup>10</sup>−4. The tolerance for stopping the iterations was set to <sup>1</sup> <sup>×</sup> <sup>10</sup>−<sup>6</sup> for all algorithms.

Table 7 shows the performance of different unmixing methods in terms of RE and aSAM in the San Diego Airport image. Our proposed method obtains the best results. Figure 11 shows the estimated abundance maps obtained by all methods. Focusing on the abundance maps of 'Ground and Road', we can see that the roof area is regarded as a mixture of 'Roof' and 'Ground and Road' in the unmixing results of FCLS, GDA, Semi-NMF and SULoRA methods. In fact, the the roof area only contains endmember 'Roof'. Unmixing results of the proposed PnP-NTF-DnCNN/BM3D are more reasonable.

Furthermore, Figure 12 shows the distribution of the RE on the San Diego Airport. The bright areas in Figure 12 indicate large errors in the reconstructed images. The error map shows that the FCLS performed worst, because the FCLS only can handle the linear information but ignore the nonlinear information in the image. Meanwhile, the semi-NMF performed better than GDA because the GDA is a pixel-based algorithm that does not take any spatial information into consideration. Our method, exploiting self-similarity of abundance maps, can perform better than other methods.

**Figure 11.** Estimated abundance maps comparison between the proposed algorithm and state-of-the-art algorithms on the San Diego Airport.

**Figure 12.** RE distribution maps comparison between the proposed algorithm and state-of-the-art algorithms on the San Diego Airport.

**Table 7.** Evaluation Results with the RE, aSAM and cost time (s) on the San Diego Airport.


#### *4.2. Washington DC Mall*

The second real dataset is called 'Washington DC Mall' image, which was acquired by HYDICE sensor over Washington DC, USA. The original image of size 1208 × 307 includes 210 spectral bands. Its spatial resolution is 3 m. After removing bands corrupted by water vapor absorption, 191 band are kept. There are seven endmembers in the image: 'Roof', 'Grass', 'Road', 'Trail', 'Water', 'Shadow',

and 'Tree' [52]. We chose a subimage with 256 × 256 pixels for the experiments, called sub-DC (shown in Figure 10b). The Hysime [53] was firstly used to estimate the number of endmembers, then the VCA was used to extract the spectral information of endmembers. The extracted endmembers were named 'Roof1', 'Roof2', 'Grass', 'Road', 'Tree', and 'Trail'.

The parameters in the comparison methods were manually tuned to obtain optimal performance. The parameter setting of our methods was same as that in the 'San Diego Airport' image.

Table 8 shows the results of the proposed method and the state-of-the-art methods in the 'Washington DC Mall' image. The proposed methods obtained the best results in terms of RE and aSAM. Figures 13 and 14 show the estimated abundance maps and the error maps, respectively. In Figure 14, the proposed methods show much smaller errors in the reconstructed images.

**Scenario Metric FCLS GDA SULoRA Semi-NMF PnP-NTF-BM3D PnP-NTF-DnCNN (Proposed) (Proposed)** Washington DC Mall RE 0.0156 0.0154 0.0152 0.0120 **0.0099 0.0099** aSAM 0.1020 0.1015 0.0880 0.0837 0.0623 **0.0621** Time 17 670 **10** 43 1163 585

**Table 8.** Evaluation Results with the RE, aSAM and cost time (s) on the Washington DC Mall.

**Figure 13.** Estimated abundance maps comparison between the proposed algorithm and state-of-the-art algorithms on Washington DC Mall data.

**Figure 14.** RE distribution maps comparison between the proposed algorithm and state-of-the-art algorithms on Washington DC Mall data.

#### **5. Conclusions**

In this paper, we propose a new hyperspectral nonlinear unmixing framework, which takes advantage of spatial correlation (i.e., self-similarity) of abundance maps through a plug-and-play technique. The self-similarity of abundance maps is imposed on our objective function, which is solved by ADMM embedded with a denoising method based regularization. We tested two state-of-the-art denoising methods (BM3D and DnCNN). In the experiments with simulated data and real data, the proposed methods can obtain more accurate estimation of abundances than state-of-the-art methods. Furthermore, we tested the proposed method in case of the number of endmembers with 5, and obtained better results compared to other methods. However, with the growing of the number of endmembers, the difficulty of unmixing will also increase, which is our future research direction.

**Author Contributions:** Conceptualization, L.G. and Z.W.; methodology, L.Z. and M.K.N.; software, Z.W.; validation, Z.W., L.Z., L.G. and M.K.N.; formal analysis, B.Z.; investigation, L.Z.; resources, B.Z.; writing—original draft preparation, Z.W.; writing—review and editing, A.M., M.K.N. and B.Z.; visualization, Z.W.; supervision, L.G. and L.Z.; project administration, B.Z.; funding acquisition, L.G., L.Z. and A.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the National Natural Science Foundation of China under Grant 42030111 Research Fund and in part by the National Natural Science Foundation of China under Grant 42001287. The A.Marinoni's work was supported in part by Centre for Integrated Remote Sensing and Forecasting for Arctic Operations (CIRFA) and the Research Council of Norway (RCN Grant no. 237906).

**Acknowledgments:** The authors would like to thank Naoto Yokoya for providing the semi-NMF code for our comparison experiment. Yuntao Qian provided the abundance and endmember data used in some of the experiments with synthetic data.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Technical Note* **A Particle Swarm Optimization Based Approach to Pre-tune Programmable Hyperspectral Sensors**

**Bikram Pratap Banerjee 1,2 and Simit Raval 2,\***


**Abstract:** Identification of optimal spectral bands often involves collecting in-field spectral signatures followed by thorough analysis. Such rigorous field sampling exercises are tedious, cumbersome, and often impractical on challenging terrain, which is a limiting factor for programmable hyperspectral sensors mounted on unmanned aerial vehicles (UAV-hyperspectral systems), requiring a pre-selection of optimal bands when mapping new environments with new target classes with unknown spectra. An innovative workflow has been designed and implemented to simplify the process of in-field spectral sampling and its realtime analysis for the identification of optimal spectral wavelengths. The band selection optimization workflow involves particle swarm optimization with minimum estimated abundance covariance (PSO-MEAC) for the identification of a set of bands most appropriate for UAV-hyperspectral imaging, in a given environment. The criterion function, MEAC, greatly simplifies the in-field spectral data acquisition process by requiring a few target class signatures and not requiring extensive training samples for each class. The metaheuristic method was tested on an experimental site with diversity in vegetation species and communities. The optimal set of bands were found to suitably capture the spectral variations between target vegetation species and communities. The approach streamlines the pre-tuning of wavelengths in programmable hyperspectral sensors in mapping applications. This will additionally reduce the total flight time in UAV-hyperspectral imaging, as obtaining information for an optimal subset of wavelengths is more efficient, and requires less data storage and computational resources for post-processing the data.

**Keywords:** evolutionary computation; heuristic algorithms; machine learning; unmanned aerial vehicles (UAVs); vegetation mapping; upland swamps; mine environment

#### **1. Introduction**

Hyperspectral technology is a potential tool for the remote detection of targets and monitoring. A hyperspectral sensor measures electromagnetic radiation reflected from the target in a large number of spectral narrowbands. The inherent objective in target classification and assessment using hyperspectral data is to utilize its high spectral resolution [1]. However, the large dimensionality of hyperspectral data is often attributed to the Hughes phenomenon, the curse of dimensionality [2]. The problem is a combined consequence of the high correlations among the adjacent bands and the inability of the algorithm being applied to process the high-dimensional data. The problem is paramount in spectrally complex environments such as wetlands and swamps with many diverse species to be monitored [1,3,4]. While a common remote sensing data processing solution involves the application of dimensionality reduction techniques or the selection of suitable narrowbands in a post-acquisition step, a hardware-based solution involves the use of programmable hyperspectral sensors as a pre-acquisition step. Programmable hyperspectral sensors typically involve a snapshot-based scanning mechanism, unlike general point or line scanning-type systems, which are non-programable and acquire a continuous spectrum

**Citation:** Banerjee, B.P.; Raval, S. A Particle Swarm Optimization Based Approach to Pre-tune Programmable Hyperspectral Sensors. *Remote Sens.* **2021**, *13*, 3295. https://doi.org/ 10.3390/rs13163295

Academic Editor: Meiping Song

Received: 20 July 2021 Accepted: 18 August 2021 Published: 20 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

over the operable wavelength region. Several such programmable hyperspectral sensors have been developed in recent times, which are increasingly being used in UAV-based remote sensing applications [5–7]. A hardware-based method, such as Fabry–Pérot interferometer (FPI) technology, acquires reflected electromagnetic radiation in pre-selected optimal narrowbands, and it is programmed by changing the air gap between the internal tuneable mirrors [8]. This method has the additional benefit of efficient mapping of the environment through the selection of only the spectral features of interest, which is particularly crucial in high-resolution mapping applications using unmanned aerial vehicles (UAVs), which have limited flight times. The technology is relatively new compared to the traditional pushboom type hyperspectral sensors, and existing works involving the FPI have used either (1) a set of bands for generating vegetation indices (VIs), herein referred to as *indices-based* criteria [7,9,10], or (2) set of bands identified through rigorous experimental testing, herein referred to as *knowledge-based* criteria [11,12] of narrowband selection. *Indices-based* criteria for band selection have the potential to assess the condition and/or estimate the yield of the vegetation [7,9]; however, they are not principally suited for multi-target classification, since the spectral variations of the target endmembers present within the scene are subjective. Furthermore, the efficacy of *indices-based* narrowband selection approach for vegetation quality or condition assessment is also subject to the characteristic reflectance of the target, and the traditional list of indices does not always ensure the best results for different vegetation communities or species. The *knowledge-based* approach requires a thorough understanding of the spectral variability among the targets present over the area, which is usually attained through intensive in-situ sampling and is not always realizable over difficult terrain or in scenarios requiring urgent mapping. Therefore, it is important to adopt a *data-driven* methodology for programmable hyperspectral sensors to estimate appropriate narrow bands for scene classification or assessment. Minet et al. [13] proposed an approach to adaptively maximize the contrast between the targets by employing a genetic algorithm (GA)-based optimization of the positions and linewidths of a limited number of filters in FPI for military applications. However, this method is unsuitable in thematic applications of remote sensing.

Different *data-driven* strategies have been proposed for the selection of optimal bands for traditional remote sensing applications. A method of sub-optimal search strategy utilizing constrained local extremes in a discrete binary space to select hyper-dimensional features was presented in [14]. Becker et al. [3] used a second-derivative approximation to identify the spectral location of inflection. A band selection method using the correlations among bands based on mutual information (MI) and deterministic annealing optimization was also employed [15]. Becker et al. [4] proposed a classification-based assessment for three optimal spectral band selection techniques (derivative, magnitude, fixed interval, and derivative histogram), using the spectral angle mapper (SAM) as a classifier. A GA-based wrapper method using a support vector machine (SVM) was proposed for the classification of hyperspectral images [16]. A double parallel feedforward neural network based on radial basis function was used for dimensionality reduction [17]. Principal component analysis for identifying optimal bands to discriminate wetland plant species was presented [1]. A semi-supervised band clustering approach for dimensionality reduction was developed [18]. A particle swarm optimization (PSO)-based dimensionality reduction approach to improving support vector machine (SVM)-based classification was suggested by [19]. Li et al. [20] and Pal et al. [21] presented a hybrid band selection strategy based on a GA-SVM wrapper to search optimal bands' subsets. A method of band selection based on spectral shape similarity analysis was put forward in [22]. Methods for nesting a traditional single loop of PSO or 1PSO inside an outer PSO loop, termed 2PSO, have been identified to improve the overall optimization performance in certain applications, at the expense of computational cost [23]. Su et al. [23] implemented 1PSO and 2PSO with minimum estimated abundance covariance (MEAC) [24], among other techniques, for the evaluation of optimal bands. Ghamisi et al. [25] presented a feature selection approach based on hybridization of a GA and PSO with an SVM classifier as a fitness function. Accuracies

achieved in an optimized band selection method are influenced by the characteristics of the input dataset, as the search strategy depends on the present classes and their spectral profiles. Therefore, these methods need to be tested on benchmark datasets, an equivalent comprehensive evaluation is reported in [23]. However, all these existing optimal band identification studies involving *data-driven* methods were used on traditional hyperspectral datasets after the acquisition, and are yet to be used with a hardware-based solution to pre-tune hyperspectral sensors to acquire the optimal bands.

In this study, for the first time, an in-field *data-driven* approach to pre-tune a snapshottype UAV-hyperspectral sensor was devised for remote sensing applications. The method employs PSO, with minimum estimated abundance covariance (MEAC), similarly to [23] in a post-processing stage for waveband selection after hyperspectral dataset acquisition. The significant benefits are: (1) it is an efficient approach to identifying the optimal bands in-field before the survey; (2) it does not require a lot of spectral samples per class, which is particularly an issue over difficult terrain when trying to establish a spectral library; and (3) the system works perfectly when the number of observed samples is less than the total number of potential hyperspectral bands to select from, which is an important issue with other dimensionality reduction methods, such as principal component analysis (PCA). Programmable UAV-hyperspectral sensors have increasingly been used in applications such as environmental mapping, precision agriculture, phenotyping, and forestry [12,26,27]. Identification of optimal wavelengths remains crucial for mapping vegetation communities, phenotyping functional plant traits, and identifying vegetation under biotic or abiotic stress. Our method aims to resolve functional challenges by improving the capturing of the spectral representation of an environment through a UAV-hyperspectral survey.

The rest of the paper is arranged as follows. The Materials and Methods section describes the experimental framework. The theoretical background of the PSO-MEAC approach is described in relation to the elements of the proposed application. In the Results and Discussion section, we present the results of using the PSO-MEAC method for optimal band selection at the experimental site. In addition, the performance of the *data-driven* PSO-MEAC approach has been evaluated against the traditional *indices based* approach for feature selection and mapping. Finally, the concluding remarks are provided in the conclusion section.

#### **2. Materials and Methods**

This section details the study area, ground based hyperspectral sensing system, data processing for the hyperspectral data, workflow for identifying optimal bands in the field, and method for UAV-hyperspectral surveying and assessment.

#### *2.1. The Area Used for the Experiment*

The test site is an upland swamp area above an underground coal mine within the temperate highland peat swamp on sandstone (THPSS) in New South Wales, southwest of the city of Sydney, Australia (34◦21 24.0S, 150◦51 51E). The area is located in Wollongong. The focus was laid on spectrally diverse vegetation communities in critically endangered ecosystems distributed in the Blue Mountains, Lithgow, Southern Highlands, and Bombala regions in New South Wales, Australia [28]. The NSW National Parks and Wildlife Service (NPWS) classifies the upland swamps complexes into five major vegetation communities—Banksia Thicket, Cyperoid Heath, Fringing Eucalypt Woodland, Restioid Heath, and Sedgeland [29]. The site has occasional thick vegetation cover and steep gradients which are inaccessible.

#### *2.2. Hyperspectral Set-Up for Ground Based Sampling*

The spectra of the target classes in the environment were measured with the visibleinfrared snapshot hyperspectral (FPI) sensor (Rikola, Senop Optronics, Kangasala, Finland) with a separate data acquisition computer. In this mode of operation, the sensor acquires the maximum number of wavelength bands possible—i.e., 380 bands at 1 nm spectral

steps between 500 and 880 nm. With a focal length of 9 mm and a field-of-view (FOV) of 36.5 × 36.5 degrees, the sensor acquires 1010 × 1010 spatial channels in the snapshot imaging mode. In contrast, in the standalone on-board UAV-based data acquisition mode the sensor records a set of 15 programmed wavelength bands in 1010 × 1010 pixel format, i.e., up to a total of 16 megapixels of storage per hypercube. The sensor also acquires solar irradiance measurements—it uses an irradiance sensor for radiometric calibration; and positional measurements using a global positioning system (GPS) for geometric corrections (Figure 1). All sensors were installed on a handheld mount for hyperspectral imaging. An Android mobile phone was also installed on the sensor mount and paired to the data acquisition computer with a video telemetry feed over a WiFi link to provide a realtime view of the scene, which was useful for bringing the target vegetation in focus before the collection of hyperspectral data (Figure 1a). Additionally, a realtime feed of goniometric measurements (roll and pitch) from the mobile phone's accelerometer was relayed to the screen of the data acquisition computer to monitor the planimetric setting of the captured hypercubes using the FPI sensor (Figure 1b).

**Figure 1.** The setup for ground-based hyperspectral data acquisition using a Fabry–Pérot interferometer (FPI) sensor (Rikola, Senop Optronics, Kangasala, Finland), an irradiance sensor, a global position system (GPS), and an android phone as goniometer on a portable handheld sensor mount: (**a**) top-side view, (**b**) bottom-side view, and (**c**) in-field hyperspectral data acquisition with a data acquisition computer. The system was used for the collection of in-field data for rapidly identifying optimal hyperspectral wavelengths, for applications in aerial (UAV-hyperspectral) data acquisition.

The simplistic design of the handheld hyperspectral imaging system was important for carrying it around in regions with dense shrub-type vegetation cover (Figure 1c). The hyperspectral data were acquired with a downward nadir orientation over the shrub type swamp vegetation. The data were acquired at a distance of approximately 0.5 m from the top of the canopy (Figure 1c). In this study, the FPI sensor was used as a tool for in-field spectral acquisition to demonstrate an independent form of operation. Nevertheless, the field spectral measurements could also be obtained from other spectroradiometers, such as ASD FieldSpec3 (Analytical Spectral Devices, Boulder, CO, USA). However, special care should be taken to establish proper radiometric calibration to remove any inter-sensor response mismatch, which is addressed by using the same FPI sensor for both in-field spectral data collection for identifying the optimal bands and later UAV-hyperspectral data acquisition.

For identifying the optimal bands through PSO-MEAC, the hyperspectral measurements were collected for a total of three target vegetation classes, covering eight upland swamp species, including Grass tree (*Xanthorrhoea resinosa*), Pouched coral fern (*Gleichenia dicarpa*). and Sedgeland complex (*Empodisma minus*, *Gymnoschoenus sphaerocephalus*, *Lepidosperma limicola*, *Lepidosperma neesii*, *Leptocarpus tenax*, and *Schoenus brevifolius*). In addition, spectral measurements were also collected for background vegetation, which contained a mixture of other species which were present in small patches and not selected in this study. Finally, a background bare-earth spectrum was also collected. To obtain a proper un-mixed spectrum for a single species, field sampling was performed over a region of interest with local homogeneity.

#### *2.3. In-Field Ground-Based Hyperspectral Data Processing*

Vegetation in an upland swamp environment is highly diverse, and species can exist in homogenous and heterogeneous patches. Data collected through the portable handheld FPI system caused minor spectral misalignments due to unavoidable handheld movement of the sensor and due to slight movements of the canopy caused by wind. This happened as the data in the FPI sensor were acquired in a snapshot, bandwise manner with a small delay and sensor movement [26]. The hyperspectral bands were aligned using a previously developed band alignment workflow described in [26]. The data were first flat-field corrected using dark current removal and a white calibration panel; then they were converted to the reflectance measurements using previously computed calibration coefficients with an integrating sphere [7]. A band-averaged hyperspectral signal was calculated from the hypercube and used in the optimal band identification workflow. The spectrum was further treated using a Savizky–Golay [30] smoothing filter with a polynomial order of 3 and a frame length of 17 to remove spectral noise. A PSO with MEAC as the criterion function was employed to identify the suitable bands in the field; the details of the theory of operation are in Section 2.4. The entire process of spectral signature retrieval and PSO-MEAC workflow for suitable band identification was implemented as MATLAB (ver. 9.5) routines, and a graphical user interface (GUI) was designed for user-friendly and seamless operation in the field.

#### *2.4. Optimal Band Identification Using PSO-MEAC*

Particle swarm optimization (PSO) was originally used to simulate the social behaviour (movement and interaction) of the organisms (*particles*) in a flock of birds or a pool of fishes [31]. It has, however, been used as a robust metaheuristic computational method to improve the selection of candidate solutions for an optimization problem. The optimization operates iteratively over a swarm of candidate solutions with a criterion function as a given measure of quality. In our approach, the selected set of bands are called *particles*, and a recursive update of the bands is called a *velocity*. The particle position *xid* denotes the selected band subset of size *k*, and velocity *vid* denotes the update for the selected band. A particle updates [31] as shown in Equation (1).

$$v\_{id} = \omega \times v\_{id} + c\_1 \times r\_1 \times (p\_{id} - \mathbf{x}\_{id}) + c\_2 \times r\_2 \times \left(p\_{gd} - \mathbf{x}\_{id}\right) \tag{1}$$
 
$$\mathbf{x}\_{id} = \mathbf{x}\_{id} + v\_{id} \tag{1}$$

where *pid* is the historically best local solution; *pgd* is historically the best global solution among all the particles; *c*<sup>1</sup> and *c*<sup>2</sup> control the contributions from local and global solutions, respectively; *r*<sup>1</sup> and *r*<sup>2</sup> are independent random variables between 0 and 1; and *ω* is the inertia weight to improve the convergence performance.

New velocities and positions (*vid* and *xid* on the left-hand side of Equation (1)) for the particles are updated based on the existing parameters and cost criterion upon every iteration (Figure 2). The iteration process aims to minimize the underlined criterion function.

**Figure 2.** The method for the PSO-MEAC system. The algorithm initializes a set of particles or a combination of bands; at each iteration, the cost function (MEAC) associated with individual particles is computed; trajectories of the particles are projected towards the particle with the best solution; the loop is exited after the specified number of iterations is reached. The particle with a minimum cost function is identified as the optimal solution.

In a traditional supervised classification, where representative class signatures are known through exhaustive field surveying, the band-selection process can be greatly simplified. However, in an aerial survey to determine suitable wavelength bands for a programmable UAV-hyperspectral system, such an exhaustive exercise is tedious, cumbersome, and not always possible. Therefore, MEAC was used as a criterion function in PSO, as it requires only class signatures and no training samples. The efficacy of this technique has been previously evaluated against other existing optimization methods by Su et al. [23] for feature selection on traditional hyperspectral datasets (airborne and satellite).

Assuming there are *p* classes present over an area in which the samples were collected, the endmember matrix can be written as *S* = ) *s*1,*s*2,..., *sp* \* . According to Yang et al. [19], with linear mixing of the endmembers, the pixel *r* can be expressed as in Equation (2):

$$
\tau = \text{Sa} + \text{n} \tag{2}
$$

where *α* = *a*1, *a*2,..., *ap <sup>T</sup>* is the abundance vector and *n* is the uncorrelated noise with *E*(*n*) = 0 and *Cov*(*n*) = *σ*<sup>2</sup> *I* (*I* is an identity matrix).

Usually, the actual number of classes (*p*) is greater than the known class signatures; i.e., *q* < *p*. Hence, the uncorrelated noise will have *Cov*(*n*) = *σ*2Σ, where Σ is the noise

covariance matrix. Therefore, the abundance vector becomes the weighted least square solution, as in Equation (3):

$$\hat{\mathbf{a}} = \left(\mathbf{S}^T \boldsymbol{\Sigma}^{-1} \mathbf{S}\right)^{-1} \mathbf{S}^T \boldsymbol{\Sigma}^{-1} \mathbf{r} \tag{3}$$

with first-ordermoment being *E*(*α*ˆ) = *α* and second-ordermoment being*Cov*(*α*ˆ) = *σ*2(*ST*Σ−1*S*) −1 .

The analysis demonstrates that when all the classes are known, the remaining noise can be modelled as independent Gaussian noise. For this application, when meeting such sampling criteria was difficult and there were unknown classes present, noise whitening was applied first. Yang et al. [19] and Su et al. [23] performed the optimal band selection on traditional hyperspectral datasets, and used all the pixels for the background noise (Σ) estimation. In this case, the background pixels' noise was calculated using background class spectra and bare-earth spectra collected through ground-based sampling. The background plus noise covariance is denoted as Σ*b*+*n*; this estimate was used in this study. The estimate of the unknown class pixels is based on the likelihood of the unknown class (or the class of no interest) being present around the sampled class of interest. In scenes where all endmembers are of known classes (or the target classes of interest), noise estimation Σ*b*+*<sup>n</sup>* is not required, which is an unlikely condition in a spectrally complex swamp environment [7].

The identified optimal bands should allow minimal deviations of *α*ˆ from actual *α* [23]. With the partially known classes, the criterion function is equivalent to minimizing the trace of the covariance, as in Equation (4):

$$\underset{\Phi^{\mathcal{S}}}{\text{arg min}} \{ \text{trace} [ \left( \mathbf{S}^{T} \boldsymbol{\Sigma}\_{\boldsymbol{\theta} + \boldsymbol{n}} \prescript{-1}{\mathbf{S}} \mathbf{S} \right)^{-1} ]\} \tag{4}$$

where Φ*<sup>S</sup>* is the selected band subset. The resulting band selection algorithm is referred to as the MEAC method [23].

The optimizer returns a suitably identified set of wavelength bands with the lowest cost criterion values (Equation (4)), upon successful completion of the PSO-MEAC algorithmic iterations (Figure 2).

#### *2.5. UAV-Hyperspectral Survey and Assessment*

After the identification of a set of optimal bands through the *data-driven* PSO-MEAC approach, the FPI hyperspectral sensor was programmed to acquire using the suitable narrow wavelength bands. A UAV-hyperspectral mission was carried out in pre-planned waypoint acquisition mode with 85% forwards and 75% lateral overlap from a flying altitude of 50 m. The sensor exposure time was set at 10 ms per band to provide good radiometric image quality for the existing illumination conditions. The UAV-hyperspectral survey was performed around two hours of local solar noon and in clear weather conditions with no clouds. This was done to avoid both the effect of significant illumination variations and shadows cast by clouds during the aerial image acquisition. However, due to the experimental site being situated in a low latitudinal region in the southern hemisphere (34◦21 24.0"S, 150◦51 51"E) with the sun projecting a shallow incidence angle, the issues of the shadows projected by trees and other tall vegetation was unavoidable. In addition to the *data-driven* PSO-MEAC tuned survey, another aerial survey was performed with an *indice-based* [7] wavelength selection approach, using the same UAV flight characteristic and sensor exposure configuration. A band stabilization workflow was adopted to co-register spatial shifts between bands in hypercubes, from both the aerial acquisition modes [26]. Further, the regular radiometric, illumination adjustment, mosaicking, and geometric correction procedures for hypercubes were carried out [7]. The UAV-hyperspectral orthomosaics achieved a high spatial resolution of 2 cm in ground sampling distance.

A supervised support vector machine (SVM) classifier was used to classify the hyperspectral datasets into constituent classes. The SVM is an efficient kernel-based machine learning classifier suitable for high-dimensional feature spaces, which is well used in classifying hyperspectral datasets [32–34]. The classification was performed as an evaluation

step to compare the efficacy of wavelengths identified through *data-driven* PSO-MEAC and *indices-based* approaches. As the fundamental objective in this study was to simply evaluate the two methods, and not to achieve superior accuracies in classification, involving complex classification algorithms were deemed needless. Standard parameter settings—a radial basis function with a kernel gamma function of 0.167, a penalty parameter of 100, and a pyramid level of 5—were used for the SVM classification. The overall and individual class classification accuracies were computed using the ground truth training samples.

For evaluating the efficacy of PSO-MEAC-identified bands through classification, a total of 120 ground truth measurements were collected for shrub-type swamp vegetation through a rigorous field survey, and 120 ground truth polygons were identified through visual interpretation of high-resolution hyperspectral data. The sampled ground-based (120) and image-based (120) polygons were randomly divided into 1:1 mutually exclusive sets of training and test samples, i.e., 60 ground and 60 image-based polygons for each training and test group. The ground truth training set was used to train the SVM classifier, and the test samples were used to compute the overall accuracy (OA), kappa (κ), and confusion matrix to evaluate the classification accuracies. The spectral data from training and test sample polygons were obtained from the UAV-hyperspectral datasets in corresponding *data-driven* PSO-MEAC and *indices-based* modes.

#### **3. Results and Discussion**

This section details the results and discussion of optimal band selection using *data-driven* PSO-MEAC workflow, and its evaluation against the *indices-based* approach.

#### *3.1. Optimal Band Identification Using PSO-MEAC*

The PSO-based optimal band identification workflow determines a list of suitable bands according to the MEAC cost criterion. The PSO-MEAC workflow was executed with a population size of 100, an inertial weight of 0.98, and a maximum number of iterations of 500. A total of 15 bands, i.e., *k* = 15, were identified, based on the maximum band capacity of the FPI sensor for on-board UAV data acquisition mode in an un-binned setting (1010 × 1010 pixels).

The selected combination of bands gets re-configured at every iteration to minimize the cost function (Figure 2). A new combination of bands is designated optimal if the combination achieves the best (or minimum) cost. To analyze the performance of the in-field optimal band identification and sensor tuning using the PSO-MEAC approach, a set of internally computed parameters (criterion cost and index of runs) were logged at every iteration (Figure 3). The PSO-MEAC approach determines the suitable combination of bands (or band-index) using the cost criterion (Equation (4)). The reduction of the best cost value signifies the learning curve for the optimization workflow (Figure 3a). At every iteration, the cost associated with the previous band-index is compared with the new band-index. A record of these parameters reveals the process of convergence to the desired solution by the implemented metaheuristic workflow. A measure of final cost and plot of identified optimal band combination is also produced. It can be seen that using the PSO-MEAC method, better (i.e., smaller) values of cost criterion can be achieved. Each iteration may produce slightly different band combinations according to the cost criterion, as shown by the plot of the index of runs in (Figure 3b). The final cost of the PSO-MEAC was <sup>−</sup>7.7 <sup>×</sup> <sup>10</sup>−9. At this stage, the identified band indices were 56, 88, 101, 119, 151, 172, 211, 217, 251, 284, 303, 326, 341, 360, and 380 (Figure 3c). The corresponding FPI wavelengths were 555.33, 587.21, 600.34, 618.21, 650.39, 671.02, 710.12, 716.11, 750.19, 783.46, 802.35, 825.28, 840.15, 859.53, and 880.43 nm with respective FWHMs of 9.81, 10.62, 9.88, 12.17, 10.78, 11.77, 9.78, 9.61, 9.58, 10.60, 10.56, 10.49, 13.69, 13.12, and 13.27 nm.

**Figure 3.** Optimal band selection: (**a**) a plot demonstrating variation of cost criterion with the PSO-MEAC iterations, (**b**) bands selected in each iteration, and (**c**) a plot of the identified optimal bands overlayed on the class spectra. The cost criterion was progressively minimized with the number of iterations. The variations of band position with the index of runs in every iteration provide insights into the functioning of PSO-MEAC. Overall, the PSO-MEAC-identified bands are well distributed over the key wavebands with maximal variation between inter-class reflectance.

The PSO-MEAC workflow uses a complex high-dimensional search strategy, producing several intermediate local and global combinations of bands, so the final solution may not be the same with every execution. Previous implementations of PSO-MEAC [23] focused on minimising the number of bands in optimal configurations, which is suitable for dimensionality reduction techniques in traditional airborne or satellite hyperspectral imaging, with a complete set of bands already acquired. In the proposed method, the number of bands to be identified is predefined by the user, which makes it important to use the FPI sensor to its fullest potential (i.e., hypercube band capacity at desired spectral binning) to acquire the maximum possible information in the optimal configuration. To evaluate the computational complexity, the PSO-MEAC workflow was programmed in MATLAB (ver. 9.5) and implemented as a GUI module to run on a portable field data acquisition computer with 1.5 GHz processor and 512 MB memory. The module took roughly 4 to 5 min for every 500 iterations with the selected number of class samples. This demonstrates the operational efficiency of the system, despite having a complex search hierarchy, and it is usable for pre-tuning the programmable FPI sensor in a UAV-hyperspectral survey for optimised wavelength selection.

Acquisition and identification of optimal bands using characteristic spectral signatures of individual swamp species have been traditionally performed using the separability of the spectrum at respective wavelength bands. In this study, the employed PSO-MEAC-based search strategy automatically analyses and identifies wavelength bands based on maximum separability of the reflectance using the MEAC cost criterion function. The field spectrum collected for each shrub-type vegetation species is shown in (Figure 3c), and the identified wavelength band positions are shown using a set of superimposed vertical lines. Our approach has been implemented using a GUI-based interface on a portable data acquisition computer, which enabled rapid analysis of spectral signatures and identification of suitable wavelength bands. The developed technique and tools were found to be efficient in a field environment during surveying.

#### *3.2. Classification*

The comparative evaluation between the *data-driven* PSO-MEAC and *indices-based* wavelength tuning approaches was performed using an SVM classifier. Two dedicated datasets (*data-driven* PSO-MEAC and *indices-based*) were collected from the swamp. The scene was primarily comprised of three shrub-type vegetation classes (i.e., grass trees, pouched coral ferns, and Sedgeland complex) and two tree-type vegetation classes (i.e., black sheoak and eucalyptus). A small portion of the area was bare of vegetation cover and was treated as a separate "bare earth" class. Therefore, a total of six classes were used in the classification-based comparative evaluation. The optimal bands identified using the

*data-driven* PSO-MEAC approach produced better results compared to the *indices-based* approach, with the SVM classifier. Combining the optimal bands identified using the *data-driven* PSO-MEAC with the SVM classifier produced an overall accuracy of 85.16% and a kappa coefficient of 0.73, whereas the *indices-based* approach produced an overall accuracy of 76.54% and a kappa coefficient of 0.67. The comparative classification maps for the *indices-based* PSO-MEAC and *data-driven* approaches produced using the SVM classifier are shown in Figure 4.

**Figure 4.** Classification map of the swamp site's vegetation classes and species produced using a support vector machine classifier with (**a**) *data-driven* (PSO-MEAC) optimal band identification and (**b**) *indices-based* band selection.

The producer's accuracy or error-of-omission refers to the conditional probability that certain land-cover of an area on the ground is correctly mapped, whereas the user's accuracy or error-of-commission refers to the conditional probability that a pixel labeled as a certain land-cover class in the map belongs to that class [35]. The producer's and user's accuracy for each class with the best classification method, *data-driven* PSO-MEAC, are shown in Table 1. With the exception of the "grass tree" class, overall the accuracy for each class was satisfactory (>70%), particularly when differentiating between swamptype (Sedgeland complex) and non-swamp-type (*Eucalyptus*) vegetation. The results also indicate the potential of the process for distinguishing certain critical non-swamp-type terrestrial species (black sheoak and bracken fern) within the swamp environment. Increases in the proportions of these terrestrial species in a swamp indicate changes in the swamp hydrology. No changes in the proportions of terrestrial species (or changes within equilibrium limits) indicates the stability of hydrology and peat moisture levels. These

results, therefore, demonstrate the usefulness of the method for directly mapping the changes induced in a swamp environment due to the fluctuation of groundwater level.


**Table 1.** Evaluation of classification accuracy achieved using the *data-driven* (PSO-MEAC) method against *indices based* band selection.

#### **4. Conclusions**

Identification of optimal bands for vegetation monitoring has been an ongoing research problem in hyperspectral remote sensing. The issue is significant in a spectrally complex environment with diversity in vegetation species, such as swamps and wetlands. Extensive surveys and post-processing solutions have been recurrently used in different swamp-type environments. The study presents an innovative approach for in-field rapid identification of spectrally significant wavelength bands. The developed method was employed to tune a programmable hyperspectral sensor before UAV borne surveys. The method was implemented through a metaheuristic workflow based on particle swarm optimization (PSO), with minimum estimated abundance covariance (MEAC) as the cost selection criterion function. A portable in-field hyperspectral signature collection system was devised using the tuneable FPI hyperspectral sensor. The set-up improved the collection of class spectra and background noise spectra, which were then used to identify the optimal band configuration. The method identifies the optimal bands based on representative class spectral signatures, avoiding the requirement of extensive in-field sampling. Additionally, the method works perfectly when the number of sample observations is less than the total number of potential hyperspectral bands, which is not possible with other dimensionality reduction methods, such as PCA. The method was successfully tested to identify a set of optimal bands for maximizing the spectral differentiation of swamp-type vegetation species and communities. The algorithm could be tuned to robustly incorporate vegetation trait retrieval by changing the criterion function. The approach would be valuable to environmental mapping, precision agriculture, phenotyping, and forestry to estimate qualitative phenotypic traits such as chlorophyll content, photosynthetic capacity, and biomass; and for studying vegetation under different treatments or biotic and abiotic stresses.

**Author Contributions:** B.P.B. and S.R. conceived of the experiment. B.P.B. conducted the experiments, data analysis, and writing of the original draft. S.R. conducted project administration, manuscript review, and editing. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data are not available due to non-disclosure agreements.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Residual Augmented Attentional U-Shaped Network for Spectral Reconstruction from RGB Images**

**Jiaojiao Li †, Chaoxiong Wu \*, Rui Song †, Yunsong Li and Weiying Xie**

The State Key Laboratory of Integrated Service Networks, Xidian University, Xi'an 710000, China; jjli@xidian.edu.cn (J.L.); rsong@xidian.edu.cn (R.S.); ysli@mial.xidian.edu.cn (Y.L.); wyxie@xidian.edu.cn (W.X.) **\*** Correspondence: cxwu@stu.xidian.edu.cn; Tel.: +86-155-2960-9856

† These authors contributed equally to this work.

**Abstract:** Deep convolutional neural networks (CNNs) have been successfully applied to spectral reconstruction (SR) and acquired superior performance. Nevertheless, the existing CNN-based SR approaches integrate hierarchical features from different layers indiscriminately, lacking an investigation of the relationships of intermediate feature maps, which limits the learning power of CNNs. To tackle this problem, we propose a deep residual augmented attentional u-shape network (RA2UN) with several double improved residual blocks (DIRB) instead of paired plain convolutional units. Specifically, a trainable spatial augmented attention (SAA) module is developed to bridge the encoder and decoder to emphasize the features in the informative regions. Furthermore, we present a novel channel augmented attention (CAA) module embedded in the DIRB to rescale adaptively and enhance residual learning by using first-order and second-order statistics for stronger feature representations. Finally, a boundary-aware constraint is employed to focus on the salient edge information and recover more accurate high-frequency details. Experimental results on four benchmark datasets demonstrate that the proposed RA2UN network outperforms the state-of-the-art SR methods under quantitative measurements and perceptual comparison.

**Keywords:** spectral reconstruction; residual augmented attentional u-shape network; spatial augmented attention; channel augmented attention; boundary-aware constraint

#### **1. Introduction**

Hyperspectral imaging systems can record the actual scene spectra over a large set of narrow spectral bands [1]. In contrast to the ordinary cameras record only reflectance or transmittance of three spectral bands (i.e., Red, Green, and Blue), hyperspectral spectrometers can encode hyperspectral images (HSIs) by obtaining continuous spectrums on each pixel of the object. The abundant spectral signatures are beneficial to many computer vision tasks, such as face recognition [2], image classification [3,4] and object tracking [5], etc.

Traditional scanning HSIs acquisition systems rely on either 1D line or 2D plane scanning (e.g., whiskbroom [6], pushbroom [7] or variable-filter technology [8]) to encode spectral information of the scene. Whiskbroom imaging devices apply mirrors and fiber optics to collect reflected hyperspectral signals point by point. The subsequent pushbroom HSIs acquisition systems capture HSIs with dispersive optical elements and light-sensitive sensors in a line-by-line scanning manner. As for the variable-filter imaging equipment, it senses each scene point multiple times, each time in a different spectral band. In fact, the scanning operation of these devices is extremely time-consuming, which severely limits the application of HSIs under dynamic conditions.

To make HSIs acquisition of dynamic scenes available, the scan-free or snapshot hyperspectral technologies have been explored, e.g., coded aperture snapshot spectral imagers [9], mosaic [10], and light-field [11], etc. Computed-tomography imaging spectrometer converts a three-dimensional object cube into multiplexed two-dimensional projections and these data can be used later to reconstruct the hyperspectral cube computation-

**Citation:** Li, J.; Wu, C.; Song, R.; Li, Y.; Xie, W. Residual Augmented Attentional U-Shaped Network for Spectral Reconstruction from RGB Images. *Remote Sens.* **2021**, *13*, 115. https://doi.org/10.3390/rs13010115

Received: 8 October 2020 Accepted: 29 December 2020 Published: 31 December 2020

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

ally [12,13]. Coded aperture snapshot spectral imager uses compressed sensing advances to achieve snapshot spectral imaging and an iterative algorithm is used to reconstruct the data cube [9,14]. A novel hyperspectral imaging system combines a stereo camera to perform the accurate HSIs measurements through the geometrical alignment, radiometric calibration and normalization [10]. However, these systems depend on post-processing with a huge computational complexity and record HSIs with decreased spatial and spectral resolution. Meanwhile, the deployments of these facilities remain prohibitively expensive and complicated.

Due to the limitations of scanning and snapshot hyperspectral systems, as an alternative solution, spectral reconstruction from ubiquitous RGB images has attracted extensive attention and research, i.e., given an RGB image, the corresponding HSI with higher spectral resolution can be recovered via fulfilling a three-to-many mapping directly. Obviously, SR is an ill-posed transition problem. Early work on SR leverages sparse coding or shallow learning models to rebuild HSI data [15–19]. Nguyen et al. [15] trained a shallow radial basis function network that leveraged RGB white-balancing to normalize the scene illuminations to further recover the scene reflectance spectra. Later, Robles-Kelly [16] extracted a set of reflectance properties from the training set and obtained convolutional features using sparse coding to perform spectral reconstruction. Typically, Arad [17] and Aeschbacher et al. [19] exploited potential HSIs priors to create an over-complete sparse dictionary of hyperspectral signatures and corresponding RGB projections, which facilitated the following reconstruction of the HSIs. More recently, with the aid of the low-rank constraints, Zhang et al. [20] proposed to make full use of the high-dimensionality structure of the desired HSI to boost the reconstruction quality. Unfortunately, these methods only model low-level and simple correlation between RGB images and hyperspectral signals, which limits their expression ability and leads to poor performance in challenging situations. Accordingly, it is indispensable to further improve the results of the reconstructed HSIs for SR.

Recently, witnessing the great success of CNNs in the field of hyperspectral spatial super-resolution [21,22], numerous CNN-based algorithms have been widely explored in the SR task [23–28]. For example, Galliani et al. [23] modified a high-performance network originally designed for semantic segmentation to learn the statistics of natural image spectra and generated finely resolved HSIs from the RGB inputs. This is a milestone work, since it is the first time to introduce deep learning into the SR task. To promote the research of SR, NTIRE 2018 challenge on spectral reconstruction from RGB images is organized, which is the first SR challenge [29]. Meanwhile, a great quantity of excellent approaches have been proposed in this competition [30–34]. Impressively, Shi et al. [34] designed a deep HSCNN-R network consisting of multiple residual blocks and acquired promising performance, which was developed from their previous HSCNN model [25]. Stiebel et al. [30] investigated a lightweight Unet and added a simple pre-processing layer to enhance the quality of recovery in a real world scenario. Not long ago, the second SR challenge, NTIRE 2020 on spectral reconstruction from RGB images [35], has been successfully held and a new data set is released, which further promote the development of SR methods based on CNNs [36–41] as well as more recent works [42–45]. To explore the interdependencies among intermediate features and the camera spectral sensitivity prior, Li et al. [36] proposed an adaptive weighted attention network and incorporated the discrepancies of the RGB images and HSIs into the loss function. As the winning method on the "Real World" track of the second SR competition, Zhao et al. [37] organized a 4-level hierarchical regression network with pixelShuffle layer as inter-level interaction. Hang et al. [44] attempted to design a decomposition model to reconstruct HSIs and a selfsupervised network to fine-tune the reconstruction results. Li et al. [45] presented a hybrid 2D–3D deep residual attentional network to take fully advantage of the spatial–spectral context information. These two SR challenges are divided into the "Clean" and "Real World" tracks. The "Clean" track aims to recover HSIs from the noise-free RGB images created by a known camera response function, while the "Real World" one requires participants to

rebuild the HSIs from JPEG-compression RGB images obtained by an unknown camera response function. It is worth noting that the camera response functions for the same tracks of the two challenges are different. Also, to provide a more accurate simulation of physical camera systems, the NTIRE2020 "Real World" track is updated with additional simulated camera noise and demosaicing operation.

Attention mechanisms have been a useful tool in a variety of tasks, for instance, image captioning [46], classification [47,48], single image super-resolution [49–51], and person re-identification [52]. Chen et al. [46] proposed a SCA-CNN that incorporated spatial and channel-wise attention for image captioning. Dai et al. [50] presented a deep second-order attention network by exploring second-order statistics of features rather than first-order ones (e.g., global average pooling) [47]. Zhang et al. [53] proposed an effective relation-aware global attention module which captured the global structural information for better attention learning. Only a few very recent methods for SR [36,37,45] considered channel-wise attention mechanism using first-order statistics.

Compared with the previous sparse recovery and shallow mapping methods, the endto-end training paradigm and discriminant representational learning of CNNs bring considerable improvements of SR. However, the existing CNN-based SR approaches only devote to realizing the RGB-to-HSI mapping by the means of designing the deeper and wider network frameworks, which integrates hierarchical features from different layers without distinction and fails to explore the feature correlations of intermediate layers, thus hindering the expression capacity of CNNs. Actually, the importance of the information of all spatial regions of the feature map is different in the SR task. The feature response among channels also plays a different role for the SR performance. Additionally, most of CNN-based SR models do not consider the problem of spectral aliasing at the edge position, thus resulting in relatively-low performance.

To address these issues, a deep residual augmented attentional u-shape network (RA2UN) is proposed for SR. Concretely, the backbone of the proposed network is stacked with several double improved residual blocks (DIRB) rather than paired plain convolutional units to extract increasingly abstract feature representations through powerful residual learning. Moreover, we develop a novel spatial augmented attention (SAA) module to bridge the encoder and decoder, which is employed to highlight the features in the informative regions selectively and boost the spatial feature representations. To model interdependencies among channels of intermediate feature maps, a trainable channel augmented attention (CAA) module embedded in the DIRB is presented to adaptively recalibrate channel-wise feature responses by exploiting first-order statistics and secondorder ones. Such CAA modules make the network dynamically focus on useful features and further strengthen intrinsic residual learning of DIRBs. Finally, we establish a boundaryaware constraint to guide network to pay close attention to salient information in boundary localization, which can alleviate spectral aliasing at the edge position and recover more accurate edge details.

In summary, the main contributions of this paper can be depicted as below:


feature responses and enhance residual learning by using first-order and second-order statistics for stronger feature expression.

• A boundary-aware constraint is established to guide the network to focus on the salient edge information, which is helpful to alleviate spectral aliasing at the edge position and preserve more accurate high-frequency details.

#### **2. Materials and Methods**

#### *2.1. The Proposed RA*2*UN Network*

Figure 1 gives an illustration of our proposed RA2UN network. The backbone architecture mainly consists of several DIRB blocks. The SAA module is bridged the different DIRB counterparts between encoder and decoder and the CAA one is embedded in each DIRB. As for each DIRB, batch normalization layers are not performed, since the normalization operation can prevent the network's power to learn spatial dependencies and spectral distribution. Meanwhile, we adopt Parametric Rectified Linear Unit (PReLU) instead of ReLU as activation function to introduce more nonlinear representation and obtain stronger robustness. The entire DIRB is formulated as

$$y = \rho(R(\mathbf{x}, \mathcal{W}\_{l,1}) + \mathbf{x}) \tag{1}$$

$$z = \rho(R(y, \mathcal{W}\_{l,2}) + y) \tag{2}$$

where *x* and *z* denote the input and output of the DIRB block. *y* is the output of the first residual unit of the DIRB block. *Wl*,1 and *Wl*,1 represent the weight matrixes of the first and second residual units of the *l*-th DIRB block. *R*(·) denotes the residual mapping to be learned which comprises two convolutional layers and one PReLU function. *ρ* is the PReLU function. Our proposed RA2UN keeps the same spatial resolution of feature maps throughout the proposed model, which can maintain plentiful spatial details information for recovering the accurate spectrum from the RGB inputs in the network. The specific parameters settings for the backbone frameworks are given in Table 1. It can be seen that the output size of each DIRB of our RA2UN is not decreased in the encoding and decoding parts, i.e., we remove the down-sampling operation, which can loss partial spatial details and fail to remain the original pixel information as the network goes deeper, further reducing the accuracy of SR inevitably. In the encoder section, a simple convolutional layer is firstly employed to extract shallow feature from input images. Then several DIRBs are stacked for deep features extraction. Finally, we perform the final reconstruction part via one convolutional layer.

**Figure 1.** Network architecture of the proposed RA2UN network. The input of the RA2UN network is RGB images and the output is the corresponding reconstructed HSIs. The detailed network and parameters setting can be referenced from Table 1.

**Table 1.** Parameters settings for the backbone frameworks of our proposed RA2UN. (·) stands for the dimension of the convolutional kernels (input channels, kernel size2, filter number). The stride and padding of the convolution kernels are set to 1. The dimensions of the feature map are denoted by *C* × *H* × *W*(*H* = *W*). C, H and W denote the channel, height and width of the feature map. {·} indicates the DIRB block. Four rows in kernels column denote the dimensions of the four convolutional kernels of each DIBR block. [·] is the improved residual unit.


#### *2.2. Spatial Augmented Attention Module*

In general, the importance of the information of all spatial regions of the feature map is different in the SR task. To focus more attention on the features in the informative regions, a SAA module is designed between the encoder and the decoder, which can boost the interaction and fusion between the low-level and high-level features effectively. The specific diagram of SAA module is displayed in Figure 2. Our proposed SAA module consists of paired symmetric and asymmetric convolutional groups. The asymmetric convolutions refer to use 1D horizontal and vertical kernels (i.e., 1 × 3 and 3 × 1 sizes), which not only strengthen the square convolution kernels but also capture multi-direction contextual information to obtain discriminative spatial dependencies.

**Figure 2.** The overview of spatial augmented attention module. ⊕ denotes the element-wise summation.

Given an intermediate feature map denoted as **F** = [**f**1,**f**2, ··· ,**f***c*, ··· ,**f***C*] containing *C* feature maps with spatial size of *H* × *W*, we firstly feed **F** to the parallel paired symmetric and asymmetric convolutional groups

$$\mathbf{C}\_1 = \rho(Conv\_{1,2}^{3 \times 1}(\rho(Conv\_{1,1}^{1 \times 3}(\mathbf{F}))))\tag{3}$$

$$\mathbf{C}\_2 = \rho(Conv\_{2,2}^{1\times3}(\rho(Conv\_{2,1}^{3\times1}(\mathbf{F}))))\tag{4}$$

$$\mathbf{C}\_3 = \rho(Conv\_{3,2}^{3\times3}(\rho(Conv\_{3,1}^{3\times3}(\mathbf{F}))))\tag{5}$$

where *<sup>ρ</sup>* denotes the PReLU activation function. *Conv*1×<sup>3</sup> 1,1 (·), *Conv*3×<sup>1</sup> 2,1 (·) and *Conv*3×<sup>3</sup> 3,1 (·) project the feature **<sup>F</sup>** <sup>∈</sup> *<sup>R</sup>C*×*H*×*<sup>W</sup>* to a lower size *<sup>R</sup>C*/*t*×*H*×*<sup>W</sup>* along the channel dimension. Then the next convolution layers *Conv*3×<sup>1</sup> 1,2 (·), *Conv*1×<sup>3</sup> 2,2 (·) and *Conv*3×<sup>3</sup> 3,2 (·) map the lowdimensional features to the multi-direction spatial feature descriptors **<sup>C</sup>**1, **<sup>C</sup>**2, **<sup>C</sup>**<sup>3</sup> <sup>∈</sup> *<sup>R</sup>*1×*H*×*W*, which contain rich contextual information. Besides, this design increases only a small amount of parameters and computational burden. To compute the spatial attention, the feature descriptors are summed and normalized to [0, 1] through a sigmoid activation *σ*

$$\mathbf{A}\_s(\mathbf{F}) = \sigma(\mathbf{C}\_1 + \mathbf{C}\_2 + \mathbf{C}\_3) \tag{6}$$

where **<sup>A</sup>***s*(**F**) <sup>∈</sup> *<sup>R</sup>*1×*H*×*<sup>W</sup>* represents the spatial attention, which encodes the degree of importance for the spatial positions of the original feature **F** and determines which spatial locations should be emphasized. Finally, we perform the element-wise multiplication ⊗ between **A***s*(**F**) and **F**

$$\mathbf{F}^{\mathbf{s}} = \mathbf{A}\_{s}(\mathbf{F}) \otimes \mathbf{F} \tag{7}$$

where **F***<sup>s</sup>* is the refined feature. During the processing, the spatial attention values are broadcasted along the channel-wise direction. Such SAA module is bridged the encoder and decoder to highlight the features in the important regions selectively and boost the spatial feature representations.

#### *2.3. Channel Augmented Attention Module*

In contrast to the preceding SAA module extracting the inter-spatial relationships of features, our presented CAA module attempts to explore inter-channel dependencies of features for SR. To obtain more powerful learning capability of the network, we present a novel CAA module to model interdependencies between channels by using first-order and second-order statistics jointly for stronger feature representations (see Figure 3).

**Figure 3.** The overview of channel augmented attention module. ⊕ denotes the element-wise summation.

We first aggregate spatial information of the feature map **<sup>F</sup>** <sup>∈</sup> *<sup>R</sup>C*×*H*×*W*(**<sup>F</sup>** = [**f**1,**f**2, ··· , **<sup>f</sup>***c*, ··· ,**f***C*],**f***<sup>c</sup>* <sup>∈</sup> *<sup>R</sup>H*×*W*) by using global average pooling

$$\mathbf{s}\_c^{\rm{1st}} = \frac{1}{H \times W} \sum\_{i=1}^{H} \sum\_{j=1}^{W} \mathbf{f}\_c(i, j) \tag{8}$$

where **s**1st *<sup>c</sup>* denotes the *<sup>c</sup>*-th element of the first-order channel descriptor **<sup>S</sup>**1st <sup>∈</sup> *<sup>R</sup><sup>C</sup>* and **f***c*(*i*, *j*) is the response at location (*i*, *j*) of the *c*-th feature map **f***c*. As for the secondorder channel descriptor, we reshape the feature map **<sup>F</sup>** <sup>∈</sup> *<sup>R</sup>C*×*H*×*<sup>W</sup>* to a feature matrix **<sup>D</sup>** <sup>∈</sup> *<sup>R</sup>C*×*n*, *<sup>n</sup>* <sup>=</sup> *<sup>H</sup>* <sup>×</sup> *<sup>W</sup>* and compute the sample covariance matrix

$$\mathbf{X} = \mathbf{D} \mathbf{I} \mathbf{D}^T \tag{9}$$

where **I** = <sup>1</sup> *n* **<sup>I</sup>** <sup>−</sup> <sup>1</sup> *n* **1** , and **<sup>X</sup>** <sup>∈</sup> *<sup>R</sup>C*×*C*,**<sup>X</sup>** = [**x**1, **<sup>x</sup>**2, ··· , **<sup>x</sup>***c*, ··· , **<sup>x</sup>***C*], **<sup>x</sup>***<sup>c</sup>* <sup>∈</sup> *<sup>R</sup>*1×*C*. **<sup>I</sup>** and **<sup>1</sup>** represent the *n* × *n* identity matrix and matrix of all ones. Then the *c*-th dimension of the second-order statistics **<sup>S</sup>**2nd <sup>∈</sup> *<sup>R</sup><sup>C</sup>* is formulized as

$$\mathbf{s}\_{\varepsilon}^{2\text{nd}} = \frac{1}{\mathbb{C}} \sum\_{i=1}^{\mathbb{C}} \mathbf{x}\_{\varepsilon}(i) \tag{10}$$

where **s**2nd *<sup>c</sup>* denotes the *<sup>c</sup>*-th element of the second-order channel descriptor **<sup>S</sup>**2nd <sup>∈</sup> *<sup>R</sup><sup>C</sup>* and **x***c*(*i*) is the *i*-th value of the *c*-th feature map **x***c*. To make use of the aggregated information **S**1st and **S**2nd, both descriptors are fed into a shared multi-layer perceptron (MLP) with a sigmoid function to generate the channel attention. The MLP is constituted of two fully connected (FC) layers and a non-linearity PReLU function, where the output dimension of the first FC layer is *RC*/*<sup>r</sup>* and the output size of the second one is *RC*. *r* is the reduction ratio. In summary, the channel attention map is indicated as

$$\mathbf{A}\_{\mathbf{f}}(\mathbf{F}) = \sigma(FC2(\rho(FC\_1(\mathbf{S}^{1 \text{st}}))) + FC\_2(\rho(FC\_1(\mathbf{S}^{2 \text{nd}})))) \tag{11}$$

where *FC*1(·) and *FC*2(·) are the weight set of two FC layers. **<sup>A</sup>***c*(**F**) <sup>∈</sup> *<sup>R</sup><sup>C</sup>* denotes the channel attention recording the importance and interdependences among channels, which is to rescale the original input feature **F**

$$\mathbf{F}^{\mathbf{c}} = \mathbf{A}\_{\mathbf{c}}(\mathbf{F}) \otimes \mathbf{F} \tag{12}$$

where ⊗ is element-wise multiplication and the channel attention values can be copied along the spatial dimension according to the broadcast mechanism. Inserted into the DIRB block, the CAA module can recalibrate channel-wise feature responses adaptively and enhance residual learning.

#### *2.4. Boundary-Aware Constraint*

In the process of hyperspectral imaging, the spectral aliasing of the edge position is easy to occur, so that the reconstruction accuracy of boundary spectrum is low. To alleviate the spectral aliasing and recover more accurate high-frequency details of HSIs, we establish a boundary-aware constraint to guide the training process in the proposed RA2UN:

$$l = l\_m + \tau l\_b \tag{13}$$

$$l\_{\rm m} = \frac{1}{N} \sum\_{p=1}^{N} \left( |\mathbf{I}\_{HSI}^{(p)} - \mathbf{I}\_{SR}^{(p)}| / \mathbf{I}\_{HSI}^{(p)} \right) \tag{14}$$

$$l\_b = \frac{1}{N} \sum\_{p=1}^{N} \left( \left| \mathbf{B}(\mathbf{I}\_{HSI}^{(p)}) - \mathbf{B}(\mathbf{I}\_{SR}^{(p)}) \right| \right) \tag{15}$$

where *lm* represents the mean relative absolute error (MRAE) loss term to minimize the numerical error between ground truths and the reconstructed results. *lb* denotes the boundaryaware constraint component to lead the network to focus on the salient edge information simultaneously. *τ* is a weighted parameter. *N* is the total number of pixels. **I** (*p*) *HSI* and **I** (*p*) *SR* denote the *p*-th pixel value of the ground truth **I***HSI* and the spectral reconstructed result **I***SR*. **B**(·) represents the edge extraction function. To be specific, **B**(·) firstly performs Gaussian filtering to eliminate the influence of noise and then adopts Prewitt operator [54] to get boundaries of ground truths and the reconstructed results. The Gaussian filtering kernel is [[0.0751, 0.1238, 0.0751], [0.1238, 0.2042, 0.1238], [0.0751, 0.1238, 0.0751]] and the sigma is 1.0. The Prewitt operators are [[−1.0, 0.0, 1.0], [−1.0, 0.0, 1.0], [−1.0, 0.0, 1.0]] and [[−1.0, −1.0, −1.0], [0.0, 0.0, 0.0], [1.0, 1.0, 1.0]] in the x and y directions, respectively. In order to better observe the effect of edge extraction, we visualize several example images in Figure 4. The first row shows several original images from the NTIRE2020 dataset. The second row displays the effect of edge extraction. From the mathematical perspective, compared with the single MRAE loss term *lm*, the compound loss function *l* can make the space of the possible three-to-many mapping functions smaller for the ill-posed SR problem and avoid falling into a local minimum to obtain more accurate spectral recovery, which will be demonstrated in Section 4.1. Finally, *τ* is empirically set to 1.0 in the proposed network.

**Figure 4.** The first row (**a**–**d**) shows several original images from the NTIRE2020 dataset. The second row (**e**–**h**) displays the effect of edge extraction and the white lines represent boundary information.

#### **3. Experiments Setting**

#### *3.1. Datasets and Implementations*

In this paper, we evaluate the proposed RA2UN on four benchmark datasets, i.e., NTIRE2018 "Clean" and "Real World" tracks, NTIRE2020 "Clean" and "Real World" tracks. Following the competition instructions, the NTIRE2018 dataset contains 256 natural HSIs for official training set and 5 + 10 additional images for official validation set and testing set with the size of 1392 × 1300. All images have 31 spectral bands (400–700 nm at roughly 10nm increments). The NTIRE2020 dataset consists of 450 images for official training set, 10 images for official validation set and 20 images for official testing set with 31 bands from 400 nm to 700 nm at 10 nm steps. Each band is the size of 512 × 482. The NTIRE2020 datasets are collected with a Specim IQ mobile hyperspectral camera. The Specim IQ camera is a stand-alone, battery-powered, push-broom spectral imaging system, the size of a conventional SLR camera (207 × 91 × 74 mm) which can operate independently without the need for an external power source or computer controller. The NTIRE2018 datasets are acquired using a Specim PS Kappa DX4 hyperspectral camera and a rotary stage for spatial scanning.

For the dataset settings, due to the confidentiality of ground truth HSIs for the official testing set of both SR contests, we choose the official validation as the final testing set and randomly select several images from the official training set as the final validation set in this paper. The rest of the official training set is adopted as the final training set. Specifically, the NTIRE2020 final validation set contains 10 HSIs including "ARAD\_HS\_0079", "ARAD\_HS\_0089", "ARAD\_HS\_0255", "ARAD\_HS\_0304", "ARAD\_HS\_0363", "ARAD\_HS\_0372", "ARAD\_HS\_0387", "ARAD\_HS\_0422", "ARAD\_ HS\_0434" and "ARAD\_HS\_0446". The NTIRE2018 final validation set chooses 5 HSIs including "BGU\_HS\_00001", "BGU\_HS\_00036", "BGU\_HS\_00204", "BGU\_HS\_00209" and "BGU\_HS\_00225".

During the training process, we crop 64 × 64 RGB and HSI sample pairs from the original NTIRE2020 and NTIRE2018 datasets. The batch size of our model is 16 and the parameter optimization algorithm chooses Adam [55] with *β*<sup>1</sup> = 0.9, *β*<sup>2</sup> = 0.99 and = 10−8. The parameter *t* value of the SAA module is 4 and reduction ratio *r* of CAA module is 16. The learning rate is initialized as 1.2 <sup>×</sup> <sup>10</sup>−<sup>4</sup> and the polynomial function is set as the decay policy with power = 1.5. We stop network training at 100 epochs and the proposed RA2UN network has been implemented on the Pytorch framework on an NVIDIA 2080Ti GPU.

#### *3.2. Evaluation Metrics*

To objectively test the results of our proposed method on the NTIRE2020 and NTIRE2018 datasets, the mean relative absolute error (MRAE), root mean square error (RMSE), and spectral angle mapper (SAM) are adopted as metrics. The MRAE and RMSE are provided by the challenge, where MRAE is chosen as the ranking criterion rather than RMSE to avoid overweighting errors in the higher brightness region of the test image. The SAM is employed to measure the spectral quality. The MRAE, RMSE and SAM are defined as follows

$$MRAE = \frac{1}{N} \sum\_{p=1}^{N} \left( \left| \mathbf{I}\_{HSI}^{(p)} - \mathbf{I}\_{SR}^{(p)} \right| / \mathbf{I}\_{HSI}^{(p)} \right) \tag{16}$$

$$RMSE = \sqrt{\frac{1}{N} \sum\_{p=1}^{N} \left( \mathbf{I}\_{HSI}^{(p)} - \mathbf{I}\_{SR}^{(p)} \right)^2} \tag{17}$$

$$SAM = \frac{1}{M} \sum\_{v=1}^{M} \left( \arccos(< \mathbf{I}\_{HSI}^{(v)} \mathbf{I}\_{SR}^{(v)} > / (||\mathbf{I}\_{HSI}^{(v)}||\_2 ||\mathbf{I}\_{SR}^{(v)}||\_2)) \right) \tag{18}$$

where **I** (*p*) *HSI* and **I** (*p*) *SR* denote the *p*-th pixel value of the ground truth and the spectral reconstructed HSI. < **I** (*v*) *HSI*,**I** (*v*) *SR* > represents the dot product of the *v*-th spectral vector **I** (*v*) *HSI* and **I** (*v*) *SR* of the ground truth and the spectral reconstructed HSI. || · || is *l*2 norm operation. *N* is the total number of pixels and *M* is the total number of spectral vectors. A smaller MRAE, RMSE or SAM indicates better performance.

#### **4. Experimental Results and Discussions**

#### *4.1. Discussion on the Proposed RA*2*UN: Ablation Study*

In order to demonstrate the effectiveness of the SAA module, the CAA module and the boundary-aware constraint, we conduct the ablation study on the NTIRE2020 "Clean" track dataset. The results are summarized in Table 2. *Ra* refer to the baseline network without any attention module, which is trained by individual MRAE loss term *lh*. In Table 2, the baseline result reaches to MRAE = 0.03668.

**Table 2.** Ablation study on the final validation set of NTIRE2020 "Clean" track dataset. We record the best MRAE values in 5.76 <sup>×</sup> 105 iterations.


**Spatial Augmented Attention Module.** Firstly, we only add the SAA module to basic model in *Rb* and acquire the decline in MRAE. It implies that the SAA module is helpful to emphasize the features in the important regions and boost the spatial feature representations. Then the results of *Re* and *Rf* further prove the effectiveness of the SAA module, based on that the CAA module is employed or the boundary-aware constraint is established.

**Channel Augmented Attention Module.** As elaborated in Section 2.3, a CAA module is developed to explore feature interdependencies among channels. Compared with the baseline result, *Rc* achieves 7.42% decrease in the MRAE value. The reason may be that CAA module can recalibrate channel-wise feature responses adaptively and realize powerful learning capability of the network. Compared with the results from *Rb* and *Rd*, the results of *Re* and *Rg* further demonstrate the superiority of the CAA module, respectively.

**Boundary-aware Constraint.** In contrast to the baseline experiment *Ra*, *Rd* is optimized by stochastic gradient descent algorithm with the MRAE loss term *lh* and the boundary-aware constraint *lb*. The result of *Rd* indicates that the boundary-aware constraint is helpful to recover more accurate HSIs. Furthermore, other results of *Rf* , *Rg* and *Rh* all verify the effectiveness of the boundary-aware constraint. In particular, we can get the best MRAE value with the two modules and the boundary-aware constraint in *Rh*.

#### *4.2. Results of SR and Analysis*

In this study, we compare the proposed RA2UN against 6 existing methods including Arad [17], Galliani [23], Yan [26], Stiebel [30], HSCNN-R [34] and HRNet [37]. Among them, the Arad is an early SR approach based on sparse recovery, while the others are based on CNNs. For a fair comparison, all models retrain on the final training set, save on the final validation set and evaluate on the final testing set for the two tracks of the NTIRE2020 and NTIRE2018 datasets. The quantitative results of final test set of NTIRE2020 and NTIRE2018 "Clean" and "Real World" tracks are listed in Tables 3 and 4. Since the camera response function is unknown, Arad is only suitable for measuring on "Clean" tracks. It can be seen that our RA2UN performs the best results under MRAE, RMSE and SAM metrics on all the tracks. As for the ranking metrics MRAE, the proposed method achieves relative

reduction of 14.02%, 6.89%, 14.21% and 1.27% over the second best results on corresponding established datasets. In addition, we can obtain the smallest SAM values, which indicate that our reconstructed HSIs contain better spectral quality.

Also, we show the visual comparison of the five selected bands on different example images of the final test set in Figures 5–8. The ground truth, our results and error images are displayed from top to bottom. The error images are the heat maps of MRAE between the ground truth and the recovered HSI. The bluer the displayed color, the better the reconstructed spectrum. As can be seen, our approach yields better recovery results and have less reconstruction error than other competitors. Besides, the spectral response curves of four selected spatial points are painted in Figure 9. The red line is our result and the black one denotes the groundtruth spectrum. The rest are the results of the comparison methods. Obviously, the reconstructed results of RA2UN are much closer to the groundtruth spectrum than the others.

**Figure 5.** Visual comparison of the five selected bands on "ARAD\_HS\_0455" image from the final testing set of NTIRE2020 "Clean" track. The best view on the screen.

**Figure 6.** Visual comparison of the five selected bands on "ARAD\_HS\_0451" image from the final testing set of NTIRE2020 "Real World" track. The best view on the screen.

**Table 3.** The quantitative results of final test set of NTIRE2020 "Clean" and "Real World" tracks. The best and second best results are **bold** and underlined.


**Figure 7.** Visual comparison of the five selected bands on "BGU\_HS\_00265" image from the final testing set of NTIRE2018 "Clean" track. The best view on the screen.

**Table 4.** The quantitative results of final test set of NTIRE2018 "Clean" and "Real World" tracks. The best and second best results are **bold** and underlined.


**Figure 8.** Visual comparison of the five selected bands on "BGU\_HS\_00259" image from the final testing set of NTIRE2018 "Real World" track. The best view on the screen.

**Figure 9.** Spectral response curves of selected several spatial points from the reconstructed HSIs. (**a**,**b**) are for the NTIRE2020 "Clean" and "Real World" tracks respectively. (**c**,**d**) are for the NTIRE2018 "Clean" and "Real World" track respectively.

#### **5. Conclusions**

In this paper, we propose a novel RA2UN network for SR. Concretely, the backbone of RA2UN network consists of several DIRB blocks instead of paired plain convolutional units. To boost the spatial feature representations, a trainable SAA module is developed to highlight the features in the important regions selectively. Furthermore, we present a novel CAA module to adaptively recalibrate channel-wise feature responses by exploiting first-order statistics and second-order ones for enhance learning capacity of the network. To find a better solution, an additional boundary-aware constraint is built to guide network to learn salient information in edge localization and recover more accurate details. Extensive experiments on challenging benchmarks demonstrate the superiority of our RA2UN network in terms of numerical and visual measurements.

**Author Contributions:** J.L. and C.W. conceived and designed the study; W.X. performed the experiments; R.S. shared part of the experiment data; J.L. and Y.L. analyzed the data; C.W. and J.L. wrote the paper. R.S. and W.X. reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the National Key Research and Development Program of China under Grant (no. 2018AAA0102702), the National Nature Science Foundation of China (no. 61901343), the Science and Technology on Space Intelligent Control Laboratory (no.ZDSYS-2019-03), the China Postdoctoral Science Foundation (no. 2017M623124) and the China Postdoctoral Science Special Foundation (no. 2018T111019). The project was also partially supported by the Open Research Fund of CAS Key Laboratory of Spectral Imaging Technology (no. LSIT201924W) and the Fundamental Research Funds for the Central Universities JB190107. It was also partially supported by the National Nature Science Foundation of China (no. 61571345, 61671383, 91538101, 61501346 and 61502367), Yangtse Rive Scholar Bonus Schemes (No. CJT160102), Ten Thousand Talent Program, and the 111 project (B08038).

**Acknowledgments:** The authors would like to thank the anonymous reviewers and associate editor for their valuable comments and suggestions to improve the quality of the paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Linear and Non-Linear Models for Remotely-Sensed Hyperspectral Image Visualization**

#### **Radu-Mihai Coliban \*, Maria Marinca¸s, Cosmin Hatfaludi and Mihai Ivanovici**

Electronics and Computers Department, Transilvania University of Bra¸sov, 500036 Bra¸sov, Romania; maria.marincas@student.unitbv.ro (M.M.); cosmin.hatfaludi@student.unitbv.ro (C.H.); mihai.ivanovici@unitbv.ro (M.I.)

**\*** Correspondence: coliban.radu@unitbv.ro

Received: 29 June 2020; Accepted: 30 July 2020; Published: 2 August 2020

**Abstract:** The visualization of hyperspectral images still constitutes an open question and may have an important impact on the consequent analysis tasks. The existing techniques fall mainly in the following categories: band selection, PCA-based approaches, linear approaches, approaches based on digital image processing techniques and machine/deep learning methods. In this article, we propose the usage of a linear model for color formation, to emulate the image acquisition process by a digital color camera. We show how the choice of spectral sensitivity curves has an impact on the visualization of hyperspectral images as RGB color images. In addition, we propose a non-linear model based on an artificial neural network. We objectively assess the impact and the intrinsic quality of the hyperspectral image visualization from the point of view of the amount of information and complexity: (i) in order to objectively quantify the amount of information present in the image, we use the color entropy as a metric; (ii) for the evaluation of the complexity of the scene we employ the color fractal dimension, as an indication of detail and texture characteristics of the image. For comparison, we use several state-of-the-art visualization techniques. We present experimental results on visualization using both the linear and non-linear color formation models, in comparison with four other methods and report on the superiority of the proposed non-linear model.

**Keywords:** hyperspectral imaging; visualization; color formation models

#### **1. Introduction**

Hyperspectral imaging captures high-resolution spectral information covering the visible and the infrared wavelength spectra, and thus can provide a high-level understanding of the land cover objects [1]. It is used in a wide variety of applications, such as agriculture [2,3], forest management [4,5], geology [6,7] and military/defense applications [8,9]. Human interaction with hyperspectral images is very important for image interpretation and analysis as the visualization is very often the first step in an image analysis chain [10]. However, displaying a hyperspectral image poses the problem of reducing the large number of bands to just three color RGB channels in order for it to be rendered on a monitor, with the information being meaningful from a human point of view. In order to address this problem, a series of hyperspectral image visualization techniques have been developed, which can be included in the following broad categories: band selection, PCA-based approaches, linear approaches, approaches based on digital image processing techniques and machine/deep learning methods.

Band selection methods consist of a mechanism of picking three spectral channels from the hyperspectral image and mapping them as the red, green and blue channels in the color composite. Commercial geospatial image analysis software products such as ENVI [11] offer the possibility to visualize a hyperspectral image by manually selecting the three channels to be displayed. More complex unsupervised band selection approaches have been developed, based on the one-bit

transform (1BT) [12], normalized information (NI) [13], linear prediction (LP) or the minimum endmember abundance covariance (MEAC) [14].

Another family of hyperspectral visualization techniques consists of methods that use principal component analysis (PCA) for dimension reduction of the data. A straightforward visualization technique is to map a set of three principal components (usually the first three) to the R, G and B channels of the color image [15]. Other methods use PCA as part of a more complex approach. For instance, the method presented in [16] is an interactive visualization technique based on PCA, followed by convex optimization. The authors of [17] obtain the color composite by fusing the spectral bands with saliency maps obtained before and after applying PCA. In [1], the image is first decomposed into two different layers (base and detail) through edge-preserving filtering; dimension reduction is achieved through PCA applied on the base layer and a weighted averaging-based fusion on the detail layer, with the final result being a combination of the two layers.

In the case of the linear method described in [18,19], the values of each output color channel are computed as projections of the hyperspectral pixel values on a vector basis. Examples of such bases include one consisting of a stretched version of the CIE 1964 color matching functions (CMFs), a constant-luma disc basis or an unwrapped cosine basis.

A set of hyperspectral image visualization approaches are based on digital image processing techniques. In [20], dimension reduction is achieved using multidimensional scaling, followed by detail enhancement using a Laplacian pyramid. The approach presented in [21] uses the averaging method in order to the number of bands to 9; a decolorization algorithm is then applied on groups of three adjacent channels, which produces the final color image. The technique described in [22] is based on t-distributed stochastic neighbor embedding (t-SNE) and bilateral filtering. The method in [23] is also based on bilateral filtering, together with high dynamic range (HDR) processing techniques, while in [24] a pairwise-distances-analysis-driven visualization technique is described.

Machine/deep learning-based methods used for hyperspectral image visualization generally rely on a geographically-matched RGB image, either obtained through band selection or captured by a color image sensor. Approaches include constrained manifold learning [25], a method based on self-organizing maps [26], a moving least squares framework [10], a technique based on a multichannel pulse-coupled neural network [27] or methods based on convolutional neural networks (CNNs) [28,29].

In this paper, our goal is to produce natural-looking visualization results (i.e., depicting colors close to the real ones in the scene) with the highest possible amount of information and complexity. We propose the usage of a linear color formation model based on a widely-used linear model in colorimetry, based on spectral sensitivity curves. We study the impact on visualization of the choice of spectral sensitivity curves and the amount of overlapping between them, which induces the correlation between the three color channels used for visualization. Besides Gaussian functions, we use spectral sensitivity functions of digital camera sensors, the main idea behind the approach being to emulate the result of capturing the scene with a consumer-grade digital camera sensor instead of a hyperspectral one. Alternatively, we also developed a non-linear visualization method based on an artificial neural network, trained using the spectral signatures of a 24-sample color checker, also often used in colorimetry. By using the proposed approaches, we address the following question: what is the impact of the choice of visualization technique on the amount of information and complexity of a scene? The amount of information in a hyperspectral image should be preserved as much as possible after the visualization. The entropy is often used to measure the amount of information contained by a signal [30] and is one of the metrics that are used for the objective assessment of the visualization result [10,21,31]. The complexity of a scene is related to the texture and object characteristics preservation in the process of visualization. The color fractal dimension is a multi-scale measure capable of globally assessing the complexity of a color image, which can be useful to evaluate both the amount of detail and the object-level content in the image. We perform both a qualitative and a quantitative evaluation (using color entropy and color fractal dimension) of the described techniques

in comparison with four other state-of-the-art methods, employing five widely used hyperspectral test images.

The rest of the paper is organized as follows: Section 2 presents the five hyperspectral images used in our experiments, the proposed approaches (both linear and non-linear) and the two embraced measures for the objective evaluation of the performance of the proposed approaches, Section 3 depicts the experimental results, Section 4 the discussion on the various aspects related to the proposed approaches, as well as possible further investigation paths, and Section 5 presents our conclusions.

#### **2. Data and Methods**

In this section we briefly describe the five hyperspectral images used in our experiments, the linear and non-linear models proposed and used to visualize the respective hyperspectral images, as well as the two quality metrics deployed to objectively evaluate the experimental results—the color entropy and the color fractal dimension.

#### *2.1. Hyperspectral Images*

The hyperspectral images used in our experiments are Pavia University, Pavia Centre, Indian Pines, SalinasA and Cuprite [32]. The first two were acquired by the ROSIS-3 sensor [33], while the other three were acquired by the AVIRIS sensor [34]. Figure 1 depicts RGB representations of the five test images.

Pavia University (Figure 1a) is a 610 × 340 image, with a resolution of 1.3 m. The image has 103 bands in the 430–860 nm range. The scene in the image contains a number of 9 materials according to the provided ground truth, both natural and man-made. Pavia Centre (Figure 1b) is a 1096 × 715, 102-band image with the same characteristics as Pavia University. In both cases, the 10th, 31st and 46th bands were used for generating the RGB representations [25].

The third test image, Indian Pines (Figure 1c), is a 145 × 145 image, having 224 spectral reflectance bands in the 400–2500 nm range with a 20 m resolution. The water absorption bands were removed, resulting in a total of 200 bands. The image contains 16 classes, mostly vegetation/crops.

SalinasA (Figure 1d), is an 86 × 83 sub-scene of the Salinas image. After removing the water absorption bands, the image has 204 spectral reflectance bands in the 400–2500 nm range with a spatial resolution of 3.7 m. This image exhibits 6 types of agricultural crops.

The fifth image, Cuprite (Figure 1e), is of size 512 × 614, with 188 spectral reflectance bands in the 400–2500 nm range remaining after removing noisy and water absorption channels. This image contains 14 types of minerals.

For the last three images, the RGB representations were generated by selecting the 6th, 17th, and 36th bands [25].

(**c**) Indian Pines (**d**) SalinasA (**e**) Cuprite

**Figure 1.** RGB representations of the five hyperspectral images used in our experiments. Top row: images acquired by the ROSIS-3 sensor; bottom row: images acquired by the AVIRIS sensor.

#### *2.2. Linear Color Formation Model*

Considering the formation process of an RGB image, we embraced a linear model given by Equation (1) [35]. In colorimetry, the linear model is used as a standard model for the color formation, but usually the XYZ coordinates of colors are used as an intermediate step before computing the RGB final color coordinates [36]. In the embraced approach, for a pixel at any position (*x*, *y*) in the resulting RGB color image, the scalar value on each channel of the RGB triplet is computed as the integral of the product between the spectral reflectance *R*(*λ*) of the (*x*, *y*) point in the real scene, the power spectral distribution *L*(*λ*) of the illuminant and the spectral sensitivity *C*(*λ*) of the imaging sensor:

$$I\_k(\mathbf{x}, \mathbf{y}) = \int\_{\lambda\_{\min}}^{\lambda\_{\max}} \mathbb{C}\_k(\lambda) L(\lambda) R\_{(\mathbf{x}, \mathbf{y})}(\lambda) d\lambda, \quad k = R, G, B \tag{1}$$

For the spectral sensitivity curves of the imaging sensor one can use theoretical or ideal curves, in order to simulate the image formation process. An alternative would be to use the actual sensitivity curves of a specific sensor, which can be measured according to the approach proposed in [35].

The illuminant can be also characterized, either by considering a standard illuminant or measuring the real one by means of spectrophotometry. In colorimetry, a D65 illuminant is very often preferred, as it corresponds to a bright summer day light. For remotely-sensed images, one may know the illuminant as the direct sun light incident to the Earth's surface, as the position of the sun with

respect to the position of the satellite is known. The use of the illuminant in the model from Equation (1) represents merely an unbalanced weighting of the three sensitivities, favoring the blue channel (lower wavelengths) over green and red. The classical D65 illuminant is depicted in Figure 2, in support of this statement. However, in this article we assume that the illuminant is constant across all wavelengths, as we are mostly interested in the effect of the image sensor sensitivity curves on the vizualization process. Thus, the influence of the illuminant *L*(*λ*) in Equation (1) is basically null and it can be removed from the integral. Consequently, the equation is basically reduced to the following:

$$I\_k(\mathbf{x}, \mathbf{y}) = \int\_{\lambda\_{\min}}^{\lambda\_{\max}} \mathbb{C}\_k(\lambda) R\_{(\mathbf{x}, \mathbf{y})}(\lambda) d\lambda, \quad k = R, G, B \tag{2}$$

**Figure 2.** The D65 illuminant.

This is the linear model we consider for the experimental results presented in Section 3. In order to apply Equation (2) on a hyperspectral image, we extract from it only the bands corresponding to the range [*λmin*, *λmax*], covered by the sensitivity curves, which corresponds to the visible spectrum. This is the main difference between the proposed model and the linear model presented in [18], which uses all of the bands of the hyperspectral image and the weighting functions are stretched in order to cover the entire range of wavelenghts of the hyperspectral image. Since both the sensitivity functions and the reflectances are discrete, an interpolation of the pixel values of the hyperspectral image is done in order to match the wavelengths and number of values of the sensitivity functions.

Given the embraced linear model and sensitivity functions, our study is limited to the visible spectrum. The extension beyond the visible range could be done either by (i) stretching the sensitivity functions [18] or (ii) adding a fourth color channel, given that one of the latest trends in color display technologies is to add a fourth channel (such as a yellow channel) besides the RGB primaries [37]. However, both approaches would lead to unnatural-looking visualization results, which is not the goal of this study.

#### *2.3. Spectral Sensitivity Functions*

As the main objective of visualization is very often the interpretation of the image by humans, we start by considering the spectral sensitivity of the human visual system, which is actually the paradigm for RGB-based color image acquisition and display systems. Figure 3 presents the spectral sensitivities of the human cone cells in the retina, based on the data from [38]. The spectral sensitivity is a function of the wavelength of signal relative to detection of color. These spectral sensitivities are labeled in three categories, depending on the peak value: short (S), medium (M) and long (L). The cone cells are called *β* for the S group with the range that corresponds to the perception of the blue color. Similarly, the range of the M group (*γ* cells) corresponds to green and the L group (*ρ*) corresponds to red.

The RGB color digital cameras are characterized by their sensor spectral sensitivity functions, which define the performance of the respective system. The sensor sensitivity functions for consumer-grade cameras have a similar shape to the spectral sensitivities of human cone cells, since the aim of these products is to capture a representation of the scene that is as accurate as possible from the point of view of human perception. The five digital camera sensor spectral sensitivity functions used in our experiments, taken from [35], are presented in Figure 4.

Starting from the spectral sensitivities of the Canon 5D camera sensor, for our experiments we modeled a set of spectral sensitivities consisting of three Gaussian functions with the mean equal to the wavelength corresponding to the three peaks in Figure 4a and with increasing standard deviation. The functions are depicted in Figure 5. Figure 5a depicts Gaussian functions with a standard deviation of 0, which represent basically unit impulses. In this case, the linear model is reduced to a band selection approach (BS). The standard deviation is gradually increased in the next graphs, resulting in an increasing degree of overlapping between the three functions: *no* overlap (NOL), *small* overlap (SOL), *medium* overlap (MOL) and *high* overlap (HOL). In this way, we emulate the various levels of correlation between the three RGB color channels of the considered sensor model—from zero correlation, corresponding to a complete separation between the color channels for an ideal imaging sensor, to high overlap, corresponding to a low-performance imaging sensor.

**Figure 3.** Spectral sensitivities of human cone cells.

**Figure 4.** Spectral sensitivity functions for 5 digital cameras.

**Figure 5.** Gaussian spectral sensitivity functions based on the functions of the Canon 5D camera from Figure 4a.

#### *2.4. Non-Linear Color Formation Model*

The non-linear color formation model that we propose is based on an Artificial Neural Network (ANN) [39], with the input feature vector consisting of a spectral reflectance curve and the output being the corresponding RGB value. The architecture of the fully connected 5-layer network is depicted in Figure 6. The network uses the Exponential Linear Unit (ELU) [40] as an activation function instead of the more standard Rectified Linear Unit (ReLU), in order to overcome the problem of having a multitude of deactivated neurons (also referred to as "dying neurons" [41]). The implementation was done using the PyTorch library [42].

For the supervised training of the ANN, we chose to use a standard set of 24 colors widely-used in colorimetry–the McBeth color chart [43], depicted in Figure 7. In Figure 8 we show the spectral reflectance curves of the color patches for each row in the McBeth color chart, with their original designations in the legend of the plots. For each color, the RGB triplet is known and we used the measurements provided by [44]. The wavelength range covered by the reflectance curves is 380–780 nm. The reason for choosing this McBeth standard color set is twofold: (i) the spectral reflectance curves of the colors are specified regardless of the illuminant, therefore they can be used as references both in ideal or real conditions; and (ii) this particular color set was determined independently from the domain of remote sensing, thus it can be seen as a neutral set of colors compared to the existing data set of material spectral signatures, such as the ASTER spectral library [45]. In addition, the chosen color set does not require the mapping between the spectral curves and corresponding RGB colors. The training of the ANN is done via the classical backpropagation algorithm, with the mean squared error (MSE) being used as a cost function and Adam used as the optimizer.

As in the case of the linear model, only the bands covered by the spectral reflectance curves of the McBeth color set are used from the hyperspectral image. Concretely, the common range between the Pavia University image and the McBeth curves is 430–780 nm. This range is covered by 83 bands of the image and 71 values of the spectral reflectance curves. The 83 bands of the image are reduced to 71 through interpolation, such as to match the McBeth spectral reflectance curves, giving the size of the input feature vector in Figure 6.

After training with the 24 reflectance curves, the network is applied on a pixel-by-pixel basis; thus, for each pixel in the input image (a vector of 71 values in the case of Pavia University), the 3 output values (R, G and B) are obtained and placed in the corresponding position in the visualization result.

**Figure 6.** Architecture of the ANN.

**Figure 7.** The McBeth color chart.

**Figure 8.** Spectral reflectance curves of the color patches in each row of the McBeth color chart.

#### *2.5. Quality Metrics*

A commonly used objective quality metric for hyperspectral image visualization is the entropy, which is a measure of the degree of information preservation in the resulting image [1]. The most common definition of entropy is the Shannon entropy (see Equation (3)) which measures the average level of information present in a signal with *N* quantization levels [30].

$$H = -\sum\_{i=1}^{N} p\_i \log\_2 p\_i \tag{3}$$

where *pi* represents the probability to find a certain level in the signal (or color *i* in a given subset, in context of color images). From the Shannon definition, various other definitions were developed: Rényi entropy (as a generalization), Hartley entropy, collision entropy and min-entropy, or the Kolmogorov entropy, which is another generic definition of entropy [46]. The original Shannon entropy was embraced by Haralick as one of his thirteen features proposed for texture characterization [47]. In our experiments, we use the extension of the entropy to color images from [48].

Additionally, we use the fractal dimension from fractal geometry [49] to assess the complexity of the color images resulting in the process of hyperspectral image visualization. The fractal dimension, also called similarity dimension, is a measure of the variations, irregularities or *wiggliness* of a fractal object [50]. This multi-scale measure is often used in practice for the discrimination between various signals or patterns exhibiting fractal properties, such as textures [51]. In [52] the fractal dimension was linked to the visual complexity of a color image, more specifically to the perceived beauty of the visual art. Consequently, we use it in this article to both objectively assess the color image content at multiple scales and the appealing of the visualization from a human perception point of view.

The theoretical fractal dimension is the Hausdorff dimension [53], which is comprised in the interval [*E*, *E* + 1], where *E* is the topological dimension of that object (thus, for gray-scale images the fractal dimension is comprised between 2 and 3.). Because it was defined for continuous objects, equivalent fractal dimension estimates were defined and used: the probability measure [54,55], the Minkowski or box-counting dimension [53], the *δ*-parallel body method [56], the gliding box-counting algorithm [57] etc. The fractal dimension estimation was extended to the color image domain, like the marginal color analysis [58] or the fully vectorial probabilistic box-counting [59]. More recent attempts in defining the fractal dimension for color images exist [60,61]. For an RGB color image, the estimated color fractal dimension should be comprised in the interval [2, 5] [59].

In our experiments, we used the probabilistic box-counting approach defined color images in [59] for the estimation of the fractal dimension of the visualization results. The classical box-counting method consists of covering the image with grids at different scales and counting the number of boxes that cover the image pixels in each grid. The fractal dimension *FD* is then computed as [62]:

$$FD = \lim\_{r \to 0} \frac{\log N\_r}{\log r} \tag{4}$$

where *Nr* is the number of boxes and *r* is the scale.

*FD* is defined and computed for binary and grayscale images (considering the *z* = *f*(*x*, *y*) image model, where *z* is the luminance and *x* and *y* are the spatial coordinates). The extension of *FD* to color images, the color fractal dimension (*CFD*), is defined by considering the color image as a surface in a 5-dimensional hyperspace (*RGBxy*) [59] and 5D hyper-boxes instead of 3D regular ones. For the experimental results presented in Section 3, the stable *CFD* estimator proposed in [63] was used, which minimizes the variance of the nine regression line estimators used in the process of fractal dimension estimation. See [64] for reference color fractal images and the Matlab implementation of the baseline CFD estimation approach.

#### **3. Experimental Results**

Figures 9–13 depict the visualization results for the five hyperspectral test images presented in Section 2. Each figure is organized as follows: on the top row, the results obtained with the proposed linear approach using the Gaussian functions (Figure 5); on the middle row, the results obtained with the linear approach using camera spectral sensitivity functions (Figure 4); on the bottom row, the results obtained using the proposed ANN approach (Section 2.4), the approach based on the PCA to RGB mapping [15], the linear approach based on the stretched color matching functions (CMF) [18] and two recent approaches, constrained manifold learning (CML) [25] and decolorization-based hyperspectral visualization (DHV) [21].

For the Gaussian approaches, it can be noticed that, as the degree of overlapping between the three functions increases, the vizualization results tend to come closer to grayscale images, as expected. In the case of the camera functions, the difference between the results is not significant, proving that the choice of a particular camera model over the other does not have a large impact on the visualization results. Moreover, there is no significant difference in the visualization results between the two cases of the proposed linear approach. The proposed ANN approach obtains satisfying results in terms of both color and contrast, while the other depicted methods, particularly PCA and DHV, do not tend to give natural-looking results.

**Figure 9.** Experimental results on the Pavia University image. (**a**) BS. (**b**) NOL. (**c**) SOL. (**d**) MOL. (**e**) HOL. (**f**) Canon 5D. (**g**) Canon 1D. (**h**) Hasselblad H2. (**i**) Nikon D3X. (**j**) Nikon D50. (**k**) ANN. (**l**) PCA [15]. (**m**) CMF [18]. (**n**) CML [25]. (**o**) DHV [21].

**Figure 10.** Experimental results on the Pavia Centre image. (**a**) BS. (**b**) NOL. (**c**) SOL. (**d**) MOL. (**e**) HOL. (**f**) Canon 5D. (**g**) Canon 1D. (**h**) Hasselblad H2. (**i**) Nikon D3X. (**j**) Nikon D50. (**k**) ANN. (**l**) PCA [15]. (**m**) CMF [18]. (**n**) CML [25]. (**o**) DHV [21].

**Figure 11.** Experimental results on the Indian Pines image. (**a**) BS. (**b**) NOL. (**c**) SOL. (**d**) MOL. (**e**) HOL. (**f**) Canon 5D. (**g**) Canon 1D. (**h**) Hasselblad H2. (**i**) Nikon D3X. (**j**) Nikon D50. (**k**) ANN. (**l**) PCA [15]. (**m**) CMF [18]. (**n**) CML [25]. (**o**) DHV [21].

The corresponding values for the color entropy *H* and color fractal dimension *CFD* are depicted in Tables 1 and 2. One may note that, for the set of Linear Gaussian approaches, both the color entropy and color fractal dimension are maximum for the band selection, with one exception for the SalinasA image, and they both decrease with the increase of the correlation between the three Gaussian functions, as the color content tends to gray-scale and thus complexity diminishes. For the set of Linear Camera proposed approaches, the two quality measures have similar values, basically there is no noticeable difference in the visualization results. For both the Linear Gaussian and Linear Camera approaches, the two quality measures exhibit relatively modest values, which indicate that the visualization result does neither contain the highest information, nor is the most complex. The highest amount of information, measured through the color entropy, is obtained using the proposed non-linear ANN approach for the Pavia University and Pavia Centre images, the PCA approach for the Indian Pines and Cuprite images, and DHV for the SalinasA image. For the three latter images, the proposed ANN-based non-linear approach obtains the third (Indian Pines, Cuprite) and second (SalinasA) best visualization from the point of view of entropy. The highest complexity, measured through the color fractal dimension, is revealed when the hyperspectral images are visualized using the non-linear approach based on ANN, with the exception of the Cuprite image, in which case the PCA approach proves to be superior. The main advantage of the ANN method is that basically any out-of-the-box artificial neural network model can be used, by changing the input layer only in order to match the hyperspectral image under analysis. Table 3 lists, for each visualization method, the independent data used in addition to the hyperspectral images. In the case of the CML approach, the geographically-matched RGB image was obtained through band selection from the original image; the images used are depicted in Figure 1, while the specific bands chosen are listed in Section 2.1.

**Figure 12.** Experimental results on the SalinasA image. (**a**) BS. (**b**) NOL. (**c**) SOL. (**d**) MOL. (**e**) HOL. (**f**) Canon 5D. (**g**) Canon 1D. (**h**) Hasselblad H2. (**i**) Nikon D3X. (**j**) Nikon D50. (**k**) ANN. (**l**) PCA [15]. (**m**) CMF [18]. (**n**) CML [25]. (**o**) DHV [21].

**Figure 13.** Experimental results on the Cuprite image. (**a**) BS. (**b**) NOL. (**c**) SOL. (**d**) MOL. (**e**) HOL. (**f**) Canon 5D. (**g**) Canon 1D. (**h**) Hasselblad H2. (**i**) Nikon D3X. (**j**) Nikon D50. (**k**) ANN. (**l**) PCA [15]. (**m**) CMF [18]. (**n**) CML [25]. (**o**) DHV [21].


**Table 1.** Entropy and fractal dimension for the visualization results in Figures 9 and 10. The values in bold represent the highest values for the respective image.



**Table 3.** Independent data used by the methods under comparison.


#### **4. Discussion**

First of all, other measures can be considered for the assessment of the complexity of color images, like the Naive Complexity Measure [65]. For the evaluation of the information present in a color image, one could use the Pearson correlation coefficient between the color channels of the resulting RGB color image [63] as an indication of the overlapping between the information on the three RGB color channels. In the presence of a reference or ground truth, similarity indexes like Structural Similarity Index Measure [66] can be used. Nevertheless, the ultimate criteria for the evaluation of the performance of the hyperspectral image visualization approaches are dictated by the specific application and its objectives.

The best experimental results were obtained using the proposed non-linear ANN-based model, despite the extremely reduced training set—only 24 spectral reflectance curves and the corresponding RGB triplets. One should investigate the effects of increasing the size of the training set, in order to assess and reduce the overfitting effect [67] which may occur in our experiments. Extending the training set implies the realization of more color references, characterized both by their hyperspectral signatures (e.g., by using a spectrophotometer) and RGB triplets (e.g., by using a calibrated digital color image acquisition system). The non-linear model itself could be developed further by considering the wavelengths outside the visible range and taking into account the possibility to display the image with more than 3 color channels, including various choices for the mapping between the hyperspectral signatures and RGB triplets.

The linear models used to obtain the experimental results can be useful in understanding both the capabilities and limitations of current or new imaging sensors. The full characterization of the imaging sensors is mandatory in order to predict the imaging process outcome.

#### **5. Conclusions**

In this article, we proposed the usage of a linear model for the color formation based on spectral sensitivity curves in order to visualize hyperspectral images by rendering them as RGB color images. We deployed both Gaussian and real digital camera sensitivity curves and showed that, as the correlation between the RGB color channels increases, similar to the overlapping of the curves for both the human visual system and commercially-available digital cameras, the resulting color images tend to go to gray-scale and to exhibit both a smaller amount of information and complexity. We also proposed a non-linear color formation model based on an artificial neural network which was trained with the colors of the McBeth color chart widely used in colorimetry. The training was supervised as the 24 colors of the McBeth chart are specified both by their spectral reflectance curves and RGB triplets. Given their construction, both proposed linear and non-linear approaches generate color images with natural colors.

For the objective assessment of the quality of the hyperspectral image visualization results, we deployed the widely-used measure of entropy, as it is an indicator of the amount of information contained by a signal. We also proposed the usage of the fractal dimension, which is a multi-scale measure usually employed to assess the complexity of color images, but also their beauty and appeal according to some studies. The fractal dimension is an indicator of the amount of details present in the image along multiple analysis scales.

In our experiments, we compared the proposed approaches with four other visualization techniques, using five remotely-sensed hyperspectral images. In the case of the Gaussian functions, our results show that, as the degree of overlapping between functions increases, the visualization results come closer to a grayscale image. With regards to the camera sensitivity functions, we show that the specific choice of a camera model does not have a significant impact on the visualization result. Our experiments also show that the proposed non-linear model achieves the best visualization results from the point of view of the complexity of the resulting color images. We envisage further development by investigating the possible overfitting effect occurring in the case of the ANN approach, extending the approach beyond the visible range and by using a fourth color channel. We underline

that for the choice of the most appropriate visualization technique, one may need to consider three important aspects: the naturalness of the resulting colors, the amount of information present in the resulting color image and the complexity along multiple scales.

**Author Contributions:** Idea and methodology, M.I. and R.-M.C.; software, M.M., C.H. and R.-M.C.; investigation, M.I. and R.-M.C.; writing—original draft preparation, R.-M.C., M.M. and M.I.; writing—review and editing, R.-M.C.; supervision, M.I. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Generative Adversarial Network Synthesis of Hyperspectral Vegetation Data**

**Andrew Hennessy \*, Kenneth Clarke and Megan Lewis**

School of Biological Sciences, The University of Adelaide, Adelaide 5000, Australia; kenneth.clarke@adelaide.edu.au (K.C.); megan.lewis@adelaide.edu.au (M.L.)

**\*** Correspondence: andrew.hennessy@adelaide.edu.au

**Abstract:** New, accurate and generalizable methods are required to transform the ever-increasing amount of raw hyperspectral data into actionable knowledge for applications such as environmental monitoring and precision agriculture. Here, we apply advances in generative deep learning models to produce realistic synthetic hyperspectral vegetation data, whilst maintaining class relationships. Specifically, a Generative Adversarial Network (GAN) is trained using the Cramér distance on two vegetation hyperspectral datasets, demonstrating the ability to approximate the distribution of the training samples. Evaluation of the synthetic spectra shows that they respect many of the statistical properties of the real spectra, conforming well to the sampled distributions of all real classes. Creation of an augmented dataset consisting of synthetic and original samples was used to train multiple classifiers, with increases in classification accuracy seen under almost all circumstances. Both datasets showed improvements in classification accuracy ranging from a modest 0.16% for the Indian Pines set and a substantial increase of 7.0% for the New Zealand vegetation. Selection of synthetic samples from sparse or outlying regions of the feature space of real spectral classes demonstrated increased discriminatory power over those from more central portions of the distributions.

**Keywords:** hyperspectral; vegetation; generative adversarial network; deep learning; data augmentation; classification

#### **1. Introduction**

Hyperspectral (HS) Earth observation has increased in popularity in recent years, driven by advancements in sensing technologies, increased data availability, research and institutional knowledge. The big data revolution of the 2000s and significant advances in data processing and machine learning (ML) have seen hyperspectral approaches used in a broad spectrum of applications, with methods of data acquisition covering wide-ranging spatial and temporal resolutions.

For researchers aiming to classify or evaluate vegetation, hyperspectral remote sensing offers rich spectral information detailing the influences of pigments, biochemistry, structure and water absorption whilst having the benefits of being non-destructive, rapid, and repeatable. These phenotypical variations imprint a sort of 'spectral fingerprint' that allows hyperspectral data to differentiate vegetation at taxonomic units ranging from broad ecological types to species and cultivars [1]. Acquiring labelled hyperspectral measurements of vegetation is expensive and time-consuming, resulting in limited training datasets for supervised classification techniques. However, this has been slightly alleviated through multi/hyperspectral data-sharing portals such as ECOSTRESS [2] and SPECCHIO [3]. Supervised classification of such high dimensional data has had to rely on feature reduction or selection techniques in order to overcome small training sample sizes and avoid the curse of dimensionality, also called the 'Hughes phenomenon'. Additionally, the general requirement of large training datasets in ML has meant limited success has been had when trying to leverage recent ML progress towards classification of HS data, often leading to overfitting of models and poor generalizability.

**Citation:** Hennessy, A.; Clarke, K.; Lewis, M. Generative Adversarial Network Synthesis of Hyperspectral Vegetation Data. *Remote Sens.* **2021**, *13*, 2243. https://doi.org/ 10.3390/rs13122243

Academic Editor: Chein-I Chang

Received: 23 April 2021 Accepted: 3 June 2021 Published: 8 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Data augmentation (DA), the process of artificially increasing training sample size, has been implemented by the ML community when the problem of small or imbalanced datasets has been encountered. DA methods vary from simple pre-processing steps such as mirroring, rotating or scaling of images [4] to more complicated simulations [5,6] and generative models [7,8]. DA for timeseries or 1D data consists of the addition of noise, or methods such as time dilation, or cut and paste [9]. However, when dealing with non-spatial HS data, these methods would be unsuitable, as it is important to maintain reflectance and waveband relationships in order to ensure class labels are preserved. Methods of DA such as physics-based models [10], or noise injection [11,12] have been applied to HS data. Whilst successful, these methods are either simplifications of reality and require domain-dependent knowledge of target features in the case of physical models or rely upon random noise, potentially producing samples that only approximate the true distribution.

Generative adversarial networks (GANs) have been used successfully in many fields as a DA technique, often for images, timeseries/1D [13], sound synthesis [14], or anonymising medical data [15]. GANs consist of two neural networks trained in an adversarial manner. The generator (G) network produces synthetic copies mimicking the real training data while the discriminator (D) network attempts to identify whether a sample was from the real dataset or produced by G. The D is scored on its accuracy in identifying real from synthetic data, before passing feedback to G allowing it to learn how best to fool D and improve generation of synthetic samples [16].

The use of GANs to generate synthetic HS data is a relatively new field of study. GANs of varying architectures ranging from 1D spectral [17–19] to 2D [20], and 3D spectralspatial [21] with differing data embeddings including individual spectra, HS images, and principal components have been examined. All have been able to demonstrate the ability to generate synthesized hyperspectral data and to improve classification outcomes to varying degrees whether through DA or conversion of the GANs discriminator model to a classifier. However, issues such as training instability and mode collapse, a common form of overfitting are prevalent.

The work presented in this paper applies advances in generative models to overcome limitations previously encountered by Audebert et al. [17] to produce more realistic synthetic HS vegetation data and eliminate reliance on PCA to reduce dimensionality and stabilise training. Specifically, we train a GAN using the Cramér distance on two vegetation HS datasets, demonstrating the ability to approximate the distribution of the training samples while encountering no evidence of mode collapse. We go on to demonstrate the use of these synthetic samples for data augmentation and reduced under-sampling of class distributions, as well as establishing a method to quantify the potential classification power of a synthetic sample by evaluating its relative position in feature space.

#### *Generative Adversarial Networks—Background*

GANs are a type of generative machine learning algorithm known as an implicit density model. This type of model does not directly estimate or fit the data distribution but rather generates its own data which is used to update the model. Since first being introduced by Goodfellow et al. [16] GANs have become a dominant field of study within ML/DL, with numerous variants and being described as "the most interesting idea in the last 10 years in machine learning" by a leading AI researcher [22]. Although sometimes utilizing non-neural network architectures, GANs generally consist of two neural networks, sometimes more, that compete against each other in a minimax game. This is where one neural network, the discriminator, attempts to reduce its "cost" or error as much as possible. This occurs in an adversarial manner, where the discriminator is trained to maximize the probability of correctly labelling whether a sample originates from the original data distribution or has been produced by the generator. Simultaneously, the generator is trained to minimize the probability that the discriminator correctly labels the sample [23].

As a result, the training of GANs is notoriously unstable, with issues such as the discriminator's cost quickly becoming zero and providing no gradient to update the generator, or the generator converging onto a small subset of samples that regularly fool the discriminator, a common issue known as mode collapse. Considerable research has gone into attempting to alleviate these issues, improve training stability and improve quality of synthetic samples, so much so that during 2018 more than one GAN-related paper was being released every hour [23].

Unlike non-adversarial neural networks, the loss function of a GAN does not converge to an optimal state, making the loss values meaningless in respect to evaluating the performance of the model. In an attempt to alleviate this problem, the Wasserstein GAN (WGAN) was developed to use the Wasserstein distance, also known as the Earth Mover's (EM) distance which results in an informative loss function for both D and G that converges to a minimum [24]. Rather than the D having sigmoid activation in its final layer producing a binary classification of real or fake, WGAN approximates the Wasserstein distance, which is a regression task detailing the distance between the real and fake distributions. Due to gradient loss being a common weakness with WGANs, they were improved by applying weight clipping to the losses with a gradient penalty (GP) [25], further improving training stability.

#### **2. Experimental Design**

Here, we implement the CramérGAN, a GAN variant using the Cramér/energy distance as the Ds loss, reportedly offering improved training stability and increased generative diversity over WGANs [26]. This choice was informed by our preliminary testing of wGAN and wGAN-GP that produced noisy synthesized samples and lower standard deviations, in addition to the learning instability and poor convergence previously reported for wGAN, which may explain mode collapse encountered by Audebert et al. [17].

Individual models were trained for each hyperspectral class, for a total of 38 models. Each model was trained for 50,000 epochs, at a ratio of 5:1 (5 training iterations of D for every 1 of G) using the Adam optimiser at a learning rate of 0.0001, with beta1 = 0.5 and beta2 = 0.9. The latent noise vector was generated from a normal distribution with length 100. The G consists of two fully connected dense layers followed by two convolution layers, all using the ReLu activation function save for the final convolution layer using Sigmoid activation. The final layer of G reshaped the output to be a 2D array with shape (batch size \* number of bands). A similar architecture was used for the D, though reversed. Starting with two convolution layers into a flatten layer, followed by 2 fully connected dense layers, all layers of D used Leaky ReLu activation except the final layer which used a linear function (Figure 1) (Appendix A Tables A1 and A2).

**Figure 1.** Schematic of Generative Adversarial Network (GAN)architecture.

The three classification models (SVM, RF and NN) were evaluated in four permutations: trained on real data and evaluated on real data (real–real); trained on real data and evaluated on synthetic data (real–synthetic); trained on synthetic data and evaluated on

synthetic data (synthetic–synthetic); and trained on synthetic data and evaluated on real data (synthetic–real). Each dataset was split into training and testing subsets with 10 times cross-validation. All synthetic datasets were restricted to the same number of samples per class as the real datasets unless specified otherwise. The real–real experiments were expected to have the highest accuracy and offer a baseline of comparison for the synthetic samples. If the accuracy of real–synthetic is significantly higher than real–real experiments, this potentially indicates that the generator has not fully learned the true distribution of the training samples. Conversely, accuracy being significantly lower could mean the synthetic samples are outside the true distribution and are an unrealistic representation of the spectra.

Extending this analysis, synthetic–synthetic and synthetic–real experiments were performed with the number of synthesized training samples increasing from 10 to 490 samples by increments of 10 samples per class. The real–synthetic and real–real experiments were included for comparison with a consistent number of training samples, though training and evaluation subsets differed every iteration. The initial DA experiment was performed with the same number of samples for real and synthetic datasets with the augmented set having twice the number before the number of synthetic samples were incremented by 10 from 10 to 490 samples per class.

The data augmentation capabilities of the synthetic spectra were evaluated by similar methods. First, the three classifiers were trained with either real, synthetic or both combined into an augmented dataset and tested against an evaluation dataset that was not used in the training of the GAN.

All code was written and executed in Python 3.7. The CramérGAN based upon [27] using the Tensorflow 1.8 framework. Support Vector Machine (SVM), and Random Forests (RF) classifiers make use of the Scikit-Learn 0.22.2 library, with Tensorflow 1.8 utilized for the neural network (NN) classifier. Additionally, Scikit-Learn 0.22.2 provided the dimensionality reduction functions for Principal Components Analysis (PCA) and t-distributed Stochastic Neighbourhood Embedding (t-SNE), with Uniform Manifold Approximation and Projection (UMAP) being a standalone library. Hyperparameters for all functions are provided in Appendix A.

#### *2.1. Classification Power*

Potential classification power of a sample was estimated with the C metric devised in Mountrakis and Xi [28] for the purpose of predicting the likelihood of correctly classifying an unknown sample by measuring its Euclidean distance in feature space to samples in the training dataset. Mountrakis and Xi [28] demonstrated a strong correlation between close proximity to number of training samples and likelihood of correctly being classified. The C metric is bound between −1 indicating low likelihood and 1 indicating high likelihood of successful classification.

Rather than focusing on the proximity of an unknown sample to a classifier's training data, we are interested in the distance of each synthesized sample to that of the real data in order to evaluate any potential increase in information density. We hypothesise that a C value closer to the lower bound for a synthetic sample would indicate it being further away from real data points and of any synthetic samples with C values close to the upper bound. Such a sample could potentially contain greater discriminatory power for the classifier as it essentially fills a gap in feature space of the class distribution.

To determine whether some samples of the NZ dataset provide more information to the classifier than others, and that the improvement in classification accuracy is not purely from increased sample size, the distance of each generated sample was measured to all real samples of its class before being converted to a C value as per Mountrakis and Xi [28], with an h value range of 1–50 at increments of 1. Two data subsets were then created using the first 100 spectral samples after all synthetic samples were ordered by their C value in ascending (most distant) and descending (least distant) order. The first 100 samples from

each ordered dataset rather than the full 500 were used to maximize differences, reduce computation time and simplify figures.

*2.2. Datasets*

Two hyperspectral datasets were used to train the GAN: Indian Pines agricultural land cover types (INDI); and New Zealand plant spectra (NZ). The Indian Pines dataset (INDI) recorded by the AVIRIS airborne hyperspectral imager over North-West Indiana, USA, is made available by Purdue University and comprises 145 × 145 pixels at 20 m spatial resolution and 224 spectral reflectance bands from 400 to 2500 nm [29]. Removal of water absorption bands by the provider reduced these to 200 wavebands, and then reflectance of each pixel was scaled between 0 and 1. Fifty pixels were randomly selected as training samples except for three classes with fewer than 50 total samples, for which 15 samples were used for training (Table 1).


**Table 1.** Land cover classes, training and evaluation sample numbers for Indian Pines dataset.

The New Zealand (NZ) dataset used in this study is a subsample of hyperspectral spectra for 22 species taken from a dataset of 39 native New Zealand plant spectra collected from four different sites around the North Island of New Zealand and made available on the SPECCHIO database [3]. These spectra were acquired with an ASD FieldSpecPro spectroradiometer at 1 nm sampling intervals between 350 and 2500 nm. Following acquisition from the SPECCHIO database, spectra were resampled to 3 nm and noisy bands associated with atmospheric water absorption were removed (1326–1464, 1767–2004, 2337–2500) resulting in 540 bands per spectra. Eighty percent of samples per class were used for training the GAN and 20% held aside to evaluate classifier performance (Table 2).




**Table 2.** *Cont*.

#### **3. Results and Discussion**

#### *3.1. Mean and Standard Deviation of Training and Synthetic Spectra*

In order to visualize similarities between synthetic and real spectra, the mean and standard deviation for each class are shown for the real, evaluation, and synthetic datasets. All low-frequency spectral features, as well as mean and standard deviations, appear to be reproduced with high accuracy by the GAN. At finer scales of 3–5 wavebands noise is present, most notably throughout the near infra-red (NIR) plateau (Figure 2). Smoothing of synthesized data by a number of methods resulted in either no improvement or decreased performance in a number of tests; for this reason, no pre-processing was performed on synthesized samples. Due to the high frequency and random nature of the noise, once mean and STD statistics are calculated the spectra appear smooth.

**Figure 2.** Synthetic spectra of NZ class 0, 350–2400 nm at 3 nm bandwidths.

Class 0 is one of the NZ classes with the largest number of samples, resulting in its mean and standard deviation being similar between its real, evaluation, and synthetic subsets. However, this is not the case for all classes, with NZ-9 showing the mean and standard deviation of the randomly selected evaluation samples being vastly different to those of real and synthetic spectra (Figure 3). The same is seen amongst INDI classes, with class 2 matching across all 3 data subsets, and class 4 with only 40 samples showing substantial difference between evaluation and real samples, especially in the visible wavebands (Figure 4). Although some classes may struggle to represent the evaluation dataset due to the initial random splitting of the datasets, in general, mean and standard deviation of the synthetic samples very closely match the real training data.

**Figure 3.** Mean and +/− 1 STD for training (real), synthetic, and evaluation (real) datasets. (**A**) NZ class 0; Manuka (*L. scoparium*). (**B**) NZ class 9; Rata (*M. robusta*).

**Figure 4.** Mean and +/− 1 STD for training (real), synthetic, and evaluation (real) datasets. (**A**) INDI class 2; corn-no-till. (**B**) INDI class 4; corn.

#### *3.2. Generation and Distribution of Spectra*

Here, we demonstrate the ability of the GAN to reproduce realistic spectral shapes and to capture the statistical distribution of the class populations. Three dimensionality reduction methods—PCA, t-SNE, and UMAP—were applied to both the real and synthetic datasets of INDI and NZ spectra to reduce their 200 and 540 wavebands (respectively) down to a plottable 2D space (Figures 5 and 6). Upon visual inspection, the class clusters formed by the augmented data across all reduction methods mimic the distribution of those of the real data. Additionally, due to its small sample sizes the structure of clusters for the real NZ data is sparse and unclear, though is emphasised by the large number of synthetic samples.

**Figure 5.** Dimensional reduced representations of INDI real and synthetic datasets; highlighted classes: INDI1—Alfalfa (green); INDI11—Soybean-min-till (blue). (**A**) Real dataset; PCA reduction, (**B**) synthetic dataset; PCA reduction, (**C**) real dataset; t-SNE reduction, (**D**) synthetic dataset; t-SNE reduction, (**E**) real dataset; UMAP reduction, and (**F**) synthetic dataset; UMAP reduction.

**Figure 6.** Dimensional reduced representations of NZ real and synthetic datasets; + highlighted classes: NZ0—Manuka (*L. scoparium*) (green), NZ9—Rata (*M. robusta*) (blue). (**A**) Real dataset; PCA reduction, (**B**) synthetic dataset; PCA reduction, (**C**) real dataset; t-SNE reduction, (**D**) synthetic dataset; t-SNE reduction, (**E**) real dataset; UMAP reduction, and (**F**) synthetic dataset; UMAP reduction.

Such strong replication of the 2D representation of the classes is a good indication of the generative model's ability to learn distributions. Even when the models are trained separately for each class, the relationship between classes is maintained. However, the

increased sample number in the synthetic datasets do in some cases extend beyond the bounds of the real samples. Whilst some may represent potential outliers the majority are artefacts of increased sample sizes. This is most evident in the UMAP representation, where a parameter that defines minimum distance between samples can be set a larger value, which results in increased spread of samples in the 2D representation [30]. This is most notable in the INDI dataset, with classes 1, 7, and 8 extending more broadly than the real dataset (Figure 5F).

#### *3.3. Training Classification Ability*

In order to further examine the similarity of synthetic spectra to the real training data, three classifiers were trained (SVM, RF, NN), with four permutations of each (real–real, real–synthetic, synthetic–synthetic, and synthetic–real) (Table 3). With few exceptions, the neural network classifier outperformed the others, with SVM being the second most accurate, followed by RF. The INDI dataset recorded the highest accuracy for the real–real test with RF and NN classifiers at 74.76% and 84.13%, respectively, although the highest accuracy for the SVM classifier occurred during the synthetic–real test with 81.42% accuracy. Comparing the four combinations of real and synthetic, real–real had the highest accuracy for four experiments, with INDI synthetic–real with the SVM, and NZ real–synthetic with the RF classifier being the only exceptions.

**Table 3.** Classification accuracies for classifiers trained on real or synthesized spectral data and evaluated on either real or synthesized data for both Indian Pines and New Zealand datasets based on real class sample sizes. Highest achieved accuracy for each classifier per dataset indicated in bold.


To further evaluate the synthetic spectra, synthetic–synthetic and synthetic–real experiments were performed with the number of synthesized training samples increasing from 10 to 490 samples by increments of 10 samples per class (Figure 7). Synthetic–synthetic accuracy improves with more samples: this too is to be expected as this simply adds more training samples from the same distribution. Most importantly, synthetic–real accuracy, though often slightly lagging behind synthetic–synthetic, improves in the same manner, indicating that the synthetic samples are a good representation of the true distribution and that increasing their number for training a classier is an effective method of data augmentation. The main exception to this is the NN NZ classier, where synthetic–synthetic quickly reaches ~100% accuracy, while synthetic–real maintains ~80% before slowly decreasing in accuracy as more samples are added. This could indicate the NN classifier focuses on different features than the other classifiers, potentially being more affected by the small-scale noise apparent in the NZ-generated samples as the noise is not as apparent in the INDI data and the INDI NN classifier does not show such a discrepancy between synthetic–synthetic and synthetic–real.

**Figure 7.** Classification accuracies for classifiers trained on real or synthesized spectral data and evaluated on either real or synthesized data for both Indian Pines and New Zealand datasets ranging from 10 to 490 samples per class. (**A**) New Zealand dataset; SVM classifier, (**B**) Indian Pines dataset; SVM classifier, (**C**) New Zealand dataset; RF classifier, (**D**) Indian Pines dataset; RF classifier, (**E**) New Zealand dataset; NN classifier, and (**F**) Indian Pines dataset; NN classifier.

#### *3.4. Data Augmentation*

In order to test the viability of the synthetic data for data augmentation, the same three classifiers were trained with either real, synthetic or both combined into an augmented dataset and tested against an evaluation dataset (Table 4). All classifiers had higher accuracy when trained on the real dataset compared to synthetic, though the highest accuracy overall was with the augmented dataset. For the INDI data, this increase was minor, being <1% for all classifiers. A far more significant improvement was seen for the NZ data with increases of 3.54% (to 86.55%), 0.53% (to 50.80%), and 3.73% (to 85.14%) for SVM, RF, and NN, respectively.

**Table 4.** Classification accuracies for classifiers trained on real, synthesized, or augmented spectral data and evaluated on an evaluation dataset for both Indian Pines and New Zealand datasets based on real class sample sizes. Highest achieved accuracy for each classifier per dataset indicated in bold.


Of course, however, the number of synthetic samples does not have to be limited in such a manner. As with previous experiments the number of synthetic samples started at 10 and incremented by 10 to a total of 490, demonstrating the potential of this data

augmentation method. Dramatic increases in accuracy were seen for the synthetic dataset, with the smallest increase being 5.13% for INDI-SVM occurring at 490 samples, the largest being 20.47% for NZ-RF at 420 samples. These increases brought the synthetic dataset very close to the accuracy of the real samples or even above in the cases of INDI-NN, NZ-RF, and NZ-NN. Increases in accuracy were also seen in the augmented dataset, though not as dramatic as those for the synthetic dataset. Improvements in accuracy ranged from 0.16% for INDI-SVM at 10 synthetic samples to 9.45% for NZ-RF at 280 synthetic samples. These improvements raise the highest accuracy for the INDI dataset from 70.40% to 70.56%, resulting in an increase of 0.16% over the highest achieved by just the real data. A larger increase was seen in the NZ dataset with the previous highest accuracy raising from 86.55% to 90.01%, an increase of 3.45% from the previous augmented classification with restricted sample size, and a 7% increase over the real dataset alone (Table 5).

**Table 5.** Classification accuracies for classifiers trained on real, synthesized, or augmented spectral data and evaluated on an evaluation dataset for both Indian Pines and New Zealand datasets with sample sizes ranging from 10 to 490 per class for synthetic and augmented while real contained all real samples. Highest achieved accuracy for each classifier per dataset indicated in bold.


#### *3.5. Classification Power of a Synthetic Sample*

Ordering the synthetic samples by their C value before iteratively adding samples one at a time from each class to the training dataset of an SVM classifier shows the differing classification power of the synthetic samples from lower to upper bounds of C and vice versa (Figure 8).

**Figure 8.** Classification accuracy of a SVM classifier for C metric ascending and descending ordered synthetic datasets incremented by single samples.

When in ascending order from lower to upper bounds, classification accuracy increases dramatically, reaching ~60% accuracy with ~100 samples, while 200 samples were required for similar accuracy in descending order. At approximately half the number of samples, accuracies converge, then increase at the same rate before reaching 80% accuracy at 500 samples. These classification accuracies (Figure 8) provide the first insight into increased discriminatory power associated with synthetic samples that occur at distance to real samples. Although not encountered here, a maximum limit to this distance would be

present, with synthetic samples needing to remain within the bounds of their respective class distributions.

A similar, though reduced, trend can be seen when the ordered synthetic samples are used to augment the real dataset. Both ascending and descending datasets improve classification over that of the real dataset when samples are iteratively added to the classifiers training dataset (Figure 9). Despite descending ordered samples outperforming ascending at times, on average, ascending samples achieved ~1.5% higher accuracy across the classifications compared with the 79.72% to 78.24% accuracy of descending samples.

**Figure 9.** Classification accuracy of a SVM classifier for C metric ascending and descending augmented datasets with randomly ordered real dataset incremented by single samples.

This artificial selection of synthetic data points distant or close to the real data influences sample distribution used to train the classifiers. As one might expect, the ordered data points come from the edges or sparse regions of the real data distribution, dramatically shifting the mean and standard deviation of the ordered datasets (Figure 10).

**Figure 10.** (**A**) Mean and (**B**) STD of C metric ascending, descending, and randomly ordered synthetic datasets incremented by single samples.

The inclusion of synthetic data points selected at random provides a baseline for comparison with the ordered datasets. Once the number of samples increases beyond a few points, the means for descending and random converge and stay steady throughout. Mean values for ascending start significantly higher, though initially begin to converge towards the other datasets before plateauing at a higher level. Whilst being averaged across all classes and all wavebands of spectra, the mean reflectance for the ascending data is consistently higher. Standard deviation of the descending dataset is consistently low, only slightly increasing as samples are added. This is in stark contrast to the STD of the ascending dataset being ~5–6 x higher across all n samples. The mean of the randomly selected dataset occurs between the means of the two ordered, though closer to the ascending mean, indicating the samples that make up the descending dataset are highly conserved.

To further illustrate the relationship of the ordered datasets and the real distribution, a PCA of one of the classes is shown (Figure 11). As the mean and STD indicated, the

descending samples are tightly grouped near the mean and densest area of the real data distribution, with the ascending samples generally occurring along the border of the real distribution. Whilst ascending selects for samples with low C and greater distance from real samples, it is important to note that these synthetic samples still appear to conform to the natural shape of the real distribution, a further indication the generative model is performing well.

**Figure 11.** PCA of NZ class 0; Manuka (*L. scoparium*) real samples, with first 100 samples of the ascending and descending C ordered synthetic datasets.

#### **4. Conclusions**

In this paper, we have successfully demonstrated the ability to train a generative machine learning model for synthesis of hyperspectral vegetation spectra. Evaluation of the synthetic spectra shows that they respect many of the statistical properties of the real spectra, conforming well to the sampled distributions of all real classes. Further to this, we have shown that the synthetic spectra generated by our models are suitable for data augmentation of a classification models training dataset. Addition of synthetic samples to the real training samples of a classifier produced increased overall classification accuracy under almost all circumstances examined. Of the two datasets, the New Zealand vegetation showed a maximum increase of 7.0% in classification accuracy, with Indian Pines demonstrating a more modest improvement of 0.16%. Selection of synthetic samples from sparse or outlying regions of the feature space of real spectral classes demonstrated increased discriminatory power over those from more central portions of the distributions. We believe further work regarding this could see targeted generation to maximize the information content of a synthetic sample that would result in improved classification accuracy and generalizability with a smaller augmented dataset. The use of these synthesized spectra to augment real spectral datasets allows for the training of classifiers that benefit from large sample numbers without a researcher needing to collect additional labelled spectra from the field. This is of increasing significance as modern machine and deep learning algorithms tend to require larger datasets.

**Author Contributions:** Conceptualization, A.H.; methodology, A.H.; data curation, A.H.; formal analysis, A.H.; writing—original draft preparation, A.H., K.C. and M.L.; writing—review and editing, A.H., K.C. and M.L.; supervision, K.C., M.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The publically available datasets used in our experiments are available at; Indian Pine Site 3 AVIRIS hyperspectral data image file, doi:10.4231/R7RX991C. New Zealand hyperspectral vegetation dataset, https://specchio.ch/.

**Acknowledgments:** Financial support for this research was provided by the Australian Government Research Training Program Scholarship and the University of Adelaide School of Biological Sciences.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Table A1.** Layer architecture of the GANs generator.


**Table A2.** Layer architecture of the GANs discriminator.


**Table A3.** Epochs with Kullback–Leibler divergence loss, Adam optimiser with a learning rate of 0.00001, and a batch size of 32.




**Table A4.** Hyperparameters used during UMAP dimension reduction for each dataset.


**Figure A1.** (**A**) New Zealand dataset; SVM classifier, (**B**) Indian Pines dataset; SVM classifier, (**C**) New Zealand dataset; RF classifier, (**D**) Indian Pines dataset; RF classifier, (**E**) New Zealand dataset; NN classifier, and (**F**) Indian Pines dataset; NN classifier.

#### **References**


### *Article* **Rice Leaf Blast Classification Method Based on Fused Features and One-Dimensional Deep Convolutional Neural Network**

**Shuai Feng 1, Yingli Cao 1,2, Tongyu Xu 1,2,\*, Fenghua Yu 1,2, Dongxue Zhao <sup>1</sup> and Guosheng Zhang <sup>1</sup>**


**\*** Correspondence: xutongyu@syau.edu.cn; Tel.: +86-024-8848-7121

**Abstract:** Rice leaf blast, which is seriously affecting the yield and quality of rice around the world, is a fungal disease that easily develops under high temperature and humidity conditions. Therefore, the use of accurate and non-destructive diagnostic methods is important for rice production management. Hyperspectral imaging technology is a type of crop disease identification method with great potential. However, a large amount of redundant information mixed in hyperspectral data makes it more difficult to establish an efficient disease classification model. At the same time, the difficulty and small scale of agricultural hyperspectral imaging data acquisition has resulted in unrepresentative features being acquired. Therefore, the focus of this study was to determine the best classification features and classification models for the five disease classes of leaf blast in order to improve the accuracy of grading the disease. First, the hyperspectral imaging data were pre-processed in order to extract rice leaf samples of five disease classes, and the number of samples was increased by data augmentation methods. Secondly, spectral feature wavelengths, vegetation indices and texture features were obtained based on the amplified sample data. Thirdly, seven one-dimensional deep convolutional neural networks (DCNN) models were constructed based on spectral feature wavelengths, vegetation indices, texture features and their fusion features. Finally, the model in this paper was compared and analyzed with the Inception V3, ZF-Net, TextCNN and bidirectional gated recurrent unit (BiGRU); support vector machine (SVM); and extreme learning machine (ELM) models in order to determine the best classification features and classification models for different disease classes of leaf blast. The results showed that the classification model constructed using fused features was significantly better than the model constructed with a single feature in terms of accuracy in grading the degree of leaf blast disease. The best performance was achieved with the combination of the successive projections algorithm (SPA) selected feature wavelengths and texture features (TFs). The modeling results also show that the DCNN model provides better classification capability for disease classification than the Inception V3, ZF-Net, TextCNN, BiGRU, SVM and ELM classification models. The SPA + TFs-DCNN achieved the best classification accuracy with an overall accuracy (OA) and Kappa of 98.58% and 98.22%, respectively. In terms of the classification of the specific different disease classes, the F1-scores for diseases of classes 0, 1 and 2 were all 100%, while the F1-scores for diseases of classes 4 and 5 were 96.48% and 96.68%, respectively. This study provides a new method for the identification and classification of rice leaf blast and a research basis for assessing the extent of the disease in the field.

**Keywords:** rice leaf blast; hyperspectral imaging data; deep convolutional neural networks; fused features

#### **1. Introduction**

Crop pests and diseases cause huge losses of agricultural production [1]. According to the Food and Agriculture Organization of the United Nations, the annual reduction in

**Citation:** Feng, S.; Cao, Y.; Xu, T.; Yu, F.; Zhao, D.; Zhang, G. Rice Leaf Blast Classification Method Based on Fused Features and One-Dimensional Deep Convolutional Neural Network. *Remote Sens.* **2021**, *13*, 3207. https:// doi.org/10.3390/rs13163207

Academic Editor: Chein-I Chang

Received: 15 July 2021 Accepted: 10 August 2021 Published: 13 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

food production caused by pests and diseases accounts for about 25% of the total food production worldwide, with 14% of the reduction caused by diseases and 10% by pests [2]. In China, the amount of grain lost due to pest and disease outbreaks and hazards is about 30% of the total production each year, which has a huge impact on the domestic economy [3]. We still mainly rely on plant protection personnel to conduct field surveys and field sampling in order to monitor crop disease. Although these traditional detection methods have high accuracy and reliability, they are time-consuming, laborious and lack representativeness. These traditional diagnostic methods mainly rely on the subjective judgment of investigators, which is prone to human misjudgment, subjective errors and variability [4–7]. Therefore, there is an urgent need to improve pest and disease monitoring and control methods.

Rice blast is one of the most serious rice diseases in the north and south rice-growing areas of China and it is known as one of the three major rice diseases together with bacterial blight and sheath blight [8]. In September 2020, rice blast was listed as a Class I crop pest by the Ministry of Agriculture and Rural Affairs of China. Rice blast is caused by magnaporthe grisea and phytophthora grisea, which infest the leaves, neck and ears of rice by producing conidia and causing devastating effects on the physiological aspects of rice growth [9]. According to the period and location of damage, rice blast can be divided into the seedling blast, leaf blast and spike blast, etc., among which leaf disease is the most harmful. Leaf blast usually occurs after the three-leaf stage of rice plants and is increasingly serious from the tillering stage to the jointing stage. The spots first appear as white dots and gradually become 1~3 cm long diamond-shaped spots. The disease spot is gray in the middle and is surrounded by a dark brown color. In severe infestation, the entire leaf dries out [10,11] and reduces the green leaf area and photosynthesis in the lesioned area [12], thus causing a substantial rice yield reduction. It generally causes a 10~30% yield reduction in rice. Under favourable conditions, it can destroy an entire rice field in 15 to 20 days and cause up to 100% yield loss [13]. In China, the average annual occurrence of rice blasts is as high as 3.8 million hectares, with annual losses of hundreds of millions of kilograms of rice. In order to control the spread of leaf blast fungus over a large area and reduce yield losses. It is urgent to develop methods of rapid and accurate monitoring and discrimination of leaf blast disease.

Spectroscopy is a commonly used technique for plant disease detection, and its nondestructive, rapid and accurate characteristics have attracted the attention of a wide range of scholars [14]. Multispectral techniques [15,16] and near-infrared spectroscopy [17,18] have been studied in crop disease stress classification. However, multispectral and nearinfrared techniques obtain less spectral data information, making it more difficult to detect the disease at its early stage of development and resulting in the inability to accurately discriminate against it. Compared with the above-mentioned spectroscopic techniques, hyperspectral imaging technology, which has characteristics of multiple spectral bands, high resolution and can provide spatial-domain and spectral-domain information, has thus gradually become a research hotspot for scholars. This technique has been widely used for disease detection in vegetables [19,20], fruits [21,22] and grains [23–25]. In recent years, with the development and application of hyperspectral imaging technology, the technology has made great progress in crop disease detection and greatly improved the science of accurate prevention and controls and management decisions in the field. Luo et al. [26], after comparing the accuracy of rice blast identification with different spectral processing methods and modeling approaches, concluded that the probabilistic neural network classification based on logarithmic spectra was the best, with an accuracy of 75.5% in the test set. Liu et al. [27] used support vector machine and extreme learning machine methods to model and classify white scab and anthracnose of tea, respectively, with a classification accuracy of 95.77%. Yuan et al. [28] extracted hyperspectral data from healthy and diseased leaves without disease spots, leaves with less than 10% disease spot area and less than 25% disease spot area, respectively, and used CARS-PCA for dimensionality reduction in order to construct SVM rice blast classification models. The accuracy of

all categories was greater than 94.6%. Knauer et al. [29] used hyperspectral imaging for the accurate classification of powdery mildew of wine grapes. Nagasubramanian et al. [30] used hyperspectral techniques and built soybean charcoal rot early identification models based on genetic algorithms and support vector machines. Nettleton et al. [31] used operational process-based models and machine learning models for the predictive analysis of rice blast. It was concluded that machine learning methods showed better adaptation to the prediction of rice blast in the presence of a training data set. All the above-mentioned studies achieved good results, but all of them focused on the detection of diseases in crops using spectral information from hyperspectral images, and they did not address texture features in hyperspectral images which are directly related to disease characterization. Texture features as inherent properties possessed by the crop, which are not easily disturbed by the external environment, can reflect the image properties and the spatial distribution of adjacent pixels, compensating to some extent for the saturation of crop disease detection relying only on spectral information [32]. Zhang et al. [33] used spectral features and texture features to construct a support vector machine classification model. The results demonstrated that the classification model was able to effectively classify healthy, moderate and severe diseases in wheat. Al-Saddik et al. [34] concluded that combining texture features of grape leaves and spectral information to construct a classification model resulted in the effective classification of yellowness and esca with an overall accuracy of 99%. Zhang and Zhu et al. [35,36] concluded after analysis that the classification model constructed by fusing spectral and texture features had superior classification accuracy compared to the classification model using only spectral or texture features. The above literature shows that it is feasible to construct plant disease classification models by fusing spectral and texture information from hyperspectral images. However, the study of using fusion features of spectral and textural information to discriminate different disease levels of rice leaf blast needs to be explored deeply.

In the above-mentioned studies, researchers mostly used machine learning methods such as support vector machines and back propagation neural networks to model hyperspectral data. However, there are still relatively few studies using deep learning methods for crop disease identification and recognition based on hyperspectral imaging data. The reason for this may be the small quantity of sample data obtained, which makes it impossible to build a deep learning model. In existing studies, researchers have mostly used deep learning methods to build models for hyperspectral data due to the powerful feature extraction capabilities of these models. Nagasubramanian et al. [37] constructed a 3D convolutional neural network recognition model for soybean charcoal rot by using hyperspectral image data with a classification accuracy of 95.73%. Huang et al. [38] obtained hyperspectral images of rice spike blast and constructed a detection model based on the GoogLeNet method with a maximum accuracy of 92%. Zhang et al. [39] used a three-dimensional deep convolutional neural network model to model yellow rust of winter wheat with an overall accuracy of 85%. Although this modeling approach can achieve high accuracy rates, it still requires the use of expensive hyperspectral instruments in practical agricultural applications in order to obtain data and cannot be applied on a large scale.

In view of this, this study draws on existing research methods to expand the sample data size. Data dimensionality reduction uses augmented sample data to extract spectral feature wavelengths, vegetation indices and texture features. A total of seven onedimensional deep convolutional neural network classification models were constructed for leaf blast disease classification based on the above features and their fusion features. Finally, Inception V3, ZF-Net, BiGRU, TextCNN, SVM and ELM models were used for comparative analysis with the model of this study to determine the best classification features and classification model for leaf blast. It is expected to provide some scientific theory and technical support for the identification of rice leaf blast disease grades.

#### **2. Materials and Methods**

#### *2.1. Study Site*

Rice leaf blast trials were conducted from July to August 2020 at Liujiaohe Village, Shenyang New District, Shenyang, Liaoning Province (42◦01 17.16N, 123◦38 14.57E). The region has a temperate semi-humid continental climate, with an average annual temperature of 9.7 ◦C and an average annual precipitation of 700 mm, making it a typical cold-land rice-growing area. Mongolian rice with a high susceptibility to leaf blast was used as the test variety, and it was planted on an area of about 100 m<sup>2</sup> with a row spacing of 30 cm and a plant spacing of 17 cm. Nitrogen, potassium and phosphorus fertilizers were applied according to local standards at 45, 15 and 51.75 kg/hm2, respectively. Prior to basal fertilizer application, soil samples were collected using the five-point sampling method from the disease trial plots, and soil nutrients were measured and analyzed. The measured results showed that the rapid potassium content ranged from 86.83 to 120.62 mg/kg; the effective phosphorus content ranged from 3.14 to 21.18 mg/kg; the total nitrogen content ranged from 104.032 to 127.368 mg/kg; and the organic matter content ranged from 15.8 to 20.0 g/kg. Leaf blast inoculation was carried out at 5:00 p.m. on the same day (3 July 2020) by using a spore suspension at a concentration of 9 mg/100 mL (in order to inoculate, the spore suspension was shaken well and sprayed evenly over the surface of the plant leaves until the leaves were completely covered with water droplets), which was wrapped in a moistened black plastic bag after inoculation and removed at 6:30 a.m. the following morning. The test plots were not treated with any disease control, and field management was normal. Five days after inoculation, the plants began to show symptoms, and healthy and diseased rice leaves were obtained from the field under the guidance of a plant protection specialist and taken back to the hyperspectral laboratory in order to obtain hyperspectral image data.

#### *2.2. Data Acquisition and Processing*

#### 2.2.1. Sample Collection

Five trials were conducted to collect healthy and diseased plants at three critical fertility stages: the rice jointing stage (8 July; 15 July), the booting stage (25 July; 2 August) and the heading stage (10 August). Under the supervision of plant protection experts, 57, 61 and 27 leaf samples with five different levels of disease were collected at the jointing, booting and heading stages, respectively, and a total of 145 rice leaf samples were obtained. In the experiment, in order to maintain the moisture content of the rice leaves, the leaves were placed in a portable refrigerator to maintain their freshness. Hyperspectral image data were then acquired indoors by using a hyperspectral imaging system. Figure 1 shows pictures of healthy and different disease grades of rice leaves. We used ENVI 5.3 (ITT Visual Information Solutions, Boulder, CO, USA) software for manual segmentation of rice leaves, leaf background and disease areas. The number of pixel points for the whole leaf and the diseased area was calculated, along with the number of diseased pixel points as a percentage of the number of pixel points on the leaf. According to the GBT 15790-2009 Rules of Investigation and Forecast of the Rice Blast, classification was carried out according to the size of the disease spot, as shown in Table 1. Level 5 leaf blast samples were not found in this study; therefore, the criteria for determining level 5 disease are not listed in Table 1.

**Figure 1.** Healthy and different disease levels of rice leaves.



#### 2.2.2. Hyperspectral Image Acquisition

In this study, a hyperspectral imaging system was used to acquire hyperspectral images of rice leaves, as shown in Figure 2. The main components of the system include a hyperspectral imaging spectrometer (ImSpector V10E, Spectral Imaging Ltd., Oulu, Finland), a high-definition camera (IGV-B1410, Antrim, Northern Ireland), a precision displacement control stage, a light-free dark box, two 150 W fiber optic halogen lamps (Ocean Optics, Dunedin, FL, USA) and a computer. The effective spectral range obtained by this hyperspectral imaging system is 400–1000 nm with a spectral resolution of 0.64 nm. The distance of the camera lens from the surface of the rice leaves was set to 32 cm before acquiring the images. The lens focus was adjusted by using a white paper focusing plate with black stripes until the black stripes were imaged and the transition area between the black stripes and the white paper was clear. In order to obtain the best image quality, the light source intensity and exposure rate were adjusted and the scanning speed was set to 1.1 mm/s.

**Figure 2.** Hyperspectral imaging system: (1) EMCCD HD camera; (2) hyperspectral imaging spectrometer; (3) lens; (4) light source controller; (5) light source; (6) computer; (7) displacement stage; (8) displacement stage controller.

Due to the problem of inconsistent intensity values of different spatial hyperspectral data caused by the variation of light intensity on the leaf surface and the camera's dark current, the original hyperspectral images needed to be processed by black-and-white plate correction by using Equation (1) to obtain the final image spectral reflectance:

$$I = \frac{R\_S - R\_D}{R\_W - R\_D} \tag{1}$$

where *I* is the corrected hyperspectral reflectance of rice leaves, *RS* is the spectral reflectance of the original hyperspectral images of rice leaves, and *RW* and *RD* are the spectral reflectance of the corrected white plate and corrected black plate, respectively. The acquisitions and transmissions of spectral images were completed by using the system's hyperspectral acquisition software (Isuzu Optics, Hsinchu, China).

#### 2.2.3. Spectra Extraction and Processing

In this study, the whole rice leaf was treated as a separate region of interest (ROI), and ENVI5.3 was used to manually delineate the region of interest and extract its average spectral reflectance. This culminated in 29 health data and 116 disease data (27, 32, 27 and 30 disease data for levels 1, 2, 3 and 4 respectively), for a total of 145 hyperspectral imaging data.

In order to determine the best classification features and classification model for leaf blast, there were two main considerations in this study. Firstly, the leaf blast classification features extracted from the existing data scale are contingent and not universal. Secondly, the constructed leaf blast classification model is not generalizable and is not sufficient for constructing a deep learning model based on big data and calibrated supervision mechanisms. In view of these two considerations, in this study, the data set was divided into a training set and a testing set, and then the data augmentation method proposed by Chen et al. [40] for data augmentation was used. This method augments the data by adding light intensity perturbations and Gaussian noise to the raw spectral data to simulate interference factors such as uneven illumination and instrument noise. The formula is shown in Equation (2):

$$\mathbf{y}\_i = n\mathbf{y}\_{Gaussian} + al\,p\mathbf{x}\_i\tag{2}$$

where *n* is the weight of the control Gaussian noise y*Gaussian*, alp is the light intensity perturbation factor and *xi* is the raw spectral data. Figure 3 shows the effect of data augmentation.

**Figure 3.** The effect of data augmentation.

In the end, a total of 986 healthy sample data, 918 level 1 disease data, 1088 level 2 disease data, 918 level 3 disease data and 1020 level 4 disease data were obtained, resulting in a total of 4930 sample data. Figure 3 shows the effect of data augmentation.

#### *2.3. Optimal Spectral Feature Selection*

Hyperspectral data are characterized by rich information content, high resolution and band continuity, which can fully reflect the differences in physical structure and chemical composition within the leaf. However, there is still a large amount of redundant information in the spectral information, which affects modeling accuracy. Therefore, hyperspectral data need to be subjected to dimension-reduced processing to extract valid and representative spectral features as model input to improve modeling accuracy. In this study, no new descending dimension methods were proposed or used, but both the successive projections algorithm (SPA) and random frog (RF) methods were used to extract spectral feature wavelengths. This is due to the fact that a wide range of researchers have confirmed that the characteristic wavelengths of SPA and RF screening are representative. At the same time, the SPA and RF methods screen for a smaller number of characteristic wavelengths, making it easy to generalize and use the model. In this study, the SPA and RF methods were used to extract the feature wavelengths of the spectra.

SPA is a forward feature variable dimension reduction method [41]. SPA is able to obtain the combination of variables that contains the least redundant information and the minimum characteristic co-linearity. The algorithm uses projection analysis of vectors to map spectral wavelengths onto other spectral wavelengths in order to compare the magnitude of the mapping vectors and to. obtain the wavelength with the largest projection vector, which is the spectral wavelength to be selected. A multiple linear regression analysis model was then developed to obtain the RMSECV of the modeling set. The number and wavelength corresponding to the smallest RMSECV value in the different subsets of features to be selected consists of the optimal spectral feature wavelength combinations.

RF is a relatively new method of signature variable screening, initially used for gene expression data analysis of diseases [42]. The method uses the Reversible Jump Markov Chain Monte Carlo (RJMCMC) method to transform and sample the dimensions of the spectrum. From there, a Markov chain is modeled in space that conforms to the steadystate distribution to calculate the frequency of selection for each wavelength variable. The selection of frequencies was used as a basis for eliminating redundant variables, resulting in the best spectral characteristic wavelength.

#### *2.4. Texture Features Extraction*

Textural features contain important information about the structural tissue arrangement of the leaf spot surface and the association of the spot with its surroundings. Therefore, TFs can reflect the physical characteristics of the crop leaves and information on the growth status of the crop [26]. When leaf blast infects leaves, cell inclusions and cell walls are damaged, the chlorophyll content is reduced and the volume is reduced. This results in a change in color in some areas of the leaf surface and causes changes in textural characteristics.

A gray-level co-occurrence matrix (GLCM) is a common method for extracting texture features on the leaf surface. It reflects the comprehensive information of the image in terms of direction, interval and magnitude of change by calculating the correlation between the gray levels of two points at a certain distance and in a certain direction in the image [43]. At the same time, the energy, entropy, correlation and contrast can better reflect the difference between the diseased and normal parts of the leaf, thus improving the modeling accuracy (energy reflects the degree of gray distribution and texture thickness; entropy is a measure of the amount of information in the image; correlation measures the similarity of images at the gray level in the row or column direction; and contrast reflects the sharpness of the image and the depth of the texture grooves). Hence, in this study, energy, entropy, correlation and contrast were calculated from four directions, namely 0◦, 45◦, 90◦ and 135◦, at a relative pixel distance d of 1. The formulae for energy, entropy, correlation

and contrast are shown in Table 2. The average and standard deviation were calculated for energy, entropy, correlation and contrast in each of the four directions. A total of eight texture features were obtained, specifically the mean value of energy (MEne), the standard deviation of capacity (SdEne), the mean value of entropy (MEnt), the standard deviation of entropy (SdEnt), the mean value of correlation (MCor), the standard deviation of correlation (SDCor), the mean value of contrast (MCon) and the standard deviation of contrast (SDCon).


**Table 2.** Four texture features extracted from the GLCM.

Note: *i* and *j* represent the row number and column number of the grayscale co-occurrence matrix, respectively; *P*(*i*, *j*) denotes the relative frequency of two neighboring pixels.

#### *2.5. Vegetation Index Extraction*

VIs are indicators constructed by combining different spectral bands in linear and nonlinear combinations, and they are often used to monitor and discriminate the degree of vegetation disease. In this study, the VIs with the highest correlation of leaf blast disease levels were screened by establishing a contour of the decision coefficient. The method arbitrarily selects two spectral bands in the spectral range to construct a certain spectral index, and then the Pearson correlation coefficient method is used to calculate the correlation between the disease class and the vegetation index to find the vegetation index with a higher classification ability.

Based on previous research results, the ratio spectral index (RSI), the difference spectral index (DSI) and the normalized difference spectral index (NDSI) were used to construct the contour of the decision coefficient. The formula is as follows:

$$RSI = R\_i / R\_j \tag{3}$$

$$RSI = R\_i / R\_j \tag{4}$$

$$NDSI = R\_{\dot{i}} - R\_{\dot{j}}/R\_{\dot{i}} + R\_{\dot{j}} \tag{5}$$

where *Ri* and *Rj* denote the spectral reflectance values in the spectral band range.

#### *2.6. Disease Classification Model*

Deep Convolutional Neural Network

The human visual system has a powerful ability to classify, monitor and recognize. Therefore, in recent years, a wide range of researchers have been inspired by bio-vision systems to develop advanced data processing methods. Convolutional Neural Networks (CNNs) are deep neural networks developed to emulate biological perceptual mechanisms. The networks are capable of automatically extracting sensitive features at both shallow and deep levels in the data. The Residual Network (ResNet) [44] is a typical representative of CNN, as shown in Figure 4. The residual module (both the direct mapping and residual components) is designed with the idea of the better extraction of data features and to prevent degradation of the network. ResNet is well recognized for its feature extraction and classification in the ILSVRC 2015 competition.

**Figure 4.** ResNet structure.

As ResNet has a deeper network hierarchy, it is prone to over-fitting during training. ResNet was initially used mainly in image classification and was not applicable to spectral data. Therefore, this study adapts ResNet to render it suitable for modeling one-dimensional data. Firstly, the data in this study were all one-dimensional, and thus the number of input features was used as the network input, and there was no need to experimentally derive the optimum input layer size. The number of channels in the FC layer of ResNet was also adjusted to 5 for the 5 classification problems of normal, level 1, level 2, level 3 and level 4 diseases of rice leaf blast. ResNet is a DCNN designed for application to large-scale data, and its training process is computationally intensive. The classification problems for different disease classes are smaller in terms of data size and computational effort of training. Therefore, in order to improve the modeling effect of the model, different types of classification networks were designed by adjusting the network depth and structure of ResNet by adding the BatchNorm layer and Dropout layer while maintaining the design concept of ResNet (Figure 5) in order to be applicable to the data obtained from this study. The model in this paper was compared and analyzed with SVM [45], ELM [46], Inception V3 [47], ZF-Net [48], BiGRU [49] and TextCNN [50] models to determine the best leaf blast disease class classification model.

**Figure 5.** DCNN models with different dimensionality reduction methods.

The above DCNN model was built using the deep learning computational framework Keras 2.3 for model building. The hardware environment for the experiments was 32G RAM, Bronze 3204 CPU and Quadro P5000 GPU.

#### **3. Results**

#### *3.1. Spectral Response Characteristics of Rice Leaves*

As shown in Figure 6, the mean spectral reflectance of healthy rice leaves and diseasesusceptible leaves showed a consistent trend. The reflectance at 500~600 and 770~1000 nm changed significantly after rice blast spores infested the leaves. There is a slight increase in the reflectance of diseased leaves in the 500 to 600 nm range. At 700~1000 nm, the reflectance decreases significantly. In the range of 680 to 770 nm, the spectral curves of the different disease degrees were shifted to the short-wave direction compared to the healthy leaf spectral curves, i.e., the phenomenon of "blue shift. This is due to damage to chloroplasts or other organelles within the leaf caused by the disease and changes in pigment content, resulting in changes in spectral reflectance [51]. The band range between 400 and 450 nm shows severe reflectance overlap, and thus the band range of 450 to 1000 nm was chosen as the main band for spectral feature extraction.

**Figure 6.** Comparison of average spectral curves. (**a**) Average spectral curves of diseases at 400 to 1000 nm. (**b**) Average spectral curves of diseases at 680 to 770 nm.

#### *3.2. Optimal Features*

#### 3.2.1. Vegetation Indices

Figure 7 shows the contour of the decision coefficient of DSI, RSI and NDSI constituted by any two-band combinations with the leaf disease class. In Figure 7a, the NDSI constructed by the combination of spectral bands from 623 to 700 and 700 to 1000, 556 to 702 and 450 to 623 nm correlated well with the disease levels, and the coefficient of determination R<sup>2</sup> was greater than 0.8. Among them, the NDSI vegetation index constructed by the combination of 600 and 609 nm had the best correlation with R2 of 0.8947. Compared with NDSI, RSI correlated better with the disease class in fewer band ranges, mostly concentrated in the visible band range (Figure 7b). The best RSI vegetation index was constructed for the combination of 725 and 675 nm with an R<sup>2</sup> of 0.9103. Relatively, the DSI constructed at 548 nm and 698 nm had the highest correlation, with an R2 of 0.800 (Figure 7c).

**Figure 7.** Contour of decision coefficient between disease levels and DSI, RSI and NDSI. (**a**) NDSI. (**b**) RSI. (**c**) DSI.

#### 3.2.2. Extraction of Hyperspectral Features

The spectral data were processed by using the SPA to obtain the characteristic wavelengths of the spectra with high correlation. In this study, a minimum screening number of eight and a maximum screening number of ten were set, and the RMSE was used as the evaluation criterion for selecting the best spectral feature wavelength. Figure 8a shows the eight optimal spectral characteristic wavelengths, and the spectral wavelengths are given in Table 3. The RMSE curve drops sharply as the wavelength changes from 0 to 5 and stabilizes at the eighth wavelength. The final SPA selects eight spectral features at wavelengths evenly distributed in the visible, red-edge and near-infrared regions.

**Figure 8.** Selected optimal variables using (**a**) SPA and (**b**) RF.


**Table 3.** The variables selected by SPA and RF.

The RF algorithm was used to screen the spectral feature wavelengths, setting the maximum number of potential variables to 6, the initial number of sampled variables to 1000 and the screening threshold to 0.1. Given that the RF algorithm uses RJMCMC as the screening principle, the characteristic bands are slightly different each time they are screened. The RF algorithm was, therefore, run a total of 10 times, and the final average of the results was taken as the basis for the judgment of the characteristic wavelengths. The screening probability results for each spectral characteristic wavelength are shown in Figure 8b. The larger the screening probability, the more important the corresponding spectral feature wavelengths are; thus, the wavelengths with a screening probability greater than 0.1 were selected as the best spectral feature wavelengths (Table 3), with a total of 13 spectral feature wavelengths, accounting for approximately 2.36% of the full wavelength band.

#### 3.2.3. Extraction of Texture Features by GLCM

Since hyperspectral images contain a large amount of redundant information, PCA is used to reduce the dimensionality of hyperspectral images and to generate principal component images containing a large amount of effective information. The cumulative contribution of the first three principal component images (PC1–PC3) was greater than 95% and, therefore, was used to extract texture features. Figure 9 shows the principal component images of healthy and diseased leaves after dimensionality reduction by PCA.

**Figure 9.** Principal component images of healthy and diseased leaves.

The GLCM was used to calculate the PC1-PC3 images separately to obtain eight features such as the means and standard deviations of the energy, entropy, contrast and correlation. In order to further improve the modeling accuracy, redundant texture features were removed. Eight texture features were subjected to Pearson correlation analysis with different disease classes to screen the significantly correlated and highly significantly correlated texture features, and the correlation coefficients and significance are shown in Table 4. The correlation and significance variation between the eight characteristics and the different disease classes can be observed in Table 4. Among them, MEne, SDEne, MEnt, SDEnt, MCon, SDCon and Mcor displayed highly significant correlations, while SDCor displayed a lower correlation. Therefore, in this study, seven highly significant features such as MEne were chosen as the final texture features to be modeled.


**Table 4.** Correlation of texture features with different disease classes.

Note: \*\* indicates significant correlation at 0.01 (0.01 < *p* < 0.05). \*\*\* indicates highly significant correlation at 0.001 (*p* < 0.001).

*3.3. Sensitivity Analysis of the Number of Convolutional Layers and Convolutional Kernel Size for the DCNN*

Figure 10 shows a comparison of the accuracy of the convolutional layers for different input features in the proposed model. From the figure, it can be observed that the DCNN constructed based on the features obtained from SPA, RF, TFs, SPA + TFs and RF + TFs achieved the best classification accuracy when the number of convolutional layers in the residual block was two. For Vis, Vis + TFs, the DCNN achieved the best classification results when the number of convolution layers was three.

**Figure 10.** Effect of the number of DCNNs in the proposed DCNN model on classification accuracy.

Based on the optimal number of convolutional layers, we investigated the effect of different sizes of convolutional kernels on the classification accuracy through a set of experiments. Figure 11 shows a comparison of the accuracy of the models built with different sizes of convolutional kernels. When the convolutional kernel size was (3,3), the DCNN models constructed from features screened by SPA, RF, TFs, SPA + TFs and RF + TFs were better for classification. Meanwhile, the DCNN models constructed with VIs and Vis + TFs had the best classification accuracy when the convolutional kernel size was (1,3,3).

**Figure 11.** Comparison of the accuracy of models built with different sizes of convolutional kernels. Note: (3,3), etc., denotes two convolutional layers with convolutional kernel sizes of 3 and 3; (1,3,3), etc., denotes three convolutional layers with convolutional kernel sizes of 1, 2 and 3.

#### *3.4. DCNN-Based Disease Classification of Rice Leaf Blast* 3.4.1. DCNN Model Training and Analysis

The modeling was carried out using 4930 rice leaf blast data obtained for different disease classes as samples (including data obtained by data augmentation methods), where the training set, validation set and test set were divided according to 7:1:2. The relevant training experiments were carried out for the seven DCNN models with different dimensionality reduction methods in Figure 4. The overall accuracy (OA), Kappa coefficient and F1-score were selected as the model evaluation criteria for the experiment. In order to train the DCNN model, the Nadam algorithm [52] was used. The same learning rate was used for all layers in the network, with an initial learning rate of 0.002 and exponential decay rates of 0.9 and 0.999 for the first and second orders, respectively. The initialization of the weights has a large impact on the convergence speed of the model training. In this study, a normal distribution with a mean of 0 and a standard deviation of 0.01 was used to initialize the weights of all layers of the network, and the bias of the convolutional layer and the full connection was initialized to 0. In order to determine the best disease classification features and classification models, each DCNN model was fully trained. The epochs for SPA-DCNN, RF-DCNN, VIs-DCNN, TFs-DCNN, SPA + TFs-DCNN, RF + TFs-DCNN and Vis + TFs-DCNN were 200, 180, 300, 150, 150, 150 and 250. The training results of different DCNN models are shown in Figure 12.

**Figure 12.** *Cont.*

**Figure 12.** Change of loss function value and accuracy with iteration curves.

As can be observed from Figure 12, the training error of all DCNN models gradually decreases as the number of iterations increases and finally reaches a state of convergence. At the beginning of the training period, the training loss decreases rapidly by updating the gradient of the loss function with small batches of samples. This shows that batch\_size and the optimization algorithm play a better role. In addition, as the training loss decreases, the prediction accuracy of the model for the training set shows an overall upward trend.

#### 3.4.2. DCNN Model Testing and Analysis

In order to obtain the best leaf blast classification features, spectral features, vegetation indices, texture features (TFs) and their fusion features were used to construct the DCNN leaf blast classification model. The modeling results are shown in Table 5.


**Table 5.** Results of the DCNN disease classification model based on different features.

The data in Table 5 show that all seven DCNN models designed based on different characteristics have high classification accuracy with OA greater than 88% and Kappa coefficients greater than 85% for different disease degree classification. In the DCNN model constructed with a single feature, better classification results were obtained for the feature wavelengths selected by the SPA and RF methods, with OA and Kappa reaching 97.67% and 96.75% and 97.08% and 95.93%, respectively. In the DCNN model constructed based on TFs, although the model constructed was not as accurate as the spectral feature wavelength model, it still achieved better classification results, indicating that the image data also had the ability to identify rice leaf blast. Among the DCNN models constructed by fusing features, SPA + TFs-DCNN obtained the highest classification accuracy, with OA and Kappa of 98.58% and 98.22%, respectively. The F1-scores of SPA + TFs-DCNN are greater than those of the other fusion features for the identification of specific different disease classes. The F1-scores for Level 0, Level 1, Level 2, Level 3 and Level 4 were 100%, 100%, 100%, 96.48% and 96.68%, respectively. This result shows that the fusion of spectral wavelengths and textural features screened by SPA can more accurately represent valid information about the different disease levels in rice.

#### 3.4.3. Comparison with Other Classification Models

The model in this paper was analyzed and compared with six classification models, namely Inception V3, ZF-Net, BiGRU, TextCNN, SVM and ELM. The classification results of the six models are shown in Table 6.



As can be observed from Table 6, all six models achieved good accuracy in disease classification. The model constructed by fusing spectral wavelengths and texture features screened by SPA as input quantities has the best classification accuracy, with OA and Kappa of greater than 90% and 88%, respectively. In addition, for the identification of the different

disease classes, F1-score were greater than 84% for levels 0, 2 and 4 and greater than 82% for levels 1 and 3 (shown in Appendix A Tables A1–A3). In addition, the experimental results of the models simultaneously show that the fusion of spectral feature wavelengths with texture features can enhance the classification of the models. Compared to machine learning models (SVM and ELM), the OA, Kappa and F1-scores of the models in this paper are significantly improved. In particular, OA and Kappa improved by 3.04% and 3.81%, respectively, compared to the SPA + TFs-SVM model. Compared to the SPA + TFs-ELM model, OA and Kappa improved by 6.91% and 8.63%, respectively. In comparison with the other four deep learning models, it can be observed that the classification accuracy of ZF-Net, Inception V3, TextCNN and BiGRU is lower than that of the present model. The classification results of ZF-Net, Inception V3, TextCNN and BiGRU for one-dimensional disease data were not very different, all with the best models constructed with features obtained from SPA + TFs (OA > 97%, Kappa > 96%). In view of this, it is evident from the comparative analysis of different input features and different modeling methods that the fusion of spectral features wavelength and texture features extracted by SPA is the best feature for leaf blast classification. At the same time, the DCNN model proposed in this paper has the best accuracy in classifying disease classes.

We performed a comparative analysis of the performance of the models constructed based on the best classification features (SPA + TFs) using the OA and test time, as shown in Table 7. As can be observed from Table 7, the deep learning model took significantly more time than the machine learning model on the 986 test datasets. However, the machine learning model is insufficient in OA. In the performance comparison of the deep learning models, it was found that the convolutional neural network took significantly less time than the recurrent neural network (BiGRU), which may be due to the fact that BiGRU is trained in a fully connected manner and requires more parameters. In comparison with DCNN models such as Inception V3, ZF-Net and TextCNN, our proposed model has the highest classification accuracy and the shortest testing time. On 986 test data, disease classification took only 0.22 s. Therefore, our proposed DCNN model has the best classification performance.


**Table 7.** Results of model detection efficiency comparison.

#### **4. Discussion**

At present, the identification and disease degree classification of rice blast is mainly carried out through the subjective judgment of plant protection personnel, with high professional ability but low efficiency of detection. Hyperspectral imaging technology is a highly promising disease detection technology that has attracted the interest of scholars because of its non-destructive, fast and accurate characteristics [53,54].

This study first pre-processed the hyperspectral imaging data to extract rice leaf samples of different disease classes and increased the number of samples by data augmentation methods. Secondly, in order to reduce the dimensionality of hyperspectral data, methods such as SPA, RF, the contour of decision coefficient and GLCM were used to screen spectral features, vegetation indices and texture features. Finally, deep learning and machine learning methods were used to construct rice leaf blast classification models and to determine the best classification features and classification models for leaf blast.

When a crop is infested with a disease, it results in changes in a range of physiological parameters of rice, such as chlorophyll content, water content and cell structure [55]. The changes in these physiological parameters are reflected both in the spectral reflectance curves and in the crop image features, as shown in Figures 2 and 3. When rice leaves were infested with leaf blast, the leaf blast level showed a correlation with the change in the mean spectral curve. In the visible wavelength range, the spectral reflectance appeared slightly increased, which was due to the rhombus-shaped lesions on the leaf cells infested with magnaporthe grisea, which reduced the cytochrome content and activity and weakened the absorption of light. At the same time, as the chlorophyll content decreased, the absorption band narrowed and the red edge (680~770 nm) shifted to the short-wave direction, resulting in a "blue shift phenomenon. There was a greater correlation between 770~1000 nm and the internal structure of the leaves. Compared to healthy leaves, the cell layer inside the diseased leaves was reduced and the spectral reflectance decreased [51]. The above phenomenon, therefore, provides some basis for research to obtain graded characteristics of leaf blast.

In this work, the focus was on the use of hyperspectral imaging data to determine the best classification features and classification models for leaf blast. In terms of data dimensionality reduction, this study used the SPA and RF methods to screen the spectral feature wavelengths, and 8 and 13 feature wavelengths were obtained, respectively, as shown in Table 4. The contour of the decision coefficient method was used to extract the three best vegetation indices with R<sup>2</sup> all greater than 0.8. The seven best texture features were also selected by combining GLCM and rank correlation analysis, as shown in Table 5. In DCNN modeling, the network depth, number and size of convolutions of the DCNN model can seriously affect its performance [56]. Therefore, we borrowed the design concept of ResNet and adjusted the network depth and convolutional layer parameters of ResNet through multiple tests to determine the best model structure. BatchNorm and Dropout layers were also added to avoid overfitting and to ensure accuracy. We constructed seven DCNN-based rice blast classification models based on different input features. The results show that all seven DCNN models designed based on different features have high classification accuracy, with OA greater than 88% and Kappa coefficient greater than 85%. The reason for this may be due to the fact that DCNN uses the ResNet model design concept as a reference and adopts a "shortcut structure. This structure enables the inclusion of the full information of the previous layer of data in each residual module, preserving more of the original information to some extent. At the same time, the data augmentation method was used to increase the quantity of sample data and improve the diversity of the samples, further enhancing the generalization capability of the model. In comparing the DCNN models constructed with different features, the DCNN models constructed based on fused features all achieved high classification accuracy. The highest classification accuracy was obtained for SPA + TFs-DCNN, with OA and Kappa of 98.58% and 98.22%, respectively. All had high classification accuracy in the identification of detailed disease classes, with F1-scores of 100%, 100%, 100%, 96.48% and 96.68% for levels 0, 1, 2, 3 and 4, respectively. This suggests that the fusion of spectral and texture features to construct a classification model has the ability to improve the accuracy of model classification. This is consistent with previous studies [57].

In order to further determine the best classification features and classification model, the model in this paper was compared and analyzed with Inception V3, ZF-Net, BiGRU, TextCNN, SVM and ELM models. In the SVM and ELM modeling results, it was shown that the SPA screened feature wavelengths combined with TFs constructed the model with the best classification accuracy. Compared with the DCNN classification model, the OA, Kappa and F1-score of both the SVM and ELM classification models were significantly lower than those of the DCNN model. The reason for this may be that the convolutional layer of DCNN is able to further extract disease features and obtain significant differences between different diseases, thus improving model accuracy. The classification accuracy results of ZF-Net, Inception V3, TextCNN and BiGRU are all lower than the results of the model in

this paper, as can be observed in the modeling results of the deep learning methods. This may be due to the fact that the model in this paper uses the shortcut structure of ResNet to retain more of the fine-grained features between diseases. Models such as Inception V3, on the other hand, gradually ignores fine-grained features and retain coarse-grained features as the number of iterations increases. In the case of intra-class classification problems, fine-grained features are the key to achieving higher accuracy.

Therefore, in this study, it is concluded from the comparative analysis of different input features and different modeling methods that the DCNN model constructed based on the fused features of feature wavelength and texture features acquired by SPA has the highest classification accuracy. It can realize the accurate classification of the severity of rice leaf blight and provides some technical support for the next step of UAV hyperspectral remote sensing monitoring of rice leaf blasts. It is worth noting that only rice leaf blast was modeled and analyzed in this study, and no other leaf diseases of rice were studied. Therefore, future research work will further explore the best classification features for different rice diseases and establish a more representative, generalized and comprehensive disease classification model.

#### **5. Conclusions**

Leaf blast, a typical disease of rice, has major impacts on the yield and quality of grain. In this study, an indoor hyperspectral imaging system was used to acquire hyperspectral images of leaves. With limited hyperspectral data, data augmentation was performed by drawing on data augmentation methods from existing studies to augment the sample data from 145 to 4930. Then, spectral features, vegetation indices and texture features were extracted based on the augmented hyperspectral images. The above features and their fusion features were used to construct leaf blast classification models. The results showed that the model constructed based on fused features was significantly better than the model constructed based on single feature variables in terms of accuracy in the classification of the degree of leaf blast disease. The best performance was achieved by combining the SPA screened spectral features (450, 543, 679, 693, 714, 757, 972 and 985 nm) with textural features (MEne, SDEne, MEnt, SDEnt, MCon, SDCon and MCor). The modeling results also showed that the proposed DCNN model provided better classification performance in disease classification compared to traditional machine learning models (SVM and ELM), with an improvement of 3.04% and 6.91% in OA and 3.81% and 8.63% in Kappa, respectively. Compared to deep learning models such as Inception V3, ZF-Net, BiGRU and TextCNN, this model also has the best classification accuracy. In comparison to ZF-Net and TextCNN, both OA and Kappa improved by 0.81% and 1.02%. OA and Kappa improved by 1.52% and 1.22% and 1.9% and 1.52%, respectively, compared to Inception V3 and BiGRU. Therefore, this study confirms the great potential of the proposed one-dimensional deep convolutional neural network model for applications in disease classification. The best fusion features identified in this study can further improve the modeling accuracy of the disease classification model. In addition, in the next study, we will further explore the classification features of rice diseases such as sheath blight and bacterial blight to establish a more stable, accurate and comprehensive disease classification model.

#### **Appendix A**


**Table A1.** F1-score for the SVM and ELM models.

**Table A2.** F1-score for the Inception V3 and ZF-Net models.




**Author Contributions:** Conceptualization, S.F., Y.C., T.X. and G.Z.; methodology, S.F.; software, S.F.; validation, S.F., F.Y., G.Z. and D.Z.; formal analysis, S.F. and T.X.; investigation, S.F.; resources, S.F. and G.Z.; data curation, S.F., G.Z. and D.Z.; writing—original draft preparation, S.F.; writing—review and editing, S.F. and T.X.; visualization, S.F.; supervision, T.X.; project administration, T.X.; funding acquisition, T.X. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Liaoning Provincial Key R&D Program Project (2019JH2/ 10200002).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Data sharing is not applicable to this article.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Detection and Classification of Rice Infestation with Rice Leaf Folder (***Cnaphalocrocis medinalis***) Using Hyperspectral Imaging Techniques**

**Gui-Chou Liang 1, Yen-Chieh Ouyang <sup>2</sup> and Shu-Mei Dai 1,\***


**Abstract:** The detection of rice leaf folder (RLF) infestation usually depends on manual monitoring, and early infestations cannot be detected visually. To improve detection accuracy and reduce human error, we use push-broom hyperspectral sensors to scan rice images and use machine learning and deep neural learning methods to detect RLF-infested rice leaves. Different from traditional image processing methods, hyperspectral imaging data analysis is based on pixel-based classification and target recognition. Since the spectral information itself is a feature and can be considered a vector, deep learning neural networks do not need to use convolutional neural networks to extract features. To correctly detect the spectral image of rice leaves infested by RLF, we use the constrained energy minimization (CEM) method to suppress the background noise of the spectral image. A band selection method was utilized to reduce the computational energy consumption of using the full-band process, and six bands were selected as candidate bands. The following method is the band expansion process (BEP) method, which is utilized to expand the vector length to improve the problem of compressed spectral information for band selection. We use CEM and deep neural networks to detect defects in the spectral images of infected rice leaves and compare the performance of each in the full frequency band, frequency band selection, and frequency BEP. A total of 339 hyperspectral images were collected in this study; the results showed that six bands were sufficient for detecting early infestations of RLF, with a detection accuracy of 98% and a Dice similarity coefficient of 0.8, which provides advantages of commercialization of this field.

**Keywords:** rice; rice leaf folder; hyperspectral imaging; band selection; hyperspectral image classification; target detection

#### **1. Introduction**

Rice leaf folder (RLF), *Cnaphalocrocis medinalis* Guenée, is widely distributed in the rice-growing regions of humid tropical and temperate countries [1], and the developmental time of RLF decreases with an increase in temperature [2]. Due to global warming, RLF has become one of the most important insect pests of rice cultivation [3]. The larvae of RLF fold the leaves longitudinally and feed on the mesophyll tissue within the folded leaves. The feeding of RLF generates lineal white stripes (LWSs) in the early stage and then enlarge into ocher patches (OPs) and membranous OPs [4]. As the infestation of RLF increases, the number and area of OPs will increase. The feeding of RLF not only reduced the chlorophyll content and photosynthesis efficiency [4] but also provided a method for fungal and bacterial infection [5]. Therefore, the severe damage caused by RLF may cause 63–80% yield loss [6], and the highest record of the damaged area to rice cultivation in a single year exceeded 30,000 hectares [7].

The economic injury level of RLF, which is important for the determination of insecticide applications, has been established as 4.2% damaged leaves and 1.3 larvae per

**Citation:** Liang, G.-C.; Ouyang, Y.-C.; Dai, S.-M. Detection and Classification of Rice Infestation with Rice Leaf Folder (*Cnaphalocrocis medinalis*) Using Hyperspectral Imaging Techniques. *Remote Sens.* **2021**, *13*, 4587. https://doi.org/ 10.3390/rs13224587

Academic Editor: Clement Atzberger

Received: 15 October 2021 Accepted: 10 November 2021 Published: 15 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

plant by the International Rice Research Institute [8]. However, it is laborious and timeconsuming to visually inspect for damage. In addition, RLF is a long-distance migratory insect pest. The uncertain timing of the appearance of RLF means that farmers are unable to predict pest arrival, so to avoid damage by undetected infestations, farmers often preventively spray chemical insecticides, which generates unnecessary costs and environmental pollution [9,10].

Hyperspectral imaging (HSI) is a novel technique that combines the simultaneous advantages of imaging and spectroscopy and that has been investigated and applied in crop protection [11–15]. HSI, which contains spatial and spectral information, is given in Figure 1. The external damage and internal damage caused by pest infestations, such as yellowing/attenuation/defects and loss of pigments/photosynthetic activity/water content, respectively, can be identified by this system through image or spectral reflectance. Further automatic detection can be fulfilled by taking advantage of pest damage detection algorithms. For instance, constrained energy minimization (CEM) [16] and principal component analysis (PCA) [17] have been employed for band selection, and support vector machines (SVMs) [18], convolutional neural networks (CNNs) [19], and deep neural networks (DNNs) [20] are utilized for classification. Fan et al. [21] applied a visible/nearinfrared hyperspectral imaging system to detect early invasion of rice streak insects. Using the successive projection algorithm (SPA) [22], PCA, and a back-propagation neural network (BPNN) [23] as classifiers to identify key wavelengths, the classification accuracy of the calibration and prediction sets was 95.65%. Chen et al. [24] also employed a visible/near-infrared hyperspectral imaging system to acquire images and further developed a hyperspectral insect damage detection algorithm (HIDDA) to detect pests in green coffee beans. The method combines CEM and SVM and achieves 95% accuracy and a 90% kappa coefficient. In addition, spectroscopy technology has been applied to detect plant diseases [25], the quality of agricultural products [26], and pesticide residues [27].

**Figure 1.** Two-dimensional projection of a hyperspectral cube.

To effectively manage RLF with a rational application of insecticides, an artificialintelligent inspection of economic injury levels is necessary. The purpose of this study is to establish a model for detecting early infestation of RLF based on visible light hyperspectral data exploration techniques and deep learning technology. The specific objectives include (1) predefining the region of interest (ROI); (2) data preprocessing through a band selection and band expansion process (BEP); (3) simultaneously combining a deep learning network to train the model and to classify multiple different levels of damage; (4) using an automatic target generation program (ATGP) algorithm [28] to test unknown samples to fully automate the process and optimize the process to shorten the prediction time; and (5) establishing the spectral signatures of damaged leaves caused by RLF, which can serve as an expert system to provide valuable resources for the best timing of insecticide application.

#### **2. Materials and Methods**

#### *2.1. Insect Breeding*

The RLF in this study was collected from the Taichung District Agricultural Research and Extension Station. The larvae were raised in insect rearing cages (47.5 <sup>×</sup> 47.5 <sup>×</sup> 47.5 cm3, MegaView Science Co., Ltd., Taichung, Taiwan) with corn seedlings (agricultural friend seedling Yumeizhen) and maintained at 27 ± 2 ◦C and 70% relative humidity during a photoperiod of 16:8 h (L:D). The adults were reared in a cage with 10% honey at 27 ± 2 ◦C and 90% relative humidity, which allows adults to lay more eggs.

#### *2.2. Preparation of Rice Samples*

The variety Tainan No. 11, which is the most prevalent cultivar planted in Taiwan, was selected for this study. Larvae were grown in a greenhouse to prevent the infestation of insect pests and diseases. To obtain different levels of damage caused by RLF, e.g., LWS and OP, 1st-, 2nd-, 3rd-, 4th-, or 5th-instar larvae of RLF were manually introduced to infest 40-day-old healthy rice for seven days, and three replicates were conducted for each treatment. Three different types of samples shown in Figure 2, e.g., healthy leaves (HL), LWS, and OP caused by RLF, were prepared for imaging acquisition and spectral information extraction.

**Figure 2.** Appearance of healthy and damaged leaf types. (**a**) Healthy leaves, (**b**) lineal white stripe (LWS) caused by RLF (blue arrow) and LWS enlarge into ocher patch (OP) (yellow arrow) on Day 1 (D1), (**c**) LWS and OP on D2, and (**d**) OP on D6.

#### *2.3. Hyperspectral Imaging System and Imaging Acquisition*

#### 2.3.1. Hyperspectral Sensor

The hyperspectral scanning system employed in the experiment is shown in Figure 3. The hyperspectral image capturing system was composed of the following equipment: hyperspectral sensor, halogen light source, conveyor system, computer, and photographic darkroom isolated from external light sources. The hyperspectral sensor utilized in the study was a V10E-B1410CL sensor (IZUSU OPTICS), which contained visible and nearinfrared (VNIR) bands with a spectral range from 380–1030 nm, a resolution of 5.5 nm, and 616 bands for imaging. The type of camera sensor is an Imspector Spectral Camera, SW ver 2.740. The halogen light sources used to illuminate the image were "3900e-ER", and the power was 150 W. Halogen lights were simultaneously illuminated on the left and right sides and focused on the conveyor track at an incident angle of 45 degrees to reduce shadow interference during the sampling process. The temperature and relative humidity in the laboratory were kept at 25 ◦C and 60%, respectively. A conveyor belt was designed to deliver rice plants for acquiring hyperspectral images by line scanning (Figure 3). Both the speed of the conveyor belt and the halogen lights were controlled by computer software. The distance between the VNIR sensor and the rice sample was 0.6 m.

**Figure 3.** Hyperspectral imaging system.

#### 2.3.2. Image Acquisition

The damage to leaves infested with different RLF larvae (from 1st to 5th instar larvae) for various durations of feeding (1–6 days) was assessed using VNIR hyperspectral imaging. Leaves were placed flat on the conveyor belt to scan the image at every 90◦ turn to enlarge the dataset. The exposure time for scanning was 5.495 ms, and the number of pixels in each scan raw was 816. Healthy leaves without RLF infestation were selected as the control. Before taking the VNIR hyperspectral images, light correction was conducted, and all processing of images was conducted in a dark box to avoid interference from other light sources. In total, 339 images, including 52 images of healthy leaves and 69, 32, 48, 52, 52, and 34 images of leaves infested for 1 day to 6 days, were taken.

#### 2.3.3. Calibration

To eliminate the impacts of uneven illumination and dark current noise, the object scan, reference dark value, and reference white value are needed to perform the normalization step. To reduce noise and avoid the influence of dark noise, the original hyperspectral image must be calibrated according to the following formula [21]:

$$R\_{\mathbb{C}} = \frac{R\_0 - B}{W - B} \tag{1}$$

where *R*<sup>0</sup> is the raw hyperspectral image, *RC* is the hyperspectral image after calibration, *W* is the standard white reference value with a Teflon rectangular bar, and *B* is the standard black reference value obtained by covering the lens with a lens cap.

#### *2.4. Spectral Information Extraction*

Removing the background of the image will help extract useful spectral information and reduce noise. The background removal process performs binary segmentation through the Otsu method, dividing the image into background and meaningful parts with similar features and attributes [29], including healthy, RLF-infested, and other defective leaves. To reduce unnecessary analysis work, the first step is to separate plant pixels

from non-plant pixels. This task directly converts the grayscale image from the true-color image or generates a single channel image (grayscale image) based on a simple index (e.g., Excess Green [30]). Second, the threshold value is obtained using the Otsu method; the grayscale value of each pixel point is compared with the threshold value, and the pixel is classified as a target or background based on the result of the comparison [31]. Since plants and backgrounds have very different characteristics, they can be separated quickly and accurately.

Third, the images that had been removed from the background were applied to determine the ROI using the CEM algorithm [16]. CEM has been widely employed for target detection in hyperspectral remote sensing imagery. CEM detects the desired target signal source by using a unity constraint while suppressing noise and unknown signal sources; it also minimizes the average energy of output. This algorithm generates a finite impulse response filter through a given vector as the d value to suppress regions that are not related to the features of the ROI. The vector indicates the spectral reflectance of a pixel in this study, and the ROI was predefined as an RLF-infested region in the images of rice leaves, e.g., Figure 2b,c. The results of the CEM processing of the image show the enhanced characteristics of pixels similar to the target feature d value. Using the Otsu method, if the pixel value exceeds the threshold, the feature similarity is set to 1; otherwise, it is set to 0. Last, a binary image is obtained. This algorithm is an efficient method of pixel-based detection [32].

#### *2.5. Band Selection*

Since HSIs usually contain hundreds of spectral bands, full-band analysis of the spectrum is not only time-consuming but also too redundant. To decrease the analysis time and redundancy, the first step of data analysis is to determine the key wavelengths. The way to achieve this goal is to select highly correlated wavelengths by comparing the reflectance and to maximize the representativeness of the information by decorrelation. Various band spectral methods based on certain statistical criteria have been selected to achieve this purpose [33]. The concept of band selection is similar to feature extraction in image processing, which can improve the accuracy of identification and classification.

#### 2.5.1. Band Prioritization

In the band prioritization (BP) part, the priority of the spectral bands will be calculated by statistical criteria [27]. Five criteria—variance, entropy, skewness, kurtosis, and signalto-noise ratio (SNR)—were chosen to calculate the priority of the spectral bands in this work. Thus, each spectral band has a priority and can be ranked with high priority.

#### 2.5.2. Band Decorrelation

When applying BP in the band selection process, the correlation between each band will highly affect the priority score. Neighboring bands will frequently be selected because of the high correlation between each band. Nevertheless, these redundant spectral bands are not helpful for improving detection performance. Therefore, to solve this problem, band decorrelation (BD) is utilized to remove these redundant spectral bands.

In this study, spectral information divergence (SID) [34] was applied for BD and utilized to measure the similarity between two vectors. By calculating the SID value, a threshold will be set to remove the bands with high similarity. The formula is:

$$k\left(b\_{i\prime}b\_{j}\right) = D\left(b\_{i}\parallel b\_{j}\right) + D\left(b\_{j}\parallel b\_{i}\right) \tag{2}$$

The parameter "*b*" represents a vector of spectral information, and *D bi bj* denotes Kullback–Leibler divergence, that is, the average amount of difference between the selfinformation of *bj* and the self-information of *bi*, and vice versa.

#### *2.6. Band Expansion Process*

Although the band selection-acquired spectral images can reduce storage space and processing time, some of the original features of the spectra were lost. To solve the problem of information loss after band selection, the difference in reflectivity can be increased by expanding the band to increase the divergence. The concept of the BEP [35] is derived from the fact that a second-order random process is generally specified by its first-order and second-order statistics. These correlated multispectral images provide missing but useful second-order statistical information about the original hyperspectral images. The secondorder statistical information utilized for the BEP includes autocorrelation, cross-correlation, and nonlinear correlation to create nonlinearly correlated images. The concept of generating second-order, correlated band images coincide with the concept of covariance functions employed in signal processing to generate random processes. Even though there may be no real physical inference for the band expansion process, it does provide an important advantage for addressing the problem of an insufficient number of spectral bands.

#### *2.7. Data Training Models*

Hyperspectral imaging data analysis is based on pixel-based classification and target recognition, using low-level features (such as spectral reflectance and texture) as the bedding, and the output feature representation at the top of the network can be directly input to subsequent classifiers for pixel-based classification [36]. The classification of this pixel is particularly suitable for deep learning algorithms to learn representative and discriminative features from the data in a hierarchical manner. In this study, the input neuron is the reflectance of a pixel. The input layer has 466 neurons in the full band, 6 neurons after band selection, and 27 neurons after band expansion. As shown in Figure 4a,b, the reflectivity of the HL, D1 OP, and D6 OP samples was divided into three categories. The model is trained with four hidden layers, and the learning rate parameter is 0.001. A softmax classifier was provided in the DNN terminal, and the classification results of the spectrum were obtained. The classified result was compared with the ground truth to calculate the accuracy. The model repeated the cross-validation ten times and averaged it as its overall accuracy (OA).

**Figure 4.** *Cont*.

**Figure 4.** (**a**) DNN model architecture. (**b**) Flowchart of classifying reflectance using DNN.

Figure 5 depicts the data training flowchart of this study, starting with hyperspectral image capture. First, the reflectivity is extracted from the ROI as a ground truth, which was selected by the entomologist. Second, the reflectance dataset applied in the full-band spectrum was processed in the same way to build DNN models after band selection. Last, the band selection dataset was processed by the BEP to build a DNN model.

The DNN model is constructed using three processes: full bands, band selection, and BEP. Each classification model has the best weight evaluated by its own model. Three DNN classification process models are constructed based on randomly distributed datasets, including 70% training, 15% validation, and 15% testing (as shown in Table 1). In the testing phase, the accuracy of each classification situation will be compared, and the OA of multiple classifications will be integrated. As a result, the most suitable model for identifying the classification was obtained.

**Table 1.** Number of pixels used for band section, training, and testing in the rice dataset.


<sup>1</sup> Band selection number = 5% of training number.

**Figure 5.** Data training flowchart of full bands, band selection, and band expansion process.

#### *2.8. Model Test for Unknown Samples*

To apply the spectral reflectance of unknown samples of healthy leaves, early and late OPs leave machine learning. The first step is to quickly determine the ROI to reduce the time required for image recognition. To achieve this goal, a method that combines an ATGP [28] and CEM is proposed. The ATGP is an unsupervised target recognition method that uses the concept of orthogonal subspace projection (OSP) to find a distinct feature without a priori knowledge. The ATGP method was employed to identify the target pixel in the hyperspectral image, and all the similar pixel data obtained were averaged as the d value of CEM.

Figure 6a,b shows the flow chart of the unknown sample prediction model. To automate the detection process, first, the full-band HSI, band selection, and BEP of the rice sample must be calibrated. Second, through the combined method of the ATGP and CEM, the Otsu method is utilized to mark the ROI. The ROI obtained from the full band, band selection, and BEP is classified by the corresponding DNN model and is labeled HL, early OP, or late OP by entomologists according to the occurrence of damage caused by RLF. The labeled ROI will be utilized to verify the prediction results of the DNN model. Five analysis methods, such as CEM\_Full-band→DNN\_Fullband, CEM\_band selection→DNN\_band selection, CEM\_band selection→DNN\_BEP, CEM\_BEP→DNN\_band selection, and CEM\_BEP→DNN\_BEP, are established to evaluate the prediction performance.

Last, the model classification results were visualized and overlaid on the original truecolor images, and agricultural experts verified the actual situation afterward to compare the performance of the models.

#### *2.9. Predict Unknown Samplings*

After a cross-validated predictive model has been established, a completely unknown sample with different data from the training set is needed to test its robustness. Eligible samples were obtained from the field. To fix other conditions, the retrieved samples were also photographed with a push-broom hyperspectral camera.

Many different evaluation metrics have been mentioned in the literature. The confusion matrix [37] was selected as a measure of model accuracy. A true positive (TP) is a correct detection of the ground truth. A false positive (FP) is an object that is mistaken as true. A false negative (FN) is an object that is not detected, although it is positive.

However, it is not enough to rely on the confusion matrix alone. An additional pipeline of common evaluation metrics was needed to facilitate a better comparison of classification models. The following metrics were employed for the evaluation in this study:

(i) recall

$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{3}$$

(ii) precision

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{4}$$

(iii) Dice similarity coefficient

$$\text{Dice similarity coefficient} = 2 \times \frac{\text{TP}}{(2 \times \text{TP} + \text{FP} + \text{FN})} \tag{5}$$

The recall is the ability of the model to detect all relevant objects, i.e., the ability of the model to detect all detected bounding boxes of the validation set. Precision is the ability of the model to identify only relevant objects. The Dice similarity coefficient (DSC) is an ensemble similarity measure function that is usually applied to calculate the similarity between two samples in the value range between 0 when there is no overlap and 1 when there is complete overlap.

#### **3. Results and Discussion**

#### *3.1. Images and Spectral Signatures of Healthy and RLF-Infested Rice Leaves*

When larvae of RLF feed on rice leaves, they generate LWS or OP on the leaves. As time passes, the LWSs are enlarged into a patch; the color of the patch gradually turns from white to ocher; and the images and spectral signatures of these patches also change during this process, as shown in Figure 7a,b, respectively. The spectral signatures of HL and OP in Figure 7b were obtained manually, according to entomological experts. The OPs have higher reflectance than HL in the blue to red wavelength range. Among these spectral bands, the longer the infestation period is, the higher the reflectance, e.g., day 6 (D6) > D5 > D2 > D1. However, only the reflectance of D6 OP is higher than that of HL at the NIR wavelength (Figure 7b). The reflectance of D1-OP is much lower than the HL reflectance, and the reflectance of D2- and D5-OPs is approximately the same as that of healthy leaves. The decrease in reflectance in D1 OP at NIR was mainly due to the destruction of leaf structure, which caused photon scattering [38]. These results suggest that the early defects caused by RLF have very different spectral signatures of vectors from the subsequent damage of infestation. Differences in the spectral properties between the early phase of damage and the late phase of damage, which could serve as a basis for the early identification of RLF infestations.

**Figure 7.** (**A**) Hyperspectral images of healthy leaves on day 0 (**a**) and ocher patches (yellow arrow) infested by rice leaf folders on day 1 (**b**), day 2 (**c**), day 5 (**d**), and day 6 (**e**). (**B**) Spectral signature and corresponding hyperspectral images of the healthy leaves (D0) and ocher patches (from D1 to D6) caused by RLF.

#### *3.2. Band Selection and Band Expansion Process*

The HSI and spectral signature from the full band system shown in Figure 7a,b contain considerable redundant information that slow the analysis efficiency and consume too much storage space. Therefore, band selection and the BEP were employed to select the most informative bands to increase the analysis efficiency and reduce storage space. To more effectively detect early RLF infection, the number of training sessions for HL and D1 OP was 5%, as shown in Table 1; these sessions were chosen to perform band selection. Five criteria were utilized in BP to calculate the priority of each band from the full-band signature of HL and D1 OP, and then, a value of 2.5 for SID was chosen as the threshold for BD to remove the adjacent bands with high similarity for D1 OP. Six bands of 489, 501, 603, 664, 684, and 705 nm, which had the largest difference in reflectance between HL and D1 OP, were selected as candidates through BP and BD using the criteria of entropy (Figure 8a,b). To adapt to the cheaper and easy-to-use, six-band handheld spectrum sensor, we only chose the six-band spectrum. The results of band selection using the other four criteria are shown in Supplementary Table S1 and Figure S1. Furthermore, the six bands were expanded to 27 bands using the BEP to improve the deficiency caused by band selection.

#### *3.3. ROI Detection with CEM in Full Bands, Band Selection, and Band Expansion Process*

CEM, a standard linear detector, was selected as a filter in this study to quickly identify the ROI. CEM increases the accuracy of automated detection and reduces the analysis time. The spectral signature of the OP that appeared on D1 in Figure 7b was employed as the d value of CEM to detect damaged leaves caused by RLF. Figure 9 shows the effect of different degrees of enhancement on ROI detection in the case of the full band, band selection, and BEP and the results of k-means clustering as a contrast. In the case of full bands, very minimal damage caused by RLF was detected (Figure 9b). The abundance of spectral data increases the complexity of detection and reduces the spectral reflectance resulting from RLF. On the other hand, the ROI detection in the cases of band selection reveals almost all the damage shown in Figure 9a. This finding indicates that band selection can achieve the best performance in ROI detection through CEM (Figure 9c). In the case of the BEP, the result of CEM is better than the full bands but not as good as band selection (Figure 9d).

**Figure 9.** Region of interest detection with k-means (k = 10) or constrained energy minimization algorithm on different datasets using the reflectance of the D1 ocher patch as a d value on rice leaves. (**a**) True-color image, (**b**) k-means in full bands, (**c**) CEM in full bands, (**d**) band selection, and (**e**) band expansion process.

#### *3.4. DNN Model for Classification of Testing Dataset*

The DNN multilayer perceptron model is suited for HSI data for classification because the spectral reflection of each pixel can form a vector. Even if we have fewer images, we can still use enough pixels as samples for analysis. Therefore, this study does not require thousands of images to train a set of deep learning models, which greatly reduces the tedious work of collecting samples and the difficulty of controlling sample conditions.

Table 2 describes the results of the OA verification using the DNN models of the full bands, band selection (6 bands), and BEP (27 bands). The confusion matrix [37] was utilized to evaluate the classification performance; the complete confusion matrix calculated for DNN classification is shown in Supplementary Figure S2. In the case of full bands, the OA (95%) and performance are the best in the classification of various situations, but a longer time (14.88 s) is needed than band selection and BEP in classification. The application of band selection saves approximately half the time of full bands, but it will also reduce the classification accuracy. Except for HL, the accuracy of early and late OPs decreased after band selection, which may be attributed to a decrease in some spectral information. The accuracy of the BEP is not higher than that of band selection, as expected, and it is possible that BEP amplifies the noise and interferes with the classification ability. Among the five criteria, the OA of classification is the best among the bands selected by entropy. In terms of entropy, the accuracy of early OP from band selection is approximately 4% higher than that from BEP.


**Table 2.** Results for the testing dataset for DNN classification in different bands. The best performance is highlighted in red.

<sup>1</sup> Early OP comprises a set of D1 and D2 OP. <sup>2</sup> Late OP comprises a set of D5 and D6 OP. <sup>3</sup> OA is an abbreviation for overall accuracy.

#### *3.5. Prediction of Unknown Samples*

The predictions were carried out using ROIs obtained from full bands, band selection, and the BEP, as shown in Figure 6. CEM was applied to suppress the background and to detect the ROI. The DNN models of full bands, band selection, and the BEP were used as classifiers to predict unknown samples through five analysis methods. For band selection and the BEP, bands selected by entropy were selected as examples according to the results of Table 2 to execute the prediction. Figure 10 shows the results of the true-color image (a), ground truth (b), and predictions from an unknown sample (c–g). The ground truth was determined by entomologists and given different colors to distinguish HL (green) from OP (red). Figure 10c–g shows the classification results from the full bands and band selection/BEP, respectively, which were also colored for visualization. Figure 10d,e shows the best results as expected, in which the predicted areas of the ROI were approximately the same as the ground truth (Figure 10). However, the predicted ROI in Figure 10c was distributed over the rice leaves in addition to the ROI of the ground truth.

**Figure 10.** Prediction of spectral information from unknown rice sample: (**a**) true-color image, (**b**) Ground Truth, (**c**) CEM\_Full-band→DNN\_Full-band, (**d**) CEM\_band selection→DNN\_band selection, (**e**) CEM\_band selection→DNN\_BEP, (**f**) CEM\_BEP→DNN\_band selection, and (**g**) CEM\_BEP→DNN\_BEP.

The performance of the pixel classification of DNN models was verified by comparing the prediction results with ground truth using a confusion matrix; the results are shown in Tables 3 and 4. Similar to the results in Figure 10, the analysis methods show that CEM\_band selection→DNN\_band selection showed the best prediction performance (Table 3) because this method showed the highest TP (correct identification of OP) and overall accuracy (OA) and the lowest FN (misidentification of OP). However, very high false positives (FPs) were obtained from the methods of CEM\_Full-band→DNN\_Fullband, CEM\_band selection→DNN\_band selection, and CEM\_band selection→DNN\_BEP (Figure 10c–e). The high FP value of CEM\_Full-band→DNN\_Full-band may be derived from the scattered distribution of predicted ROI, while the high FP values of CEM\_band selection→DNN\_band selection and CEM\_band selection→DNN\_BEP predicted area of ROI may be derived from the predicted areas of ROI that are undetectable by the naked eye. To prove the above observation, the images of Figure 10d or Figure 10e were overlaid with ground truth (Figure 10b). The extra predicted area around the ROI of ground truth in Figure 11d,e should be the early infestation of RLF that cannot be detected by human eyes.

To verify the necessity of using CEM to extract ROI, the DNN classification results of the background-removed images are shown in Supplementary Table S2 and Figure S3. The results show that the accuracy of DNN classification after CEM processing is approximately 22% higher than that of the DNN applied directly to remove the background.


**Table 3.** Accuracy of DNN classification evaluated by the confusion matrix.

<sup>1</sup> Bands selected by Entropy. <sup>2</sup> TP represents the correct identification of OP; <sup>3</sup> FP denotes the health misidentification of HL; <sup>4</sup> TN indicates the correct identification of health HL; <sup>5</sup> FN represents misidentification of OP; <sup>6</sup> OP is positive, and <sup>7</sup> non-OP is negative.

**Figure 11.** Overlaid images of the predicted ROI with the ground-truth ROI for evaluating the performance of DNN classification. (**a**) Predicted ROI with CEM\_band selection→DNN\_band selection, (**b**) predicted ROI with CEM\_band selection→DNN\_BEP, (**c**) Ground Truth, (**d**) Overlay with (**a**,**c**,**e**) Overlay with (**b**,**c**).

The performance of DNN classification was further evaluated by the metrics of recall, precision, accuracy, and DSC, as shown in Table 4. The analysis method of CEM\_band selection→DNN\_band selection was again rated as the best model for predicting unknown samples, as it had the highest accuracy, recall, and DSC and took the shortest time. Although the analysis method of CEM\_band selection→DNN\_BEP also showed reasonably good performance, the overall results indicated that six bands obtained from band selection are good enough to detect the early OP caused by RLF. The analysis method of CEM\_BEP→DNN\_band selection has the highest precision, but its recall and DSC are lower than those of CEM\_band selection→DNN\_band selection and CEM\_band selection→DNN\_BEP.

**Table 4.** Evaluation metrics of DNN prediction. The best performance is highlighted in red color.


Taking the OP as an example, the pixels of the ROI were utilized for prediction evaluation, and a confusion matrix was employed for performance in this study. As shown in Table 4, all analysis methods were successful in classification, and their accuracies reached at least 95%. The area of the block classified as OP is smaller than the actual situation, which is the case in Figure 11f. As shown in Figure 11d, CEM\_band selection→DNN\_band selection, the distribution of false positives was observed around the OP, which means that the earlier defects caused by insect pests could be identified as false-positive areas in hyperspectral images but could not be recognized in true-color images or human eyes.

#### *3.6. Discussion*

Automatic detection of plant pests is extremely useful because it reduces the tedious work of monitoring large paddy fields and detects the damage caused by RLF at the early stage of pest development and eventually stop further plant degradation. This study proposes an automatic detection method that combines CEM and the ATGP. CEM is an efficient hyperspectral detection algorithm that can efficiently handle subpixel detection [39]. The quality of the CEM results is determined by the d-value used as a reference. Therefore, it is important to provide a plausible spectral feature. The ATGP was applied to identify the most representative feature vector as the d-value from an unknown sample. Another problem with the CEM is that it only provides a rough detection result. The DNN was selected to classify the reflectance of the ATGP→CEM detection results. In addition, band selection and the BEP were chosen to identify the key wavelengths among the five criteria to save time and improve accuracy. The accuracy of CEM\_band selection→DNN\_band selection in predicting the performance of unknown samples reached 98.1%. Traditional classifiers such as linear SVM (support vector machine) and LR (logistic regression) can be attributed to single-layer classifiers, while decision trees or SVM with kernels are considered to have two layers [40,41]. However, deep neural architectures with more layers can potentially extract abstract and invariant features for better image or signal classification [42]. Our previous studies to detect Fusarium wilt on Phalaenopsis have shown this result [43]. In addition, we have used the Entoscan Plant imaging system to detect the infestation of RLF, but this system only covers 16 bands (390, 410, 450, 475, 520, 560, 625, 650, 730, 770, 840, 860, 880, 900, 930, and 960 nm) to obtain the Normalized Difference Vegetation Index. The results are shown in Supplementary Figure S4. It may not be specific enough to distinguish the damage caused by different pests. Therefore, we attempt to find a more representative vector from the spectral fingerprint of the hyperspectral imaging system to detect the infestation of RLF. At the same time, the band selection was used to remove redundant

information to achieve the time required for the automatic detection process. It is not only reducing the time by 2.45 times (from 8 11" to 3 20") but also reach higher accuracy (0.981) than that (0.951) in the full band. The time required for each stage of the prediction process is shown in Supplementary Figure S5. The six bands (489, 501, 603, 664, 684, and 705 nm) obtained through band selection are more representative than bands supplied by the Entoscan Plant imaging system and can be applied to the multispectral sensor of UAVs and portable instruments for field use. The methods, algorithms, and models we established in this paper will be applied to other important rice insect pests and verified in the field by using either UAVs or portable instruments that carry the multispectral sensor. In addition, a platform to integrate all this information will be established to interact with farmers.

Other studies [44,45] used conventional true-color images, which can only classify spatial information based on their color and shape and identify damage that is clearly visible by the naked eye. Compared with previous studies, the DNN was based on high spectral sensors to provide spectral information, which can detect pixel-level targets and retain the spatial information of the original image. The authors [44,45] employed the CNN to detect pests and achieved a classification accuracy of 90.9% and 97.12%, respectively. The method proposed in this paper is slightly higher than the final accuracy of CNN. Although it can simultaneously classify multiple insect pests and diseases, it often causes confusion. In addition, their studies were conducted with images of the late damage stage and could not classify the level of infestation. In addition, most image classifications are trained by a CNN. CNNs often need to collect a large number of training samples, and it is difficult to obtain a large number of sufficient training images in a short period of time. In contrast, hyperspectral image classification based on spectral pixels can be trained by a DNN, which means that even a single hyperspectral image can have a large amount of data for training.

#### **4. Conclusions**

HSI techniques can provide a real-time monitoring system to guide the precise application and reduction of pesticides and to provide objective and effective options for the automatic detection of crop damage caused by insect pests or diseases. In this research, we propose a deep learning classification and detection method that is based on band selection and a BEP that can be applied to determine the lowest cost to achieve the monitoring of leaf defects caused by RLF. To compensate for the deficiencies caused by band selection, the BEP method was selected to improve the detection efficiency. The results of the test dataset show that the use of the full-band classification is the best, and the band selection classification is better than the BEP. Except for criteria on skewness and signal-to-noise ratio, the accuracy of full-band classification is nearly 95%.

After using the trained model to predict the unknown samples, the results show that the CEM\_band selection→DNN\_band selection analysis method is the best model and has reached the expected prediction. The maximum DSC is 0.80, which means that its classification is 80%, which is identical to the classification recognized by entomologists. In addition, we discovered that the predictive area of the model was larger than the area observed by the human eye. This phenomenon may indicate that RLF damage may produce changes in parts of the spectrum that cannot be easily detected by the human eye. In addition, comparing the implementation of prediction operations based on the full-band DNN model and the band selection-based DNN model, the band selection method only needs 1% of the full-band time, which provides a vast potential for wider applications and has good rice identification capabilities. Only six bands are needed while reducing the technical cost required for on-site monitoring.

By providing more training data, the method also has significant room for improvement by implementing a data argumentation process or extending other data, such as the mean or variance-generating structures. While the current research has only been conducted in the laboratory or used non-specified multispectral images in the field, the handheld six-band sensor provided very good results, and its portability means that it could be adapted for use in the field to obtain realistic multispectral images on-site using

band Selection methods. In addition, most of the existing UAVs use CNN or vegetation indices for analysis and have not been studied much in spectral reflectance. As mentioned in Section 3.5, the HSI prediction model can detect infested areas before noticed by the human eye. This technique can be extended to UAV in the future to monitor the invisible spectral changes on the leaf surface. This technology can be extended to UAV in the future to monitor the invisible spectral changes on the leaves. Combining HSI techniques and deep learning classification models could provide real-time surveys that give on-site early warning of damage.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/rs13224587/s1, Figure S1: Bands selected through band prioritization and band decorrelation, Figure S2: Confusion Matrix result of DNN model, Figure S3: Prediction of spectral information from unknown rice sample, Figure S4: Entoscan Plant imaging system, Figure S5: Approximate time required for each step of the prediction of unknown samples, Table S1: Results of the first six bands of band selection using different criteria, Table S2: The accuracy of DNN classification evaluated by confusion matrix.

**Author Contributions:** Conceptualization, Y.-C.O. and S.-M.D.; methodology, Y.-C.O. and S.-M.D.; software, Y.-C.O.; validation, Y.-C.O. and S.-M.D.; formal analysis, G.-C.L.; investigation, G.-C.L.; resources, Y.-C.O. and S.-M.D.; data curation, G.-C.L.; writing—original draft preparation, G.-C.L.; writing—review and editing, Y.-C.O. and S.-M.D.; visualization, G.-C.L.; supervision, Y.-C.O. and S.-M.D.; funding acquisition, Y.-C.O. and S.-M.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Ministry of Science and Technology (MOST), Taiwan (Grant No. MOST 107-2321-B-005-013, 108-23321-B-005-008, and 109-2321-B-005-024), and Council of Agriculture, Taiwan (Grant No. 110AS-8.3.2-ST-a6). The APC was funded by MOST 109-2321-B-005-024.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We are grateful to Chung-Ta Liao from Taichung District Agricultural Research and Extension Station for RLF collection and maintenance. We would also like to thank the publication subsidy from the Academic Research and Development of NCHU.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Detection of Insect Damage in Green Co**ff**ee Beans Using VIS-NIR Hyperspectral Imaging**

**Shih-Yu Chen 1,2,\*, Chuan-Yu Chang 1,2, Cheng-Syue Ou <sup>1</sup> and Chou-Tien Lien <sup>1</sup>**


Received: 11 May 2020; Accepted: 20 July 2020; Published: 22 July 2020

**Abstract:** The defective beans of coffee are categorized into black beans, fermented beans, moldy beans, insect damaged beans, parchment beans, and broken beans, and insect damaged beans are the most frequently seen type. In the past, coffee beans were manually screened and eye strain would induce misrecognition. This paper used a push-broom visible-near infrared (VIS-NIR) hyperspectral sensor to obtain the images of coffee beans, and further developed a hyperspectral insect damage detection algorithm (HIDDA), which can automatically detect insect damaged beans using only a few bands and one spectral signature. First, by taking advantage of the constrained energy minimization (CEM) developed band selection methods, constrained energy minimization-constrained band dependence minimization (CEM-BDM), minimum variance band prioritization (MinV-BP), maximal variance-based bp (MaxV-BP), sequential forward CTBS (SF-CTBS), sequential backward CTBS (SB-CTBS), and principal component analysis (PCA) were used to select the bands, and then two classifier methods were further proposed. One combined CEM with support vector machine (SVM) for classification, while the other used convolutional neural networks (CNN) and deep learning for classification where six band selection methods were then analyzed. The experiments collected 1139 beans and 20 images, and the results demonstrated that only three bands are really need to achieve 95% of accuracy and 90% of kappa coefficient. These findings show that 850–950 nm is an important wavelength range for accurately identifying insect damaged beans, and HIDDA can indeed detect insect damaged beans with only one spectral signature, which will provide an advantage in the process of practical application and commercialization in the future.

**Keywords:** target detection; coffee beans; insect damage; hyperspectral imaging; band selection

#### **1. Introduction**

Coffee is one of the most widely consumed beverages by people, and high quality coffee comes from healthy coffee beans, an important economic crop. However, insect damage is a hazard on green coffee beans as the boreholes in green beans, also known as wormholes, are the cause for the turbid or strange taste of the coffee made from such coffee beans. Generally, the coffee beans are inspected manually with the naked eye, which is a laborious and error-prone work, while visual fatigue often induces misrecognition. Even for an expert analyst, each batch of coffee takes about 20 min to inspect.

The international green coffee beans grading method is based on the SCAA (Specialty Coffee Association of America) Green Coffee Classification. This classification categorizes 300 g of properly hulled coffee beans into five grades, according to the number of primary defects and secondary defects. Primary defects include full black beans, full sour beans, pod/cherry, etc. One to two primary defects equal one full defect. Secondary defects include insect damaged, broken/chipped, partial black, partial

sour, floater, shell, etc., where two to five secondary defects are equal to one full defect [1]. Specialty grade (Grade 1) shall have no more than five secondary defects and no primary defect allowed in 300 g of coffee bean samples. At most, a 5% difference in screen mesh is permitted. These must have a special attribute in terms of concentration, fragrance, acidity, or aroma, with no defects and contamination. Premium-grade (Grade 2) shall have no more than eight full defects in 300 g of coffee bean samples and a maximum of 5% difference of screen mesh is permitted. These must have a special attribute in terms of concentration, fragrance, acidity, or aroma, and there must be no defect. The exchange grade (Grade 3) is permitted to have 9~23 full defects in 300 g of coffee bean samples. The test cup should be defect-free, and the moisture content should be 9~13%. Below standard grade (Grade 4) has 24~86 full defects in 300 g of coffee bean samples. Finally, the off-grade (Grade 5) has more than 86 full defects in 300 g of coffee bean samples.

In recent years, many coffee bean identification methods have been proposed, but few research reports have used a spectral analyzer to evaluate the defects and impurities of coffee beans. The current manual inspection of defective coffee beans is time-consuming and is unable to analyze a large quantity of samples. Therefore, this study, which used hyperspectral images for analysis, should provide more crucial spectral information than conventional RGB images to determine the spectral signal difference between healthy and defective coffee beans. Table 1 tabulates the green coffee bean evaluation methods proposed by previous studies.


**Table 1.** Existing green coffee bean evaluation methods.

In 2019, Oliveri et al. [2] used VIS-NIR to identify the black beans, broken beans, dry beans, and dehydrated coffee beans using principal component analysis (PCA) and the k-nearest neighbors algorithm (k-NN) for classification. Although their method can extract effective wavebands, the disadvantages are that the recognition rate is only 90%. As k-NN uses a qualified majority for training and classification, it is likely to have over fit and low-level fit. In 2018, Caporaso et al. [3] used hyperspectral imaging to recognize the origin of coffee beans by using support vector machine (SVM) to classify the origins. Their method is similar to that used in this paper and the advantage includes more spectral information of hyperspectral imaging. Despite the fact that SVM and partial least squares (PLS) multi-dimensional classification can classify the green coffee beans effectively, the bands are not selected according to materials, and the recognition rate was 97% among 432 coffee beans. Zhang et al. [4] proposed a hyperspectral analysis used moving average smoothing (MA), wavelet transform (WT), empirical mode decomposition (EMD), and median filter for the spatial preprocessing of gray level images of each wavelength, and finally used SVM for classification. The advantage of their method is that the preprocessing is performed by using signals different from the concept of images, and SVM is used for classification. The disadvantages are that only second derivatives are used for band selection, the material is not analyzed, and the accuracy in 1200 coffee beans was only slightly higher than 80%. There have been a few reports on traditional RGB images. García [5] used K-NN to classify sour beans, black beans, and broken beans. The limitations of the method are that K-NN is likely to have over fit and low-level fit. As the result, the classified coffee beans were relatively clear

target objects, and the accuracy in about 444 coffee beans was 95%. Later, Arboleda [6] used thresholds to classify black beans. However, the defects in that method were that only the threshold was used. Therefore, if the external environment changes, the threshold changes accordingly, and the classified target objects were relatively apparent, so the accuracy was higher, at 100% in 180 coffee beans.

The black beans, dry beans, dehydrated beans, and sour beans are still apparent coffee beans, except with very different colors. The differences in appearance are obvious in traditional color images. Most prior studies have used black beans as experimental targets because black beans are quite different from healthy beans. The broken beans are identified by using the morphological analysis method. Unlike the aforementioned studies, this paper sought to identify insect damaged beans, which are difficult to visualize from data. While insect damaged coffee beans are the most common type of defective coffee beans, such targets have little presence and low probability of existence in data, thus it has never been investigated by previous studies. More specifically, as this signal source is considered to be interesting, the signatures are not necessarily pure. Rather, they can be subpixel targets, which cannot exhibit their distinction from the surrounding spectral such as insect damaged beans due to small sample size, and cannot be detected by traditional spatial domain-based techniques. The method proposed in this paper can be applied to many different applications. Without considering their spatial characteristics, hyperspectral imaging provides an effective way to detect, uncover, extract, and identify such targets using their spectral properties, as captured by high spectral-resolution sensors.

The study conducted in this paper collected a total of 1139 green coffee beans including healthy beans and insect damaged beans in equal proportions for hyperspectral data collection and experimentation. Table 1 lists the methods used in prior studies. Our method differed from prior studies in terms of spectral range, data volume, analysis method, and accuracy, along with enhanced data volume, and accuracy. This study used a push-broom VIS-NIR hyperspectral sensor to obtain the images of coffee beans and distinguished the healthy beans from insect damaged ones based on the obtained hyperspectral imaging. Moreover, the hyperspectral insect damage detection algorithm (HIDDA) was particularly developed to locate and capture the insect damaged areas of coffee beans. First, the data preprocessing was performed through band selection (BS), as hyperspectral imaging has a wide spectral range and very fine spectral resolution. As the inter-band correlation between adjacent bands is sometimes too high, complete bands are averse to subsequent data compression, storage, transmission, and resolution. Therefore, the mode of extracting the most representative information from images is one of the most important and popular research subjects in the domain. For this, Step 1 of our HIDDA method involves the analysis of important spectra after band selection, and one image is then chosen for insect damaged bean identification through constrained energy minimization (CEM) and SVM as training samples. In this step, as long as the spectral signature of one insect damaged bean is imported into CEM, the positions of the other insect damaged beans can be detected by Otsu's method and SVM. In Step 2, the image recognition result of Step 1 is used for training, and the deep learning CNN model is used to identify the remaining 19 images. The experimental results show that when using the proposed method to analyze nearly 1100 green coffee beans with only three bands, the accuracy reached almost 95%.

#### **2. Materials and Methods**

#### *2.1. Hyperspectral Imaging System and Data Collection*

The hyperspectral push-broom scanning system (ISUZU Optics Corp.) used in this experiment is shown in Figure 1. A carried camera SPECIMFX10 hyperspectral sensor with a spectral range of 400–1000 nm, a resolution of 5.5 nm, and 224 bands was used for imaging. The light source for irradiating the images was "3900e-ER", with a power of 21V/150W, comprised of a loading implement and mobile platform, and step motor (400 mm, maximum load: 5 kg, and maximum speed: 100 mm/s). The system was controlled with ISUZU software. The dark (closed shutter) and white (99% reflection spectrum) images were recorded and stored automatically before each measurement. The laboratory

samples were placed on movable plates so that they were appropriately spaced. In each image, 60 green coffee beans were analyzed. The process of filming coffee beans is shown in Figure 2. Each time, 30 insect damaged beans and 30 healthy beans were filmed. Figure 3 shows the actual filming results. The mobile platform and correction whiteboard were located in the lower part, and the filming was performed in the dark box to avoid the interference from other light sources. The spectral signatures of green coffee beans were obtained after filming. Figure 3 shows the post-imaging hyperspectral images. The spectral range was 400–1000 nm. The hyperspectral camera captured 224 spectral images and the image data size was 1024 × 629 × 224.

**Figure 1.** Hyperspectral imaging system.

**Figure 2.** Filming of coffee beans.

**Figure 3.** Results of green coffee beans filming. (**a**) Color image; (**b**) 583 nm; (**c**) 636 nm; (**d**) 690 nm; (**e**) 745 nm; (**f**) 800 nm.

#### *2.2. Co*ff*ee Bean Samples*

After the seeds produced by healthy coffee trees are removed, washed, sun-dried, fermented, dried, and shelled, healthy beans are then separated from defective beans. Common defective beans include black beans, insect damaged beans, and broken beans. Figure 4 shows the healthy and defective beans.

**Figure 4.** The appearance of healthy and defective beans. (**a**) Healthy bean, (**b**) defective bean (black bean), (**c**) defective bean (insect damaged bean), and (**d**) a defective bean (broken bean).


The sample of coffee beans used this study were provided by coffee farmers in Yulin, Taiwan. The coffee farmers filtered the beans and provided both healthy and defective coffee bean samples for the experiment on coffee bean classification. In order to ensure the intactness of the sample beans, all beans were removed from the bag using tweezers, and the tweezers were wiped before touching different types of beans. A total of 1139 beans were collected, and 19 images were recorded. The quantities of the coffee beans are listed in Table 2.



The hyperspectral data of green coffee beans and the original hyperspectral data were obtained, and 224 bands were observed after filming the green coffee beans in the spectral range of 400–1000 nm. The data were normalized to enhance the model convergence regarding the speed and precision of band selection with machine learning or deep learning. We collected 19 hyperspectral images in the experiments. Figure 5 shows the spectral signatures of the healthy beans and defective beans for our proposed hyperspectral algorithm.

**Figure 5.** Spectral signature of the healthy and insect damaged coffee beans.

#### *2.3. Hyperspectral Band Selection*

In hyperspectral imaging (HSI), hyperspectral signals, with as many as 200 contiguous spectral bands, can provide high spectral resolution. In other words, subtle objects or targets can be located and extracted by hyperspectral sensors with very narrow bandwidths for detection, classification, and identification. However, as the number of spectral bands and the inter-band information redundancy are usually very high in HSI, the original data cube is not suitable for data compression or data transmission, and particularly, image analysis. The use of full bands for data processing often encounters the issue of "the curse of dimensionality"; therefore, band selection plays a very important role in HSI. The purpose of band selection is to select the most representative set of bands in the image and include them in the data, so that they can be as close as possible to the entire image. Previous studies have used various band selection methods based on certain statistical criteria [10–17], mostly select an objective function first, and then select a band group that can maximize the objective function. This paper first used the histogram method in [18] to remove the background, and then applied six band selection methods based on constrained energy minimization (CEM) [19–24] to select and extract a representative set of bands.

#### 2.3.1. Constrained Energy Minimization (CEM)

CEM [19–24] is similar to matched filtering (MF); the CEM algorithm only requires one spectral signature (desired signature or target of interest) as parameter d, while other prior knowledge (e.g., unknown signal or background) is not required. Basically, CEM applies a finite impulse response (FIR) filter to pass through the target of interest, while minimizing and suppressing noise and unknown signals from the background using a specific constraint. CEM suppresses the background by correlation matrix R, which can be defined as **R** = ( **<sup>1</sup> N** ) **N <sup>i</sup>** <sup>=</sup> **<sup>1</sup> riri <sup>T</sup>**, and feature d is used by FIR to detect other similar targets. Assuming one hyperspectral image with N pixels r is defined as {**r1**,**r2**,**r3**, ... ,**rN**}, each pixel has L dimensions expressed as **ri** = (**ri1**,**ri2**,**ri3**, ... ,**riL**) **<sup>T</sup>**, thus, the desired target d can be defined as (**d1**, **d2**, **d3**, ... , **dL**) **<sup>T</sup>**, and the desired target is passed through by the FIR filter. The coefficient in the finite impulse response filter can be defined as *w* = (*w***1**, *w***2**, *w***3**, ... , *wL*) *<sup>T</sup>*, where the value of w can be obtained by the constrain *dT***w** = *wTd* = **1**, and the result of CEM is:

$$\boldsymbol{\delta}^{\rm CEM} = \left(\mathbf{w}^{\rm CEM}\right)^{\rm T}\mathbf{r} = \left(\mathbf{d}^{\rm T}\mathbf{R}\_{\rm L\times L}^{-1}\mathbf{d}\right)^{-1}\left(\mathbf{R}\_{\rm L\times L}^{-1}\mathbf{d}\right)^{\rm T}\mathbf{r} \tag{1}$$

CEM is one of the few algorithms that can suppress the background while enhancing the target at the subpixel level. CEM is easier to implement than binary classification as it uses the sampling correlation matrix R to suppress BKG, thus, it only requires meaningful knowledge of the target and no other information is required. In this regard, CEM has been used to design a new band selection method called constraint band selection (CBS) [19], and the resulting minimum variance from CBS is used to calculate the priority score to rank the frequency bands. Conceptually, constrained-target band selection (CTBS) [25,26] is slightly different from CBS, as CBS only focuses on the band of interest, while CTBS simultaneously takes advantage of the target signature and the band of interest. First, it specifies the signal d of a target, and then constrains d to minimize the variance caused by the background signal through the FIR filter. The resulting variance can also be selected by the selection criteria. Since CEM has been widely used for subpixel target detection in hyperspectral imagery, this paper applied CBS and CTBS based methods for further analysis. The following are the six target detection based band selection methods used in the experiments.

#### 2.3.2. Constrained Energy Minimization-Constrained Band Dependence Minimization (CEM-BDM)

CEM-BDM [19] is one of the CBS methods, which uses CEM to determine the correlation between the various bands, and regards such correlation as a score. Subsequent processing is then performed on this score to obtain a band selection algorithm with different band priorities. Let {**B***l*} be the set of all band images, where {**b***l*} denotes each band image sized at M × N in a hyperspectral image cube. A optimization problem similar to CEM can be obtained for a constrained band-selection problem by min *wl wl TQwl subject to b<sup>l</sup> <sup>T</sup>wl* = 1, which uses the least squares error (LSE) as the constraint. CEM-BDM can be extended as follows: assume the autocorrelation matrix as *<sup>Q</sup>* = <sup>1</sup> *L*−1 *L j* = 1,*j<sup>l</sup> bjb<sup>j</sup> T* and the coefficient in the finite impulse response filter as *<sup>w</sup>CEM <sup>l</sup>* = (*b<sup>l</sup> <sup>T</sup><sup>Q</sup>*<sup>−</sup><sup>1</sup> *bl*) −1 *Q*−1 *bl*, thus, the final results of CEM-BDM can be defined as follows:

$$BDM\_{priority}(\mathcal{B}\_l) \;=\; \left(\overline{w}\_l^{\text{CEM}}\right)^T \overline{\mathcal{Q}} \overline{w}\_l^{\text{CEM}} \tag{2}$$

This band selection method uses the least square error to determine the correlation between the bands. If the results of the least square error are larger, it means that the current band is more dependent on other bands, and thus, the more significant band.

#### 2.3.3. Minimum Variance Band Prioritization (MinV-BP)

According to the optimization method of CEM, the priority score is processed by the variance value; the smaller the variance, the higher the priority score. CEM ranks bands by starting with the minimal variance as its first selected band. Let {*bl*} *L* <sup>l</sup> <sup>=</sup> <sup>1</sup> be the total band images for a hyperspectral image cube, where *bl* is the *l*th band in the image. By applying CEM, this value is obtained by the full band set Ω, *V*(**Ω**)=(*d<sup>T</sup>* Ω*R*−<sup>1</sup> <sup>Ω</sup> *d*Ω) <sup>−</sup><sup>1</sup> = (*dTR*−**1***d*) −1 , in this case, for each single band *bl*, the MinV-BP [23,25,26] variance can be defined as:

$$\mathcal{N}(b\_l) \;= \left(d\_{b\_l}^T \mathcal{R}\_{b\_l}^{-1} d\_{b\_l}\right)^{-1} \tag{3}$$

This can be used as a measure of variance, as it uses only the data sample vector specified by *bl*. Therefore, the value of V(*bl*) can be further used as the priority score of *bl*. According to this explanation, the band is ranked by the value of V(*bl*); the smaller the V(*bl*), the higher the priority of band selection.

#### 2.3.4. Maximum Variance Band Prioritization (MaxV-BP)

In contrast to MinV-BP, the concept of Max V-BP [23,25] is to first remove *bl* from the band set **Ω**, and the variance is calculated as follows:

$$\mathbf{V}(\boldsymbol{\Omega} - \mathbf{b}\_l) = \left(d\_{\boldsymbol{\Omega} - \langle \mathbf{b}\_l \rangle}^T \boldsymbol{R}\_{\boldsymbol{\Omega} - \langle \mathbf{b}\_l \rangle}^{-1} d\_{\boldsymbol{\Omega} - \langle \mathbf{b}\_l \rangle} \right)^{-1} \tag{4}$$

Under this criterion, the value of V(**Ω** − **b***l*) can also be the measurement of the priority score for **b***<sup>l</sup>* Consequently, {*bl*} *L <sup>l</sup>* <sup>=</sup> <sup>1</sup> can be ranked by the decreasing values of V(**Ω** − **b***l*). The maximum V(**Ω** − **b***l*) is supposed to be the most significant, and the band is prioritized by (4). The difference between MinV\_BP and MaxV\_BP is that MinV\_BP conducts sorting according to a single band, while MaxV\_BP is sorted by the full band, and the results of the two band selections are not opposite.

#### 2.3.5. Sequential Forward-Constrained-Target Band Selection (SF-CTBS)

SF-CTBS [25] uses the MinV\_BP criteria in (3) to select one band at a time sequentially, instead of sorting all bands with the scores in (3), as MinV\_BP does. As a result, band *b*∗ *<sup>l</sup>*<sup>1</sup> can obtain the minimal variance.

$$\mathbf{b}\_{l\_1}^\* = \arg \left\{ \min\_{\mathbf{b}\_l \in \Omega} V(\mathbf{b}\_l) \right\} \\ = \arg \left\{ \min\_{\mathbf{b}\_l \in \Omega} (\mathbf{d}\_{\mathbf{b}\_l}^T \mathbf{R}\_{\mathbf{b}\_l}^{-1} \mathbf{d}\_{\mathbf{b}\_l})^{-1} \right\} \tag{5}$$

where *b*∗ *<sup>l</sup>*<sup>1</sup> is the first selected band, and the second band is generated by another minimum variance

$$b\_{l\_2}^\* = \arg\left\{ \min\_{b\boldsymbol{\eta} \in \Omega - b\_{l\_1}} V(\boldsymbol{b}\_l) \right\} \\ = \arg\left\{ \min\_{b\boldsymbol{\eta} \in \Omega - b\_{l\_1}} (d\_{b\_l}^T \mathcal{R}\_{b\_l}^{-1} d\_{b\_l})^{-1} \right\} \tag{6}$$

This process is repeated continuously by adding each newly selected band, while the sequential forward technique in [26,27] selects one band at a time sequentially.

#### 2.3.6. Sequential Backward-Target Band Selection (SB-CTBS)

In contrast to SF-CTBS using the MinV\_BP criteria in (3), SB-CTBS [25] applies the MaxV\_BP as the criterion by using the leave-one-out method to select the optimal bands. For each single band, **b***l*, assumes band subset **Ω** − *bl***<sup>1</sup>** , which removes **b***<sup>l</sup>* from the full band. The first selected band can be obtained by (7), which yields the maximal variance and *b*∗ *<sup>l</sup>*<sup>1</sup> can be considered as the most significant band.

$$b\_{l\_1}^\* = \arg \left\{ \max\_{\|b\_l \in \Omega} V(\Omega - \langle b\_l \rangle) \right\} \\ = \arg \left\{ \max\_{\|b\_l \in \Omega\|} (d\_{\Omega - \langle b\_l \rangle}^T R\_{\Omega - \langle b\_l \rangle}^{-1} d\_{\Omega - \langle b\_l \rangle})^{-1} \right\} \tag{7}$$

After calculating *b*∗ *l*1 , we can have Ω<sup>1</sup> = Ω − *bl*1 , and the second band can be generated by another maximal variance in (8). The same process is repeated continuously by removing the current selected band one at a time from the full band set.

$$b\_{l\_2}^{\*} = \arg\limits\_{\{\Omega\_1 = \{b\}\}} V(\Omega\_1 - \langle b\_l \rangle) \Big| \\ = \arg\Big(\max\_{\Omega\_1 = \{b\}} (d\_{\Omega\_1 - \langle b\_l \rangle}^T R\_{\Omega\_1 - \langle b\_l \rangle}^{-1} d\_{\Omega\_1 - \langle b\_l \rangle})^{-1} \Big) \tag{8}$$

It can be noted that the differences between SB-CTBS and SF-CTBS are that SB-CTBS removes bands from the full band set to generate a desired selected band subset, while SF-CTBS increases the selected band by calculating the minimal variance one at a time. The correlation matrix in SB-CTBS uses *R*Ω−{*bl*} but the correlation matrix in SF-CTBS is *<sup>R</sup>bl* .

#### 2.3.7. Principal Component Analysis (PCA)

PCA [28] is classified in machine learning as a method of feature extraction in dimensional reduction, and can be considered as an unsupervised linear transformation technology, which is widely used in different fields. Dimensionality reduction is used to reduce the number of dimensions in data, without much influence on the overall performance. The basic assumption of PCA is that the data can identify a projection vector, which is projected in the feature space to obtain the maximum variance of this dataset. In this case, this paper compared PCA with other CEM-based band selection methods.

#### *2.4. Optimal Signature Generation Process*

Our proposed algorithm first identified the desired signature of insect damaged beans as the d (desired signature) in CEM for the detection of other similar beans. Optimal signature generation process (OSGP) [29,30] was used to find the optimal desired spectral signature. As the CEM needs only one desired spectral signature for detection, the quality of the detection result is very sensitive to the desired spectral signature. To minimize this defect, the OSGP selects the desired target d first, and the CEM is repeated to obtain a stable and better d. Thus, the stability of detection can be increased, and the subsequent CEM gives the best detection result. Figure 6 shows the flow diagram of OSGP, and then Otsu's method [31] is used to find the optimal threshold. Otsu's method divides data into 0 and 1. This step is to label data for follow-up analysis.

**Figure 6.** The optimal signature generation process.

#### *2.5. Convolutional Neural Networks (CNN)*

Feature extraction requires expert knowledge as the important features must be known for this classification problem, and are extracted from the image to conduct classification. The "convolution" in convolutional neural network (CNN) [32–38] refers to a method of feature extraction, which can replace experts to extract features. Generally speaking, CNN effectively uses spatial information in traditional RGB images; for example, 2D-CNN uses the shape and color of the target in the image to capture features. However, insect damaged coffee beans may be mixed with other material substances, and may even be embedded in a single pixel as their size is smaller than the ground sampling distance. In this case, as no shape or color can be captured, spectral information is important in the detection of insect damaged areas. Therefore, this paper used the pixel based 1D-CNN model to capture the spectral features, instead of spatial features. The result after the band selection of the hyperspectral image was molded into one-dimensional data and the context of data still existed, as shown in Figure 7. The 1D-CNN uses much fewer parameters than 2D-CNN and is more accurate and faster [39].

**Figure 7.** The 1D-CNN model.

Figure 8 shows the 1D-CNN model architecture used in this paper. The hyperspectral image after band selection was used for further analysis, and the data size of the image was 1024 × 629. The features were extracted by using the convolution layer. An 8-convolution kernel and a 16-convolution kernel were used, and then 2048 neurons entered the full connection layer directly.

**Figure 8.** The 1D-CNN model architecture.

The network terminal was provided with a Softmax classifier, and the classifier result of the input spectrum was obtained. The parameters included the training test split: 0.33, epochs: 200, kernel size: 3, activation = 'relu', optimizer: SGD, Ir: 0.0001, momentum: 0.9, decay: 0.0005, factor = 0.2, patience = 5, min\_lr = 0.000001, batch\_size = 1024, and verbose = 1.

#### *2.6. Hyperspectral Insect Damage Detection Algorithm (HIDDA)*

This paper combined the above methods to develop the hyperspectral insect damage detection algorithm (HIDDA), in which band selection is first used to filter out the important bands, and then CEM-OTSU is applied to generate training samples for the two classifiers, in order to implement binary classification for healthy and defective coffee beans. Method 1 uses linear support vector machine (SVM) [39], where the data are labeled and added for the classification of the coffee beans. While Otsu's method was used for subsequent classification, considering its possible misrecognition, this paper improved classification with SVM. Method 2 is comprised of CNN. Figure 9 describes the HIDDA flowchart, which is divided into two stages: training (Figure 9a,c) and testing (Figure 9b,d).

**Figure 9.** The hyperspectral insect damage detection algorithm flowchart. (**a**) Training data of Support Vector Machine (SVM), (**b**) Testing data of Support Vector Machine (SVM), (**c**) Training data of Convolutional Neural Networks (CNN) (**d**) Testing data of Convolutional Neural Networks (CNN).

In the training process, the spectral signature of an insect damaged bean was imported into the CEM as the desired target. The positions of other insect damaged beans could be detected automatically by Otsu's method, and the result was taken as the training data of SVM and CNN (Figure 9a,c) to classify the remaining 19 images (Figure 9b,d). The training set and the test set of the CNN converted data into 1D data. The training data of this experiment were trained by acquiring the hyperspectral image of 60 coffee beans containing 30 insect damaged beans and 30 healthy beans simultaneously after obtaining the results of CEM-OTSU. The remaining 19 hyperspectral images were used for prediction, so the training samples were less than 5% and testing data were about 95%. The data were preprocessed before this experiment by using data normalization and background removal. Then, six band selection algorithms were used to find the sensitive bands of the insect damaged and healthy beans, and the hyperspectral algorithm CEM was performed. As the CEM only needs a single desired spectral signature for detection, this spectral signature is quite important in the algorithm. The best-desired signature was found by OSGP; this signature was put in CEM for analysis, and Otsu's method divided the data into 0 and 1 to label the training data. This paper analyzed pixels instead of images, so this step is relatively important. The remaining 19 images of the test sets were used for SVM (Figure 9b), which used the CEM result for classification. The same set of 19 images after band selection was used as the CNN testing set (Figure 9d). As CNN used the convolution layer to extract features, CEM was not required for analysis. It can be noted that HIDDA generated training samples from the result of CEM-OTSU and not from prior knowledge, as the only prior knowledge HIDDA requires is a single desired spectral signature for CEM in the beginning.

#### **3. Results and Discussion**

#### *3.1. Band Selection Results*

According to Figure 9, the experimental hyperspectral data removed the background from the image before band selection. This experiment used six kinds of band selection (as discussed earlier) for comparison (minimum CEM-BDM, MinV-BP, MaxV-BP, SF-CTBS, SB-CTBS, and PCA. The SVM and CNN classifiers were then used for classification. Finally, the confusion matrix [40] and kappa [41,42] were used for evaluation and comparison. Instead of using pixels for evaluation, this paper used coffee beans as a unit; if a pixel of a coffee bean was identified as an insect damaged bean, it was classified as an insect damaged bean, and vice versa. In the confusion matrix of this experiment, TP represents a defective bean hit, FN is defective bean misrecognition, TN is healthy bean hit, and FP is healthy bean misrecognition. Figures 10–15 show the graphics of the first 20 bands selected by band selection and after band selection. As per sensitive bands selected by six kinds of band selection, 3, 10, and 20 bands were used for the test. The bands after 20 were not selected because excessive bands can cause disorder and repeated data. In addition, excessive bands could make future hardware design difficult. Therefore, the number of bands was controlled below 20. According to the results in Figures 10–15, almost all the foremost bands fell in the range of 850–950 nm. This finding helps to reduce cost and increase the use-value for future sensor design.

**Figure 10.** Visualization of the CEM\_BDM band selection results. The first five bands are 933, 900, 869, 930, and 875 nm.

**Figure 11.** Visualization of the MinV\_BP band selection results. The first five bands are 936, 933, 927, 930, and 925 nm.

**Figure 12.** Visualization of the MaxV\_BP band selection results. The first five bands are 858, 927, 850, 674, and 891 nm.

**Figure 13.** Visualization of the SF\_CTBS band selection results. The first five bands are 936, 858, 534, 927, and 693 nm.

**Figure 14.** Visualization of the SB\_CTBS band selection results. The first five bands are 858, 850, 927, 674, and 891 nm.

**Figure 15.** Visualization of the PCA band selection results. The first five bands are 936, 875, 872, 869, and 866 nm.

According to the results in Figures 10–15, almost all the foremost bands fell in the wavelength range of 850–950 nm. Table 3 lists the most frequently selected bands according to the six band selection algorithms in the first 20 bands, and 850 nm and 886 nm were selected by five out of six band selection algorithms, which means those bands are discriminate bands for coffee beans. This finding can help to reduce costs and increase the usage-value for future sensor designs.

**Table 3.** Most frequently selected bands by six band selection algorithms in the first 20 bands. (•: include, X: not include).


#### *3.2. Detection Results by Using Three Bands*

The final detection results using 10 bands were obtained by the CEM-SVM and the CNN model, as described in Section 2.3.7. Figures 16–21 show the final detection results as generated by CEM-SVM using six band selection methods to select 10 bands, while Figures 22–26 show the final detection results as obtained by the CNN model using five-band selection methods to select 10 bands. The upper three rows in Figures 16–21 are insect damaged beans, while the lower three rows are healthy beans, and there were 1139 beans in 20 images. To limit the text length, only four of the 20 images are displayed, and the analysis results are shown in Table 3. In the confusion matrix of this experiment, TP refers to insect damaged bean hits, FN refers to missing insect damaged beans, TN refers to healthy bean hits, and FP refers to false alarms. In image representation, TP is green, FN is red, TN is blue, and FP is yellow; these colors are used for visualization, as shown in Figures 16–26. All results of the three bands are compiled and compared in Table 4. The ACC [40], Kappa [41,42], and running time calculated by the confusion matrix were used for evaluation. The running time of this experiment was the average time.

**Figure 16.** Results of green coffee beans CEM-SVM+CEM\_BDM three bands (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 17.** Results of green coffee beans CEM-SVM+ MinV\_BP three bands (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 18.** Results of green coffee beans CEM-SVM+ MaxV \_BP three bands (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 19.** Results of green coffee beans CEM-SVM+ SF\_CTBS three bands (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 20.** Results of green coffee beans CEM-SVM+SB\_CTBS three bands, (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 21.** Results of green coffee beans CEM-SVM+PCA, (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 22.** Results of green coffee beans CNN+CEM-BDM three bands, (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

**Figure 23.** Results of green coffee beans CNN+ maxV\_BP three bands, (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

**Figure 24.** Results of green coffee beans CNN+SF\_CTBS three bands, (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

**Figure 25.** Results of green coffee beans CNN+SB\_CTBS three bands, (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

**Figure 26.** Results of green coffee beans CNN+ PCA three bands, (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

**Table 4.** The results of the green coffee bean classification. The best performance is highlighted in red color.


In the case of three bands, CEM-SVM, BDM, MaxV\_BP, SF\_CTBS, SB\_CTBS, and PCA were successful in classification. However, a portion of insect damaged beans was not detected, which was probably because the insect damage surface was not irradiated. The MinV\_BP+CEM-SVM could not perform classification at all, as shown in Figure 17, possibly due to the non-selection of the sensitive band; thus, its result was excluded from subsequent discussion. As shown in Table 4, the PCA+CNN had the highest TPR, and PCA+CEM-SVM had the highest ACC and kappa, proving that the sequencing of the PCA amount of variation is feasible for band selection. The minimum FDR was observed for SF\_CTBS+CEM-SVM. The minimum variance of CEM was used for recurrent selection in SF\_CTBS, and the healthy beans could be identified accurately.

In the case of CNN, the BDM, MaxV\_BP, SF\_CTBS, SB\_CTBS, and PCA were used, and the MinV\_BP was not used because the deep learning label produced in CEM could not be identified. Here, the paper of three bands was not included for comparison. The PCA exhibited the highest TPR, and thus, the band selected by PCA was more sensitive to defective beans. The SF\_CTBS had the lowest FPR, and the minimum variance of CEM calculated by SF\_CTBS for recurrent selection could accurately identify healthy beans. The classification result indicates that only eight green coffee beans were misidentified as defective beans, CEM\_BDM possessed the highest ACC and kappa, and that the CEM\_BDM method classified green coffee beans better. In terms of time, the CNN was faster than SVM because the CNN model used a batch\_size = 1024 for prediction, while the SVM used pixels one by one for prediction.

#### *3.3. Detection Results Using 10 Bands*

The final detection results using 10 bands were obtained by CEM-SVM and the CNN model, as described in Section 2.3.7. Figures 27–32 show the final detection results, as generated by CEM-SVM using six band selection methods to select 10 bands; Figures 33–37 show the final detection results as obtained by the CNN model using five-band selection methods to select 10 bands.

**Figure 27.** Results of green coffee bean CEM-SVM+CEM\_BDM 10 bands. (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 28.** Results of green coffee bean CEM-SVM+ MinV\_BP 10 bands. (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 29.** Results of green coffee bean CEM-SVM+ MaxV\_BP 10 bands. (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 30.** Results of green coffee bean CEM-SVM+ SF\_CTBS 10 bands. (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 31.** Results of green coffee bean CEM-SVM+ SB\_CTBS 10 bands. (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 32.** Results of green coffee bean CEM-SVM+PCA 10 bands. (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 33.** Results of green coffee bean CNN+CEM\_BDM 10 bands. (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

**Figure 34.** Results of green coffee bean CNN + maxV\_BP 10 bands. (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

**Figure 35.** Results of green coffee bean CNN+SF\_CTBS 10 bands. (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

**Figure 36.** Results of green coffee bean CNN+SB\_CTBS 10 bands. (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

**Figure 37.** Results of green coffee bean CNN+PCA 10 bands. (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

All results from the 10 bands were compiled and compared, as shown in Table 5. As seen, there were several influential bands in the front, but excessive bands could induce misrecognitions.


**Table 5.** The results of the green coffee bean classification. The best performance is highlighted in red color.

In the case of CEM-SVM, the CEM\_BDM+CEM-SVM had the best performance in FPR, ACC, and kappa, indicating the reliability of the CEM\_BDM band priority in 10 bands, and the minimization of correlation between bands could influence green coffee beans. The sensitive bands were extracted using this method. The MaxV\_BP+CEM-SVM had the highest TPR, indicating that the maximum variance of CEM calculated by MaxV\_BP for sequencing could classify defective beans. The MinV\_BP was less effective than the other methods, which might be related to the variance of green coffee beans, suggesting that this method is inapplicable to a small number of bands.

In the case of CNN, when the MinV\_BP produced labels, excessive data misrecognitions failed the training model, and the PCA had the highest TPR, ACC, and Kappa. Therefore, in the bands selected from the 10 bands, the PCA+CNN seemed to be the most suitable for classifying green coffee beans. The SF\_CTBS and SB\_CTBS had the minimum FPR, indicating that the cyclic ordering of CEM variance is appropriate for classifying good beans.

#### *3.4. Detection Results by Using 20 Bands*

The final detection results using 20 bands were obtained by the CEM-SVM and the CNN model, as described in Section 2.3.7. Figures 38–43 show the final detection results, as generated by CEM-SVM using six band selection methods to select 10 bands; Figures 44–48 show the final detection results as obtained by the CNN model using five-band selection methods to select 10 bands.

**Figure 38.** Results of green coffee beans CEM-SVM+CEM\_BDM 20 bands. (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 39.** Results of green coffee bean CEM-SVM+MinV\_BP 20 bands. (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 40.** Results of green coffee bean CEM-SVM+MaxV\_BP 20 bands. (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 41.** Results of green coffee bean CEM-SVM+SF\_CTBS 20 bands. (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 42.** Results of green coffee bean CEM-SVM+SB\_CTBS 20 bands. (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 43.** Results of green coffee bean CEM-SVM+PCA 20 bands. (**a**–**d**) SVM classification results, (**e**–**h**) final visual images.

**Figure 44.** Results of green coffee bean CNN+CEM\_BDM 20 bands. (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

**Figure 45.** Results of green coffee bean CNN+maxV\_BP 20 bands. (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

**Figure 46.** Results of green coffee bean CNN+SF\_CTBS 20 bands. (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

**Figure 47.** Results of green coffee bean CNN+SB\_CTBS 20 bands. (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

**Figure 48.** Results of green coffee bean CNN+PCA 20 bands. (**a**–**d**) CNN classification results, (**e**–**h**) final visual images.

All results of the 20 bands were compiled and compared, as shown in Table 6. The ACC, kappa, and the running time calculated by the confusion matrix were used for evaluation.


**SB\_CTBS**+**CEM-SVM** 544 26 253 316 0.954 0.444 0.755 0.516 11.78 **PCA**+**CEM-SVM** 546 24 144 425 0.957 0.253 0.852 0.729 11.38 **20 Bands Green Co**ff**ee Beans CNN Results Analysis Method TP FN FP TN TPR FPR ACC Kappa Time (s) CEM\_BDM**+**CNN** 480 90 5 564 0.842 0.007 0.917 0.835 7.84 **MaxV\_BP**+**CNN** 355 215 1 568 0.62 0.001 0.809 0.618 7.27 **SF\_CTBS**+**CNN** 395 175 2 567 0.687 0.003 0.841 0.683 7.35 **SB\_CTBS**+**CNN** 336 234 1 568 0.585 0.001 0.791 0.583 7.13 **PCA**+**CNN** 511 59 5 564 0.896 0.007 0.944 0.888 7.82

**Table 6.** The results of the green coffee bean classification. The best performance is highlighted in


In the case of CNN, as the data content increased, the accuracy of most methods declined. In the training of MinV\_BP, excessive data misrecognition failed the training model. The PCA band selection exhibited good performance in TPR, ACC, and kappa, indicating that the PCA performed the best in classification with 20 bands. The MaxV\_BP and SB\_CTBS had the lowest FPR. The use of the maximum variance of CEM in the case of 20 bands had the best effect on classifying good beans.

of green coffee beans, but applicable to a larger number of bands.

#### *3.5. Discussion*

The ACC and kappa values of three bands, 10 bands, and 20 bands were compared and represented as histograms, as shown in Figures 49 and 50. According to the comparison results of the figures, the CEM\_BDM+CEM-SVM gave good results in the case of three bands, 10 bands, and 20 bands. The accuracy was higher than 90%, and the kappa was about 0.85, indicating that the BDM selected bands are crucial and representative to both classifiers for better performance. Considering MinV\_BP+CEM-SVM in the cases of three bands and 10 bands, the selected bands might be difficult to classify the data, although there were sufficient data in the case of 20 bands, where the effect was enhanced greatly, suggesting that MinV\_BP needs a larger number of bands for better classification.

**Figure 49.** The ACC accuracy histograms of 3 bands, 10 bands, and 20 bands.

**Figure 50.** The Kappa histograms of 3 bands, 10 bands, and 20 bands.

As for MaxV\_BP+CEM-SVM, the green coffee beans could be classified in the cases of three bands and 10 bands; but the accuracy declined in the case of 20 bands, indicating excessive bands induced misrecognitions. Interestingly, this situation is contrary to MinV\_BP, and is related to the CEM variance of MinV\_BP and MaxV\_BP. When assessing SF\_CTBS+CEM-SVM, in the cases of three bands and 10 bands, the accuracy and Kappa values were quite high, while in the case of 20 bands, there were too many misrecognitions of healthy beans, and the kappa decreased greatly. This indicates that excessive bands induced misrecognitions, and confirms that the sensitive bands were identified from the first 10 bands. Considering SB\_CTBS+CEM-SVM, in the case of three bands, high precision and kappa were observed, and with the increase in the number of bands, the rate of healthy bean misrecognition increased. Therefore, the first three bands of this method were the most representative, and excessive bands did not increase the accuracy. When assessing PCA+CEM-SVM, in the cases of three bands, 10 bands, and 20 bands, the results were good, and the variation seemed feasible for band selection, about the same as previous BDM. The two methods could thus select important spectral signatures with a small number of bands. In the cases of three bands, 10 bands, and 20 bands, the CEM\_BDM+CNN exhibited good results, but poorer than the previous SVM in the cases other than three bands.

In the case of three bands, MaxV\_BP+CNN exhibited a high precision and kappa, which reduced as the bands increased, but CNN seemed to be more suitable than SVM. The SF\_CTBS+CNN had a poorer effect than SVM in the cases of three bands, 10 bands, and 20 bands, indicating that this method is inapplicable to CNN, which may be related to the variance of CEM. The SB\_CTBS+CNN exhibited high precision and kappa in the case of three bands, which reduced as the bands increased. This suggests that excessive bands influenced the decision, and there was no significant difference, except for a slight difference in the case of 10 bands from SVM. The PCA+CNN exhibited good results in the cases of three bands, 10 bands, and 20 bands; the CNN performed much better than SVM, where the results were quite average, and the cases of 10 bands and 20 bands exhibited the best effect.

Based on the aforementioned results, this paper found that the number of bands is a critical factor. From the band selection results in Section 3.1, the foremost bands fell in the wavelength range of 850–950 nm. According to the spectral signature of healthy and insect damaged coffee beans in Figure 5, the most different spectrum was also between 850–950 nm. This finding can explain the above results, if the selected bands fell in this range, the results performed relatively well. Considering the number of bands, in the case of three bands, the CEM\_BDM+CNN method had the best performance in ACC and Kappa, and the ACC was 95%, indicating that the minimization of inter-band correlation is helpful to detect insect damaged beans since the top three bands were between the range of 850–950 nm. In the 10 band and 20 band cases, the PCA+CNN method exhibited the best performance in ACC and kappa, which suggests that the covariance for band selection can determine the different bands between heathy and defective beans, and the effect was even improved when combined with CNN. Based on the above results, several findings can be observed as follows.

1. As the background has many unknown signal sources responding to various wavelengths, the hyperspectral data collected in this paper were pre-processed to remove the background, which rendered the signal source in the image relatively simple. As too much spectral data increase the complexity of detection, only healthy coffee beans and insect damaged beans were included in the data for experimentation. Without other background noise interference, this experiment only required a few important bands to separate the insect damaged beans from healthy beans. The applied CEM-based band selection methods were based on the variance generated by CEM to rank the bands, where the top ranked bands are more representative and significant. Moreover, the basic assumption of PCA is that the data can find a vector projected in the feature space, and then the maximum variance of this group of data can be obtained, thus, it is also ranked by variance. In other words, the band selection methods with the variance as the standard can only use the top few bands to distinguish our experimental data with only two signal sources (healthy and unhealthy beans), which is supported by our experimental results. As the top

few bandwidths are concentrated between 850 nm–950 nm, the difference between the spectral signature curves of the insect damaged beans and healthy beans could be easily observed between 850 nm–950 nm, as shown in Figure 51. The curve of healthy beans flattened, while the curve of the insect damaged beans rose beyond the range of 850 nm–950 nm, as shown in Figure 51.

**Figure 51.** Highlight of the spectral signature for healthy and insect damaged beans.



**Table 7.** The detailed comparison of prior studies.

#### **4. Conclusions**

Insect damage is the most commonly seen defect in coffee beans. The damaged areas are often smaller than the pixel resolution, thus, the targets to be detected are usually embedded in a single pixel. Therefore, the only way to detect and extract these targets is at the subpixel level, meaning traditional spatial domain (RGB)-based image processing techniques may not be suitable. To address this problem, this paper adopted spectral processing techniques that can characterize and capture the spectral information of targets, rather than their spatial information. After using a VIS-NIR push-broom hyperspectral imaging camera to obtain the images of green coffee beans, this paper further developed HIDDA, which includes six algorithms for band selection as well as CEM-SVM and CNN for identification. The experimental samples of this paper were 1139 coffee beans including 569 healthy beans and 570 defective beans. The accuracy in classifying healthy beans was 96.4%, and that in classifying defective beans was 93%; the overall accuracy was nearly 95%.

As CEM is one of the few algorithms that can suppress background noise while detecting the target at the subpixel level, the proposed method applies CEM as the kernel of the algorithm, which uses sampling correlation matrix **R** to suppress the background and a specific constraint in the FIR filter to pass through the target. CEM can easily implement binary classification as it only requires one knowledge of the target, and no other information is required, thus, CEM was used to design the band selection methods for CBS and CTBS, which use the CEM produced variance as criteria to select and rank the bands. This paper compared PCA as it also uses variance as the criteria. The results showed that the top few bands selected by the six band selection algorithms were concentrated between 850 nm–950 nm, which means that these bands are important and representative for identifying healthy beans and defective beans. Since no specific shape and color can be captured in the insect damaged beans, spectral information is needed to detect the insect damaged areas. In this case, this paper proposed the two spectral-based classifiers after obtaining the results of band selection. One combines CEM with the SVM for classification, while the other uses the neural network of 1D-CNN's deep learning to implement binary classification. In order to consider future sensor design, this paper used three bands, 10 bands, and 20 bands for experimentation. The results showed that in the case of three bands, both CEM-SVM and CNN performed very well, indicating that HIDDA can detect insect damaged coffee beans within only a few bands.

In conclusion, this paper has several important contributions. First, hyperspectral images were used to detect insect damaged beans, which are more difficult to identify by visual inspection than other defective beans such as black and sour beans. Second, this paper applied the results from CEM to generate more training samples for the CNN and SVM models, and the training sample rate was relatively low. Moreover, as HIDDA only requires knowledge of one of the spectral data for insect damaged beans under only three bands, and the accuracy was nearly 95%. In other words, HIDDA is advantageous in the commercial development of sensors in the future. Third, six band selection methods were developed, analyzed, and combined with neural networks and deep learning. The accuracy in 20 images of 1100 coffee beans was 95%, and the kappa was 90%. The results indicate that the band in the wavelength of 850–950 nm is significant for identifying healthy beans and defective beans. Our future study will work toward commercialization in the coffee processing process, wherein, the experimental process will be combined with mechanical automation.

**Author Contributions:** Conceptualization, S.-Y.C.; Data curation, C.-Y.C.; Formal analysis, S.-Y.C. and C.-T.L.; Funding acquisition, C.-Y.C.; Investigation, C.-S.O. and C.-T.L.; Methodology, S.-Y.C.; Project administration, S.-Y.C.; Resources, C.-T.L. and C.-Y.C.; Software, C.-S.O. and C.-T.L.; Supervision, S.-Y.C.; Validation, S.-Y.C. and C.-S.O.; Visualization, C.-S.O. and S.-Y.C.; Writing—original draft, S.-Y.C. and C.-S.O.; Writing—review & editing, S.-Y.C., C.-S.O., and C.-Y.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan and Ministry of Science and Technology (MOST): 107-2221-E-224-049-MY2 in Taiwan.

**Acknowledgments:** This work was financially supported by the "Intelligent Recognition Industry Service Center" from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan. We would also like to acknowledge ISUZU OPTICS CORP. for their financial and technical support.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Remote Sensing* Editorial Office E-mail: remotesensing@mdpi.com www.mdpi.com/journal/remotesensing

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel: +41 61 683 77 34

www.mdpi.com

ISBN 978-3-0365-5796-0