*Article* **FPGA-Based On-Board Hyperspectral Imaging Compression: Benchmarking Performance and Energy Efficiency against GPU Implementations**

**Julián Caba 1,\*,†, María Díaz 2,†, Jesús Barba 1, Raúl Guerra <sup>2</sup> and Jose A. de la Torre <sup>1</sup> and Sebastián López <sup>2</sup>**


Received: 17 September 2020; Accepted: 9 November 2020; Published: 13 November 2020

**Abstract:** Remote-sensing platforms, such as Unmanned Aerial Vehicles, are characterized by limited power budget and low-bandwidth downlinks. Therefore, handling hyperspectral data in this context can jeopardize the operational time of the system. FPGAs have been traditionally regarded as the most power-efficient computing platforms. However, there is little experimental evidence to support this claim, which is especially critical since the actual behavior of the solutions based on reconfigurable technology is highly dependent on the type of application. In this work, a highly optimized implementation of an FPGA accelerator of the novel *HyperLCA* algorithm has been developed and thoughtfully analyzed in terms of performance and power efficiency. In this regard, a modification of the aforementioned lossy compression solution has also been proposed to be efficiently executed into FPGA devices using fixed-point arithmetic. Single and multi-core versions of the reconfigurable computing platforms are compared with three GPU-based implementations of the algorithm on as many NVIDIA computing boards: Jetson Nano, Jetson TX2 and Jetson Xavier NX. Results show that the single-core version of our FPGA-based solution fulfils the real-time requirements of a real-life hyperspectral application using a mid-range Xilinx Zynq-7000 SoC chip (XC7Z020-CLG484). Performance levels of the custom hardware accelerator are above the figures obtained by the Jetson Nano and TX2 boards, and power efficiency is higher for smaller sizes of the image block to be processed. To close the performance gap between our proposal and the Jetson Xavier NX, a multi-core version is proposed. The results demonstrate that a solution based on the use of various instances of the FPGA hardware compressor core achieves similar levels of performance than the state-of-the-art GPU, with better efficiency in terms of processed frames by watt.

**Keywords:** hyperspectral imaging; lossy compression; on-board processing; FPGA; GPU; real-time performance; UAV; parallel computing

#### **1. Introduction**

Hyperspectral technology has experienced a steady surge in popularity in recent decades. Among the main reasons that have given hyperspectral imaging greater visibility is the richness of spectral

information collected by this kind of sensor. This feature has positioned hyperspectral analysis techniques as the mainstream solution for the analysis of land areas and the identification and discrimination of visually similar surface materials. As a consequence, this technology has acquired increasing relevance, being widely used for a variety of applications, such as precision agriculture, environmental monitoring, geology, urban surveillance and homeland security, among others. Nevertheless, hyperspectral image processing is accompanied by the management of large amounts of data, which affects, on one hand, its real-time performance and, on the other hand, the requirements of the on-board storage resources. Additionally, the latest technological advances are promoting to market hyperspectral cameras with higher spectral and spatial resolutions. All of this makes the efficient data handling, from an on-board processing, communication, and storage point of view, even more challenging [1,2].

Traditionally, images sensed by spaceborne Earth-observation missions are not on-board processed. The main rationale behind this is the limited on-board power capacity that forces the use of low-power devices, which are normally not as highly performing as their commercial counterparts [3–8]. In this regard, images are subsequently downloaded to the Earth surface where they are off-line processed on high-performance computing systems based on Central Processing Units (CPUs), Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs), or heterogeneous architectures. In the case of airborne capturing platforms, images are normally on-board stored and hence the end user could not access them until the flight mission is over [9]. Additionally, unmanned aerial vehicles (UAVs) have gained momentum in recent years. They have become a very popular solution for inspection, surveillance and monitoring since they represent a lower-cost approach with a more flexible revisit time than the aforementioned Earth-observation platforms. In this context, hyperspectral image management is addressed similarly to how it is done in airborne platforms, although a lot of efforts have been made recently to transmit the images to the ground segment as soon as they are captured [10,11].

Regrettably, the data transmission from the aforementioned remote-sensing observation platforms introduces important delays. They are mainly related to the transference of large volumes of data and the limited communication bandwidths between the source and the final target, which has also kept relatively stable over the years [6,12,13]. Consequently, it reveals a bottleneck in the downlink systems that can seriously affect to the effective performance of real-time or nearly real-time applications. However, the steadily growing data-rate of the latest-generation sensors makes it compulsory to reach higher compression ratios and to carry out a real-time compression performance in order to prevent the unnecessary accumulation of high amounts of uncompressed data on-board and to facilitate efficient data transfers [14].

In this scenario of limited communication bandwidths and increasing data volumes, it is becoming necessary to move from lossless or near-lossless compression approaches to lossy compression techniques. Despite most of the state-of-the-art lossless compressors bringing a quite satisfactory rate-distortion performance, the former approaches provide very moderate compression ratios of about 2∼3:1 [15,16], which presently are not sufficient to handle the high input data-rate of the newest-generation sensors. For this reason, a research effort targeting lossy compression is currently being made [17–21]. Due to the limited on-board computational capabilities of remote-sensing hyperspectral acquisition systems, low-complexity compression schemes stand as the most practical solution for such restricted environments [21–24]. Nevertheless, most of the state-of-the-art lossy compressors are generalizations of existing 2D images or video compression algorithms [25]. For this reason, they are normally characterized by high computational burden, intensive memory requirements and a non-scalable nature. These features prevent their use in power-constrained applications with limited hardware resources, such as on-board compression [26,27].

In this context, the Lossy Compression Algorithm for Hyperspectral Image Systems (HyperLCA) [28] was developed as a novel hardware-friendly lossy compressor for hyperspectral images. HyperLCA was introduced as a low-complexity alternative that provides a good compression performance at high compression ratios with a reasonable computational burden. Additionally, this algorithm permits compressing blocks of image pixels independently which promotes, on the one hand, the reduction of the data to be managed at once besides the hardware resources to be allocated. On the other hand, the HyperLCA algorithm becomes a very competitive solution for most applications based on pushbroom/whiskbroom scanners, paving the way for real-time compression performance. The flexibility and the high level of parallelism intrinsic to the HyperLCA algorithm has been previously evaluated in earlier publications. In particular, its suitability for real-time performance in applications characterized by high data-rates with restrictions in computational resources due to power, weight or space was tested in [29].

In this work, we focus on a use case where a visible-near-infrared (VNIR) hyperspectral pushbroom scanner is mounted onto a UAV. In particular, we have analyzed the performance of FPGAs for the lossy compression of hyperspectral images against low-power GPUs (LPGPUs) in order to establish the benefits and barriers of using each one of these hardware devices for the on-board compression task. Specifically, we have implemented the HyperLCA algorithm in a Xilinx Zynq-7000 programmable System on Chip (SoC). We have selected this SoC because it can be found in low-cost, low-weight and compact-size development boards, such as MicroZedTM and PYNQ boards. Although UAVs have been consolidated as trending aerial observation platforms, their acquisition costs are still not accessible for many end customers, not only those who want to purchase them but also those who lease their services. For this reason, we must also aim to solve the economic implications that comes along with these devices. On this basis, in this work we have focused on the on-board computing platform, searching for a less expensive alternative that, in exchange, cannot offer the same level of both performance and functionality than other costly commercial products. Additionally, it is important to note that while experiments carried out in this work are oriented to the current necessities imposed by an application based on drones, all drawn conclusions can be extrapolated to other fields in which remotely sensed hyperspectral images have to be compressed in real time, such as spaceborne missions that employs next-generation space-grade FPGAs.

The rest of this paper is organized as follows. Section 2 gives outlines of the majority issues around the operations involved in the HyperLCA algorithm. In addition, it also includes a detailed explanation about the implementation model developed for the execution of the HyperLCA compressor on the selected FPGA SoC. Section 3 analyzes the obtained experimental results in terms of both quality of the compression process and hardware implementation points of view. Section 4 discusses the strengths and limitations of the proposed solution and includes a comparison evaluation of the obtained results with those achieved by a selection of LPGPUs embedded systems, following the implementation model shown in [29]. Finally, Section 5 draws the main conclusions of this work.

#### **2. Materials and Methods**

#### *2.1. HyperLCA Algorithm*

The HyperLCA algorithm is a novel lossy compressor for hyperspectral images especially designed for applications based on pushbroom/whiskbroom sensors. Concretely, this transform-based solution can independently compress blocks of image pixels regardless of any spatial alignment between pixels. Consequently, it facilitates the parallelization of the entire compression process and makes it relatively simple to stream the compression of hyperspectral frames collected by a pushbroom or whiskbroom sensor. Additionally, the HyperLCA compressor also allows fixing a desired minimal compression ratio and guaranties that at least this minimum will be achieved for each block of image pixels. This makes it possible to know in advance the maximum data rate that will be attained after the acquisition and compression processes in order to efficiently manage the data transfers and/or storage. Additionally,

the HyperLCA compressor provides quite satisfactory rate-distortion results for higher compression ratios than those achievable by lossless compression approaches. Furthermore, the HyperLCA algorithm preserves the most characteristic spectral features of image pixels that are potentially more useful for ulterior hyperspectral analysis techniques, such as target detection, spectral unmixing, change detection, anomaly detection, among others [30–34].

Figure 1 shows a graphic representation of the main computing stages involved by the HyperLCA compressor. First, the *Initialization* stage configures the same parameters (*pmax*) employing the input parameters *CR*, *Nbits* and *BS*. Since the HyperLCA compressor follows an unmixing-like strategy, the *pmax* most representative pixels within a block of image pixels to be independently processed are compressed with the highest precision. Then, other image pixels are reconstructed using their scalar projections, *V*, over the previously selected pixels. Therefore, *pmax* is estimated according to the minimum desired compression ratio to be reached, *CR*, the number of hyperspectral pixels within each image block, *BS* and the number of bits used for representing the values of *V* vectors. Second, these *pmax* characteristic pixels are selected using orthogonal projection techniques in the *HyperLCA Transform* stage. It results in a spectral uncorrelated version of the original hyperspectral data. Thirdly, the outputs of the *HyperLCA Transform* stage are adapted in the following *HyperLCA Preprocessing* stage for being later more efficiently codified in the fourth and last *HyperLCA Entropy Coding* stage. In the following lines, the operations involved in these four algorithm stages are further analyzed in order to give a glance of the decisions taken for the acceleration of the HyperLCA algorithm in hardware devices based on FPGAs.

**Figure 1.** Data flow among the different computing stages of the HyperLCA compressor.

#### 2.1.1. General Notation

Before starting the algorithm description, it is necessary to introduce the general notation followed thorough the rest of this manuscript. In this sense, **HI** = {**F***i*, *i* = 1, ..., *nr*} is a sequence of *nr* scanned cross-track lines of pixels acquired by a pushbroom sensor, **F***i*, comprised by *nc* pixels with *nb* spectral bands. Pixels within **HI** are grouped in blocks of *BS* pixels, **M***<sup>k</sup>* = **r***j*, *j* = 1, ..., *BS* , being normally *BS* equal to *nc*, or multiple of it, and *k* spans from 1 to *nr*·*nc BS* . *μ***ˆ** represents the average pixel of each image block, **M***k*. Therefore, **C** is defined as the centralized version of **M***<sup>k</sup>* after being subtracted *μ***ˆ** from all image pixels. **E** = {**e***n*, *n* = 1, ..., *p*max} saves the *p*max most different hyperspectral pixels extracted from each **M***<sup>k</sup>* block. **V** = {**v***n*, *n* = 1, ..., *p*max} comprises *p*max vectors of *BS* elements where each **v***<sup>n</sup>* vector corresponds to the projection of the *BS* pixels within **M***<sup>k</sup>* onto the corresponding *n* extracted pixel, **e***n*. **Q** = {**q***n*, *n* = 1, ..., *p*max} and **U** = {**u***n*, *n* = 1, ..., *p*max} save *p*max pixels of *nb* bands that are orthogonal among them.

#### 2.1.2. HyperLCA Initialization

The HyperLCA compressor must first determine the number of pixels, **e***n*, and projection vectors, **v***n*, (*p*max) to be extracted from each image block, **M***k*, according to the input parameters *CR*, *Nbits* and *BS*, as shown in Equation (1). In it, *DR* refers to the number of bits per pixel per band used for representing the hyperspectral image elements to be compressed. As can be deduced, the number of selected pixels, *p*max, directly determines the maximum compression ratio to be reached with the selected algorithm configuration. Furthermore, bigger *p*max results in better reconstructed images but lower compression ratios.

$$p\_{\text{max}} \le \frac{DR \cdot nb \cdot (BS - CR)}{CR \cdot (DR \cdot nb + N\_{\text{bits}} \cdot BS)} \tag{1}$$

#### 2.1.3. HyperLCA Transform

The HyperLCA compressor is a transform-based algorithm that employs a modified version of the well-known Gram-Schmidt orthogonalization method to obtain a compressed and uncorrelated image. In this regard, the *HyperLCA Transform* stage oversees carrying out the spectral transform. For this reason, it is the algorithm stage that involves the most demanding computing operations and thus, which are more prone to be accelerated in parallel dedicated hardware devices.

The *HyperLCA Transform* is described in detail in Algorithm 1. As can be seen, the *HyperLCA Transform* stage receives as inputs the hyperspectral image block to be compressed, **M***k*, and the number of hyperspectral pixels to be extracted, *p*max. As outputs, the *HyperLCA Transform* states the set of the *p*max most different hyperspectral pixels, **E**, and their corresponding projection vectors, **V**. For doing so, the hyperspectral image block, **M***k*, is first centralized by subtracting the average pixel, *μ***ˆ**, from all pixels within **M***k*, which results in the centralized version of the data, **C**, in line 3. Second, the *p*max most characteristic pixels are sequentially extracted from lines 4 to 15. In this iterative process, the brightness of each image pixel is calculated in lines 5 to 7. It is defined as the dot product of each image pixel with itself (see line 6). The extracted pixels, **e***n*, are those pixels within the original image block, **M***k*, with the highest brightness in each iteration, *n* (see line 8 and 9). Afterwards, the orthogonal vectors **q***<sup>n</sup>* and **u***<sup>n</sup>* are accordingly defined as shown in lines 10 and 11, respectively. **u***n* is employed to estimate the projection of each image pixel over the direction spanned by the selected pixel, **e***n*, deriving in the projection vector, **v***n* in line 12. Finally, this information is subtracted from **C** in line 13.

Accordingly, **C** retains in next iterations the spectral information that cannot be represented by the already selected pixels **e***n*. For this reason, once the *p*max pixels have been extracted, **C** contains the spectral information that will be not recovered after the decompression process. Hence, it is represented of the losses introduced by the HyperLCA compressor. Consequently, the remaining information in **C** could be used to establish extra stopping conditions based on quality metrics, such as the *Signal-to-Noise Ratio* (*SNR*) or the *Maximum Absolute Difference* (*MAD*). In this work, these extra stopping conditions have been discarded, and thus, a fixed number of *p*max iterations is executed for all image blocks, **M***k*.

**Algorithm 1** HyperLCA Transform.

**Inputs: M***<sup>k</sup>* = [**r**1,**r**2, ...,**r***BS*], *pmax* **Outputs:** *μ***ˆ**; *μ***ˆ**; **E** = [**e**1, **e**2, ..., **e***pmax* ]; **V** = [**v**1, **v**2, ..., **v***pmax* ] **Algorithm:** 1: {Additional stopping condition initialization.} 2: Average pixel: *μ***ˆ**; 3: Centralized image: **C** = **M***<sup>k</sup>* − *μ***ˆ** = [**c**1, **c**2, ...**c***BS*]; 4: **for** *n* = 1 **to** *pmax* **do** 5: **for** *j* = 1 **to** *BS* **do** 6: Brightness Calculation: **b***<sup>j</sup>* = **c**- *<sup>j</sup>* · **c***j*; 7: **end for** 8: Maximum Brightness: *jmax* = argmax(**b***j*); 9: Extracted pixels: **e***<sup>n</sup>* = **r***jmax* ; 10: **q***<sup>n</sup>* = **c***jmax* ; 11: **u***<sup>n</sup>* = **q***n*/*bjmax* ; 12: Projection vector: **v***<sup>n</sup>* = **u**- *<sup>n</sup>* · **C**; 13: Information Subtraction: **C** = **C** − **q***<sup>n</sup>* · **v***n*; 14: {Additional stopping condition checking.} 15: **end for**

#### 2.1.4. HyperLCA Preprocessing

To ensure an efficient entropy coding of the *HyperLCA Transform* outputs, they must be first adapted in the *HyperLCA Preprocessing* stage. This transformation process is performed in two steps:

#### Scaling V Vectors

**V** vectors represent the projection of each pixel within **M***<sup>k</sup>* over the direction spanned by each orthogonal vector, **u***n*, in each iteration. Therefore, potential values of each element of **V** vectors are in the range of (−1, 1]. However, the later *Entropy-coding* stage works exclusively with integers. Consequently, elements within **V** must be scaled to fully exploit the dynamic range offered by the input parameter *N*bits, as shown in Equation (2). After doing so, the scaled **V**vectors are rounded up to the closest integer values.

$$
\upsilon\_{j\_{\text{scaled}}} = (\upsilon\_j + 1) \cdot (2^{N\_{\text{bits}} - 1} - 1) \tag{2}
$$

Error Mapping

The codification stage also exploits the redundancies within the data in the spectral domain to assign the shortest word length to the most common values. With it, the compression ratio achieved by the *HyperLCA Transform* could be even increased. To this end, the output vectors of the *HyperLCA Transform* (*μ***ˆ**, **E** and scaled **V**) are lossless pre-processed in the *Error Mapping* stage. To do so, the prediction error mapper described in the Consultative Committee for Space Data Systems (CCSDS) Blue Books is employed [35]. In this regard, each output vector, (*μ***ˆ**, **e***<sup>n</sup>* and scaled **v***n*) is individually processed and transformed to be exclusively composed of positive integer values closer to zero.

#### 2.1.5. HyperLCA Entropy Coding

The *Entropy Coding* is the last stage of the HyperLCA compressor. It follows a lossless entropy-coding strategy based on the Golomb–Rice algorithm [36]. As in the *Error Mapping* stage, each single output vector is independently coded. For doing so, the compression parameter, *M*, is estimated as the average value of the targeted vector. Afterwards, each of its elements is divided by *M* in order to obtain the results of the division, the quotient (*q*) and the remainder (*r*). On one hand, the quotient, *q*, is codified using unary code. On the other hand, the remainder, *r*, could be coded using *b* = *log*2(*M*) + 1 bits if *M* is power of 2. Nevertheless, *M* can actually be any positive integer. For this reason, the remainder, *r* is coded as plain binary using *<sup>b</sup>* − 1 bits for *<sup>r</sup>* values smaller than 2*<sup>b</sup>* − *<sup>M</sup>*, otherwise it is coded as *<sup>r</sup>* + <sup>2</sup>*<sup>b</sup>* − *<sup>M</sup>* using *<sup>b</sup>* bits.

#### 2.1.6. HyperLCA Data Types and Precision Evaluation

Most of the compression performance achieved by the HyperLCA algorithm is obtained in the *HyperLCA Transform* stage, originally designed to use floating-point precision. However, FPGA devices are, in general, more efficient dealing with integer operations. Additionally, the execution of floating-point operations in different devices may produce slightly different results. For this reason, the performance of the HyperLCA algorithm using integer arithmetic was largely drawn in [37] in order to adapt it for being more suitable for this kind of device. In particular, it was used the fixed-point concept in a custom way employing integer arithmetic and bits shifting [38]. In this previous work, two different versions of the *HyperLCA Transform* were considered employing 32 and 16 bits, respectively, for representing the image values stored in the centralized version of the hyperspectral frame to be processed, **C**.

It is important to note that the aforementioned proposed versions were developed for working with hyperspectral images whose element values could be represented with up to 16 bits per pixel per band as maximum. In this context, the quality of the compression results is very similar between the 32-bits fixed point and the single precision floating-point versions. Nevertheless, the compression performance obtained by the 16-bit version is not as competitive as the other two solutions. It is shown that this 16-bit approach provides its best performance for very high compression ratios, small *BS* and images packaged using less than 16 bits per pixel per band. This makes the 16-bits version into a very interesting option for applications with limited hardware resources, especially the availability of in-chip memory, one of the weaknesses of FPGAs.

On this basis, we have made some performance-enhancing improvements to the 16-bit version to obtain better compression results. In this context, we have assumed that the available capturing system codes hyperspectral images with a maximum of 12 bits, which is also the most common scenario in remote-sensing applications [39,40] and in the one outlined in this manuscript. Table 1 summarizes the number of bits used by the two algorithm versions introduced in [37], known as I32 and I16, and the one described in this work, referred as to I12, for some algorithm variables. It should be pointed out that when integer arithmetic is used, **V** vectors are directly represented using integer data types and do not need to be scaled and rounded to integer values.

As described in Section 2.1.3, the information presents in matrix **C** decreases with every iteration, *n*, though some specific values may increase in some quite unlikely situations. For this reason, initial values of **C** were divided by 2 in the previous 16-bit version in order to avoid overflowing, what directly decreases the precision in one bit. In the new version proposed in this work, this fact is discarded since we have 2 bits extra for these improbable situations. As can be noticed from Table 1, image **C** is stored employing 16 bits in the new I12 version as well as in the I16 version. However, it is assumed that images values are coded employing 12 bits as maximum instead of 16 bits. It permits having 2 extra bits for representing the integer part of the fixed-point values of **C** elements. Consequently, image precision is not altered as in previous 16-bit version for dealing with possible overflowing scenarios. Additionally, this new version

also allows having 2 bits for representing the decimal part of the fixed-point values of matrix **C**, which is not possible in the I16 version.


**Table 1.** Number of bits used for the integer and decimal parts used to represent the variables involved in the *HyperLCA Transform*. Three versions of the algorithm were developed to use integer arithmetic (I32, I16, I12).
