*2.2. FPGA Implementation of the HyperLCA Algorithm*

An FPGA (Field-Programmable Gate Array) can be seen as a whiteboard for designing specialized hardware accelerators (HWacc), by composition of predefined memory and logic blocks that are available in the platform. Therefore, a HWacc is a set of architectural FPGA resources, connected and configured to carry out a specific task. Each vendor proposes its own reconfigurable platform, instantiating a particular mix of such resources, around a particular interconnection architecture.

FPGAs provide flexibility to designers, since the silicon adapts to the solution, instead of fitting the solution to the computing platform as is the case of GPU-based solutions. On top of that, FPGA-based implementations can take advantage of the fine-grain parallelism of their architecture (operation level) as well as task-level concurrency. In this paper, the HyperLCA lossy compressor has been implemented onto a heterogeneous Zynq-7000 SoC (System-on-a-Chip) from Xilinx that combines a Processor System (PS), based on a dual core ARM processor, and a Programmable Logic (PL), based on a Artix-7/Kintex-7 FPGA architecture.

The development process followed a productive methodology based on High-Level Synthesis (HLS). This methodology focuses the effort on the design and verification of the HWacc, as well as the exploration of the solution space that helps to speed up the search for value-added solutions. The starting point of a methodology based on HLS is a high-abstraction model of the functionality to be deployed on the FPGA, usually described by means of high-level programming languages, such as C or C++. Then, HLS tools can generate the corresponding RTL (Register Transfer Level) implementation, functionally equivalent to the C or C++ model [41,42].

Productivity is the strongest point of HLS technology, and one of the main reasons why hardware architects and engineers have been recently attracted to it. However, HLS technology (more specifically, the tools that implement the synthesis process) has some weaknesses. For example, despite the fact that the designer can describe a modular and hierarchical implementation of a HWacc (i.e., grouping behavior via functions or procedures), all sub-modules are orchestrated by a global clock due to the way the translation to RTL from C is done. Another example is the rigid semantic when specifying dataflow architectures, allowing a reduced number of alternatives. This prevents the designer from obtaining optimal solutions for certain algorithms and problems [43,44], as was the case of the HyperLCA compressor. Therefore, to overcome the limitations of current HLS tools, a hybrid solution that combines modules developed using VHDL and HLS-synthesized C or C++ blocks has been selected. The result allows the achievement of the maximum performance. Otherwise, it could not be possible to realize either the synchronization

mechanisms or the parallel processing proposed in this work. On top of that, this approach makes it also possible to optimize the necessary resources because of the use of custom producer-consumer data exchange patterns that are not supported by HLS tools.

The four main stages of the HyperLCA compressor, described in Section 2.1 and shown in Figure 1, have been restructured in order to better adapt to devices with a high degree of fine-grain parallelism such as FPGAs. Thus, the operations involved in these parts of the HyperLCA algorithm have been grouped in two main stages: *HyperLCA Transform* and *HyperLCA Coder*. These new stages can run in parallel, which improves the performance.

The *Initialization* stage (Section 2.1.2), where the calculation of the *pmax* parameter is done, is considered only at design time. This is because several hardware components, such as internal memories or FIFOs, must be configured with the appropriate size. In this sense, it is worth mentioning that the *pmax* value (Equation (1)) depends on other parameters also known at design time (the minimum desired compression ratio (*CR*), block size (*BS*) and the number of bits (*Nbits*) used for scaling the projection vectors, **V**, and, therefore, can be fixed for the HWacc.

#### 2.2.1. HyperLCA Transform

*HyperLCA Transform* stage performs the operations of the HyperLCA transform itself (described in Section 2.1.3), and also the computation of the average pixel and the scaling of **V** vector. These are the most computational demanding operations of the algorithm and, together with the fact that they are highly parallelizable by nature, are good candidates for acceleration.

Figure 2 shows an overview of the hardware implementation of the *HyperLCA Transform* stage. *Avg\_Cent*, *Brightness* and *Proj\_Sub* modules have been modeled and implemented using the aforementioned HLS tools, while the memory buffers and custom logic that integrates and orchestrates all the components in the design have been instantiated and implemented using VHDL language, respectively. This HWacc has a single entry corresponding to a hyperspectral block (**M***k*) that will be compressed, while the output is composed of three elements (which differs to the output of Algorithm 1) in order to achieve a high level of parallelism. It must also be mentioned that the output of the *HyperLCA Transform* stage feeds the input of the *HyperLCA Coder* stage that performs the error mapping and entropy coding in parallel. Thus, the centroid, *μ***ˆ**, is obtained as depicted in Algorithm 1, while the *pmax* most different hyperspectral pixels, **E**, are not directly obtained. In this regard, the HWacc provides the indexes of such pixels, *jmax*, in each loop iteration of Algorithm 1 (outer loop), while the *HyperLCA Coder* is the responsible to obtaining each hyperspectral pixel, **e***n*, from the external memory, in which is stored the hyperspectral image, to build the vector of most different hyperspectral pixels, **E**. Finally, the projection of pixels within each image block, **V**, is provided by the HWacc in a batch mode, i.e., each loop iteration (outer loop of Algorithm 1) obtains a projection (*vn*) that forms part of the **V** vector.

The architecture of the *HLCA Transform* HWacc can be divided into two main modules: *Avg\_Cent*, which corresponds to lines 2 and 3 of Algorithm 1, and *Loop\_Iter* (the main loop body, lines from 5 to 14). These modules are connected by a bridge buffer (*BBuffer*), whose depth is only 32 words, and also share a larger buffer (*SBuffer*) with capacity to store a complete hyperspectral block. The size of the *SBuffer* depends on the *BS* algorithm parameter and its depth is determined at design time. The role of this *SBuffer* is to avoid the costly access to external memory, such it is the case of double data rate (DDR) memory in the Zynq-7000 SoCs.

The implementation of the *SBuffer* is based on a first-in, first-out (FIFO) memory that is written and read by different producers (i.e., *Avg* and *Brightness*) and consumers (i.e., *Cent* and *Projection*) of the original hyperspectral block and the intermediate results (transformations). Since there are more than one producer and consumer for the *SBuffer*, a dedicated synchronization and control access logic

has been developed in VHDL (not illustrated in Figure 2). The use of a FIFO contributes to reduce the on-chip memory resources in the FPGA fabric, being its use feasible because of the linear pattern access of the producers and consumers. However, this type of solution would not have been possible with HLS because the semantic of stream-based communication between stages in a dataflow limits the number of producers and consumer to one. Also, it is not possible, with the used HLS technology, to exploit inter-loop parallelism as is done in the proposed solution. Notice that there is a data dependency between iterations in Algorithm 1 (centralized block, **C**) and, therefore, the HLS tool infers that the next iteration must wait for the previous one to finish, resulting in fact in sequential computation. However, a deeper analysis of the behavior of the algorithm shows that the computation of the brightest pixel for iteration *n* + 1 can be performed as it has received the output of the subtraction stage, which will be still processing iteration *n*.

**Figure 2.** Overview of the *HyperLCA Transform* hardware accelerator. Light blue and white boxes represent modules implemented using HLS. Light red boxes and FIFOs represent glue logic and memory elements designed and instantiated using VHDL language.

The *Avg\_Cent* module has been developed using HLS technology and contains two sub-modules, *Avg* and *Cent*, that implement lines 2 and 3 of Algorithm 1, respectively.

*Avg* This sub-module computes the centroid or average pixel, *μ***ˆ**, of the original hyperspectral block, **M***k*, following line 2 of Algorithm 1, and stores it in *CBuffer*, a buffer that shares with *Cent* sub-module. During this operation, *Avg* forwards a copy of the centroid to the *HyperLCA Coder* via a dedicated port (orange arrow). At the same time, the *Avg* sub-module writes all the pixels of the **M***<sup>k</sup>* into the *SBuffer*. A copy of the original hyperspectral block, **M***k*, will be available once *Avg* finishes, ready to be consumed as a stream by *Cent*, which reduces the latency.

Figure 3 shows in detail the functioning of the *Avg* stage. The main problem in this stage is the way the hyperspectral data is stored. In our case, the hyperspectral block, **M***k*, is ordered by the bands that make up a hyperspectral pixel. However, to obtain the centroid, *μ***ˆ**, the hyperspectral block must be read by bands (in-width reading) instead of by pixels (in-depth reading). We introduce an optimization that handles the data as it is received (in-depth), avoiding the reordering of the data to maintain a stream-like processing. This optimization consists of an accumulate vector, whose depth is equal to the number of bands that stores partial results of the summation for each band, i.e., the first position of this vector contains the partial results of the first band, the second position the partial results of the second band and so on. The use of this vector removes the loop-carry dependency in the HLS loop that models the behavior of the *Avg* sub-module, saving processing cycles. The increase in resources is minimal, which is justified by the gain in performance.

**Figure 3.** Overview of *Avg* stage.

*Cent* This sub-module reads the original hyperspectral block, **M***k*, from the *SBuffer* to centralize it (**C**). This operation consists of subtracting the average pixel, calculated in the previous stage, from each hyperspectral pixel of the block (line 3 of Algorithm 1). Figure 4 shows this process, highlighting the elements that are involved in the centralization of the first hyperspectral pixel. Thus, the *Cent* block reads the centroid, *μ***ˆ**, which is stored in the *CBuffer*, as many times hyperspectral pixels have the original block (i.e., *BS* times in the example illustrated in Figure 4). Therefore, *CBuffer* is an addressable buffer that permanently stores the centroid of the current hyperspectral block that is being processed. The result of this stage is written into the *BBuffer* FIFO, which makes unnecessary an additional copy of the centralized image, **C**. As soon as the centralized components of the hyperspectral pixels are computed, the data is ready at the input of the *Loop\_Iter* module and, therefore, it can start to perform its operations without waiting for the block to be completely centralized.

The *Loop\_Iter* module instantiates the *Brightness* and *Proj\_Sub* sub-modules which have been designed and implemented using HLS technology. Both modules are connected by two FIFO components (*uVectorB* and *qVectorB*) using customized VHDL code to link them. Unlike *Avg\_Cent* module, *Loop\_Iter* module is executed several times (specifically *pmax* times, line 4 of Algorithm 1) for each hyperspectral block that is being processed.

*Brightness* This sub-module starts as soon as there is data in the *BBuffer*. In this sense, *Brightness* sub-module works in parallel with the rest of the system; the input of the *Brightness* module is the output of *Cent* module in the first iteration, i.e., the centralized image, **C**, while the input for all other iterations is the output of *Subtraction* sub-module that corresponds to the image for being subtracted, depicted as **X** for the sake of clarity (see brown arrows in Figure 2).

*Brightness* sub-module has been optimized to achieve a dataflow behavior that takes the same time regardless of the location of the brightest hyperspectral pixel. Figure 5 shows how the orthogonal projection vectors **q***<sup>n</sup>* and **u***<sup>n</sup>* are obtained by the three sub-modules in *Brightness*. First, the *Get Brightness* sub-module reads in order the hyperspectral pixel of the block from the *BBuffer* (**C** or **X**, it depends on the loop iteration) and calculates its brightness (*bj*) as specified in line 6 of Algorithm 1. *Get Brightness* also makes a copy of the hyperspectral pixel in an internal buffer (*actual\_pixel*) and in *SBuffer*. Thus, *actual\_pixel* contains the current hyperspectral pixel whose brightness is being calculated, while *SBuffer* will contain a copy of the hyperspectral block with transformations (line 6 and assignment in line 13 of Algorithm 1).

Once the brightness of the current hyperspectral pixel is calculated, the *Update Brightness* sub-module will update the internal vector *brightness\_pixel* if the current brightness is greater than the previous one. Regardless of such condition, the module will empty the content of *actual\_pixel* in order to keep the dataflow with the *Get Brightness* sub-module. The operations of both sub-modules are performed until all hyperspectral pixels of the block are processed (inner loop, lines 5 to 7 of Algorithm 1). The reason to use a vector to store the brightest pixel instead of a FIFO is because the HLS tool would stall the dataflow otherwise.

Finally, the orthogonal projection vectors **q***<sup>n</sup>* and **u***<sup>n</sup>* are accordingly obtained from the brightest pixel (lines 10 and 11 of Algorithm 1) by the module *Build quVectors*. Both are written in separate FIFOs: *qVectorB* and *uVectorB*, respectively. Furthermore, the contents of these FIFOs are copied in *qVector* and *uVector* arrays in order to get a double space memory that does not deadlock the system and allows **Proj\_Sub** sub-module to read several times (concretely *BS* times) the orthogonal projection vectors **q***<sup>n</sup>* and **u***<sup>n</sup>* to obtain the projected image vector, **v***n*, and transform the current hyperspectral block. This module also returns the index of the brightest pixel, *jmax*, so that the *HyperLCA Coder* stage reads the original pixel from the external memory, such as DDR, where the hyperspectral image is stored in order to build the compressed bitstream.

**Figure 5.** Overview of *Brightness* stage.

*Proj\_Sub* Although this sub-module is represented by separate *Projection* and *Subtraction* boxes in Figure 2, it must be mentioned that both perform their computations in parallel. The *Proj\_Sub* sub-module reads just once the hyperspectral block that was written in *SBuffer* by the *Brightness* sub-module (**X**). Figure 6 shows an example of *Projection* and *Subtraction* stages. First, each hyperspectral pixel of the block is read by the *Projection* sub-module to obtain the projected image vector according to line 12 of Algorithm 1. At the same time, the hyperspectral pixel is written in *PSBuffer*, which can store two hyperspectral pixels. Two is the number of pixels because the *Subtraction* stage begins right after the first projection on the first hyperspectral pixel is ready, i.e., the executions of *Projection* and *Subtraction* are shifted by one pixel. Figure 6 shows such behavior. While pixel **r**<sup>1</sup> is being consumed by *Subtraction*, pixel **r**<sup>2</sup> is being written in *PSBuffer*. During the projection of the second hyperspectral

pixel (**r**2), the subtraction of the first one (**r**1) can be performed since all the input operands, included the projection **v***n*, are available, following the expression in line 13 of Algorithm 1.

The output of the *Projection* sub-module is the projected image vector, **v***n*, which is forwarded to the *HyperLCA Coder* accelerator (through the *Projection* port) and to the *Subtration* sub-module (via the *PBuffer* FIFO). At the same time, the output of the *Subtraction* stage feeds the *Loop\_Iter* block (see purple arrow, labeled as **X**, in Figure 2) with the pixels of the transformed block in the *i th* iteration. It means that *Brightness* stage can start the next iteration without waiting to get the complete image subtracted. Thus, the initialization interval between loop-iterations is reduced as much as possible because the *Brightness* starts when the first subtracted data is ready.

**Figure 6.** Example of Projection and Subtraction stages.

The FPGA-based solution described above highlights how the operations are performed in parallel to make the most out of such technology. Also, it has been spotted specific synchronization scenarios that are not possible to design using solely HLS. due to the current communication semantics supported by the synthesis tools. For example, is not possible for current synthesis tools to perform an analysis of the data and control dependencies such as the one done in this work. However, a hybrid solution based on HLS and hand-written VHDL code to glue the RTL models synthesized from C/C++ models, brings to life an efficient dataflow (see Figure 7). In this regard, the use of optimal sized FIFOs to interconnect the modules is key. For example, while the *Brightness* sub-module is filling the *SBuffer*, the *Projection* sub-module is draining it, and at the same time this sub-module supplies to the *Subtraction* sub-module with the same data read from *SBuffer*. Finally, *Subtraction* sub-module feeds back the *Brightness* sub-module through the *BBuffer* FIFO. The *Brightness* sub-module fills, in turn, the *SBuffer*, with the same data, closing the circle; this loop is repeated *pmax* times.

Furthermore, the initialization interval between image blocks has been reduced. The task performed by the *Avg* sub-module for block *k* + 1 (see Figure 7) can be scheduled at the same time that the *Projection* and *Subtraction* sub-modules are computing their outputs for block *k*, and right after the completion of the *Brightness* sub-module for block *k*. This is possible since the glue logic discards the output of the *Subtraction* sub-module during the last iteration. This logic ensures that the *BBuffer* is filled with the output from the *Avg* sub-module that feeds the first execution of *Brightness* for the first iteration of block *k* + 1, resulting in an overlapped execution of the computation for blocks *k* and *k* + 1.

**Figure 7.** Dataflow of *HyperLCA Transform* hardware accelerator. Overlapping of operations within inter-loops and inter-blocks.

Unfortunately, despite the optimizations introduced in the dataflow architecture, the HWacc is not able to reach the performance target as shown by the experimental results (see Section 3). The value in the column labeled as 1 PE (Processing Element) is clearly below the standard frame rates provided by contemporary commercial hyperspectral cameras. However, the number of FPGA resources required by the one-PE version of the HWacc is very low (see Section 3.2) which makes it possible to implement further parallelism strategies to speed up the compression process. Thus, three options open up for solving the performance problem.

First, task-level parallelism approach is possible by means of the use of several instances of the HWacc working concurrently. Second, increase the intra-HWacc parallelism using multiple operators, computing several pixel bands at the same time. Thirdly, a combination of the two previous approached. Independently of the strategy chosen, the limiting factor is the version of the Xilinx Zynq-7000 programmable SoC, that would have enough resources to support the design. In Section 4, a detailed analysis of several single-HWacc (with variations in the number of PEs) and multi-HWacc versions of the design is drawn.

So far, it has been described the inner architecture of a HWacc that only performs a computational operation over a single band component of a hyperspectral pixel. However, it can be modified to increase the number of bands that are processed in parallel. Thus, the HWacc of HyperLCA compressor turns from a single-PE to multiple PEs. This fact opens two new challenges. First challenge is to increase the width of the input and output ports of the modules, in accordance with the number of bands that would be processed in parallel. It must be mentioned that it is technologically possible because HLS-based solutions allow designers to build their own data types. For example, if a band component of a hyperspectral pixel is represented by an unsigned integer of 16-bits, we could define our own data type consisting of an unsigned integer of 160-bits packing ten bands of a hyperspectral pixel (see Figure 8). The second challenge has to do with the strategy to process the data in parallel. In this regard, a solution based on the map-reduce programming model has been followed.

Figure 8 shows an example of the improvements applied to the *Cent* stage, following the above-mentioned optimizations. The input of this stage is the hyperspectral block, **M***k*, and the average pixel, *μ***ˆ**, which are read in blocks of *N*-bands. The example assumes that the block is composed of ten bands and uses an user data type, specifically an unsigned int of 160-bits (10 bands by 16-bits to represent each band). Then, both blocks are broken down into the individual components that feed the PEs in an orderly fashion. This process is also known as *map* phase in the map-reduce programming model [45]. It must be mentioned that the design needs as many PEs as number of divisions made in the block.

Once the PEs have performed the assigned computational operation, the *reduce* phase of the map-reduce model is executed. For *Cent* sub-module, this stage consists of gathering in a block the partial results produced by each one of the PEs. Thus, a new block of *N*-bands is built, which in turn is part of the centralized block (**C**), which is the output of *Cent* sub-module.

**Figure 8.** Map-reduce programming model and data packaging on *Cent* stage.

#### 2.2.2. Coder

The *HyperLCA Coder* is the second of the HWacc developed, responsible for the error mapping and entropy-coding stages of the HyperLCA compression algorithm (Figure 1). The coder module teams up with the transform module to perform in parallel the CCSDS prediction error mapping [35] and the Golomb–Rice [36] entropy-coding algorithms as the different vectors are received from the *HyperLCA Transform* block.

The *HyperLCA transform* block generates the centroid, *μ***ˆ**, extracted indexes of pixels, *jmax*, and projection vectors, **v***n*, for an input hyperspectral block, **M***k*. These arrays are consumed as they are received, reducing the need for large intermediate buffers. To minimize the necessary bandwidth to the memory that stores the hyperspectral image, only the indexes of the selected pixels in each iteration of the transform algorithm (line 8 of Algorithm 1) are provided to the coder (*MB\_index* port).

The operation of the transform and coder blocks overlaps in time. Since the coder takes approximately half of the time the transform needs to generate each vector (see Section 3.2) for the maximum number of PEs (i.e., the maximum performance achieved), a contention situation is not taking place, reducing the pressure over the FIFOs that connect both blocks and, therefore, requiring less space for these communication channels.

Figure 9 sketches the internal structure of the *HyperLCA Coder* that has been modeled entirely using Vivado HLS. It is a dataflow architecture comprising three steps. During the first step, the prediction mapping and entropy coding of all the input vectors are performed by the *coding command generator*. The result of this step is a sequence of commands that are subsequently interpreted by the *compressed bitstream generator*.

**Figure 9.** Overview of the *HyperLCA Coder* hardware accelerator.

The generation of the bitstream was extracted from the entropy-coding original functionality, which enabled a more efficient implementation of the latter, which could be re-written as a perfect loop and, therefore, Vivado HLS was able to generate a pipelined datapath with the minimum initiation internal (II = 1).

The generation of the compressed bitstream is simple. This module is continuously reading the *cmd\_queue* FIFO for a new command to be processed. A command contains the operation (unary of binary coding) as well as the word (quotient or reminder, see Section 2.1.5) to be coded, and the number of bits to generate. Unary and binary coding functions simply iterate over the word to be coded and produces a sequence if bits which corresponds to the compressed form of the hyperspectral block.

Finally, the third step packs the compressed bitstream in words and written to memory. For this implementation, the width of the memory word is 64 bits, the maximum allowed by the *AXI Master* interface port for the Zynq-7020 SoC. The *bitstream packer* module instantiates a small buffer (64 words) that is flushed to DDR memory once it has been filled. This way, average memory access cycles per word is optimized by means of the use of burst requests.

As mentioned above, the *HyperLCA Transform* block feeds the coder with the indexes of the extracted pixels **e***n*, which correspond to the highest brightness as the hyperspectral block is processed. Hence, it is necessary to retrieve the *nb* spectral bands from memory before the coder could start generating the compressed stream for the pixel vector. This is the role played by the *pixel reader* module. As in the case of the *bitstream packer* step, the *pixel reader* makes it use of a local buffer and issue burst requests to read the bands in the minimum number of cycles.

While the computing complexity of the coder module is low, the real challenge when it comes to the implementation of its architecture is to write a C++ HLS model that is consistent through the whole design, verification and implementation processes. To achieve this goal, it has been provided the communication channels with extra semantic so as to keep the different stages sync, despite the fact the model of computation of the architecture falls into the category of GALS (Globally Asynchronous, Locally Synchronous) systems. Side-channel information is embedded in the *cmd\_queue* and *compressed\_stream* FIFOs that connects the different stages of the coder. This information is used by the different modules

to reset their internal states or stop and resume their operation (i.e., special commands to be interpreted by the bitstream generator or a flush signal as input to the packing module). This way, it is possible to integrate under a single HLS design all the functionality of the coder, which simplifies and speeds up the design process. On top of that, this strategy allowed avoiding the need to tailor the C++ HLS specification to make it usable in different steps of the design process. For example, it is common to write variations of the model depending on whether functional (C simulation) or RTL verification (co-simulation) is taking place due to the fact that the former is based on an untimed model and the latter introduces timing requirements [46,47].

Designing for parallelism is key to obtain the maximum performance. The dataflow architecture of the coder ensures an inter-module level of concurrency. However, the design must be balanced as to the latency of the different stages in the dataflow. Otherwise, the final result could be jeopardized. As mentioned before, decoupling the generation of the compressed bitstream from the entropy-coding logic, led to a more efficient implementation of the latter by the HLS synthesis tool. Also, this change helped to redistribute the computing effort, achieving a more balanced implementation.

In the first stage of the dataflow (*coding command generator*) a simple logic that controls the encoding of each input vector plus a header is implemented. It is an iterative process that performs the error mapping and error coding over the centroid, and *p*\_*max* times over the extracted pixels and projections vectors. The bulk of this process is, thus, the encoding algorithm. The encoding is delegated in another module that implements an internal dataflow itself. In this way, it is possible to reduce the interval between two encoding operations. As can be seen in Figure 9, the prediction mapping and entropy-coding sub-modules communicates through a ping-pong buffer for the *mapped* vector.

To conclude this section, it is worth mentioning a couple of optimizations carried out related to the operations *pow* and *log*, which are used by the prediction mapping and entropy-coding algorithms. This type of arithmetic operation is costly to implement in FPGAs since the generated hardware is based on tables that consume on-chip resources (mainly BRAM memories), and an iterative processes that boosts latency. Since the base used in this application is 2, it can be largely simplified. Thus, we can substitute the *pow* and *log* operations by a logical shift instruction and the GCC compiler *\_\_builtin\_clz(x)* built-in function, respectively. This change is part of the refactoring process of the reference code implementation (golden model) that is almost mandatory at the beginning of any HLS project. The *\_\_builtin\_clz(x)* function is synthesizable and counts the leading zeros of the integer *x*. Therefore, the computation of the lowest power of 2 higher than *M*, performed during the entropy coding, is redefined as follows:

Listing 1: FPGA implementation of costly arithmetic operations during entropy coding.

```
1 //O ri gi n al code
2 b = log 2 (M) + 1 ;
3 di f f e r e n c e = pow ( 2 , b ) − M;
4
5 //FPGA op timi z a ti o n
6 b = ( 3 2 − _ _ b u i l t i n _ c l z (M) ) ;
7 di f f e r e n c e = (1<<b ) − M;
```
#### *2.3. Reference Hyperspectral Data*

In this section, we introduce the hyperspectral imagery used in this work to evaluate the performance of the proposed computing approach using reconfigurable logic devices. This data set is composed of 4 hyperspectral images that were also employed in [29], where the HyperLCA algorithm was implemented in low-power consumption embedded GPUs. We have kept the same data set in order to compare

in Section 3 the performance of the developed FPGA-based solution with the results obtained by the GPU accelerators.

In particular, the test bench was sensed by the acquisition system extensively analyzed in [48]. This aerial platform mounts a *Specim FX10* pushbroom hyperspectral camera on a DJI Matrice 600 drone. The image sensor covers the range of the electromagnetic spectrum between 400 and 1000 nm using 1024 spatial pixels per scanned cross-track line and 224 spectral bands. Nevertheless, the hyperspectral images used in the experiments only retain the spectral information of 180 spectral bands. Concretely, the first 10 spectral bands and the last 34 bands have been discarded due to the low spectral response of the hyperspectral sensor at these wavelengths.

The data sets were collected over some vineyard areas in the center of Gran Canaria island (Canary Islands, Spain) and in particular, in a village called Tejeda, during two different flight campaigns. Figure 10 and Figure 11 display some Google Maps pictures of the scanned terrain in both flight missions, whose exact coordinates are 27°59- 35.6--N 15°36- 25.6--W (green point in Figure 10) and 27°59- 15.2--N 15°35- 51.9--W (red point in Figure 11), respectively. Both flight campaigns were performed at a height of 45 m over the ground, which results in a ground sampling distance in line and across line of approximately 3 cm.

**Figure 10.** Google Maps pictures of the vineyard areas sensed during the first flight campaign. False RGB representations of the hyperspectral images employed in the experiments.

The former was carried out with a drone speed of 4.5 m/s and the camera frame rate set to 150 frames per second (FPS). In particular, this flight mission consisted of 12 waypoints that led to a total of 6 swathes, but only one was used for the experiments, which has been highlighted in green in Figure 10. Two portions of 1024 hyperspectral frames with all their 1024 hyperspectral pixels were selected

from this swath to generate two of the data sets that compose the test bench. A closer view of these selected areas can be also seen in Figure 10. These images are false RGB representations extracted from the acquired hyperspectral data.

The latter was carried out with a drone speed of 6 m/s and the camera frame rate set to 200 FPS. The entire flight mission consisted of 5 swathes, but only one was used for the experiments in this work, which has been highlighted in red in Figure 11. From this swath, two smaller portions of 1024 hyperspectral frames were cut out for simulations. A closer view of these selected areas is also displayed in Figure 11. Once again, they are false RGB representations extracted from the acquired hyperspectral data. For more details about the flight campaigns, we encourage the reader to see [48].

**Figure 11.** Google Maps pictures of the vineyard areas sensed during the second flight campaign. False RGB representations of the hyperspectral images employed in the experiments.

#### **3. Results**

#### *3.1. Evaluation of the HyperLCA Compression Performance*

The goodness of the I12 version of the HyperLCA algorithm proposed in this work has been evaluated and compared with previous I32 and I16 versions of the algorithm presented in [37] and the single precision floating-point (F32) implementation presented in [29]. For doing so, the hyperspectral imagery described in Section 2.3 has been compressed/decompressed using different settings of the HyperLCA compressor input parameters.

In this context, the information lost after the lossy compression process has been analyzed using three different quality metrics. Concretely, the *Signal-to-Noise Ratio (SNR)*, the *Root Mean Squared Error* *(RMSE)* and the *Maximum Absolute Difference (MAD)*, which are shown in Equations (3)–(5), respectively. The *SNR* and the *RMSE* give an idea of the average information lost in the compression-decompression process. Bigger values of *SNR* are indicative of better compression performance. On the contrary, higher *RMSE* values mean that the lossy compression has introduced bigger data losses. The *MAD* assesses the amount of lost information for the worst reconstructed image value. For our targeted application, the dynamic range is 212 = 4096 and hence, the worst possible *MAD* value is 4095. For the sake of clarity, the aforementioned metrics have been calculated using the entire compressed–decompressed images, i.e., after the HyperLCA algorithm has finished to compress all image blocks, **M***K*.

$$\text{SNR} = 10 \cdot \log\_{10} (\frac{\sum\_{i=1}^{nb} \sum\_{j=1}^{np} (I\_{i,j})^2}{\sum\_{i=1}^{nb} \sum\_{j=1}^{np} (I\_{i,j} - I c\_{i,j})^2}) \tag{3}$$

$$\text{RMSE} = \frac{1}{np \cdot nb} \cdot \sqrt{\sum\_{i=1}^{nb} \sum\_{j=1}^{np} (I\_{i,j} - Ic\_{i,j})^2} \tag{4}$$

$$\text{MAD} = \max(I\_{i,j} - Ic\_{i,j}) \tag{5}$$

Table 2 shows the average results obtained for each configuration of the HyperLCA compressor input parameters using the data set described in Section 2.3. Different conclusions can be drawn from these results. First of all, it is confirmed that the I12 version offers significantly better-quality compression results than previous I16 version, employing the same hardware resources. These gaps are even wider for smaller compression ratios. This is because the losses introduced by the decrease in the data precision compare to the previous I16 version, as mentioned in Section 2.1.6, is disguised by the bigger losses introduced by the compression process itself for higher compression ratios.


**Table 2.** Comparison of the compression results for the four versions of the HyperLCA algorithm under study: I12, I16, I32 and F32.

Additionally, deviations in the values of the three quality metrics employed in this work between I32, I12 and F32 versions are almost negligible, with the advantage of halving the memory space required for storing **C**. Second, it can be also concluded that the HyperLCA lossy compressor is able to compress the hyperspectral data with high compression ratios and without introducing significant spectral information losses.

#### *3.2. Evaluation of the HyperLCA Hardware Accelerator*

Section 2.1 describes the FPGA-based implementation of the HyperLCA compressor as defined in Algorithm 1. The architecture of the HWacc is divided into two blocks, *Transform* and *Coder* that run in parallel following a producer-consumer approach. Therefore, for the performance analysis of the whole solution, the slowest block is the one determining the productivity of the proposed architecture.

The *HyperLCA Transform* block bears most of the complexity and computational burden of the compression process. For that reason, several optimizations (see Section 2.1.3) have been applied during its design in order to achieve a high degree of parallelism and, thus, reduce the latency. One of the most important improvements is the realization of the map-reduce programming model, to enable an architecture with multiple PEs (see Figure 8) working concurrently on several bands. The experiments carried out over the different alternatives for the *HyperLCA Transform* block are intended to evaluate how the performance and resource usage of the FPGA-based solution scales, as the number of PEs instantiated by the architecture grows up.

The configuration of the input parameters has been set as follows: *CR* parameter has been set to 12, 16 and 20; the *BS* parameter has been set to 1024, 512 and 256 and; *Nbits* parameter gets the values 12 and 8. The value of *pmax* is obtained at design time from these parameters following Equation (1), and are listed in the last column of Table 3. Concerning the sizing of the various memory elements present in the architecture, it is determined by two parameters: the number of PEs to instantiate (parallel processing of hyperspectral bands) and the size of the image block to be compressed. It is worth mentioning that in this version of the HWacc, the number of PEs must be a divisor of the number of hyperspectral bands in order to simplify the design of the datapath logic.

First, the data width of the architecture must be defined. Such parameter is obtained multiplying the number of PEs by the size of the data type used to represent a pixel band. In the version of the HWacc under evaluation, the I12 alternative has been selected, due to its good behavior (comparable to the I32 version as discussed in Section 2.1.6) and the resource savings it brings. For the I12 version, a band is represented with an *unsigned short int* which turns into 16-bit words in memory. On the contrary, choosing the I32 version, the demand for memory resources and internal buffers (such as the *SBuffer*) would double, because an *unsigned int* data type is used in the model definition. Thus, if the HWacc only instantiates a PE, the data width will be 16-bits, whereas if the HWacc instantiates 12 PEs (i.e., the HWacc performs 12 operations over the set of bands in parallel), the data width will be 192-bits. Second, the depth of the *SBuffer* must be calculated following Equation (6), where *BS* is the block size, *nb* the number of bands (in our case is fixed to 180), and *NPEs* is the number of processing elements.

$$SBufferDepth\_{min} = \frac{BS \cdot nb}{N\_{PEs}} \tag{6}$$

Being optimal as to the use of BRAM blocks within the FPGA fabric is compulsory since this resource is highly demanded as the number of PEs increases (seen in Table 4). On top of that, keeping the use of resources under control brings along some benefits such as helping the synthesis tool to generate a better datapath and, therefore, to obtain an implementation with a shorter critical path or contain the consumption of energy.


**Table 3.** Design-time configurations parameters of the HyperLCA transform hardware block for hyperspectral images with 180 bands.

Table 3 lists the memory requirements demanded by *SBuffer* for the 108 different configurations of the *HyperLCA Transform* block evaluated in this work. The *SBuffer* component is, by far, the largest memory instantiated by the architecture and it is implemented as a FIFO. *SBuffer* component has been generated with the FIFO generator tool provided by Xilinx which only allows depths that are power of two. Therefore, from a technical point of view, it is not possible to match the minimum required space of *SBuffer* with the obtained from the vendor tools. To mitigate the waste of memory space derived from such constraint of the tool, the *SBuffer* has been broken into two concatenated FIFOs (i.e., the output of the first FIFO is the input of the second) but keeping the facade of a single FIFO to the rest of the system. For example, the minimum depth of *SBuffer* for *PE* = 1 and *BS* = 256 is 46,080. With a single FIFO, the smallest depth with enough capacity generated by Xilinx tools would be 65,536. Therefore, the unused memory space represents approximately 30% of the overall resources for *SBuffer*. However, by using the double FIFO approach, one FIFO of 32,768 words plus another one of 16384 would be use. Only ≈ 6% of the assigned resources to *SBuffer* would be misused.

To evaluate the *HyperLCA Transform* hardware accelerator, the proposed HWacc architecture has been implemented using the Vivado Design suite. This toolchain is provided by Xilinx and features a HLS tool (Vivado HLS) devoted to optimize the developing process of IP (*Intellectual Property*) components FPGA-based solutions for their own devices. The first implemented prototype instantiated one HWacc targeting the XC7Z020-CLG484 version of the Xilinx Zynq-7000 SoC. This FPGA has been selected because of its low-cost, low-weight and high flexibility, features that make it an interesting device to be integrated in aerial platforms, such as drones. The aim of this first prototype is to evaluate the capability of a mid-range reconfigurable FPGAs such as the XC7Z020 chip, for a specific application such as the HyperLCA compression algorithm. Hence, and due to the amount of resources available on the target device, the maximum possible number of PEs for the single-HWacc prototype is 12.

Table 4 summarizes the resources required, which have been extracted from post-synthesis reports, for each of the 108 versions of the HWacc that process different block sizes of 180-band hyperspectral images of the data set (see Section 2.3).


**Table 4.** Post-Synthesis results for the different versions of the HyperLCA Transform for a Xilinx Zynq-7020 programmable SoC and image block up to 180 bands.

Several conclusions can be derived from these figures. In first place, the amount of digital signal processors (DSPs), flipflops (FFs) and look-up-tables (LUTs) resources increases with the number of PEs but are similar for different values of the *BS* parameter. On the contrary, the demand of BRAMs depends directly on *BS* and increases slightly with *PE* for a given block size. The *PE* = 10 version needs a special remark. Such version represents an anomaly in the linear behavior of the resource demand. Even with the use of a double FIFO approach, as explained before, the total capacity of the BRAM used to instantiate *SBuffer* is clearly oversized for that datawidth to assure that a hyperspectral block and its transformations (**M***<sup>k</sup>* and **C**, respectively) could be stored in-circuit. Second, in addition to the resources needed by the HWacc of the *HyperLCA Transform*, it is necessary to take into account those corresponding to the other components in the system such as the *Coder* or the DMA (*Direct Memory Access*) to move the hyperspectral data (**M***k*) from/to DDR to/from the hardware accelerators. These extra components will make use of the remaining resources (specially LUTs, which is the most demanded as can be seen in Table 4), establishing a maximum of 12 for the number of PEs.

Table 5 shows the post-synthesis results for the *HyperLCA Coder* block. The resources demanded by the coder does not depend on the *BS* parameter or the number of PEs. It is important to mention that the majority of the BRAM, FFs and LUTs resources are assigned to the two AXI-Memory interfaces that the HLS tool generates for the *Pixel Reader* and the *Bitstream Packer* modules (see Figure 9).

**Table 5.** Post-Synthesis results for the HyperLCA Coder block for a Xilinx Zynq-7020 programmable SoC and pixel size up to 180 bands.


Table 6 shows the throughput, expressed as the maximum frame rate, for a specific configuration of the HWacc using two clock frequencies: 100 MHz and 150 MHz (Table A1 extends this information by adding the number of cycles to compute a hyperspectral block). Columns labeled as *PE* denote the number of *processing elements* instantiated by the HWacc, which in combination with the input parameters (*Nbits*, *BS*, *CR* and *pmax*), show the average number of hyperspectral blocks that the HWacc is able to compress per second (FPS). The maximum frame rate has been normalized to 1024 hyperspectral pixels per block, which is the size of the frame delivered by the acquisition system.


**Table 6.** Maximum frame rate obtained using the FPGA-based solution on a Xilinx ZynQ-7020 programmable SoC for hyperspectral images with 180 bands.

*f*<sup>1</sup> Clock frequency: 100 MHz. *f*<sup>2</sup> Clock frequency: 150 MHz.

It is worth noting that these results include the transform and coding steps, which are performed in parallel. Table 7 shows the coding times (in clock cycles) compared to the time needed by the transform step with the best configuration possible (i.e., *PE* = 12). The *HyperLCA Coder* takes roughly 50% less time on average and, since the relation between both hardware components is a dataflow architecture, the latency of the whole process corresponds to the maximum; that is, the delay of the *Transform* step.

One key factor is the minimum frame rate that must be supported for the targeted application. Ideally, such threshold would correspond to the maximum frame rate provided by the employed hyperspectral sensor (i.e., 330 FPS). However, the experimental validation of the camera set-up in the drone (Section 2.3), tells us that frame rates between 150 and 200 are enough to obtain hyperspectral images with the desired quality, given the speed and altitude of the flights. Therefore, a threshold value of 200 FPS is established as the minimum throughput to validate the viability of the HyperLCA hardware core. In Table 7, it has been highlighted (bold type-faced cells) the configurations that would be valid given this minimum. Thus, it can be observed that the *PE* = 12 version, for both clock frequencies, and the *PE* = 10 version at 150 MHz reach the minimum frame rate, even for the most demanding scenario (*Nbits* = 8, *BS* = 1024 and *CR* = 12).

In turn, by using lower values for *Nbits* the compression rate is reduced due to the higher number of **V** vectors extracted by the HyperLCA Transform. This means that more computations must be performed to compress a hyperspectral block. It must be mentioned that the *PE* = 10 version meets the FPS target in all the scenarios but the most demanding one when the clock frequency is set to 100 MHz. As for the *PE* = 10 version, the *PE* = 6 version does not reach the minimum FPS in a few scenarios (concentrated in *Nbits* = 8 and *BS* = 1024), but in can be a viable solution given the actual needs of the application set-up.


**Table 7.** Comparison of the computation effort made by the *Transform* and *Coding* stages.

In addition to the performance results listed in Table 6, Figure 12 graphically shows the speed-up gained by the FPGA-based implementation as the number of PEs increases. The values have been normalized, using the average time for the *PE* = 1 version as the baseline. Several conclusions can be drawn from this figure. First, it is observable that the *PE* = 12 version of the HWacc performs ×7 times (*Nbits* = 12) to ×7.6 times (*Nbits* = 8) faster than *PE* = 1 version. This configuration is the one that guarantees the fastest compression results. Second, the speed-up gain is nearly linear for *PE* = 2 and *PE* = 4 versions, whereas the scalability of the accelerator drops as the number of PEs is higher (see Figure 12a,b). This behavior is seen for both *Nbits* = 8 and *Nbits* = 12 configurations (see Figure 12c,d). However, for higher values of *BS* and *CR* the shape of the curve shows a better trend though not reaching the desired linear speed-up.

*Remote Sens.* **2020**, *12*, 3741

**Figure 12.** Speed-up obtained for multiple PEs compared to the 1 PE version of the HyperLCA HW compressor. (**a**) Speed-up *Nbits* = 8; (**b**) Speed-up *Nbits* = 12; (**c**) Speed-up curve *Nbits* = 8; (**d**) Speed-up curve *Nbits* = 12.

#### **4. Discussion**

In this paper, we present a detailed description of an FPGA-based implementation of the HyperLCA algorithm, a new lossy transform-based compressor. As fully discussed in Section 3.2, the proposed HWacc meets the requirements imposed by the targeted application in terms of the frame rate range employed to capture quality hyperspectral images. In this section, we would like to also provide a comprehensive analysis of the suitability of the FPGA-based HyperLCA compressor implementation and a comparison with the results obtained by an embedded System-on-Module (SoM) based on a GPU that has been recently published [29].

Concretely, María Díaz et al. introduce in [29] three implementation models of the HyperLCA compressor. In this previous work, the parallelism inherent to the HyperLCA algorithm was exploited beyond the thread-level concurrency of the GPU programming model taking advantage of the CUDA steams. In particular, the third approach, referred to as *Parallel Model 3* in [29], achieves the highest speed-up, especially for bigger image blocks (*BS* = 1024). This implementation model bets for pipelining the data transfers between the host and the device and the kernel executions for the different image blocks (**M***k*). To do this, such proposal exploits the benefits of the concurrent kernel execution through the management of CUDA streams. Unlike the FPGA-based implementation model proposed in this paper, only the *HyperLCA Transform* stage is accelerated in the GPU. In this case, the codification stage is also pipelined with the *HyperLCA Transform* stage but executed on the Host using another parallel CPU process. Table 2 collects the quality results of the compression process issued by the aforementioned GPU-based implementation model in terms of SNR, MAD and RMSE (F32 version). For more details, we encourage the reader to see [29].

To compare the performance of the GPU-based implementation model and the FPGA solution presented here, we are going to use two assessment metrics: the maximum number of frames compressed in a second (FPS) and the power efficiency in terms of FPS per watt. The latter figure of merit is of great importance given the target application, since it is critical to maximize the battery life of the drone. Additionally, the GPU-based model introduced in [29] has been executed in three different NVIDIA Jetson embedded computing devices: Jetson Nano, Jetson TX2 and the most recent supercomputer Jetson Xavier NX.

These modules have been selected for the reasonable computational power provided at a relatively low power consumption. Table 8 summarizes the most relevant technical characteristics of these embedded computing boards. As can be seen, the Jetson Nano module integrates the less advanced, oldest generation of the three GPU architectures, instantiating the fewer execution units or CUDA cores as well. On the contrary, Jetson Xavier NX represents one of the latest NVIDIA power-efficient products, which offers more than 10X the performance of its widely adopted predecessor, Jetson TX2.

**Table 8.** Most relevant characteristics of the NVIDIA modules Jetson Nano, Jetson TX2 and Jetson Xavier NX.


Table 9 collects the performance results obtained for the three GPU-based implementations of the *HyperLCA* and the most powerful implementation of the HWacc in a Zynq-7020 SoC (*PE* = 12). For each implementation, it is specified the clock frequency and power budget. Several algorithm parameters have been tested over 180-band input images. In addition, Figure 13 displays the obtained FPS according to different configurations of the input image block size (*BS*). For the sake of simplicity, we only represent results for *Nbits* = 8 since the behavior is similar for *Nbits* = 12 as the reader can see in Table 9.

From the performance point of view, the most competitive FPGA results, compared to the fastest GPU implementations on Jetson Nano and Jetson TX2, are for the smallest input block size (*BS* = 256), resulting very similar to *BS* = 512 and even better compared to Jetson Nano. For the largest block size (*BS* = 1024) the performance of the solutions based on GPUs, is higher because of how the parallelism is inferred in both architectures. On the one hand, the FPGA architecture can process 12 bands at a time, overlapping the computation of several groups of pixels due to the internal pipeline architecture. Therefore, processing time linearly increases as the number of pixels in the image block is higher. On the other hand, GPUs can process all the pixels in an image block in parallel, regardless the size of the block. Thus, processing time is nearly constant, independently of the value of parameter *BS*. However, this assumption does not hold when it comes to reality, since it has to be taken in consideration the time required for memory transfers, kernel launches and data dependencies. In this context, the time spent to setting up and launching the instructions to execute a kernel or perform a memory transfer must be also taken into account. As analyzed in [29], the time used in transferring image blocks of 256 pixels is

negligible in relation to the overhead of launching the memory transfers and the additional required logic. However, the time required for transferring image blocks of 1024 pixels is comparable with the overhead of initializing the copy. Consequently, the bigger the size of the block, the better performance obtained by GPU-based implementation. For this reason, the trend of the FPGA performance function (see blue line in Figure 13) decreases as *BS* increases, while the opposite behavior is shown for the GPU-based model regardless of the desired *CR*.

**Table 9.** Maximum frame rates obtained by the proposed FPGA implementation and the GPU-based model introduced in [29] for the NVIDIA boards Jetson Nano, Jetson TX2 and Jetson Xavier NX.


This pattern is also present when analyzing the results for the Jetson Xavier NX. However, in this case, the GPU clearly outperforms the maximum number of FPS achieved by the FPGA for all algorithm settings. Nevertheless, it should be noticed by the reader that the Jetson Xavier NX represents one of the latest, most advanced NVIDIA single-board computers whereas the Xilinx Zynq-7020 SoC that mounts the ZedBoard (i.e., XC7Z020-CLG484) is a mid-range FPGA several technological generations behind the Jetson Xavier GPU. Despite the fact that there are more powerful FPGA devices currently on the market, one of the main objectives of this work is to assess the feasibility of the reconfigurable logic technology for high-performance embedded applications such as HyperLCA under real-file conditions. At the same time, it is also a goal of this work to explore the minimum requirements of an FPGA-based computing platform that is able to fulfil the performance demands and constraints of the hyperspectral application under study. Thus, the selected version of the Xilinx Zynq-7020 SoC meets the demand for all algorithm configurations, at a lower cost.

**Figure 13.** Comparison of the speed-up obtained in the compression process, in terms of FPS and the input parameter *BS*, reached by a Xilinx Zynq-7020 programmable SoC following the design-flow proposed in this work versus the GPU-based implementation model described in [29] performed onto some NVIDIA power-efficient embedded computing devices, such as Jetson Nano, Jetson TX2 and Jetson Xavier NX. (**a**) FPS *Nbits* = 12, *CR* = 12; (**b**) FPS *Nbits* = 12, *CR* = 16; (**c**) FPS *Nbits* = 12, *CR* = 20.

Going deeper in the analysis of the results, Figure 14 plots the power efficiency for each targeted device, measured as the FPS achieved divided by the average power budget. The picture shows how the efficiency varies in relation to the size of the input image blocks (*BS*). Jetson boards are designed with a high-efficient Power Management Integrated Circuit that handles voltage regulators, and a power tree to optimize power efficiency. According to [49–51], the typical power budgets of the selected boards amount to 10 W, 15 W and 10 W for the Jetson Nano, Jetson TX2 and Jetson Xavier NX modules, respectively. In the case of the XC7Z020-CLG484 FPGA, the estimated power consumption after *Place & Route* stage in Vivado toolchain goes up to 3.74 W at 150 MHz. Based on the trend lines shown in Figure 14, it can be concluded that the FPGA-based platform is by far more efficient in terms of power consumption that the Jetson Nano and TX2 NVIDIA boards, for all algorithm configurations. As in the case of the performance analysis (Figure 13), the power efficiency of the FPGA slightly decreases with higher *BS* values while GPU-based implementations present an opposite behavior. As a result, the FPGA-based solution remains a more power-efficient approach for the smallest image block size (*BS* = 256) and shows similar figures for *BS* = 512 and higher *CR*. Nevertheless, for *BS* = 1024, Jetson Xavier NX clearly outperforms the proposed FPGA-based solution.

**Figure 14.** Comparison of the energy efficiency in the compression process, in terms of the ratio between obtained FPS and power consumption and the input parameter *BS*, reached by a Xilinx Zynq-7020 programmable SoC following the design-flow proposed in this work versus the GPU-based implementation model described in [29] performed onto some NVIDIA power-efficient embedded computing devices, such as Jetson Nano, Jetson TX2 and Jetson Xavier NX. (**a**) FPS *Nbits* = 12, *CR* = 12; (**b**) FPS *Nbits* = 12, *CR* = 16; (**c**) FPS *Nbits* = 12, *CR* = 20.

The reasons that explain this behavior root in the fact that GPU-embedded platforms have been able to significantly increase their performance while maintaining or even reducing the power demand. The combination of architectural improvements and better IC manufacturing processes have paved the way to an scenario where embedded-GPU platforms are gaining ground and can be seen as competitors of FPGAs concerning power efficiency.

Although the initial FPGA solution has proved to be sufficient, given the real-life requirements of the targeted application, we have ported the proposed design to a larger FPGA device. The objective is two-fold. First, to evaluate if it is possible to reach the same level of performance (in terms of FPS) than that obtained by the Jetson Xavier NX implementation with current FPGA technology. Second, to study how FPGA power efficiency evolves as the complexity of the design increases, and compare it to the results obtained by the Jetson Xavier NX.

Thus, a multi-HWacc version of the design was developed using the XC7Z100-FFV1156-1 FPGA, one of the biggest Xilinx Zynq-7000 SoCs, as the target platform. The new FPGA allows up to 5 instances of the *HyperLCA* component working in parallel. The selected baseline scenario is the configuration where the single-core FPGA design obtained the worst results compared to the Jetson Xavier NX (i.e., *BS* = 1024

and *Nbits* = 8 for all *CR* values). Synthesis and implementation was carried out by means of Vivado toolchain using the *Flow\_AreaOptimized\_high* and *Power\_ExploreArea* strategies, respectively. As can be seen in Figure 15a, the performance of the multi-HWacc version grows almost linearly with the number of instances. There is a loss due to the concurrent access to memory in order to get the hyperspectral frames and the necessary synchronization of the process, which is the responsibility of the software. The new multi-core FPGA-based *HyperLCA* computing platform can reach the same level of performance for the maximum number of instances. As to the efficiency in terms of energy consumption per frame processed by the collection of HWaccs, with just three instances the FPGA is comparable to the GPU (above the Jetson Xavier NX results for *CR* = 16 and *CR* = 20) and better for four and five instances of the HWacc.

**Figure 15.** Evolution of the performance (**a**) and energy efficiency (**b**) of a multi-core version of the FPGA-based *HyperLCA* computing platform. Comparison with the GPU-based implementation model described in [29] performed onto NVIDIA Jetson Xavier NX (*Nbits* = 8, *BS* = 1024, using a Xilinx Zynq-7000 XC7Z100-FFV1156-1).

#### **5. Conclusions**

The suitability of the HyperLCA algorithm for being executed using integer arithmetic was already examined in further detail in previous state-of-the-art publications. Nonetheless, in this work we have contributed to its optimization providing a performance-enhancing alternative that has brought about a substantial performance improvement along with a significant reduction in hardware resources, especially aimed at overcoming the scarcity of in-chip memory, one of the weakness of FPGAs.

In this context, the aforementioned modified version of the HyperLCA lossy compressor has been implemented onto an heterogeneous Zynq-7000 SoC in pursuit of accelerating its performance and thus, complying with the requirements imposed by a UAV-based sensing platform that mounts a hyperspectral sensor, which is characterized by a high frame rate. The adopted solution combines modules using VHDL and synthesized HLS models, bringing to life an efficient dataflow that fulfils the real-life requirements of the targeted application. On this basis, the designed HWaccs are able to reach frame rates of compressed hyperspectral image blocks higher than 330 FPS, setting the baseline scenario in 200 FPS, using a small number of FPGA resources and low power consumption.

Additionally, we also provide a comprehensive comparison in terms of energy efficiency and performance between the FPGA-based implementation developed in this work and a state-of-the-art GPU-based model of the algorithm on three low-power NVIDIA computing boards, namely Jetson Nano, Jetson TX2 and Jetson Xavier NX. Conclusions drawn from the discussion show that although the FPGA-based platform is by far more efficient in terms of power consumption than the oldest-generation NVIDIA boards, such as the Jetson Nano and the Jetson TX2, the newest embedded-GPU platforms, such as the Jetson Xavier NX, are gaining ground and can be seen as competitors of FPGAs concerning power efficiency.

On account of that, we have also introduced a multi-HWacc version of the developed FPGA-based approach in order to analyze its evolution in terms of performance and power consumption when the number of accelerators increases in a larger FPGA. Results conclude that the new multi-core FPGA-based version can reach the same level of performance as the most efficient embedded GPU systems. Also, looking at the energy consumption, FPGA performance per watt is comparable from just three instances of the HWaccs.

Finally, we would like to conclude that although the work described in this manuscript has been focused on a UAV-based application, it can be easily extrapolated to other work in the space domain. In this regard, FPGAs have been established as the mainstream solution for on-board remote-sensing applications due to their smaller power consumption and above all, the accessibility to radiation-tolerant FPGAs [6]. That is why the FPGA-based model proposed in this manuscript efficiently implements all HyperLCA compression stages in the programmable logic (PL) of the SoC, that is the FPGA. Hence, it can easily be adapted to be performed on other space-grade certified FPGAs.

**Author Contributions:** Investigation, J.C. and J.B.; Methodology, M.D. and R.G.; Software, M.D. and R.G.; Supervision, J.B.; Validation, J.C.; Writing—original draft, J.C., M.D., J.B. and S.L.; Writing—review & editing, J.C., M.D., J.B., J.A.d.l.T. and S.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research is partially funded by the Ministry of Economy and Competitiveness (MINECO) of the Spanish Government (PLATINO project, no. TEC2017-86722-C4, subprojects 1 and 4), the Agencia Canaria de Investigación, Innovación y Sociedad de la Información (ACIISI) of the Conserjería de Economía, Industria, Comercio y Conocimiento of the Gobierno de Canarias, jointly with the European Social Fund (FSE) (POC2014-2020, Eje 3 Tema Prioritario 74 (85%)), and the Regional Government of Castilla-La Mancha (SymbIoT project, no. SBPLY-17-180501-000334).

**Conflicts of Interest:** The authors declare no conflict of interest.



**Table A1.** Evaluation of the results obtained using the FPGA-based solution on a Xilinx Zynq-7020 programmable SoC for hyperspectral images

 with


