*2.1. Dimensionality Reduction*

Even though JYPEC supports multiple dimensionality reduction algorithms, for this approach only PCA was used. It has a very low complexity (being a candidate for real-time), and has also a very good distortion-ratio performance [20], only falling short when targeting very high qualities at low ratios (against vector quantization PCA (VQPCA)).

First, a preprocessing step selects random pixels in the image. These pixels are processed by PCA, generating a set of vectors that indicate the directions of maximum variance within the set. Since the selection is random and the sample size is big due to the size of hyperspectral images, it matches well the total variance even when using only 1% of the total pixels [39]. This process generates a projection matrix to reduce the data size, as well as a recovery matrix to go back from the reduced data to the original size. Each band of the reduced image is compressed by JPEG2000.

The process can be undone by first uncompressing each reduced band, and then using the recovery matrix to project back into the original space.

### *2.2. The JPEG2000 Algorithm*

The JPEG2000 standard is a multi-step algorithm which takes advantage of different image characteristics for compression. As well as being used for its main purpose of image compression, it has also found applications in compressing electroencephalography [40], video [41], and hyperspectral images [20], among others.

It follows a series of simple steps (Figure 2) to compress an image:

**Figure 2.** Diagram of the full JPEG2000 algorithm. In green, steps that deal with the whole image at once. In blue, steps that deal with small blocks of the image.

• A color transformation is done per pixel, converting the input color space (usually RGB) into a luminance (brightness) and chrominance (color) model, since human vision is more sensitive to brightness than color. The color channels can be down-sampled with no perceivable loss in quality, reducing a large chunk of the data bits.


The color transform is not needed for hyperspectral compression, since the spectral dimension is already decorrelated by the dimensionality reduction step. Only the wavelet transform and encoding steps are performed.

The wavelet transform is fast when compared to encoding, which takes up to 70% of the total execution time [24,25] of JPEG2000. Within JYPEC, it also dominates execution times [39]. This paper focuses on a FPGA implementation that greatly improves encoding, bringing the full algorithm execution time down as much as possible by accelerating the bottleneck.

In the following subsections, the encoder algorithm is detailed to give a better understanding of the implemented accelerator later.

#### 2.2.1. Encoding

Encoding is done in a lossless way over blocks of up to 4096 samples (usually 64 × 64 squares, with a depth of 16 bits (15 magnitude + 1 sign). The encoding technique used in JPEG2000 is called embedded block coding with optimal truncation [44] (EBCOT). Two different coders lie within EBCOT: the tier 1 and tier 2 coders:

The **tier 1** coder compresses the original block. The resulting stream has the prefix property: any prefix of that stream, once decoded, gives an approximation of the block. Longer prefixes provide better reconstruction accuracy.

The **tier 2** coder splits the output streams from each block into sections, with each section refining the information decoded by the previous one. Sections from different blocks are interleaved by first storing the ones which better approximate the original image. This technique extends the concept of the prefix property from blocks to the full image. This part of the coder is not used here, since JYPEC targets full image compression and not progressive decoding.

#### 2.2.2. Tier 1 Coder

Two phases lie within: First, the so called "bit plane coder" (BPC) pairs each bit with some context. This creates context—data pairs (CxD pairs). The distribution of all bits paired with the same context is highly skewed, making coding more efficient.

Second, CxD pairs are processed by the MQ-coder (it belongs to the family of arithmetic coders), generating the compressed output stream. The more skewed the input distributions are, the fewer bits the MQ-coder will output.

The **BPC** works by scanning the different bit planes of the block. First, the most significant bit of each sample is coded; then the second-most significant; and so on. This is the trick that later allows for progressive decoding.

The sign bit receives a special treatment, and is only coded when needed (i.e., when a sample is known to be nonzero, the sign bit for that sample is coded, but not before). This avoids coding sign bits for samples that are zero.

To exploit redundancies, bits are scanned in a zig-zag pattern in up to three passes per bit plane *p* (Figure 3). To keep track of what bit is scanned in what pass, a "significance" value *σ*[*j*] is kept per sample *v*[*j*], where *j* is the 2D position within the block/bit plane.

**Figure 3.** Bit plane coder. The zig-zag pattern can be seen (four rows visited column by column). This is repeated three times (three passes) until all bits from a plane have been visited. The sign bit plane (red) is coded separately.

This value indicates whether a sample is significant (i.e., at least one of its already coded bits is one) or insignificant (if all bits have been zero up until the current pass in previous bit planes). A significant sample is either positive or negative depending on its sign bit.

Three distinct passes exist: First, a significance propagation pass in which bits from samples that are expected to turn significant in the current plane are coded. Second, a refinement pass that codes bits from all samples that are already significant. Finally, a cleanup pass that codes the remaining bits.

The cleanup pass will mostly code zeros and the significance pass will mostly code ones, while the refinement pass is more random. These skewed distributions are the key for compression. To increase efficiency, each bit is paired with a context based on the significance state of neighboring samples. The context is used to predict the value of the bit, and if right, can save even more space in the compressed stream.

The **MQ-coder** starts by mapping each context to two values:


The basic idea of the MQ-coder is that of arithmetic coders: The input data will be compressed as a subinterval *I* ⊂ [0, 1). Starting with the interval [0, 1), the data are subdivided with each CxD pair.

Given the probability *p*, *I* = [*c*, *c* + *a*) is divided in *I*<sup>1</sup> = [*c*, *c* + *ap*) and *I*<sup>2</sup> = [*c* + *ap*, *c* + *a*). If the predictive model is right, and the probability *p* high, *I*<sup>1</sup> (the bigger subinterval) will be kept. The bigger the final interval is, the less bits are needed to represent it.

In practice, infinite precision cannot be achieved, so 27 and 16 bits are allocated for *c* and *a* respectively as per JPEG2000 standard, calling them registers *C* and *A*.

Each state has an associated probability of finite precision *<sup>p</sup>*¯ ∈ {0, ... , 2<sup>16</sup> − <sup>1</sup>} mapping to the [0, 1) interval when dividing by 216, indicating the probability of its prediction's success.

Since *p*¯ is fixed for a given state, its change is accomplished with two transition functions: most probable symbol (MPS) and least probably symbol (LPS). The MPS or LPS functions will be used depending on whether the prediction turns out to be right or not, and will change the state to one where *p*¯ is expected to better approximate the current bit input distribution. The prediction is also inverted when certain states are reached, doubling the possible states.

*p*¯ is used to update the interval [*C*, *C* + *A*) by either adding *p*¯ to *C* or subtracting it from *A*. When both ends get close together, a shift left is performed to keep the length of the interval longer than the maximum value of *p*¯.

*A* is kept under control by periodically resetting it, and *C* eventually overflows. The overflow is saved and forms the compressed output stream. To finish compression, *C* is flushed out. The process can then be reversed, and the original inputs recreated for decompression.

#### **3. Existing Implementations**

The objective of this study was to bring lossy compression to real time. For that, JYPEC was chosen as the target algorithm, and a tier 1 coder implementation of JPEG2000 (a step within JYPEC) is presented. It accelerates both the BPC and MQ-coder by making a single high-speed pipeline with both. To show that this effort was justified, a timing breakdown of compression of different images with JYPEC is shown in Figure 4.

**Figure 4.** Timing breakdown when compressing a set of hyperspectral images with JYPEC. Steps are sorted by time. While training takes a fair bit of time, coding is clearly the bottleneck of the process.

It is clear that coding is the slowest part of the algorithm, and that any efforts to speed up the algorithm should be dedicated to it. Many implementations of the full tier 1 coder have been proposed, but more efforts have been focused towards accelerating the MQ-coder alone. In the following paragraphs, the literature is explored in this regard.

#### *3.1. Bit Plane Coder*

In [45], the authors designed a basic BPC which goes over the full block following the zig-zag pattern. Its controller is a 24 state machine which goes over the different passes bit by bit, producing at most one CxD pair per cycle.

Improving on that, in [25] the authors introduced the concept of skipping. They loaded full columns with four bits at once, and marked them with flags when they were no longer required in a certain pass. This way, the BPC could skip them when not needed, saving clock cycles. They also introduced flags for groups of columns and even full passes, allowing one to skip big chunks of idle cycles in some cases. Finally, since they loaded the full column at once, they also checked which bits needed to be encoded, and skipped the others within the column. In the end, savings of around 60% of clock cycles were achieved.

A different approach was reported in [27]. Instead of skipping, the authors processed whole columns at once, producing up to 10 CxD pairs per cycle. CxD generation is not independent, so a series of cascading dependencies were taken into account. Despite the added complexity, this idea doubles the throughput of the sample-skipping technique without the extra memory. In [46], the authors went even further by allowing multiple planes to be coded simultaneously by using non-default coding options. Despite the small loss in compression efficiency, the throughput grew by a factor of 8 in CMOS 0.35 μm technology when using gray-scale images. This, however, deviates from the standard implementation, since multiple MQ-coders would have had to be used for a single block in order to keep up with the BPC.

#### *3.2. MQ-Coder*

The MQ-coder has seen more optimizations [47] than the BPC, since traditionally the MQ-coder always was the bottleneck.

MQ-coder receives and processes CxD pairs from the bit plane coder, generating a compressed bit stream which can be further processed by the tier 2 coder. It is important to note that by design, the CxD pairs are processed serially, so no parallelization is possible at this stage.

Two main approaches have been used to accelerate its execution:


In [31], a three stage pipelined MQ coder is proposed. It performs all arithmetic operations in the first stage, A and C register shifts are done in a second stage, and a third stage emits bytes. The drawback of the authors' approach was that the second stage could stall the first if the number of shifts was greater than one, since the authors did not use barrel shifters. In the end, they worked around this limitation by having two clock domains increasing the speed of the shifting stage, ensuring that stalling occurred only 0.64% of the time, achieving a performance of around 145.9 MS/s on a Stratix EP1S10B672C6 board.

In [30], an implementation of the full tier 1 coder with no pipelining nor dual symbol processing is presented for the Virtex II Pro FG 456 board. They note that, at 112 MS/s, the arithmetic coder is the bottleneck, with the (context, bit) generation being five times faster. Speed was later increased by having up to four simultaneous instances of the tier 1 coder working in parallel.

In [48], two techniques were used to create a pipelined design that works at 413 MS/s: They first employed "traced pipelining" which consists of designing a pipeline for the most likely cases, and processing unlikely, more costly cases in a separate unit that stalls the pipeline if necessary. The second technique is based on eliminating cascading shifts (such as the ones from [29]) by looking ahead at the number of necessary shifts and performing them all at once. All of this is made possible by working on 0.18 μm CMOS technology.

Both pipelining and dual processing are employed in the approach from [32], in which improvements are made to the multiple approaches from [26]. Dual processing is solved by having four different units processing all four possible scenarios (taking the lower or upper interval two times in any combination). Pipelining was used to separate the *A* register update, *C* register update, and byte output procedures, achieving in the end performances of 96.6 MS/s on FPGA and 440.2 MS/s on 0.18 μm CMOS technology.

A different pipelined approach is presented in [49]. The A register update, C register update, and byte out procedures are kept in three distinct stages, and two more are added at the beginning

by using two memory modules. The first one stores contextual information (namely state and predicted symbol) and the second one is a ROM which outputs state change information. If two consecutive contexts are equal, the second memory will be read with the updated state that is sent to the first one. This splits reading into a two-step process that accelerates the pipeline. The other notable technique is that shifting is limited to seven bits per cycle, reducing the critical path at the cost of one cycle stall in the unlikely case that the shift amount is greater than seven. In the end, a speed of 192.77 MS/s was achieved on 0.18 μm technology.

#### **4. Implementation**

The presented design includes both the BPC and MQ-coder, chained together to form the full tier 1 coder. The basic structure of the tier 1 coder is shown in Figure 5.

**Figure 5.** Tier 1 coder architecture.

The bit plane coder receives data from memory and generates CxD pairs. These are coded by the MQ-coder and the output stream is generated.
