1. Introduction
With the development of the mobile industry, the user-centered application market has increased significantly and, as a result, the embedded system has become high-performance to meet advanced users’ requirements [
1]. In particular, the embedded system is equipped with a display in order to provide a visible environment to a user interface and attempts to achieve a high-performance graphical user interface (GUI) in the embedded system have appeared [
2]. As a result, not only the connection between the embedded system and the user, but also the provision of various types of sources is extended through the GUI [
3,
4]. Traditionally, the most essential elements to provide a GUI are the advancement of the memory capacity and processing time to store the source. However, the embedded system with the limited memory area has an essential challenge of storing the image data for GUI in minimum condition. Most systems that utilize large amounts of data exploit various data compression algorithms to overcome this problem [
5,
6]. However, in the case of an embedded processor, the workload that is demanded for the data compression and decompression causes an extreme bottleneck [
5].
The way to alleviate this problem is to integrate the hardware accelerator, which is strong on the specific workload, as a part of the peripheral [
7,
8]. The embedded processor requests the decompression operation which occurs in the process of decoding to the specially designed hardware accelerator. As a result, the processor has a benefit of containing the resource to execute other commands continuously. Through the advantage acquired by employing the accelerator, the embedded processor maintains the system frequency with high performance and reaches the low-power consumption design [
9,
10]. In addition, an optimized system design is possible because of the area benefit generated by utilizing compressed data. Several compression algorithms have been researched to apply a compression accelerator on the embedded system [
1,
11]. The lossy compression algorithm has been dominant owing to the relatively high compression ratio [
8]. However, in the case of image compression for constructing a GUI, a lossless compression can achieve a high compression ratio, as similar as to the lossy compression, while maintaining image quality [
12,
13].
In a GUI, the elements that are artificially produced to deliver clear information to users such as figures, Arabic numerals, icons, and symbols account for a large proportion. Although the natural image such as photograph or landscape may have an infinite number of embedded palettes, depending on the environment and object, the artificially produced images tend to select specific colors in order to eliminate ambiguity. This means that the probability of repetitive character strings is increased. The property makes it possible to obtain a high compression ratio through the window look-ahead buffer conventionally utilized in lossless compression [
14,
15]. Therefore, a hardware accelerator for lossless decompression is essential in order to utilize the lossless compressed data in an embedded system where the image quality deterioration is not allowed.
In this paper, we propose a lossless decompression accelerator (LDA) optimized for an embedded system with a GUI. The LDA is designed based on the inflate algorithm, which is a decompression algorithm of deflate compression known to be used in gzip or portable network graphics (PNGs). The accelerator supports the static Huffman method and LZ77 decoding in parallel, considering the embedded system to reduce processing time and minimize memory capacity. The checksum module based on Adler32 is designed to enable extensible input as the zlib compressed data format. Moreover, the LDA receives commands with the memory-mapped I/O method from the embedded processor through the system bus and fetches compressed data through direct memory access (DMA) to reduce the burden of the main processor. The LDA fetches the zlib format data that are compressed with up to 32 k window buffer without a separating preprocess. The fetched data are decompressed with the six-stage pipeline to minimize bottleneck. Finally, the decompressed raw data are transferred to the pre-assigned address by burst write to avoid long wait times for memory access. The embedded system including the LDA is implemented on a field programmable gate array (FPGA) to verify the functional suitability and is fabricated with Samsung 65 nm complementary metal oxide semiconductor (CMOS) process. The performance of the accelerator is evaluated by the Canterbury corpus benchmark, which is mainly used to test lossless data compression algorithms.
The rest of this paper is organized as follows. In
Section 2, we describe the deflate/inflate algorithm in detail.
Section 3 discusses some related works for the lossless compression system.
Section 4 explains the optimized architecture of the LDA for the embedded system. In
Section 5, we present the implementation and system-on-chip realization of the embedded system with the LDA and analyze the experimental results. Finally, we give a conclusion in
Section 6.
2. Inflate Algorithm
As inflate algorithm is a decompression algorithm for deflate compression; back-ground knowledge of deflate algorithms is required. A deflate algorithm has processes for compressing repetitive string and assigning bits to the repetitive string in accordance with the frequency. For those processes, LZ77 and Huffman encoding are applied [
16]. In LZ77 compression, a look-ahead buffer and window buffer are utilized. The data to be compressed are aligned in the look-ahead buffer and slide to the window buffer as the compression proceeds. The window buffer is scanned to find duplicated string starting on the front of the look-ahead buffer.
Figure 1 shows the compression sequence of the LZ77 algorithm. In
Figure 1a, string data from 0A to 0C are stored in the look-ahead buffer. As shown in
Figure 1b, the character data are sequentially shifted to the window buffer until the duplicated strings are detected between the window buffer and the look-ahead buffer. In
Figure 1c, the string data from 0A to 90 in the look-ahead buffer are duplicated with the string data in the window buffer. The length and distance of the duplicated string data are extracted as the length–distance pair, which is an output of the LZ77 compression [
17]. Therefore, 0A to 90 characters are stored as (3,5), which refers to the number of duplicated character and distance. As LZ77 stores the duplicated string with two characters, LZ77 compression has an advantage when the compressed data are over three characters. When the length–distance pair is generated, the duplicated string and next character in the look-ahead buffer are shifted to the window buffer, as shown in
Figure 1d. The length of buffers affects the compression ratio of LZ77 encoding because it represents the length of string data to be scanned. As the deflate format supports the window size up to 32 k-byte, a window buffer with identical size is required for optimal performance.
In deflate format, the LZ77 compressed data are additionally encoded with Huffman code. Huffman encoding is divided into static and dynamic encoding. Static Huffman encoding utilizes predefined code tables that allocate serial bits to the literal, length, and distance outputs of LZ77 compression, which are called symbols [
16]. On the contrary, dynamic Huffman encoding constructs a Huffman tree according to the frequency of symbols [
18]. By assigning a short-length code to the repeated symbols, dynamic Huffman encoding has an advantage of compression ratio [
19]. As a result, static Huffman encoding shows a relatively low compression ratio compared with the dynamic Huffman encoding. On the other hand, less computation is required because the static Huffman encoding does not need to generate a Huffman tree.
In order to decompress deflate data, Huffman and LZ77 decoding need to be performed sequentially. A Huffman code is decoded by matching the compressed data with Huffman code tables in bit level. In the case of dynamic Huffman decoding, analyzing Huffman tree information that is included in the deflate header is required to construct the Huffman code tables. As the deflate algorithm use two Huffman code tables for the literal-length symbol and distance symbol, it needs to identified whether the matched Huffman code is the literal or length symbol by the predefined range of the symbols. When the decoded symbol is a literal, the literal itself is an output of inflate algorithm and shifted to the window buffer of LZ77. When the decoded symbol is a length, the distance to the next symbol to decode is determined due to the order of LZ77 compression outputs. Therefore, the decoded length–distance pair is utilized to extract the duplicated data.
3. Related Work
The deflate compression can be processed in parallel and the research for accelerating compression is performed [
20,
21,
22,
23]. In the case of decompression, there are several challenges for accelerating owing to the serial nature of deflate format [
24]. The deflate data are compressed with the LZ77 compression and Huffman encoding. The deflate format includes multiple number of compressed data blocks and each block has a various length that continues until the end-of-block (EOB) code. When the type of the Huffman code is static, the length of the literal-length code and distance code are 7- to 9-bit and 5-bit long, respectively [
7]. In dynamic Huffman code, the literal-length and distance code bits vary from 1 to 15 bits including extra bits. Moreover, the back-reference between the deflate blocks for decoding LZ77 implies the dependency of compressed data. Because of the various length of the deflate data block and Huffman code with data dependency, the inflate operation need to be processed serially, and it causes challenges for accelerating [
18].
In order to address the limitation of decompression accelerating, the technique for parallel decoding has been researched [
24,
25,
26,
27]. Jang et al. [
25] proposed speculative parallelization for deflate decompression. By identifying the boundaries of the deflate block by scanning the EOB code, each compressed block is decoded independently. However, the challenge of LZ77 decoding remains owing to the back-reference. Sitaridi et al. [
26] proposed an alternate deflate format for massively parallel decompression on a graphics processing unit (GPU). The compressed data block includes starting offset values in the file header and the dependencies between the data blocks are eliminated in order to decode the Huffman and LZ77 compression in parallel. By applying the parallel decoding to both LZ77 and Huffman code, the authors achieved the increase of decompression throughputs over 13 GB/s on the GPU. Weißenberger et al. [
27] presented a GPU-based parallel decoder for Huffman code, which is compatible with the original deflate data format. The authors utilized the self-synchronization property that the decoding process is synchronized after the block boundary. Yamamoto et al. [
24] proposed a gap-array data structure, which is an array of the gaps of segments in the Huffman encoding process. The gap array is attached to the Huffman encoded data for accelerating Huffman decoding. Although the generation of the gap array requires additional operation in the Huffman encoding process, parallel decoding is enabled to achieve 1.26 to 2.63 times performance increase.
In [
11,
28,
29,
30], the designs of hardware accelerators for decompression were proposed. Koch et al. [
11] presented hardware decompression techniques for embedded systems. The authors modified the compression algorithms such as run length compression, Lempel–Ziv, and Huffman encoding in order to apply the algorithms to hardware efficiently. They implemented the hardware accelerators of each algorithm on FPGA and compared the compression ratio. Lazaro et al. [
28] designed a hardware decoder that applies dual core architecture in order to support static Huffman and LZ77 decoding. The decoder is implemented on FPGA with an embedded microprocessor. Satpathy et al. [
29] presented a decompression accelerator that applies a dual-arithmetic logic unit (ALU) architecture and block-adaptive Huffman decoder. The dual-ALU is utilized to improve the serial bottleneck of Huffman decoding by matching the Huffman code with two additional arrays. Moreover, the block-adaptive technique, which skips a missing length code region, reduces wasteful computations, resulting in an additional 13% performance increase. The decoder was fabricated in 14 nm tri-gate CMOS. Ledwon et al. [
30] designed the FPGA-based hardware accelerators for deflate compression and decompression using high-level synthesis (HLS). The decompressor supports static and dynamic Huffman decoding with the technique of joint length-distance decoding. The design achieved average throughputs of 196.61 MB/s and 97.40 MB/s at the static and dynamic decoding, respectively.
The contribution of this paper is that we propose a lossless decompression accelerator (LDA), which supports both the LZ77 decompression and static Huffman decoding in original deflate data format. The LDA is designed with Verilog hardware description language (HDL) and synthesized on both FPGA and Samsung 65 nm CMOS process. The techniques such as first-in, first-out (FIFO) control and DMA are applied to the design in order to optimize the design for the embedded system that has restrictions of area and performance. The throughput of the design is evaluated by Canterbury corpus benchmark in condition of system-on-chip (SoC) for embedded system. We constructed the SoC design by employing a Cortex-m0 processor with the LDA and achieved the throughput up to 20.7 MB/s. Moreover, we compared the design with another FPGA-based inflate accelerator in terms of area and throughput.
4. Lossless Decompression Accelerator
The decompression process consists of data import, Huffman decoding, LZ77 decoding, window buffer read/write, and data export. Data import and export operations are performed through memory devices with high-density storage such as synchronous dynamic random-access memory (RAM) in order to decrease the memory-area. Because the relatively slow memory speed and large data size make it hard to process with the embedded processor alone, the bus bottleneck is caused regardless of the decoding process. Besides, in the case wherein the decompression accelerator requires input data pre-processing, the main processor periodically implements the instructions to generate the input frame. The operating time for generating input frame is similar to the access time for memory. For this reason, the decompression accelerator must be an optimized design that minimizes the wait time by identifying the degree of bottleneck of each process in the target system.
Figure 2 shows the block diagram of the LDA, which performs optimized operations for the embedded systems to reduce the workload of the main processor, as mentioned above. They are largely composed of four domains, and each domain is controlled through a main finite state machine in the controller. The description of each domain is as follows.
4.1. Configuration
The configuration domain supports the 32-bit system bus protocol that enables the interconnection with the main processor. The LDA includes seven registers that store control, error status, length, start address of compression data and raw data, and checksum result. Both big endian and little endian are supported and the LDA is configured with two input modes about the compression data, system bus, or direct memory access.
4.2. Pre-Processing
The data format structure that is compressed through the deflate algorithm has endian switch points because of the Huffman coding. For this reason, in the decoding process, the Huffman coding boundary must be detected and aligned by analyzing it in bit unit. As the LDA receives the data with 32-bit unit, the boundaries of the code appear randomly and the function for aligning them is required. This operation is performed in the pre-processing domain. First, the zlib frame module checks the header data of zlib format in order to identify the format validation. The data are transmitted to the parser module through the FIFO. The FIFO has 32-bit data width and 16 data depth in order to optimize area-time of the applied system. The parser module operates repetitive sort-flush function that performs shift operation by checking the length of processed bits according to the operation result of the decompression domain.
4.3. Decompression
In the decompression domain, the inflate algorithm is performed by receiving input data from the parser module with a nine-bit unit. The LDA always receives the compressed input data of static-Huffman code and extracts the symbol code only at one clock cycle through the pre-defined Huffman table. When the extracted symbol code is a literal code, the Huffman decoder module transmits the literal data to the post-processing domain. When the symbol code is a duplicated string, composed of the length and distance, the symbol code is transmitted to the LZ77 decoder module. The LZ77 decoder module sequentially extracts the length and distance through the symbol code and transmits decoding results to the post-processing domain. As the post-processing domain affects the memory bandwidth when the length–distance pair is transmitted, the LZ77 decoder module checks the status of post controller module in the post-processing domain before a result is transmitted.
The LDA performs the static-Huffman decoding to be optimized for the embedded system. In the case of dynamic-Huffman decoding, the decompression time is increased because additional workloads are required to create a Huffman tree. In order to analyze the efficiency of compression ratios for an artificially produced image, we compare the compression ratios of static and dynamic-Huffman encoding by employing the sample images, which have a size of 128 by 128.
Table 1 shows the compression ratio results. The compression ratio is calculated as the following equation.
As the produced image has a single background pixel characteristic in order to eliminate ambiguity, the difference in compression ratio is not different by even 1% owing to the LZ77 compression. As a result, the LDA gains a fast processing time and memory benefit through a slight decrease the compression ratio.
4.4. Post-Processing
The post-processing domain performs the window memory read/write operation according to the literal, length, and distance code of the previous domain and outputs raw data to the external memory in 32-byte unit. In this domain, the operations of Adler32 calculation, window memory access, and raw data transmission are executed in parallel. Therefore, the two FIFOs are embedded to minimize the waiting time by analyzing the workload for each operation. Because the range of distance value that affects the number of memory access is up to 32,768, the processing time is changed according to the length of the frame unit processed by the post controller module. We analyze the processing time according to the frame length for 1 byte and 4 bytes, which correspond to the literal symbol length and system bus width, respectively. The result is described in
Section 5.