A High-Performance Non-Indexed Text Search System

Kieu-Do-Nguyen, Binh; Dang, Tuan-Kiet; The Binh, Nguyen; Pham-Quoc, Cuong; Phuc Nghi, Huynh; Tran, Ngoc-Thinh; Inoue, Katsumi; Pham, Cong-Kha; Hoang, Trong-Thuc

doi:10.3390/electronics13112125

Open AccessArticle

A High-Performance Non-Indexed Text Search System

by

Binh Kieu-Do-Nguyen

^1,2

,

Tuan-Kiet Dang

¹

,

Nguyen The Binh

²

,

Cuong Pham-Quoc

^2,*

,

Huynh Phuc Nghi

²

,

Ngoc-Thinh Tran

²

,

Katsumi Inoue

³,

Cong-Kha Pham

¹

and

Trong-Thuc Hoang

¹

Department of Computer and Network Engineering, The University of Electro-Communications (UEC), 1-5-1 Chofugaoka, Tokyo 182-8585, Japan

²

Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet St., Dist. 10, Ho Chi Minh City 740050, Vietnam

³

Advanced Original Technologies Co., Ltd., Academy of Cryptography Techniques, Chiba 277-0827, Japan

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(11), 2125; https://doi.org/10.3390/electronics13112125

Submission received: 26 March 2024 / Revised: 7 May 2024 / Accepted: 28 May 2024 / Published: 29 May 2024

(This article belongs to the Section Microelectronics)

Download

Browse Figures

Versions Notes

Abstract

:

Full-text search has a wide range of applications, including tracking systems, computer vision, and natural language processing. Standard methods usually implement a two-phase procedure: indexing and retrieving, with the retrieval performance entirely dependent on the index efficiency. In most cases, the more powerful the index algorithm, the more memory and processing time are required. The amount of time and memory required to index a collection of documents is proportional to its overall size. In this paper, we propose a full-text search hardware implementation without the indexing phase, thus removing the time and memory requirements for indexing. Additionally, we propose an efficient design to leverage the parallel architecture of High Bandwidth Memory (HBM). To our knowledge, few (if not zero) researchers have integrated their full-text search system with an effective data access control on HBM. The functionality of the proposed system is verified on the Xilinx Alveo U50 Field-Programmable Gate Array (FPGA). The experimental results show that our system achieved a throughput of 8 Gigabytes per second, about 6697× speed-up compared to other software-based approaches.

Keywords:

full-text search; parallel architecture; High Bandwidth Memory; FPGA

1. Introduction

In the information age, the continuous growth of data puts crucial requests for timely retrieval of big datasets. Efficient and customized tools are designed to index and query the database to extract accurate information. Search engines play a crucial role in various domains, including machine learning [1], computer vision [1], artificial intelligence [2], business [3], and more. The current most common search algorithms often utilize both indexing and searching techniques. The index method significantly impacts the search algorithm’s performance, which usually operates on a static dataset. However, the circumstances have undergone several significant changes during the past decade. The increasing proliferation of Internet-connected small devices, such as Personal Computers (PCs), smartphones, sensors, and more, are extensively utilized across various sectors, such as healthcare, manufacturing, industry, smart homes, etc. These devices consistently gather data worldwide and provide substantial volumes of data. This has presented a significant challenge for big data analytics, as elaborated upon in Refs. [3,4,5]. The proliferation of Internet-of-Things (IoT) applications has posed numerous hurdles regarding efficient data storage and management. Simply put, an effective method for searching big IoT data must handle substantial data volumes consistently gathered over time, all within a resource-constrained context. Hence, the effectiveness of the latest iteration of search engines depends on its performance and the limitations of resource utilization. Optimizing indexing methods is necessary due to limits on index size and search time, resulting in decreased efficiency and increased implementation complexity. The issues involved in building the next generation of search systems encompass the following:

Scalability: The search system could be deployed on various devices and environments, putting different requirements in place. For example, the authors in [6,7] defined the restrictions of the search system on fog computers. Therefore, to exhaust the resources of the devices and satisfy the constraints, the search system needs to be scaled.
Index independence: The search system must process a large amount of local data collected over time. This means the indexes are updated regularly. In addition, indexing and retrieving a small amount of local data are simple. However, managing extensive indexes and databases over distributed systems requires sophisticated techniques [8]. Some recent studies on indexing techniques have highlighted optimizing the search performance in large datasets with greater efficiency in performance and storage usage [8,9,10]. Besides the existing index-based search algorithms, the non-index search methods, which ignore the index step, should be considered.
Performance and resource restrictions: The feature of IoT systems is distribution. The nodes in such systems could be spread over a vast space, and they restrict resources, especially storage capacity. Therefore, a robust index and searching method can take time and resources, which may not be available on such devices.

Among the search issues, text search is the most vital work for retrieving the necessary information on big data analysis, such as in optical character recognition [11], cloud service [12], and natural language processing [13]. Full-text search (FTS) can locate relevant content even with regular expression. FTS locates documents that contain the user’s query and ranks the documents according to the similarity of the query. Before being saved into the database, the documents are initially indexed in an FTS system. The data or documents are scanned and indexed to create a list of search keywords, commonly named “index”. The indexer will create an entry for each keyword discovered and may include the keyword’s relative position in the document. Indexing is the pivotal operation in an FTS system [14]. The FTS engine then searches within the index and returns hits matching the query. Finally, the documents containing the searched keywords will be revealed by analyzing the search results.

In this work, we propose a novel text search system that satisfies the requirements of the next generation of FTS systems. The contents include a parallel text search algorithm that ignores the indexing process. Furthermore, it utilizes the High Bandwidth Memory (HBM) to promote the hardware’s parallelism. The experimental results prove that the proposed system archives high throughput compared to other hardware/software-based proposals.

The remaining parts of this paper are structured as follows. Section 2 analyzes the related works and highlights our main contributions. Section 3 discusses an efficient FTS system and HBM. Section 4 discusses the parallel FTS technique. Section 5 carries out the high-performance hardware architecture. Section 6 presents the experimental results, along with comparisons and discussions. The final portion, Section 7, summarizes the study and discusses future work.

2. Related Work

Lucene [15] and Hyper Estraier [16] are the two most prominent candidates in this research area. Lucene is an open-source software developed by the Apache Software Foundation in 1999. The authors also provided an Application Programming Interface (API) that made adding Lucene to an FTS system easy. Like previous FTS implementations, Lucene follows a two-phase procedure that includes indexing documents and retrieving queries. The Lucene-based search engine systems [17] have superior indexing and retrieval performance in comparison to other technologies [18]. Hyper Estraier [19] provides excellent scalability. This method improves the regular N-gram method, based on the occurrence probabilities of a certain sequence of words, combined with the morphological analysis in [11] to improve the performances. In addition, it allows the use of many languages and searches using wildcards and gaps. In 2019, an enhanced inverse indexing-based FTS method was proposed in [20]. To speed up the search time, they give three-component key indexes and arrange keywords according to their frequency of use. Although the index size is eight times higher than in their initial work [21], the search performance is far greater than the conventional methods. Some proposals rely on well-known DBMS, such as PostgreSQL [22] and ORACLE [17], to enhance the FTS performances. Unfortunately, the performance of this strategy is not great. Almost all the remaining FTS research relies on Lucene and Hyper Estraier. They are primarily concerned with enhancing indexing and retrieving. The authors in [23] presented a framework for performing FTSs on a distributed file system. This strategy is effective for databases holding large quantities of little files, such as the Taobao file system from Taobao or Cassandra from Twitter [24]. To increase the scalability in a distributed context, the authors replaced the inverted index in their search engine with a suffix index. The authors in [25] recommended a block-linked list index structure for handling huge datasets. The testing results indicate a minor improvement. In [26], the authors implement an FTS system on a mobile device. They give a hash table data structure index to reduce the memory access time and consumption during the indexing process; this helps to tackle the power problem. The proposal in [27] applies learning-based enhancement to the Lucene full-text retrieval method, thus reducing the retrieval time. Overall, the performances of these studies based on Lucene and Hyper Estraier mainly depend on indexing and retrieving methods.

Besides software-based approaches, hardware-based text search accelerators have also been investigated. In 2021, the authors of [28] released a bandwidth-optimized search accelerator. The accelerator is an application-specific architectural substructure. It works separately from the processor to accelerate an intensive workload task to provide higher performance at a lower cost [29]. Their proposed memory management solution utilizes Storage-Class Memory (SCM) and Compute Express Link (CXL) devices to alleviate the host-accelerator bottleneck. The experimental results indicate an acceleration of 8.1 times. However, the SCM and CXL cannot be deployed on older devices, making this concept difficult to implement in the near future. Additionally, a portion of the search process is handled on the host computer, which has only eight cores in this concept. Therefore, limited acceleration is in effect. Hardware Accelerator for Full-Text Search (HAFTS) [30] is another hardware design. The authors in [30] accelerated the retrieval time on a succinct data structure, a compressed and indexed data structure based on a suffix array. In this proposal, the authors try to reduce the bottleneck of fetching reference text by using multiple internal Random-Access Memory (RAM), which is very limited in size. Even though the internal RAM could provide very high bandwidth, only a small size of reference text is processed per section. Therefore, this approach cannot deal with an extensive dataset. The implementations for other languages, such as Japanese [1] or Chinese [31], have also been investigated.

Generally speaking, the proposed methods must index the documents before searching, which is one of their major drawbacks. When it comes to almost all proposals, the indexing procedure is the step that takes the most time. The method of indexing has a significant influence on the final performance. It usually takes from 50 to 400 s to update new indexes [32]). Therefore, the traditional approach of existing FTS tools hardly satisfies the requirements of modern applications. Additionally, one of the challenges in FTS implementation is the space requirement for indexes. Another area for improvement with the current FTS system is that it can only be implemented on general-purpose computers with limited computational power. Therefore, a hardware accelerator is a promising solution to overcome the current issue. This work presents a novel solution that can resolve the above issues for the existing FTS systems. Our main contributions are summarized as follows.

A matching algorithm that can be efficiently deployed on hardware to achieve a high level of parallelism. Furthermore, the proposed algorithm can leverage the parallel architecture of HBM, which will be widely applied in a high-performance computing system.
Text Search Processor (TSP) is the hardware implementation of the proposed parallel matching algorithm. The proposed approach ignores the indexing phase. It relies on the parallel matching algorithm to find multiple search results simultaneously. The complete system is deployed on Field Programmable Gate Arrays (FPGA) and achieves high performance. Upon receiving a command from the host computer, the TSP automatically fetches the reference text, matches the data, generates the results, and then writes back to the host memory. Depending on the readily available hardware, the number of processing cores might have their capacity significantly increased. Our experiments’ findings indicate that we could install 32 thousand processing cores on a Xilinx Alveo U50 while maintaining an operating frequency of over 180 megahertz (MHz).
The TSP is designed to enable the usage of High Bandwidth Memory (HBM). To our knowledge, few (if not zero) researchers have integrated their FTS system with an effective data access control on HBM.
A high-performance decoder design can translate thousands of matched positions, indicated by a bit, to matched addresses. The decoder helps reduce the number of results written back after searching, lessening the bottleneck of data exchange between the host computer and the text search system.

3. Background Knowledge

3.1. Text Search System

An FTS system is a system that searches the content of files given a specific input text. Usually, the text search engine seeks a piece that matches the original text. In the current state of the art, indexing the data before searching is crucial. The indexing procedure often requires the new document to undergo an identification task. Then, the input documents are divided into characters that represent the semantic unit used in the search. Afterward, each semantic unit is normalized to remove different variations of each word, such as plural forms, different spellings in specific languages, case sensitivity, and grammatical structure. Further processes may entail the substitution of synonyms, the removal of suffixes (also known as stemming), lemmatization, and the exclusion of stop words. Lemmatization is the process of classifying the modified forms of a word into a unified group. Removing stop words is essential for improving search accuracy; for example, the words “and” and “the” have little meaningful content. Furthermore, each language has its own unique collection of stop words. Eliminating stop words reduces the size of the source text and decreases the number of comparisons. Consequently, the overall effectiveness of performances is improved. After completing these stages, the input text is prepared for a full-text search.

For the search strategy, the typical method is to perform a search for that string within each document. In this case, the document is systematically examined from beginning to end to discover the required keywords. Conducting this kind of search is acceptable when there are only a few documents in the store; the time requirement will increase exponentially with larger quantities of documents. To improve the search performance, the existing algorithm begins by creating a search index. The inverted index is the most commonly used for this type of FTS. The index type closely resembles the concordance of the documents. The index is a compilation of terms mentioned in the source papers and the matching document identifiers indicating their location. The built index is mainly related to a topic and needs to be updated or replaced when the keywords are changed. When conducting a text search, you search for the index rather than the content. Hence, this method can lead to a significant improvement in performance.

3.2. High Bandwidth Memory

The Xilinx Alveo U50 FPGAs [33] have been bonded with HBM stacks within the same chip package, enabling ultra-wide memory interconnections [34]. HBM provides the bandwidth, energy efficiency, and memory architecture flexibility to adapt to various applications [35]. HBM utilizes a vast quantity of separate memory channels. This is achieved by parallelizing access to a stack of Dynamic Random Access Memory (DRAM) chips. Xilinx devices utilize two 4-Hi stacks, each with a capacity of 4 GiB. This results in 32 channels, which is presented as an AXI slave interface. Consequently, each pair shares a Memory Controller (MC) for AXI-to-DDR protocol conversion. The global addressing interconnects topology is designed as a segmented switch network with local crossbar switches connecting all channels.

Despite the parallel architecture of the HBM, hardware developers must put more effort into utilizing the HBM for a particular application. In [36], a design is presented that combines HBM channels and lowers data dependencies to accelerate the sort algorithm. Database processing is one of the most prevalent study disciplines focusing on leveraging HBM by utilizing Processing In-Memory (PIM), such as the study in [37] about In-Memory database acceleration on FPGAs. Multicore processors also take advantage of HBM, such as Intel’s Knights Landing, NVIDIA’s Titan V, and Google’s TPU. Recent research in this area has focused on demonstrating the utility of HBM data-intensive processing issues, such as hash tables [38], graph processing [39], and stream processing [40]. When it comes to the process of expediting text search, there are two obstacles related to accessing the external memory:

Text search is memory-bounded. It requires a significant amount of memory access to retrieve the referencing content. Consequently, a restricted amount of memory bandwidth outside the chip would increase access time, thus reducing overall performance. To overcome this problem, we propose a hardware design that utilizes many HBM channels for the best data transfer efficiency.
Anticipated and concurrent memory retrieval: The memory retrieval process in text search follows a predictable pattern. Several chunks of reference text are used to compare with the input keyword. The matching step can be utilized for coarse-grained parallelism in a partition-based approach. We present a parallel matching technique and an HBM manager to enhance the parallelism of stacked architecture, thereby mitigating the memory bottleneck and optimizing search efficiency.

4. Parallel Matching Algorithm

4.1. Parallel Matching Algorithm

We present a parallel matching algorithm to encourage the TSP’s parallelism. This approach can locate several search keywords inside the input text without indexing. The algorithm’s core is the chain of Processing Elements (PEs) that operate in parallel. Each PE evaluates the match state of a text byte with the input keywords’ character. The keywords are input to the system byte-by-byte and compared with the input text buffered in PEs. The winning score is evaluated depending on the match mode, the score from the previous PE, the input keyword, and the input reference character. There are three match modes: normal, gap, and wildcard.

The matching process is shown in the Algorithm 1. The first step is the initialization of all PEs. In this step, each PE points to the previous PE, as shown in line 5 in Algorithm 1. Then, the reference text and the keyword are loaded from the memory to the process by functions PE.setRef and PE.setKeyword, as shown in lines 20 and 22 in Algorithm 1, respectively. If the input text size is bigger than the buffer, the text is split into multiple batches, where each batch size equals the buffer’s size. Then, the input keyword is compared to every word in the text, the

P E . e v a l

in line 23 in Algorithm 1. The simplified flowchart of Algorithm 1 is given in Figure 1.

Algorithm 1 Parallel text search algorithm.

Input:: Reference $t e x t$ of length $T E X T_L E N$ ,

String of keywords

k e y s

of length M,

N P E

is the number of Processing Elements

P E

in the chain

Output:: List of matched address $a d d r e s s_l i s t$

1:: $p i d x \leftarrow 0$ /* Processing Element index */
2:: Step 1: Initialize the Processing Elements
3:: while $p i d x \leq N P E$ do
4:: if $p i d x > 0$ then
5:: $P E [p i d x] . p r e v i o u s = P E [p i d x - 1]$
6:: else
7:: $P E [p i d x] . p r e v i o u s = N U L L$
8:: end if
9:: $p i d x \leftarrow p i d x + 1$
10:: end while
11:: Step 2: Load text over multiple batches of a file
12:: $b i d x \leftarrow 0$ /* batch index */
13:: while $b i d x \leq ⌈ T E X T_L E N / N P E ⌉$ do
14:: Step 3: Evaluate matching state of input keywords
15:: for key k in $K e y w o r d L i s t$ do
16:: $p i d x \leftarrow 0$
17:: while $p i d x < N P E$ do
18:: if $(b i d x * N P E + p i d x) < T E X T_L E N$ then
19:: $r e f_c h a r \leftarrow t e x t [b i d x * N P E + p d i x]$
20:: $P E [p i d x] . s e t R e f (r e f_c h a r)$
21:: end if
22:: $P E [p i d x] . s e t K e y w o r d (k)$
23:: $P E [p i d x] . s c o r e \leftarrow P E [p i d x] . e v a l ()$
24:: $p i d x \leftarrow p i d x + 1$
25:: end while
26:: end for
27:: Step 4: Decode matching position
28:: $m a t c h e d_a d d r e s s e s \leftarrow a d d r e s s D e c o d e (s c o r e)$
29:: $b i d x \leftarrow b i d x + N P E$
30:: end while

Each PE assigns a matching state for a pair of

< k e y w o r d, r e f e r e n c e c h a r a c t e r >

, as described in Table 1. The matching function

P E . e v a l

is described in Algorithm 2. There is the possibility of finding many matches at the same time. If the input keywords and the matching words in the text are identical, the last matched position of the input string is set. In the last stage, each matched place’s state is decoded to determine where the keyword is located in the text.

Algorithm 1 targets the heterogeneous hardware implementation. Therefore, it provides a high level of scalability and parallelism. In Algorithm 1, the connections among PEs are initialized in step 1. They will be implemented as hard-wired connections in hardware. Therefore, step 1 will not increase the complexity of the algorithm. Steps 2, 3, and 4 describe the computational process. Step 2, the loop in line 13, allows the matching function

P E [x] . e v a l ()

to process all input reference text. The number of iterations depends on the length of the input text and the number of available PEs; more PEs mean fewer required iterations. This feature reveals the scalability of the proposed algorithm, where the number of PEs could be varied to fulfill the available resources, thus achieving the best performance outcome. The main computational process is the matching function

P E [x] . e v a l ()

in line 23.

P E [x] . e v a l ()

is described in Algorithm 2. When integrating on hardware, the Algorithm 2 is implemented as a combinational logic circuit with two cycles operation. Therefore, the complexity of Algorithm 2 is only one. Although the number of iterations from line 17 to line 25 in Algorithm 1 depends on the number of PEs, the complexity is one because all the PEs work in parallel. Therefore, the iteration complexity from line 15 to line 26 is

O (k)

, where k is the number of input keywords. Moreover, the complexity of step 2, lines 13 to 30, is

O (k \times ⌈ \frac{T E X T_L E N}{N P E} ⌉)

. Thanks to the massive parallelism achievement, our proposed algorithm reduces roughly

N P E

times when compared with the direct method, which has the complexity of

O (k \times T E X T_L E N)

. In our deployment on Xilinx Alveo U50 FPGA, the number of PEs is 32,000.

4.2. Matching Status Evaluation

The proposed approach can handle matching keywords even when gaps and masks are involved. For a mask or a wildcard, the matching score is determined only by the value of the bit that came before it. For a gap, the string is considered to be matched when the keyword is discovered within a certain number of input gaps. Two extra states are required for the comparison in this scenario. When matching with gaps, the bit triad can be in one of the five states, as depicted in Table 1. In gap mode, the value of the bit triad can be either 2, 3, or 4. If there is no match with a gap, the state is 2. The value is set to 3 if a match is discovered inside the gaps. The matching status is dealt with internally; there is no overhead for the output results. If a word is composed of multiple bytes, a matching process of a word in gap mode is considered an ordinary mapping. Moreover, if all bytes of this gap are matched, the state of this gap is set to 4, and it continues to maintain this value until all gaps have been processed.

Algorithm 2 Maching status evaluation (

P E . e v a l ()

).

Input:: Keyword $k e y$ , reference character $r e f_c h a r$
Output:: Score $s c$

1:: $i s_m a t c h \leftarrow k e y = r e f_c h a r$
2:: $s c . g a p_e v a l \leftarrow s c . g a p_e v a l$ AND $i s_m a t c h$
3:: $s c . g a p_w i n \leftarrow s c . p r e v i o u s . g a l_e v a l$ AND $k e y . l a s t$
4:: $s c . g a p_w i n \leftarrow s c . g a p_w i n$ AND $i s_m a t c h$
5:: $s c . g a p_w i n \leftarrow s c . g a p_w i n$ OR $s c . p r e v i o u s . g a p_w i n$
6:: $s c . g a p_w i n \leftarrow s c . g a p_w i n$ AND $k e y . i s_g a p$
7:: if $k e y . i s_m a s k$ then
8:: $s c . w i n \leftarrow s c . p r e v i o u s . w i n$ $k e y . i s_g a p$
9:: else if key.is_gap then
10:: $s c . w i n \leftarrow s c . g a p_w i n$
11:: else
12:: $s c . w i n \leftarrow s c . p r e v i o u s . w i n$ AND $i s_m a t c h$
13:: end if
14:: return $s c$

Algorithm 2 gives the matching status evaluation’s function for each input character of the keyword, which is also

P E . e v a l

in Algorithm 1. The return value is the matching status

s c

between the input keyword

k e y

and the reference character

r e f_c h a r

. Moreover, the evaluation depends on the score of the previous PE,

s c . p r e v i o u s ._

. If the input keyword is a mask

k e y . i s_m a s k

, the matching status only depends on the previous score. If the input keyword is a gap, the score of the gap’s keyword

s c . g a p_w i n

is set when found before the last byte of the input keyword

k e y . l a s t

. If the input keyword is neither mask nor gap, the score is set when the previous PE

s c . p r e v i o u s . w i n

score is set, and the input keyword is matched within the reference characters. The algorithm is suitable for a hardware-based approach, in which the bit manipulation is effectively performed. In Algorithm 2, a single bit indicates the score for a match between a byte of keyword and a byte of reference text. This bit is transmitted to the next PE, where it will be used to calculate the matching state for the subsequent cycle. This design also reduces the cost of transporting data among PEs.

5. Hardware Architecture

5.1. Overall Architecture

The proposed FTS system has six main components, as shown in Figure 2. The first component is the communication bus. The TSP and host computer exchange data using the Peripheral Component Interconnect Express (PCIe). An FTS application involves intensive data exchange between the host program and the search engine. Therefore, to minimize data movement overhead, the TSP design should rely on PCIe, which supports a high-performance communication interface between the host computer and the TSP. The second component is the memory system. In high-performance hardware designs, data transportation is a severe bottleneck. Our design provides an efficient memory system to eliminate data transportation bottlenecks. Two 4-GibiBytes (GiB) High Bandwidth Memory 2 (HBM2) stacks are incorporated into the memory system; that means 8-GiB in total. The third component is the HBM management system. On the Alveo U50 FPGA, the HBM is organized into stacks of DDRs with 32 pseudo channels in total. Each channel manages 2-Gibibit (Gib) of data; the total size is 32-Gib or 8-GiB. Each HBM reader handles a pseudo-channel. The 32 HBM readers simultaneously access 32 pseudo channels to reach the best HBM2 bandwidth. When data are written from the host computer to the HBM channels, the data are distributed over the memory. Therefore, the switch network guarantees that HBM readers fetch the right data from the correct locations on the HBM. Direct Memory Access (DMA) is applied to the communications between PCIe and HBM and between HBM and the switch network, thus reducing wasted time in handshaking.

The fourth component is the processing system, which combines many PEs; each PE holds one byte of the input text. The processing system handles the matching problem of input keywords to the text. The position of the PE and the matched state show the position of the matched keyword. The fifth component is the address decoder. The matched positions from the processing system are indicated by the position of set bits in the bits’ string. Therefore, multiple address decoders in parallel are needed to translate the bit-indicated result to the direct-address result, which is necessary to reduce the decoding time on the host computer. The sixth and final component is the controller. Its task is to generate appropriate control signals based on the configuration’s data and input settings.

5.2. Processing Element

The TSP can perform the normal search, search with gaps, and search wildcards in several languages without an index. Figure 3 illustrates a keyword search with five English characters. Figure 4 gives an example of a search with gaps; the gap is denoted as ‘_’ and followed by a gap character. In contrast, Figure 5 depicts a search with wildcards denoted as ‘?’. There are four steps that constitute a search procedure. First, there is an initial match in which every instance of the beginning letter is set. Second, the matched positions are transferred to the next PE while the previous PE changes the current PE’s match position. In this phase, the state of a PE is decided based on the current matching state and the previous PE’s state. After evaluating all input keywords, the address decoder transforms the remaining matching places into addresses.

In the gap mode, the procedure is as follows. The status of the first gap, which does not match with any word before, is set to 2, which means “Not matched with a gap”. From the second gap forward, the status is set to 3, meaning “Temporarily matched”, since it matches the prior gaps. When the final gap byte is processed, and all preceding gap bytes are matched, the status of this final gap is set to 4, which means “Matched with a gap”. Once the final gap has been matched, the PE’s status is changed to “Matched with a gap”. Then, the subsequent characters are considered matched. Upon leaving the string of gaps, if the string’s status is 4, the final status of the PE is set to 1, which means “Normal matched”. In contrast, upon leaving the string of gaps, if the string’s status is 2 or 3, the final status of the PE is set to 0, which means “Not matched”. Table 1 outlines all the statuses of a PE. The states of every PE are encoded using three bits, and matching is determined by one bit; the remaining two bits are utilized only in the gap mode.

Multiple PEs are joined as a chain, referred to as a PE chain. Each PE in the chain receives the value of the matched position from the previous PE and sends the data to the next PE. The number of PEs in a chain may be reconfigured. In this manner, the TSP is scalable. Figure 6 shows the architecture of a PE chain. To achieve the best optimization, the number of chains can also be reconfigured depending on the number of channels in the HBM. For example, Xilinx Alveo U50 FPGA deployment is configured with 32 chains to adapt to the 32 HBM pseudo channels. The number of chains can be easily scaled to optimize for other HBM configurations on other platforms. Additionally, the number of PEs in each chain is also configurable. Therefore, the proposed system can be scaled up or down to use all the available resources. These two levels of reconfiguration make our design achieve a high level of scalability when compared with the existing proposals.

5.3. Address Decoder

Figure 7 demonstrates the address decoder’s architecture. The address decoder converts the matching location from the PEs chain into meaningful addresses for the host computer. Instead of scanning for all match locations, which could take up hundreds of clock cycles, the proposed address decoder design can adjust its operation based on the number of matched positions. The proposed architecture is composed of several layers. Each layer consists of a set of Decoders (Dec), followed by a system of Arbitrators (Abtr), and finally, a First-In-First-Out (FIFO). One decoder can examine only a small number of matched bits, but multiple decoders can simultaneously process a vast number of matched bits. Depending on the matched position, the arbitrator in the subsequent layer can transform a set of bits into a meaningful address in a single clock cycle. It only processes a few results from the decoders in the previous layer. Due to the restricted input of each arbitrator, they do not generate an excessive logical delay. The translated addresses are stored in a FIFO. The arbitrators in the subsequent tier will utilize these addresses when they become accessible. The number of layers and the number of each layer’s decoders, arbitrators, and FIFOs can be modified to use up all the available resources.

The address decoder includes multiple decoders and arbitrators. Each decoder decodes 64 bits of the corresponding location. It begins by scanning a set of 8-bit. Whenever there is a matched bit in the scanning set, the decoder proceeds to scan each bit of this set and then converts the matched positions into the corresponding addresses. The location of the matched bit in the set indicates the match position in the string. The second element of the address decoder is the arbitrator. The arbitrator collects data from multiple decoders whenever they are available. In this example, each arbitrator manages eight decoders. The smaller the number of the decoder, the higher the priority. For example, the data from the decoder 0 will be prioritized over the data from the decoder 1. When the valid data are written to the FIFO that this arbitrator is connected to, the Read signal is sent to the FIFOs in the previous layer, thus retrieving new data. For instance, if the arbitrator reads data from the second channel, the second bit of the Read signal is set while the others are unset.

5.4. Data Management System

The HBM on Alveo U50 FPGA can support up to 32 PCs simultaneously. To take advantage of this HBM, the input data should be split into multiple fragments and distributed over the 32 PCs by an HBM management system. The fragments are then written using the PCIe bus. Figure 2 describes the communication method between the software on the host computer and the proposed HBM management in the hardware.

Figure 8 shows the proposed architecture of a buffer’s row. The HBM stacks are integrated outside the FPGA. Therefore, a buffer system is required to fetch data from multiple HBM channels efficiently. Figure 8 shows the organization of the buffer system. The buffer comprises multiple rows, each organized as a FIFO structure. The head of the FIFO directly receives data from an HBM. Then, it forwards the data to the next buffer in the chain on the next clock cycle. This approach shortens the paths from the HBM to a buffer and the paths between two consecutive buffers. The number of buffers on a row could be reconfigurable depending on the number of PEs on the PE chain. The effectiveness of a buffer’s size is 64-bit, which is the support width of the available HBM. In each cycle, 64-bit data are loaded and filled for one buffer, and each buffer plays the role of local storage for a group of eight PEs. Each 64-bit buffer is split into eight 8-bit local registers of a PE. Each 8-bit local register stores a byte of the input text, and it will be used to match with the input keyword. In this way, the local registers are placed near the PE using the data inside the buffer.

It is possible that the text stored in the HBM can be significantly more than the buffer’s capacity. In that case, the input text is divided into a number of batches. The buffer’s capacity determines the size of each batch being processed. When the text is split, the words that fall between the boundaries of two consecutive batches are risked being overlooked. To solve this problem, we suggest using a method called overlap. When a single text is broken up into batches, the last portion of each batch is reprocessed and used as the first portion of the following batch. The overlap zone has to be greater than the size of the input keyword for it to work correctly. As a result, this is a complete protection for the matched keywords that fell in the overlap zone. After completing the matching process, all the matched locations in the overlap zone are discarded. However, they will be recalled again and counted as matched positions for the following batch. Because of this, there will be no address duplication across the two consecutive batches. An example of the process is depicted in Figure 9. Once the initial batch has been loaded and processed, any matching addresses within the overlap zone

O v

will not be considered. However, as shown in Figure 9, the overlap results will be loaded once more as the heading in the following batch.

5.5. Hardware and Software Communication

To use the proposed FTS system effectively, we also provide a software flow to handle the hardware accelerator. Figure 10 describes the software flow. From the beginning, the software validates the design in the current deployment platform. If the bitstream has already loaded onto the FPGA, the software will skip the FPGA programming step. Otherwise, it will take a few seconds to program the FPGA. After that, the reference text will be loaded from the hard drive to the allocated memory in the host computer. The information of the loaded data are registered to the database, such as the text size and the beginning positions of all files in the text. Then, the text is split into fragments and loaded into the HBM via the PCIe. Subsequently, the keywords are loaded into the hardware accelerator. Finally, a search process is activated.

The control flow for the search process is described in Figure 11. After registering all the search information, including search mode, keywords, language setting, and lengths of all processing files, the software activates the search procedure in the hardware accelerator and waits for the finish signal. Figure 12 illustrates the communication between the host computer and the proposed FTS system via PCIe. The system transfers the reference text and the keywords from the host’s memory to the device’s HBM. Then, the software program initiates a matching procedure. TSP evaluates the matching score between the received keywords and the input text, then writes the matched addresses to the final FIFO. Whenever this FIFO is not empty, a DMA will write back the results of this FIFO to the host’s memory. Once a file is completed, the TSP raises a complete signal. Then, the host repeats this process for the next file in the database. When all the searches are done, the host reads the matched addresses from memory and updates the results.

6. Experimental Results

An analysis of the proposed architecture was conducted to evaluate both the hardware resources and the obtained performances. The proposed architecture was prototyped using a Xilinx Alveo U50 with an HBM-supported FPGA platform. Although, theoretically, the integrated HBM can reach up to 316-GB/s, due to the power limitation as shown in [33], it can only provide 201-GB/s when used with the PCIe simultaneously.

The proposed FTS system has 32 PE chains corresponding to the available HBM channels. Each chain has 1024 PEs; that means 32,768 PEs in total. The following step analyzes the number of resources consumed and the maximum working frequency after being placed and routed. Table 2 summarizes the TSP resource consumption. Four parameters are recorded, including the number of slices #Slices, the number of Lookup Tables #LUT, the number of registers #Registers, and the maximal operating frequency. According to the table, resource usage takes up to 54% of LUTs and 92% of slices, and a frequency of 180 MHz has been obtained.

To evaluate the proposed FTS system, we use a dataset that includes four different languages: English, Japanese, Chinese, and Vietnamese. The dataset is extracted from different sources, as described in Table 3.

We obtained the Japanese dataset from the Aozora Bunko project [41]. For the Vietnamese dataset, we collected the data from a GitHub link [42]. Finally, for the English and Chinese texts, we collected them from free eBooks of the Gutenberg Project [43]. We perform the test on four scenarios. In scenario 1, we randomly select three keywords from the English dataset, which are “gold”, “Unfeeling”, and “statement”. In scenario 2, we select three Vietnamese keywords. The keywords of scenario 3 include “ 野口”, 英司, and 八巻美恵 in the Japanese dataset. Finally, the three keywords 翩翩舉, 復行, and 數十 in the Chinese dataset are used for scenario 4. In our proposed workflow, one or more keywords can be input for a search section. Therefore, in each scenario, we also perform another test with three keywords for each search section, as shown in Table 4 with the tests 4, 9, 13, and 17. Table 4 records the search time in milliseconds (ms) and the total hit of each search section. To retrieve the total search time exactly, we integrate hardware monitors to capture the time usage. Section 4 and Section 5 describe our proposed matching algorithm and its hardware implementation based on byte-by-byte matching. Such a method provides a consistent performance with different kinds of databases. It also benefits our proposed method when compared to index-based approaches. As shown in the analysis in Section 4, the execution time of our method only depends on the data size and the number of input keywords. Therefore, we can perform a fair comparison with other proposals based on keyword size and input data size, as described in Table 4 and Table 5.

In Table 4, the total search time is split into four elements, including init time, match time, decode time, and writeback time. Init time is the time for initializing a search section, transferring the reference text, and recording the search results. Match time is when the PEs get the reference characters from the HBM and match them with the corresponding keywords. Decode time is when the address decoder refines the search results. Finally, writeback time is when the search results are written back to the host computer’s memory. The test numbers 1–3, 6–8, 10–12, and 14–15 only search one keyword for each section, and the test numbers 4, 9, 13, and 17 search three keywords for each section. The total search time is counted from the start of the search to when the host receives the finish signal; that means the search time includes not only the TSP time but also the overhead time for data exchange between the device and the host computer. Table 4 reveals that the search time only depends on the data size and the number of input keywords. Almost all of the init time is spent transferring the reference text from the host computer to the TSP; this is mainly caused by the speed limitation of the PCIe interface. Therefore, the performance can be further improved with a better method for communication, such as PCIe gen 4 or 5. The combination tests of 4, 9, 13, and 17 also show an approach to reduce the overhead of PCIe transactions. In these examples, multiple keywords with the same reference text can reuse the existing text that has already been transferred. Table 4 also shows that our proposed address decoder can almost eliminate all the writeback time. In this test, we select all four languages as an example. Our FTS system can search in a different language because the working mechanism is based on the character’s encoding, not the actual language. For example, in this test, English characters are 1 byte, Japanese and Chinese characters are 3 bytes, and Vietnamese characters are mixed between 1 byte and 2 bytes. Therefore, from the TSP point-of-view, the only different setting is the number of bytes to be treated as one.

Table 5 compares our work with other software-based and hardware-based proposals. In this comparison, we select from Table 4 the average time of test cases with a single keyword, which is 250.68-ms. The index time, retrieval time, overhead time, and throughput are compared. The retrieval period for a keyword search is lengthy. The process of indexing takes a considerable amount of time. From the table, it is clear that we have eliminated the need for indexing operations. As a result, there is no lost time associated with the indexing process. In most cases, the throughput achieved by a hardware-based method is far higher than that of a software-based approach. Comparatively, the throughput of our FTS system is 905× greater than [28,44] and 6697× greater than [19], which are the state-of-the-art hardware-based and software-based implementations, respectively.

Table 5 also compares our performance with other works when the indexing time is ignored. When the amount of indexing time is not considered, the total performance of the hardware can achieve an extremely high throughput; for example, the system in [28] can reach up to 240 GiBps (GibiBytes Per Second) without the indexing time. Almost all of the existing FTS hardware uses an indexed data structure. Although index-based approaches could significantly reduce the amount of data that needs to be processed, they are only effective when the dataset is infrequently updated. Furthermore, the indexed structure’s efficiency significantly impacts the performance of these types of FTS hardware. In the table, because the authors in [28,30,44] did not include their indexing time in their papers, we added the indexing time based on the index size and their used indexing algorithms to complete the comparison. The authors in [30] used the indexed data from [45], while the authors in [28,44] utilized the [46] indexing tools. Table 5 also reveals the trade-offs between the index size, the overhead time, and the retrieval time. It is clear that our proposed method not only skips the index step, which significantly reduces the storage space for indices, but can also promote a competitive speed-up compared to other state-of-the-art software and hardware proposals. Although our approach could not entirely replace the index-based searching method in conventional implementations, it is proven to be well-suited for applications where the resource is limited, or the database is regularly updated, such as in IoT or fog computing systems.

Thanks to the high level of parallelism of the matching algorithm, the proposed hardware architecture can achieve a competitive throughput compared to other approaches. Together with the ability to scale, the proposed method allows the hardware developers to use up all the available resources, thus achieving the best possible outcome given the device’s limited memory bandwidth and communication speed. Therefore, this proposed idea is not only suitable for the current technology but also for the future technology of memory and communication standards.

Table 5. Processing time comparison.

	Method	Algorithm	Platform	Data Size (MiB)		Processing Time (ms)			Throughput (MiBps)
	Method	Algorithm	Platform	Index	Text	Index	Retrieval	Total	Retrieval	Total
This work	Hardware	Parallel Search	Alveo U50	0	2252	0	250.69	250.69	8773	8773
[30]		Succint index	Virtex-5	0.06 †	0.08	33.1 †	0.002	33.102	40,000	2.59
[44]		Inverted index	TSMC 40-nm	23,206 ‡	398,336	41,070,000 ‡	7.25	41,070,007	54,942,896	9.69
[28]		Inverted index	TSMC 40-nm	23,206 ‡	398,336	41,070,000 ‡	1.62	41,070,001	245,886,419	9.69
[19]	Software	Hyper Estraier	Intel T6600	45.4	42.7	31,518	987	32,505	43	1.31
[47]		Lucene-based	NG *	NG *	1.03	1214	1250	2464	0.82	0.42
[48]		Generalized Suffix tree	Intel i7 6700HQ	73,440	60	3,465,000	0.005	3,465,000	12,000,000	0.02

* NG: Not Given. †: collected from [45]. ‡: collected from [46].

7. Conclusions

This work provides a complete solution for the text search problem, including a parallel text search algorithm and its hardware-based optimized implementation. Additionally, the proposed architecture allows hardware developers to scale their implementation for the best outcome corresponding to the usage of the parallel memory architecture of HBM. The proposed design is deployed and validated on the Xilinx Alveo U50 FPGA platform. The experimental results show that our implementation has successfully removed the indexing time, thus reducing the overall search time compared to other state-of-the-art index-based approaches. Our approach not only considers the performance but also the scalability of the implementation. The results indicate that our design can be configured on the Xilinx Alveo U50 FPGA with 32 PE chains containing 32,768 PEs. The system offers a throughput greater than 8-GiBps when operating at a 180 MHz frequency.

In the current setup, the only bottleneck was the PCIe communication between the host computer and the hardware accelerator. Therefore, we want to reduce this bottleneck in future work. Such an improvement requires other better-standard platforms, such as PCIe Gen4 or Gen5, CXL, or something similar. Finally, we will try to explore the option of multi-FPGA installation for our design, thus significantly increasing the scalability and performance for future implementations.

Author Contributions

Supervision, C.-K.P. and T.-T.H.; methodology, B.K.-D.-N., T.-K.D. and K.I.; investigation, B.K.-D.-N., T.-K.D., N.T.B. and H.P.N.; writing—original draft preparation, B.K.-D.-N. and C.P.-Q.; writing—review and editing, N.-T.T., C.-K.P. and T.-T.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in this article.

Conflicts of Interest

Author Katsumi Inoue was employed by the company Advanced Original Technologies Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Imura, J.; Tanaka, Y. A Full-Text Search System for Images of Hand-Written Cursive Documents. In Proceedings of the International Conference on Frontiers in Handwriting Recognition (ICFHR), Kolkata, India, 16–18 November 2010; pp. 640–645. [Google Scholar]
Mu, C.M.; Zhao, J.R.; Yang, G.; Yang, B.; Yan, Z.J. Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines. In Proceedings of the Similarity Search and Applications (SISAP), Newark, NJ, USA, 2–4 October 2019; pp. 49–56. [Google Scholar]
Dinh, L.T.N.; Karmakar, G.; Kamruzzaman, J. A survey on context awareness in big data analytics for business applications. Knowl. Inf. Syst. 2020, 62, 3387–3415. [Google Scholar] [CrossRef]
Abbasi, A.; Sarker, S.; Chiang, R. Big Data Research in Information Systems: Toward an Inclusive Research Agenda. J. Assoc. Inf. Syst. 2016, 17, 1–32. [Google Scholar] [CrossRef]
Virginia, N. On the Impact of High Performance Computing in Big Data Analytics for Medicine. Appl. Med. Inform. 2020, 42, 9–18. [Google Scholar]
Xie, J.; Qian, C.; Guo, D.; Wang, M.; Shi, S.; Chen, H. Efficient Indexing Mechanism for Unstructured Data Sharing Systems in Edge Computing. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM), Paris, France, 29 April–2 May 2019; pp. 820–828. [Google Scholar]
Wang, C.; Xie, M.; Bhowmick, S.S.; Choi, B.; Xiao, X.; Zhou, S. An Indexing Framework for Efficient Visual Exploratory Subgraph Search in Graph Databases. In Proceedings of the International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 1666–1669. [Google Scholar]
Kemouguette, I.; Kouahla, Z.; Benrazek, A.E.; Farou, B.; Seridi, H. Cost-Effective Space Partitioning Approach for IoT Data Indexing and Retrieval. In Proceedings of the International Conference on Networking and Advanced Systems (ICNAS), Annaba, Algeria, 27–28 October 2021; pp. 1–6. [Google Scholar]
Qin, L.; Josephson, W.; Wang, Z.; Charikar, M.; Li, K. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. In Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria, 23–27 September 2007; pp. 950–961. [Google Scholar]
Dong, W.; Wang, Z.; Josephson, W.; Charikar, M.; Li, K. Modeling LSH for Performance Tuning. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), Napa Valley, CA, USA, 26–30 October 2008; pp. 669–678. [Google Scholar]
Maleknaz, N.; Guenther, R. Chapter 19—Analytical Product Release Planning. In The Art and Science of Analyzing Software Data; Bird, B.C., Tim, M., Thomas, Z., Eds.; Morga Kaufmann: Boston, MA, USA, 2015; pp. 555–589. [Google Scholar]
Will, M.A.; Ko, R.K.L.; Witten, I.H. Bin Encoding: A User-Centric Secure Full-Text Searching Scheme for the Cloud. In Proceedings of the IEEE Trustcom/BigDataSE/ISPA, Helsinki, Finland, 20–22 August 2015; pp. 563–570. [Google Scholar]
Nazemi, K.; Klepsch, M.J.; Burkhardt, D.; Kaupp, L. Comparison of Full-text Articles and Abstracts for Visual Trend Analytics through Natural Language Processing. In Proceedings of the International Conference Information Visualisation (IV), Melbourne, Australia, 7–11 September 2020; pp. 360–367. [Google Scholar]
Ismail, B.I.; Kandan, R.; Goortani, E.M.; Mydin, M.N.M.; Khalid, M.F.; Hoe, O.H. Reference Architecture for Search Infrastructure. In Proceedings of the International Conference on Control System, Computing and Engineering (ICCSCE), Penang, Malaysia, 24–26 November 2017; pp. 115–120. [Google Scholar]
Apache Software Foundation. Apache Lucene—Scoring. 2011. Available online: http://lucene.apache.org/java/3_4_0/scoring.html (accessed on 27 May 2024).
Hirabayashi, M. Hyper Estraier: A Full-Text Search System for Communities. 2007. Available online: https://dbmx.net/hyperestraier (accessed on 27 May 2024).
Shi, X.; Wang, Z. An Optimized Full-Text Retrieval System Based on Lucene in Oracle Database. In Proceedings of the Enterprise Systems Conference (ES), Shanghai, China, 2–3 August 2014; pp. 61–65. [Google Scholar]
Lakhara, S.; Mishra, N. Desktop Full-Text Searching Based on Lucene: A Review. In Proceedings of the International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI), Chennai, India, 21–22 September 2017; pp. 2434–2438. [Google Scholar]
Tian, T.W.; Zhou, Y.; Huang, G. Research and Implementation of a Desktop Full-Text Search System Based on Hyper Estraier. In Proceedings of the International Conference on Intelligent Computing and Integrated Systems (ICISS), Guilin, China, 22–24 October 2010; pp. 820–822. [Google Scholar]
Veretennikov, A.B. Proximity Full-Text Search by Means of Additional Indexes with Multi-component Keys: In Pursuit of Optimal Performance. In Proceedings of the Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL), Kazan, Russia, 15–18 October 2019; pp. 111–130. [Google Scholar]
Veretennikov, A.B. Proximity Full-Text Search with a Response Time Guarantee by Means of Additional Indexes. In Proceedings of the Intelligent Systems and Applications (IntelliSys), London, UK, 6–7 September 2018; pp. 936–954. [Google Scholar]
Chaitanya, B.S.S.K.; Reddy, D.A.K.; Chandra, B.P.S.E.; Krishna, A.B.; Menon, R.R.K. Full-Text Search Using Database Index. In Proceedings of the International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 19–21 September 2019; pp. 1–5. [Google Scholar]
Xu, W.; Zhao, X.; Lao, B.; Nong, G. Enhancing HDFS with a Full-Text Search System for Massive Small Files. J. Supercomput. 2021, 77, 7149–7170. [Google Scholar] [CrossRef]
Lakshman, A.; Malik, P. Cassandra: A Decentralized Structured Storage System. ACM SIGOPS Oper. Sys. Rev. 2010, 44, 35–40. [Google Scholar] [CrossRef]
Yang, Y.; Ning, H. Block Linked List Index Structure for Large Data Full Text Retrieval. In Proceedings of the International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), Guilin, China, 29–31 July 2017; pp. 2123–2128. [Google Scholar]
Vishnoi, S.; Goel, V. Novel Table Based Air Indexing Technique for Full Text Search. In Proceedings of the International Conference on Computational Intelligence & Communication Technology (CICT), Ghaziabad, India, 13–14 February 2015; pp. 410–415. [Google Scholar]
Yu, J.X.; Su, A.Y.; Liu, W.Y.; Cheng, X.; Yang, J. Thematic Learning-based Full-text Retrieval Research on British and American Journalistic Reading. In Proceedings of the International Conference on Computer Science & Education (ICCSE), Toronto, ON, Canada, 19–21 August 2019; pp. 611–615. [Google Scholar]
Heo, J.; Lee, S.Y.; Min, S.; Park, Y.; Jung, S.J.; Ham, T.J.; Lee, J.W. BOSS: Bandwidth-Optimized Search Accelerator for Storage-Class Memory. In Proceedings of the International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 279–291. [Google Scholar]
Patel, S.; Hwu, W.M.W. Accelerator Architectures. IEEE Micro 2008, 28, 4–12. [Google Scholar] [CrossRef]
Tanida, N.; Inaba, M.; Hiraki, K.; Yoshino, T. Hardware Accelerator for Full-Text Search (HAFTS) with Succinct Data Structure. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig), Cancun, Mexico, 9–11 December 2009; pp. 155–160. [Google Scholar]
Liu, F.; He, X. Application of Full-Text Indexed Knowledge Graph in Chinese Address Matching for Hazardous Materials Transportation. In Proceedings of the International Conference on Electronic Information Engineering and Computer Science (EIECS), Changchun, China, 23–26 September 2021; pp. 510–515. [Google Scholar]
Akram, S. Exploiting Intel Optane Persistent Memory for Full Text Search. In Proceedings of the ACM SIGPLAN International Symposium on Memory Management (ISMM), Virtual, 22 June 2021; pp. 80–93. [Google Scholar]
Advanced Micro Devices (AMD), Inc. Alveo U50 Data Center Accelerator Card Data Sheet (DS965); Advanced Micro Devices (AMD), Inc.: Santa Clara, CA, USA, 2020. [Google Scholar]
Shi, R.; Kara, K.; Hagleitner, C.; Diamantopoulos, D.; Syrivelis, D.; Alonso, G. Exploiting HBM on FPGAs for Data Processing. ACM Trans. Reconfigurable Technol. Syst. 2022, 15, 1–27. [Google Scholar] [CrossRef]
Lee, D.U.; Lee, K.S.; Lee, Y.; Kim, K.W.; Kang, J.H.; Lee, J.; Chun, J.H. Design Considerations of HBM Stacked DRAM and the Memory Architecture Extension. In Proceedings of the Custom Integrated Circuits Conference (CICC), San Jose, CA, USA, 28–30 September 2015; pp. 1–8. [Google Scholar]
Jayaraman, S.; Zhang, B.; Prasannar, V. Hypersort: High-performance Parallel Sorting on HBM-enabled FPGA. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT), Hong Kong, China, 5–9 December 2022; pp. 1–11. [Google Scholar]
Fang, J.; Mulder, Y.T.B.; Hidders, J.; Lee, J.; Hofstee, H.P. In-memory Database Acceleration on FPGAs: A Survey. VLDB J. 2020, 29, 33–59. [Google Scholar] [CrossRef]
Cheng, X.; He, B.; Lo, E.; Wang, W.; Lu, S.; Chen, X. Deploying Hash Tables on Die-Stacked High Bandwidth Memory. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), Beijing, China, 3–7 November 2019; pp. 239–248. [Google Scholar]
Chen, X.; Chen, Y.; Cheng, F.; Tan, H.; He, B.; Wong, W.F. ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines. In Proceedings of the International Symposium on Microarchitecture (MICRO), Chicago, IL, USA, 1–5 October 2022; pp. 1342–1358. [Google Scholar]
Miao, H.; Jeon, M.; Pekhimenko, G.; McKinley, K.S.; Lin, F.X. StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems Providence (ASPLOS), Providence, RI, USA, 13–17 April 2019; pp. 167–181. [Google Scholar]
Aozora Bunko. Available online: https://www.aozora.gr.jp (accessed on 27 May 2024).
Web scapping. Available online: https://github.com/trungngv/web_scraping (accessed on 27 May 2024).
Project Gutenberg. Available online: https://www.gutenberg.org (accessed on 27 May 2024).
Heo, J.; Won, J.; Lee, Y.; Bharuka, S.; Jang, J.; Ham, T.J.; Lee, J.W. IIU: Specialized Architecture for Inverted Index Search. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Lausanne, Switzerland, 16–20 March 2020; pp. 1233–1245. [Google Scholar]
Gog, S.; Beller, T.; Moffat, A.; Petri, M. From Theory to Practice: Plug and Play with Succinct Data Structures. In Proceedings of the International Symposium on Experimental Algorithms (SEA 2014), Copenhagen, Denmark, 29 June–1 July 2014; pp. 326–337. [Google Scholar]
Mallia, M.A.; Michal, S.; Joel, M.; Torsten, S. PISA: Performant Indexes and Search for Academia. In Proceedings of the Open-Source IR Replicability Challenge (OSIRRC), Paris, France, 25 July 2019; pp. 50–56. [Google Scholar]
Lakhara, S.; Mishra, N. Design and Implementation of Desktop Full-Text Searching System. In Proceedings of the International Conference on. Intelligent Sustainable Systems (ICISS), Palladam, India, 7–8 December 2017; pp. 480–485. [Google Scholar]
Zaky, A.; Munir, R. Full-Text Search on Data with Access Control Using Generalized Suffix Tree. In Proceedings of the International Conference on Data and Software Engineering (ICoDSE), Denpasar, Indonesia, 26–27 October 2016; pp. 1–6. [Google Scholar]

Figure 1. Execution flow of the parallel matching algorithm.

Figure 2. Text Search Processor.

Figure 3. Matching regular words with English keywords.

Figure 4. Matching gaps in an English keyword.

Figure 5. Matching wildcards in an English keyword. The symbol "?" stands for a wildcard.

Figure 6. Processing Element chain.

Figure 7. Address Decoder architecture.

Figure 8. Buffer row.

Figure 9. Load file with overlap.

Figure 10. Search registration flow. *: The Search flow is explained in Figure 11.

Figure 11. Search flow.

Figure 12. Text search system.

Table 1. Matching status.

Match Status	Definition
0	Not matched
1	Matched
2	Not matched with gap
3	Temporary matched with gap
4	Matched with gap

Table 2. TSP resource consumption summarize.

TSP Configuration
Platform	Xilinx Alveo U50
Language support	Any
Search mode	Regular, mask, gap
#Processing Elements	32,768
Resource Usage
#Slices	100,401 (92.16%)
#LUTs	483,080 (54.69%)
#Registers	692,922 (39.73%)
Frequency	180-MHz

Table 3. Dataset.

Language	File Count	Size (MiB)	Dataset Source
Japanese	34,508	1434	Aozora Bunko [41]
Vietnamese	28,334	113	VNExpress [42]
English	1205	611	Project Gutenberg [43]
Chinese	231	96	Project Gutenberg [43]

Table 4. Search time.

Test No	Input Keyword	Keyword Size (Byte)	Match Count	Init Time (ms)	Match Time (ms)	Decode Time (ms)	Writeback Time (ms)	Total Time (ms)
Scenario 1: English
1	gold	4	23,346	186.5	7.61	55.98	0.00013	250.09
2	Unfeeling	9	12	186.5	9.61	54.89	0.00001	251.00
3	statement	9	8906	186.5	9.61	54.91	0.00005	251.02
4 *	Combine	22	32,264	308.76	26.82	165.79	0.00019	501.37
Scenario 2: Vietnamese
6	ngô	4	9301	186.5	7.61	56.51	0.00005	250.62
7	lân	4	526	186.5	7.61	56.21	0.00001	250.32
8	duyên	5	828	186.5	8.00	55.80	0.00001	250.51
9 *	Combine	14	10,655	308.56	23.22	168.53	0.00007	500.31
Scenario 3: Japanese
10	野口	6	854	186.5	8.41	55.41	0.00001	250.32
11	英司	6	109	186.5	8.41	55.80	0.00001	250.71
12	八巻美恵	12	66	186.5	10.81	53.88	0.00001	251.19
13 *	Combine	24	1029	307.89	27.62	165.09	0.00003	500.61
Scenario 4: Chinese
14	翩翩舉	6	3	186.5	8.41	56.17	0.00001	251.08
15	復行	6	135	186.5	8.41	55.46	0.00001	250.37
16	數十	12	1590	186.5	10.41	53.68	0.00001	250.99
17 *	Combine	24	1728	307.42	27.62	165.32	0.00003	500.36

* Search with combination of three above keyword.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kieu-Do-Nguyen, B.; Dang, T.-K.; The Binh, N.; Pham-Quoc, C.; Phuc Nghi, H.; Tran, N.-T.; Inoue, K.; Pham, C.-K.; Hoang, T.-T. A High-Performance Non-Indexed Text Search System. Electronics 2024, 13, 2125. https://doi.org/10.3390/electronics13112125

AMA Style

Kieu-Do-Nguyen B, Dang T-K, The Binh N, Pham-Quoc C, Phuc Nghi H, Tran N-T, Inoue K, Pham C-K, Hoang T-T. A High-Performance Non-Indexed Text Search System. Electronics. 2024; 13(11):2125. https://doi.org/10.3390/electronics13112125

Chicago/Turabian Style

Kieu-Do-Nguyen, Binh, Tuan-Kiet Dang, Nguyen The Binh, Cuong Pham-Quoc, Huynh Phuc Nghi, Ngoc-Thinh Tran, Katsumi Inoue, Cong-Kha Pham, and Trong-Thuc Hoang. 2024. "A High-Performance Non-Indexed Text Search System" Electronics 13, no. 11: 2125. https://doi.org/10.3390/electronics13112125

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A High-Performance Non-Indexed Text Search System

Abstract

1. Introduction

2. Related Work

3. Background Knowledge

3.1. Text Search System

3.2. High Bandwidth Memory

4. Parallel Matching Algorithm

4.1. Parallel Matching Algorithm

4.2. Matching Status Evaluation

5. Hardware Architecture

5.1. Overall Architecture

5.2. Processing Element

5.3. Address Decoder

5.4. Data Management System

5.5. Hardware and Software Communication

6. Experimental Results

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI