1. Introduction
In recent years, there has been an increase in computational solutions employing Digital Image Processing (DIP) techniques, such as facial recognition, medical image enhancement, signature authentication, traffic control, autonomous cars, and product quality analysis [
1,
2,
3,
4,
5]. These applications usually require real-time processing. However, meeting their processing time requirements can be complex due to the large volume of data to be processed, which is proportional to the image resolution, color depth, and, in the case of video applications, the frame rate employed. Therefore, obtaining results in real-time has become a challenge [
6].
Some of the mentioned applications use segmentation algorithms to identify the image’s region of interest and classify its pixels as background or object. Thresholding is one of the main image segmentation techniques in which pixels are classified based on their intensity values [
7]. The Otsu algorithm, proposed in [
8], is a widely used global thresholding technique, which proposes the definition of an optimal threshold by maximizing the between-class variance. However, the Otsu algorithm has a high computational cost due to the complex arithmetic operations performed interactively, hindering its use in real-time applications.
Many works in the literature proposed the Otsu algorithm developed in hardware, such as Field-Programmable Gate Arrays (FPGA), to overcome the processing time constraints. This therefore allows applications to achieve real-time or near real-time processing. The FPGA allows the exploitation of the algorithm parallelization and the development of dedicated hardware to obtain performance improvement [
9,
10,
11,
12,
13,
14,
15]. However, FPGA implementations found in the literature are often developed with sequential processing schemes in some stages of the Otsu algorithm, limiting the hardware’s processing speed [
16,
17,
18,
19,
20,
21].
Therefore, this work proposes a fully parallel FPGA implementation of the Otsu algorithm. Unlike most approaches proposed in the literature, a full-parallel implementation reduces the bottleneck for processing speed compared to sequential systems or hybrid hardware architectures, that is, architectures implemented with sequential and parallel schemes. Besides, given the continuous increase in the volume of data present in the DIP applications, a full-parallel strategy is less likely to become obsolete quickly.
The remainder of this paper is organized as follows:
Section 2 presents the related works in the literature;
Section 3 addresses the theoretical foundation of the Otsu method;
Section 4 shows a detailed description of the architecture proposed in this paper; while
Section 5 presents and analyzes the synthesis results obtained from the described implementation, including a comparison to other works. Finally,
Section 7 presents the final considerations.
2. Related Works
Many proposals can be found in the literature for real-time applications of the Otsu algorithm deployed in FPGAs. In [
22], an adaptive lane departure detection and alert system is presented, while in [
23], a lane departure and frontal collision warning system. Meanwhile, in [
24], a vision system is presented to detect obstacles and locate a robot that navigates indoors; in [
25], a system for detecting moving objects is presented; [
26], presents a system to assist in the diagnosis of glaucoma; and, in [
27], a system for improving thermograms is presented. However, these articles provide few details about the hardware implementation.
Among the first FPGA implementations of the Ostu algorithm is the proposal of [
16], synthesized for an Altera Cyclone II FPGA. The design improved the algorithm’s performance through a hybrid hardware architecture and Altera MegaCores, eliminating complex divisions and multiplications of the algorithm. The architecture developed was used for the segmentation of an image with a resolution of
and pixel represented by 10 bits. Through visual segmentation results, they evaluated the implementation performance as satisfactory.
Other works have proposed a hardware implementation using logarithmic functions to eliminate the division and multiplication circuits. In [
17,
19], the authors implemented two versions of the Otsu algorithm, in which one version uses the logarithmic functions, to compare the results achieved between them. The architectures developed by [
17] were synthesized for a Xilinx Virtex XCV800 HQ240-4 FPGA. The implementation without the logarithmic functions occupied a hardware area of 622 slices and 103 Input/Output Blocks (IOBs), obtaining a clock latency of
ns. Meanwhile, the logarithmic function implementation occupied 109 slices and 49 IOBs, obtaining a clock latency of 132ns.
The architectures presented by [
19] were developed on the Altera Cyclone IV EP4CE115F29C6N FPGA. The synthesis results obtained for the algorithm implemented without the logarithmic functions occupied 6525 logic elements, 4920 registers, 18,266 bits of memory, and 79 multipliers of 9 bits. Regarding the processing time, considering an image with a resolution of
, the system achieved a maximum frequency of
MHz and latency of 589 clock cycles. In contrast, the implementation with the logarithmic function used 2440 logic elements, 1026 registers, 10,943 bits of memory, and 79 multipliers of 9 bits. Moreover, it achieved a maximum frequency of
MHz and a latency of 536 clock cycles. Therefore, the results presented in [
17,
19] indicate that the algorithm designed with logarithmic function reduced the FPGA area occupation and latency.
In [
18,
20], similar architectures of the Otsu algorithm in the Virtex-5 xc5vfx70t ffg1136-1 FPGA were deployed, available on the Xilinx ML-507 development platform. Both proposals were developed in VHDL, using fixed-point representation, operating at a clock frequency of
MHz. The proposal described in [
18] occupied 168 slices and 33 IOBs, while in [
20], the implementation reached an area occupation of 161 slices, 21 IOBs, 72 Look-Up Tables (LUTs), 591 registers, 4 blocks of RAM (BRAMs), and 5 DSP48Es. In addition, Reference [
20] presented results related to the processing time for a
image, with pixels represented by 8 bits. This work reached a latency of 5 clock cycles and throughput of
megabits processed per second (Mbps).
Meanwhile, it was presented in [
21] an implementation for binarization and thinning of fingerprint images, using the Otsu method, on a Spartan 6 LX45 FPGA. Concerning the area occupation, 1898 registers, 1859 LUTs, 735 slices, 10 IOBs and 44 BRAMs were used. Regarding processing time, a maximum clock frequency of 100MHz was achieved, with the execution time of 1489 ms for processing a
image and latency of 531 clock cycles or 5310 ns. Besides, a comparison with the same technique implemented in Matlab was also presented, showing that the FPGA was ≈
faster than the Matlab version.
Thus, this work proposes an FPGA implementation of the Otsu algorithm to improve its performance. Unlike the works presented in the literature, the architecture proposed here uses a fully parallel scheme. The hardware implementation was developed in Register-Transfer Level (RTL), using fixed-point representation, in an Arria 10 GX 1150 FPGA. The results concerning the hardware area occupation and throughput are also presented.
3. Otsu’s Algorithm
The Otsu is one of the most popular thresholding algorithms, used to find an optimal threshold that separates an image into two classes: the background and object. These classes are represented by
and
, respectively. This method has the advantage of performing all its calculations based only on the histogram of the image [
7,
8].
Initially, the algorithm starts by calculating the normalized histogram of an image,
, in grayscale, which is described as
where
is one pixel of
b bits and
is the image dimension. The pixels can assume
L distinct integer intensity levels, represented by
k and characterized as a value in a range of 0 to
, where
. Each
n-th pixel is processed in an instant
, which represents the sampling time. Thus, one complete image can be processed at every
m-th moment, where
This equation must be changed if more than one pixel is processed per sample time.
The histogram of each
m-th image,
, is calculated and stored in the vector
where each
k-th component,
, is defined as
in which
denotes the number of pixels with intensity
k of the
m-th image, described as
with
expressed as
Subsequently, after obtaining the normalized histogram, stored in the vector
, the Otsu algorithm calculates an optimal threshold between the two classes, i.e.,
and
. The optimal threshold called here as
can be characterized as
where
is the
k-th between-class variance of the
m-th image, defined as
where
and
are the probability of class occurrence given a
k threshold and the mean intensity value of the pixels up to the
k threshold of the
m-th image, respectively, meanwhile,
is the average intensity of the entire
m-th image, called the global mean, with a value equal to
when
.
The variables
and
can be expressed as
and
After finding the optimal threshold value, , the pixels of the input image,, can be classified as a background or object ( and ), generating a mask for the input image.
4. Hardware Proposal
This work proposes a fully parallel architecture of the Otsu method capable of processing images of any dimension, focused on obtaining high-speed processing. The details of the hardware implementation are described in the following subsections.
4.1. General Architecture
The general hardware architecture implemented for the Otsu algorithm is presented through a block diagram, shown in
Figure 1. As can be observed, the architecture was developed based on the description presented in
Section 3. Therefore, it consists of five main modules: Normalized Histogram Module (NHM), Probability of Class Occurrence Module (PCOM), Mean Intensity Module (MIM), Between-Class Variance Module (BCVM), and Optimal Threshold Module (OTM).
Initially, the NHM module receives the parallel input of G image pixels, where G is the number of submodules internal to the NHM that simultaneously calculate the normalized histogram, according to Equations (
3) and (
4). Subsequently, the PCOM module uses the histogram components to calculate the class occurrence probabilities, according to Equation (
9), while the MIM module calculates the average intensities, based on Equation (
10). The PCOM and MIM modules perform their calculations simultaneously. Afterward, these two modules’ outputs are supplied to BCVM, in which the values of the between-class variance are computed, according to Equation (
8). Finally, the calculated between-class variances are compared in the OTM to select the optimal threshold value, as described in Equation (
7).
All variables and constants shown in
Figure 1 were implemented in fixed point to reduce the bit-width compared to floating-point implementations, the bits number used can be adjusted to adapt the precision of the results obtained to the desired application. Therefore, each pixel,
, of the input image were configured with 8 bits in the integer part (without sign), then
is defined. For the histogram components,
, and the probabilities of class occurrence,
, which has a positive value less than 1, only 24 bits are used in the decimal part. For the average intensity elements,
, 8 bits are used in the integer part (without sign) and 24 bits in the decimal part. For the between-class variances,
, 27 bits are used in the integer part (one bit for sign) and 24 bits in the decimal part. Finally, for the optimal threshold,
, only 8 bits are used in the integer part (without sign).
The modules of this architecture are pipelined, and the system operates on the same sample time,
. Nonetheless, each module has a different execution time, characterized here as
,
,
and
. To minimize control, due to the lack of synchronism between the modules, the hardware proposed here defines
, where
is the time to process a complete image, being equal to the
m-th moment that an image is processed. The time
is defined by the NHM block, since
has the longest execution time. Thus, the system has an initial latency expressed as
and a throughput characterized as
4.2. Normalized Histogram Module (NHM)
The NHM is responsible for generating the normalized histogram of the input image, by performing the Equations (
3) and (
4). Usually, this step of the algorithm costs more clock cycles to complete than other steps, as the entire image needs to be scanned to obtain the histogram components. We propose the parallelization of this step by calculating the components’ partial values in a parallel way to optimize this process. Afterward, these values are summed to obtain their final values. The architecture of this module is shown in
Figure 2.
As can be observed, the NHM module is constituted of
G identical submodules, called Partial Normalized Histogram (PNH), responsible for computing the partial values of the histogram components. Each
g-th input pixel,
, is processed by the
g-th
. Likewise, the PNH modules are internally constituted of
L submodules, called Partial Component of the Normalized Histogram (PCNH), as shown in
Figure 3. Thus, each
k-th partial component of the histogram calculated by the
g-th PNH,
, is computed in parallel by a
submodule.
Figure 4 shows the internal circuit of each PCNH module, consisting of a comparator (
), an adder (
), two registers (
R) and two constants (
and
).
Initially, in each
k-th comparator of the
g-th
,
, it is checked whether the input pixel,
, has value equal to
, according to Equation (
6). The constant
has a different value of
k in each
submodule. Following, the
output,
, which is a Boolean value, enables the adder,
, when equal to 1. Therefore, when
is enabled, the constant
is summed with its previous value, thus operating as an accumulator. The constant
is defined as
, so it determines the value of
when summed
times, according to Equation (
4). After entering all the image pixels, each
outputs the
k-th partial component of the normalized histogram computed in the
g-th PNH,
.
After that, the final value of each
k-th component of the
m-th image,
, is obtained by summing all the
k-th partial values provided as an output of each
g-th PNH,
. This sum is performed for each
k-th component through an adder tree (
), as represented in
Figure 2. This tree has a depth equal to
. At the end, the value of the components of the normalized histogram,
, is obtained, according to Equation (
3).
Instead of processing 1 pixel per sample time in the histogram, the proposed architecture allows processing
G pixels in parallel. Consequently, the amount of clock cycles required in this step is reduced and, thus, the processing time of a complete image,
. Consequently, the latency is also reduced, and throughput increased. With this scheme, the value of
can be defined by
Through this equation, it is possible to observe that the higher the value of G, the better the performance obtained.
In each PCNH submodule, the constant assumes a grayscale value between 0 and and is represented with only 8 bits in the integer part (without sign). The constant , which has a positive value less than 1, uses only 24 bits in the decimal part. This bit-width is also used for all the adders of the NHM module, as the normalized histogram components also assume positive values less than 1. All the k-th components of the histogram, , are transmitted in parallel to PCOM and MIM.
4.3. Mean Intensity Module (MIM)
The MIM calculates the average intensity value of the pixels up to level
k, according to Equation (
10). Each
k-th average intensity,
, is calculated in parallel. The architecture of this module, shown in
Figure 5, consists of
L gains submodules (
),
adders (
) and one register (
R) after each component.
Based on the Equation (
10), each
k-th
component of the normalized histogram is first multiplied by a gain with the value of
k. These gains are represented in the architecture block diagram by
, and the index indicates the value of the applied gains. Thereafter, each
k-th average intensity,
, is obtained by summing the outputs of all gains with an index from 0 to
k. The sum of these values is carried out in parallel based on the adder proposed by [
28], with a maximum of
cascading adders using this technique.
All k-th gains of this module, , have their output represented with 8 bit in the integer part (without sign) and 24 bits in the decimal part. Similarly, this bit resolution is also used for the adders, , as the average intensity values of the pixels are at most equal to , for input images with pixels represented by 8 bits. All k-th average intensities, , are provided to BCVM in parallel.
4.4. Probability of Class Occurrence Module (PCOM)
The probability of class occurrence for a given threshold
k,
, is performed in PCOM based on Equation (
9). This module has an architecture similar to MIM, but it does not have the gain submodule to weight the input. Therefore, the inputs are directly linked to the adders (
). Thus, this architecture is composed only of adders and registers, as shown in
Figure 5.
According to Equation (
9), the
values are obtained by adding all the components of the histogram from index 0 to
k. Thus, using the parallel adder proposed by [
28], all
values are computed simultaneously through the sum of the
k-th entries
.
The adders in this module were implemented for the same bit resolution of the inputs, , since the probability of class occurrence also assumes positive values less than 1. All k-th probabilities calculated are propagated to the BCVM in parallel.
4.5. Between-Class Variance Module (BCVM)
The
k-th between-class variance of the
m-th image,
, are calculated by BCVM based on the Equation (
8). The BCVM module is internally composed of
L equal submodules, named Between-Class Variance of
k (
), with the same architecture shown in
Figure 6. Each
k-th
is computed in parallel by the
k-th submodule
. This submodule consists of four multipliers (
), two subtractors (
), a point shift (
), a Look-Up Table (
) and eleven registers (
R).
Equation (
8) is performed in parallel in this architecture. Therefore, the numerator and denominator are performed simultaneously. Concerning the numerator, it is obtained by first multiplying the
k-th probability,
, by the global mean,
, on
. Subsequently, the
k-th
submodule performs the subtraction between
and the
output. Finally, the result of this subtraction is multiplied by itself in the
k-th
. Regarding the denominator, it is calculated by initially subtracting the
k-th probability,
, from the value 1 in the
k-th
. Lastly, the result is multiplied by the same
in the
k-th
.
The division arithmetic operation is highly costly to the hardware in terms of processing speed, being the architecture’s bottleneck due to the highest critical time. One way to avoid using the division is to multiply the numerator by the reciprocal of the denominator. By definition, the reciprocal of a number is its inverse. Thereupon, the denominator’s reciprocal can be approximated for a range of predefined values and stored in a LUT. Thus, the division can be performed using only one LUT and a multiplier, and , consequently increasing the throughput of the implementation.
Thus, each k-th value in the output of has a reciprocal approximated value in the k-th . This LUT was configured with a depth of L, storing words of 33 bits, where 9 bits represent the integer part (one bit for sign) and 24 bits the fractional. The mapping of the output value of each k-th to an address of the is performed by shifting the binary point eight bits to the right by the k-th . The approximate value of the reciprocal shown at the output of each k-th is multiplied by the calculated value of the numerator in . The result of this multiplication is the k-th between-class variance of the m-th image, .
The subtractor was configured with 9 bits in the integer part (one bit for sign) and 24 bits in the decimal part, while , uses only 24 bits in the decimal part. The multiplier uses 8 bits in the integer part (without sign) and 24 bits in the decimal part, while uses 18 bits in the integer part (one bit for sign) and 24 bits in the decimal part. Meanwhile, was configured with only 24 bits in the decimal part, and uses 27 bits in the integer part (one bit for sign) and 24 bits in the decimal part. Each k-th between-class variance calculated is propagated in parallel to the OTM.
4.6. Optimal Threshold Module (OTM)
The OTM module performs the last step of the Otsu algorithm, responsible for comparing all
k-th values of the between-class variance,
, to determine the optimal threshold of the
m-th image,
, based on Equation (
7). The architecture of this module is shown in
Figure 7. As can be observed, it consists of
comparators (
),
multiplexers (
) and a register (
R) after each component.
According to Equation (
7), the optimal threshold of the
m-th image,
, is the threshold value for which the highest value of between-class variance is obtained. For this purpose, the between-class variances of the
m-th image,
, are compared through a comparator tree. Each
k-th
compares whether a given variance
is greater than the other,
. This comparator is used as a key selector of two multiplexers, defining on the outputs the largest variance value that was compared and its respective threshold
k. All outputs of the multiplexers are passed to the next branch of the tree until the last comparator,
, in which the optimal threshold value of the
m-th image,
, should be selected as the output of the multiplexers, as well as its between-class variance,
.
Therefore, the optimal threshold of the m-th input image is determined by the proposed architecture of a fully parallel design. A new value of is computed for every m-th instant.
5. Results
The architecture presented in the previous section was developed on an FPGA Arria 10 GX 1150, and the analysis of the synthesis results was carried out concerning hardware area occupation, throughput, and power consumption.
5.1. Hardware Area Occupation Analysis
Initially, the hardware area occupation analysis was performed for the architecture with one PNH module only. The results are shown in
Table 1. The first to the third columns indicate the number of logical cells occupied (
), the number of multipliers implemented using DSP blocks (
), and the number of block memory bits (
), respectively. It also presents the resources used in percentage.
As can be observed, only
of the block memory bits available have been used, while
of the logic cells were occupied. The most-used resource was the multipliers, occupying
of the total DSPs available. Therefore, the data presented in
Table 1 demonstrate the feasibility of implementing the proposed architecture in the target FPGA. Besides, the Arria-10 FPGA still has resources available that can be used to implement additional logic, thus allowing an increase in the number of PNH modules.
PNH modules require only logical cells for their implementation since they are not designed with multipliers and memories. Therefore, the
of unused logic cells in the Arria-10 allow for the increasing of the number of PNH modules. Hence, we also analyzed the area occupation for the Arria-10 FPGA by increasing the number of PNH modules (
). The number of occupied logical cells is presented in
Table 2, as there is no change in the use of other resources.
Figure 8 shows the curve obtained by linear regression using the set of values presented in
Table 2. The equation associated with the regression analysis is expressed by
This equation can obtain the number of logical cells occupied by the architecture without the NHM module when defining .
Therefore, Equation (
14) allows for the estimation of the maximum number of PNH modules an FPGA can support.
Table 3 presents the maximum number of PNH modules supported on different commercial FPGAs [
29,
30]. The first and second columns present the label and FPGA, respectively, while the third and fourth columns present the number of logical cells available (
) and the maximum number of PNH modules that can be implemented with these resources (
). Hence, a high degree of parallelism can be achieved through our proposed architecture, limited only by the FPGA resources available.
5.2. Time Processing Analysis
The data related to the system’s processing time was obtained considering a clock cycle of ns, which is defined by its critical path. Moreover, as the circuit operates with the same clock, the sampling time is also defined as ns.
The system’s processing time, for different amounts of
, is presented in
Table 4. The first column indicates the number of
. Meanwhile, from the second to fourth columns are shown, respectively, the image processing time,
, defined according to Equation (
13), the initial system latency,
D, according to Equation (
11) and the throughput,
, which in this work consists of the number of images processed per second (IPS), determined through Equation (
12). According to Equation (
13), the processing time of an image depends on its size. Thus, the data presented in
Table 4 concern the processing of a
image with 4K resolution.
As can be observed in
Table 4, the value of
is directly proportional to
and inversely proportional to
and
D. Thus, the more PNH modules employed in the implementation, the better the system performance.
Figure 9 shows the curve obtained by linear regression that relates the values of
and
. The equation associated with this curve is expressed by
According to (
12), and (
13), Equation (
15) can be rewritten as
Equation (
16), allows one to define
values as
When comparing this equation to Equation (
13), it is observed that the only difference between them is the absence of the sum of the logarithm. The reason for that is
Thus, this component can be disregarded in the calculation of .
Afterward, the proposed architecture’s processing time was analyzed for other commercial FPGAs, previously presented in
Table 3, adopting for each FPGA an
. For this purpose, the values
and
were defined according to Equations (
16) and (
17), respectively, and different image resolutions were considered. The results are shown in
Table 5.
All FPGA models analyzed achieved a high throughput in processing images with 4 K resolution, allowing real-time processing of 4 K videos in all models. For images with 8 K and 10 K resolutions, real-time processing proved to be more viable in the FPGA2 and FPGA3. Finally, a high throughput was also achieved by the FPGA3 in processing 16 K images, allowing real-time processing of videos with this resolution. Therefore, the FPGA3 offers better performance due to the high number of PNH modules that can be implemented, i.e., the increased architecture’s parallelism.