Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Printed Edition

A printed edition of this Special Issue is available at MDPI Books....

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Hardware Implementation for an Improved Full-Pixel Search Algorithm Based on Normalized Cross Correlation Method

Electronics 2018, 7(12), 428; https://doi.org/10.3390/electronics7120428

by Guohe Zhang¹, Zejie Kuang¹, Sufen Wei^1,3, Kai Huang¹, Feng Liang¹ and Cheng-Fu Yang^2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Electronics 2018, 7(12), 428; https://doi.org/10.3390/electronics7120428

Submission received: 7 November 2018 / Revised: 7 December 2018 / Accepted: 8 December 2018 / Published: 12 December 2018

(This article belongs to the Special Issue Selected Papers from IEEE ICKII 2018)

Round 1

Reviewer 1 Report

Dear authors,in the contributed paper, a fast full-pixel search method suitable for digital speckle correlation method is proposed. The algorithm to be implemented on it is the NCC correlation matching algorithm.

The main drawback is that the paper does not show the insights of the HW design.

It is also not clear how the algoritm has been implemented. More or less, the reader could follow the first step of the procedure with the 5 templates matching. But how the method works to use different sizes for the template matching, should be shown in the paper.

To obtain a match an histogram with the statistics is used.

The threshold fixation is not well explained used to select the threshold adaptively.

Later, the adaptive adjustment of the search area is used to adjust the position and size of the search area but is not clearly stated in the paper. Where are the problems of convergence of this "search area" Please explain the type of algorithm used in it.

Try to include the key elements and how it works. Later in the conclusions sections you could include other possibilities to use the soil of a certain EU region.

Finally,in the HW section the synthesized hardware circuit ready for implementation should be shown.

No real data, image capture or photo of the board, so at least once, shows that the system has been executed. Explained also the testing system and procedure to keep track of the solution.

About the english, there is nothing to say, the paper has a proper english and is technically-sound.

Author Response

Dear Reviewer:

Thank you for the comments concerning our manuscript entitled “Hardware Implementation for an improved Full-pixel Search Algorithm based on Normalized Cross Correlation Method”. Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our researches. We tried our best to improve the manuscript and made some changes in the manuscript. These changes will not influence the content and framework of the paper. We appreciate for your warm work earnestly, and hope that the correction will meet with approval. Once again, thank you very much for your comments and suggestions.

The main corrections are set to highlight in the paper and the main improvement are described as follows.

Adjustments on Structure:

In order to show the insights of the HW design and make the structure of paper clear and well-organized, we deleted the original part1.3(Hardware circuit design) and added a new part3(Hardware circuit design), including Sum of square of all pixels in a reference sub-area, Sum of square of all pixels in a searching sub-area, Sum of cross-correlation product, Sum of product of the full template and Selection of adaptive threshold.

Responds to the reviewers’ comments:

Comments:

The main drawback is that the paper does not show the insights of the HW design.

To obtain a match an histogram with the statistics is used.

The threshold fixation is not well explained used to select the threshold adaptively.

Try to include the key elements and how it works. Later in the conclusions sections you could include other possibilities to use the soil of a certain EU region.

Finally,in the HW section the synthesized hardware circuit ready for implementation should be shown.

No real data, image capture or photo of the board, so at least once, shows that the system has been executed. Explained also the testing system and procedure to keep track of the solution.

About the english, there is nothing to say, the paper has a proper english and is technically-sound.

Reply:

Figure 7 in the manuscript（page6）shows the structure of hardware implementation. The buffer unit reads in the data of the reference image and the target image serially and outputs them parallelly. The algorithm is implemented by using the corresponding number of multipliers to calculate the correlation coefficient of the local matching template firstly. If the full template calculation is required after the threshold comparison, the multiplier array is multiplexed by time-sharing method. It ensures that the full template calculation for this window is completed in three clock cycles.

We think it is lack of hardware description of local templates that makes you confused. We will describe it in detail in the following.

The corresponding local matching template is divided as shown in Figure 3 in the manuscript （page3）. The size of the full template is 31×31, and R0, R1, R2, R3, and R4 are all rectangular areas of size 7×7. The data in the search sub-area register group is dynamically changed. Therefore, the parallel pipeline structure is used to calculate the sum of the squares of the pixels in the search sub-area. The five regions in the partial template are divided into three parts for parallel calculation, which are the squares sum of the regions R1R2, R0, and R3R4.

Figure 9 in the manuscript （page7） shows the implementation circuit for calculating the sum of the squares of R1R2. The circuit implements the square sum calculation of the first 7 lines of data in the template and the square sum of the region R1R2 in a single clock cycle. First, the 7-way parallel data is calculated by a 8-bit multiplier for the square of each data. Then the summation calculation is done by the pipeline adder. The result is serially stored in 31 20-bit shift registers. Finally, the sum of squares of the region R1R2 (Sum_R1R2) is calculated by the pipeline adders 1, 2 and the adder A, the sum of squares of all 7 rows (Sum1) in the window can be calculated by the pipeline adder 3 and the adder B. A similar circuit structure is used to obtain the sum of the squares of the region R0, R3R4 and the 7 rows occupied by them, and the sum of the squares of the remaining 10 rows. The square sum of all data of a local template of a search sub-area can be obtained by summing Sum_R1R2, Sum_R0 and Sum_R3R4. The square sum of all data of the full template of the search sub-area can be obtained by summing Sum1, Sum2, Sum3, and Sum4 to be ready for full template matching which is possible later.

As shown in Figure 10 in the manuscript （page8）, the hardware structure constructs a multiplier array with the same number of multipliers as the local template pixels, and the reference sub-area and search sub-region register groups are treated as inputs of the multiplier array. In this way, the multiplier array can ensure that the cross-correlation product calculation of all pixel points in the local template is completed in one clock cycle. Finally, the sum of cross-correlation product of the local template can be obtained by summing all product by the pipeline adder.

If the number of pixels involved in the calculation is more, the accuracy of the matching is higher, but correspondingly, the consumption of hardware resources is greater. We chose these 5 regions as a local template because we thought that they can represent the texture information of the entire template effectively. And we choose the size of every region as h=H/4 considering the accuracy and computational complexity. Of course, the size of the template can all be changed. When the size of the template changes, the system will still work as we described earlier (Calculating the square sum of the reference sub-area and Calculating the sum of cross-correlation product). What we need to do is configuring the system to adjust the regional division for time-sharing calculation to ensure that the multiplier array can complete full template calculations in three clock cycles (revised paper p.9 line225). The purpose of the local template method is to exclude some non-matching points and thus reduce computational complexity. The best match point is still obtained after the full template calculation.

We did not clearly explain the threshold fixation and adaptive adjustment of the search area in the original manuscript. We will explain them in detail in the following.

We use the correlation coefficient of the local template as the threshold. Therefore, the closer the threshold is to 1, the stronger the correlation between the two sub-areas (reference sub-area and search sub-area)(revised paper p.4 line104). Under ideal conditions, the best matching point should be the one with the correlation coefficient closest to 1. The method of threshold can reduce the effects of noise and exposure. There is an overlapping area between the reference sub-area and the search area of two adjacent matching points. The texture information, the noise and the exposure effects of them are similar. Therefore, a certain similarity can be seen in the distribution of their correlation coefficients. For two adjacent points A and B, we use the histogram to make statistics on the correlation coefficient of matching point A under the partial template. We use the histogram to make statistics on the correlation coefficient of matching point A under the partial template. The correlation coefficient interval 0~1 is subdivided into 100 intervals as the abscissa of the histogram, and the length of each interval is 0.01. The ordinate of the histogram represents the number of matching points in a certain interval. When the partial matching template is swiped once in the search area, a correlation coefficient is calculated. The interval is determined by where the correlation coefficient is located. The histogram of the corresponding interval plus 1 is generated to complete the statistics of the correlation coefficient When the matching point search is completed, the number of occurrences of the correlation coefficient in each interval can be obtained. After completing the statistics, the number of occurrences is sequentially accumulated from the larger interval to the smaller interval of the histogram. So, accumulation stops when the accumulated value reaches 8% of the total amount. Considering the effects of noise and exposure, we choose 8% as the margin and the current abscissa is used as the threshold of matching point B. This is a compromise between precision and computation. This percentage value is configurable. If the accuracy requirements are particularly high or the noise impact is particularly large, this value can be increased (decrease the threshold)(revised paper p.5 line135).

We added the hardware description of the threshold in the manuscript (subsection 2.5).

Exactly, the adaptive adjustment of the search area is used to adjust the position but not the size of the search area (revised paper p.6 line160). We are very sorry for our negligence.

When industrial measurements are made using digital speckle, the displacement or deformation of each point on the surface of a object can be considered to vary continuously. In general, sudden mutations are less likely to occur, so it can be considered that the displacement values of adjacent two matching points in the speckle pattern before and after the displacement are not significantly different. In the traditional full-pixel search algorithm, a rectangular area which is centered on the coordinates of the center point P of the reference sub-region and larger than the reference sub-area is framed in the target image as the search area. As shown in Figure 6 in the manuscript（page5）, the size of the initial search area is specified as 256×256 in the manuscript. When the search is completed, the best matching point P* is obtained, and the displacement (u,v) of the current matching point is recorded. When searching for the adjacent matching point Q, Q* corresponding to point Q after the displacement (u,v) is found in the target image. Because the displacement values of adjacent two matching points in the speckle pattern before and after the displacement are not significantly different, Q* is the adaptively adjusted search area of Q. The size of the initial search area and the adjusted search area are configurable. Considering the accuracy and resource consumption, we choose the size of the adjusted search area as 151×151. Of course, in the case of sudden mutations, the size of the adjusted search area can be appropriately increased.

Because the synthesized hardware circuit is complex and messy, nothing can be reflected. Instead, we posted Figure 7 in the manuscript（page6）to show the structure of hardware implementation.

And we posted Figure 8 in the manuscript（page7）, Figure 9 in the manuscript（page7）, Figure 10 in the manuscript（page8）to describe every module of the hardware design.

And we posted Figure 11 in the manuscript（page9）to show the state machine that controls the multiplier array.

Figure 1 is the photo of FPGA board used in this paper. Figure 2 shows the data received by the serial debugger. But they have no useful help for the article. So we didn't post them.

Figure 1. Stratix IV series FPGA (EP4SGX530HH35C2)

Figure 2. Data received by the serial debugger

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper briefly presents the hardware implementation of full pixel search algorithm used in digital speckle correlation. The hardware implementation of the algorithm was carried out on a Stratix VI FPGA and it was used in a hardware-in-the-loop setup to perform pixel search. The obtained results showed a great improvement in computation time (2000x faster then a software solution). In my assessment, the FPGA implementation of this algorithm is a first step to integrate the algorithm into an ASIC, that may power a new generation of digital speckle correlation measurement instruments.

My suggestions to improve the content of the paper:

1. The authors assessed the performance of the algorithm in terms of computation time and compared the implementation to its software implementation. Is there other hardware implementation of pixel search algorithms? If yes, then the present implementation should be compared to those.

2. Express in percent the resource usage in the FPGA.

3. There is a typo: MTLAB instead of MATLAB.

4. The significance of Figure 7 is marginal, the gray distribution of the two picture looks almost the same. You may consider removing these pictures.

5. Enlarge the surfaces in Figure 8.

Author Response

Dear Reviewer:

The main corrections are set to highlight in the paper and the main improvement are described as follows.

Adjustments on Structure:

Adjustments on Figures:

In the new section 2 “Hardware circuit design”, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 are added to describe the structure of hardware implementation and every module of the hardware design. Figure 7 in the origin paper is deleted. Figure 8 in the origin paper is modified into Figure 14 in the revised manuscript.

Responds to the reviewers’ comments:

Comment 1: The authors assessed the performance of the algorithm in terms of computation time and compared the implementation to its software implementation. Is there other hardware implementation of pixel search algorithms? If yes, then the present implementation should be compared to those.

Reply:

Image correlation is today a mature concept with hundreds of different algorithm floating around. Present algorithms are hence rather optimised when it is implemented by software. And in the field of pixel search, many software algorithms have been proposed. However, few digital speckle matching algorithms similar to this paper are implemented on hardware. In the face of the challenge of real-time processing, we hope to take advantages of parallelization, pipeline structure and low cost and proposed an improved fast full-pixel search algorithm considering hardware implementation. We try to compare the present implementation with others but we have not been able to find the suitable data for comparison.

Comment 2: Express in percent the resource usage in the FPGA.

Reply:

The resource consumption of FPGA is shown in Table 3 in the revised manuscript. (revised paper p.12 line286)

Table Resource consumption in FPGA

Resource	Consumption
Combinational ALUTs	8.55% (18173/212480)
Dedicated logic registers	5.96% (31665/531200)
Total pins	6.08% (70/1152)
BUFG	18.75% (3/16)
Total block memory bits	0.35% (95317/27376K)
DSP block 18-bit memory bits	28.42% (291/1024)

Comment 3: There is a typo: MTLAB instead of MATLAB.

Reply:

We have Corrected the typo (revised paper p.12 line290). Thanks for reminding.

Comment 4: The significance of Figure 7 is marginal, the gray distribution of the two picture looks almost the same. You may consider removing these pictures.

Reply:

Figure 7 in the origin paper was posted because it is the input of the experiment. But the gray distribution of the two picture really looks almost the same. Deleting them has no effect on the consistency of the article. So we decided to delete it.

Comment 5: Enlarge the surfaces in Figure 8.

Reply:

The surfaces in Figure 8 in the origin paper (Figure 14 in the revised manuscript) are too small. We have enlarged them (revised paper p.11 line268).

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Dear authors,

The reviewed version of the paper includes more information about the hardware insights done in the work presented. The algorithm is now fully shown.

In my opinion, some minor parts should be modified prior to publish the paper. The figures should appear after citing them in the text (for example, not the case for fig. 1 and fig. 13) Some text in figure 12 is crossed by lines, could be avoided.

Some clarifications should be also included. For example, in page 10, give reasons so that it is the 8% of the total amount the fixed threshold to stop accumulating, is it related to the seleced sub-area size?

Additionally, the buffer unit responsible to read and distribute the image data should be explained more in detail. Usually, the data access to memory is the bottleneck of these systems so, how it is implemented and how it works is of great interest to the readers.

Finally, an equation that counts the number of cycles to process each kind of process should be included. For example, if the multiplier array consumes 3 cycles, for a 51x51 template, the read process, and accumulation pipeline, would result in XXX cycles for each template matching. So, applying the frequency of the system, the reader will obtain the main information about the speed-up of the hardware implementation compared with other processing systems.

Author Response

Dear Reviewer:

The main corrections are set to highlight(blue) in the paper and the main improvement are described as follows.

Adjustments on Structure:

In order to explain the buffer unit more in detail, we added a new section(Section3.1 Buffer unit), including Data preprocessing module, Parallel data output module and Serial shift register groups of the matching window.

Responds to the reviewers’ comments:

Comments:

The reviewed version of the paper includes more information about the hardware insights done in the work presented. The algorithm is now fully shown.

Reply:

We were not careful enough in typesetting before. Now we have adjusted the position of the figures. And we have redrawn Figure 12 in the origin paper (Figure 18 in the revised manuscript, page 12).

We have explained the threshold fixation in Item 2 of Section 1.1 “Adaptive selection method of thresholds for histogram statistics”(page4 line103 ~ line138). Maybe our description is not clear enough. Because we use the correlation coefficient of the local template as the threshold, the closer the threshold is to 1, the stronger the correlation between the two sub-areas (reference sub-area and search sub-area). Under ideal conditions, the best matching point should be the one with the correlation coefficient closest to 1. There may be some deviations in consideration of the effects of noise and exposure. But the best match point is still one of the points where the threshold is closest to 1. We count all points with the correlation coefficient from 1 to 0 by the histogram and retain those points close to 1.(revised paper page13 line299) When the input speckle images are determined, we performed a lot of simulation experiments. We found that after the accumulation as described in the paper, when the accumulated value reaches 3% of the total amount, no best match point will be lost. Finally, we decided that the abscissa is used as the threshold when the accumulated value reaches 8% of the total amount. it is not related to the seleced sub-area size. In fact, it depends on the specific application scenario. Its determinants are mainly noise and exposure. This percentage value is configurable. If the noise or the exposure is high, it can be increased. Inevitably, increasing this value will result in an increase in the amount of calculation. In most cases, such as the conditions of the experiments in this article, 8% is sufficient.(revised paper page13 line310)

We added a new section(Section3.1 Buffer unit page6 line180~page9 line242) to explain the buffer unit more in detail.

For the read process, the data of a row will be read in one clock cycle(revised paper page8 line222). For example, for a 31x31 template, it would result in 31 cycles. For the accumulation pipeline, cycles consumed to calculate the first data depend on the series of the pipeline. For the pipeline adder in Figure 16(page16), the number of the points which need to be calculated is 245(7x7x5) and it would result in 8 cycles. If the size of template is 51x51, R0~R4 are 13x13 regions, and it would result in 10 cycles. Then, each calculation would be completed in one cycle.(revised paper page11 line279) For the local temple, the sum of square and the sum of product would be completed in one cycle(revised paper page10 line256). For the full temple, the sum of product would be completed in three cycles(revised paper page11 line295).

Author Response File: Author Response.pdf

Article Menu

Printed Edition

Hardware Implementation for an Improved Full-Pixel Search Algorithm Based on Normalized Cross Correlation Method

Further Information

Guidelines

MDPI Initiatives

Follow MDPI