Next Article in Journal
Remaining Useful Life Estimation of Turbofan Engines with Deep Learning Using Change-Point Detection Based Labeling and Feature Engineering
Previous Article in Journal
Two-Stage Fusion-Based Audiovisual Remote Sensing Scene Classification
 
 
Article
Peer-Review Record

Linear Time Non-Local Cost Aggregation on Complementary Spatial Tree Structures

Appl. Sci. 2023, 13(21), 11892; https://doi.org/10.3390/app132111892
by Penghui Bu 1,2,*, Hang Wang 1,*, Yihua Dou 1 and Hong Zhao 3
Reviewer 1:
Reviewer 3: Anonymous
Appl. Sci. 2023, 13(21), 11892; https://doi.org/10.3390/app132111892
Submission received: 29 July 2023 / Revised: 9 October 2023 / Accepted: 10 October 2023 / Published: 30 October 2023

Round 1

Reviewer 1 Report

  1. The paper proposes an efficient non-local cost aggregation method which performs filtering along two complementary tree structures.

  2. The paper contribution consists in:
    Novel complementary tree structures are proposed to balance the information propagated along all directions.
    • Efficient algorithms with linear time complexity are presented to compute the 76 filtering output on each tree structure.
    • A comparison of handcrafted features and features learned by CNNs in computing stereo matching cost. 
    Extensive experiments in optical flow estimation and stereo matching show the effectiveness of our approach, and it turns out our spatial tree structures are superior to the MST-based trees in cost aggregation on Middlebury [17] and KITTI [18,19] data sets.

  3. The application addressed motivation of the new method should be added to the introduction.

  4. The gap in the research field is not clear from the introudction conerning all contribution points. 

  5. The proposed algorithms could be referred by github code.
  6. Fig 2 the arrow should be explained and difference could be highlighted.

  7. The conclusion could show results more numerically especially for CNN and proposed method comporision

  8. In general well written paper, that could be published after a small correstion given above. Also the github or python library is highly desirable. 

Author Response

Q1: The application addressed motivation of the new method should be added to the introduction.

Answer: we have rewritten the introduction of our manuscript. We first presented the basic ideas of dense correspondence tasks in stereo matching and optical flow estimation. Then, we listed some typical approaches to deal with this problem, which are highly correlated with our work. The shortages of these methods are also given. Finally, we addressed the motivation of our method. The contributions of our work are mainly two parts: a novel efficient non-local filtering method on two complementary spatial tree structures and a comparison of handcrafted feature and features learned by CNNs in computing stereo matching cost.

The motivation of our novel efficient non-local filtering method on two complementary spatial tree structures comes from two key observations: first, the similarity of neighboring pixels can be evaluated by their intensity distance, since there is no need to extract tree structure from guidance image with extra computation; second, adjacent pixels with similar intensity tend to possess similar disparities, which suggests that informative message should propagate along all edges.

The motivation of comparison of handcrafted feature and features learned by CNNs in stereo matching comes from that deep CNNs with large inceptive field can extract high-level semantic information from guidance image, so that the matching cost generated by deep features shows better robustness, even for pixels in homogeneous areas. However, features learned by CNNs have low positioning accuracy and also less distinctive than handcrafted feature. Therefore, an investigation of these two kind of features in evaluating the similarity of corresponding pixels is necessary, and the integration of handcrafted feature and features learned by CNNs may help to improve the generalization ability of deep learning-based approaches.

 

Q2: The gap in the research field is not clear from the introduction concerning all contribution points.

Answer: We have rewritten the introduction in the revised version. The gap about all contributions of our manuscript are given in the new introduction.

The main contributions of our manuscript are: 1) novel complementary spatial tree structures to perform non-local cost aggregation; 2) linear time complexity algorithms to efficiently aggregate informative message across the whole image; 3) a comparison of handcrafted feature and features learned by CNNs in computing stereo matching cost.

Nowadays, non-local filters based on minimum spanning tree (MST) tend to overuse piece-wise constant assumption, resulting in over-smoothing problem in cost aggregation. While many other non-local methods which perform filtering along row or column in each pass, thus they could introduce streak artifacts in disparity image. Our work attempts to overcome these shortages. We propose a novel efficient non-local cost aggregation approach based on two complementary spatial tree structures. These two spatial trees complementary with each other to balance the information propagated along all directions. Furthermore, recursive strategy is used to implement the filtering procedure, which makes our algorithm has linear time complexity. Experimental results in stereo matching and optical flow estimation show the effectiveness of our approach.

Features learned by CNNs with large inceptive field can extract high-level semantic information from guidance image, which improve the robustness of matching cost. However, features learned by CNNs have low positioning accuracy and also less distinctive than handcrafted feature. To the best of our knowledge, these is no evaluation of deep features learned by CNNs and handcrafted feature in computing stereo matching cost under typical traditional cost aggregation methods by now. Thus, an investigation of these two kind of features in evaluating the similarity of matching pixels is presented in our manuscript.

 

Q3: The proposed algorithms could be referred by github code.

Answer: our code will be publicly available as soon as our manuscript is accepted.

 

Q4: Fig. 2 the arrow should be explained and difference could be highlighted.

Answer: We presented more details about Fig. 2 in the caption. Black and red arrows indicate the directions of information propagated in the first and second steps when filtering along our two complementary spatial trees.

In Fig. 2, (a) and (b) show the ways of information propagated along our complementary spatial trees (Tree 1 and Tree 2) when computing the output of root node (red node). The dot lines between nodes p and q indicate the different paths used to compute the geodesic distances between pixels. (c) and (d) show the way to compute the output of root node on Tree 1. The filtering procedure is performed first along horizontal and vertical directions and then along diagonal directions. (e) and (f) show the way to compute the output of root node on Tree 2. The filtering procedure is performed first along diagonal directions and then along horizontal and vertical directions. (g) Filtering procedure of recursive filter. The filtering procedure is performed first along horizontal direction and then along vertical direction. (f) The way of information propagated along the MST.

 

Q5: The conclusion could show results more numerically especially for CNN and proposed method comparison.

Answer: We have rewritten the conclusion in the revised version. The numerical comparison of handcrafted feature and deep features learned by CNNs in computing stereo matching cost is added to the conclusion and future work (Section 5). After refinement, the percentage of erroneous pixels and average end-point error in non-occluded area using features learned by deep CNNs to compute matching cost are improved by 37.2% and 8.2%, respectively.

 

Q6: In general well written paper, that could be published after a small correction given above. Also the github or python library is highly desirable. 

Answer: We have read our manuscript carefully and tried our best to improve the language. We also rewrote the introduction to address the motivation of our work. Our code will be available when our manuscript is accepted.

Reviewer 2 Report

In the reviewed paper, the authors presented a spatial tree-based non-local cost aggregation method

based on two tree structures. The proposal was tested in optical flow estimation and stereo matching. Generally, the method proposed is promissory. However, the lack of clear explanations regarding main terms, few details about implementation, and poor discussions characterize the paper.

 

The introduction should be rewritten. The authors should include information regarding the context of the field addressed. Please include explanations of the main terms. For example, What is dense correspondence? What is a labeling task? How can the costs be smoothed? What is a semantic feature? What is a loss function? What is cost volume? What is stereo matching? How is optical flow estimated?

 

Please include references to support the arguments presented. For example, the first sentence of the introduction explains that dense correspondence has been extensively studied. Therefore, please include evidence of this.

 

Please avoid exaggerations. For example, "extensive research work," "effective approaches," and "unprecedented progresses."

 

Please clearly explain the problem to be solved. How is the problem currently solved? Why a new solution is needed? What is the aim of the paper?

 

Figure 1 should be explained in depth. Please erase the exaggeration on the fourth contribution. How many experiments are needed to ensure that they are extensive?

 

Section II can be erased. No important information was presented. Please ask for help to have an explanation about how to analyze the related work and how to write the report. It will be desirable to build a comparison Table. The information from Section II was not reused for conduct discussion or comparisons.

 

Please include more details to explain Figure 2. What is the meaning of the arrows?

 

In section V, please avoid exaggerations. Please demonstrate that the experiments are extensive. Please explain the process for selecting the data sets and the approaches employed in experimentation. What were the other data sets and approaches considered? Why were the other data sets and approaches discarded?

 

The details about how the experiments were conducted must be included. I think the authors on page 8, line 20, tried to refer to Table II, not Table 1. Please insert more details to explain Figure 3. Please insert more explanations about the handcrafted feature extraction.

 

Please explain what happened in Table 3 because in almost all comparisons, the proposal not obtained the best results. A better discussion of the results should be included. For all the results, please demonstrate that the differences are significant to establish that one method is better.

 

 

Please rewrite the conclusions. The conclusions should be derived from the information presented in the paper's body. Ultimately, it is unclear whether the proposal addressed the contributions explained on Section I.

Only a few details regarding English were detected.

Author Response

Q1: In the reviewed paper, the authors presented a spatial tree-based non-local cost aggregation method based on two tree structures. The proposal was tested in optical flow estimation and stereo matching. Generally, the method proposed is promissory. However, the lack of clear explanations regarding main terms, few details about implementation, and poor discussions characterize the paper.

Answer: We have made significant improvement over the original manuscript. The introduction are rewrote in the revised manuscript. The main terms are all clearly explained. In order to make our filtering approach easier to understand, more details are added to the caption of Fig. 1. More discussions about the experimental results are also provided for Fig. 3 and Table 4 (corresponding to Table 3 in the original manuscript). Furthermore, the conclusion is also improved.

In the introduction of our revised manuscript, we first explained the main terms, such as dense correspondence, stereo matching and optical flow estimation. Then, the procedures to obtain the result are given. A discussion of typical methods shows the shortages of these approaches. With these discussions, we presented the motivations of our non-local filters on two complementary spatial trees. What’s more, the shortage and advantage of features learned by CNNs are provided, which is the motivation of evaluating handcrafted feature and deep features learned by CNNs in computing stereo matching cost under typical cost aggregation methods. Finally, we presented the filtering procedures on our spatial tree structures and the conclusions of experimental results.

In order to make our method understood, we presented more details in Fig. 1, Fig. 2 and Fig. 3. The ways of message propagated and the details of filtering along our spatial tree structures are provided in the caption of Fig. 1. The means of two kind of arrows in Fig. 2 are provided to show the artifacts generated by different filtering methods. More explanations about the results in Fig. 3 generated by typical cost aggregation methods on Middlebury dataset are elaborated. In order to show the differences among non-local filters, a comparison of typical non-local edge-aware filters, namely NL, ST, DT and our method, is presented in Table 1 in the revised manuscript.  

An in-depth explanation of Table 4 in the revised manuscript (corresponding to Table 3 in the original manuscript) is also provided. The reason for the results of GF is slightly better than that of ours on KITTI 2012 dataset when using handcrafted feature computing matching cost is that only the disparities of pixels near ground are provided in ground-truth disparity image. Most of these valid pixels are located in highly textured regions, so that GF can make full use of the structure information in local area to generate high quality result. While our method is superior to GF in homogeneous areas, as shown in Fig.4 and Fig. 5 in the revised manuscript.

All these explanations, details and discussions are added to the revised manuscript, so that it would make our method easier to understand.

 

Q2: The introduction should be rewritten. The authors should include information regarding the context of the field addressed. Please include explanations of the main terms. For example, What is dense correspondence? What is a labeling task? How can the costs be smoothed? What is a semantic feature? What is a loss function? What is cost volume? What is stereo matching? How is optical flow estimated?

Answer: We have rewritten the introduction in the revised manuscript. We use unified expressions of main terms through our manuscript, and the explanations of main terms are provided in the new introduction. For example, the aim and the usage of dense correspondence tasks are explained in the first sentence. Then, the difference between stereo matching and optical flow estimation is given. The four steps of deciding the result of dense correspondence for stereo matching are listed, and the way to implement each step is presented.

Labeling task aims to assign a label to each pixel in computer vision, which are stereo matching and optical flow estimation in our original manuscript. Semantic feature in our original manuscript is the another name of features learned by CNNs. The confusing expressions are removed in the revised version to make our method easier to understand.

 

Q3: Please include references to support the arguments presented. For example, the first sentence of the introduction explains that dense correspondence has been extensively studied. Therefore, please include evidence of this.

Answer: We have added related references to support the arguments in our revised manuscript. For example, related references are added to demonstrate the usage of dense correspondence in 3D reconstruction, image registration and interactive segmentation.

 

Q4: Please avoid exaggerations. For example, "extensive research work," "effective approaches," and "unprecedented progresses".

Answer: We have read our manuscript carefully and removed all these exaggerated words.

 

Q5: Please clearly explain the problem to be solved. How is the problem currently solved? Why a new solution is needed? What is the aim of the paper?

Answer: We have rewritten the introduction in the revised manuscript. We first provided the basic procedures of typical traditional methods used to solve dense correspondence tasks, and then listed the shortages of these methods. The aim of this manuscript is trying to overcome or alleviate the shortages of widely-used cost aggregation methods. Finally, our novel non-local cost aggregation method is presented.

Our work mainly focuses on cost aggregation in dense correspondence tasks, namely stereo matching and optical flow estimation. Due to radiometric variations, the initial matching costs are very noisy. Thus, cost aggregation is needed to improve the robustness of matching costs. Typical way to perform cost aggregation is computing the weighted average or sum of matching costs over support window. Local filtering methods only take pixels in a local window into account, which are not geometrically adaptive and can not make full use of information from the whole image. MST-based non-local filtering methods extend the support region to the whole image by extracting tree structures from guidance image. However, these methods tend to over-use piece-wise constant assumption. Furthermore, extra computation are needed to build the tree structures in MST-based non-local filters.

The main goal of our work is to present novel tree structures which can aggregate information from entire image while balancing the messages propagated along all directions. This is motivated by two key observations: 1) the similarity of neighboring pixels can be evaluated by their intensity distance, so there is no need to extract tree structure from guidance image with extra computation; 2) adjacent pixels with similar intensity tend to possess similar disparities, which suggests that all edges in 4-connected neighborhood should be taken into account.

Another contribution of our work is a comparison of handcrafted feature and features learned by deep CNNs in computing stereo matching cost. Deep CNNs with large inceptive field can extract high-level semantic information from guidance image, improving the robustness of matching costs in homogeneous regions. However, these learned features have low positioning accuracy and also less distinctive than handcrafted feature. Thus, an investigation of these two kind of features in evaluating the similarity of matching pixels is necessary.

 

Q6: Figure 1 should be explained in depth. Please erase the exaggeration on the fourth contribution. How many experiments are needed to ensure that they are extensive?

Answer: We have added more details to explain the filtering procedures on different trees in Fig. 1. The means of black and red arrows are elaborated in the caption. The difference between the ways to measure the distance between pixels on two complementary spatial trees are provided. The filtering procedures on each spatial tree and another two non-local filtering methods (recursive filter and non-local filter along an MST) are also presented.

 All the exaggerated words are removed in our revised manuscript.

 

Q7: Section II can be erased. No important information was presented. Please ask for help to have an explanation about how to analyze the related work and how to write the report. It will be desirable to build a comparison Table. The information from Section II was not reused for conduct discussion or comparisons.

Answer: We have removed the Section II in the revised version and rewritten the introduction. In the new introduction, we first analyzed the shortages of typical cost aggregation methods and then presented the strategies to overcome these shortages. Finally, the motivation of our work is given. In order to aggregate information across the entire image and avoid the over-smoothing problem introduced by MST-based non-local filters, we attempt to construct novel spatial tree structures which owns the ability to balance the information propagated along all directions. As for features used in computing stereo matching cost, handcrafted feature is fragile under radiometric variations while featured learned by CNNs owns high robustness even in low textural area. However, deep features learned by CNNs show low positioning accuracy and also less distinctive. Therefore, we compare the performance of these two kind of features in computing stereo matching cost under typical cost aggregation methods.

A comparison of typical MST-based non-local cost aggregation methods (NL, ST and DT) and our method is presented in Table 1 in the revised version. Both NL and ST need extra computation to build tree structures, while DT and our method take advantage of the spatial relationship among pixels to propagate information across the entire image. In order to keep high efficiency, all these non-local filters are implemented in a recursive manner. The spatial distance between any two pixels for NL and ST is evaluated along corresponding tree structure, while it is the Manhattan distance for DT. The spatial distance is composed by two parts in our method. They are along row/column and diagonal directions. Moreover, our method propagate information along edges in 8-connected neighborhood. NL, ST and DT only take some of edges in 8-connected neighborhood into account.

 

Q8: Please include more details to explain Figure 2. What is the meaning of the arrows?

Answer: More details are added to the caption of Fig. 2. The blue arrows indicate the fake edges among sub-trees generated by the non-local filter based on an MST, which do not appear in our proposed complementary spatial trees. While the red arrows indicate the streak artifacts when filtering only along one of our complementary spatial trees. When the filtering is performed on our complementary tree structures, the streak artifacts can be unnoticeable.

 

Q9: In section V, please avoid exaggerations. Please demonstrate that the experiments are extensive. Please explain the process for selecting the data sets and the approaches employed in experimentation. What were the other data sets and approaches considered? Why were the other data sets and approaches discarded?

Answer: We have read our manuscript carefully and removed all the exaggerated expression in our revised manuscript.  

We perform experiments on widely-used Middlebury and KITTI datasets to evaluate the performance of typical cost aggregation approaches. Middlebury dataset contains different indoor artificial scenes taken under controlled environment with pixel-level ground-truth generated by structured light. KITTI datasets are composed by two parts, namely KITTI 2012 and KITTI 2015. Both of them are real-world street views of both highways and rural areas captured by a driving car under natural conditions, hence there are a large portion of textureless regions in these stereo pairs. Furthermore, radiometric variations often appear in this challenging dataset. Therefore, performing experiments on these two datasets can validate the effectiveness of our method under different conditions.

 

Q10: The details about how the experiments were conducted must be included. I think the authors on page 8, line 20, tried to refer to Table II, not Table 1. Please insert more details to explain Figure 3. Please insert more explanations about the handcrafted feature extraction.

Answer: We have provided more details about how to conduct the experiments in stereo matching and optical flow estimation in the revised version.

For stereo matching on Middlebury dataset, we compute the initial matching cost using the pixel-based truncated absolute differences of both color vector and gradient to measure the proximity of candidate matching pixels. Then typical cost aggregation methods are used to smooth each cost slice. Both left and right images are used as guidance image to generate corresponding disparity images. In refinement step, left-right consistency check is performed to classify pixels into stable and unstable. The non-local refinement method is utilized to propagate reliable disparities from stable pixels to unstable ones.

For stereo matching on KITTI dataset, we adopt both Census Transform and the correlation of deep features learned by CNNs to compute matching cost. The main idea of Census Transform is using a string of bits to characterize the pixels in local window. The details of obtaining the string of bits for each pixel is presented in Eq. (10) in the revised version. Then the Hamming distance is used to measure the similarity of two string of bits, as shown in Eq. (11). As for features learned by CNNs, we directly use the correlation of left and right features to evaluate the similarity of matching pixels, as shown in Eq. (12) in the revised version. Then typical cost aggregation is used to smooth each cost slice. Both left and right images are also used as guidance image to generate corresponding disparity images. In refinement step, left-right consistency check is used to identify mismatched pixels.Then they are assigned to the lowest disparity value of the spatially closest matched pixels lying on the same scanline. Finally, the weighted median filter is used to remove streak artifacts.

For optical flow estimation, we also use the truncated absolute differences of both color vector and gradient to compute the matching cost. The initial flow maps of both left and right images are generated in a way exactly the same with that in stereo matching. In order to obtain flow maps with sub-pixel accuracy, we utilize bicubic interpolation with an upscaling parameter 4 to upscale input images. In refinement step, the left-right consistency check is used to identify mismatched pixels.Then they are assigned to the lowest disparity value of the spatially closest matched pixels lying on the same scanline. Finally, the weighted median filter is used to remove streak artifacts.

The quotation on page 8, line 20 attempts to show that our method generate competitive result with that of GF, while our method has lower computational complexity. Therefore, it should refer to Table 2 (corresponding to Table 1 in the original manuscript) in the revised version.

More details are given to explain Fig. 3. BF and GF assume all pixels in the support window on a disparity plane. NL and FCGF utilize the MST of input image to build long-range connections. However, they tend to overuse piece-wise constant assumption. Thus, BF, GF, NL and FCGF produce a large number of erroneous disparities in slant surface, indicated by the boxes in the first row of Fig. 3 (b), (c), (d) and (e). Compared with BF and GF, NL and FCGF weaken the constraints from neighboring pixels. Hence, NL and FCGF produce better results in regions containing a lot of finescale details, indicated by the boxes in the second row in Fig. 3 (b), (c), (d) and (e). Our method overcomes these shortages by taking advantage of recursive filtering and balancing information propagated along all directions. Thus, our method can successfully preserve fine-scale details in highly textured regions and alleviate the over-smoothing problem in slant surface, as shown in Fig. 3 (f).

All these details and explanations are added in the revised version.

 

Q11: Please explain what happened in Table 3 because in almost all comparisons, the proposal not obtained the best results. A better discussion of the results should be included. For all the results, please demonstrate that the differences are significant to establish that one method is better.

Answer: The reason for the performance of our method is slightly inferior to GF is that only the disparities of pixels near ground are provided in ground-truth disparity image on KITTI dataset. Most of those valid pixels are located in highly textured regions, so that GF can make full use of structure information in local area to generate high quality result. While our method is superior to GF in homogeneous region, as shown Fig. 4 and Fig. 5 in our revised manuscript.

However, our method generates better results that GF in the other experiments. The reason for this is that our method utilizes geodesic distances in both spatial and intensity spaces to measure the similarity of pixels, which contributing to alleviate over-smoothing problem in slant surface and to preserve fine-scale details. Our method outperforms GF when using features learned by CNNs to compute stereo matching cost on KITTI 2012 dataset. It means our method can generate better result when robust features are provided. The dynamic foreground objects are removed in the KITTI 2015 dataset, and our method produce the best results among all the testing methods. Furthermore, the computational complexity of our method is lower than that of GF.

 

Q12: Please rewrite the conclusions. The conclusions should be derived from the information presented in the paper's body. Ultimately, it is unclear whether the proposal addressed the contributions explained on Section I.

Answer: We have rewritten the conclusions in the revised version. The conclusions are all derived from our manuscript. All the contributions listed in Section I are addressed in our revised version. In our new conclusion, we first presented the main contribution of this work, which is the efficient non-local filtering method on two complementary spatial tree structures. Then, the strategy used to efficiently implement our algorithm is given, which corresponds to the second contribution. The conclusion of comparing handcrafted feature and features learned by CNNs in computing stereo matching cost is presented, which is the third contribution of this work. Finally, we list the future work.

 

Q13: Only a few details regarding English were detected.

Answer: We have read our manuscript carefully and tried our best to improve the English quality. The typos in our manuscript are modified in the revised version.

Author Response File: Author Response.pdf

Reviewer 3 Report

The paper discusses the filtering method (i.e., a linear time non-local cost aggregation method) on two complementary tree structures. First, I have read the article several times, but it is hard to be understood. Second, it is quite important, I have found an article that is similar to this paper. I thought the paper entitled: "Linear Time Edge-aware Filtering on Complementary Tree Structures" may be written by the same authors. If so, it needs to be thoroughly elaborate on the difference. 

 

According to the facts mentioned above, I would like to recommend the authors rewrite the paper, especially in terms of contribution by mentioning and discussing the difference with the paper entitled: "Linear Time Edge-aware Filtering on Complementary Tree Structures". Some visualizations on the mentioned paper are also interesting, you can add also to the paper. Finally, the evaluation metrics need to be explained more, especially the effect when it is different. 

Author Response

Q1: The paper discusses the filtering method (i.e., a linear time non-local cost aggregation method) on two complementary tree structures. First, I have read the article several times, but it is hard to be understood. Second, it is quite important, I have found an article that is similar to this paper. I thought the paper entitled: "Linear Time Edge-aware Filtering on Complementary Tree Structures" may be written by the same authors. If so, it needs to be thoroughly elaborate on the difference. 

Answer: We have made significant improvement over the original manuscript. We have rewritten the introduction to show the shortages of typical cost aggregation methods and then presented the strategies to overcome these shortages. Finally, the motivations of our work is given.

The main contribution of our work is an efficient non-local coat aggregation method on two complementary spatial trees. Fig. 1 elaborates the filtering procedure on these two trees, as shown in Fig. 1 (a) and (b). More details of the implementation are provided in the caption. The filtering procedure on each tree can be implemented in two steps: 1) compute the supports from pixels lying on the same scanlines, which are the row and column for the spatial tree in Fig. 1 (a) (corresponding to Fig. 1 (c)) while they are diagonal scanlines for spatial tree in Fig. 1 (b) (corresponding to Fig. 1 (e)); 2) compute the supports from pixels lying on four sub-trees, which are the sub-trees in four diagonal directions for the spatial tree in Fig. 1 (a) (corresponding to Fig. 1 (d)) while they are the sub-trees in horizontal and vertical directions (corresponding to Fig. 1 (f)). The ways to compute the outputs of these two steps are given in Section 2.

Moreover, a comparison of typical non-local filtering methods, namely NL, ST, DT and our method, are presented in Table 1 in our revised manuscript. All these elaborations help to make our work easier to understand.

The paper entitled: "Linear Time Edge-aware Filtering on Complementary Tree Structures" is another paper of our work (has not been accepted by any conferences and journals), which focuses on the applications in graphics applications, such as edge-aware filtering, image denosing, detail enhancement, stylization and colorization. In that work, we also proposed an distance mapping scheme suing sigmoid function which enables proposed smoothing approach manageably filtering out low-amplitude structures while preserving sharp edges. However, our work mainly focuses on cost aggregation in stereo matching and optical flow estimation. The key difference between using our efficient non-local filtering method in cost aggregation and graphics applications is that the normalization parameter on each spatial tree are needed for image processing applications. Hence, there are lot of differences between our two manuscripts.

 

Q2: According to the facts mentioned above, I would like to recommend the authors rewrite the paper, especially in terms of contribution by mentioning and discussing the difference with the paper entitled: "Linear Time Edge-aware Filtering on Complementary Tree Structures". Some visualizations on the mentioned paper are also interesting, you can add also to the paper. Finally, the evaluation metrics need to be explained more, especially the effect when it is different. 

Answer: As we explained in the first question, the manuscript entitled "Linear Time Edge-aware Filtering on Complementary Tree Structures" focuses on applications in graphics applications. A distance mapping scheme relying on sigmoid function is introduced to enhance the versatility of our non-local edge-aware filtering method on two spatial trees. However, this manuscript mainly focuses on the performance of our method in cost aggregation. There is no normalization needed in this application. Furthermore, we present an comparison of handcrafted feature and deep features learned by CNNs in computing stereo matching cost.

In our manuscript, we use four evaluation metrics to compare the performance of typical cost aggregation methods on widely-used Middlebury and KITTI datasets, namely the percentages of erroneous pixels in non-occluded and all regions, the average end-point error in non-occluded and all regions. As Middlebury dataset contains close shot stereo images and the ground-truth has pixel-level accuracy, the threshold is 1 in our experiment. KITTI dataset contains real-world street views captured by a driving car under natural conditions, and the sparse ground-truth is provided by LiDAR, hence we use a threshold of 3 to identify outliers in disparity image.

Author Response File: Author Response.pdf

Reviewer 4 Report

The english quality has to be improved. It is difficult to read this paper.

The english quality has to be improved. It is difficult to read this paper.

Author Response

Q1: The English quality has to be improved. It is difficult to read this paper.

Answer: We have read our manuscript carefully and tried our best to improve the English quality. All the typos in the manuscript are corrected.

We have made significant improvement over the original manuscript. The introduction are rewrote in the revised manuscript. The main terms are all clearly explained. In order to make our filtering approach easier to understand, more details are added to the caption of Fig. 1. More discussions about the experimental results are also provided for Fig. 3 and Table 4 (corresponding to Table 3 in the original manuscript). Furthermore, the conclusion is also improved.

In the introduction of our revised manuscript, we first explained the main terms, such as dense correspondence, stereo matching and optical flow estimation. Then, the procedures to obtain the result are given. A discussion of typical methods shows the shortages of these approaches. With these discussions, we presented the motivations of our non-local filters on two complementary spatial trees. What’s more, the shortage and advantage of features learned by CNNs are provided, which is the motivation of evaluating handcrafted feature and deep features learned by CNNs in computing stereo matching cost under typical cost aggregation methods. Finally, we presented the filtering procedures on our spatial tree structures and the conclusions of experimental results.

In order to make our method understood, we presented more details in Fig. 1, Fig. 2 and Fig. 3. The ways of message propagated and the details of filtering along our spatial tree structures are provided in the caption of Fig. 1. The means of two kind of arrows in Fig. 2 are provided to show the artifacts generated by different filtering methods. More explanations about the results in Fig. 3 generated by typical cost aggregation methods on Middlebury dataset are elaborated. In order to show the differences among non-local filters, a comparison of typical non-local edge-aware filters, namely NL, ST, DT and our method, is presented in Table 1 in the revised manuscript.  

An in-depth explanation of Table 4 in the revised manuscript (corresponding to Table 3 in the original manuscript) is also provided. The reason for the results of GF is slightly better than that of ours on KITTI 2012 dataset when using handcrafted feature computing matching cost is that only the disparities of pixels near ground are provided in ground-truth disparity image. Most of these valid pixels are located in highly textured regions, so that GF can make full use of the structure information in local area to generate high quality result. While our method is superior to GF in homogeneous areas, as shown in Fig.4 and Fig. 5 in the revised manuscript.

All these explanations, details and discussions are added to the revised manuscript, so that it would make our method easier to understand.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The authors have attended to almost all my suggestions. However, it is recommended to report exactly the locations where the changes were conducted.

Only minor editing of English is required.

Author Response

Q1: The authors have attended to almost all my suggestions. However, it is recommended to report exactly the locations where the changes were conducted.

Answer: The contents that have changed in the revised revision are highlighted in BLUE. It would be better for the reviewers identify the changes in our manuscript.

Q2: Only minor editing of English is required.

Answer: We have read our manuscript carefully and tried our best to improve the language. All the typos in our manuscript are corrected.

Author Response File: Author Response.pdf

Reviewer 4 Report

In this paper authors propose a linear time non-local cost agregation method on two complementary spatial tree structores. Two datasets were used to demonstrate the proposed method outperfom works already published.

The performance of the proposed method is compared with handcrafted and deep features scheme. For the deep features comparison, I suggest to include the reason why PSMNet was selected as feature extractor.

In general the paper is well written, some corrections of english are required.

Corrections of english are required.

Author Response

Q1: For the deep features comparison, I suggest to include the reason why PSMNet was selected as feature extractor.

Answer: The PSMNet combines spatial pyramid pooling and dilated convolution to enlarge the receptive field. The resultant local and global features aggregate context information at different scales and locations and are widely used in many state-of-the-art methods. Hence we also use PSMNet to extract deep features in our experiments.

All these explanations are added to the revised manuscript.

Q2: In general the paper is well written, some corrections of English are required.

Answer: We have read our manuscript carefully and tried our best to improve the quality of English.

Author Response File: Author Response.pdf

Back to TopTop