NTPP-MVSNet: Multi-View Stereo Network Based on Neighboring Tangent Plane Propagation

Zhao, Qi; Deng, Yangyan; Yang, Yifan; Li, Yawei; Yuan, Ding

doi:10.3390/app13148388

Open AccessArticle

NTPP-MVSNet: Multi-View Stereo Network Based on Neighboring Tangent Plane Propagation

by

Qi Zhao

¹

,

Yangyan Deng

¹

,

Yifan Yang

²

,

Yawei Li

³

and

Ding Yuan

^3,*

¹

Department of Electronic and Information Engineering, Beihang University, Beijing 100191, China

²

Institute of Artificial Intelligence, Beihang University, Beijing 100191, China

³

School of Astronautics, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(14), 8388; https://doi.org/10.3390/app13148388

Submission received: 30 May 2023 / Revised: 4 July 2023 / Accepted: 17 July 2023 / Published: 20 July 2023

(This article belongs to the Special Issue Intelligent Analysis and Image Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Although learning-based multi-view stereo algorithms have produced exciting results in recent years, few researchers have explored the specific role of deep sampling in the network. We posit that depth sampling accuracy more directly impacts the quality of scene reconstruction. To address this issue, we proposed NTPP-MVSNet, which utilizes normal vector and depth information from neighboring pixels to propagate tangent planes. Based on this, we obtained a more accurate depth estimate through homography transformation. We used deformable convolution to acquire continuous pixel positions on the surface and 3D-UNet to account for the regression of depth and normal vector maps without consuming additional GPU memory. Finally, we applied homography transformation to complete the mapping of the imaging plane and the neighborhood surface tangent plane to generate a depth hypothesis. Experimental trials on the DTU and Tanks and Temples datasets demonstrate the feasibility of NTPP-MVSNet, and ablation experiments confirm the superior performance of our deep sampling methodology.

Keywords:

depth sampling; normal estimation; multi-view stereo

1. Introduction

Multi-view stereo was initiated by David Marr [1] in 1979 as a segment of a theoretical framework for computer vision. According to Marr, humans interpret objects by reconstructing a three-dimensional representation of their contours, which can be accomplished through computational approaches. A number of mathematical models have been proposed over the years to tackle the problem, for instance, Patchmatch Stereo [2] and Colmap [3]. Nevertheless, due to insufficient computational resources, the accuracy and completeness of these models are questionable.

With the advent of convolutional neural networks in computer vision [4,5,6,7], the domain of deep learning-based multi-view stereo has witnessed significant strides [8,9,10,11]. Deep learning-based multi-view stereo algorithms encompass several modular components, which include feature extraction, depth sampling, cost volume aggregation [12], and depth map optimization regression. However, it must be noted that accurate depth hypotheses serve as the foundation for regressing high-precision depth maps. Traditional depth sampling methodologies [13] often resort to dense sampling within a fixed range to acquire sampling points that are proximal to the depth truth values, employing a brute-force enumeration methodology that is not only time-consuming but also computationally inefficient. We combined the normal vector estimation of the scene and used the homography feature of the plane to propose a more accurate depth sampling method, which can improve the accuracy of scene estimation by applying it to the multi-view stereo matching network.

A small number of researchers have proposed some algorithm modules for high-efficiency deep sampling. One such solution involves processing the image’s depth information at multiple scales to progressively narrow the depth range of each pixel based on low-scale depth prediction maps [14,15,16]. While this technique can effectively reduce the computational cost of dense sampling in high-resolution images by minimizing the depth hypothesis range, it remains an enumeration method that may be computationally demanding. An alternative group of researchers proposed utilizing neighborhood information in the image to provide depth values for three-dimensional points in the image’s four-neighborhood/eight-neighborhood [17,18,19]. Since the pixel point distribution in the depth map can be treated as a Markov random field, a correlation exists between adjacent points in the image. Therefore, they upscale the low-scale depth prediction outcomes and employ the neighborhood-depth value to generate reliable depth assumptions for each pixel [16,20]. However, the success of this depth sampling technique hinges on the assumption of equal depth values of each pixel in the image, implying that the current object’s surface needs to be parallel to the imaging plane. Nevertheless, real-world scenes may not fulfill this condition and this sampling method may consequently produce poorer depth assumptions on heavily inclined surfaces. The above methods either still rely on uniform sampling in a fixed range under different resolution images, which has low depth estimation efficiency, or there is a certain deviation between the premise of depth sampling and the actual scene, which is prone to producing wrong depth assumptions.

In this paper, we present a multi-view stereo network that leverages neighborhood normal vector information to obtain efficient and robust depth sampling values under realistic conditions, while concurrently enhancing the computational efficiency and depth map regression accuracy of the network model. Our proposed algorithmic framework utilizes the neighborhood geometric structure of the current scene as a starting point for the depth sampling module. Through the employment of a homography mapping between the neighborhood’s tangent plane information and the image plane, the depth hypothesis of the current pixel in the neighborhood’s tangent plane assumption can be calculated. We also incorporated a neighborhood bias learning module in our depth sampling module, obtained using deformable convolution, that enables adaptively depth-smoothing neighborhood pixels for each pixel. This mechanism circumvents the presence of erroneous depth assumptions offered by neighborhood points at discontinuous surfaces. Notably, our network achieves higher accuracy with equal depth hypotheses and acquires the current image’s normals outcome that can offer significant benefits for tasks like model editing and texture mapping.

In summary, the main contributions of our algorithm are as follows:

Firstly, we proposed a novel multi-view stereo network that leverages neighborhood tangent plane cues, which incorporates a carefully designed depth sampling module and a neighborhood bias learning module, providing surface-continuous neighbor pixel positions for each pixel.

Secondly, we introduced a normal estimation module to the multi-view stereo network. The predicted depth map can provide input for the normal map calculation and the generated normal map can constrain the detail distribution of the depth map to make it closer to the true value. The addition of this module does not significantly reduce the operating efficiency of the algorithm or GPU memory consumption.

Thirdly, we proposed a new depth sampling method based on the normal vector estimation results of the scene. Compared with the uniform sampling and the sampling method based on neighborhood depth propagation, the depth sampling accuracy obtained by our method is higher, and the premise assumption is more in line with the actual situation. Experiments conducted on diverse datasets like DTU, Tanks, and Temples, displayed the potential of our method for various real-world applications.

2. Related Works

In 2018, Yao et al. [13] proposed the MVSNet algorithm, which represents the first end-to-end multi-view stereo network model that applies deep learning to estimate depth. The core of the network is a differentiable homography module that creates a matching cost volume through mapping source images to target images via depth sampling. The cost volume undergoes regularization and soft argmax to produce the final depth map. Subsequent improvements concentrate on two primary aspects.

2.1. Cost Volume Aggregation Optimization

The cost volume regularization using 3D convolution requires too much memory, making it impractical to employ MVSNet in high-resolution scenes. Consequently, numerous researchers have proposed various solutions. Yao et al. [21] used the gate recurrent unit (GRU) in a recurrent neural network to iteratively regularize the 2D cost volume along the depth sampling direction. Liu et al. [22] built the RED-Net with a recurrent encoder-decoder structure, which reduces the memory consumption and computational cost. Wei et al. [23] introduced long short-term memory to build the AA-RMVSNet, which includes a view-wise aggregation module that significantly enhances the algorithm’s performance on thin and low-texture surfaces. By using edge convolution, Chen et al. [24] proposed the Point-MVSNet algorithm with a coarse-to-fine optimization strategy and the PointFlow module to refine the depth residual map on the rough point cloud model in three-dimensional space, maximizing the target depth map. CVP-MVSNet [14] proposed a cost volume pyramid built via a coarse-to-fine approach to deliver a lighter network. The Cascade-MVSNet, designed by Gu et al. [15], demonstrated that building a feature pyramid optimizes the cost volume aggregation.

2.2. Other Optimization

To overcome the issue of low absolute resolution caused by fixed depth hypotheses, Cheng et al. [16] developed an adaptive depth sampling module (ATV) combined with uncertainty estimation. This module accurately determines each pixel’s depth interval in three-dimensional space to produce more precise depth maps. Xu et al. [25] introduced the PVSNet, the first-ever deep learning framework that captures the visibility information of different adjacent views, with the aim of improving the multi-view stereo algorithms’ performance on datasets with strong viewpoint changes. In contrast to popular learning-based planar scanning algorithms that rely on approximating isotropic cost volumes, Luo et al. [26] argued that planar scanning volumes are inherently anisotropic in both space and depth directions. To remedy this, they proposed P-MVSNet [26], a multi-view stereo algorithm that employs isotropic and anisotropic 3D convolutions. In addition to improving the accuracy of depth maps, efficiency is a desirable feature for multi-view depth estimation in real-world scenarios. Thus, Yu et al. [27] created Fast-MVSNet, which utilizes a sparse-to-dense algorithm framework and optimizes the depth map further with a simple and effective Gaussian–Newton layer.

3. Methods

3.1. Architecture of the Network

Figure 1 displays the comprehensive architecture of the neighborhood normal vector cues-based multi-view stereo network. As with other techniques, our approach utilizes a multi-scale feature pyramid network to acquire multi-scale feature information from images captured through different viewpoints. The information entering the network is N image information from varying viewpoints, represented as

{I_{i} | i = 1 \dots N}

, where

I_{1}

signifies the target image, while

{I_{i} | i = 2 \dots N}

pertains to the source images obtained from the surrounding viewpoints. In addition to this, as input, we require the intrinsic matrix

K_{i}

and extrinsic matrix

[R_{i} | T_{i}]

of each image, which can be calculated via the Structure From Motion (SFM) methodology. Our network model has three stages, as indicated in Figure 1, and the feature map dimensions used in each stage are

R^{\frac{H}{4} \times \frac{W}{4} \times 16 C}

,

R^{\frac{H}{2} \times \frac{W}{2} \times 4 C}

, and

R^{H \times W \times C}

, correspondingly. The generated feature maps are then fed into the stereo-matching module at each stage, culminating in the development of higher-resolution depth prediction outcomes through iterative means.

As showcased in Figure 2, distinct feature maps from varying perspectives undertake differentiable homographic transformations towards the target view through depth sampling, which in turn ensures view consistency for different feature vectors. The cost between N views is determined and further regulated by a 3D U-Net network. Moreover, the application of softmax weighting is conducive to the depth information of the output result, while normalization is initially performed on the normal vector information, followed by a weighted sum based on probability. Ultimately, the depth map and normal map of the current view image can be procured.

3.2. Depth Hypothesis

The brute force enumeration approach demonstrated in Figure 3 is typically used by the traditional deep sampling method, which establishes a pre-determined range and conducts uniform sampling accordingly. Whilst dense sampling density will facilitate the acquisition of depth values close to the assumed true value, this technique is heavily flawed owing to its inefficiency. Low-density depth sampling renders the outcome susceptible to missing the nearby true value points, whereas the high-density alternative carries the potential for memory overload.

As demonstrated in Figure 4, depth sampling confers to each pixel various depth hypotheses with high reliability and leverages neighborhood results acquired at lower scales [17,18,19,28]. With the same or fewer depth sampling numbers, disparities between the depth sampling results and actual depth values are more minute. Nevertheless, this method’s establishment requires an ideal premise wherein the surface is imaged in parallel to the camera imaging plane. Significant tilting of the surface from the camera imaging plane or even a perpendicular approach will result in significant errors relative to the current point’s depth true value.

To address the aforementioned issues, we proposed a novel depth sampling method that operates on neighborhood tangent plane information, encapsulating the assumption that the 3D representation of the current pixel point is precisely located on the tangent plane of the neighborhood. As presented in Figure 5, homography transformation was employed to deduct the normal vector and spatial position of the current pixel point based on known coarse depth values and neighborhood normals, consequently modeling the neighborhood’s tangent plane. The inceptive step entails estimating normal vector information of the image, ultimately leading to the acquisition of relatively continuous neighborhood locations of the surface and resulting in the computation of depth assumptions through homography transformation.

3.2.1. Normal Estimation

The estimation of the normals is not a primary function of the multi-view stereo network in general. Rather, the emphasis is on the accuracy and completeness of the spatial position calculation for each pixel. As such, the network often fails to fully consider the geometric properties of the resulting 3D model. This issue has been addressed by related studies, such as monodepth, which have augmented the loss function with smoothing terms to facilitate the coherent transformation of the surface depth. Through the incorporation of a smoothing constraint into the loss function, the noise and discontinuities in the depth can be mitigated while still preserving the detailed information.

L o s s_{s m o o t h} = | \partial_{x} d_{t} | e^{- | \partial_{x} I_{t} |} + | \partial_{y} d_{t} | e^{- | \partial_{y} I_{t} |} .

(1)

It has been noted that the neighborhood depth information and the current pixel’s depth information frequently demonstrate a linear transformation pattern as opposed to an identity transformation pattern within this method. Consequently, the enforcement of a strong consistency constraint on the neighborhood depth information stands in contradiction to the observed objective facts.

In this presented study, the estimation of surface normal vectors within a multi-view stereo algorithm was suggested. The surface normal vectors were expected to furnish structural cues conducive to depth estimation, thereby optimizing the feature volume of surface normal vectors simultaneously with the regression of the depth map. The combined estimation of depth and surface normals can engender greater accuracy and robustness concerning the 3D reconstruction while also promoting better comprehension and significance of the scene geometry.

Related works, such as the NNet [29], greatly consume memory. Furthermore, there is a logical deduction between the normal and depth information, and this method does not fully utilize the depth cost aggregation process but rather starts training the normal vector map again from the originally extracted feature map, resulting in lower algorithm efficiency.

In Figure 6, it is proposed to simultaneously produce the depth prediction module as well as the normal vector prediction module of the model utilizing the cost aggregation method. The 3D U-Net structure, in accordance with the conventional multi-view stereo networks, is leveraged to standardize the cost aggregation and expunging the effect of individual outliers in the matching. In addition, to acquire depth maps and normals at the same time, a convolutional neural network was integrated at the terminal end of the 3D U-Net structure for dimension transformation of the output cost volume. Lastly, the final cost volume V measures

R^{H \times W \times D \times 3}

. The regression formula for the depth map is demonstrated below, with “L” indicating the stage level.

P_{(i, j)}^{L} (d_{m}) = \frac{e^{V (i, j, m, 0)}}{\sum_{m = 0}^{M - 1} e^{V (i, j, m, 0)}}

(2)

D^{L} (p) = \sum_{m = 0}^{M - 1} d_{m} \times P_{p}^{L} (d_{m}) w h e r e d_{m} \in {d_{i} | i = 1, \dots, M} .

(3)

Contingent upon the confidence level of each depth sampling, the confidence in the normal vector calculation is simultaneously influenced. Given that the normal vector calculation’s degree of freedom is 2, salient characteristics pertaining to the last two dimensions of the volume were deployed to calculate the normal vector information under each depth sampling scenario. In congruence with the depth regression method, the normal vector estimation map for the targeted viewpoint materializes via a weighted averaging technique coupled with a subsequent re-normalization process, both under the guidance of the confidence conditions. The specific formula characterizing this calculation process is enumerated in the following.

n^{L} (p) = \sum_{m = 0}^{M - 1} n_{m} \times P_{p}^{L} (n_{m}) w h e r e n_{m} \in {n_{i} | i = 1, \dots, M} .

(4)

3.2.2. Neighboring Pixels

Owing to our deep sampling approach that is heavily reliant on the spatial disposition of neighboring pixels, and mandating the neighborhood pixels to maintain similar tangent planes to the current pixel in 3D space as shown in Figure 5, the algorithm necessitates the spatial surface of the neighboring points to remain contiguous with the current pixel to avoid any significant discrepancies between the sampling findings and the actual value. To tackle this challenge, we incorporated the reference image feature map information within the model. As shown in Figure 7, by leveraging deformable convolutional networks, we ascertained the position offset of the neighboring pixels and overlaid the original neighborhood pixels against said offset, thereby achieving a computation of spatially continuous neighborhood positions. This methodology facilitates a deep sampling procedure predicated on neighborhood tangent plane propagation.

3.2.3. Propagation

Subsequent to acquiring the low-resolution depth and surface normal prediction maps

D_{l}

,

N_{l}

(l < 3)

, the next imperative step pertains to ascertaining the neighborhood tangent plane information on these maps, followed by an assessment of the depth hypothesis values via homographic transformations. To this end, the projection of the tangent plane, which is specified in accordance with the object’s surface, onto the image plane constitutes a pivotal undertaking.

In order to accomplish scale transformation, the depth and surface normal prediction values are initially subjected to upsampling operations, with bilinear methods being a popular choice. Subsequently, based on the pinhole imaging principle, every pixel is re-projected into the world coordinate system. The calculation formula for this procedure is enunciated below.

{\vec{n_{(i, j)}}}^{T} p = {\vec{n_{(i, j)}}}^{T} ({K R}^{- 1} [\begin{matrix} i * d_{(i, j)} \\ j * d_{(i, j)} \\ d_{(i, j)} \end{matrix}] - R^{- 1} t) .

(5)

Here, K represents the intrinsics matrix of the camera, and R and t represent the rotation and translation matrix for transforming from the world coordinate system to the camera coordinate system.

{[\begin{matrix} X_{(i, j)} & Y_{(i, j)} & Z_{(i, j)} & 1 . \end{matrix}]}^{T}

expresses the homogeneous coordinates of the 3D point that corresponds to the pixel

(i, j)

.

Based on the surface normal prediction value of pixel

(i, j)

output by the regularization module, it is possible to estimate the tangent plane in space that corresponds to this point. Specifically, the calculation formula for this is as follows:

[\begin{matrix} {\vec{n_{(i, j)}}}^{T} & - {\vec{n_{(i, j)}}}^{T} [\begin{matrix} X_{(i, j)} \\ Y_{(i, j)} \\ Z_{(i, j)} \end{matrix}] \end{matrix}] [\begin{matrix} X \\ Y \\ Z \\ 1 \end{matrix}] = 0 .

(6)

According to prior assumptions, we approximate the tangent planes of neighboring pixels. Similarly, based on the above calculations, the equation for the tangent plane in space for neighboring point

(i, j)

under the predicted value

\vec{n_{(i, j)}}

is as follows:

[\begin{matrix} {\vec{n_{(Δ i, Δ j)}}}^{T} & - {\vec{n_{(Δ i, Δ j)}}}^{T} [\begin{matrix} X_{(Δ i, Δ j)} \\ Y_{(Δ i, Δ j)} \\ Z_{(Δ i, Δ j)} \end{matrix}] \end{matrix}] [\begin{matrix} X \\ Y \\ Z \\ 1 \end{matrix}] = 0 .

(7)

Substituting the above pinhole imaging equation, we obtain the following formula:

[\begin{matrix} {\vec{n_{(Δ i, Δ j)}}}^{T} & - {\vec{n_{(Δ i, Δ j)}}}^{T} [{(K R)}^{- 1} [\begin{matrix} Δ i {\hat{d}}_{(Δ i, Δ j)} \\ Δ j {\hat{d}}_{(Δ i, Δ j)} \\ {\hat{d}}_{(Δ i, Δ j)} \end{matrix}] - R^{- 1} t] \end{matrix}] {[\begin{matrix} K R & K t \\ 0 & 1 \end{matrix}]}^{- 1} [\begin{matrix} i {\tilde{d}}_{(i, j)} \\ j {\tilde{d}}_{(i, j)} \\ {\tilde{d}}_{(i, j)} \\ 1 \end{matrix}] = 0

(8)

After rearranging the formula, the final depth hypothesis for pixel

(i, j)

is:

{\tilde{d}}_{(i, j)} = \frac{{\vec{n_{(Δ i, Δ j)}}}^{T} {(K R)}^{- 1} [\begin{matrix} Δ i \\ Δ j \\ 1 \end{matrix}] {\hat{d}}_{_{(Δ i, Δ j)}}}{{\vec{n_{(Δ i, Δ j)}}}^{T} {(K R)}^{- 1} [\begin{matrix} i \\ j \\ 1 . \end{matrix}]}

(9)

3.3. Loss Function

The loss function utilized in the present study comprises two crucial components, wherein the first pertains to the utilization of the L1 smoothing loss function to ascertain the correlation between the predicted depth values and the actual depth values. The second component concerns the direction consistency loss function wherein the normal vector prediction values are compared against the normal vector ground truth. In doing so, we employed the cosine distance approach to measure the disparities between the normal vectors.

4. Experiments

In this section, the results obtained from the conducted experiments are presented. To begin with, a comprehensive account of the dataset deployed to train and test the multi-view stereo network is laid out. Subsequently, the pertinent indicators and calculation methods are systematically introduced. Lastly, experimental findings from the multi-view stereo conducted are presented for evaluation purposes.

4.1. Dataset

The DTU dataset [30], as shown in Figure 8, is an indoor small target multi-view stereo dataset proposed in 2016. The dataset was created by capturing several views of an object utilizing an industrial robot arm with adjustable brightness lights. Each view captured was comprehensively controlled to obtain camera intrinsic and extrinsic parameters. The dataset comprises 124 diverse scenes. Each scene includes 49 views, with each view containing seven distinct levels of brightness information. Therefore, there exists a total of 343 images per scene, all possessing an image resolution of

1600 \times 1200

. To enhance network training efficiency during the training phase, the dataset resolution was pruned to

640 \times 512

. Ultimately, the training set is comprised of 79 scenes consisting of a total of 27,097 images, alongside corresponding depth ground truth data. The validation set consists of 18 scenes incorporating 6174 images and corollary depth ground truth data, whilst the test set employs 22 scenes providing 7546 images in total complemented with corresponding depth ground truth information. In order to attain ground truth normal information, which will be utilized as supervised information for the normal estimation module, the k nearest neighbor points in the space based on the depth ground truth information were calculated, resulting in the derivation of 33,271 normal vector maps.

The Tanks and Temples dataset [31], as shown in Figure 9 contains photos of buildings such as tanks and temples, as well as vehicles, which can be used for 3D reconstruction. Compared to the DTU dataset, it faces more challenges due to factors such as changes in lighting and dynamic targets in outdoor scenes, requiring algorithms to have higher robustness and accuracy. This dataset also includes the trajectory of the appearance camera and the true ground coordinates of buildings, which can be used to evaluate the reconstruction accuracy of algorithms.

4.2. Evaluation Metrics

In order to evaluate the reconstruction quality of our model, ascertaining the accuracy and completeness of the predicted point cloud when weighed against the ground truth point cloud is essential. The completeness and accuracy of the distance metric are crucial metrics utilized for effective point cloud evaluation. Completeness of the point cloud is identified as the shortest distance measure between every ground truth point and its corresponding anticipated point. Accuracy, on the other hand, pertains to the shortest distance between each estimated point and its corresponding point in the actual ground truth point cloud. The F1 score, an evaluation technique considering both completeness and accuracy parameters, contributes to providing a comprehensive overview of the point cloud’s reconstruction quality.

4.3. Implementation Details

The present study utilized the open-source DTU dataset for both model training and testing. In stages of the sampling process, a uniform deep sampling method with dimensions of [64, 24, 4] was employed. Specifically, the second and third stages of the deep sampling process incorporated eight and four sampling values, respectively, based on neighborhood normal information. Optimization procedures were conducted by implementing the Adam algorithm in conjunction with a weight decay coefficient of 0.0 and beta parameters of (0.9, 0.999). A customized learning rate adjustment method was also integrated, which included a warm-up mechanism in which the learning rate gradually increased from 0 to a maximum rate of 0.0016 during the initial 500 training batches. Beyond the 20th epoch, the learning rate underwent periodic reduction to 0.625 times its prior value with an interval of ten epochs. To facilitate the model training process, five RTX 3090 GPUs were used with a batch size of 30. Throughout the training phase, each target image was equipped with source images from two adjacent views.

4.4. Results

The proposed method was subject to evaluation using the official evaluation metrics on the testing set of the DTU dataset. During the evaluation phase, five source images were employed at an input resolution of

1600 \times 1184

. As indicated in Table 1, a comparison of our method with several other stereo algorithms revealed a commendable level of accuracy and completeness. Specifically, our method exhibited competitive performance. The F score of our result ranks first among the comparison algorithms, and has high completeness and precision. This validates its legitimacy in the current experimental paradigm. Moreover, the reconstruction capabilities of our algorithm are showcased in Figure 10, where an accurate and detailed 3D model was generated. This outcome was largely attributable to the proximity of the depth information of each point in the model to the ground truth, as well as accurate pixel positioning within the images. The RGB value of each spatial 3D point was assigned its corresponding actual value, thus increasing the overall refinement of the model.

Further evaluation of our method was conducted via a comparison with other learning-based methods, with a focus on runtime and GPU memory consumption. Table 2 effectively illustrates that the inclusion of normal estimation as an additional task did not exert significant pressure on either memory or time consumption. Compared with other algorithms, our time consumption ranks second, and while adding an additional task, our GPU memory consumption is not the highest among the comparison algorithms.

Notably, several normal maps were generated during the experiment to showcase the algorithm’s adroitness regarding the interpretation of surface structures, as illustrated in Figure 11. These outcomes confer further evidence of the aptitude of our method in capturing pertinent information essential for effective modeling. It is worthwhile to note that the resultant normal maps could be combined with the depth point cloud for further optimization and integration, which is an area that warrants further inquiry.

To furnish conclusive evidence of the generalization capabilities of our method, the model was subject to extensive testing on the Tanks and Temples dataset. The evaluation was conducted in a comprehensive manner, with a focus on image resolution and the number of source images employed, which were

1080 \times 1920

and five, respectively. Table 3 displays the final F1 score comparison results, which effectively convey the generalization potential of our method. Additionally, some illustrative qualitative visualizations of the resultant models are depicted in Figure 12, highlighting the robust reconstructive abilities of our algorithm, particularly in the context of large-scale outdoor scenes.

The proposed method is more suitable for scenes with many inclined planes, as demonstrated by our theoretical analysis, which is further supported by the experimental results. Specifically, the scene reconstruction for the “M60”, “Playground”, and “Train” datasets yielded better results, as indicated by the larger corresponding indices in the table. These results further validate our approach as a promising method for effective model construction.

4.5. Ablation Study

We tested the training results of different combinations of deep sampling at different stages of the algorithm. As shown in Figure 13 and Figure 14, this is the convergence of the accuracy indicators on our validation set. The results show that: (1) under the same number of deep samplings, using our method in the large-scale stage, the ability to capture local details is superior to that of the baseline method, which only uses uniformly sampled depth assumptions, with the same network structure and tuning; (2) we increased the number of neighborhood depth samplings from 0 to 8 in the second stage and the 2 mm difference on the validation set was lower. This indicates that our method is closer to the ground truth point cloud within a 2 mm difference error than other methods.

We further calculated the accuracy and completeness of the above model on the DTU test dataset and the results are shown in Table 4, which further demonstrate the superiority of our proposed method.

5. Conclusions and Future Work

To address the insufficient accuracy of scene details in multi-view stereo tasks, we proposed NTPP-MVSNet. There are two modules that make NTPP-MVSNet efficient and highly accurate. Firstly, we designed a depth sampling module to make the depth hypotheses closer to the ground truth. The novel depth sampling module takes advantage of neighborhood tangent plane propagation. It guides the NTPP-MVSNet to understand scene information from the geometric structure and obtain more accurate depth predictions. Secondly, we introduced the normal estimation module to receive neighborhood tangent planes. In addition, the normal estimation module plays an important role in completing the mutual optimization of the depth prediction map and the normal estimation map. The procedure of the mutual optimization is as follows: the cost volume of the depth prediction map provides input for the normal estimation and then the normal estimation map provides the geometric structural hints for depth estimation in return.

Extensive experiments were conducted to examine the effectiveness of our NTPP-MVSNet. The result shows that the F score ranks first on the DTU dataset. The ablation experiments and the visual display of the results proved the high reconstruction accuracy of NTPP-MVSNet for local details, as shown in Figure 10. Future work may focus on the improvement of completeness in low-overlap outdoor areas.

Author Contributions

Conceptualization, Q.Z.; methodology, Y.D. and Q.Z.; software, Y.D. and Y.L.; validation, Y.D.; formal analysis, Y.Y. and Q.Z.; writing—original draft preparation, D.Y. and Y.Y.; writing—review and editing, D.Y. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partially supported by National Natural Science Foundation of China (No. 61972015 and 62002005).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We mainly use DTU data set for multi-view stereo. The data set’s official web site is http://roboimagedata.compute.dtu.dk.

Conflicts of Interest

The authors declare no conflict of interest.

References

Marr, D.; Poggio, T. A computational theory of human stereo vision. Proc. R. Soc. Lond. Ser. B Biol. Sci. 1979, 204, 301–328. [Google Scholar]
Bleyer, M.; Rhemann, C.; Rother, C. Patchmatch stereo-stereo matching with slanted support windows. Bmvc 2011, 11, 1–11. [Google Scholar]
Schonberger, J.L.; Frahm, J.M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar]
Ali, R.; Hardie, R.C.; Narayanan, B.N.; Kebede, T.M. IMNets: Deep learning using an incremental modular network synthesis approach for medical imaging applications. Appl. Sci. 2022, 12, 5500. [Google Scholar] [CrossRef]
Mohammadpour, L.; Ling, T.C.; Liew, C.S.; Aryanfar, A. A survey of CNN-based network intrusion detection. Appl. Sci. 2022, 12, 8162. [Google Scholar] [CrossRef]
Al-onazi, B.B.; Nauman, M.A.; Jahangir, R.; Malik, M.M.; Alkhammash, E.H.; Elshewey, A.M. Transformer-based multilingual speech emotion recognition using data augmentation and feature fusion. Appl. Sci. 2022, 12, 9188. [Google Scholar] [CrossRef]
Gu, Y.; Piao, Z.; Yoo, S.J. STHarDNet: Swin transformer with HarDNet for MRI segmentation. Appl. Sci. 2022, 12, 468. [Google Scholar] [CrossRef]
Choy, C.B.; Xu, D.; Gwak, J.; Chen, K.; Savarese, S. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 628–644. [Google Scholar]
Murez, Z.; Van As, T.; Bartolozzi, J.; Sinha, A.; Badrinarayanan, V.; Rabinovich, A. Atlas: End-to-end 3d scene reconstruction from posed images. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 414–431. [Google Scholar]
Sun, J.; Xie, Y.; Chen, L.; Zhou, X.; Bao, H. NeuralRecon: Real-time coherent 3D reconstruction from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, BC, Canada, 11–17 October 2021; pp. 15598–15607. [Google Scholar]
Bozic, A.; Palafox, P.; Thies, J.; Dai, A.; Nießner, M. Transformerfusion: Monocular rgb scene reconstruction using transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 1403–1414. [Google Scholar]
Wang, S.; Li, B.; Dai, Y. Efficient multi-view stereo by iterative dynamic cost volume. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8655–8664. [Google Scholar]
Yao, Y.; Luo, Z.; Li, S.; Fang, T.; Quan, L. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 767–783. [Google Scholar]
Yang, J.; Mao, W.; Alvarez, J.M.; Liu, M. Cost volume pyramid based depth inference for multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4877–4886. [Google Scholar]
Gu, X.; Fan, Z.; Zhu, S.; Dai, Z.; Tan, F.; Tan, P. Cascade cost volume for high-resolution multi-view stereo and stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2495–2504. [Google Scholar]
Cheng, S.; Xu, Z.; Zhu, S.; Li, Z.; Li, L.E.; Ramamoorthi, R.; Su, H. Deep stereo using adaptive thin volume representation with uncertainty awareness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2524–2534. [Google Scholar]
Xu, Q.; Tao, W. Planar prior assisted patchmatch multi-view stereo. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 9–11 February 2020; Volume 34, pp. 12516–12523. [Google Scholar]
Wang, F.; Galliani, S.; Vogel, C.; Speciale, P.; Pollefeys, M. Patchmatchnet: Learned multi-view patchmatch stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, BC, Canada, 11–17 October 2021; pp. 14194–14203. [Google Scholar]
Lee, J.Y.; DeGol, J.; Zou, C.; Hoiem, D. Patchmatch-rl: Deep mvs with pixelwise depth, normal, and visibility. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6158–6167. [Google Scholar]
Yang, J.; Alvarez, J.M.; Liu, M. Self-supervised learning of depth inference for multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, BC, Canada, 11–17 October 2021; pp. 7526–7534. [Google Scholar]
Yao, Y.; Luo, Z.; Li, S.; Shen, T.; Fang, T.; Quan, L. Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5525–5534. [Google Scholar]
Liu, J.; Ji, S. A novel recurrent encoder-decoder structure for large-scale multi-view stereo reconstruction from an open aerial dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6050–6059. [Google Scholar]
Wei, Z.; Zhu, Q.; Min, C.; Chen, Y.; Wang, G. Aa-rmvsnet: Adaptive aggregation recurrent multi-view stereo network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6187–6196. [Google Scholar]
Chen, R.; Han, S.; Xu, J.; Su, H. Point-based multi-view stereo network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1538–1547. [Google Scholar]
Xu, Q.; Tao, W. Learning inverse depth regression for multi-view stereo with correlation cost volume. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 9–11 February 2020; Volume 34, pp. 12508–12515. [Google Scholar]
Luo, K.; Guan, T.; Ju, L.; Huang, H.; Luo, Y. P-mvsnet: Learning patch-wise matching confidence aggregation for multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10452–10461. [Google Scholar]
Yu, Z.; Gao, S. Fast-mvsnet: Sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1949–1958. [Google Scholar]
Xu, Q.; Tao, W. Multi-view stereo with asymmetric checkerboard propagation and multi-hypothesis joint view selection. arXiv 2018, arXiv:1805.07920. [Google Scholar]
Kusupati, U.; Cheng, S.; Chen, R.; Su, H. Normal assisted stereo depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2189–2199. [Google Scholar]
Aanæs, H.; Jensen, R.R.; Vogiatzis, G.; Tola, E.; Dahl, A.B. Large-scale data for multiple-view stereopsis. Int. J. Comput. Vis. 2016, 120, 153–168. [Google Scholar] [CrossRef] [Green Version]
Knapitsch, A.; Park, J.; Zhou, Q.Y.; Koltun, V. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Trans. Graph. (ToG) 2017, 36, 78. [Google Scholar] [CrossRef]
Tola, E.; Strecha, C.; Fua, P. Efficient large-scale multi-view stereo for ultra high-resolution image sets. Mach. Vis. Appl. 2012, 23, 903–920. [Google Scholar] [CrossRef] [Green Version]
Galliani, S.; Lasinger, K.; Schindler, K. Massively parallel multiview stereopsis by surface normal diffusion. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 873–881. [Google Scholar]
Yan, J.; Wei, Z.; Yi, H.; Ding, M.; Zhang, R.; Chen, Y.; Wang, G.; Tai, Y.W. Dense hybrid recurrent multi-view stereo net with dynamic consistency checking. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IV. Springer: Berlin/Heidelberg, Germany, 2020; pp. 674–689. [Google Scholar]
Zhang, J.; Li, S.; Luo, Z.; Fang, T.; Yao, Y. Vis-MVSNet: Visibility-Aware Multi-view Stereo Network. Int. J. Comput. Vis. 2023, 131, 199–214. [Google Scholar] [CrossRef]

Figure 1. Overview of our proposed method. Multi-scale feature information is acquired by employing the target view image as well as the source view images as the input into the FPN network. The MVS estimation model undergoes three stages of iterative optimization, leading to the ultimate acquisition of high-resolution depth map and normal map information.

Figure 2. In the proposed network model, initial proceedings involve performing a differentiable homography transformation on the features of various perspectives, simultaneous to the acquisition of several depth hypotheses via depth sampling. After mapping to the target view, the cost volume is calculated and further regularization is carried out via 3D-UNet. Eventually, the prediction of the depth map and normal information can be accomplished through channel-specific cost volume.

Figure 3. Uniform sampling: This approach’s efficacy is comparatively poor as it demands a substantial density of sampling to procure dependable depth hypotheses, which in turn incurs a time-consuming process.

{Z_{i} | i = 1, \dots, 4}

represent four hypothetical depth values after uniform sampling, and the error between the depth sampling value

Z_{2}

closest to point A and the true depth value of point A is

E_{2}

. Therefore, the error between assumed and actual values is the largest underlying this methodology.

Figure 3. Uniform sampling: This approach’s efficacy is comparatively poor as it demands a substantial density of sampling to procure dependable depth hypotheses, which in turn incurs a time-consuming process.

{Z_{i} | i = 1, \dots, 4}

represent four hypothetical depth values after uniform sampling, and the error between the depth sampling value

Z_{2}

closest to point A and the true depth value of point A is

E_{2}

. Therefore, the error between assumed and actual values is the largest underlying this methodology.

Figure 4. Sampling based on neighborhood depth information propagation: Enhancement of sampling efficiency is feasible to a certain extent. Ideally, consistency between the current point and the depth value of neighboring points should be achieved, signifying a parallel formation of the local section to the camera plane.

{Z_{i} | i = 1, \dots, 4}

represent four hypothetical depth values after uniform sampling,

Z_{N}

represents the depth value of the neighborhood point N. The error between the depth sampling value

Z_{N}

closest to point A and the true depth value of point A is

E_{N}

, the error between assumed and real values is of a relatively minor magnitude in this context.

Figure 4. Sampling based on neighborhood depth information propagation: Enhancement of sampling efficiency is feasible to a certain extent. Ideally, consistency between the current point and the depth value of neighboring points should be achieved, signifying a parallel formation of the local section to the camera plane.

{Z_{i} | i = 1, \dots, 4}

represent four hypothetical depth values after uniform sampling,

Z_{N}

represents the depth value of the neighborhood point N. The error between the depth sampling value

Z_{N}

closest to point A and the true depth value of point A is

E_{N}

, the error between assumed and real values is of a relatively minor magnitude in this context.

Figure 5. Sampling based on neighborhood tangent plane information propagation:

{Z_{i} | i = 1, \dots, 4}

represent four hypothetical depth values after uniform sampling,

Z_{A}^{'}

represents the depth value of the point A’ lying on the tangent plane of the neighboring point N. The imaging result of N and A’ on the image plane is n and a. The error between the depth sampling value

Z_{A}^{'}

closest to point A and the true depth value of point A is

E_{A}^{'}

. Identical sampling conditions serve to minimize the error between estimated and true values, thus contributing towards a more precise computation of the depth distribution.

Figure 5. Sampling based on neighborhood tangent plane information propagation:

{Z_{i} | i = 1, \dots, 4}

represent four hypothetical depth values after uniform sampling,

Z_{A}^{'}

represents the depth value of the point A’ lying on the tangent plane of the neighboring point N. The imaging result of N and A’ on the image plane is n and a. The error between the depth sampling value

Z_{A}^{'}

closest to point A and the true depth value of point A is

E_{A}^{'}

. Identical sampling conditions serve to minimize the error between estimated and true values, thus contributing towards a more precise computation of the depth distribution.

Figure 6. In this approach, the cost volume is fed into the 3D-UNet network to regularization, ultimately producing the cost volume with three channels. The resultant channels serve different purposes: the first channel fosters the regression of the probability volume for each depth sampling of the target image, while the second and third channels facilitate the regression of the normal volumes. The depth sampling and normal volumes demand a weightage and cumulative summation alongside the probability volume to arrive at the ultimate predictive outcome.

Figure 7. The calculation and implementation of spatially continuous neighborhood positions.

Figure 8. Several images sourced from the DTU dataset are presented herein, accompanied by their corresponding depth truth and the associated normal vector truth, as derived from our analysis.

Figure 9. Various perspective images of some scenes on the Tanks and Temples dataset.

Figure 10. Comparison between some qualitative visualization examples on the DTU dataset.

Figure 11. Some results of normal estimation on the DTU test dataset are displayed.

Figure 12. Some reconstruction results of the Tanks and Temples dataset are displayed.

Figure 13. During training, the absolute difference between the depth prediction results of the validation set and the true depth value is averaged within 2 mm of error.

Figure 14. During the training process, the proportion of pixels the depth prediction of which results in the validation set is within the error range of 2 mm.

Table 1. Comparison of accuracy and completeness with other methods on the DTU dataset.

Method	Acc. (mm)	Comp. (mm)	Fscore (mm)
Tola [32]	0.342	1.190	0.766
Gipuma [33]	0.283	0.873	0.578
Colmap [3]	0.400	0.664	0.532
CIDER [25]	0.417	0.437	0.427
P-MVSNet [26]	0.406	0.434	0.420
R-MVSNet [21]	0.383	0.452	0.417
D2HC-RMVSNet [34]	0.395	0.378	0.386
Point-MVSNet [24]	0.342	0.411	0.376
Fast-MVSNet [27]	0.336	0.403	0.370
Vis-MVSNet [35]	0.369	0.361	0.365
CasMVSNet [15]	0.325	0.385	0.355
PatchmatchNet [18]	0.427	0.277	0.352
CVP-MVSNet [14]	0.296	0.406	0.351
Ours	0.337 (5th)	0.356 (2nd)	0.346 (1st)

Table 2. Comparison of GPU memory consumption and inference time on the DTU dataset. The image resolution is

1600 \times 1184

and the number of source images is 5.

Table 2. Comparison of GPU memory consumption and inference time on the DTU dataset. The image resolution is

1600 \times 1184

and the number of source images is 5.

Method	Mem. (GB)	Time. (s)	Normal
Vis-MVSNet [35]	5.6	0.61	NO
Fast-MVSNet [27]	7.0	0.52	NO
CVP-MVSNet [14]	8.8	1.51	NO
CasMVSNet [15]	9.1	0.55	NO
PatchmatchNet [18]	3.6	0.25	NO
Ours	7.8	0.52	YES

Table 3. Comparison of F score with other methods on the Tanks and Temples dataset.

Method	Family	Francis	Horse	Lighthouse	M60	Panther	Playground	Train	Mean
MVSNet [13]	55.99	28.55	25.07	50.79	53.96	50.86	47.90	34.69	43.48
R-MVSNet [21]	69.96	46.65	32.59	42.95	51.88	48.80	52.00	42.38	48.40
CVP-MVSNet [14]	76.50	47.74	36.34	55.12	57.28	54.28	57.43	47.54	54.03
CasMVSNet [15]	76.37	58.45	46.26	55.81	56.11	54.06	58.18	49.51	56.84
D2HC-RMVSNet [34]	74.69	56.04	49.42	60.08	59.81	59.61	60.04	53.92	59.20
ours	76.15	52.35	36.94	55.15	57.58 (2nd)	53.55	58.64 (2nd)	50.05 (2nd)	55.05 (3rd)

Table 4. In the case of the same depth sampling ratio, different depth sampling combinations are used to gradually strengthen the sampling ratio based on neighborhood tangent plane propagation and the evaluation results on the DTU dataset.

Sampling	Acc.	Comp.	Fscore
$[0, 0, 0]$	0.358	0.357	0.357
$[0, 0, 4]$	0.353	0.356	0.355
$[0, 8, 4]$	0.337	0.356	0.346

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Q.; Deng, Y.; Yang, Y.; Li, Y.; Yuan, D. NTPP-MVSNet: Multi-View Stereo Network Based on Neighboring Tangent Plane Propagation. Appl. Sci. 2023, 13, 8388. https://doi.org/10.3390/app13148388

AMA Style

Zhao Q, Deng Y, Yang Y, Li Y, Yuan D. NTPP-MVSNet: Multi-View Stereo Network Based on Neighboring Tangent Plane Propagation. Applied Sciences. 2023; 13(14):8388. https://doi.org/10.3390/app13148388

Chicago/Turabian Style

Zhao, Qi, Yangyan Deng, Yifan Yang, Yawei Li, and Ding Yuan. 2023. "NTPP-MVSNet: Multi-View Stereo Network Based on Neighboring Tangent Plane Propagation" Applied Sciences 13, no. 14: 8388. https://doi.org/10.3390/app13148388

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

NTPP-MVSNet: Multi-View Stereo Network Based on Neighboring Tangent Plane Propagation

Abstract

1. Introduction

2. Related Works

2.1. Cost Volume Aggregation Optimization

2.2. Other Optimization

3. Methods

3.1. Architecture of the Network

3.2. Depth Hypothesis

3.2.1. Normal Estimation

3.2.2. Neighboring Pixels

3.2.3. Propagation

3.3. Loss Function

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Results

4.5. Ablation Study

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI