Research on Rapid and Accurate 3D Reconstruction Algorithms Based on Multi-View Images

Yang, Lihong; Ge, Hang; Yang, Zhiqiang; He, Jia; Gong, Lei; Wang, Wanjun; Li, Yao; Wang, Liguo; Chen, Zhili

doi:10.3390/app15084088

Open AccessArticle

Research on Rapid and Accurate 3D Reconstruction Algorithms Based on Multi-View Images

by

Lihong Yang

¹

,

Hang Ge

^1,*,

Zhiqiang Yang

¹,

Jia He

²,

Lei Gong

¹,

Wanjun Wang

¹,

Yao Li

¹,

Liguo Wang

¹ and

Zhili Chen

¹

College of Optoelectronic Engineering, Xi’an Technological University, Xi’an 710021, China

²

95841 Military Unit or Troop, Jiuquan 735000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4088; https://doi.org/10.3390/app15084088

Submission received: 9 December 2024 / Revised: 17 March 2025 / Accepted: 3 April 2025 / Published: 8 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

Three-dimensional reconstruction entails the development of mathematical models of three-dimensional objects that are suitable for computational representation and processing. This technique constructs realistic 3D models of images and has significant practical applications across various fields. This study proposes a rapid and precise multi-view 3D reconstruction method to address the challenges of low reconstruction efficiency and inadequate, poor-quality point cloud generation in incremental structure-from-motion (SFM) algorithms in multi-view geometry. The methodology involves capturing a series of overlapping images of campus. We employed the Scale-invariant feature transform (SIFT) algorithm to extract feature points from each image, applied the KD-Tree algorithm for inter-image matching, and Enhanced autonomous threshold adjustment by utilizing the Random sample consensus (RANSAC) algorithm to eliminate mismatches, thereby enhancing feature matching accuracy and the number of matched point pairs. Additionally, we developed a feature-matching strategy based on similarity, which optimizes the pairwise matching process within the incremental structure from a motion algorithm. This approach decreased the number of matches and enhanced both algorithmic efficiency and model reconstruction accuracy. For dense reconstruction, we utilized the patch-based multi-view stereo (PMVS) algorithm, which is based on facets. The results indicate that our proposed method achieves a higher number of reconstructed feature points and significantly enhances algorithmic efficiency by approximately ten times compared to the original incremental reconstruction algorithm. Consequently, the generated point cloud data are more detailed, and the textures are clearer, demonstrating that our method is an effective solution for three-dimensional reconstruction.

Keywords:

feature extraction and matching; image processing; three-dimensional reconstruction; structure from motion

1. Introduction

Multi-view 3D reconstruction represents a crucial research avenue within the field of computer vision [1] and has significant applications in areas such as medical CT imaging, cultural heritage preservation, robotic visual localization, tunnel surveying, industrial inspection, and object digitization [2,3,4,5,6]. Fundamentally, three-dimensional reconstruction is the inverse process of camera imaging. It involves the transformation of the relationship between spatial three-dimensional points and two-dimensional image pixel points. This information is analyzed and integrated to calculate the spatial positioning of an object from various angles, ultimately yielding a three-dimensional model of the object. In the three-dimensional reconstruction of campus statues, traditional algorithms depend on the exhaustive matching of feature descriptors from corresponding image pairs. This dependency considerably diminishes the efficiency of matching algorithms. Moreover, the various data acquisition methods employed in image reconstruction complicate the integration of all projections of an object across each image. The algorithms utilize a pairwise matching strategy for feature matching during the reconstruction phase to facilitate subsequent reconstruction tasks. This approach not only significantly increases processing time but also introduces ambiguity in matching, which negatively impacts position estimation. Even prominent open-source solutions, such as COLMAP [7] and Bundler [8], face difficulties in overcoming these challenges.

By concentrating on the three-dimensional reconstruction of small to medium-sized campus objects, this paper seeks to improve model completeness, point cloud density, and reconstruction accuracy based on the previously mentioned research background and current state. The following summarizes this paper’s primary research contributions:

(1): To address the issues of low correct point pair matching and accuracy inherent in traditional feature matching algorithms, this study proposes an enhanced RANSAC optimization framework that integrates SIFT feature extraction, KD-Tree accelerated matching and adaptive threshold adjustment. The core concept of the automatic threshold adjustment algorithm is to dynamically determine the threshold based on the statistical properties of the data rather than relying on fixed settings. In each iteration of the RANSAC algorithm, after the preliminary model estimation, it is necessary to compute the residuals of all data points based on the current model and dynamically adjust the threshold to ensure that the output of the optimized model includes the maximum number of correct matching point pairs.
(2): To address the issue of poor reconstruction model accuracy inherent in existing incremental Structure from Motion (SFM) sparse reconstruction algorithms, this paper proposes an enhanced incremental SFM sparse reconstruction algorithm. The study employs a similarity measurement algorithm to conduct an initial screening of the original matched views, selecting the top 70% as the standard for threshold setting and filtering, thereby generating preliminary matched views. Subsequently, the preliminary matched views are optimized based on temporal continuity to mitigate the impact of similar structures across time periods on the quantity and accuracy of the reconstructed model’s point cloud. This process yields more complete and accurate matched views, resulting in a greater number of point clouds and a more precise sparse reconstruction outcome.
(3): This article further integrates the incremental Structure from Motion (SfM) algorithm and enhanced feature matching with the Patch-Based Multi-View Stereo (PMVS) algorithm to generate more comprehensive three-dimensional models, thereby addressing the problem of incomplete reconstruction models found in existing dense reconstruction algorithms.

This paper systematically reviews current solutions to these issues in Section 2. Section 3 provides a detailed introduction to the proposed improved algorithm. Subsequently, Section 4 conducts a multidimensional validation of the algorithm’s effectiveness through five sets of systematic comparative experiments (Section 4.2.1, Section 4.2.2, Section 4.2.3, Section 4.2.4 and Section 4.2.5). This includes an analysis of the effectiveness of the improved RANSAC algorithm with adaptive threshold adjustments, a comparative performance analysis of the enhanced feature matching algorithm, validation of the matching view improvement algorithm, and a comparison and analysis of the results from sparse and dense reconstruction improvements.

2. Related Works

To address this issue, Zhang Qing Peng and colleagues [9] employed an active selection image-matching strategy. This approach identifies the nearest neighbor for each image, thereby decreasing the number of pairwise matches in the original motion structure recovery algorithm. However, performance variation is minimal. Tu et al and colleagues [10] proposed the use of conflict measurement algorithms to optimize the sparse point clouds of three-dimensional structures in their work on enhancing the precision of repetitive structural components. However, the results of the optimized reconstruction were subpar, and there was a lack of detailed explanation regarding the outcomes of the improved sections. Chen Weiwen and colleagues [11] proposed a method that filters the SIFT descriptors of each image in relation to adjacent frames. They compared these with the previous frame to determine the number of valid point pairs, thereby establishing a matching strategy. However, this approach may be excessively rigid. Jiang San et al. [12] proposed a parallel structure-from-motion (SFM) method that integrates global descriptors with graph indexing for drone imagery. This approach tackles the problem of diminished retrieval efficiency due to the heightened dimensionality of image encoding. Zhang Haopeng et al. [13] proposed utilizing the chronological order of object imagery as a prior constraint, sequentially incorporating new images to achieve spatial and temporal dimension integration. However, this method is not applicable to unordered images. Kataria [14] improved camera position estimation by using only reliable matching points for initialization. They assigned greater importance to shorter matching trajectories, which helps reduce ambiguity caused by repetitive structures. Zach [15] introduced the concept of missing correspondences, which involves acquiring three images that align effectively to establish the epipolar geometry among images. This approach aims to identify and eliminate incorrect perspectives; however, the method is somewhat inflexible and may unintentionally exclude valid viewpoints. Wilson [16] eliminated erroneous feature point trajectories by utilizing local contextual information derived from the visibility graph. This method is applicable to large-scale network datasets and existing developed 3D reconstruction systems, such as Meshroom [17].

3. Materials and Methods

Smartphones were used in this study to capture immersive multi-view images of campus statues for three-dimensional reconstruction. By analyzing the imaging principles of objects and the camera’s imaging model, the aim is to extract the corresponding back-projected three-dimensional coordinate points from the two-dimensional pixel points in the images, thus achieving three-dimensional reconstruction. The specific camera imaging model is illustrated in Figure 1.

Figure 1a depicts the transformation between world and camera coordinate systems. The world coordinate system refers to the coordinate system in the real world, measured in arbitrary units m where any point and its coordinates are represented as

P_{W} = {(X_{W}, Y_{W}, Z_{W})}^{T}

. The camera coordinate system is a right-handed coordinate system, where the imaging device aperture O is the origin. Its Z-axis typically aligns with the camera’s optical axis (perpendicular to the imaging plane) and also uses arbitrary units mm, where any point and its coordinates are represented as

P_{C} = {(X_{C}, Y_{C}, Z_{C})}^{T}

. Both coordinate systems occupy the same space and represent rigid transformations with different positions and orientations. Therefore, only rotation R and translation T operations are necessary for completing the transformation, as indicated in the following equation:

[\begin{matrix} X_{C} \\ Y_{C} \\ Z_{C} \\ 1 \end{matrix}] = [\begin{matrix} \begin{matrix} R \\ 0 \end{matrix} & \begin{matrix} T \\ 1 \end{matrix} \end{matrix}] [\begin{matrix} X_{W} \\ Y_{W} \\ Z_{W} \\ 1 \end{matrix}] |\begin{matrix} \begin{matrix} R : [\begin{matrix} \begin{matrix} r_{11} \\ r_{21} \\ r_{31} \end{matrix} & \begin{matrix} r_{12} \\ r_{22} \\ r_{32} \end{matrix} & \begin{matrix} r_{13} \\ r_{23} \\ r_{33} \end{matrix} \end{matrix}] \\ T : {[\begin{matrix} t_{11} & t_{12} & t_{13} \end{matrix}]}^{T} \end{matrix} \end{matrix}

(1)

Figure 1b illustrates the transformation from the camera coordinate system to the image coordinate system. The latter resides on the camera’s imaging plane as a two-dimensional coordinate system. Its origin is at the center of the virtual imaging plane. The units are represented as mm, and the system effectively uses physical dimensions to describe pixel positions. All the point coordinates on this system can be expressed as

p = {(x, y)}^{T}

. The relationship from the camera coordinate system to the image coordinate system exemplifies perspective projection, signifying the conversion from 3D to 2D, as detailed in the following equation:

Z_{C} [\begin{matrix} x \\ y \\ 1 \end{matrix}] = [\begin{matrix} \begin{matrix} f \\ 0 \\ 0 \end{matrix} & \begin{matrix} 0 \\ f \\ 0 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 1 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 0 \end{matrix} \end{matrix}] [\begin{matrix} X_{C} \\ Y_{C} \\ Z_{C} \\ 1 \end{matrix}]

(2)

Figure 1c illustrates the transformation from the image coordinate system to the pixel coordinate system. The latter represents a two-dimensional grid on a digital image, typically with the origin located at the top-left corner measured in pixels. Scaling and translation exist between the physical image and pixel plane coordinate systems. For any point with coordinates

p = {(u, v)}^{T}

, both coordinate systems reside on the imaging plane, differing only in their respective origins and measurement units, as detailed in the following equation:

[\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} \begin{matrix} 1 / d x & 0 & u_{0} \end{matrix} \\ \begin{matrix} 0 & 1 / d y & v_{0} \end{matrix} \\ \begin{matrix} 0 & 0 & 1 \end{matrix} \end{matrix}] [\begin{matrix} x \\ y \\ 1 \end{matrix}]

(3)

1 / d x

and

1 / d y

represent the number of pixels contained within a unit length of 0.001 m in the horizontal and vertical directions, respectively. By integrating the aforementioned steps, one can derive the comprehensive transformation formula from the world coordinate system to the image coordinate system:

[\begin{matrix} x \\ y \\ 1 \end{matrix}] = [\begin{matrix} \begin{matrix} f_{x} \\ 0 \\ 0 \end{matrix} & \begin{matrix} 0 \\ f_{y} \\ 0 \end{matrix} & \begin{matrix} u_{0} \\ v_{0} \\ 1 \end{matrix} & \begin{matrix} 0 \\ 0 \\ 0 \end{matrix} \end{matrix}] [\begin{matrix} \begin{matrix} R \\ 0 \end{matrix} & \begin{matrix} T \\ 1 \end{matrix} \end{matrix}] [\begin{matrix} X_{W} \\ Y_{W} \\ Z_{W} \\ 1 \end{matrix}]

(4)

The overall framework of the presented algorithm is illustrated in Figure 2. First, we calibrate the camera’s intrinsic parameters. Then, we detect and match the feature points in the pixel coordinate system and Propose an improved RANSAC algorithm with autonomous threshold adjustment to enhance the number of matched point pairs to establish the two-dimensional correspondence of three-dimensional points from any two images. Subsequently, we convert the feature points to the camera coordinate system using the calibrated intrinsic parameters. Finally, we integrate all three-dimensional points with the camera position to transform them into the world coordinate system; an improved incremental Structure from Motion (SFM) algorithm is used to produce a sparse reconstruction model with greater accuracy and a larger point cloud and achieve sparse reconstruction and obtaining the sparse real three-dimensional coordinates of the object alongside the camera’s position parameters. Ultimately, we employ dense reconstruction to produce a visualizable three-dimensional model of the object.

3.1. Camera Calibration

Camera calibration determines the intrinsic parameter matrix during data acquisition, establishing the transformation relationship between the 2D-pixel coordinates of feature points and their corresponding 3D real-world coordinates. The imaging process of an object involves mapping transformations from 3D real coordinates to 2D plane coordinates. To achieve inverse projection from a 2D image to a 3D model, the intrinsic parameter matrix must be calibrated by adjusting for the transformation of the imaging coordinate system corresponding to the feature point coordinates. This calibration allows for camera position calculation and object reconstruction.

The camera imaging model illustrates how three-dimensional objects in the real world transform into two-dimensional images via the camera’s imaging process. In the field of computer vision, this typically involves four coordinate systems: the pixel, image, camera, and world coordinate systems. A diagram depicting the three-dimensional coordinate transformation via these coordinate systems is shown in Figure 3.

In 3D reconstruction, camera calibration involves determining internal parameters such as focal length and principal point. This study employs the Zhang Zhengyou calibration method [18], utilizing a checkerboard calibration pattern of 14 × 10 squares. We collected approximately 15~20 images from various angles with the camera to be calibrated. Subsequently, we used the MATLAB 2024a camera calibration toolbox to output the internal parameters.

3.2. Principles of the Enhanced Feature Matching Algorithm

The core problem of feature matching is establishing the corresponding relationships on a two-dimensional plane for images of the same target at different times, under different lighting conditions, and in different positions at the pixel level. Correspondence building between identical points in multi-view images is the foundation of multi-view image reconstruction. Consequently, this paper employs the SIFT algorithm for feature detection, complemented by KD-Tree for feature matching. The high discriminative descriptors of the SIFT algorithm effectively reduce false matches, while the efficient search capabilities of KD-Tree enable the processing of a greater number of feature points. The combination of these two methods not only ensures accuracy but also expands the pool of candidate matches. Furthermore, the enhanced RANSAC algorithm, with its adaptive threshold adjustment, retains a larger number of correct match pairs, thereby increasing the overall count of matching point pairs and providing foundational conditions for generating three-dimensional models with a higher quantity of point clouds.

3.2.1. Scale-Invariant Feature Transform (SIFT) Feature Point Detection

The objective of feature extraction focuses on identifying highly repeatable and distinguishable key points, followed by generating corresponding feature descriptors. High repeatability ensures that detected key points can be recognized across overlapping images. Meanwhile, high distinguishability guarantees the accurate matching of these key points in other overlapping images. This study utilizes the scale-invariant feature transform (SIFT) algorithm [19] for the feature extraction of multi-view images. The algorithm developed by Lowe employs a local feature detection method, searching for key points across various scale spaces and extracting each key point’s position, scale, and rotation invariants. SIFT features demonstrate invariance to changes in scale, rotation, and brightness while maintaining robustness against variations in viewpoint, affine transformations, and noise.

The SIFT algorithm introduces the concept of scale space for matching images across different scales. It constructs a Gaussian pyramid using a Gaussian blur function. To detect stable key points in the scale space, Gaussian difference kernels of varying scales are employed, convolving them with the image to generate the Difference of Gaussian (DOG) scale space. The scale space

L (x, y, σ)

of the input two-dimensional image

I (x, y)

can be represented as follows:

L (x, y, σ) = G (x, y, σ) \times I (x, y)

(5)

G (x, y, σ)

represents the Gaussian filter function as follows:

G (x, y, σ) = \frac{1}{2 π σ^{2}} e^{- \frac{{(x - m / 2)}^{2} + {(y - n / 2)}^{2}}{2 σ^{2}}}

(6)

In the formula above,

σ

denotes the scale coordinates, and

(x, y)

signifies the pixel coordinate values of the image. Images of varying scales are obtained via a down-sampling method. The SIFT algorithm employs the Difference of the Gaussian method during the extremum detection process as follows:

\begin{matrix} D (x, y, σ) = (G (x, y, k σ) - G (x, y, σ)) \times I (x, y) \\ = L (x, y, k σ) - L (x, y, σ) \end{matrix}

(7)

In the generated Gaussian difference pyramid, each sampling point is compared with 26 feature points in the same layer and in layers above and below. When a sampling point corresponds to either the maximum or minimum of these detection points, it qualifies as a feature point. The scale-space Difference of Gaussian (DOG) function undergoes curve fitting to enhance feature point stability and improve noise resistance. This process eliminates low-contrast feature points and unstable edge points, allowing for the precise localization of feature points in both position and scale. To ensure the uniqueness of each feature point’s descriptor, a reference direction is assigned to each extremum point based on the image’s local features. The formulas for calculating the gradient and direction of the feature point

(x, y)

are as follows:

m (x, y) = \sqrt{{(L (x + 1, y) - L (x - 1, y))}^{2} + {(L (x, y + 1) - L (x, y - 1))}^{2}}

(8)

θ (x, y) = \tan^{- 1} (\frac{L (x, y + 1) - L (x, y - 1)}{L (x + 1, y) - L (x - 1, y)})

(9)

3.2.2. Multidimensional Binary Search Tree (KD-Tree) Feature Point Matching

After the SIFT algorithm obtains feature points, feature point matching is conducted. Usually, Euclidean distance is used to determine whether there is a similarity between two descriptors. If the SIFT descriptors in two images to be matched are

X_{i} = (X_{i 1}, X_{i 2}, \dots, X_{i 128})

and

Y_{i} = (Y_{i 1}, Y_{i 2}, \dots, Y_{i 128})

, the degree of similarity between these descriptors is determined from their Euclidean distance, as expressed in the following formula:

d = \sqrt{\sum_{j = 1}^{128} {(X_{i j} - Y_{i}_{j})}^{2}}

(10)

Brute-force matching (BF) is a straightforward and direct matching technique. It iterates through all feature point descriptors and computes the distances between them. Then, it ranks these distances and selects the nearest feature points as matches. For two sets of feature descriptors extracted from two images, for each descriptor

X_{i}

in the first set, it calculates the distance to every descriptor in the second image. The shortest distances are chosen as matched points after filtering. While the brute-force method is simple, it has disadvantages. When image resolutions are excessively high, the number of feature points increases, resulting in substantial computational loads for pairwise comparisons. This leads to longer algorithm processing times and a higher incidence of erroneous matches, including cases of one-to-many matches.

The Multidimensional Binary Search Tree (KD-Tree), KD-Tree-based feature matching method is employed to achieve faster and more efficient feature point extraction. The KD-Tree method [20] contains a spatial partitioning data structure, forming a K-dimensional binary tree where each node represents K-dimensional data. An illustrative example is shown in Figure 4, where (a) depicts a KD tree, and (b) illustrates the planar partitioning of the KD tree.

The KD-Tree algorithm is utilized for the nearest neighbor search to identify the nearest and second-nearest feature points corresponding to each feature. If the ratio of the nearest distance to the second-nearest distance falls below a specified threshold, scale-space matching complexity may result in numerous similar Euclidean distances. The author recommends setting this threshold at 0.8. However, considering the perspective variations between different matched images, a threshold of 0.4 to 0.6 is more appropriate. A value too low may yield insufficient matches, while a value too high may lead to excessive mismatches. Typically, a threshold of 0.5 is set to designate the nearest neighbor feature point as a matching point.

As previously mentioned, during the matching process, image information complexity and the limitations of computational methods can lead to the emergence of similar feature points, resulting in numerous mismatches. Ultimately, the Random sample consensus (RANSAC) method [21], a robust random sample consensus approach, is utilized to eliminate mismatches and achieve optimal matching pairs. The RANSAC algorithm serves as a strong parameter estimation method, working iteratively to find the best solution. The specific steps include first randomly selecting four pairs of matching points to establish an initial model. This model is then used to detect other matching points, counting the number of mismatched points to see if they meet a defined threshold. If the threshold is met, the model is accepted as the final method; if not, the process restarts by randomly selecting four new pairs of matching points.

3.2.3. Improved RANSAC Algorithm with Automatic Threshold Adjustment

Numerous feature-matching mismatches can arise from the emergence of similar feature points during the matching process due to the complexity of the image information and the limitations of computational methods. The RANSAC (Random Sample Consensus) method is used to remove these discrepancies. Therefore, this paper employs a self-threshold adjustment method to optimize the algorithm. The specific flowchart illustrating the improved RANSAC matching result optimization algorithm is depicted in Figure 5.

The basic idea behind automatic threshold adjustment is to use statistical properties of the data to dynamically determine thresholds instead of manual settings. This method greatly improves the algorithm’s ability to adapt to different datasets, thereby lowering the problems of missed and false matches. The drawbacks of fixed thresholds can be successfully addressed in the RANSAC algorithm by introducing an automatic threshold adjustment mechanism before determining the number of inliers. In particular, each time the initial model estimation is completed, the residuals of every data point must be calculated using the current model, and the threshold must be dynamically adjusted in accordance with the results.

3.3. Enhanced Incremental SFM Sparse Reconstruction Algorithm for Similarity Measurement

Feature matching establishes 3D point correspondences across multi-view images, and incremental Structure from Motion (SfM) achieves sparse reconstruction by progressively optimizing camera poses and 3D coordinates. However, traditional methods face two critical limitations: exhaustive pairwise matching leads to exponential computational complexity in large-scale scenarios; repetitive textures and similar regions generate mismatches, degrading point cloud quality.

Traditional “Matching Views” construction relies on full-image feature matching and geometric validation (e.g., RANSAC), resulting in redundant matches. Although existing SfM frameworks attempt error filtering through bundle adjustment, reconstruction accuracy remains suboptimal. The improved algorithm in this study achieves transitive image connectivity with fewer matching instances, producing sparser yet geometrically more consistent matching relationships, as shown in Figure 6. Comparative experiments demonstrate that this method generates denser point clouds and more complete models.

To achieve high-precision and comprehensive scene reconstruction, this study proposes a view construction framework integrating similarity metrics and temporal continuity constraints. By employing adaptive matching strategies, the algorithm significantly reduces redundant matching requirements, enabling robust sparse reconstruction with minimal matching effort and generating richer point clouds and more accurate models (as illustrated in the flowchart in Figure 6.

3.3.1. Similarity Measurement Algorithms

We first calculate the overlap between images before conducting feature matching on the image sequences for the unordered images of campus statues obtained via circular collection. If certain criteria are met, we consider the two images suitable for feature matching; otherwise, feature matching between two images is not required, as illustrated in Figure 7.

3.3.2. Image Continuity Optimization Adjustments

After autonomous strategy selection, the matching relationship between images is determined, and the process of SIFT feature point detection, KD-Tree feature point matching, and RANSAC mismatch filtering are carried out to determine the two-dimensional correspondence between images with overlapping degrees. After the initial filtering of the primary view through similarity metrics, the data collected across different time periods exhibits intra-class similarity (such as repetitive textures and symmetrical structures). Consequently, even after preliminary selection, some spurious matches may persist, which can still impact the accuracy of the reconstruction. The specific algorithm flowchart for optimizing image continuity is illustrated in Figure 8.

After obtaining the initial matching views based on similarity metrics, this study introduces a temporal continuity constraint mechanism. By analyzing the temporal correlations within the image acquisition sequence, the preliminary matching results are optimized and refined, filtering key matching views that adhere to physical scene constraints, thereby establishing robust matching relationships.

Addressing the issue of reduced accuracy in reconstruction models caused by mismatches arising from temporally disparate yet structurally similar data (where the time intervals of image acquisition are significant but the similarity metrics are high). As illustrated in Figure 9.

Figure 9 illustrates the two criteria that control the image’s continuity:

The temporal continuity algorithm optimizes matching relationships through a dual mechanism: (1) it mitigates the interference of similar structures across time periods on reconstruction accuracy (see Figure 9a); (2) it ensures that essential matches between the first and last images in a circular acquisition scheme are preserved and not discarded by the temporal continuity algorithm (see Figure 9b). To prevent the loss of reliable matching relationships due to algorithmic improvements, this study employs a segmented temporal filtering strategy: on the one hand, it conducts continuity assessments based on a temporal matrix to segment images by viewpoint, thereby eliminating cross-period interference; on the other hand, it establishes a threshold for the first and last images, mandating the retention of critical matches in the circular shooting sequence to maintain the topological integrity of the scene.

In the incremental SFM sparse reconstruction process, the two images with the highest number of matching feature points are first selected, and the external parameters of one image are set as the identity matrix. Then, the position matrices of the two cameras are solved using the five-point method, and the obtained feature points are traced in the images. The two-dimensional feature points are triangulated to obtain their corresponding three-dimensional coordinates, and the obtained three-dimensional coordinates are optimized using Bundle adjustments (BA). To complete this process, the next image is added, and the above steps are repeated until all images are added. By improving the incremental SFM sparse reconstruction method, more accurate two-dimensional corresponding feature points can be obtained, and the problem of ambiguous matching caused by structural similarity between images can be avoided, thus improving reconstruction model accuracy.

3.4. PMVS Dense Reconstruction

The PMVS (patch-based multi-view stereo) algorithm extracts feature points from multi-view images as initial seed points, represents these points using patches, and performs consistency verification and optimization in multiple views, gradually expanding and densely reconstructing the 3D point cloud in the image. This algorithm generates dense 3D point clouds via image feature matching, viewpoint consistency checking, local optimization, and iterative extension.

First, one of the images is selected as the reference image. Sparse points reconstructed via SFM are selected as seed points, which can be projected onto the reference image and other images. Then, based on the internal and external parameters of the camera, the pixel depth and normal vector of the sparse point mapping are obtained; the distance and direction between the 3D point and the camera coordinate system origin of the reference image are taken as the initial pixel depth and the initial normal vector of the projection point, respectively. Thirdly, small square patches are created, centered around the selected point. The patch points are mapped onto pixels in other view images. The pixel depth and initial normal vector are optimized by minimizing the reprojection error. Fourthly, the normalized cross-correlation (NCC) photometric consistency method is used to remove noise points. Luminous consistency refers to color consistency within the same point from different perspectives. A small block is projected to be centered on the selected pixel onto a neighboring image, and the similarity of multi-view pixels is calculated. Therefore, the proposed 3D reconstruction method is robust in color and does not rely on color variations. Fifth, the depth and normal value of the seed point are assigned to the adjacent four pixels as the initial depth and normal value of the seed point, respectively. The pixels of other images are found via an epipolar search. The next loop continues until all pixels are searched. Finally, the depth map is reconstructed based on pixel depth and normal vector. The camera’s inverse projection matrix is used to project depth pixels into three-dimensional space, and a three-dimensional point cloud is obtained based on the depth of each pixel in the depth map. Finally, Poisson reconstruction is used to generate a three-dimensional model with visualization.

4. Results

The hardware environment utilized in this experiment includes a 2nd Generation Intel(R) Core(TM) i9-12900H CPU operating at 2.50 GHz (Intel, Santa Clara, CA, USA), along with the Windows 11 operating system. The graphics processing unit is NVIDIA GeForce GTX 3060 (Nvidia, Santa Clara, CA, USA), and the algorithm is implemented on the MATLAB 2024a platform. This study employed the Zhang Zhengyou [18] calibration method to determine the camera intrinsic parameter matrix as follows:

K = [\begin{matrix} \begin{matrix} f_{x} \\ 0 \\ 0 \end{matrix} & \begin{matrix} 0 \\ f_{y} \\ 0 \end{matrix} & \begin{matrix} u_{0} \\ v_{0} \\ 1 \end{matrix} \end{matrix}] = [\begin{matrix} \begin{matrix} 3271.577 \\ 0 \\ 0 \end{matrix} & \begin{matrix} 0 \\ 3263.385 \\ 0 \end{matrix} & \begin{matrix} 2047.586 \\ 1535.174 \\ 1 \end{matrix} \end{matrix}]

4.1. Experimental Data and Evaluation Metrics

4.1.1. Data Acquisition

In multi-view three-dimensional reconstruction, “multi-view” refers to the requirements for image acquisition and the fundamental conditions for algorithm execution. The images used for reconstruction must have at least two different camera angles and positions with overlapping content. This study employs the built-in camera of the Huawei Mate 50 Pro (Huawei, Shenzhen, China) and DJI AIR 3 (DJI, Shenzhen, China) to capture images of campus sculptures from various angles. The device parameters are listed in Table 1, utilizing a standard telephoto lens for imaging at a 1× magnification ratio.

This involved data collection for two groups of statues and two groups of buildings, which are sequentially labeled from left to right in Figure 10 as Statue1, Statue2, Archi1, and Archi2.

The number of images contained in each dataset and their corresponding resolution sizes are presented in Table 2.

4.1.2. Evaluation Metrics

The following are the primary quantitative evaluation metrics employed in this study to assess the improvement of the algorithm:

(1): Reprojection Error: $R_{e}$

Reprojection Error refers to the geometric deviation between the projected coordinates of a three-dimensional reconstructed point cloud and the observed coordinates of feature points in the actual image plane when back-projected onto a two-dimensional image. The unit of measurement is pixels, and it is mathematically represented as follows:

R_{e} = \frac{1}{N} \sum_{i = 1}^{N} {‖P_{j} X_{i} - x_{i j}‖}^{2}

(11)

In this context,

P_{j}

represents the projection matrix of the

j

perspective camera,

X_{i}

denotes the coordinate value of the

i

three-dimensional point in the model, and

x_{i j}

indicates the two-dimensional observation coordinate value of the

i

three-dimensional point in the

j

perspective image.

(2): Reconstructing the quantity of point clouds: N

The density of reconstructed point clouds positively correlates with feature matching density, demonstrating both the algorithm’s capability in tracking features under low-texture/repetitive-structure scenarios and the completeness of reconstructed regions. This study employs point cloud quantity N as a quantitative evaluation metric.

(3): Aligning the view edges: $N_{C}$

The edge count in matching views reflects the feature-matching frequency required by incremental SfM. Redundant matches not only consume computational resources but also degrade reconstruction accuracy, particularly in cross-temporal similar structures. This study quantifies algorithmic improvements through edge count metrics, with comparative analyses in section

N_{C}

, demonstrating performance enhancements.

(4): Feature matching-accuracy of matches

Matching accuracy serves as a key metric for evaluating feature matching algorithm precision, defined as the ratio of correct matches to total matches. Higher values indicate enhanced algorithmic precision. The calculation formula is:

A cc u r a c y = \frac{N_{(\begin{matrix} c o r r e c t & m a t c h e s \end{matrix})}}{N_{(m a t c h e s)}} \times 100 %

(12)

In this context,

N (m a t c h e s)

denotes the total number of matched point pairs obtained through the matching algorithm, which includes both accurate feature correspondences and erroneous matching results.

N (\begin{matrix} c o r r e c t & m a t c h e s \end{matrix})

represents the number of correct matched point pairs within the results. The accuracy of the matches is one of the most critical metrics for evaluating algorithm performance, as it is directly linked to the reliability of the algorithm in practical applications.

4.2. Experimental Results and Analysis

This chapter conducts experiments on four self-collected 3D reconstruction datasets (Statue, Statue2, Archi1, Archi2), covering statue and architectural scenes to balance structural complexity and texture diversity, thereby validating the proposed algorithm’s efficacy and generalization capability.

4.2.1. Improving the Effectiveness Analysis of the RANSAC Algorithm

The RANSAC algorithm’s performance in research and use is vulnerable to outliers and noise interference found in high-resolution datasets. Furthermore, the fixed threshold mechanism has inherent shortcomings, including a high rate of misclassification and a heavy reliance on empirical data. In order to overcome these problems, this study suggests an adaptive threshold adjustment technique based on statistical features. By using dynamic threshold settings, the possibility of incorrectly classifying inliers is decreased, improving the model’s resilience to noise. The RANSAC optimization algorithm has a threshold range of 10 to 100, a threshold increment of 5, a threshold initialization of 30, and a total of 100 iterations.

Using the Statue1 dataset as a reference, Figure 11 illustrates the effectiveness of the proposed algorithm in this study.

Figure 11a displays initial matching results using SIFT features and KD-Tree descriptors (green lines), revealing mismatches caused by endpoint misalignment. Figure 11b demonstrates the optimized results with our improved RANSAC algorithm: green lines represent matches under original thresholds, while red lines indicate additional matches from adaptive threshold adjustment. This highlights the algorithm’s enhanced robustness through increased valid matching pairs.

The quantitative results of the detailed autonomous threshold adjustment RANSAC algorithm are presented in Table 3.

Compared to fixed-threshold RANSAC (threshold = 40), the proposed adaptive threshold algorithm dynamically adjusts parameters (e.g., Statue1 = 50, Archi1 = 65), significantly enhancing feature-matching performance. Experimental results show: statue datasets (Statue1/Statue2) achieve 57.0%~80.7% growth in matched pairs, while architectural datasets (Archi1/Archi2) exhibit 45.2%~71.1% increases. Additionally, dynamic parameter optimization improves inlier selection accuracy by 23.6% (p < 0.01), effectively overcoming empirical threshold limitations, particularly in complex scenarios with significant resolution and noise variability.

4.2.2. Feature Detection and Matching

Feature point extraction and matching in images represent a crucial step in three-dimensional reconstruction. The feature matching phase is executed on the MATLAB platform, utilizing the SIFT algorithm for feature extraction. We employed the KD-Tree matching algorithm based on Euclidean distance to replace the previous brute-force matching approach with a matching threshold set at 0.5.

Figure 12 illustrates the SIFT feature detection results. Figure 12a displays two randomly selected original images from the collected data. Figure 12b presents the detection outcomes for these images, which took a cumulative time of 13.467 s. The number of feature points detected, from left to right, were 21,196 and 13,531, respectively.

Figure 13 illustrates the differing outcomes of two images after employing two methods of feature matching and error removal. Figure 13a,b present outlier removal results using brute-force (BF) matching and traditional RANSAC, while Figure 13c,d demonstrate the enhanced RANSAC algorithm integrating KD-Tree matching and adaptive threshold adjustment. KD-Tree achieves significantly lower mismatch rates compared to BF; The improved algorithm retains more matched pairs while drastically reducing execution time without compromising accuracy. A comparison of Figure 13a and Figure 13c reveals that BF exhibits marginally higher per-frame matching accuracy but suffers from substantially more mismatches. These observations validate the rationale for adopting the KD-Tree + SIFT framework, which balances efficiency and robustness while providing precise correspondences for downstream 3D reconstruction tasks.

The experimental findings of the performance comparison are shown in Table 4. Experiments reveal significant differences in robustness, matching accuracy, and time efficiency among SIFT-based feature detection with varying matching strategies:

Time Efficiency: The SIFT + Brute Force (BF) baseline takes 84.523 s, increasing to 86.147 s after RANSAC optimization, confirming additional computational costs from iterative sampling. In contrast, SIFT + KD-Tree reduces matching time to 26.634 s (68.5% speedup over BF) through hierarchical spatial partitioning of high-dimensional feature vectors, with only a marginal increase to 26.953 s after enhanced RANSAC, indicating minimal computational overhead during model optimization.

Matching Quality: SIFT + BF generates 12,356 initial matches, retaining 5050 pairs after RANSAC filtering, with accuracy improving from 40.871% to 50.50%, demonstrating RANSAC’s outlier removal capability. Meanwhile, SIFT + KD-Tree reduces matches from 10,316 to 9136 pairs (11.4% decrease), achieving 88.561% initial matching accuracy after improved RANSAC, significantly outperforming SIFT + BF + RANSAC’s 59.1% reduction. This highlights the synergistic effect of SIFT’s high discriminative power and KD-Tree’s approximate nearest neighbor search, enhancing both precision and stability.

Overall, the SIFT + KD-Tree + enhanced RANSAC approach provides a solution that supports both real-time performance and robustness for large-scale image registration tasks by striking an ideal balance between time efficiency (26.953 s), matching accuracy (88.561%), and false match suppression capability (false match rate < 11.5%).

Figure 14 illustrates the comparative results of various feature-matching algorithms applied to self-collected data. Figure 14 shows the matching outcomes of four feature matching algorithms on four sets of self-collected data: SURF [22], SIFT [19], MS-HLMO [23], and our suggested approach. It is evident that our algorithm is capable of preserving a greater number of matching point pairs.

Table 5 and Table 6 present a comparative analysis of the proposed algorithm against various matching algorithms, focusing on the number of matched point pairs and accuracy metrics across four sets of self-collected data.

Based on a thorough examination of Table 5 and Table 6, we can determine that our algorithm performs noticeably better than competing algorithms in two crucial metrics: the accuracy of matching and the quantity of matched point pairs. In particular, throughout all datasets, our algorithm produces a significantly greater number of matched point pairs; however, it excels in the Archi1 dataset, successfully matching 4090 point pairs. With a maximum accuracy of 88.56% (Statue1) and a minimum of 64.27% (Archi2), our algorithm also performs exceptionally well in terms of matching accuracy across all datasets, demonstrating high precision and robustness. On the other hand, the MS-HLMO algorithm performs inconsistently in terms of matching accuracy, especially underperforming in the Archi1 dataset, even though it outperforms SIFT and SURF in terms of the number of matched point pairs.

The specific comparison of the results is illustrated in Figure 15. It can be clearly observed from Figure 15 that across all datasets, our algorithm and MS-HLMO consistently outperform SIFT and SURF algorithms, particularly when it comes to matching accuracy. Consequently, our algorithm not only matches more point pairs when processing self-collected datasets but also achieves a higher matching accuracy when taking into account both the number of matched point pairs and matching accuracy, highlighting its superiority and potential in real-world applications.

In summary, to address the issue of insufficient point cloud density in existing multi-view image 3D reconstruction algorithms, this study proposes a solution framework based on feature-matching optimization. By integrating the scale invariance of SIFT feature descriptors with the high-dimensional search efficiency of KD-Tree (resulting in a 68.5% reduction in matching time) alongside a self-designed dynamic threshold RANSAC optimization strategy (which enhances inlier selection accuracy by 23.6%), the experimental results demonstrate the algorithm’s exceptional performance across four datasets. This accomplishment demonstrates the synergistic efficacy of the adaptive RANSAC filtering mechanism and SIFT + KD-Tree feature space optimization, which effectively alleviates the issue of low model point cloud density caused by insufficient matching points during multi-view image reconstruction and provides a larger number of matching point pairs for subsequent multi-view geometric computations.

4.2.3. Analysis of the Effectiveness of View Improvement Algorithms

To address the issue of numerous invalid matching relationships in the incremental SFM sparse reconstruction algorithm, the influence of similar structures across different time periods can diminish reconstruction accuracy. This paper proposes a feature-matching strategy based on similarity measures, which extracts the similarity of images and conducts an initial screening of matching relationships. By considering the temporal continuity of image acquisition, the initial matching views are optimized and adjusted to generate more complete and accurate matching views. The objective is to enhance the quantity and accuracy of point clouds in model reconstruction. A comparison with the COMAP algorithm, the matching view comparison results of COLMAP and the improved algorithm presented in this paper are illustrated in Figure 16.

Figure 16 presents matching views in the Structure from Motion (SfM) algorithm using network graphs, where nodes represent viewpoints and edges indicate matching relationships. Node size and edge thickness may scale with the number of matching pairs or quality, visually reflecting differences in view construction across algorithms. The first row (Figure 16(a1,b1,c1,d1)) shows matching views from the traditional COLMAP algorithm, revealing a high density of edges and well-organized views, indicating equal treatment of all images. However, similar structures across different time periods can cause ambiguous matches, reducing reconstruction accuracy, and feature matching on images without common areas leads to redundancy. In contrast, the second row (Figure 16(a2,b2,c2,d2)) displays matching views from the improved incremental SFM sparse reconstruction algorithm proposed in this study, with significantly fewer edges and more targeted view distribution. This demonstrates that the improved algorithm establishes more matches between highly similar images, forming denser combinations. The enhanced algorithm not only reduces redundant matches but also more accurately filters valid matching relationships, providing a reliable foundation for improving reconstruction quality and precision.

The specific performance improvements of the algorithm presented in this paper, as demonstrated in the matching view, are illustrated in Table 7.

The number of edges in the matching view, represented by

N_{C}

in this case, is equal to the number of iterations needed for further feature-matching tasks. According to the table’s data, the enhanced algorithm’s matching iterations for every dataset are noticeably fewer than those generated by the original algorithm. In particular, there are 63.4%, 76.3%, 76%, and 84.2% decreases in matching iterations between Statue1 and Archi2, respectively.

Strong generalizability and applicability are demonstrated by the improved algorithm’s impressive performance across a variety of dataset types, including sculptures and architecture. The enhanced algorithm successfully lowers the number of matches, demonstrating its stability and effectiveness in managing datasets of different sizes, whether used on small datasets like Statue1 with only 30 images or large datasets like Archi2 with 100 images.

4.2.4. Improving the Analysis of Sparse Reconstruction Results in Algorithms

This section seeks to fully illustrate the influence of the suggested similarity measure enhancement on the incremental SFM sparse reconstruction algorithm, building on the analysis of the view improvement algorithm’s efficacy. We will compare the enhanced algorithm’s performance with other popular algorithms like VSFM [24], COLMAP [7], and SfSM [25] and look into how well it performs in sparse reconstruction. We will evaluate the algorithms based on the number of point clouds and the accuracy of the reconstruction results by comparing and contrasting the sparse reconstruction results before and after the improvements.

Figure 17, Figure 18, Figure 19 and Figure 20 illustrate the comparative sparse reconstruction results of the proposed algorithm against various other algorithms using the self-collected data from Statue1, Statue2, Archi1, and Archi2.

The presentation format of the results is as follows: from left to right, the front, side, and rear views of the reconstruction results are displayed. This three-view representation provides a clear visual demonstration of the accuracy and completeness of the sparse reconstruction from a subjective visual perspective. All results are saved in ‘.ply’ format and analyzed using the Cloud Compare software v2.13.2.

As illustrated in Figure 17, the sparse reconstruction results for the self-acquired dataset Statue1 include SfSM, VSFM, COLMAP, and the improved algorithm proposed in this paper. In terms of point cloud density, the algorithm presented herein produces a more densely populated point cloud that covers a significant portion of the statue, whereas the other three methods yield relatively sparse point clouds with numerous voids and missing areas. This indicates that the proposed algorithm is more effective in feature point extraction and retains a greater amount of valid information during the matching process, resulting in a richer point cloud dataset.

The algorithm presented in this paper generates statue models with greater accuracy in geometric shapes and details, successfully recovering more surface textures and contours. In contrast, the sparse reconstruction results from the SfSM and VSFM algorithms exhibit noticeable distortion or deformation, while the COLMAP algorithm performs relatively well; however, the proposed algorithm demonstrates superior results. This enhancement is attributed to the optimization of the incremental structure from the Motion (SfM) sparse reconstruction algorithm and improvements in the initial feature matching phase.

Comparative analysis of the supplementary data in Figure 18, Figure 19 and Figure 20, specifically for Statue2, Archi1, and Archi2, substantiates the significant advantages of the proposed algorithm in terms of reconstruction accuracy and point cloud density.

Through a comprehensive analysis of sparse reconstruction results from four self-collected datasets (Statue1, Statue2, Archi1, Archi2), the following conclusions are drawn: In terms of point cloud quantity and reconstruction accuracy, the proposed enhanced algorithm demonstrates significant advantages over traditional methods (e.g., SfSM, VSFM, COLMAP). The improved algorithm substantially increases the number of generated point clouds, achieving a more uniform distribution that nearly covers all areas of the scene, indicating its superior capability in extracting and utilizing image feature information to produce denser point cloud data. Furthermore, the enhanced algorithm excels in detail reconstruction, more accurately capturing subtle scene structures and textures, thereby improving the precision of geometric shapes and intricate details with a clear representation of scene contours and structures. Additionally, the improved algorithm exhibits greater robustness in handling complex scenes, effectively mitigating the impact of erroneous matches by optimizing feature-matching strategies and incorporating temporal continuity in image acquisition, enhancing both matching accuracy and robustness.

The specific quantitative results are presented in Table 8, where a more detailed analysis of the sparse reconstruction outcomes is conducted from the perspectives of reprojection error and point cloud quantity. The specific evaluation metrics are detailed in Section 4.1.2.

In terms of reprojection error, the proposed improved algorithm (Ours) achieves the lowest error values across all datasets, significantly outperforming traditional algorithms such as SfSM, VSFM, and COLMAP. For instance, in the Statue1 dataset, the reprojection error of the improved algorithm is 0.98, compared to 4.35 for SfSM, 3.25 for VSFM, and 1.21 for COLMAP, indicating a substantial increase in accuracy. This advantage is consistent across other datasets; for example, in the Archi2 dataset, the improved algorithm achieves an error of 1.58, significantly lower than other methods. Reprojection error measures the positional deviation of 3D reconstruction points projected onto the 2D image plane, with lower values indicating better geometric consistency between the reconstructed 3D structure and the actual imaging process. These results demonstrate the improved algorithm’s superior reconstruction accuracy, enabling more precise recovery of scene geometry and providing a reliable geometric foundation for subsequent 3D reconstruction tasks.

Regarding point cloud quantity, the improved algorithm exhibits exceptional performance. For example, the Statue1 dataset generates 78,986 point clouds, significantly surpassing SfSM’s 40,973, VSFM’s 60,588, and COLMAP’s 17,566. This trend is consistent in other datasets, such as Archi2, where the improved algorithm produces 500,247 point clouds, far exceeding other methods. This indicates that the improved algorithm more effectively extracts and utilizes image feature information, resulting in denser and more uniformly distributed point cloud data, thereby providing richer geometric support for subsequent 3D reconstruction.

4.2.5. PMVS Dense Reconstruction Results

PMVS (patch-based multi-view stereo) is a multi-view stereo matching algorithm used to reconstruct 3D models from multiple photos. The basic principle is to generate a dense 3D point cloud by performing dense matching on multi-view images. The core idea of PMVS is based on image patches, which obtain high-precision 3D reconstruction results via local optimization and global consistency.

Figure 21, Figure 22, Figure 23 and Figure 24 illustrate the comparison of dense reconstruction results obtained from the proposed algorithm against various other algorithms using the self-collected datasets Statue1, Statue2, Archi1, and Archi2.

Figure 21 illustrates the dense reconstruction results of the Statue.

From the frontal perspective, the VSFM algorithm produces sparse point clouds with significant detail loss, while COLMAP increases point cloud density but still exhibits reconstruction blind spots. In contrast, the proposed algorithm generates uniformly distributed high-density point clouds, effectively capturing the statue’s textures and intricate details. In the lateral view, VSFM displays blurred outlines, and COLMAP inadequately handles complex geometries; however, the proposed algorithm achieves precise contour delineation and fine feature reconstruction. The rear perspective comparison is even more pronounced: VSFM suffers from extensive point cloud omissions, and COLMAP fails to provide complete coverage, whereas the proposed algorithm ensures full rear-view coverage.

The comparative results of the supplementary data for Statue2, Archi1, and Archi2 presented in Figure 22, Figure 23 and Figure 24 further validate the significant advantages of the proposed algorithm in terms of point cloud quantity and completeness.

Through a comparative analysis from various perspectives, it is evident that the algorithm presented in this paper demonstrates significant improvements over VSFM and COLMAP in terms of point cloud quantity and detail reconstruction. The point clouds generated by VSFM appear relatively sparse across different viewpoints, resulting in the failure to recover numerous architectural details, which compromises the completeness of the reconstruction. Although COLMAP shows an increase in point cloud density, its performance in detail representation remains suboptimal, particularly concerning the textures and complex structures of buildings, failing to achieve the desired reconstruction quality. In contrast, the algorithm proposed in this paper not only generates a greater number of point clouds but also ensures a more uniform distribution, effectively capturing the intricate features of the architecture thereby enhancing the completeness and accuracy of the reconstruction across all perspectives.

Table 9 quantitatively illustrates the specific number of reconstructed point clouds generated by various dense reconstruction algorithms applied to self-acquired data.

This study conducts a comparative analysis of the dense reconstruction performance across four typical 3D reconstruction datasets (Statue1, Statue2, Archi1, Archi2) using three feature matching algorithms (VSFM, COLMAP, and an improved algorithm). As shown in Table 1, VSFM demonstrates the best computational speed in point cloud generation (Statue1: 65.732 s, Archi2: 605.858 s), although its reconstruction density is significantly lower (Statue1: 424,949 points, Archi2: 840,909 points). The COLMAP algorithm shows a substantial increase in point cloud quantity compared to VSFM (Statue1: +150%, Archi2: +465%), but this comes with a considerable increase in time cost (Statue1: 3085.794 s, Archi2: 3785.947 s).

The results of the algorithm presented in this paper, as illustrated in Figure 25, clearly demonstrate.

The improved algorithm achieves the highest reconstruction density across all datasets, generating point clouds that are 7.14 times that of VSFM (Statue1: 3,033,703 points) and 1.80 times that of COLMAP (Archi2: 8,574,136 points). Although its computation time is only reduced by 8.2% (Archi1) to 29.8% (Statue1) compared to COLMAP when considering both the number of point clouds and runtime, the proposed algorithm demonstrates superior performance.

5. Conclusions

Addressing the issues of limited point cloud density, model inaccuracies, and poor completeness inherent in existing multi-view image 3D reconstruction algorithms, this paper proposes an enhanced algorithm. Under conditions of comparable temporal efficiency to most current algorithms, experimental validation using four sets of self-collected data demonstrates that the proposed reconstruction algorithm increases the point cloud density by nearly 2.3 times, improves model accuracy by approximately 20%, and enhances completeness by 45%. The resulting model exhibits a greater quantity of point clouds that are more accurate and complete. The primary methods and contributions of this study are outlined as follows:

(1): To address the issues of low correct point pair matching and accuracy associated with traditional feature matching algorithms, this study proposes an enhanced RANSAC optimization scheme that integrates SIFT feature extraction, KD-Tree accelerated matching and autonomous threshold adjustment. SIFT is particularly effective in extracting a substantial number of stable feature points in complex textured scenes due to its strong robustness against scale, rotation, and illumination variations. Meanwhile, KD-Tree improves matching speed and reduces computational complexity through efficient spatial partitioning, contrasting with the O(N²) complexity of brute-force methods. By incorporating dynamically adjusted thresholds in RANSAC to optimize the exclusion of mismatches, we observed significant improvements in both the number of matched point pairs and accuracy across four datasets. For instance, in the Statue1 dataset, the number of effective matches surged from 5050 pairs using SIFT to 9136 pairs, with accuracy increasing from 40.87% to 88.56%, significantly outperforming algorithms such as SURF and SIFT, thereby providing high-quality sparse point cloud inputs for generating more detailed 3D models;
(2): This study indicates a two-stage optimization method to address the model distortion problems resulting from similar structures across different time frames and the computational inefficiencies brought on by exhaustive matching in traditional incremental Structure from Motion (SfM): Prioritizing the matching of images from adjacent time frames (interval < 5 frames) will eliminate contradictory matches from similar structures across different time periods. Matching selection through similarity metric: using global feature extraction and a dynamic threshold (similarity > 0.75), we achieve a 70% reduction in redundant computations, preventing erroneous matches due to repetitive textures. According to experimental results, the suggested approach greatly increases the number of sparse reconstruction point clouds by almost 2.3 times.

Author Contributions

Conceptualization, L.Y. and H.G.; methodology, H.G. and Z.Y; software, W.W.; validation, J.H. and L.G.; formal analysis, L.W.; investigation, L.Y.; data curation, Y.L.; writing—original draft preparation, H.G.; writing—review and editing, Z.C.; visualization, H.G.; supervision, L.Y.; project administration, L.G.; funding acquisition, L.Y. and Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The Science and Technology fund project of Shaanxi Provincial Department [2023-YBGY-369], Key Scientific Research Plan of Education Department of Shaanxi [23JY035], The Youth Innovation Team of Shaanxi Universities [K20220184].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, [gehang@st.xatu.edu.cn], upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

References

Zhang, Y.; Hu, K.; Wang, P. A Review of Three-Dimensional Reconstruction Algorithms. J. Nanjing Univ. Inf. Sci. Technol. (Nat. Sci. Ed.) 2020, 12, 591–602. [Google Scholar] [CrossRef]
Chen, H. Neural Implicit Surface Reconstruction of Freehand 3D Ultrasound Volume with Geometric Constraints. Med. Image Anal. 2024, 98, 103305. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Zhu, X.; Liu, C. Real-Time 3D Reconstruction of UAV Acquisition System for the Urban Pipe Based on RTAB-Map. Appl. Sci. 2023, 13, 13182. [Google Scholar] [CrossRef]
Iorio, F.D.; Sebar, L.E.; Croci, S.; Taverni, F.; Auenmüller, J.; Pozzi, F.; Grassini, S. The Use of Virtual Reflectance Transformation Imaging (V-RTI) in the Field of Cultural Heritage: Approaching the Materialityof an Ancient Egyptian Rock-Cut Chapel. Appl. Sci. 2024, 14, 4768. [Google Scholar] [CrossRef]
Lu, X.L. Hybrid Physics Machine Learning for Ultrasonic Field Guided 3D Generation and Reconstruction of Rail Defects. NDT E Int. 2024, 146, 103174. [Google Scholar] [CrossRef]
Zhao, S. Intelligent Structural Health Monitoring and Noncontact Measurement Method of Small Reservoir Dams Using UAV Photogrammetry and Anomaly Detection. Appl. Sci. 2024, 14, 9156. [Google Scholar] [CrossRef]
Schönberger, J.L.; Frahm, J.-M. Structure-from-Motion Revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar] [CrossRef]
Snavely, N.; Seitz, S.M.; Szeliski, R. Modeling the World from Internet Photo Collections. Int. J. Comput. Vis. 2008, 80, 189–210. [Google Scholar] [CrossRef]
Zhang, Q.; Cao, Y. Research on Three-Dimensional Reconstruction Algorithm of Weak Textured Objects in Indoor Scenes. Laser Optoelectron. Prog. 2021, 58, 0810017. [Google Scholar] [CrossRef]
Tu, F.; Wei, S. Three-dimensional reconstruction of repetitive structural components. J. Wuhan Univ. Sci. Technol. 2023, 46, 210–215. [Google Scholar]
Chen, W.; Tian, Q.; Hu, T. A Method for Drone Trajectory Planning in 3D Reconstruction of Disaster Scenario Architecture. J. Sun Yat-Sen Univ. (Nat. Sci. Ed.) (Chin. Engl.) 2024, 63, 78–84. [Google Scholar]
Jiang, S.; Ma, Y.; Li, Q.; Jiang, W.; Guo, B.; Wang, L. Parallel SfM-based 3D reconstruction for unordered UAV images. Acta Geod. Cartogr. Sin. 2024, 53, 946–958. Available online: http://xb.chinasmp.com/CN/10.11947/j.AGCS.2024.20230335 (accessed on 25 November 2024).
Zhang, H.; Wei, Q.; Zhang, W.; Wu, J.; Jiang, Z. Sequential-image-based space object 3D reconstruction. J. Beijing Univ. Aeronaut. Astronaut. 2016, 42, 273–279. [Google Scholar] [CrossRef]
Kataria, R.; DeGol, J.; Hoiem, D. Improving Structure from Motion with Reliable Resectioning. In Proceedings of the 2020 International Conference on 3D Vision (3DV), Fukuoka, Japan, 25–28 November 2020; pp. 41–50. [Google Scholar] [CrossRef]
Zach, C.; Klopschitz, M.; Pollefeys, M. Disambiguating Visual Relations Using Loop Constraints. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 1426–1433. [Google Scholar] [CrossRef]
Wilson, K.; Snavely, N. Network Principles for SfM: Disambiguating Repeated Structures with Local Context. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013; pp. 513–520. [Google Scholar] [CrossRef]
Griwodz, C.; Gasparini, S.; Calvet, L. AliceVision Meshroom: An Open-Source 3D Reconstruction Pipeline. In Proceedings of the 12th ACM Multimedia Systems Conference, Istanbul Turkey, 28 September–1 October 2021; pp. 241–247. [Google Scholar] [CrossRef]
Zhang, Z. Flexible Camera Calibration by Viewing a Plane from Unknown Orientations. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; pp. 666–673. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bentley, J.L. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM 1975, 18, 509–517. [Google Scholar] [CrossRef]
Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Bay, H.; Ess, A.; Tuytelaars, T.; Gool, L.V. SURF: Speeded-Up Robust Features. Eur. Conf. Comput. Vis. ECCV 2006, 3951, 404–417. [Google Scholar] [CrossRef]
Gao, C.; Li, W.; Tao, R.; Du, Q. MS-HLMO: Multiscale Histogram of Local Main Orientation for Remote Sensing Image Registration. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5626714. [Google Scholar] [CrossRef]
Wu, C. Towards Linear-Time Incremental Structure from Motion. In Proceedings of the 2013 International Conference on 3D Vision, Seattle, WA, USA, 29 June–1 July 2013; pp. 127–134. [Google Scholar] [CrossRef]
Im, S.; Ha, H.; Choe, G.; Jeon, H.-G.; Joo, K.; Kweon, I.S. High Quality Structure from Small Motion for Rolling Shutter Cameras. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 837–845. [Google Scholar] [CrossRef]

Figure 1. Coordinate systems of the pinhole camera model: (a) camera coordinate system; (b) image coordinate system; (c) pixel coordinate system.

Figure 2. The overall framework of the proposed method.

Figure 3. Camera calibration schematic.

Figure 4. Example image of feature point segmentation: (a) establishment of KD tree; (b) plane segmentation corresponding to KD tree.

Figure 5. Flowchart of the Enhanced RANSAC Matching Optimization Algorithm with Automatic Threshold Adjustment.

Figure 6. Schematic diagram of the improved incremental SFM algorithm for similarity measurement.

Figure 7. Flow chart of feature matching autonomous selection strategy.

Figure 8. Flowchart for Image Continuity Optimization Adjustment Algorithm.

Figure 9. Optimization and adjustment algorithms for enhancing image continuity issues: (a) The structural similarity of cross-temporal images; (b) The images exhibit temporal coherence in the context of the immersive acquisition framework.

Figure 10. Self-collected data.

Figure 11. Analysis of the Effectiveness of Improved RANSAC Matching Result Optimization Algorithms under Self-Collected Data for Statue1: (a) Feature matching results without RANSAC optimization; (b) The feature matching results optimized through the improved RANSAC algorithm The green line represents the optimization results of the original RANSAC algorithm, while the red line indicates the increase in the number of matched point pairs after optimization using the proposed algorithm in this paper.

Figure 12. Feature detection results of the acquired image under Self-Collected Data for Statue1: (a) Original image; (b) Feature detection results.

Figure 13. Comparison of SIFT Algorithm Matching Performance under Self-Collected Data for Statue1: (a) the results of SIFT + BF; (b) SIFT + BF + RANSAC; (c) SIFT + KD-Tree; (d) SIFT + KD-Tree+ our RANSAC, The green line represents the optimization results of the original RANSAC algorithm, while the red line indicates the increase in the number of matched point pairs after optimization using the proposed algorithm in this paper.

Figure 14. Matching results of various feature matching algorithms in self-collected data, The green line shows original algorithm results; the yellow line, comparative algorithm matches; the red section highlights additions from this paper’s optimized algorithm.

Figure 15. Quantitative analysis of various matching algorithms in self-collected data: (a) The number of matched point pairs across four sets of self-collected data using various matching algorithms; (b) The accuracy of various matching algorithms in four sets of self-collected data.

Figure 16. Comparison of matching views before and after improvements to the COLMAP algorithm. (a1,b1,c1,d1) represent the matched views of the original COLMAP algorithm across four datasets.; (a2,b2,c2,d2) is a representation of the matching views of the enhanced method described in this paper across four datasets.

Figure 17. Sparse reconstruction results of the self-collected data Statue1 under various algorithms.

Figure 18. Sparse reconstruction results of the self-collected data Statue2 under various algorithms.

Figure 19. Sparse reconstruction results of the self-collected data Archi1 under various algorithms.

Figure 20. Sparse reconstruction results of the self-collected data Archi2 under various algorithms.

Figure 21. The dense reconstruction results of the self-collected data Statue1 under various algorithms.

Figure 22. The dense reconstruction results of the self-collected data Statue2 under various algorithms.

Figure 23. The dense reconstruction results of the self-collected data Archi1 under various algorithms.

Figure 24. The dense reconstruction results of the self-collected data Archi2 under various algorithms.

Figure 25. Quantitative results of various dense reconstruction algorithms applied to four sets of self-collected data: (a) Comparison of Dense Reconstruction Point Cloud Quantities; (b) Comparison of Dense Reconstruction Time.

Table 1. Equipment parameter specification diagram.

Parameter Description of Shooting Equipment
	Magnification factor	1×
	Selected lens	50 million pixel lens
	aperture	F2.4
	ISO	600
	image size	4096 × 3072

Table 2. Detailed information on self-collected data.

Name	Number of Images	Resolution
Statue1	30	4096 × 3072
Statue2	40	4032 × 2268
Archi1	60	4032 × 2268
Archi2	100	4032 × 2268

Table 3. Quantitative results of the autonomous threshold adjustment RANSAC algorithm.

Data	Number of Matched Point Pairs	Original RANSAC		Our RANSAC
Data	Number of Matched Point Pairs	Fixed Threshold of the Original Algorithm	Optimization of the Number of Feature Match Pairs	Adaptive Thresholding	Optimization of the Number of Feature Match Pairs
Statue1	10,316	40	5050	50	9136
Statue2	3250	40	1506	45	2365
Archi1	5262	40	2778	65	4033
Archi2	5703	40	2859	60	4893

The bolded data represents the optimal results for each section.

Table 4. Performance of SIFT Algorithm Matching under Self-Collected Data Statue1.

Algorithmic Ensemble	Time/s	Number of Matched Point Pairs	Accuracy of Matches↑
SIFT + BF	84.523 s	12,356	40.871%
SIFT + BF + RANSAC	86.147 s	5050	40.871%
SIFT + KD-Tree	26.634 s	10,316	88.561%
SIFT + KD-Tree + our RANSAC	26.953 s	9136	88.561%

The arrow indicates that a higher match accuracy is preferable; the bolded data represents the optimal results for each section.

Table 5. The number of matched point pairs across four sets of self-collected data using various matching algorithms.

Matching Algorithm	Number of Matched Point Pairs↑
Matching Algorithm	Statue1	Statue2	Archi1	Archi2
SIFT	5050	2138	2568	3209
SURF	69	229	44	65
MS-HLMO	719	424	151	312
Ours	9136	1782	4090	3382

The arrow indicates that a higher number of matched point pairs is preferable; the bolded data represents the optimal results for each section.

Table 6. The accuracy of various matching algorithms in four sets of self-collected data.

Matching Algorithm	Matching Accuracy↑
Matching Algorithm	Statue1	Statue2	Archi1	Archi2
SIFT	40.87%	52.46%	33.54%	50.28%
SURF	1.97%	6.54%	1.26%	1.86%
MS-HLMO	55.14%	23.91%	22.34%	27.21%
Ours	88.56%	75.35%	71.72%	64.27%

The higher the data values, the better; the bolded data represents the optimal results for each group.

Table 7. Comparison of Results for Enhanced Matching Views.

Data		Original Method	Ours
Name	Number	$N_{C} ↓$	$N_{C} ↓$
Statue1	30	465	170
Statue2	45	1035	206
Archi1	60	1830	448
Archi2	100	5050	798

The arrow indicates that a lower number of matched view edges is preferable, with the bolded data representing the optimal results.

Table 8. Comparison of quantitative results of various sparse reconstruction algorithms across four sets of self-collected data.

Data	SfSM		VSFM		Colmap		Ours
Data	$R_{e} ↓$	$N ↑$	$R_{e} ↓$	$N ↑$	$R_{e} ↓$	$N ↑$	$R_{e} ↓$	$N ↑$
Statue1	4.35	40,973	3.25	60,588	1.21	17,566	0.98	78,986
Statue2	9.96	32,000	2.48	100,257	1.12	42,858	1.06	220,861
Archi1	8.56	46,749	2.69	114,350	1.11	33,599	1.03	570,330
Archi2	11.56	34,284	2.32	182,470	1.33	74,195	1.58	500,247

The direction of the arrow indicates the optimal trend of the metric, while the bolded data signifies the best results.

Table 9. The quantity of dense reconstructed point clouds across four datasets using various matching algorithms.

Data	VSFM		Colmap		Ours
Data	$N ↑$	Times/s	$N ↑$	Times/s	$N ↑$	Times/s
Statue1	424,949	65.732	1,059,895	3085.794	3,033,703	2163.256
Statue2	911,398	252.537	1,571,687	2948.587	2,964,349	2489.657
Archi1	631,484	439.327	2,476,828	2628.669	3,909,707	3252.891
Archi2	840,909	605.858	4,749,228	3785.947	8,574,136	3459.859

The arrow indicates that a higher quantity of point clouds is preferable; the bolded data represents the optimal results for each section.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, L.; Ge, H.; Yang, Z.; He, J.; Gong, L.; Wang, W.; Li, Y.; Wang, L.; Chen, Z. Research on Rapid and Accurate 3D Reconstruction Algorithms Based on Multi-View Images. Appl. Sci. 2025, 15, 4088. https://doi.org/10.3390/app15084088

AMA Style

Yang L, Ge H, Yang Z, He J, Gong L, Wang W, Li Y, Wang L, Chen Z. Research on Rapid and Accurate 3D Reconstruction Algorithms Based on Multi-View Images. Applied Sciences. 2025; 15(8):4088. https://doi.org/10.3390/app15084088

Chicago/Turabian Style

Yang, Lihong, Hang Ge, Zhiqiang Yang, Jia He, Lei Gong, Wanjun Wang, Yao Li, Liguo Wang, and Zhili Chen. 2025. "Research on Rapid and Accurate 3D Reconstruction Algorithms Based on Multi-View Images" Applied Sciences 15, no. 8: 4088. https://doi.org/10.3390/app15084088

APA Style

Yang, L., Ge, H., Yang, Z., He, J., Gong, L., Wang, W., Li, Y., Wang, L., & Chen, Z. (2025). Research on Rapid and Accurate 3D Reconstruction Algorithms Based on Multi-View Images. Applied Sciences, 15(8), 4088. https://doi.org/10.3390/app15084088

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Rapid and Accurate 3D Reconstruction Algorithms Based on Multi-View Images

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Camera Calibration

3.2. Principles of the Enhanced Feature Matching Algorithm

3.2.1. Scale-Invariant Feature Transform (SIFT) Feature Point Detection

3.2.2. Multidimensional Binary Search Tree (KD-Tree) Feature Point Matching

3.2.3. Improved RANSAC Algorithm with Automatic Threshold Adjustment

3.3. Enhanced Incremental SFM Sparse Reconstruction Algorithm for Similarity Measurement

3.3.1. Similarity Measurement Algorithms

3.3.2. Image Continuity Optimization Adjustments

3.4. PMVS Dense Reconstruction

4. Results

4.1. Experimental Data and Evaluation Metrics

4.1.1. Data Acquisition

4.1.2. Evaluation Metrics

4.2. Experimental Results and Analysis

4.2.1. Improving the Effectiveness Analysis of the RANSAC Algorithm

4.2.2. Feature Detection and Matching

4.2.3. Analysis of the Effectiveness of View Improvement Algorithms

4.2.4. Improving the Analysis of Sparse Reconstruction Results in Algorithms

4.2.5. PMVS Dense Reconstruction Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI