1. Introduction
Three-dimensional (3D) optical measurement has been broadly applied in various fields, such as biomedicine, computer vision, and industrial manufacturing, due to its non-contact feature, low cost, and high efficiency. Structured light techniques are the most popular 3D measurement methods, due to their accuracy and versatility. Fringe projection profilometry (FPP) utilizes a structured light system to project sinusoidal fringe patterns onto the measured objects and capture modulated patterns to calculate the height maps. Most fringe analysis methods retrieve the phase from a set of phase-shifting fringe patterns and perform temporal phase unwrapping using multi-frequency fringe patterns [
1,
2].
Research on FPP has mostly focused on the improvement of phase-shifting techniques and temporal phase unwrapping. Yu et al. [
3] proposed a 3D measurement method based on the unequal-period combination of shifting Gray code and dual-frequency phase-shifting fringes. This method adopts a strategy of additionally projecting a set of low-frequency phase shifting fringe patterns to effectively correct the period jump error. The experimental results show that the proposed method can effectively reduce the period jump error, and can measure multiple isolated objects, as well as objects with drastic changes in surface height, such as plaster heads. Peng et al. [
4] produced sinusoidal fringes for high-speed three-dimensional shape measurement using a phase-shifting algorithm. In this method, the sinusoidal fringes are generated by expanding the inverted image of a filled binary sinusoidal pattern in a specified direction, and the phase shift is realized by switching the backlight sources of a set of binary sinusoidal patterns formed on a slide. Hu et al. [
5] adopted a multi-frequency phase-shifting scheme for the microscopic 3D measurement of shiny surfaces. The multi-frequency phase-shifting scheme can improve the integrity of the final phase map of the shiny surface, on which basis a complete and high-accuracy 3D reconstruction combined with a microscopic telecentric stereo system can be achieved. Wu et al. [
6] utilized a phase-shifting profilometry (PSP) to realize temporal phase unwrapping simultaneously with the lowest number of fringe patterns. This method does not use other additional structural patterns to achieve full field phase unwrapping and can measure discontinuous surfaces or multiple isolated objects simultaneously. Li et al. [
7] proposed a high-accuracy temporal phase unwrapping method based on super-grayscale multi-frequency grating projection by using the time-division multiplexing characteristic of a projector and the integral characteristic of a CCD camera. In this method, the super-grayscale grating is designed instead of the traditional 256 grayscale multi-frequency gratings to reduce the digital error. Li et al. [
8] adopted a dynamic 3D reconstruction framework based on a modified three-wavelength phase unwrapping algorithm and phase error compensation method to acquire a sufficient number of 3D results, ensure the continuity of the dynamic 3D shape measurement, and reduce the phase error and measurement error introduced by object motion. Li et al. [
9] used redundant data for self-correction in multi-frequency phase unwrapping. Pistellato et al. [
10] proposed to eliminate phase ambiguity using probabilistic consensus where phase values were modeled from Wrapped Gaussian distribution. Lilienblum and Michaelis [
11] derived a phase unwrapping method by applying pattern sequences and reduced calculation errors caused by discontinuity, occlusion, and reflection.
As deep learning techniques evolve, the deep convolutional neural network (CNN) has been applied to 3D measurements. Qi et al. [
12] proposed an absolute phase measurement method with limited patterns. The proposed method combines the object reflectivity correction and the half-period gray-coded phase unwrapping algorithm and can obtain a large number of codewords for a fringe order without reducing the intensity level of each stair. Yao et al. [
13] first proposed a multi-purpose neural network combined with code-based patterns to recover the absolute phase. The multi-purpose network can learn the principle of extracting absolute phase from a small number of patterns and greatly decrease the number of patterns with a high accuracy.
These abovementioned methods require multi-shot phase-shifting fringe images in multiple frequencies. Therefore, 3D measurements based on these methods are time-consuming and inaccurate when applied to dynamic scenes. Although multi-shot methods provide highly accurate measurement data for static objects by capturing multiple fringe pattern images, the accuracy may be decreased due to the disturbance from vibration and movement between the gaps in the image shots, rendering these methods vulnerable to temporal noises.
Single-shot methods extract phase maps from one fringe image and are robust against movement. Thus, single-shot methods are desirable for dynamic 3D measurement. Traditional single-shot 3D measurement methods utilize spatial demodulation, including Fourier transform profilometry (FTP), windowed Fourier transform profilometry (WFTP), and wavelet transform profilometry (WTP) [
14,
15,
16]. Single-shot methods using specifically designed patterns have also been proposed. In addition to the methods with complex patterns, Kawasaki et al. [
17] utilized a simple grid pattern to achieve dense shape reconstruction. Single-shot methods based on deep learning are more accurate and robust than traditional methods, thus being more viable for practical applications. These methods were first introduced into FPP to retrieve phase maps. Feng et al. [
18] demonstrated that deep learning can improve the accuracy of phase demodulation from a single fringe pattern. They used twelve-step phase-shifting techniques to generate ground truth and trained two convolutional neural networks (CNNs) for fringe analysis, where CNN1 predicted the background image and CNN2 predicted the numerator and denominator for phase calculation. Their results were superior to those of FTP and WFTP. Qiao et al. [
19] presented a single-shot phase retrieval method that can reconstruct the phase distribution of specular objects by using deep learning. The deep networks are built on ideas of deep separable convolution and inverted residual. The method can estimate result closer to ground truth and effectively retain details on the measured surface. Yu et al. [
20] designed the FPTNet for fringe transformation rather than setting phase maps as the output. The network only requires a single fringe as the input, and the FPTNet transformed one fringe image into multiple phase-shifting fringe images, and phase retrieval was achieved through phase-shifting techniques. Nguyen et al. [
21] integrated a fringe-to-phase network with single-shot FPP to achieve 3D reconstructions. The proposed fringe-to-phase network has an architecture similar to that of U-Net and can directly retrieve three wrapped phase maps from a color image comprising three fringe patterns with designated frequencies. Zhang et al. [
22] designed a convolutional neural network to accurately extract phase information in both the low signal-to-noise ratio (SNR) and saturation situations and increased the dynamic range of 3D measurement. Qian et al. [
23] proposed a single-shot absolute 3D shape measurement method with deep-learning-based color FPP. Through learning on extensive data sets, the trained neural network can predict a high-resolution, motion-artifact-free, and crosstalk-free absolute phase directly from a single-color fringe image. Qian et al. [
24] presented deep-learning-enabled geometric constraints and a phase unwrapping approach for the single shot absolute 3D shape measurement. This method generated more accurate absolute phase maps than spatial phase unwrapping.
End-to-end networks can also directly predict height maps from a single-shot fringe image. Van der Jeugh and Dirckx [
25] designed an end-to-end neural network for single-shot measurement. They randomly generated a large number of height maps and collected fringe images using simulated fringe projection. These images, with the corresponding height maps as ground truth, compose the data set for network training. Nguyen et al. [
26] adopted a four-step phase-shifting technique to produce ground truth height maps using a real-world FPP system. The input of the technique is a single fringe-pattern image, and the output is the corresponding depth map for 3D shape reconstruction. They compared different network architectures, including FCN, AEN, and U-Net, in terms of performance. U-Net obtained the most impressive results on their fringe projection data set. Machineni et al. [
27] introduced an end-to-end deep-learning-based framework for FPP that does not require any frequency domain filtering or phase unwrapping. The framework reconstructs the depth profile of the object from the deformed fringe itself through multi-resolution similarity evaluation using a convolution neural network. Nguyen et al. [
28] presented a 3D shape reconstruction technique that employs an end-to-end deep convolutional neural network to transform a single speckle-pattern image into its corresponding 3D point cloud. The deep network predicted the height map from the single-shot speckle image using the ground truth height map measured by FPP.
However, deep learning methods usually require large data sets to facilitate training. Data insufficiency results in overfitting and hinders performance. As in the case of FPP, the commonly used supervised learning methods require the ground truths of height maps or phase maps for training, which are measured using phase-shifting and temporal phase unwrapping. To produce training samples with accurate ground truth, twelve-step phase-shifting techniques with fringe patterns at four frequencies were adopted in some research, in which 48 fringe images were captured for one training sample. This is a highly time-consuming process and is inconvenient for practical applications. To generate more training samples, previous works utilized computer graphics to generate synthetic data sets for supervised training or re-project fringe patterns for unsupervised training. Zheng et al. [
29] proposed constructing a digital twin of the FPP system and conducting virtual scanning, which generated 7200 fringe images and 800 corresponding 3D scenes in 1.5 h. Wang et al. [
30] presented a single-shot fringe projection profilometry method based on deep learning and computer graphics. They built a virtual FPP system and tested different parameters to construct a sufficient data set. To estimate the depth image from only one fringe image, a new loss function was designed and two network architectures, U-Net, and pix2pix, were compared. Fan et al. [
31] developed an unsupervised training method for 3D reconstruction with dual-frequency fringe projection profilometry that does not require ground truth in the training set. They reprojected fringe patterns using the height maps predicted by the network and calculated the loss between the reprojected fringe images and real fringe images. The methods mentioned above that use computer graphics to generate synthetic data sets are less time-consuming in the data set preparation stage but cannot avoid accuracy loss when applied to real-world FPP systems. Synthetic data sets or reprojection processes are unable to account for various textures and noises in real-world fringe images. Thus, the domain discrepancy has a negative influence on the network performance.
To solve the dependence on static scenes and the time-consuming property of multi-shot methods, as well as the high data requirements of single-shot methods based on deep learning, in this work, we propose a single-shot 3D measurement method using a deep neural network named Fringe Analysis Network (FrANet). The FrANet consists of three subnetworks for fringe analysis, including the phase retrieval subnetwork, phase unwrapping subnetwork, and refinement subnetwork, replacing phase retrieval and phase unwrapping in traditional FPP. Phase unwrapping or height map prediction that integrates phase unwrapping in it is regarded as an ill-posed task that requires long-range information for the accurate prediction of absolute phase maps. U-Net is capable of extracting features and recovering the image resolution, thus being efficient in processing high-resolution images. However, generic CNN or a single U-Net cannot process long-range information due to the limitation of the small receptive field. Therefore, instead of using a single U-Net, in this work, we adopt two subnetworks to conduct phase unwrapping, where the phase unwrapping subnetwork with additional layers extracts long-range information, and the refinement subnetwork provides further refinement. Specifically, the phase retrieval subnetwork extracts wrapped phase maps from single-shot fringe images. The phase unwrapping subnetwork analyzes predicted wrapped phase maps and fringe images to yield absolute phase maps. The refinement subnetwork takes the wrapped phase maps, fringe images, and the primary prediction of absolute phase maps as inputs. Refined absolute phase maps are the final output of the FrANet. To solve the problems of overfitting and poor performance caused by insufficient samples, the FrANet is pre-trained using an unsupervised data set with fringe pattern reprojection and fine-tuned using a supervised data set with ground truth phase maps. Such a training strategy lowers the number of ground truth phase maps in the data set, saves time during data collection, and maintains the accuracy of supervised methods in real-world setups.
The contribution can be summarized as follows: A single-shot 3D measurement method based on deep learning is proposed, which enables accurate 3D measurements in dynamic scenes; A deep network named the FrANet with three subnetworks for fringe analysis is designed to improve the accuracy of 3D measurements; A two-stage training strategy for the FrANet is developed to reduce the number of supervised samples, which enables the efficient deployment of the proposed method in practical applications.
The rest of this paper is organized as follows:
Section 2 describes the detailed architecture and training strategy of the FrANet;
Section 3 presents the analysis of the experiments and the results;
Section 4 presents the discussion, followed by the conclusion in
Section 5.
3. Experiments
To verify the effectiveness of the proposed method, an FPP system consisting of a blue light projector from TengJu and digital cameras from DaHeng Mercury was built. The system is shown in
Figure 5a. Only the left camera was adopted in the experiments.
Figure 5b,c displays the samples for 3D measurement, including some automotive parts. The industrial part in
Figure 5b belongs to the test set for evaluation, while the part in
Figure 5c is used in the two-stage training. The schematic of the experimental setup is shown in
Figure 6. The camera and projector are connected to the computer for coding fringe patterns and triggering data acquisition. The camera captures images of a resolution of
pixels with three channels, and the resolution of the projector is
pixels. During data acquisition, the projector projects vertical blue fringe patterns onto the measured objects. The air temperature during the experiments was around 20 degrees Celsius. In the unsupervised training stage, 2000 groups of fringe images are collected. Each group contains fringe images of the same scene at two frequencies. The frequency is defined as the entire number of fringes in the projected fringe pattern. The high frequency and low frequency were set to 64 and 9, respectively. In the supervised training stage, only 120 groups of twelve-step phase-shifting fringe images were captured at four frequencies of 64, 16, 4, and 1. The test set with 20 groups of fringe images was split from the supervised data set. In the following sections, the implementation details of the fringe analysis network FrANet are first provided. Then, the accuracy of the results predicted by the FrANet network is evaluated, the ablation studies on the training stages and FrANet architecture is undertaken, and lastly, the data efficiency of the proposed method is analyzed.
3.1. Implementation Details
The proposed method was implemented with PyTorch, and the device used for training is one RTX 2070 SUPER GPU. The Adam optimizer was used with the parameters and . These parameters were set as default since default parameters are generally adopted in deep network optimization except in rare cases, for instance, Generative Adversarial Network with the parameter . The Adam optimizer is not as strict with the initial learning rate as the SGD optimizer. The initial learning rate was selected from , , , and according to the results after one epoch of training. In addition, the learning rate was adjusted when the network loss stopped decreasing after five epochs of training. During pre-training, the initial learning rate was set to , and was decreased to after 30 epochs. The whole pretraining process took 60 epochs. The images were cropped to a size of . During finetuning, the initial learning rate was initially set to for 200 epochs and then for another 200 epochs. The batch size of two and multiple data augmentations were adopted. After training, the FrANet network was used for single-shot measurements with a run time of 0.87 s using the same setup. All deep networks used for comparison had the same hyperparameter for training.
The network loss of the training set and test set in the unsupervised and supervised training stage is shown in
Figure 7. The network loss converged after both training stages, and the network loss in the test set was just slightly larger than the loss in the training set, which indicated that the proposed training strategy avoided the problem of overfitting.
3.2. Accuracy Evaluation
After the two-stage training, the accuracy of the FrANet network based on the supervised data set was evaluated. The test set, including 20 sets of fringe images, was fed into the FrANet network. The error between the predicted phase maps and ground truth obtained by using twelve-step phase-shifting techniques was calculated. The fringe images in the test set did not contain any object that appeared in the training set, thus displaying the capacity of the FrANet network for generalization.
Figure 8 displays results for one fringe image in the test set, where (a) shows the original fringe image, (b) and (c) show ground truth phase maps, and (d)–(f) show the network results, including predicted wrapped phase map, predicted absolute phase map, and refined absolute phase map. The FrANet network predicts an accurate wrapped phase map from the fringe image and performs phase unwrapping. In the first predicted absolute phase map, the shape of the measured object is less clear than in the refined phase map, which verifies the effectiveness of the refinement subnetwork. The refined absolute phase map is similar to the ground truth in the valid region, which means the FrANet network prediction effectively accounts for the shape of the measured object and provides similar results to phase-shifting profilometry for most of the parts.
To visualize the accuracy of the proposed method, error maps of the predicted wrapped phase map, predicted absolute phase map, and refined absolute phase map are shown in
Figure 9. The error maps are calculated using the exact results in
Figure 7 except that the background is eliminated in this experiment to focus on the measured objects. The absolute error between predicted and ground truth phase maps of each pixel is displayed in the error maps. In addition to the error maps, the mean absolute error (MAE) in the entire test set is also calculated, as shown in
Table 1. Since the phase unwrapping subnetwork predicts an absolute phase map instead of fringe orders, it can refine the retrieved wrapped phase map. Unlike in classic FPP, where the maximum accuracy of absolute phase maps is determined by the accuracy of wrapped phase maps, FrANet retains the information of fringe images after wrapped phase maps are predicted and further refinement in hidden layers is conducted throughout the network. There is a relatively large error in the primary wrapped phase map, and the error is reduced after the process of phase unwrapping. The refinement subnetwork further improves the performance and eliminates fringe-shaped errors in the phase maps. The MAE of the final output is 0.0114 rad. These results indicate that the proposed method can perform high-accuracy 3D measurements and avoids the overfitting problem.
After the absolute phase maps were yielded, the 3D shapes of the measured objects were reconstructed using both the ground truth phase maps and phase maps predicted by the network. The 3D coordinates were compared to calculate the error in the 3D measurements. The results for the test set have a root mean square error (RMSE) of 0.67 mm. For a data set with various industrial parts, this accuracy is desired in single-shot measurement.
To test the generalization ability of the proposed method, the single-shot 3D shape reconstruction data set proposed by Nguyen et al. [
26] was used for finetuning and evaluation. After unsupervised learning using our dataset, the pre-trained model was finetuned using the dataset in the literature. The fringe image and wrapped phase map calculated using the proposed method are shown in
Figure 10. To adapt to the end-to-end format of this data set, a convolutional layer was added to transform the phase maps into height maps. Using the proposed method, the RMSE of the test set was 0.94 mm compared to 1.62 mm in the original paper [
26].
3.3. Dynamic Scene
Since only one fringe image is required and used as the network input, the proposed method can be applied to dynamic measurements. For the evaluation of dynamic scenes, the object formed a pendulum, and fringe images were captured near the bottom part when the object reached the highest speed, which was estimated using a simple pendulum model. To measure the absolute error without using the phase-shifting results as a reference, standard spheres were adopted for evaluation. A diagram and dynamic scene of the standard sphere pendulum are shown in
Figure 11. The diameter of the spheres is 30 mm within an error of 2 μm. The uncertainty of reference diameter is claimed by the sphere manufacturer of DS-DCB-D30L100. The length of the pendulum is
cm. Standard spheres that are released at a horizontal deviation of
will reach the estimated maximum speed of
, where g is the gravitational acceleration.
The results of the measurements of dynamic scenes are shown in
Figure 12. The spheres were released at horizontal deviations of 10.0 cm, 15.0 cm, 20.0 cm, 25.0 cm, and 40.0 cm, corresponding to the highest speeds of 0.376 ± 0.004 m/s, 0.567 ± 0.004 m/s, 0.759 ± 0.004 m/s, 0.955 ± 0.004 m/s, and 1.580 ± 0.005 m/s. The predicted wrapped phase maps and absolute phase maps were accurate without motion blur. The measured diameters were 30.078 mm, 29.946 mm, 29.916 mm, 30.152 mm, and 30.354 mm, with errors of 78 μm, 54 μm, 84 μm, 152 μm, and 354 μm, respectively. The measured diameters were calculated using sphere fitting on the reconstruction results. The RMSEs between 3D coordinates in the reconstruction results and fitted spheres were 49 μm, 32 μm, 56 μm, 88 μm, and 197 μm, respectively. The results show that the proposed method has a high accuracy for dynamic scenes when measuring objects below the speed of 1 m/s.
3.4. Ablation Study
To provide a clear view of the contributions of each technique, we conducted ablation studies of the two training stages and three subnetworks. The results of ablation studies on the training stages are shown in
Table 2. We selected the RMSE between prediction and ground truth height maps on the test set as the metric. In unsupervised learning, the pre-trained model without fine-tuning was evaluated. In supervised learning, the FrANet was trained from scratch using the supervised training set which had an insufficient number of training samples. Supervised learning yielded the worst result due to overfitting. The proposed method reached the lowest RMSE and improved the accuracy of unsupervised methods by 68%, which verifies that two-stage training is effective in achieving a high accuracy for the given data sets.
Since the FrANet network consists of multiple subnetworks, we evaluated the contribution of each subnetwork on the test set individually, as shown in
Table 3. For the experiments without the phase retrieval subnetwork, WFTP was utilized to extract the wrapped phase map. Spatial phase unwrapping was adopted when testing without the phase unwrapping subnetwork. The refinement subnetwork does not affect the form of the output, and thus there is no need for replacement. The results show that all the subnetworks of the proposed network contributed, and the integration of these subnetworks yielded the best performance.
3.5. Comparison with Other Methods
There are different methods for single-shot 3D measurement, such as traditional FTP, WFTP, and deep learning methods. In the experiments using traditional methods, fringe images were captured at four frequencies, 1, 4, 16, and 64. Temporal phase unwrapping was conducted to obtain absolute phase maps. Deep learning methods include end-to-end networks and networks used for phase retrieval. The U-Net architecture proposed by Nguyen et al. [
24] as an end-to-end network was adopted for comparison. Note that we switched only the network architecture and still used two-stage training for the end-to-end network, in which fringe patterns were re-projected according to the predicted height maps in the unsupervised stage. Networks used for phase retrieval were also compared, including the convolutional network proposed by Feng et al. [
18], the FPTNet [
20], and the fringe-to-phase network [
21]. These networks were also trained in two stages. The results of the proposed method were superior to those of traditional methods, deep networks for phase retrieval, and end-to-end deep learning methods, as shown in
Figure 13. The result of each single-shot method was computed using the RMSE between 3D coordinates of reconstruction results and ground truth in the test set, where the ground truth was obtained using twelve-step phase-shifting and temporal phase unwrapping. The FPTNet yielded the second-best performance, but the input of the FPTNet had an additional low-frequency fringe image for phase unwrapping. The proposed method achieved the highest accuracy when processing one single-shot fringe image.
The 3D reconstruction results using different single-shot methods are shown in
Figure 14. Methods with the highest accuracy, namely, WFTP [
15] in traditional methods and FPTNet [
20] in deep-learning-based methods, were adopted for qualitative comparison. The result of WFTP contained rough edges and notable shape distortion, which indicated a large phase error. FPTNet recovered the shape of the object well but lost a small portion of pixels for reconstruction due to phase unwrapping errors. The proposed method yielded the most accurate 3D reconstruction result.
3.6. Data Acquisition Efficiency
We compared the efficiency of the proposed method with unsupervised and supervised methods. To ensure a fair comparison, the other methods were assumed to have the same number of training samples as the unsupervised data set used in this work. Unsupervised methods require one to collect 4000 fringe images in data acquisition, while supervised methods require 96,000 fringe images, and the proposed method requires 8800 fringe images. The results suggest that the proposed method has a comparable data efficiency to the unsupervised methods and is superior to the supervised methods.
5. Conclusions
Multi-shot methods that mostly adopt phase-shifting techniques provide highly accurate measurement data for static objects by capturing multiple fringe pattern images. However, the performance may be degraded due to disturbance from vibration. Moreover, these methods are time consuming and inaccurate when applied to dynamic scenes. In this work, a single-shot 3D measurement method using a deep neural network named FrANet was proposed. The FrANet consists of three subnetworks for fringe analysis: a phase retrieval subnetwork, a phase unwrapping subnetwork, and a refinement subnetwork. This system renders long-range information accessible, which is necessary for accurate phase unwrapping or height map prediction. Generic convolutional networks have small receptive fields and cannot extract long-range information. The phase unwrapping subnetwork in the FrANet acquires long-range information using additional down-sampling layers, and the refinement subnetwork conducts further refinement. All the subnetworks in the FrANet adopt the improved U-Net architecture, which is efficient in processing high-resolution images. To reduce the number of supervised training samples required, in this work, a two-stage training strategy was designed. The two stages included pre-training using unsupervised learning and fine-tuning using supervised learning. Re-projected fringe images were obtained using network predictions to construct unsupervised loss. Twelve-step phase-shifting techniques were adopted to acquire ground truth for supervised learning. We also explored the network architecture for efficient fringe analysis. The proposed network consists of a phase retrieval subnetwork, phase unwrapping subnetwork, and refinement subnetwork. The experimental results obtained using a real-world FPP system indicate that the proposed method achieves accurate single-shot 3D measurements with an RMSE of 0.67 mm on the test set. The measurements of the moving standard spheres verify the effectiveness of the method for dynamic scenes. The measurement errors of sphere diameter were 78 μm, 54 μm, 84 μm, 152 μm, and 354 μm corresponding to speeds of 0.376 ± 0.004 m/s, 0.567 ± 0.004 m/s, 0.759 ± 0.004 m/s, 0.955 ± 0.004 m/s, and 1.580 ± 0.005 m/s. Standard sphere diameters at most speeds were measured with an error of around 100 μm, which is considered high accuracy in single-shot 3D measurement. Two-stage training with 8800 fringe images saves time during data acquisition compared to supervised methods. The ablation studies verify the effectiveness of using two training stages and three network subnetworks, as two-stage training reduces the error in unsupervised training by 68%, and the network comprising all the subnetworks achieves the highest accuracy. The results of the proposed method are superior to those of FTP, WFTP, and end-to-end networks. The proposed method achieves a high data efficiency comparable to that of unsupervised methods and high accuracy in real-world setups. Future work includes reducing the complexity of the training strategy and using light-weighted networks to further accelerate the measurement process. Variations of the network architecture utilizing multiple subnetworks could be explored and other two-stage training strategies could also be attempted.