In this section, we first benchmark our iterative 3D reconstruction algorithm to existing algorithms in the literature by the IoU metric (Equation (
1)). In
Section 4.3, we present an evaluation in terms of the instantaneous IoU(t) and demonstrate the problems that arise by naively deploying batch-based algorithms in an online setting. In
Section 4.4, we empirically evaluate the variance in reconstruction quality in terms of the selected viewpoints and the order in which these are presented. Lastly, in
Section 4.5, we compare the reconstruction quality obtained from different selection algorithms.
4.1. Final Reconstruction Quality
In our first experiment, we compared the quality of 3D reconstructions obtained by our iterative approach to the reported performances of other state-of-the-art approaches on the R2N2 dataset, as listed in
Table 1. In particular, we evaluated the IoU obtained after processing
views. For the batch-based algorithms, we report the IoU of the (only) reconstruction obtained from batched input. For the iterative algorithms, we report the IoU of the last reconstruction when sequentially processing
images.
From these results, we can conclude that our iterative model results in better reconstructions than the RNN-based 3D-R2N2 model, which is to our knowledge the best-performing iterative model (aside from LSM [
18], but this model requires camera parameters as input). We can also conclude that most batch-based reconstruction methods outperform iterative reconstruction. However, there is only a slight degradation in reconstruction quality of our model compared to the Pix2Vox++/F architecture from which it was derived. This slight decrease can be attributed to the increase in the difficulty of the problem because the iterative model also has to learn to compress the previously seen viewpoints.
In
Table 2, we provide per-class IoU values for our algorithm and the original Pix2Vox++/F architecture. We report results on the original and the extended version of the R2N2 dataset. As the Pix2Vox++/F paper only reports overall IoU, and the weights of the trained model are not publicly available, we retrained the Pix2Vox++/F model, achieving an overall IoU of 69.6% (whereas the original paper reported 69.0%). We provide results for two different backbone models: VGG and ResNet. We obtained similar results as those in [
9]: ResNet outperformed VGG by 1–2% IoU on average.
Table 2 reveals wide variation in the per-class IoU both on the original and on the extended dataset. The difference in per-class IoU between batch-based and iterative reconstruction was limited to ∼2%.
The most notable result of
Table 2 is that both algorithms achieved lower reconstruction quality on the extended dataset. We attribute this performance loss to the smaller viewpoint diversity in this dataset. In the original dataset, the azimuth of 24 viewpoints is uniformly distributed in the interval
. Any selection of eight viewpoints is likely to fall within a large azimuth range. Selecting eight viewpoints out of 120 may result in batches covering much smaller azimuth ranges. In the worst case, eight consecutive viewpoints are selected, covering an azimuth interval of
. We investigate the importance of selecting diverse viewpoints further in
Section 4.5.
4.3. Online Reconstruction
In this section, we focus on the online reconstruction quality (IoU(t)) and its computational requirements. We compare our iterative algorithm to an online setup with the original Pix2Vox++/F model. The latter model can take as input batches of any size, including batch size 1 (single-viewpoint). Using the Pix2Vox++/F model in an online setting can therefore be achieved by buffering the most recent viewpoints, using a first-in-first-out replacement strategy, and using this buffer as batch input to the algorithm.
In our initial experiments, we presented a new viewpoint every timestep from the extended dataset, in order of increasing azimuth. For both algorithms, we observed a high fluctuation in reconstruction quality; see
Figure 6. For Pix2Vox++/F, this can be explained by the fact that reconstructions are always based on a batch of consecutive viewpoints with minor difference in camera perspective. Although our model is equipped with a fused representation containing all information from previous problems, we still observed long-term memory loss. This phenomenon that features extracted from initial inputs tend to be forgotten as more viewpoints are processed is a long standing problem in machine learning [
32].
To avoid long-term memory loss in the iterative algorithm and ensure sufficient viewpoint diversity in the input batch of Pix2Vox++/F, we therefore conducted our benchmark experiment on a subsampled version of the original R2N2 dataset. After ordering the 24 viewpoints of an object by azimuth, we selected every second image.
Table 4 compares the average IoU over time (
) when presenting the viewpoints ordered by azimuth. We experimented with several batch-composition strategies. The condition Size = 1 refers to a batch size of 1, with which a new reconstruction is calculated each time a new viewpoint becomes available, which is the same as single-view reconstruction. In the other conditions, a batch size of 3 viewpoints was used, which was updated with a stride of 1 or 3; see
Figure 7 for additional clarification. Stride 1 corresponds with a first-in first-out buffer update strategy, whereas stride 3 corresponds with erasing the buffer after calculating a reconstruction. As a reference, we added the condition with a growing buffer, where each time a reconstruction is created based on all previously seen information. This last condition, however, has to redo a lot of computations every time a new viewpoint is added to the buffer. This results in the computation time rising quadratically with the size of the batch and is thus not feasible in practice.
Limiting the batch size of Pix2Vox++ induces a performance degradation in the reconstruction quality. Only the condition where all viewpoints in the batch are processed every timestep (Growing Buffer) results in a performance that is on-par with our algorithm. This setting, however, in addition to being slow, also requires keeping 12 images in memory, which results in additional memory resources when compared to the fused-context tensor from the iterative approach.
To quantify the speed of both reconstruction models, we testde them on various hardware setups, as shown in
Table 5. We include results on a Jetson TX1 to show their performances on an embedded system. As different conditions produce different numbers of reconstructions (NoR), we compare the processing time for an entire sequence. We show that only the batch conditions with Size = 1, Stride = 1 and Size = 3, Stride = 3 are faster than the iterative approach, but they result in a loss in IoU.
4.4. Viewpoint Selection
The results presented in the previous sections already hinted to the importance of only processing a subset of the captured viewpoints. In this section, we compare different heuristics to determine what frames to process. In addition to improving the overall reconstruction, this also reduces the computational footprint of the reconstruction algorithm, since this scales linearly with the number of images processed. In the next section, we study several selection algorithms. We characterize the difference in IoU when a fixed number of k viewpoints is selected from the n images available. Since selecting k elements out of n scales exponentially in n, we only present quantative results obtained from testing all possible combinations of the smaller R2N2 dataset (). For the larger extended dataset (), we only report qualitative results. In order to save on computation, we performed the experiments in batches with the original Pix2Vox++/F model.
Table 6 shows different percentiles P, minima and maxima and the P95–P05 interpercentile range of the IoU obtained over all possible combinations of 3 out of 24 images (2024 combinations per object). Notably, a quarter of the tested combinations of three viewpoints resulted in an IoU that is on par with or better than the results presented in
Table 2 for eight viewpoints. The results also highlight the importance of view selection, since there is on average a 16.3% variation between the best and worst combinations (95th and 5th percentile to avoid outliers). Worth noting is that the reconstruction of some classes is more robust to the selected viewpoints than others—the car class being the most robust.
To better understand which viewpoints contribute more to the reconstruction quality, we extended our study to the extended version of the dataset. Since there are 280,840 combinations of 3 out of 120 viewpoints per object, an exhaustive study is intractable.
Figure 8 shows qualitative results of four objects. More examples can be found in
Appendix C.
In this figure, the histograms of IoU values confirm the impact of viewpoint selection in obtaining good reconstructions: there are differences up to ∼25% between best and worst IoU for several objects. The second row in each subfigure indicates the distance, elevation and azimuth of each viewpoint. The red and green curve indicate how many of the combinations, including that viewpoint, belong to the overall best 25% and worst 25%, respectively. These graphs reveal that each object has clear azimuth intervals with informative and uninformative viewpoints. We will revisit this question later in this section.
To create the bottom graph of each subfigure, we binned all combinations by viewpoint diversity and calculated the average IoU of each bin. We define viewpoint diversity as the angle of the shortest arc on the unit circle that contains the three points corresponding to the azimuth values. More details can be found in
Appendix A. Again, these results indicate the importance of selecting sufficiently good viewpoints. A clear degradation of the IoU can be observed for a viewpoint diversity of 180, meaning that two of the three viewpoints in the tested combination were opposite viewpoints. All objects in the R2N2 dataset have at least one symmetry plane, and two viewpoints containing opposite viewpoints on the axis orthogonal to that plane contain only redundant information.
The distinct intervals in
Figure 8 with informative and uninformative viewpoints raise the question of whether these intervals are linked to the object’s characteristics. Moreover, in real-world deployments, iterative reconstruction can start from any viewpoint. Starting the iterative reconstruction in a degraded region does not necessarily jeopardize the final reconstruction quality, as subsequent viewpoints can compensate for the missing information, but the IoU(t) will be degraded during the initial phase. Therefore, we calculated the reconstruction quality obtained from each individual viewpoint. Results for eight objects are shown in the polar plots of
Figure 9, and more examples can be found in
Appendix D. In this figure, the radial axis reflects the reconstruction quality, and the angular coordinate corresponds to the viewpoint azimuth.
In most polar plots, a butterfly-shaped curve is observed, most distinctively for instances that have one characteristic orientation, such as cars, telephones and aeroplanes. Viewpoints taken from perspectives aligned or orthogonal to that orientation tend to be less favorable starting points for iterative reconstruction. Somewhat surprisingly, the single-view reconstruction quality fluctuates even for objects with rotational symmetry, such as the lamp. However, one should also account for the fluctuating distance and elevation of subsequent viewpoints; see
Figure 8. For the lamp, the viewpoints most informative to the reconstruction are those taken from perspectives with lower elevation, because these perspectives allow for a better estimation of the height of the lamp. The second row in
Figure 9 shows four telephone examples with very similar 3D models, yet the reconstruction quality varies significantly for similar azimuth angles. This indicates that the optimal viewpoint selection for one object does not necessarily guarantee good performance on other similar objects.
Combining all these results, we conclude that the primary goal of a viewpoint-selection algorithm is to maintain sufficient diversity in azimuth and elevation. Limited benefits can be expected from applying more advanced strategies based on particular characteristics of the object to be reconstructed. Another argument in favor of simpler strategies is that no prior assumptions can be made with respect to the azimuth of the starting viewpoint. By applying the principle of Occam’s razor, we study in the next section only sampling strategies that are agnostic to particular object characteristics.