Figure 1.
An overview of the devised neural network architecture, which comprises a feature extraction module, a multilayer linear module, and multiple self-attention modules. The network operates by ingesting the complete measurement matrix as input and subsequently transforming it into a 3D structure. During the training phase, the network is guided by a reprojection loss and an additional temporal constraint.
Figure 1.
An overview of the devised neural network architecture, which comprises a feature extraction module, a multilayer linear module, and multiple self-attention modules. The network operates by ingesting the complete measurement matrix as input and subsequently transforming it into a 3D structure. During the training phase, the network is guided by a reprojection loss and an additional temporal constraint.
Figure 2.
The Feature Extraction Module involves incorporating a ConvBNBlock and a set of ResBlocks. In the context of this module, the notation signifies the transformation of a feature map with an input channel of c through a convolutional kernel of dimensions to generate an output with d channels.
Figure 2.
The Feature Extraction Module involves incorporating a ConvBNBlock and a set of ResBlocks. In the context of this module, the notation signifies the transformation of a feature map with an input channel of c through a convolutional kernel of dimensions to generate an output with d channels.
Figure 3.
The figure illustrates the structure of a single attention block within the neural network architecture. This block treats individual frames as sequence tokens, enabling the network to capture intricate relationships between frames. It operates as a fundamental building block for feature extraction and encoding, contributing to the network’s ability to progressively comprehend the semantic and contextual information within the input sequence.
Figure 3.
The figure illustrates the structure of a single attention block within the neural network architecture. This block treats individual frames as sequence tokens, enabling the network to capture intricate relationships between frames. It operates as a fundamental building block for feature extraction and encoding, contributing to the network’s ability to progressively comprehend the semantic and contextual information within the input sequence.
Figure 4.
(a) displays the training loss curve for the paper dataset, showing the evolution of the loss function during training on this dataset. (b) presents the training loss curve for the seq3 dataset, demonstrating how the loss changes as the model is trained on this dataset. (c) visualizes the training loss curve for the seq4 dataset, illustrating the loss dynamics during training on this dataset. Finally, (d) showcases the training loss curve for the Pants dataset, providing a view of the loss variation during training on this specific dataset.
Figure 4.
(a) displays the training loss curve for the paper dataset, showing the evolution of the loss function during training on this dataset. (b) presents the training loss curve for the seq3 dataset, demonstrating how the loss changes as the model is trained on this dataset. (c) visualizes the training loss curve for the seq4 dataset, illustrating the loss dynamics during training on this dataset. Finally, (d) showcases the training loss curve for the Pants dataset, providing a view of the loss variation during training on this specific dataset.
Figure 5.
This figure provides a comprehensive visualization of the reconstruction results obtained from datasets devoid of ground truth values. The datasets in focus encompass the Back dataset, Real Face dataset, and Heart dataset. Within the figure, the original images are situated above, while the corresponding reconstruction results are showcased below. The primary aim of this visual representation is to offer a precise and elaborate depiction of the reconstruction process’s performance and accuracy concerning these datasets, especially considering the absence of ground truth values.
Figure 5.
This figure provides a comprehensive visualization of the reconstruction results obtained from datasets devoid of ground truth values. The datasets in focus encompass the Back dataset, Real Face dataset, and Heart dataset. Within the figure, the original images are situated above, while the corresponding reconstruction results are showcased below. The primary aim of this visual representation is to offer a precise and elaborate depiction of the reconstruction process’s performance and accuracy concerning these datasets, especially considering the absence of ground truth values.
Figure 6.
This figure provides a visual representation of the qualitative analysis of the four datasets previously subjected to quantitative assessment, namely, Actor Mocap, Paper, Traj.B, and Expressions. In the case of the Actor Mocap and Paper datasets, the original images are accompanied by the corresponding reconstruction results displayed beneath each image. Conversely, the Traj.B and Expressions datasets exclusively offer ground truth obj format files, with their respective reconstruction results presented below.
Figure 6.
This figure provides a visual representation of the qualitative analysis of the four datasets previously subjected to quantitative assessment, namely, Actor Mocap, Paper, Traj.B, and Expressions. In the case of the Actor Mocap and Paper datasets, the original images are accompanied by the corresponding reconstruction results displayed beneath each image. Conversely, the Traj.B and Expressions datasets exclusively offer ground truth obj format files, with their respective reconstruction results presented below.
Figure 7.
A comparison of the rank of covariance matrices corresponding to feature representations before and after passing through multilayer linear modules is depicted in the figure. The illustration highlights the influence of multilayer linear modules on the dimensionality of feature representations. It achieves this by comparing the rank of covariance matrices at two critical junctures in the processing pipeline: before the feature input to the modules and after the feature output from the modules.
Figure 7.
A comparison of the rank of covariance matrices corresponding to feature representations before and after passing through multilayer linear modules is depicted in the figure. The illustration highlights the influence of multilayer linear modules on the dimensionality of feature representations. It achieves this by comparing the rank of covariance matrices at two critical junctures in the processing pipeline: before the feature input to the modules and after the feature output from the modules.
Figure 8.
The visualization results depict the outcome of T-SNE dimensionality reduction, transforming the feature vector into a 2D latent space before passing through the self-attention layer. Each point represents a frame within the sequence, and the color corresponds to its temporal order. These visualizations serve to provide a clear representation of how the data is distributed in this lower-dimensional latent space.
Figure 8.
The visualization results depict the outcome of T-SNE dimensionality reduction, transforming the feature vector into a 2D latent space before passing through the self-attention layer. Each point represents a frame within the sequence, and the color corresponds to its temporal order. These visualizations serve to provide a clear representation of how the data is distributed in this lower-dimensional latent space.
Figure 9.
The visualization results depict the outcome of T-SNE dimensionality reduction for the feature vector after passing through the self-attention layer. Each point on the graph represents a frame within the sequence, with color coding indicating its temporal order. These visualizations are instrumental in elucidating how data points are distributed in a 2D latent space following processing by the self-attention modules.
Figure 9.
The visualization results depict the outcome of T-SNE dimensionality reduction for the feature vector after passing through the self-attention layer. Each point on the graph represents a frame within the sequence, with color coding indicating its temporal order. These visualizations are instrumental in elucidating how data points are distributed in a 2D latent space following processing by the self-attention modules.
Table 1.
Comparison results for on the Synthetic Face sequences dataset reveal that the method proposed in this study exhibits the lowest error among the evaluated approaches, which include VA, CMDR, GM, JM, SMSR, PPTA, EM-FEM, N-NRSfM, RONN, NTP, and DST-NRSFM.
Table 1.
Comparison results for on the Synthetic Face sequences dataset reveal that the method proposed in this study exhibits the lowest error among the evaluated approaches, which include VA, CMDR, GM, JM, SMSR, PPTA, EM-FEM, N-NRSfM, RONN, NTP, and DST-NRSFM.
Dataset | VA [8] | CMDR [39] | PPTA [40] | JM [19] | SMSR [21] | GM [9] |
---|
Traj.A | 0.035 | 0.032 | 0.039 | 0.028 | 0.030 | 0.029 |
Traj.B | 0.038 | 0.037 | 0.052 | 0.033 | 0.032 | 0.031 |
Dataset | EM-FEM [7] | N-NRSFM [6] | RONN [41] | NTP [38] | DST-NRSFM [42] | Ours |
Traj.A | 0.039 | 0.032 | 0.031 | 0.031 | 0.013 | 0.013 |
Traj.B | 0.030 | 0.039 | 0.036 | 0.034 | 0.026 | 0.019 |
Table 2.
Comparing results on the Actor Mocap dataset, this study’s method is benchmarked against FML, SMSR, CMDR, RONN, N-NRSFM, and DST-NRSFM.
Table 2.
Comparing results on the Actor Mocap dataset, this study’s method is benchmarked against FML, SMSR, CMDR, RONN, N-NRSFM, and DST-NRSFM.
Dataset | FML [43] | SMSR [21] | CMDR [39] | RONN [41] | N-NRSFM [6] | DST-NRSFM [42] | Ours |
---|
Actor Mocap | 0.092 | 0.054 | 0.0257 | 0.0226 | 0.0181 | 0.0163 | 0.0154 |
Table 3.
Assessing results on the Expressions dataset, this study’s method is compared against CSF2, KSTA, GMLI, N-NRSFM, RONN, and DST-NRSFM.
Table 3.
Assessing results on the Expressions dataset, this study’s method is compared against CSF2, KSTA, GMLI, N-NRSFM, RONN, and DST-NRSFM.
Dataset | CSF2 [44] | KSTA [17] | GMLI [45] | RONN [41] | N-NRSFM [6] | DST-NRSFM [42] | Ours |
---|
Expressions | 0.030 | 0.035 | 0.026 | 0.026 | 0.026 | 0.023 | 0.019 |
Table 4.
Regarding results on the Pants dataset, comparisons were made with KSTA, BMM, PPTA, and DST-NRSFM.
Table 4.
Regarding results on the Pants dataset, comparisons were made with KSTA, BMM, PPTA, and DST-NRSFM.
Dataset | KSTA [17] | BMM [20] | PPTA [40] | DST-NRSFM [42] | Ours |
---|
pants | 0.220 | 0.183 | 0.203 | 0.148 | 0.087 |
Table 5.
For results on the Kinect Paper and T-shirt sequences dataset, comparisons were conducted with MP, DSTA, GM, JM, N-NRSFM, and DST-NRSFM.
Table 5.
For results on the Kinect Paper and T-shirt sequences dataset, comparisons were conducted with MP, DSTA, GM, JM, N-NRSFM, and DST-NRSFM.
Dataset | MP [46] | DSTA [47] | GM [9] | JM [19] | N-NRSFM [6] | DST-NRSFM [42] | Ours |
---|
Paper | 0.0827 | 0.0612 | 0.0394 | 0.0338 | 0.0332 | 0.0296 | 0.0312 |
T-Shirt | 0.0741 | 0.0636 | 0.0362 | 0.0386 | 0.0309 | 0.0396 | 0.0402 |
Table 6.
Ablation experiment results of obtained from multilayer linear module variations across different datasets.
Table 6.
Ablation experiment results of obtained from multilayer linear module variations across different datasets.
Method | Actor Mocap | Traj.A | Traj.B | Expressions | Pants | Paper | T-Shirt |
---|
Baseline | 0.0189 | 0.017 | 0.023 | 0.022 | 0.092 | 0.0423 | 0.0589 |
Ours | 0.0154 | 0.013 | 0.019 | 0.019 | 0.087 | 0.0312 | 0.0401 |
Table 7.
Ablation experiment results of obtained from multiple self-attention module variations across different datasets.
Table 7.
Ablation experiment results of obtained from multiple self-attention module variations across different datasets.
Method | Actor Mocap | Traj.A | Traj.B | Expressions | Pants | Paper | T-Shirt |
---|
Baseline-A | 0.986 | 0.113 | 0.403 | 0.991 | 0.291 | 0.0573 | 0.0792 |
Ours | 0.0154 | 0.013 | 0.019 | 0.019 | 0.087 | 0.0312 | 0.0401 |
Table 8.
Ablation experiment results of obtained from temporal constrain variations across different datasets.
Table 8.
Ablation experiment results of obtained from temporal constrain variations across different datasets.
Method | Actor Mocap | Traj.A | Traj.B | Expressions | Pants | Paper | T-Shirt |
---|
Baseline-B | 0.0209 | 0.014 | 0.017 | 0.020 | 0.090 | 0.0501 | 0.0648 |
Ours | 0.0154 | 0.013 | 0.019 | 0.019 | 0.087 | 0.0312 | 0.0401 |