Intelligent Prediction of Ore Block Shapes Based on Novel View Synthesis Technology

Bi, Lin; Bai, Dewei; Chen, Boxun

doi:10.3390/app14188273

Open AccessArticle

Intelligent Prediction of Ore Block Shapes Based on Novel View Synthesis Technology

by

Lin Bi

^1,2

,

Dewei Bai

^1,*

and

Boxun Chen

¹

School of Resources and Safety Engineering, Central South University, Changsha 410083, China

²

Changsha Digital Mine Co., Ltd., Changsha 410221, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8273; https://doi.org/10.3390/app14188273

Submission received: 4 August 2024 / Revised: 6 September 2024 / Accepted: 10 September 2024 / Published: 13 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

To address the problem of incomplete perception of limited viewpoints of ore blocks in future remote and intelligent shoveling-dominated mining scenarios, a method of using new view generation technology to predict ore blocks with limited view based on a latent diffusion model is proposed. Initially, an ore block image-pose dataset is created. Then, based on prior knowledge, the latent diffusion model undergoes transfer learning to develop an intelligent ore block shape prediction model (IOBSPM) for rock blocks. During training, structural similarity loss is innovatively introduced to constrain the prediction results and solve the issue of discontinuity in generated images. Finally, neural surface reconstruction is performed using the generated multi-view images of rock blocks to obtain a 3D model. Experimental results show that the prediction model, trained on the rock block dataset, produces better morphological and detail generation compared to the original model, with single-view generation time within 5 s. The average PSNR, SSIM, and LPIPS values reach 23.02 dB, 0.754, and 0.268, respectively. The generated views also demonstrate good performance in 3D reconstruction, highlighting significant implications for future research on remote and autonomous shoveling.

Keywords:

global perception of ore blocks; latent diffusion model; structure similarity constraint; new view synthesis; 3D reconstruction; shovel loading

1. Introduction

Ore shoveling is a crucial stage in the mining operation process. With advancements in technology and the development of intelligent mining applications, traditional manual on-site shoveling is gradually being replaced by remote-controlled shoveling and machine autonomous shoveling [1], which offer higher production efficiency, lower costs, and enhanced safety [2]. However, remote-controlled shoveling demands a higher level of global information perception and accuracy compared to on-site shoveling. Autonomous shoveling needs to gather detailed real-time information about the shape and position of materials from a limited viewpoint to guide its next moves. Therefore, rapidly and accurately predicting the ore blocks at the operation site based on limited information is an effective way to analyze shoveling actions and improve operational efficiency, which is of significant importance for constructing a digital twin of the mining environment [3]. Ji [4] used the support vector machine (SVM) algorithm to predict mineral information. SVM is capable of transforming practical problems into a high-dimensional feature space through nonlinear mapping, where it constructs a linear discriminant function to achieve nonlinear discrimination in the original space. This approach offers robust generalization capabilities, even with a limited sample size. Although the accuracy rate is high, the sample size was limited, which means the results lacked representativeness. Ali Y et al. [5] utilized the artificial neural network (ANN) algorithm to predict blast information of ore blocks. ANN is a computational model that mimics the connections of human brain neurons, processing and learning information through multiple layers of weighted neural connections. It is capable of adaptively optimizing weights using training data to generate predictions in complex pattern recognition and forecasting tasks. However, this method has limited transferability to other mining sites and requires substantial computational resources. Huai et al. [6] used cubic spline interpolation, linear interpolation, nearest neighbor interpolation, and piecewise cubic Hermite interpolation to predict the ore block excavation volume per unit time in autonomous shoveling under different conditions. Nonetheless, the estimation results were affected by the smoothness of the operating site and surrounding environmental conditions. Thus, while traditional prediction algorithms have enhanced the richness of information to a certain extent, the limitations of various constraints have gradually become a problem that cannot be ignored.

Currently, the new view synthesis method based on computer vision is less impacted by the surrounding environment and can more intuitively and accurately present multi-dimensional ore block information as images. In recent years, Mildenhall et al. [7] introduced neural radiance fields (NeRF), a technique that achieves realistic image generation from novel viewpoints. The method works by inputting 5D pose parameters (comprising 3D spatial coordinates and 2D viewing directions) of each sampled point into a multilayer perceptron [8]. However, the effectiveness of this method relies heavily on a large number of multi-view images and requires precise pose data [9], making it perform poorly with limited input [10]. Moreover, installing numerous sensors or camera equipment to capture these views would significantly increase costs. To address this, Alex Yu et al. [11] proposed a method incorporating convolutional neural networks to learn implicit geometric priors from the input image space for few-shot training. This approach enabled the synthesis of new views from a limited number of perspectives without requiring 3D supervision on the ShapeNet synthetic dataset [12]. Despite producing good view synthesis results, the method demonstrated poor view generation and generalization performance for non-ShapeNet synthetic datasets.

Currently, the denoising diffusion probabilistic models (DDPM) [13] have shown promising results in the field of view synthesis. The model gradually adds noise to the data in the forward stage and restores the Gaussian noise to the original image in the reverse stage [14,15]. However, because both the denoising and noise addition processes must be performed on full-resolution images [16,17], they demand substantial computational resources. To address this issue, Robin Rombach et al. [18] proposed latent diffusion models (LDMs), which compress the processing of pixel images into a latent space that is perceptually equivalent to the original image space but has significantly lower computational complexity. This approach enables high-fidelity data reconstruction while reducing the model’s computational resource requirements [19]. Consequently, LDMs can produce high-quality images while demanding fewer computational resources [20,21] and have demonstrated significant application value in the field of mining engineering. However, in real-world mining scenarios, ore blocks generated through blasting and caving exhibit diverse shapes and complex textures. Public datasets used to train LDMs typically contain simpler patterns, which limits the model’s ability to effectively handle the complexity of ore blocks. As a result, there is a certain degree of bias in the model’s perception of ore block information when applied to mining engineering.

To address this issue, this study proposes a novel approach—the intelligent ore block shape prediction model (IOBSPM) based on LDMs, which derives global morphology from local perceptual information. The primary technical contribution of this research is the application of LDMs to mining engineering. The model is specifically designed to handle the complex shapes and textures of ore blocks, which are not well represented in existing public datasets. This approach overcomes the limitations of current view synthesis methods in complex environments, significantly improving the accuracy and computational efficiency of ore block morphology prediction. Additionally, this research lays the foundation for future applications of deep learning techniques in remote-controlled precision shoveling and autonomous shoveling operations.

2. A New View Synthesis Technique Based on Latent Diffusion Models

2.1. Architectural Framework for IOBSPM

The intelligent prediction technology for ore block morphology aims to perceive the overall shape of ore blocks in the mining area using a prediction model trained on simulation datasets. In autonomous shoveling scenarios, video or image data of the ore blocks ahead, along with corresponding pose information (3D spatial coordinates and angles), is captured by sensors mounted on the shoveling equipment. This enables the prediction of the ore block’s shape from multiple angles. The specific implementation process is shown in Figure 1.

The intelligent prediction technology for ore block morphology based on latent diffusion models consists of four main components: The first part involves using simulation software to create images of ore blocks and their poses as the target task dataset. This process aims to provide the model with a diverse range of ore block morphology samples, reducing the randomness in prediction results in real-world scenarios. The second part involves applying transfer learning to the diffusion model using the target task dataset. Through forward and reverse diffusion processes, the model continuously optimizes its loss function, with the core objective of adjusting the model’s weights to better suit the application task of ore block prediction, thereby achieving a more refined application. The third part involves reducing the adverse effects of lighting and background to improve the accuracy of the predicted images. The ore block images are enhanced through methods such as background filling and brightness adjustment and then fed into the prediction model to generate new views from different perspectives, thereby enabling the intelligent prediction of ore block morphology. The fourth part involves using the generated multi-view images and pose information to allow for more detailed perception of the global features of the target object, providing rich input data for grid-based 3D model reconstruction, and ultimately achieving a more visually expressive representation of the ore block morphology. Overall, the original views, predicted views, and 3D models of the ore blocks can provide more diverse information references for machines or remote operators.

2.2. Inference of Ore Block Images Using Diffusion Models and Latent Diffusion Models

The most fundamental type of diffusion model is the Denoising Diffusion Probabilistic Models (DDPM). Its inference process is mainly divided into two phases: forward diffusion and reverse diffusion. In the forward diffusion phase, Gaussian noise is progressively added to the input image, transforming it into a noise image devoid of distinguishable features. This process can be mathematically described for time steps t = 1, 2,…, T as follows:

x_{t} = \sqrt{1 - β_{t}} x_{t - 1} + \sqrt{β_{t}} ϵ, ϵ ~ N (0, 1)

(1)

where

β_{t}

represents the noise intensity at time step t, and

ϵ ~ N (0, 1)

represents noise sampled from a standard normal distribution; the same applies below. If we define

\bar{α_{t}} = \prod_{s = 1}^{t} (1 - β_{s})

, the expression for the image

x_{t}

at time t, obtained by directly adding noise to the original image, is given by the following:

x_{t} = \sqrt{\bar{a_{t}}} x_{0} + \sqrt{1 - \bar{a_{t}}} ϵ, ϵ ~ N (0, 1)

(2)

In the reverse diffusion process, the noise-added image is input into the U-Net network to predict the noise component. The model then gradually denoises the image, approximating the reverse of the forward diffusion process using Bayesian inference. The simplified formula for the reverse process can be expressed as follows:

x_{t - 1} = \frac{1}{\sqrt{1 - β_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - \bar{α_{t}}}} ϵ_{θ} (x_{t}, t)) + \sqrt{β_{t} \frac{1 - \bar{α_{t - 1}}}{1 - α_{t}}} ϵ, ϵ ~ N (0, 1)

(3)

where

\in_{θ} (x_{t}, t)

is the predicted noise from the previous stage, and

x_{t}

and

x_{t - 1}

represent the images at time steps t and t − 1, respectively.

It is important to note that this approach requires substantial computational resources, and it assumes that the actual noise conforms to a Gaussian distribution. However, by using this method, DDPM can gradually reconstruct the original image from completely random noise, demonstrating significant potential in image generation tasks. The overall inference process of the denoising diffusion probabilistic models (DDPM) is illustrated in Figure 2.

Latent diffusion models (LDMs) are the evolution of denoising diffusion probabilistic models (DDPMs). They utilize variational auto-encoders (VAEs) to achieve latent space compression of images [22], allowing the denoising process to occur in this latent space, which significantly enhances the inference speed of the model and also reduces resource consumption. Additionally, LDMs improve the U-Net network with a cross-attention mechanism [23] and incorporate the pre-trained CLIP model

τ_{θ}

to extract feature information from input images and text [24]. The information is converted into embedding vectors, which are then mapped to the U-Net layers through the attention mechanism (Q, K, V). The U-Net, with its skip connection structure, effectively preserves high-resolution feature information [25], transforming the internal diffusion model into a conditional image generator [26]. The complete inference process is illustrated in Figure 3.

2.3. Transfer Learning Tasks Based on Latent Diffusion Models

To generate views of the same object from different angles with limited initial perspectives, Ruoshi Liu et al. [27] utilized over 10 million images and corresponding camera poses from the Objaverse dataset [28]. They employed the intrinsic image generation capabilities of latent diffusion models (LDMs) for large-scale training, enabling the generation of new views based on input pose information. The objective can be expressed as follows:

\hat{x_{R, T}} = f (x, R, T)

(4)

where

\hat{x_{R, T}}

represents the newly synthesized image,

f

denotes the latent diffusion model,

x

represents the input image, and R and T refer to the camera’s rotation and translation information, respectively.

To achieve rapid convergence of the network model with limited data [29] and to optimize issues related to texture and morphology deviations in new view synthesis for generalized data [30], a transfer learning approach is employed. This involves fine-tuning the model using pre-trained weights on the target task dataset. The process utilizes various datasets of ore blocks to refine the entire model, resulting in the intelligent ore block shape prediction model based on the latent diffusion model. The detailed fine-tuning process is as follows:

(1): Input the original ore block image x and the target view ore block image x(R, T) (both with dimensions 128 pixels × 128 pixels) along with their corresponding camera pose data;
(2): Using the encoder $ε$ , downsample the full-size images by a factor of 4, encoding them into lower-dimensional latent data of 32 pixels × 32 pixels (compressed data);
(3): Randomly sample a time step t from K noise levels (commonly set to K = 1000);
(4): Conduct the forward diffusion process by progressively adding Gaussian noise, corresponding to the sampled time step t, to the target image x(R, T);
(5): Input the noise-added ore block image into the U-Net for reverse diffusion to predict the noise. Use the CLIP model to extract embedding information from the original input image and combine this information with the camera’s pose data R and T to form an embedding vector c(x, R, T). This vector, along with the original image, helps guide the generation of the target image;
(6): Gradually compute the loss between the actual noise and the predicted noise during the reverse diffusion process;
(7): Use the decoder D to upsample the compressed image from 32 pixels × 32 pixels back to the full-size image of 128 pixels × 128 pixels, decoding it back into the pixel space.

2.4. Structural Similarity Constraint

In the process of transfer learning described above, each iteration primarily aims to achieve the following objectives:

{m i n}_{θ} E_{z ~ ε (x), t, ϵ ~ N (0, 1)} | | ϵ - ϵ_{θ} (z_{t}, t, c (x, R, T)) | |_{2}^{2}

(5)

where E represents the expectation,

z ~ ε (x)

represents the latent representation extracted from the input image x by the encoder

ε

, t represents the time step,

ϵ ~ N (0, 1)

represents noise sampled from a standard normal distribution, c(x, R, T) is the embedding vector,

z_{t}

is the latent distribution at the current time step, and

ϵ

and

ϵ_{θ}

refer to the actual and predicted noise signals, respectively, and their squared Euclidean norm is computed.

The model reduces the loss through continuous iterative optimization and realizes the generation of new viewpoints under the constraints of posture conditions. In the latent diffusion models, the MSE loss function is used for calculation. The MSE loss function penalizes the gray value error of the output noise at the pixel level and pays more attention to the overall nature of the noise [31]. Although the embedding vector of the original image data and posture information is introduced as a guide, the latent diffusion models still have the problem of random generation [32]. This is mainly because the diffusion models have inherent randomness in both the forward diffusion denoising and the reverse diffusion denoising process, which may cause the generated image to deviate from the expected phenomenon. In order to alleviate this effect and improve the accuracy of new view prediction, the structure similarity index measure (SSIM) [33] is introduced to guide the model to generate new viewpoints that are closer to the actual situation. In noise prediction, structural similarity measures the structural information of the noise image in more detail from multiple dimensions such as brightness, contrast, and structure and improves the understanding of subtle structural changes in the image [34]. In the task of obtaining global information of ore blocks, higher structural similarity can improve the visual continuity of the generated view. More importantly, in downstream 3D reconstruction applications, model reconstruction requires extremely high consistency between input images, and image consistency and structural similarity are positively correlated [35,36]. Therefore, improving the structural similarity of model-generated images is crucial for subsequent engineering applications.

In the context of structural similarity, brightness is measured using the mean grayscale value, which is obtained by averaging the values of all pixels, calculated as follows:

l (x, y) = \frac{2 μ_{x} μ_{y} + C_{1}}{μ_{x}^{2} + μ_{y}^{2} + C_{1}}

(6)

Contrast is measured using the standard deviation of the grayscale values. The unbiased estimate of the standard deviation is calculated as follows:

l (x, y) = \frac{2 σ_{x} σ_{y} + C_{2}}{σ_{x}^{2} + σ_{y}^{2} + C_{2}}

(7)

Structural contrast is measured using the correlation coefficient, calculated as follows:

c (x, y) = \frac{σ_{x y} + C_{3}}{σ_{x} σ_{y} + C_{3}}

(8)

The SSIM formula can be calculated by assigning an equal weight of 1 to each of the three features as follows:

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{{(μ}_{x}^{2} + μ_{y}^{2} + C_{1}) {(σ}_{x}^{2} + σ_{y}^{2} + C_{2})}

(9)

where x and y represent the original ore block view and the target ore block view, respectively;

μ_{x}

and

μ_{y}

denote the mean values of x and y;

σ_{x}

and

σ_{y}

are the standard deviations of x and y;

σ_{x y}

is the covariance between x and y; and

C_{1}

,

C_{2}

, and

C_{3}

are constants.

During training, the original loss function, MSE, typically converges faster. To ensure that the two loss functions are of similar magnitude and to prevent any single loss function from dominating the model training process, a weighting parameter w = 0.1 is introduced to balance the influence between MSE and SSIM constraints. Thus, the model loss function incorporating structural similarity can be expressed as follows:

L o s s = | |ϵ - ϵ_{θ} (z_{t}, t, c)| |^{2} + {0.1 \times L}_{S S I M} (ϵ, ϵ_{θ})

(10)

where the former term is the MSE loss based on the squared Euclidean norm, and the latter is the loss based on SSIM.

During the iterative process, a smaller loss value indicates that the predicted noise values are closer to the actual noise values, and the generated new view has a higher structural similarity to the original view.

3. Experimental Verification

3.1. Ore Block Dataset Creation Process

Since there is currently no publicly available large-scale dataset of ore block views and pose data in the fields of mining engineering and computer vision, and collecting diverse ore block views and calculating their corresponding poses in the field is time-consuming and insufficient to meet the training data scale requirements, this paper uses the simulation graphics software Blender 2.90 to create simulated ore block model data that mimics the shapes and textures of real ore blocks. Blender is an open-source 3D graphics software capable of constructing 3D models of any shape [37] and generating static images and corresponding poses using its built-in renderer. It also produces good results in generating details and textures [38]. To enhance the diversity of the dataset views, two sampling trajectories were used: “hemispherical uniform distribution” and “Archimedean spiral”. Additionally, bounded random noise was added to the camera positions to further increase the diversity of the training data [39]. These approaches not only ensure that the training data has a high degree of coverage across different viewpoints and conditions but also effectively prevent the model from overfitting to specific scenarios, thereby enhancing the model’s generalization ability. Moreover, this diversified data generation strategy better captures subtle variations in ore block morphology, enabling the model to recognize and distinguish more complex morphological features, ultimately improving the model’s robustness. Illustrations of the two sampling trajectories are shown in Figure 4.

In this study, 10,000 images and corresponding poses were rendered from 200 simulated ore blocks with various shapes using the aforementioned sampling methods. The data were also augmented to enhance its diversity and the model’s accuracy. A portion of the data are shown in Figure 5.

All dataset images were resized to 128 pixels × 128 pixels for input and were split into training sets, validation sets, and test sets in a ratio of 8.0:1.2:0.8.

3.2. Experimental Parameter Settings

In this study, the model was trained for 100 epochs using a single NVIDIA GeForce RTX 4090 GPU (NVIDIA, Santa Clara, CA, USA) under an environment of Ubuntu 20.04, PyTorch 1.12.1, and Python 3.9. To balance model performance, the sampling scale factor was set to f = 4, the total number of noise addition steps T was 1000, and the batch size was set to 32. The Adam optimizer was used for network parameter optimization, with a fixed learning rate of 0.0001.

3.3. Evaluation Metrics

In fields such as classification, regression, and QSAR model evaluation [40], metrics like RMS and MAE are commonly used to assess the accuracy of predicted values against actual values. However, in the domain of computer vision and generative models, in order to comprehensively evaluate the quality of the generated images, this study uses three evaluation metrics: peak signal-to-noise ratio (PSNR) [41], structural similarity index (SSIM), and learned perceptual image patch similarity (LPIPS) [42]. These metrics are widely used in image quality assessment. LPIPS is calculated using a deep learning model, with lower values indicating better image generation quality. Generally, an LPIPS value below 0.3 indicates good overall image quality. However, for ore blocks with complex texture details, this value may be higher. SSIM considers the brightness, contrast, and structural information of an image and is used to measure the similarity between two images. The calculation formula has been explained in Formula (9). PSNR calculates the ratio of the peak signal power to the mean noise power. The calculation formula is as follows:

P S N R = 10 \times \log_{10} (\frac{M A X_{I}^{2}}{M S E})

(11)

where

M A X_{I}

represents the maximum possible pixel value of the image, and MSE stands for the mean squared error. Higher values of SSIM and PSNR indicate better image generation quality.

4. Results Analysis and Downstream Task Application

4.1. Analysis of Experimental Results

4.1.1. Comparative Analysis of Image Synthesis Results

To facilitate a more intuitive comparison of new view generation effects, visual comparisons of new views generated from different styles of ore blocks without post-processing are shown in Figure 6. Sections that significantly differ from the actual results or contain certain discrepancies in the details are highlighted with red boxes. It is evident that traditional methods based on neural radiance fields (NeRF) have poorer generalization capabilities and slower generation speeds in new view generation tasks. In contrast, the latent diffusion model exhibits a more accurate inference of texture, shape, and color in the generation of new views of ore. For predictions with smaller deviation angles (Groups A, B, and C), the model, after transfer learning on ore block data, provides more precise detail compared to the original latent diffusion model, especially in areas such as edge shadows, the smoothness of ore block contours, and the accuracy of shape generation (e.g., Group A). For predictions with larger deviation angles (Groups D and E), the latent diffusion model exhibits inherent shortcomings. This issue arises because, when the specified angle for the generated image differs significantly from the input image, the model faces increased difficulty in leveraging prior knowledge and current input information, which in turn reduces its ability to accurately infer missing details. Although this problem can be somewhat mitigated by covering a wider range of scenarios in the training data, the objective reality remains that greater loss of detail leads to larger prediction deviations. Nevertheless, the latent diffusion model, after being adapted through transfer learning on ore block data, still achieves generated results that are closer to the ground truth compared to the original model. Experimental results based on the test set used in this study are shown in Table 1. The method proposed in this paper achieved an average PSNR of 23.02 dB, an average SSIM of 0.754, and an average LPIPS of 0.268. Compared to the traditional PixelNeRF model and the original model, the average PSNR increased by 12.48 dB and 1.66 dB, respectively, the average SSIM increased by 0.186 and 0.052, respectively, and the average LPIPS decreased by 0.211 and 0.050, respectively. Therefore, although the transfer learning model based on the ore block dataset still shows some biases and limitations when predicting at significantly shifted angles, which may not fully represent real-world objects, it still outperforms the original model. The new model is better adapted to the task of predicting different perspectives of ore blocks, and its higher accuracy is more conducive to guiding autonomous shoveling and remote-controlled shoveling operations.

4.1.2. Comparative Analysis of Structural Similarity Improvement

To evaluate the impact of structural similarity constraints on the final results and subsequent work, transfer learning was conducted using only the

L_{M S E}

constraint and with the addition of the

L_{S S I M}

constraint at a scale of 0.1. The test set was then reused for comparative experiments, as shown in Table 2. While the structural similarity constraint doesn’t produce visible differences in the generated images to the naked eye, the SSIM values in Table 2 indicate that adding this constraint helps to enhance the overall structural recovery of the model, thereby improving the quality of the predicted images. Using models trained with the two different constraints, five sets of image collections were generated, with each set containing 50 images. These images were then subjected to pose matching, and the comparison of the data is shown in Table 3. The images with higher structural similarity showed greater consistency. This helped improve the efficiency of pose matching and reduced the workload for modeling in downstream tasks.

4.2. Actual Image Synthesis and Application in Downstream Tasks

4.2.1. Evaluation of the Actual Image Generation Quality

To better verify the practical application of the model, the image of an ore block from a Chinese mining site were used as input. The new view synthesis results are shown in Figure 7. In the figure, the red image represents the input shape of the ore block, the green camera indicates the frontal view, and the blue camera represents the generated target view. The images below show the predicted new views after horizontal rotation, vertical rotation, side orientation, and top orientation. Leveraging the strong generative capabilities of the latent diffusion model, a new view image can be generated within 5 s under the conditions of this experiment.

4.2.2. Evaluation of 3D Reconstruction Results

In the field of computer vision, an important application of new view synthesis is the use of abundant pixels and pose information from multi-view images to reconstruct a 3D model of the object [43]. Since each reconstruction requires separate data collection and training, the overall process is relatively time-consuming [44]. However, in the context of mining engineering shoveling scenarios, reconstructing 3D models of ore blocks can provide machines and remote operators with comprehensive information about the target objects in a clearer and more flexible manner. It also allows for further estimation of physical properties such as the volume, surface texture, and roughness of the ore blocks, which has significant exploratory value for the future. This study uses the NeuS neural surface reconstruction method, which renders the transition from 2D images to 3D models by representing surfaces as zero-level sets of signed distance functions (SDF) [45]. Compared to previous methods using implicit differentiable renderers [46] to learn the shape of objects, this approach offers better robustness and the ability to handle complex objects. The multi-view images of ore blocks generated in Figure 7 were used as inputs, with pose matching employed to obtain image coordinate information, leading to the generation of point cloud data and the reconstruction of the 3D model. The overall implementation process, shown in Figure 8, can effectively achieve the transformation from synthesized views to the model.

5. Conclusions and Discussion

As mining operations continue to evolve, the integration of advanced technologies becomes essential to improve efficiency, accuracy, and safety. This paper explores the application of latent diffusion models to address these challenges, providing new insights into ore block perception and reconstruction. Below, we summarize the key contributions, findings and future outlook of this research:

(1): In future mining scenarios, remote-controlled and autonomous shoveling are expected to become mainstream methods. To obtain more comprehensive information about ore blocks from limited viewpoints, thereby enhancing the accuracy of perception for remote operators or autonomous machinery, the latent diffusion models are used for predicting new views of single-view ore blocks.
(2): To improve the accuracy of the model in downstream tasks related to ore block view synthesis, a self-made ore block image-pose dataset created using simulation software is used for transfer learning. Additionally, introducing structural similarity loss constrains the predicted results during the training process, which helps to optimize details and improve consistency among generated views. The model performs well even when real images are used as input.
(3): Combining applications in computer vision and mining engineering, the predicted multi-view images can be further used to generate a 3D model of the ore using neural surface reconstruction methods. This helps to represent the global information of the ore blocks more intuitively.
(4): In shoveling scenarios, there are still some limitations in generating new views and performing 3D reconstruction of ore blocks based on global information perceived from limited viewpoints. These limitations include the high hardware resource requirements for running the model, the need for greater real-time performance in view synthesis and 3D reconstruction in actual shoveling scenarios, and the consistency between generated images that still requires improvement. However, future research can build on the current results to further explore the acquisition of physical information such as volume, surface texture, and roughness of the predicted ore blocks through 3D models, thereby enriching the diversity of ore block information and enhancing the completeness of the prediction tasks.

Author Contributions

L.B.: formal analysis, funding acquisition, supervision, writing—review and editing; D.B.: data curation, investigation, software, writing—original draft, writing—review and editing; B.C.: methodology, resources, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the National Key R&D Program of China, 2023YFC2907305.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors gratefully acknowledge the funders and all advisors and colleagues who supported our work.

Conflicts of Interest

Author Lin Bi was employed by the company Changsha Digital Mine Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Tauger, V.; Valiev, N.; Volkov, E.; Simisinov, D.; Adas, V. Remote-Controlled Robotic Complex for Underground Mining. E3S Web Conf. 2020, 177, 03006. [Google Scholar] [CrossRef]
Jiang, D.; Wang, L. Present Situation and Development Trend of Self-loading Technology for Underground Load-Haul-Dump. Gold Sci. Technol. 2021, 29, 35–42. [Google Scholar]
Li, Q.; Liu, J.; Li, J.; Zhang, C.; Guo, J.; Wang, X.; Ran, W. Digital twin of mine ecological environment: Connotation, framework and key technologies. J. China Coal Soc. 2023, 48, 3860–3873. [Google Scholar] [CrossRef]
Ji, B.; Zhou, T.; Yuan, F. The detection method of maglev gyroscope abnormal data based on the characteristics of two positioning. Sci. Surv. Mapp. 2015, 40, 106–107. [Google Scholar] [CrossRef]
Al-Bakri, A.Y.; Sazid, M. Application of Artificial Neural Network (ANN) for Prediction and Optimization of Blast-Induced Impacts. Mining 2021, 1, 315–334. [Google Scholar] [CrossRef]
Huai, Z.; Chen, Y.; Hong, Q. Optimizing of autonomous loading trajectory of loader based on various interpolation methods. Min. Process. Equip. 2022, 50, 10–15. [Google Scholar] [CrossRef]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Commun. ACM 2022, 65, 99–106. [Google Scholar] [CrossRef]
Xia, W.; Xue, J.-H. A Survey on Deep Generative 3D-Aware Image Synthesis. ACM Comput. Surv. 2024, 56, 1–34. [Google Scholar] [CrossRef]
Guo, Z.; Xie, Q.; Liu, S.; Xie, X. Bi-Resolution Hash Encoding in Neural Radiance Fields: A Method for Accelerated Pose Optimization and Enhanced Reconstruction Efficiency. Appl. Sci. 2023, 13, 13333. [Google Scholar] [CrossRef]
Li, J.; Cheng, L.; He, J.; Wang, Z. Current Status and Prospects of Research on Neural Radiance Fields. J. Comput.-Aided Des. Comput. Graph. 2024, 3, 1–20. [Google Scholar]
Yu, A.; Ye, V.; Tancik, M.; Kanazawa, A. pixelNeRF: Neural Radiance Fields from One or Few Images. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4576–4585. [Google Scholar] [CrossRef]
Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv 2015, arXiv:1512.03012. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar] [CrossRef]
Liu, Z.; Ma, C.; She, W.; Xie, M. Biomedical Image Segmentation Using Denoising Diffusion Probabilistic Models: A Comprehensive Review and Analysis. Appl. Sci. 2024, 14, 632. [Google Scholar] [CrossRef]
Ye, B.; Wang, H.; Li, J.; Jiang, J.; Lu, Y.; Gao, E.; Yue, T. 3D Point Cloud Completion Method Based on Building Contour Constraint Diffusion Probability Model. Appl. Sci. 2023, 13, 11246. [Google Scholar] [CrossRef]
Fan, Y.; Lee, K. Optimizing DDPM Sampling with Shortcut Fine-Tuning. arXiv 2023, arXiv:2301.13362. [Google Scholar] [CrossRef]
Khrulkov, V.; Ryzhakov, G.; Chertkov, A.; Oseledets, I. Understanding DDPM Latent Codes Through Optimal Transport. arXiv 2022, arXiv:2202.07477. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10674–10685. [Google Scholar] [CrossRef]
Croitoru, F.-A.; Hondru, V.; Ionescu, R.T.; Shah, M. Diffusion Models in Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10850–10869. [Google Scholar] [CrossRef]
Kim, S.W.; Brown, B.; Yin, K.; Kreis, K.; Schwarz, K.; Li, D.; Rombach, R.; Torralba, A.; Fidler, S. NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 8496–8506. [Google Scholar] [CrossRef]
Zhang, J.; Xu, Z.; Cui, S.; Meng, C.; Wu, W.; Lyu, M.R. On the Robustness of Latent Diffusion Models. arXiv 2023, arXiv:2306.08257. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2022, arXiv:1312.6114. [Google Scholar] [CrossRef]
Ates, G.C.; Mohan, P.; Celik, E. Dual Cross-Attention for Medical Image Segmentation. Eng. Appl. Artif. Intell. 2023, 126, 107139. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar] [CrossRef]
Shi, Y.; Hao, X.; Huang, X.; Pei, P.; Li, S.; Wei, T. Multi-View Synthesis of Sparse Projection of Absorption Spectra Based on Joint GRU and U-Net. Appl. Sci. 2024, 14, 3726. [Google Scholar] [CrossRef]
Esser, P.; Chiu, J.; Atighehchian, P.; Granskog, J.; Germanidis, A. Structure and Content-Guided Video Synthesis with Diffusion Models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1 October 2023; pp. 7312–7322. [Google Scholar] [CrossRef]
Liu, R.; Wu, R.; Van Hoorick, B.; Tokmakov, P.; Zakharov, S.; Vondrick, C. Zero-1-to-3: Zero-Shot One Image to 3D Object. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 9298–9309. [Google Scholar] [CrossRef]
Deitke, M.; Schwenk, D.; Salvador, J.; Weihs, L.; Michel, O.; VanderBilt, E.; Schmidt, L.; Ehsani, K.; Kembhavi, A.; Farhadi, A. Objaverse: A Universe of Annotated 3D Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13142–13153. [Google Scholar] [CrossRef]
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A Comprehensive Survey on Transfer Learning. Proc. IEEE 2021, 109, 43–76. [Google Scholar] [CrossRef]
Long, X.; Guo, Y.-C.; Lin, C.; Liu, Y.; Dou, Z.; Liu, L.; Ma, Y.; Zhang, S.-H.; Habermann, M.; Theobalt, C.; et al. Wonder3D: Single Image to 3D Using Cross-Domain Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 9970–9980. [Google Scholar] [CrossRef]
Hao, W.; Cai, H.; Zuo, T. Self-Supervised Pretraining for IVUS Image Segmentation Based on Diffusion Model. Laser Optoelectron. Prog. 2024, 20, 1–17. [Google Scholar]
Hu, H.; Li, J.; A, X.; Duan, Y.; Wei, J. Cloud removal method of optical remote sensing image based on latent diffusion model. Acta Opt. Sin. 2024, 44, 1228009. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Xie, Q.; Ma, Z.; Zhu, L.; Jiang, Z. Retinal OCT image denoising based on structural similarity constrained generative adversarial network. J. Electron. Meas. Instrum. 2023, 37, 11–20. [Google Scholar] [CrossRef]
Lo, Y.-M.; Chang, C.-C.; Way, D.-L.; Shih, Z.-C. Generation of Stereo Images Based on a View Synthesis Network. Appl. Sci. 2020, 10, 3101. [Google Scholar] [CrossRef]
Deng, Z.; Wang, M. Reliability-Based View Synthesis for Free Viewpoint Video. Appl. Sci. 2018, 8, 823. [Google Scholar] [CrossRef]
Zhu, X.; Zhang, Z.; Hou, L.; Song, L.; Wang, H. Light Field Structured Light Projection Data Generation with Blender. In Proceedings of the 2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA), Changchun, China, 20 May 2022; pp. 1249–1253. [Google Scholar] [CrossRef]
Pyka, M.; Hertog, M.; Fernandez, R.; Hauke, S.; Heider, D.; Dannlowski, U.; Konrad, C. fMRI Data Visualization with BrainBlend and Blender. Neuroinform 2010, 8, 21–31. [Google Scholar] [CrossRef]
Hatka, M.; Haindl, M. Advanced Material Rendering in Blender. IJVR 2012, 11, 15–23. [Google Scholar] [CrossRef]
Bolboaca, S.; Jantschi, L. The Effect of Leverage and/or Influential on Structure-Activity Relationships. CCHTS 2013, 16, 288–297. [Google Scholar] [CrossRef] [PubMed]
Huynh-Thu, Q.; Ghanbari, M. Scope of Validity of PSNR in Image/Video Quality Assessment. Electron. Lett. 2008, 44, 800. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, J.; He, X.; Fu, H.; Jia, R.; Zhou, X. Modeling Indirect Illumination for Inverse Rendering. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18622–18631. [Google Scholar] [CrossRef]
Alvi, H.M.U.H.; Farid, M.S.; Khan, M.H.; Grzegorzek, M. Quality Assessment of 3D Synthesized Images Based on Textural and Structural Distortion Estimation. Appl. Sci. 2021, 11, 2666. [Google Scholar] [CrossRef]
Wang, P.; Liu, L.; Liu, Y.; Theobalt, C.; Komura, T.; Wang, W. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-View Reconstruction. arXiv 2023, arXiv:2106.10689. [Google Scholar] [CrossRef]
Li, Z.; Wang, L.; Cheng, M.; Pan, C.; Yang, J. Multi-View Inverse Rendering for Large-Scale Real-World Indoor Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12499–12509. [Google Scholar] [CrossRef]

Figure 1. Implementation process of intelligent prediction of ore block morphology.

Figure 2. Denoising diffusion probabilistic Model (DDPM) inference process.

Figure 3. Latent diffusion model (LDM) inference process.

Figure 4. Schematic diagram of two rendering sampling trajectories: (a) hemispherical uniform distribution sampling trajectory; (b) Archimedean spiral sampling trajectory.

Figure 5. Partial dataset display of ore blocks.

Figure 6. Generate contrast based on new views in different ways.

Figure 7. Single-to-Multi-View transformation of the ore block using diffusion model.

Figure 8. An application process example for generating a 3D model based on multi-views of the ore blocks.

Table 1. Comparison of experimental data based on the test set of this article.

Model/Metric	PSNR (dB)	SSIM	LPIPS
pixelnerf	10.54	0.568	0.573
original model	21.36	0.702	0.318
transfer learning model	23.02	0.754	0.268

Table 2. Comparison of image generation indicators under different constraints.

Type/Metric	PSNR (dB)	SSIM	LPIPS
L_MSE	22.88	0.737	0.281
L_MSE + 0.1L_SSIM	23.02	0.754	0.268

Table 3. Comparison of image matching data under different constraints.

Group	Total Number of Matches	Number of Matches with L_MSE Successful	Number of Matches with L_MSE + 0.1L_SSIM Successful
1	50	18	23
2	50	19	21
3	50	21	24
4	50	16	26
5	50	20	24

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bi, L.; Bai, D.; Chen, B. Intelligent Prediction of Ore Block Shapes Based on Novel View Synthesis Technology. Appl. Sci. 2024, 14, 8273. https://doi.org/10.3390/app14188273

AMA Style

Bi L, Bai D, Chen B. Intelligent Prediction of Ore Block Shapes Based on Novel View Synthesis Technology. Applied Sciences. 2024; 14(18):8273. https://doi.org/10.3390/app14188273

Chicago/Turabian Style

Bi, Lin, Dewei Bai, and Boxun Chen. 2024. "Intelligent Prediction of Ore Block Shapes Based on Novel View Synthesis Technology" Applied Sciences 14, no. 18: 8273. https://doi.org/10.3390/app14188273

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intelligent Prediction of Ore Block Shapes Based on Novel View Synthesis Technology

Abstract

1. Introduction

2. A New View Synthesis Technique Based on Latent Diffusion Models

2.1. Architectural Framework for IOBSPM

2.2. Inference of Ore Block Images Using Diffusion Models and Latent Diffusion Models

2.3. Transfer Learning Tasks Based on Latent Diffusion Models

2.4. Structural Similarity Constraint

3. Experimental Verification

3.1. Ore Block Dataset Creation Process

3.2. Experimental Parameter Settings

3.3. Evaluation Metrics

4. Results Analysis and Downstream Task Application

4.1. Analysis of Experimental Results

4.1.1. Comparative Analysis of Image Synthesis Results

4.1.2. Comparative Analysis of Structural Similarity Improvement

4.2. Actual Image Synthesis and Application in Downstream Tasks

4.2.1. Evaluation of the Actual Image Generation Quality

4.2.2. Evaluation of 3D Reconstruction Results

5. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI