1. Introduction
The rapid development of machine learning and artificial intelligence has led to significant progress in numerous application areas. Neural networks have proven to be powerful tools for automatically performing image processing tasks such as object detection and image segmentation. Extensive research has focused on the architecture of neural networks for object recognition and segmentation [
1,
2,
3]. In particular, several pre-trained model architectures provide standardized network structures and associated weight initializations that can be directly employed. These include, for example, YOLO [
4,
5], Faster R-CNN [
6], or Mask R-CNN [
7]. These models can be fine-tuned to a specific task with little computing effort and thus enable the application of neural networks even with limited hardware resources.
However, the success of these networks is highly dependent on the quality and size of the datasets used for their training [
8]. When assembling large real training datasets is impractical, synthetic datasets can be produced by rendering 3D models to generate images together with their corresponding annotations [
9]. Various approaches have been developed to generate synthetic data for training neural networks. For example, it has been applied to the classification of defects on steel components [
10], the detection of surgical instruments in the operating room [
11], or the detection of food for pick-and-place applications [
12].
The rendering of the synthetic image data can be carried out using a variety of software tools [
13]. For example, approaches based on frameworks originally developed for game development such as Unreal Engine or Unity are noteworthy [
14,
15]. Furthermore, the open-source software Blender, which was developed for 3D modeling and animation, and the BlenderProc framework based on it, offer the possibility of automatically creating and rendering virtual scenes [
16,
17]. All these approaches offer the advantage that not only can synthetic image data be generated, but the corresponding bounding boxes, segmentation masks, and, if necessary, depth maps and normal vectors can also be generated.
When creating synthetic datasets, all objects must be available as 3D geometries with associated material and texture information. Additionally, the synthetic scene must be defined through parameters including object counts and positions, light source specifications (type, position, and intensity), virtual camera positions, and background or ground plane design. There are two general approaches for selecting those parameters to bridge the gap between the real and synthetic domains. Domain Adaptation aims to select parameters such that the resulting synthetic scenes mimic real application scenes as closely as possible. Domain Randomization, on the other hand, seeks to create synthetic environments so varied that the real scene can be considered simply another instance within this variation [
18,
19,
20].
Current methods for generating synthetic datasets typically rely on manual selection or heuristic randomization of scene and material parameters, informed by domain knowledge or basic grid search strategies [
21,
22,
23]. While these approaches can yield useful datasets, the optimal settings for maximizing neural network performance are often non-intuitive and highly task-dependent. Manual parameter selection thus risks producing suboptimal synthetic data, as shown by recent studies exploring manually varied parameters in object detection and segmentation pipelines [
15,
24].
To address these limitations, automated parameter-optimization frameworks are required to systematically identify synthetic scene and material parameters that maximize accuracy on real-world data. This study proposes a novel optimization pipeline designed to find optimal parameters tailored to a specific application, including its unique objects and environmental conditions. Notably, this approach requires only a small real dataset including images and annotations, which is used as a validation set within the optimization loop to directly assess the quality of each synthetic dataset configuration.
The next section presents the optimization pipeline developed to identify optimal parameter sets for synthetic data generation and examines its reproducibility.
Section 3 details the case study methodology, followed by independent analyses of individual parameters to assess their impact on validation precision. Subsequently, multidimensional optimization is conducted separately for material and scene parameters. The resulting optimal parameter configuration generates a comprehensive training dataset for final evaluation. Result interpretation appears in
Section 4, while
Section 5 summarizes key findings.
Section 6 concludes with recommendations for future research directions.
2. Optimization Pipeline
Building upon the limitations of manual parameter selection in synthetic data generation, this study introduces a systematic approach to identify optimal scene and material parameters for specific segmentation tasks. This section first presents the overall architecture and implementation of the pipeline, followed by analyses of its deterministic behavior under various computational conditions.
2.1. Pipeline Architecture and Implementation
The optimization architecture comprises two major components (see
Figure 1): the optimization loop itself, which runs in a dedicated container, and a central database for communicating results between container instances. The containerized approach enables the framework to perform multiple parameter evaluations simultaneously on different machines.
Before beginning the optimization, the parameters and their limits have to be defined. For the purpose of this study, the parameters are separated into material parameters (e.g., realistic/parametric textures, roughness, metallicity, color) and scene parameters (e.g., number of target and distractor objects, position of target and distraction objects, number and intensity of light sources, ground texture, ground plane curvature, position of objects in the scene, object scaling factor, camera position).
Optionally, each parameter can be defined by a normal distribution. The optimizer then chooses the distribution’s mean and standard deviation and samples a concrete value for each scene generation. This enables the optimization algorithm to freely explore both Domain Adaptation and Domain Randomization approaches.
Inside the optimization loop, the open-source optimization framework Optuna [
25] loads the already evaluated parameter sets and their corresponding average precision (AP) values and uses them to identify the next set of material and scene parameters to evaluate with the evolutionary optimization algorithm NSGA-II [
26].
To generate synthetic datasets, the BlenderProc framework [
17] is used, which internally accesses Blender’s Python interface. This framework offers a pipeline for generating datasets (including image data and annotations) for machine learning purposes.
Figure 2 shows examples for synthetic images generated with this pipeline.
Subsequently, these datasets are used to train a neural network. The pre-trained image recognition and segmentation network YOLOv8 (see [
5]) with the model size
m is utilized. For the training, only the synthetic dataset with the selected material and scene parameters is used. Finally, the quality of the parameter set is determined by evaluating the neural network trained with the synthetic dataset on the real annotated dataset.
2.2. Deterministic Behavior
The pipeline for investigating the scene parameters comprises several components (see
Figure 1). The following investigations examine whether the results of each step are reproducible. Reproducibility is important in this framework to ensure that the return values of the optimization function (the AP values of the object classes) depend solely on the chosen scene and material parameters.
The influence of the GPU model is investigated using various available models from NVIDIA across different architecture versions. First, three synthetic datasets are generated. Then, training with these datasets is performed on each GPU, and the trained models are evaluated using the real dataset.
The investigations indicate that different GPU models can achieve varying results on the real validation dataset despite using the same training dataset (see
Table 1). When multiple training and validation runs are started on the same worker, the results remain unchanged. Likewise, identical results are consistently obtained on different workers that use the same GPU model. This demonstrates that the training step itself is deterministic. Hence, to improve reproducibility, only workers equipped with an RTX 4090 GPU will be used for the following parameter studies.
These findings show that the training process itself is deterministic when the same synthetic training dataset is used. Additionally, it needs to be investigated whether the rendering pipeline produces the same training images (and annotations) for identical scene and material parameters. To ensure reproducibility, the seeds of all random number generators used during scene creation (e.g., for object poses and selection of materials) are fixed.
In this study, the ray-tracing-based Cycles Engine from Blender version 3.3.14 is used for rendering. By rendering the same scene several times, it can be observed that the images generated are generally not exactly the same. The color values differ especially in poorly lit areas of the scene, which indicates that the ray-tracing process itself lacks deterministic behavior. It should be noted that there is ongoing work on the Blender project that aims for a deterministic rendering process [
27].
These differences are reduced due to the 8-bit discretization of the brightness values if the number of ray samples is increased manually. However, no practical value could be found that fully resolved the issue. Rendering on the CPU with a single thread can mitigate this effect, but this is impractical due to the associated loss of performance. Consequently, the potential influence of non-determinism in Cycles on subsequent investigations cannot be ruled out.
3. Analysis
In the following sections, the presented framework is used to evaluate the significance of individual material and scene parameters and to identify optimal parameter settings for a specific application.
3.1. Case Study Setup
For each optimization study, the AP scores of the individual object classes are taken directly as the objectives to be maximized. Although the optimization operates directly on these per-class AP values, subsequent sections frequently report only the mean average precision (mAP) for conciseness. Since Optuna supports multi-objective optimization, any number of object classes can be incorporated in practical applications of this pipeline. For simplicity, this case study focuses on two target classes:
The Waterpump class (see
Figure 3a) represents a water pump cover. It exhibits distinctive features that facilitate recognition, though these may vary by viewpoint.
The Cap class (see
Figure 3b) corresponds to a housing cap. While easily identifiable due to latching tabs, these features become obscured from certain angles, causing the object to resemble a simple cuboid and complicating detection.
For the real dataset, physical versions of these objects were fabricated via 3D printing. To enhance visual diversity, some components were colored, while others were deliberately deformed or had parts removed to simulate manufacturing defects.
The real dataset comprises three distinct image groups, each designed to simulate potential application scenarios of varying difficulty. The first group (27 images) depicts office environments (
Figure 4a), where objects rest on similarly colored surfaces (e.g., keyboards or desks) and are partially occluded by target or primitive objects. The second group (28 images) captures components on a conveyor belt (
Figure 4b), featuring clear background separation to emulate a standard robotic pick-and-place task. The third group (47 images) shows components placed at varying heights amidst laboratory equipment and utensils, representing a complex laboratory setting. As this dataset serves as the benchmark for evaluating synthetic data effectiveness, the optimization should yield parameter sets that deliver optimal performance specifically for these application domains.
3.2. Synthetic Dataset Size
The time required to evaluate a parameter set is approximately proportional to the number of synthetic training images. To maximize the number of evaluations feasible with the available hardware, it is important to first assess the influence of the synthetic training dataset size on validation precision. For this purpose, a dataset of 400 scenes is generated first. The number of scenes to be used for the training processes is then varied. The scenes used for each training process are randomly selected from the generated pool.
Figure 5 indicates that an almost linear correlation between the number of scenes
and the achieved mAP can be approximated for small numbers of scenes (roughly
). As the number of scenes grows further, the gain in mAP saturates. To balance validation accuracy against processing time, the number of scenes is set to
for all subsequent investigations.
3.3. Optimization of Individual Parameters
The framework is first used to individually analyze the influence of several selected parameters. In this way, the influence of a single parameter—or a small group of parameters—on the training accuracy achieved can be analyzed in isolation. These analyses are conducted independently, and the results of one parameter study are not always integrated into subsequent analyses. For this reason, the mAP values obtained can be compared only within a study, not across different studies. In each parameter evaluation, 500 synthetic image–annotation pairs (10 in each of the 50 unique scenes) are generated.
When analyzing the number of objects, both the absolute number of all objects
n and the discrete individual object numbers (Waterpump object
, Cap object
, distractor objects
) as well as the share of the target object classes in the total number of objects
and the inter-class ratio
r between the Waterpump and Cap objects are considered. The distractor objects have primitive geometric shapes (e.g., cylinders, spheres, squares and cones) whose numbers are given by
3.3.1. Analysis of the Total Object Count
The first analysis investigates the total number of objects present in each scene. This total number is divided into distractor objects, Waterpump objects and Cap objects. The remaining material and scene parameters are selected at random.
Figure 6 shows a clear correlation between the total number of objects and the mAP achieved on the real validation dataset. If
n is too low, the validation accuracy achieved on the real dataset decreases, as fewer objects are available overall for training the neural network. Excessive object counts reduce mAP, likely due to increased occlusion that hinders complete object visibility. The optimal number of objects is approximately
.
3.3.2. Analysis of the Number of Primitive Objects
Furthermore, the influence of the primitive distractor objects on the achieved validation accuracy is to be analyzed. For this purpose, Waterpump objects and Cap objects are rendered in each of the 500 training datasets. The number of primitive objects is then sampled from the interval . All other material and scene parameters are again selected at random.
The results in
Figure 7 show a noisy, but approximately linear decrease of mAP with the number of additional distractor objects. The results indicate that including distractor objects does not improve the dataset quality.
3.3.3. Analysis of the Inter-Class Ratio
The inter-class ratio r between the Waterpump and Cap target classes is further analyzed in the following section. The proportion of target objects in the total objects is . The aim is to identify a Pareto front where the average precision values achieved on the real validation dataset for both object classes are maximized.
Figure 8 shows the AP values for both target object classes as a function of the inter-class ratio
r. As expected, for small
r, the AP values for the Waterpump class are low and high for the Cap class. Conversely, equivalent behavior can be observed for large
r. Furthermore, the AP of the Cap class is lower when the dataset is heavily skewed toward Cap objects compared to more balanced distributions. This effect also occurs for Waterpump objects with a strongly skewed distribution, but is much less pronounced. Most Pareto-optimal solutions are at
. This corresponds to a slightly higher number of Cap objects compared to Waterpump objects.
3.3.4. Analysis of the Material Library Composition
The materials for the objects (target and distractor objects) and the material for the ground plane are randomly selected from a material library comprising 100 different materials. The material library contains both parametric materials (defined by their color, roughness and metallicity) and realistic materials (sourced from the CC-Textures library). The composition of this material library is analyzed using the proportion of realistic materials (referred to as CC-Share). The proportion of parametric materials is correspondingly . All other parameters are set to random values.
Figure 9 shows that material libraries composed exclusively of parametric or realistic materials achieve lower mAP values than more balanced material libraries. The best precision (apart from outliers) is achieved at
. It should be noted that the outliers in the data initially resemble noise effects. On closer inspection, however, it becomes apparent that the data points in the immediate vicinity of an outlier also feature similar AP values. This indicates a systematic cause for the outliers, which point to particularly good or bad
values.
3.3.5. Analysis of the Camera Distance
In this parameter study, the camera distance to the center of gravity of all objects is examined. The minimum camera distance and the maximum camera distance are examined.
Figure 10 shows the mAP values obtained for the chosen ranges of
and
. It is clear that in the range
, only mAP values below
are achieved. The best results on the real validation data are achieved with the values
,
. In detail, the optimal camera distances differ between both target object classes.
3.4. Multi-Parameter Optimizations
The next step is the multidimensional optimization of the parameters already presented as well as some additional parameters, e.g., the ground plane curvature a and the number of light sources . The optimization is divided into separate scenes and material parameter optimization, reducing the dimensionality of the search space.
First, the multidimensional optimization of the material parameters is carried out. The roughness and metallicity of the parametric materials and the proportion of realistic materials in the material library are considered. A total of 85 trials with distinct parameters are evaluated, and three Pareto-optimal sets are identified.
As shown in
Figure 11, the best training results occur when the proportion of realistic materials lies between approximately
and
. By contrast, metallicity and roughness show no clear relationship with AP values. This suggests that the proportion of realistic materials in terms of material parameters has the greatest influence on the quality of a synthetic training dataset.
This is verified through parameter relevance analysis using the fANOVA algorithm [
28]. The results support the hypothesis that the proportion of realistic materials is the most important material parameter (see
Figure 12). The three Pareto-optimal training results, in terms of the average precision for both object classes, are achieved with the material parameters shown in
Table 2:
An analogous multidimensional optimization is also performed for the scene parameters. Here, the number of point light sources, the curvature of the ground plane, the total number of objects
n, the ratio of target objects
, the inter-class ratio
r and the minimum and maximum camera distances
and
are optimized. A total of 171 parameter sets are evaluated.
Figure 13 and
Table 3 show the parameters that are used to generate the most suitable datasets for training. For some parameters, an accumulation of these best training datasets can be recognized. These include the camera distances and the parameters that describe the object distribution.
This observation can also be confirmed for the scene parameter optimization using a fANOVA-based analysis (see
Figure 14). The most important parameter is the minimum camera distance
. The optimal interval for this parameter roughly corresponds to the results of the preliminary study on the influence of the camera distance (see
Figure 10).
3.5. Evaluation of Optimal Configuration
In the previous multidimensional optimizations, Pareto-optimal parameter sets were found for the generation of synthetic training data. Based on these investigations, an optimal dataset is generated. For this purpose, the material and scene parameters with which the highest mAP was achieved are combined (last columns in
Table 2 and
Table 3). This dataset with optimal parameters consists of 5000 images and annotations. Additionally, a second dataset is generated in which roughness, metallicity, background curvature, and number of light sources are not fixed values, but are instead modeled as normally distributed. The expected value of each normal distribution is set to the determined optimal value. The final evaluation of these training datasets is performed using the
xl-model of YOLOv8. The results of this final evaluation are shown in
Table 4.
For both the dataset with fixed parameters and the dataset with normal distributed values, a training process with and without data augmentation is started. When data augmentation is used, the YOLO framework automatically alters the training data to increase variety. This includes adjustments to hue, saturation, and value, as well as geometric transformations (rotation, translation, scaling, shearing, flipping, etc.), and copy–pasting image segments.
4. Discussion
The mAP values achieved are modest compared with those reported in previous studies, primarily because the validation dataset is small and challenging. For example, a more application-specific validation dataset could have been generated, such as one in which all objects are placed on a conveyor belt, emulating a classic pick-and-place task. However, the aim of this work extends beyond maximizing absolute mAP values. Instead, the results show that the presented optimization pipeline is able to find a material and scene parameter set that is suitable for generating a purely synthetic training dataset for one or multiple specific applications.
The results demonstrate that the camera distance is the most important scene parameter, likely because it directly determines object size in synthetic images. The other scene parameters only have a minor influence on the accuracy achieved by the trained network (see
Figure 14).
Regarding material parameters, the results (
Figure 12) suggest that the proportion of realistic materials is especially important. Thus, using approximately
parametric materials in addition to realistic ones is advantageous. In contrast, the precise parameterization of these parametric materials (in terms of roughness and metallicity) appears to be less important.
It should also be mentioned that the results could be improved by increasing the number of parameter evaluations within the optimizations. However, the absolute number of optimization runs for this study is limited by the available computing power.
5. Conclusions
This paper introduces an optimization pipeline for enhancing the generation of synthetic datasets used in neural network-based object segmentation. Leveraging the BlenderProc framework for procedural scene generation and Optuna for parameter optimization, the pipeline systematically identifies optimal material and scene parameters for specific objects based on performance on a real validation dataset.
Through individual and multidimensional parameter studies, important parameters influencing synthetic dataset quality were identified. Notably, camera distance proved to be the most significant scene parameter, while the proportion of realistic materials was crucial among material parameters. The optimized parameter sets demonstrate that the pipeline can produce synthetic training data that meaningfully improves neural network performance. They also provide insight into how systematic scene design and material selection affect the robustness of synthetic data for segmentation tasks.
This suggests a significant improvement over conventional data generation practices, where parameter selection typically relies on domain expertise or trial-and-error. Such approaches cannot systematically identify the non-intuitive parameter combinations that might be optimal for specific segmentation tasks.
6. Future Work
While the results of this study confirm the feasibility of synthetic data parameter optimization, many avenues for further research remain. For example, a more comprehensive exploration of the parameter space could be achieved by increasing the number of parameter evaluations. Furthermore, the relative importance between the scene and material parameters could be determined by a joint multidimensional parameter optimization. However, both approaches would require additional computing resources. In addition, the influence of the size and type of the real validation dataset on the optimization should be investigated. The investigations should be repeated once a solution ensuring the deterministic behavior of the Blender’s Cycles renderer becomes available.
Optimization was performed on two components with markedly different geometries, demonstrating that optimal parameter values and the relative importance of individual parameters depend on component geometry. Therefore, a new optimization would be required to generate synthetic training data for a different component. In further work, the influence of the component geometry on the optimal material and scene parameters should be investigated in more detail, e.g., by parameterizing the geometry of the component itself so that the optimal parameters can be determined as a function of the component geometry.
The real validation dataset used is relatively heterogeneous, encompassing multiple environments. For a simple application (e.g., pick-and-place robots), on the other hand, a very homogeneous validation dataset can be used that is generated directly in the application environment. In such cases, it would be valuable to investigate whether the optimization converges to a parameter set that closely matches the actual scene, assuming the material and scene parameters permit such a configuration.
Author Contributions
Conceptualization, M.N. and N.M.; methodology, M.N.; software, M.N.; validation, L.H.; formal analysis, M.N.; investigation, M.N.; resources, K.H. and N.M.; data curation, K.H.; writing—original draft preparation, M.N.; writing—review and editing, K.H. and L.H.; visualization, M.N.; supervision, L.H. and E.R.; project administration, L.H. and E.R. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The raw data supporting the conclusions of this article will be made available by the authors on request.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object Detection with Deep Learning: A Review. IEEE Trans. Neural Networks Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
- Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A survey of modern deep learning based object detection models. Digit. Signal Process. 2022, 126, 103514. [Google Scholar] [CrossRef]
- Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A Survey of Deep Learning-Based Object Detection. IEEE Access 2019, 7, 128837–128868. [Google Scholar] [CrossRef]
- Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. 2023. Available online: https://docs.ultralytics.com/de/models/yolov8/ (accessed on 30 December 2024).
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Schieber, H.; Demir, K.C.; Kleinbeck, C.; Yang, S.H.; Roth, D. Indoor Synthetic Data Generation: A Systematic Review. Comput. Vis. Image Underst. 2024, 240, 103907. [Google Scholar] [CrossRef]
- Boikov, A.; Payor, V.; Savelev, R.; Kolesnikov, A. Synthetic Data Generation for Steel Defect Detection and Classification Using Deep Learning. Symmetry 2021, 13, 1176. [Google Scholar] [CrossRef]
- Wiese, L.; Hinz, L.; Reithmeier, E. Medical instrument detection with synthetically generated data. In Proceedings of the Medical Imaging 2024: Imaging Informatics for Healthcare, Research, and Applications, Yokohama, Japan, 17–19 May 2024; Yoshida, H., Wu, S., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2024; Volume 12931, p. 1293109. [Google Scholar] [CrossRef]
- Jonker, M.; Roozing, W.; Strisciuglio, N. Synthetic Data-Based Training of Instance Segmentation: A Robotic Bin-Picking Pipeline for Chicken Fillets. In Proceedings of the 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE), Bari, Italy, 28 August–1 September 2024; pp. 2805–2812. [Google Scholar] [CrossRef]
- Arlovic, M.; Damjanovic, D.; Hrzic, F.; Balen, J. Synthetic Dataset Generation Methods for Computer Vision Application. In Proceedings of the 2024 International Conference on Smart Systems and Technologies (SST), Osijek, Croatia, 16–18 October 2024; pp. 69–74. [Google Scholar] [CrossRef]
- Martinez-Gonzalez, P.; Oprea, S.; Garcia-Garcia, A.; Jover-Alvarez, A.; Orts-Escolano, S.; Garcia-Rodriguez, J. Unrealrox: An extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation. Virtual Real. 2020, 24, 271–288. [Google Scholar] [CrossRef]
- Borkman, S.; Crespi, A.; Dhakad, S.; Ganguly, S.; Hogins, J.; Jhang, Y.C.; Kamalzadeh, M.; Li, B.; Leal, S.; Parisi, P.; et al. Unity Perception: Generate Synthetic Data for Computer Vision. arXiv 2021, arXiv:2107.04259. [Google Scholar] [CrossRef]
- Blender Foundation. Blender—A 3D Modelling and Rendering Software. 2024. Available online: https://www.blender.org (accessed on 30 December 2024).
- Denninger, M.; Winkelbauer, D.; Sundermeyer, M.; Boerdijk, W.; Knauer, M.; Strobl, K.H.; Humt, M.; Triebel, R. BlenderProc2: A Procedural Pipeline for Photorealistic Rendering. J. Open Source Softw. 2023, 8, 4901. [Google Scholar] [CrossRef]
- Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 23–30. [Google Scholar] [CrossRef]
- Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; Birchfield, S. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Prakash, A.; Boochoon, S.; Brophy, M.; Acuna, D.; Cameracci, E.; State, G.; Shapira, O.; Birchfield, S. Structured Domain Randomization: Bridging the Reality Gap by Context-Aware Synthetic Data. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7249–7255. [Google Scholar] [CrossRef]
- Rawal, P.; Sompura, M.; Hintze, W. Synthetic Data Generation for Bridging Sim2Real Gap in a Production Environment. arXiv 2024, arXiv:2311.11039. [Google Scholar]
- Newman, C.; Petzing, J.; Goh, Y.M.; Justham, L. Investigating the optimisation of real-world and synthetic object detection training datasets through the consideration of environmental and simulation factors. Intell. Syst. Appl. 2022, 14, 200079. [Google Scholar] [CrossRef]
- Wiese, L.; Hinz, L.; Reithmeier, E.; Korn, P.; Neuhaus, M. Detection of Surgical Instruments Based on Synthetic Training Data. Computers 2025, 14, 69. [Google Scholar] [CrossRef]
- Schneidereit, T.; Gohrenz, S.; Breuß, M. Object detection characteristics in a learning factory environment using YOLOv8. arXiv 2025, arXiv:2503.10356. [Google Scholar]
- Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
- Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
- Blender Contributors. Blender Issue #101726: Cycles Does Not Generate the Exact Same Images When a Scene Is Rendered Twice. 2022. Available online: https://projects.blender.org/blender/blender/issues/101726 (accessed on 30 December 2024).
- Hutter, F.; Hoos, H.; Leyton-Brown, K. An Efficient Approach for Assessing Hyperparameter Importance. In Proceedings of the 31st International Conference on Machine Learning, Bejing, China, 22–24 June 2014; Xing, E.P., Jebara, T., Eds.; Proceedings of Machine Learning Research. Volume 32, pp. 754–762. [Google Scholar]
Figure 1.
Procedure of optimization.
Figure 1.
Procedure of optimization.
Figure 2.
Examples of synthetic images from two different scenes.
Figure 2.
Examples of synthetic images from two different scenes.
Figure 3.
Renderings of the target objects with parametric materials.
Figure 3.
Renderings of the target objects with parametric materials.
Figure 4.
Examples from the real-world dataset. (a) Sample captured in an office environment. (b) Sample captured on a conveyor belt.
Figure 4.
Examples from the real-world dataset. (a) Sample captured in an office environment. (b) Sample captured on a conveyor belt.
Figure 5.
Influence of the number of scenes on the real validation dataset mAP. Each scene contains 10 images captured from different camera angles.
Figure 5.
Influence of the number of scenes on the real validation dataset mAP. Each scene contains 10 images captured from different camera angles.
Figure 6.
Influence of the total number of objects n in the synthetic dataset on the validation accuracy on the real dataset. This number includes the number of the target objects as well as the distractor objects.
Figure 6.
Influence of the total number of objects n in the synthetic dataset on the validation accuracy on the real dataset. This number includes the number of the target objects as well as the distractor objects.
Figure 7.
Influence of the number of additional distractor objects . The distractor objects have primitive geometric shapes (e.g., cylinders, spheres, squares and cones). When , the scene only contains the target objects (in this case Waterpump and Cap).
Figure 7.
Influence of the number of additional distractor objects . The distractor objects have primitive geometric shapes (e.g., cylinders, spheres, squares and cones). When , the scene only contains the target objects (in this case Waterpump and Cap).
Figure 8.
Influence of the inter-class ratio r on the average precision values for both object classes. corresponds to a dataset that consists solely of Cap objects, while corresponds to a dataset which consists of solely Waterpump objects.
Figure 8.
Influence of the inter-class ratio r on the average precision values for both object classes. corresponds to a dataset that consists solely of Cap objects, while corresponds to a dataset which consists of solely Waterpump objects.
Figure 9.
Analysis results for the composition of the material library. For , the material library consists only of parametric materials. For , the material library consists only of realistic materials.
Figure 9.
Analysis results for the composition of the material library. For , the material library consists only of parametric materials. For , the material library consists only of realistic materials.
Figure 10.
Analysis results of the influence of the camera distance interval on the validation dataset mAP. The optimization bounds are chosen so that .
Figure 10.
Analysis results of the influence of the camera distance interval on the validation dataset mAP. The optimization bounds are chosen so that .
Figure 11.
Results achieved from the multidimensional material parameter optimization. The parameter sets which yielded the best results are highlighted.
Figure 11.
Results achieved from the multidimensional material parameter optimization. The parameter sets which yielded the best results are highlighted.
Figure 12.
Significance of the individual material parameters as calculated using the fANOVA algorithm.
Figure 12.
Significance of the individual material parameters as calculated using the fANOVA algorithm.
Figure 13.
Results from the multidimensional scene parameter optimization. The parameter sets which yielded the best results are highlighted.
Figure 13.
Results from the multidimensional scene parameter optimization. The parameter sets which yielded the best results are highlighted.
Figure 14.
Relevance of the individual scene parameters as calculated using the fANOVA algorithm.
Figure 14.
Relevance of the individual scene parameters as calculated using the fANOVA algorithm.
Table 1.
Average precision for the Waterpump class using identical training data on different GPUs.
Table 1.
Average precision for the Waterpump class using identical training data on different GPUs.
GPU | Dataset 1 | Dataset 2 | Dataset 3 |
---|
RTX 4090 | 0.551 | 0.459 | 0.511 |
RTX 4070 Ti | 0.554 | 0.466 | 0.579 |
RTX 3090 | 0.545 | 0.534 | 0.569 |
GV100 | 0.537 | 0.472 | 0.553 |
Table 2.
Material parameters and precision values of the Pareto-optimal datasets.
Table 2.
Material parameters and precision values of the Pareto-optimal datasets.
Dataset Hash | e0834c | 22d11f0 | 8a5d1c |
---|
Metallicity | 0.215 | 0.867 | 0.965 |
Roughness | 0.204 | 0.204 | 0.785 |
CC-Share | 0.638 | 0.638 | 0.386 |
AP Waterpump | 0.630 | 0.646 | 0.386 |
AP Cap | 0.590 | 0.559 | 0.546 |
mAP | 0.610 | 0.603 | 0.611 |
Table 3.
Scene parameters and precision values for the Pareto-optimal datasets.
Table 3.
Scene parameters and precision values for the Pareto-optimal datasets.
Dataset Hash | 9dc945 | 893078 | f4493f4 |
---|
Ground curvature a | 0.660 | 0.867 | 0.274 |
Light sources | 3 | 7 | 2 |
Min. camera dist. [m] | 0.268 | 0.280 | 0.207 |
Max. camera dist. [m] | 1.212 | 0.816 | 1.228 |
Share of target objects | 0.767 | 0.856 | 0.759 |
Inter-class ratio r | 0.517 | 0.124 | 0.403 |
Total objects n | 120 | 74 | 199 |
AP Waterpump | 0.710 | 0.614 | 0.685 |
AP Cap | 0.506 | 0.572 | 0.546 |
mAP | 0.608 | 0.593 | 0.615 |
Table 4.
Results of the final evaluation.
Table 4.
Results of the final evaluation.
Dataset | Fixed Values | Normal Distributed Values |
---|
Data Augmentation
| Deactivated
| Activated
| Deactivated
| Activated
|
---|
AP Waterpump | 0.683 | 0.736 | 0.697 | 0.785 |
AP Cap | 0.540 | 0.620 | 0.526 | 0.647 |
mAP | 0.612 | 0.678 | 0.612 | 0.716 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).