Camera Pose Generation Based on Unity3D

Luo, Hao; Luo, Wenjie; Yang, Wenzhu

doi:10.3390/info16040315

Open AccessArticle

Camera Pose Generation Based on Unity3D

by

Hao Luo

¹

,

Wenjie Luo

^1,2,*

and

Wenzhu Yang

^1,2

¹

School of Cyber Security and Computer, Hebei University, Baoding 071002, China

²

Machine Vision Engineering Research Center, Hebei University, Baoding 071002, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(4), 315; https://doi.org/10.3390/info16040315

Submission received: 27 February 2025 / Revised: 4 April 2025 / Accepted: 14 April 2025 / Published: 16 April 2025

(This article belongs to the Topic Visual Computing and Understanding: New Developments and Trends)

Download

Browse Figures

Versions Notes

Abstract

Deep learning models performing complex tasks require the support of datasets. With the advancement of virtual reality technology, the use of virtual datasets in deep learning models is becoming more and more widespread. Indoor scenes represents a significant area of interest for the application of machine vision technologies. Existing virtual indoor datasets exhibit deficiencies with regard to camera poses, resulting in problems such as occlusion, object omission, and objects having too small of a proportion of the image, and perform poorly in the training for object detection and simultaneous localization and mapping (SLAM) tasks. Aiming at the problems regarding the capacity of cameras to comprehensively capture scene objects, this study presents an enhanced algorithm based on rapidly exploring random tree star (RRT*) for the generation of camera poses in a 3D indoor scene. Meanwhile, in order to generate multimodal data for various deep learning tasks, this study designs an automatic image acquisition module under the Unity3D platform. The experimental results from running the model on several mainstream virtual indoor datasets—such as 3D-FRONT and Hypersim—indicate that the image sequences generated in this study show enhancements in terms of object capture rate and efficiency. Even in cluttered environments such as those in SceneNet RGB-D, the object capture rate remains stable at around 75%. Compared with the image sequences from the original datasets, those generated in this study achieve improvements in the object detection and SLAM tasks, with increases of up to approximately 30% in mAP for the YOLOv10 object detection task and up to approximately 10% in SR for the ORB-SLAM algorithm.

Keywords:

virtual dataset; indoor scene; camera pose; Unity3D

1. Introduction

Indoor scene datasets are suitable for tasks such as deep learning model training, object detection, and instance segmentation. However, for complex tasks such as scene understanding and SLAM [1], it is often necessary to continuously capture indoor images to extract environmental information. Indoor scene datasets that are capable of performing such complex tasks are generally in the form of image sequences or videos, which requires the consideration of planning reasonable camera poses. For datasets originating from real scenes, the manual planning of camera poses is possible, for example, Matterport3D [2], while for virtual datasets that require large-scale generation, the automatic planning of camera poses is essential.

The current mainstream virtual indoor scene datasets are SceneNet RGB-D [3], 3D-FRONT [4], and TartanAir [5]. The images contained in these datasets are sampled by a camera from a virtual 3D scene; consequently, the researchers of these datasets have all investigated the problem of planning camera poses to a considerable extent. Nevertheless, these studies do still have their own limitations. For instance, SceneNet RGB-D employs a Brownian motion-based random wandering algorithm to simulate the camera’s motion gestures through the Chrono physics engine [6]. However, the algorithm exhibits a propensity to become ensnared in locally recurrent loops in intricate scenes. Experiments have shown that 30% of the paths cover the same area repeatedly, with a capture rate of only 38% for objects measuring less than 0.1 cubic meters in size. 3D-FRONT employs a room topology-based optimization strategy, which generates discrete candidate points within a room by the Delaunay triangulation algorithm [7], and then selects a single optimal viewpoint in each room based on the visible area of the viewpoints. This strategy results in a paucity of viewpoint diversity, with an average of only 5.2 viewpoints per scene, and less than 12% coverage of the scene objectives.

The development of datasets with challenging virtual environment and motion trajectories is a strategy employed by some other dataset producers to stimulate the limits of SLAM modeling. For instance, TartanAir was constructed utilizing the Unreal engine and configured with two modes of motion: Easy and Challenge. Mode 1 employs uniform linear motion, while Mode 2 incorporates sinusoidal perturbations to simulate bumpy effects. TartanAir randomly samples keyframes from space and then connects the keyframes using the RRT* algorithm [8]. TartanAir’s research emphasizes that a significant number of SLAM models that have demonstrated efficacy in the past against well-established datasets, such as KITTI [9], have exhibited suboptimal performance when confronted with challenging trajectories. This research underscores the pivotal role that camera trajectory generation plays in influencing model training outcomes. Therefore, it is necessary to design camera pose generation methods suitable for indoor scenes.

Many scholars have investigated how to improve the performance of datasets on tasks such as object detection and SLAM by improving camera pose planning. For instance, Du et al. [10] advanced a methodology for real-time object detection in dynamic environments, which led to a substantial reduction in the duration of analysis of object motion. This was accomplished by means of enhancing the low-frequency components extracted from point cloud data as feature points, thereby achieving an average detection rate of 88.7% and an increased recall rate to 89.1%. Therefore, this study argues that it is entirely feasible to investigate ways to improve data performance in these tasks by optimizing camera poses in indoor scenes.

In recent years, there has been a notable advancement in the realm of virtual image design with the development of various rendering engines. More and more virtual datasets have been produced using these engines, such as SceneNet using the Chrono engine and TartanAir using the Unreal engine. Inspired by the above work, this study proposes a method for generating camera poses and capturing images in Unity 3D scene. The primary focus of our study lies in enhancing the camera’s capture rate for indoor objects, increasing the area share of the object on the image, reducing the time spent, and optimizing the performance of image sequences in object detection and SLAM tasks. The primary contributions of our study can be summarized as follows:

1.: In this study, we propose a hierarchical camera pose generation algorithm. The algorithm first generates coarse-grained keypoints based on indoor spatial objects, and then generates complete paths through an improved RRT* algorithm. The primary benefit of this algorithm is that it is entirely inspired by geometric information in the environment and does not rely on image information. Consequently, the method does not have to turn on camera rendering, which significantly reduces the time spent on running the model. The generation of a single trajectory requires approximately 500 ms, and the object capture rate is enhanced to around 80%.
2.: In this study, we propose automatic image rendering and annotation methods in Unity Engine. When the camera trajectory is formed, it is made continuous and smoothed using Bessel curve [11] interpolation. The HD rendering pipeline of Unity Engine is used to render the image automatically, and the annotation information—such as instance segmentation and object detection—can be output by binding the NYU [12] category labels to the model.
3.: In this study, we propose the development of an import module that facilitates the compatibility of diverse datasets with Unity Engine. The module facilitates the conversion of diverse model structures that are prevalent in mainstream datasets—including FBX, OBJ, and C4D—into triangular meshes that can be utilized by Unity Engine through the Delaunay segmentation algorithm. In this study, the method is successfully used on these datasets, and new image sets are generated. Comparison experiments are conducted between the generated image set and the original image set to validate the performance of this study’s method in object detection and SLAM tasks.

2. Related Work

The camera pose determines the 3D coordinates and angles of the camera, while complex camera poses will also include the camera’s field of view (FOV), focal length, and other parameters. The camera trajectory consists of a sequence of camera poses. The objective of camera trajectory generation is to determine a trajectory in a 3D environment that successfully avoids all obstacles and captures the object. In this process, objects are not only obstacles that the camera must avoid but also objects that the camera must successfully capture. Determining a camera trajectory in a confined and cluttered indoor environment is significantly more challenging than in a spacious outdoor environment. This problem is commonly studied in the context of the production of virtual 3D datasets, where the work can be subdivided into methods for storing camera poses and algorithms for generating camera trajectories.

2.1. Storage of Camera Poses

The storage of camera poses is predominantly observed in datasets that perform camera pose estimation as well as 3D reconstruction. In datasets comprising a substantial number of images, the camera poses corresponding to all images under a camera track are typically integrated and labeled in a single file. The information stored in this file depends on the type of camera. For instance, in the KITTI dataset, light detection and ranging (LiDAR) scanning and four cameras are utilized, of which two are gray and two are in color. The radar scans a 3D scene, generating a 3D point cloud, and the cameras maintain a 3 × 3 rotation matrix and a 3 × 1 translation matrix, which are stored in a text file (TXT). The camera pose is derived from a linear operation of the chi-square coordinates in the radar coordinate system with this matrix. The advantage of this method is that it enables every camera and point cloud in the space to seamlessly transition between the radar coordinate system and the camera coordinate system; the disadvantage is that it increases the storage pressure. In the 3D-FRONT dataset, the position of the camera and the position of the object it observes are stored in two 3 × 1 matrices. In addition, the information of the camera’s near and far planes, FOV, etc., is added, which is stored in javascript object notation (JSON). The camera angle can be obtained by calculating the camera coordinates and the object’s coordinates. Therefore, this study employs the 3D-FRONT method to store the camera trajectory.

2.2. Camera Trajectory Generation

Many large-scale datasets are produced with automated camera trajectory generation in mind, as there are high costs associated with manually planning camera trajectories. Furthermore, certain real-world filming techniques are often integrated into the generation process. This study summarizes two camera pose planning strategies commonly used for datasets: the two-body trajectory and the random walk.

The two-body trajectory strategy has been proposed by SceneNet RGB-D and refined by InteriorNet [13]. Its core is to simulate two physical bodies that randomly moving within the indoor free-space. One of them defines the camera position, and the other defines the observation point. The combination of these two bodies then serves to determine the camera’s pose. The advantage of this strategy is that the camera poses are visually described and sufficiently diverse. The limitation is that the camera’s region of interest remains unconsidered, leading to 40% of the image sequence comprising less than 10% of the subject’s furniture. Although the majority of objects can be captured with a telephoto lens, many small objects constitute an infinitesimal percentage of the image area and are frequently overlooked in object detection. This study argues that the two-body trajectory delineating the camera pose is reasonable. However, the observation point should be focused on the object, which will improve the capture rate of objects. Furthermore, the camera position should be determined by the observation point, with the maintenance of a suitable distance facilitating the magnification of the object’s area in the image.

The random walk strategy has been studied in many path planning algorithms. However, in the trajectory generation of the dataset, it is not entirely random. For instance, Hypersim [14]’s 3D environment employs a substantial quantity of triangles for modeling. Hypersim has observed that protruding objects (e.g., beds, tables, and sofas) have a higher density of triangles in the mesh than non-protruding objects (e.g., ceilings, floors, and walls). This is due to the fact that large planes are typically composed of a limited number of large triangles, whereas furniture objects necessitate a greater quantity of smaller triangles to accurately reconstruct their intricate details. Inspired by this observation, Hypersim improved the sampling process for random walks. Support

c_{0}

,

c_{1}

, ...,

c_{n}

is the sequence of camera poses Hypersim wants to generate, set

c_{0}

to be an artist-defined camera pose, and then generate the subsequent camera poses by iteratively sampling from a conditional probability distribution

p (c_{i + 1} | c_{i})

, while

c_{i}

is the current known camera pose;

c_{i + 1}

is the next unknown camera pose to be sampled. Hypersim restricts the

c_{i + 1}

sampling space to a small local neighborhood around

c_{i}

and lies in free space. During each iteration of this sampling procedure, Hypersim generates a discrete set of candidate poses uniformly at random, calculates the density of triangles covered by raycasting against its scene geometry. The probability of the candidate pose is determined by the density of the triangles. Hypersim employs environment heuristics to achieve random walk, thereby enhancing the object capture rate, with 88.3% of the pixels in its images occupied by objects (including walls, floors, and ceilings). However, the limitation of Hypersim is that the introduction of the artist’s intervention leads to an increase in production man-hours, with an average of 1.5 man-hours per trajectory, which is unacceptable for large dataset. This study argues that modifying the artist’s preconceived work to an object-heuristic generative approach can reduce manual intervention, i.e., setting a reasonable camera pose for each object individually. When there are too many objects in an indoor scene, an appropriate merging strategy is used to ensure that as many objects as possible are covered while balancing the time spent.

3. Methods

This study designs a fully automated model in Unity Engine, which is composed of three parts: data mapping, trajectory generation, and data processing. The data mapping process involves the importation of the 3D model and the generation of a free space by traversing all objects to produce the necessary information for obstacle avoidance and trajectory planning. Subsequently, the trajectory generation process is performed in order to generate a continuous camera pose. We instruct the camera to move in accordance with the trajectory and collect multiple sensor data. In the data processing phase, the image quality is evaluated and the camera pose is fine-tuned. The latter is used to obtain higher-quality images. The utilization of diverse tools at each stage of the process is illustrated in Figure 1.

3.1. Data Mapping

Data mapping is the process of reading a 3D model, thus collecting the required information. Since many of the datasets utilize a model storage structure comprising point clouds and polygon meshes, some preliminary processing is required to adapt to the triangular mesh structure of the Unity platform. In the context of point cloud data, the point cloud is subjected to a downsampling process, thereby reducing its density. This is followed by the implementation of a Poisson reconstruction method [17], which generates mesh data. Subsequently, the mesh data are subjected to a post-processing procedure, aimed at repairing any voids present in the mesh. In the context of polygonal meshes, the Delaunay algorithm is employed to transform them into triangular meshes. Subsequent to the conclusion of the preliminary processing phase, we use an occupancy map to store the available space within the model. The methodology employed is as follows: first, a random sample of points is extracted from the model and converted into a point cloud structure; then the structure is stored in an Octomap, outputting an Octree storing free space; finally a free-space occupancy map is obtained by traversing the tree structure. The time required to map an environment measuring 10 m × 10 m × 10 m with an accuracy of 0.2 m is approximately 5 min. This method results in the failure to recognize some free space too close to the object, but given that the camera’s FOV is intended to be at a certain distance from the object, this error becomes acceptable. (See Figure 2).

The set of objects to be detected is obtained from the annotated file of the dataset, or alternatively, new objects can be defined in the model as object detection. We need to collect the position and orientation of each object in order to facilitate the capture of a frontal view of the object. The usual labeling file will indicate the position and orientation of the object in the form of 3D vectors or rotation angles, which facilitates our work.

3.2. Trajectory Generation

Referring to the work of John et al. [3], a smooth path and observation points corresponding to a number of sampling points on that path are employed to represent the camera trajectory, and a path planning algorithm is used to generate both. Since Blender is very mature regarding mesh processing, ray detection, etc., this study uses Blender to generate camera trajectories. The present approach is analogous to that of Hypersim insofar as the generation of a set of recommended camera poses as a reference is concerned. The difference is that Hypersim’s recommended poses are aesthetically pleasing, whereas the focus in this instance is to display the object comprehensively. Initially, informed by the object information that was previously collated, suitable capture positions and angles, i.e., sampling points, are planned for each object in the room. Subsequently, the algorithm plans a reasonable path based on these sampling points. Finally, the paths are smoothed using a mathematical model. The creation of sampling points for the object should be based on the scenarios described below.

3.2.1. Single Large Object

In the event that the object is large and is not in contact with other small objects, the observation point can be set to the center of gravity of the object. An oriented bounding box is used to delimit the object, and then the position vector of the camera

\vec{P}

is calculated by the following formula:

\vec{P_{c a m}} = \vec{O_{o b j}} + \frac{d r}{v} \cdot M \cdot \vec{R_{o b j}}

(1)

where r represents the image resolution, typically set to 1024 or 512; d represents the ratio of the object to the image; and v represents the focal length of the camera. This fraction represents the straight-line distance between the camera and the object. M is a

3 \times 3

rotation matrix used to rotate the direction vector of the object,

\vec{R_{o b j}}

, by a certain angle. Given that the virtual camera employed for shooting is set at a certain perspective, it follows that this straight-line distance must exceed the near-plane distance of the camera; otherwise, the object will be lost. The camera position,

\vec{P_{c a m}}

, must be in free space, and can be constrained to a frustum of a cone by fine-tuning the parameters d and M.

3.2.2. Multiple Object

Since the indoor scenes in which we work contain numerous small objects, we must consider situations where both large and small objects are in the frame at the same time. The prevailing computer vision models demonstrate a notable insensitivity to small objects. Consequently, when the size discrepancy between two objects is pronounced, we no longer seek to capture the entire large object but instead focus on capturing a close-up image of the small object. The relationship between the two objects can be categorized as follows:

(a): The small object is supported by the large object by gravity.
(b): The small object is suspended by the large object.

This can be determined by examining the relative positional relationship between the two objects. The support relationship will result in the concentration of many objects in the same plane, which will lead to a large occlusion between the objects and is not conducive to object detection. It is therefore necessary to elevate the camera to add an overhead view. In the context of a support relationship, the center of gravity of small objects will be situated above the bounding box of a large object. This law is utilized by projecting rays from the central point of gravity of the larger object, traversing the central point of gravity of the smaller objects. The mean of these rays

\vec{R_{a v e}}

is then calculated as a datum for the camera’s orientation, with the distance adjusted by the percentage of the smallest object

d_{S m a}

:

\vec{R_{a v e}} = \frac{1}{n} \sum_{i = 1}^{n} (\vec{O_{o b j_{i}}} - \vec{O_{o b j_{L a r}}})

(2)

\vec{P_{c a m}} = \vec{O_{o b j_{L a r}}} + \frac{d_{S m a} r}{v} \cdot M \cdot \vec{R_{a v e}}

(3)

For the suspension relation, we simply keep the small object between the camera and the large object. This ensures that the smaller object is not obscured by the larger object and that the front view can be obtained. The images shown in Figure 3 are rendered using these two different strategies.

3.2.3. Path Planning

Following the generation of a series of observation points O and camera positions P, both are utilized as inputs to the path planning algorithm, in conjunction with an Octree T that stores the available space.

Algorithm 1 outputs a graph of heptads H, each of which represents the position and orientation (converted to a quaternion) of a camera key frame. Function

S a m p l e

randomly samples position point p from P and obtains the observation point o corresponding to p. Function

N e a r

obtains the vertices

p_{n e a r}

with the highest priority compared to

p_{r a n d}

from P. The priority of the position point is calculated as follows:

p (p_{n e a r}, p_{r a n d}) = λ d (p_{n e a r}, p_{r a n d}) + δ s_{n e a r} + |θ_{n e a r} - θ_{r a n d}|

(4)

The closer the distance between position point

p_{n e a r}

and point

p_{r a n d}

, the smaller the difference in direction angle,

θ

, the smaller the size of point

p_{n e a r}

, and the higher the priority of point

p_{n e a r}

.

Algorithm 1 Planning a route from a collection of position points.

1:: Input: P,O,T
2:: Output: H, $O_{u n f i n}$ .
3:: $S a m p l e (P, O) ⟶ (p_{i n i t}, o_{r a n d}), H . V e r t i c e s . a d d (l_{i n i t}, o_{r a n d})$
4:: while $O_{f i n} . c o u n t < O . c o u n t$
5:: $S a m p l e (P, O) ⟶ (p_{r a n d}, o_{r a n d})$
6:: $N e a r (p_{r a n d}, P) ⟶ p_{n e a r}$
7:: $E d g e (p_{r a n d}, p_{n e a r}) ⟶ E$
8:: if $C o l l i s i o n F r e e (T, E)$ then
9:: $H . V e r t i c e . a d d (p_{n e a r}, o_{n e a r}), H . E d g e s . a d d (E)$
10:: $C a p t u r e G e t (p_{n e a r}, o_{n e a r}) ⟶ O_{g e t}, O_{f i n} . a d d (o_{g e t})$
11:: else
12:: step = 5 or break
13:: end if
14:: end while
15:: $O_{u n f i n} = O - O_{f i n}$

The function

E d g e

establishes a direct path, E, between the positions of

p_{r a n d}

and

p_{n e a r}

, and the function

C o l l i s i o n F r e e

detects whether E is in free space, T. Determining whether E is in T can be a time-consuming process; therefore, we employ Octree to streamline this process. The storage structure of the Octree ensures that if

p_{r a n d}

and

p_{n e a r}

are both located within the same leaf node, then E must be in T. In instances where the two are not located on the same leaf node, we set one of them as the starting point and a number of uniformly sampled points on E are obtained, with the spacing of these points determined by the accuracy of the Octree. We consider E to be in T only if all the sampled points are on leaf nodes in T. The function

C a p t u r e G e t

determines the sequence of observation points captured,

O_{g e t}

, a process which is achieved by detecting the rays emitted from the camera.

It is evident that the graph formed at this point presents certain issues. For example, when function

C o l l i s i o n F r e e

declines to generate edges between two vertices, the graph may consequently comprise multiple connected components. In light of the many-to-one relationship that exists between position points and observation points, it is conceivable that a proportion of the latter may remain unobserved. Furthermore, the presence of a cyclic structure within the graph may result in the algorithm becoming trapped in a local loop.

The RRT* algorithm is employed to establish connections between the connected sub-graphs. Assuming that the edge between points

p 1

and

p 2

is rejected by function

C o l l i s i o n F r e e

, we sample a series of points,

r_{i}

, from the leaf nodes of T, where

p 1

and

p 2

are located. These points

r_{i}

are sampled around the edge and satisfy a Gaussian distribution:

r_{n + 1} = \frac{1}{\sqrt{{(2 π)}^{3} d e t (\sum)}} e x p (- \frac{1}{2} {(r_{n})}^{⊤} {(\sum)}^{- 1} (r_{n}))

(5)

This is a recursive formula, where

r_{n}

is the 3D coordinate of the point in the sampling

n^{t h}

,

r_{0}

stands for point

p 1

or

p 2

, and ∑ is the covariance matrix (a matrix

3 \times 3

with

σ^{2}

on the diagonal and 0 in the rest of the elements). The sampling is more dispersed as

σ

becomes larger. By sampling continuously, we establish feasible edges between

p 1

and

p 2

.

Subsequently, a depth-first search is employed to eliminate the edges that induce the cyclic process. Eventually, we obtain a connected directed acyclic graph (CDAG). We traverse it as an undirected graph to identify the branch containing the maximum number of vertices as a path. The detailed process is shown in Figure 4.

At this juncture, the path is still in the form of a folded segment. A Bessel curve is utilized to ensure the smoothness of the path. To prevent the curve from deviating excessively, we take a certain distance d before and after the inflection point of each path, and use only second-order Bessel curves. The camera pose for a section of the path between the two endpoints is derived by spherical linear interpolation [18]. Path smoothness will lead to some chance of collision, which is unavoidable. As the value of d increases, the smoothing of the path is increased and the probability of a collision is affected. One can determine whether paths will collide by analyzing the depth information of the photos captured by the camera. We count the relationship between the distance d and collision rate in our experiments on various datasets as shown in Figure 5:

It is important to note that the camera is not activated to render images during the trajectory generation, as frequent calls to the camera to render images are very power-consuming. Figure 6 shows the complete process of executing the trajectory generation within a 3D-FRONT scene.

3.3. Data Processing

We instruct the virtual camera to shoot along the trajectory and collect multimodal information, including RGBD images, camera pose, and segmentation. Utilizing these fundamental data, we are also able to calculate additional information, such as optical flow and stereo disparity, which is challenging to acquire with conventional equipment. Raw images captured by the camera are rendered using the high-definition render pipeline [19] (HDRP) provided by the Unity engine, with the camera’s frame rate being able to be adjusted. During the rendering process, deliberate adjustments to the camera’s focus or FOV are typically not made. In instances where the frame rate of the camera is excessively high, or the camera is in close proximity to the subject, the quality of the captured images is found to be compromised. This is primarily characterized by the absence of the object in the frame, an excessive proportion of the object within the frame, and significant occlusions between the objects. Therefore, we design the formula to evaluate the quality of the image:

S c o r e_{i} = \frac{λ}{(i - 1)!} \sum_{m = 1}^{c o u n t_{i}} \sum_{n = m + 1}^{c o u n t_{i}} (\frac{d_{m n}}{c_{m n}} - \frac{|B_{m}^{g t} \cap B_{n}^{g t}|}{|B_{m}^{g t} \cup B_{n}^{g t}|}) + \frac{μ}{s} \sum_{i - s}^{i} (c o u n t_{i} - c o u n t_{j})

(6)

The formula requires a call for image segmentation information, where

c o u n t_{i}

represents the total number of objects captured in the ith image,

B^{g t}

represents the ground truth,

d_{m n}

represents the distance between the focus of the ground truth m and n, and

c_{m n}

represents the diagonal length of the smallest frame covering both. This fraction is equivalent to measuring the degree of occlusion between two objects in term of the distance–intersection over union [20] (D-IoU). Each image is compared to the previous s frames, which affects the score when an image captures a substantially lower number of objects, or where there is a high rate of occlusion between multiple objects.

In the context of image frames that have received lower scores, the processing methodologies are divided into the following two: when the image scores of the frames after the image ith are normal, we can directly delete the image ith; conversely, when the image scores of the frames after the image ith are obviously low, it means that the design of this section of the trajectory is unreasonable. We delete this section of the trajectory, with the information of the breakpoints preceding and following the trajectory being recorded. Thereafter, the trajectory between breakpoints is regenerated with the previously referenced RRT* algorithm.

4. Experiments

In view of the paucity of datasets providing 3D models, the method is implemented on the 3D-FRONT, SceneNet RGB-D, and Hypersim datasets. We show the performance of the method on these datasets, and experiments are conducted to evaluate its performance, including object detection and SLAM task.

The first set of Table 1 displays a selection of images captured utilizing the algorithm within the 3D-FRONT 3D environment. The layout is very neat, as the smaller objects in this dataset are locked into position by the larger furniture. The second set of images is from SceneNet RGB-D, showing an indoor environment composed entirely of object file format (OBJ) with furniture modeled from ShapeCore [21]. The layout of the dataset is considerably cluttered, a consequence of the fact that it is based solely on rudimentary physical constraints. The third set of images is drawn from the Hypersim dataset, which is renowned for its richness of annotation information.

4.1. Basic Metrics

Table 2 enumerates the basic metrics of the method operating on the dataset. A comprehensive enumeration of the most prevalent room types is performed in the considered dataset. It is observed that the majority of the room types comprise numerous individual rooms, with the most substantial sizes belonging to the suite category from 3D-FRONT. The average capture rate of objects within a room is calculated using the output image sequence (based on the objects being exposed 80%)—the higher the capture rate, the more objects are covered by the generated camera pose. The data demonstrate that the method performs well on 3D-FRONT and Hypersim but has significant shortcomings on SceneNet RGB-D. We believe that the increase in small objects and cluttered layouts adversely affects the method, which can be reflected in the average area share of the object in the image (using a 2D bounding box as a baseline for calculating the area). In order to facilitate a meaningful comparison with the images from the original dataset, the resolution of this image is kept the same as the original image. The average area share is defined as the mean pixel area share of each object in the image in the instance segmentation. The larger the value, the less likely it is for an object to be missed, and the better the performance of visual models that are sensitive to small objects The majority of the objects have an average area share of about 20% in the image. Furthermore, the average time expended on generating trajectories for each room is calculated (with the data mapping excluded), and it is determined that the method’s time is maintained around 500 ms in all instances, a testament to the efficient utilization of Octree by the algorithm.

4.2. Object Detection

We compare the images collected from multiple scenes with the corresponding set of images in the dataset. 3D-FUTURE: The corresponding image set of the 3D-FRONT dataset, and the furniture in the 3D-FRONT scene is from 3D-FUTURE, so we select the images of the furniture that appear in 250 scenes, totaling 4832 images. SceneNet: We utilize its provided room generator to generate the 200 scenes and employ its image generator and our method to generate 4000 images each. Hypersim: A total of 40 short frame sequences are selected from 40 scenes, each sequence comprising 100 images. The 3D models for these 40 scenes are purchased from TurboSquid [22]. Subjects are entered into the object detection model for training separately. The subject image set is equally divided into 5 copies using 5-Folder Cross Validation. The average values are taken as the result shown in Table 3.

The object detection models employed in this study were YOLOv10 [23] and RT-DETR [24]. Both are end-to-end object detection models, eschewing the incorporation of post-processing modules such as non-maximum suppression [25] (NMS). The detection results are exclusively dependent on the dataset and the model itself. The output metrics of the YOLO model include

m A P

, precision

P r e

, and recall

R e c

, and the output metrics of the DETR model include accuracy

A c c

,

m A P

,

m A R (m a x D e t = 10)

, and

m A R (m a x D e t = 100)

. Precision(

P r e

) is defined as the proportion of correctly predicted samples to all predicted samples by the model. An increase in the precision rate corresponds to an enhancement in the model’s classification ability. Recall(

R e c

) is defined as the proportion of samples whose truth is positive that are predicted correctly; the higher the recall, the lower the model’s misdetection rate. The mean average precision (

m A P

) for both detection models is calculated based on IOU in the range of 0.5 to 0.95. Accuracy (

A c c

) is the proportion of all samples that the model predicts correctly. The mean recall (

m A R

) is defined as the average of the recall when the number of predicted frames per image is limited (

m a x

, i.e., the number threshold). By comparing the average recall at max = 10 and max = 100, the possibility of the model detecting the same object repeatedly can be ruled out.

As demonstrated in the table, in comparison with the original images, the camera pose capture images generated by our methodology demonstrate substantial enhancements in all metrics on SceneNet and Hypersim. Using SceneNet RGB-D to train the YOLOv10 model, the

m A P

of the method in this study is improved by up to 30% compared to the original dataset, while Hypersim leads to an improvement of about 5%. The results obtained on 3D-FUTURE are not as satisfactory, and

P r e

and

R e c

even demonstrate a decline. this study hypothesizes that this is related to 3D-FUTURE’s models, on the basis that the mesh models of many furniture and objects in 3D-FUTURE are bound together (e.g., the desk and the computer on the desk). It is very difficult to automatically split the models of the two, so the two can only share the label of the desk. When using the method in this study to generate camera poses, these small objects cannot be independent objects of observation, and thus omissions occur. Only improved model data can effectively solve this problem.

4.3. SLAM Task

We perform a SLAM task with image sequences, and test SLAM algorithms including oriented fast and brief SLAM [26] (ORB-SLAM), large-scale direct SLAM [27] (LSD-SLAM), and direct sparse odometry [28] (DSO). Given that the images provided by 3D-FUTURE are not continuous, the present study opts to utilize SceneNet and Hypersim. In order to ensure the success rate of the SLAM algorithm, the frame rate of the camera is increased. A total of 40 scenes are selected from each dataset, with each scene containing 300 images. The test metrics encompass the absolute trajectory error (

A T E

), the reconstruction error (

R E

), and the success rate (

S R

).

A T E

is utilized to measure the discrepancy between the trajectory estimated by the algorithm and the actual trajectory, which is obtained by calculating the Euclidean distance between the estimated and real bit positions for each frame and then taking the root mean square error(RMSE). The lower the

A T E

, the closer the camera trajectory estimated by the SLAM algorithm is to the true trajectory.

R E

is utilized to evaluate the accuracy of the algorithm’s reconstructed maps by calculating the average distance between each point of the semi-dense reconstructed map and the nearest surface of the indoor model. The lower the

R E

, the more accurate the algorithm’s modeled maps. SR is then defined as the ratio of image sequences that fail to run to the total number of sequences. The results of the SLAM task are presented in Table 4.

The table shows that the method proposed in this study has achieved great improvements in ORB-SLAM and DSO, with significant optimization in ATE. However, there are still some issues in LSD-SLAM, which we infer is related to the frequent light changes in the images.

5. Conclusions

This study proposes a methodology for the generation of camera poses within a virtual 3D indoor environment. The deployment of the method on a variety of virtual datasets was achieved through the Unity engine, and the method demonstrated significant improvements in both the efficiency of camera pose generation and the capture rate of objects. A comparative analysis of existing datasets demonstrates that the proposed method is capable of achieving effective results in object detection and SLAM tasks.

This research contributes to the rapid iteration of data-driven visualization models and reduces the reliance on large amounts of realistic data collection. In addition, this research reduces the time to generate poses and is expected to be applied in dynamic environments where real-time generation is required.

Recommendations for field specialists include the following. Application extension: In the fields of robot navigation and VR development, where there is an increasing requirement for more precise environment complexity parameters, such as lighting variations, custom relationship annotations and custom regions of interest, task-specific training data can be generated by adding these tunable parameters to the Unity interface. Generalization performance: Despite the power of Unity’s high-definition rendering pipeline, synthetic images still do not perform well on tasks in real scenes, and the generalization of synthetic images could be improved. This study suggests using synthetic data as a supplement to real data or as pre-training.

Future research directions include the following. Semantic awareness: We intend to investigate semantic awareness modules to control pose generation, utilizing architectural features such as doors, windows, and corridors as anchor points to generate more ergonomic trajectories. Algorithmic robustness: We intend to investigate the robustness of validating methods in dense scenes and in environments containing transitional spaces (e.g., stairs and corridors). Multimodal sensor fusion: We intend to investigate a hybrid framework of virtual cameras and real sensors to simulate real noise (e.g., motion blur and light noise) to enhance the realism of the data. Real-time dynamic environment: We intend to investigate the real-time generation of camera poses in dynamic environments to realize complex functions such as object re-identification and tracking.

By addressing these directions, future research will bridge the gap between synthetic data and realistic deployment. Collaboration with architects and robotics engineers is expected to further advance the utility of synthetic data for real-world missions.

Author Contributions

Conceptualization, H.L.; Methodology, H.L.; Software, H.L.; Validation, H.L.; Investigation, H.L.; Resources, H.L.; Writing—original draft, H.L.; Writing—review & editing, W.L.; Supervision, W.L.; Project administration, W.Y.; Funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hebei Province (F2024201012).

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from Alibaba Tao Technology Department and are available from https://tianchi.aliyun.com/dataset/65347, accessed on 13 July 2024.

Conflicts of Interest

The authors declare no conflicts of interest. The funders play a role in project administration and funding acquisition.

Abbreviations

The following abbreviations are used in this manuscript:

RRT*	rapidly exploring random tree star
SLAM	simultaneous localization and mapping
FOV	field of view
ORB-SLAM	oriented fast and brief SLAM
LSD-SLAM	large-scale direct SLAM
DSO	direct sparse odometry
ATE	absolute trajectory error
RE	reconstruction error
SR	success rate

References

Macario Barros, A.; Michel, M.; Moline, Y.; Corre, G.; Carrel, F. A comprehensive survey of visual slam algorithms. Robotics 2022, 11, 24. [Google Scholar] [CrossRef]
Ramakrishnan, S.K.; Gokaslan, A.; Wijmans, E.; Maksymets, O.; Clegg, A.; Turner, J.; Undersander, E.; Galuba, W.; Westbury, A.; Chang, A.X.; et al. Habitat-matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI. arXiv 2021, arXiv:2109.08238. [Google Scholar]
McCormac, J.; Handa, A.; Leutenegger, S.; Davison, A.J. Scenenet rgb-d: 5m photorealistic images of synthetic indoor trajectories with ground truth. arXiv 2016, arXiv:1612.05079. [Google Scholar]
Fu, H.; Cai, B.; Gao, L.; Zhang, L.X.; Wang, J.; Li, C.; Zeng, Q.; Sun, C.; Jia, R.; Zhao, B.; et al. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10933–10942. [Google Scholar]
Wang, W.; Zhu, D.; Wang, X.; Hu, Y.; Qiu, Y.; Wang, C.; Hu, Y.; Kapoor, A.; Scherer, S. Tartanair: A dataset to push the limits of visual slam. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24–28 October 2020; pp. 4909–4916. [Google Scholar]
Tasora, A.; Serban, R.; Mazhar, H.; Pazouki, A.; Melanz, D.; Fleischmann, J.; Taylor, M.; Sugiyama, H.; Negrut, D. Chrono: An open source multi-physics dynamics engine. In Proceedings of the High Performance Computing in Science and Engineering: Second International Conference, HPCSE 2015, Soláň, Czech Republic, 25–28 May 2015; pp. 19–49. [Google Scholar]
Elshakhs, Y.S.; Deliparaschos, K.M.; Charalambous, T.; Oliva, G.; Zolotas, A. A comprehensive survey on Delaunay triangulation: Applications, algorithms, and implementations over CPUs, GPUs, and FPGAs. IEEE Access 2024, 12, 12562–12585. [Google Scholar] [CrossRef]
Noreen, I.; Khan, A.; Habib, Z. Optimal path planning using RRT* based approaches: A survey and future directions. Int. J. Adv. Comput. Sci. Appl. 2016, 7. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Du, W.; Beltrame, G. LiDAR-based Real-Time Object Detection and Tracking in Dynamic Environments. arXiv 2024, arXiv:2407.04115. [Google Scholar]
Srivastava, H.M. An introductory overview of Bessel polynomials, the generalized Bessel polynomials and the q-Bessel polynomials. Symmetry 2023, 15, 822. [Google Scholar] [CrossRef]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 746–760. [Google Scholar]
Li, W.; Saeedi, S.; McCormac, J.; Clark, R.; Tzoumanikas, D.; Ye, Q.; Huang, Y.; Tang, R.; Leutenegger, S. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset. arXiv 2018, arXiv:1809.00716. [Google Scholar]
Roberts, M.; Ramapuram, J.; Ranjan, A.; Kumar, A.; Bautista, M.A.; Paczan, N.; Webb, R.; Susskind, J.M. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF international Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10912–10922. [Google Scholar]
Gharehbagh, A.K.; Judeh, R.; Ng, J.; von Reventlow, C.; Röhrbein, F. Real-time 3D Semantic Mapping based on Keyframes and Octomap for Autonomous Cobot. In Proceedings of the 2021 9th International Conference on Control, Mechatronics and Automation (ICCMA), Luxembourg, 11–14 November 2021; pp. 33–38. [Google Scholar]
Pivoňka, T.; Přeučil, L. Stereo camera simulation in blender. In Proceedings of the Modelling and Simulation for Autonomous Systems: 7th International Conference, MESAS 2020, Prague, Czech Republic, 21 October 2020; pp. 206–216. [Google Scholar]
Kazhdan, M.; Chuang, M.; Rusinkiewicz, S.; Hoppe, H. Poisson surface reconstruction with envelope constraints. In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2020; Volume 39, pp. 173–182. [Google Scholar]
Barrera, T.; Hast, A.; Bengtsson, E. Incremental spherical linear interpolation. In Proceedings of the The Annual SIGRAD Conference. Special Theme-Environmental Visualization, Gävle, Sweden, 24–25 November 2004; Linköping University Electronic Press: Linköping, Sweden, 2004; pp. 7–10. [Google Scholar]
Peng, Y. Game Design Based on High-Definition Render Pipeline in Unity. Highlights Sci. Eng. Technol. 2024, 93, 167–178. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. Shapenet: An information-rich 3d model repository. arXiv 2015, arXiv:1512.03012. [Google Scholar]
TurboSquid. Available online: http://www.turbosquid.com/ (accessed on 26 February 2025).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 16965–16974. [Google Scholar]
Gong, M.; Wang, D.; Zhao, X.; Guo, H.; Luo, D.; Song, M. A review of non-maximum suppression algorithms for deep learning target detection. In Proceedings of the Seventh Symposium on Novel Photoelectronic Detection Technology and Applications, Kunming, China, 5–7 November 2020; SPIE: Bellingham, WA, USA, 2021; Volume 11763, pp. 821–828. [Google Scholar]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 834–849. [Google Scholar]
Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Flow chart of the different stages in our method. The data mapping process is implemented by Octomap [15], the trajectory generation process is implemented by Blender [16], and the data processing process is implemented by Unity.

Figure 2. (a) Bedroom without ceiling, as an example of a scene from the SceneNet dataset. (b) Visualization of the model after random sampling. (c) Free space visualization, light blue squares represent free space.

Figure 3. (a) A front view of a suspend picture frame, a clock, and a pillow on a bed. (b) An overhead view of some monitors and drinks on a table.

Figure 4. (a) The initial graph structure, with red edges denoting rejection by function

C o l l i s i o n F r e e

, consequently yielding three connected sub-graphs (indicated by distinct colors). (b) The connectivity graph after completing the RRT* algorithm, red vertices represent sampling points. (c) The path of the search following the elimination of the cyclic structure.

Figure 4. (a) The initial graph structure, with red edges denoting rejection by function

C o l l i s i o n F r e e

, consequently yielding three connected sub-graphs (indicated by distinct colors). (b) The connectivity graph after completing the RRT* algorithm, red vertices represent sampling points. (c) The path of the search following the elimination of the cyclic structure.

Figure 5. Plot of distance versus collision rate, where the test interval for the distance d is 0.05 m. This study suggests that d be set at about 0.45 m, which corresponds to a collision rate that can be kept below 15%.

Figure 6. The initial figure illustrates a complete bedroom.The second figure illustrates the generation process of observation point O and position point P, • represents the observation point, yellow color represents the large objects and red color represents the small objects. × represents the position point and the arrow represents the direction of that position point. The third figure presents the results of path planning. The solid red line in the figure represents the planned path, the dashed lines represent the branches of the graph and the arrows are the same as in the second picture. The fourth figure illustrates the actual trajectory after curve processing.

Table 1. Color images captured from three different dataset models.

3D-FRONT

SceneNet RGB-D

Hypersim

Table 2. Basic metrics of the method for generating image sequences.

Dataset	Room Style	Room Size (Met)	Object Number	Capture Rate (%)	Average Area Share (%)	Time (ms)
	Suite	$9.8 \times 8.6 \times 3.1$	49.7	94.2	25.5	324
3D-FRONT	Office	$7.7 \times 6.4 \times 2.9$	38.7	89.5	27.4	243
	Studio	$3.2 \times 2.9 \times 2.7$	25.8	92.2	26.2	212
	Living room	$6.2 \times 4.8 \times 2.9$	78.2	78.5	15.2	425
SceneNet RGB-D	Office	$4.4 \times 4.2 \times 3.0$	102.5	74.9	13.8	522
	Kitchen	$3.5 \times 3.2 \times 2.9$	82.7	75.1	19.5	466
	Living room	$8.8 \times 7.7 \times 3.1$	52.5	81.3	28.6	304
Hypersim	Dining room	$6.2 \times 5.5 \times 2.9$	56.3	85.6	29.8	353
	Bedroom	$5.4 \times 5.2 \times 2.9$	49.2	88.9	26.7	295

Table 3. Comparison of object detection models with different camera poses.

Training Data	YOLOv10			RT-DETR
Training Data	$mAP$	$Pre$	$Rec$	$Acc$	$mAP$	${mAR}^{\max = 10}$	${mAR}^{\max = 100}$
3D-FUTURE	0.595	0.728	0.656	0.618	0.572	0.716	0.727
Ours	0.602	0.708	0.640	0.625	0.557	0.722	0.731
SceneNet	0.528	0.694	0.575	0.582	0.515	0.522	0.536
Ours	0.688	0.754	0.620	0.644	0.602	0.688	0.693
Hypersim	0.565	0.702	0.633	0.715	0.544	0.602	0.633
Ours	0.594	0.768	0.625	0.745	0.575	0.615	0.638

Table 4. Comparison of SLAM algorithm with different camera poses.

Training Data	ORB—SLAM			LSD-SLAM			DSO
Training Data	$ATE$	SR (%)	$RE$ (cm)	$ATE$	SR (%)	$RE$ (cm)	$ATE$	SR (%)	$RE$ (cm)
SceneNet	0.17	0.75	0.44	0.72	0.35	0.72	0.42	0.62	0.67
Ours	0.09	0.82	0.42	0.78	0.52	0.77	0.38	0.72	0.69
Hypersim	0.18	0.85	0.35	0.65	0.63	0.66	0.35	0.75	0.48
Ours	0.07	0.80	0.33	0.73	0.58	0.71	0.33	0.69	0.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, H.; Luo, W.; Yang, W. Camera Pose Generation Based on Unity3D. Information 2025, 16, 315. https://doi.org/10.3390/info16040315

AMA Style

Luo H, Luo W, Yang W. Camera Pose Generation Based on Unity3D. Information. 2025; 16(4):315. https://doi.org/10.3390/info16040315

Chicago/Turabian Style

Luo, Hao, Wenjie Luo, and Wenzhu Yang. 2025. "Camera Pose Generation Based on Unity3D" Information 16, no. 4: 315. https://doi.org/10.3390/info16040315

APA Style

Luo, H., Luo, W., & Yang, W. (2025). Camera Pose Generation Based on Unity3D. Information, 16(4), 315. https://doi.org/10.3390/info16040315

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Camera Pose Generation Based on Unity3D

Abstract

1. Introduction

2. Related Work

2.1. Storage of Camera Poses

2.2. Camera Trajectory Generation

3. Methods

3.1. Data Mapping

3.2. Trajectory Generation

3.2.1. Single Large Object

3.2.2. Multiple Object

3.2.3. Path Planning

3.3. Data Processing

4. Experiments

4.1. Basic Metrics

4.2. Object Detection

4.3. SLAM Task

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI