Early Fire Detection System by Using Automatic Synthetic Dataset Generation Model Based on Digital Twins
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsIn this paper, the authors propose a model that automatically generates digital twin-based field optimal learning data for early detection of fires. Experiments confirmed that the detection model learned from a dataset simulating possible fire conditions at the site created by combining field recording using RGB-D cameras and virtual fires is sufficient to detect small initial fires at the site with very high confidence.
The theoretical part of the research work needs to be more exposed before presenting the experimental part... Defining digital-twins and how datasets are used needs to be more scientifically explained. More comparison with existing methods can be investigated.
The paper is interesting and can be accepted after minor improvements.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper proposed an Early Fire Detection System by Synthetic Dataset Automatic Generation Model. The idea is interesting. However, the innovation and presentation are not satisfactory. Some descriptions are confused. some important issues must be addressed.
1. The authors mentioned that the proposed method was designed based on digital twins. But little information was provided to illustrate how it worked in this system, especially for the so-called “digital twin-based learning”.
2. The major issues of this scheme are the generation and detection algorithms. However, the authors did not describe any of the utilized algorithm. In other word, this work is more like an engineering application without any theoretical innovation.
3. In the Experimental Results section, which is the best model in this scenario? The authors failed to demonstrate the conclusion.
4. What is the main contribution of this work? The results failed to prove the effectiveness.
5. The references are not enough. Few SOTA methods and related literatures were provided.
6. The format should be looked through. E.g.: “Version December 4, 2023 submitted to Remote Sens”.
Comments on the Quality of English LanguageModerate editing of English language required.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe introduction effectively highlights the significance of addressing fire accidents in urban areas and the role of technology, including digital twin technology, in smart cities. However, there is a repeated sentence at the end of the first paragraph and the beginning of the second paragraph ("Simulation in the actual environment..." to "...fields such as natural disasters, fires, collapses, and environmental contamination accidents"). Please address this repetition for clarity.
I noticed the use of YOLOv4 in the paper. Considering YOLOv8 and YOLONAS are available, could you clarify the choice of YOLOv4? Have you compared its accuracy with these newer models, and if so, what were the results?
It would be beneficial to focus on one specific model in the paper and explain why this model is essential for urban cities. Avoid making comparisons between models, as this is production evaluation. Emphasize the core of the paper title.
Regarding the Synthetic Dataset Automatic Generation Model, could you specify the models used to generate the dataset and simulate defects?
I did not find any mention of IoT in the paper. Could you provide information on how IoT is integrated into the proposed system or clarify if it is not applicable to this research?
The conclusion needs clarification, especially concerning the results. Please elaborate on the key findings and ensure that the conclusion effectively summarizes the main contributions of the paper.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThe article is very interesting. Some things were not entirely clear to me, so I have the following comments. In terms of content, I suggest adding information about the experiment, the deployment of the system, and system limitations.
In line 3 - In the references, you mention the extensive FireNET dataset [10].
In line 4 - Is the existence of a digital twin a prerequisite for using the model?
In the introduction, describing your concept in more detail would be good, explaining that you simulate a fire using a digital twin to train the early fire detection model. Will it be applicable in places where a digital twin is not available? Are you considering the use of moving cameras or static cameras? It's best to explain the conditions for deploying the system.
In lines 34-35 - I couldn't find information about the simulated flame. What is modelled in the flame: the change in colour due to burning material, changes in lighting conditions caused by fire?
In line 53 - How do you use an RGB-D camera? Is it about creating a digital twin or placing the camera in a 3D model?
In line 56 - What is meant by the term 'height detection'?
In line 59 - 'In 2020, Fuller et al. defined the concept’ - why is this not cited as a literature source [x] or the source added for the author?
In line 71 - ‘ETRI’ - When using an acronym for the first time, the full name should also be provided.
In line 90 - Similarly, a literature source should be cited, not just ‘Kim et al.’
In line 91 - What datasets do these models use, or were they real-world field data? References are missing.
In line 96, reference the literature source ‘Liau et. Al.’
In line 102 - The reference for the ‘Pascal VOC 2007 dataset’ is missing.
In line 103 - When using an acronym for the first time, the acronym should be explained.
In lines 105-107 - The extensive dataset mentioned lacks a reference. Does the dataset address changes in lighting conditions due to fire?
In line 108 - How was accuracy expressed? Is it accuracy in detection or accuracy in locating the focal point?
In lines 109-111 - Which models were tested on the NVIDIA Xavier NX?
In lines 113-115 - Please explain why real fire data is deemed unsuitable for early fire detection systems. Isn't it the opposite that real data provides all the details, such as changes in the colour of fire, changes in lighting conditions, etc., which, as you mention, are challenging to simulate?
In lines 116-119 - ‘As a result, a dataset that synthesizes fires...’ - I understand that you wouldn't intentionally set a fire in an office, but real fires reflect all conditions that are very difficult to simulate. Please explain this idea.
In line 122 - Are you also creating a 3D model for placing fires, and do you use the same images when placing a simulated fire? How does the fire used in placing in the model differ? Explain how you use depth data. Is it about placing the user (camera) in space or what is the purpose? Are you creating a digital twin from RGB-D camera data, or is it just about placing the camera and its orientation in space?
Figure 1 - The text details in the image are small and hard to read, especially the upper half of the image.
In line 126 - ‘Intel’s RealSense D455 model’ - Is it a model or a camera? What will be the impact, for example, when using it in a system with a different camera that captures fire? Does it have to be a stereo camera, and if so, does it have to be the same? Please specify the requirements for deploying your system.
In line 128 - What kind of motion data do you mean, and how are they used?
In line 138 - How does the system use animations? Can it use them?
Figure 3 - The fire is a 2D image you place in a 3D environment. If you generate frames from a specific camera position, how do you use them for another camera position? Or do you always generate new data (images) for the chosen camera position? Are you working with a static camera (building surveillance system with security cameras) or a moving camera?
In lines 145 - ‘virtual camera’ - Why a virtual camera? Isn't it the case that for the selected cameras, what is visible in the 3D models is determined, and then fires are added to the images? Can you add images before and after adding fire to Figure 4?
In line 167 - You cite Figure 6(b), which is not in the article.
In line 169 - What performs the extraction of images?
In line 171 - Please explain what the term Axis Linked Binding Box (AABB) means.
In line 174 - Does ‘Boxcollider’ refer to the intersection of the line connecting the camera, fire, and the 3D model?
Figure 7 - I would connect and highlight common elements in images 6a and 7a and explain the variables Boxcollider and AABB. Why is the fire placed in front of the model and not directly on the model (on tables and walls)?
In lines 176-177, wouldn't it be interesting to try training the model only on simulated data and then testing it on real natural data? Test the accuracy there.
In line 204 - Please explain what the input is for your model: images or video? Or do you use video for faster data collection, from which you extract frames?
In lines 208-209 - In the scheme (Figure 12), nothing seems to come out of the Kafka server, only input.
In line 213 - What is meant by ‘angle information’?
In Figure 13 - quite detailed information on how they are further used:
The timestamp is to 4 decimal places in seconds. So, are you extracting frames from the video? Why is such precise timing needed? How do you then synchronize all cameras with such precision?
It would be good to provide a practical example for one Digital Twin (DT) in terms of how many photos were used, how many simulated fires, and how much data was generated at what time.
Figure 14 - The text details in the image are unreadable. Overall, the image is poorly legible.
Figure 15 - The detected fires in the images in the second row are very poorly visible. Perhaps showing fewer images and enlarging them might help.
In line 222 - ‘for each location’ - What locations are being referred to? Places where cameras are accessible, monitoring the space?
In line 222 - Please provide additional information about the generated fire from the perspective of detection. How do these images differ (intensity, colour, size, etc.)? Do you simulate the reflection of fire from shiny objects and changes in lighting conditions due to fire?
In line 230 - ‘NVIDIA’s DGX A100 was used as a learning environment’ - If you used a server with these GPUs for model learning, will the NVIDIA Xavier NX be sufficient for testing?
In lines 235-238 - Ideally, it would be to verify the model on data with real fire under real lighting conditions, specific texture, changing fire size, and so on. I don't mean to devalue your result; trying a different dataset for testing is just a suggestion. It doesn't have to be part of this article.
In Figure 18 - Explain the significance of the green bounding boxes. Why are the image labels denoted with letters (a), (b), and so on at the end of the description?
In line 253 - ‘learnt differently’ - How differently?
In Figure 20 - The detected fire is not visible; try using, for example, the green colour for the bounding box.
Why are the letters distinguishing the images - (a) to (d) not placed before the image label or its part in the description?
In line 268 - ‘20x20’ - What units are these data? Are they pixels?
In line 269 - ‘[?]’ - Correct the number from the references.
In lines 276-277 - How are the limits set? If you consider camera movement during fire detection, do you recalculate the possible fire position with each new camera position? What if only a static camera is available - for example, a traditional building monitoring system with static cameras?
In lines 282-283 - ‘red bounding box is suspected to be a flame during’ - This problem must also arise from the reflection of light from the floor and mirrors.
In lines 284-285 - According to Figure 24, it looks like a static camera on a crane. How were you able to apply the additional procedure? If the camera position does not change, there will be no difference compared to the previous frame. I see the same problem with all static cameras.
In describing the experimental section, it would be good to explain better the system's limitations, what the input to the system is, what the resulting model is, and which detection it uses. Why isn't the model validated only on real fire data and then a combination with fire simulations? What is the limitation of the camera used? Is the camera static or in motion, and what data is used in the models? What is meant by motion data?
In the solution concept, you could address questions such as: Who is the model intended for if the camera is in motion? Does it involve the use of monitoring drones? Is it possible to deploy the system in classic object monitoring, where the cameras are static, and how will the control be based on the comparison with the previous frame?
It would be best to provide a practical example of deploying the system, what needs to be ensured, on what object, what the inputs will be, and what the result will be.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have addressed my concerns.
Comments on the Quality of English LanguageMinor editing of English language required.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsYolov8 and YoloNas can be used for the same purpose. I suggest you consider it once more. Also, please improve the conculsion as it is too short.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf