Next Article in Journal
Postural Reactions to External Mediolateral Perturbations: A Review
Previous Article in Journal
Fast Ion Speed Diffusion Effect on Distributions of Fusion Neutrons
 
 
Article
Peer-Review Record

Intelligent Whistling System of Rail Train Based on YOLOv4 and U-Net

Appl. Sci. 2023, 13(3), 1695; https://doi.org/10.3390/app13031695
by Kai Wang 1, Zhonghang Zhang 1, Chaozhi Cai 1,*, Jianhua Ren 1,* and Nan Zhang 2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Reviewer 4:
Appl. Sci. 2023, 13(3), 1695; https://doi.org/10.3390/app13031695
Submission received: 30 December 2022 / Revised: 25 January 2023 / Accepted: 27 January 2023 / Published: 29 January 2023
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

It was a surprise to me that image recognition and a decision module based on artificial intelligence can find such a specific application. The article is convincingly written. Artificial intelligence tools based on deep learning technique are selected and used correctly. The authors should more emphasize their original and creative contribution to the development of this practical application of AI. I am thinking of emphasizing the process of construction of weight matrix. The conclusions are convincing.

Author Response

Q1.  It was a surprise to me that image recognition and a decision module based on artificial intelligence can find such a specific application. The article is convincingly written. Artificial intelligence tools based on deep learning technique are selected and used correctly. The authors should more emphasize their original and creative contribution to the development of this practical application of AI. I am thinking of emphasizing the process of construction of weight matrix. The conclusions are convincing.

 

Our response: Thank you for your comment. We have revised part of the article to better emphasize the originality and creative contribution of this method to the application of artificial intelligence in the field of train whistle control.

  1. For example, after introducing the actual problems of the current train whistle control technology, the model based on artificial intelligence is introduced to better emphasize its originality and significance. In the revised paper, it is as follows: Therefore, this paper proposed a train intelligent whistle system based on deep learning, which provides a new idea for the application of artificial intelligence in the field of train whistle and noise control (You can find it in the last paragraph of the introduction).
  2. Some original methods in the revised paper are emphasized, such as the method of determining the current track of trains based on the weight matrix. In the revised paper, it is as follows: For problem 2, a new method of weighted mean ordering of anchor box was proposed. This method determines the track of the anchor frame where the maximum value is located as the current running track of the train by calculating and ranking the average weight of each track anchor frame area under the weight matrix (you can find it in section 3.2). And the originality of these methods is emphasized again in the newly established section 5.3.

Your comments are very helpful to us. Thank you again for your comments.

Author Response File: Author Response.doc

Reviewer 2 Report

In this paper, the authors proposed their intelligent whistling system of rail train including road condition sensing module and whistling decision module. They adopted Yolov4 and U-net models to analyze image sequences combined with chosen eleven objects detection to ensure the safety of train and noise reduction. They presented the diversity of raw dataset and explained how to manipulate the data imbalance problem. The workflow of the proposed system, model structures, model parameters, etc. were explained and provided. Finally, the performance test of the intelligent whistle system shows that their work can effectively and intelligently adjust the whistle strength according to the actual scene to achieve the goal of safe noise reduction. However, several tables are required to tune the horizontal space to clearly reveal and let reviewer think the validation of the context.

Several suggestions are listed below.

1. The authors may supplement the brief description in this paper how they used anchor frame weight mean ranking method and focal loss method, respectively, in the proposed system. 

2. From lines 116 to 124, at page 4, eleven types of labeling description may be slight adjusted using complete sentences in language grammar.

3. In the tables 3 and 5 listed at pages 6 and 11, respectively, the authors may verify the values of weight attenuation method or supplement some explanation according to Figure 9 and equation (6) at page 9.

4. Line 326, at page 13, the authors mentioned whistling volume output is based on the identification of target data, with reference to the current time and train speed; GPS information, shown in Figure 4, never be included.

5. At page 14, in the whistling volume formula of Table 7, there are several incomplete sentences in three description fields. The authors may adjust proper horizontal space setting to disclose the complete context. Then the reviewer can understand the description and think about the validation of parameters in the equation.

6. In the table 8, at page 15, the authors may adjust proper horizontal space setting to disclose the complete context.

7. Line 387, at page 16, the first sentence may be adjusted in language grammar.

8. Line 394, the authors may modify some character with suitable lower or upper case.

Author Response

Q1.  The authors may supplement the brief description in this paper how they used anchor frame weight mean ranking method and focal loss method, respectively, in the proposed system.

 

Our response: Thank you for your comments. Your comments are very helpful to us. We briefly describe these methods in the corresponding areas of the article. It is as follows.

Anchor frame weight mean ranking method: For problem 2, a new method of weighted mean ordering of anchor box was proposed. This method determines the track of the anchor frame where the maximum value is located as the current running track of the train by calculating and ranking the average weight of each track anchor frame area under the weight matrix (you can find it in section 3.2).

Focal loss method: The focal loss method was used for optimization. By increasing the weight of difficult-to-distinguish samples in the Loss function, the Loss function tends to be difficult-to-distinguish samples, which helps to improve the accuracy of small sample categories and alleviate the problem of uneven number of sample categories (you can find it in section 3.1.(2)).

 

Q2.  From lines 116 to 124, at page 4, eleven types of labeling description may be slight adjusted using complete sentences in language grammar.

 

Our response: Thank you for your comments. We have fine-tuned these sentences to make them more reasonable, so that readers can understand them more easily. You can find it in the revised version.

 

Q3.  In the tables 3 and 5 listed at pages 6 and 11, respectively, the authors may verify the values of weight attenuation method or supplement some explanation according to Figure 9 and equation (6) at page 9.

 

Our response: Thank you for your comments. We have made appropriate explanations for figure 9 and equation (6) to make the content easy for readers to understand. You can find it in the revised version.

 

Q4.  Line 326, at page 13, the authors mentioned whistling volume output is based on the identification of target data, with reference to the current time and train speed; GPS information, shown in Figure 4, never be included.

 

Our response: Thank you for your comments. We are amazed that you can review the article so carefully. GPS information is not really used in the volume formula. GPS is very powerful. Our idea is to use GPS to assist the whistle system. There are mainly two aspects. 1. Check the health of the whistle system. GPS feeds back the fixed target information of the line ahead (tunnels, bridges, intersections, etc.). If the sensing system fails to recognize the target within a period of time, it will report an error. 2. Obtain the information of the whistle restriction zone in the line to better assist the whistle system. We will add and improve these functions in future research work.

 

Q5.  At page 14, in the whistling volume formula of Table 7, there are several incomplete sentences in three description fields. The authors may adjust proper horizontal space setting to disclose the complete context. Then the reviewer can understand the description and think about the validation of parameters in the equation.

 

Our response: Thank you for your comments. The table content in the Microsoft Word version is displayed completely. This problem is caused by the conversion of Microsoft Word to PDF. We have revised these problems for reviewers and readers to understand. You can find it in the revised version.

 

Q6.  In the table 8, at page 15, the authors may adjust proper horizontal space setting to disclose the complete context.

 

Our response: Thank you for your comments. The table content in the Microsoft Word version is displayed completely. This problem is caused by the conversion of Microsoft Word to PDF. We have adjusted its horizontal space setting to disclose the complete context. You can find it in the revised version.

 

Q7.  Line 387, at page 16, the first sentence may be adjusted in language grammar.

 

Our response: Thank you very much for your detailed comments. We have corrected this problem in the paper. You can find it in the revised version.

 

Q8.  Line 394, the authors may modify some character with suitable lower or upper case.

 

Our response: Thank you very much for your detailed comments. We have corrected this problem in the paper.

Your comments are very useful to us. Thank you again for your comments.

Author Response File: Author Response.doc

Reviewer 3 Report

The paper presents an interesting application of computer vision for controlling the Rail Train Whistle. The problem statements and objectives are intelligibly declared. Given the novelty of the idea, the experiments can be considered baseline. The reviewer has several concerns.

 

1. It is suggested to summarize the contribution of the manuscript.

 

2. It can be clearer if the literature and the previous related work are discussed.

 

3. Being a real-world alternative to a human agent, it could have been better if the paper presents more about the datasets. e.g., (a) the training sets (e.g., 800 pictures) are very small, how can we know if the model has generalized the learning?  (b) Some youtube videos of train recordings from driving angles are chosen. The parameters are quite specific such as the video capturing location being two meters above the ground. It is suggested to include evaluation on the robustness of the position of the camera.

 

4. It is suggested to compare the proposed method with other related methods on the same task.

 

Author Response

Q1. It is suggested to summarize the contribution of the manuscript.

 

Our response: Thank you for your comments. A new section 5.3 Model performance and application analysis has been established to summarize our work, model performance, problems to be faced in practical application and development prospect. You can find it in the revised version.

 

Q2. It can be clearer if the literature and the previous related work are discussed.

 

Our response: Thank you for your comments. We compared the model with previous technologies (the control mode based on GPS and database retrieval and the traditional mode of manual control) to better highlight the practical significance of the intelligent control model and make the article more readable and reasonable. You can find it in the penultimate paragraph of the introduction and section 5.3 of the revised version.

 

Q3. Being a real-world alternative to a human agent, it could have been better if the paper presents more about the datasets. e.g., (a) the training sets (e.g., 800 pictures) are very small, how can we know if the model has generalized the learning?  (b) Some youtube videos of train recordings from driving angles are chosen. The parameters are quite specific such as the video capturing location being two meters above the ground. It is suggested to include evaluation on the robustness of the position of the camera. 

Our response: Thank you for your comment. (a) For the problem of whether the model popularizes learning, we mainly observe the loss value and mAP value of the model in the validation set, as shown in Figure 6 and Figure 12. A total of 9000 train running pictures were made and classified according to the 8:1:1 of the training set, validation set and test set when training YOLO model. When training the U-net semantic segmentation model, the data set is 800 track images. We found that using fewer data sets can achieve more accurate semantic segmentation. The reason is that on one hand, the semantic segmentation model is two-category, which only needs to distinguish the track and background, and does not extract the edge features of the other nine categories of objects; on the other hand, the orbit has obvious line features, and its edge features are easier to be learned by the model. (b) We have also considered this issue before, such as the installation height of the camera, the horizontal position, the angle formed with the ground, the impact of train vibration on the image, etc. These contents are too many to be completed in a short time. We hope to explore these contents in future research. In addition, we can think this way: due to the open train driving video from various countries on the network, the model of the train is different, and the image capturing conditions are also different. In such a data set, the camera installation height, vibration response and other parameters above must be different. Such differences already exist in the training set, verification set and test set, and in the whole learning and testing of the model, Therefore, the performance of the model in the verification set and test set can reflect the robustness of the model to the changes of these parameters to a certain extent.

 

Q4. It is suggested to compare the proposed method with other related methods on the same task. 

Our response: Thank you for your comment. A comparison with traditional methods has been added to the partial performance evaluation of the model. In terms of security, the model is compared with the whistle system based on manual control; In terms of noise reduction, the model is compared and discussed with the traditional whistling methods based on remote communication technology and database retrieval. You can find it in section 5.3 of the revised version.

Your comments are very useful to us. Thank you again for your comments.

Author Response File: Author Response.doc

Reviewer 4 Report

This paper mainly presents an intelligent whistling system for railway trains based on deep learning, which automatically decide whether whistling or not on the basis of safety,ensures whistling in dangerous situations and reduces noise pollution in a safe environment. The experiments show good performances on different situation, and demonstrate high reliability in the test platform. However, there are still some concerns required to be carefully addressed.

1. Sometimes there are railroad crossings around houses and villages, and the surrounding environment is very complex, there may be houses or villages and pedestrians, the intelligent whistling system needs to judge the sound of whistling according to the actual situation from the perspective of the train. How effective is the series FSM method when the above situation occurs? Does it meet the requirements of the author and the needs of the actual environment? If not, how does the author plan to solve it using intelligent algorithms?

2. The low mAP of pedestrian and intersection recognition will bring more safety hazards in actual operation, how should the authors plan to improve the detection accuracy? The Iou threshold of mAP in Table 4 is 0.5 or higher?

3.For Sec. 4.2:     Since the system is an intelligent system, according to the distance of the identified target, the system can achieve different sound control methods when the train is “from far to near” and “from near to far” to better reduce noise?

4. For Sec. 4.4:  Since the problem of noise reduction at night is mentioned, there is a lack of light on the train track, and the dark environment at night makes it difficult to identify the target in the captured image, which is the challenge of the intelligent system, how to solve it? If the harsh environment causes the intelligent system to not work properly, what should be done at this time? If human assistance is required, how should the collaboration of artificial and intelligent systems be designed?

5. On P14, Table formatting problem, the content is incomplete.

6. What kind of platform is the test to be carried out, is the test platform the same as the training platform? If not, what is the hardware level of the test platform, whether it is equipped with a high-performance graphics card, and what is the computing power of the graphics card? If a high-performance graphics card is used, will there be cost problem and difficulty involved in the actual deployment environment?

Author Response

Q1. Sometimes there are railroad crossings around houses and villages, and the surrounding environment is very complex, there may be houses or villages and pedestrians, the intelligent whistling system needs to judge the sound of whistling according to the actual situation from the perspective of the train. How effective is the series FSM method when the above situation occurs? Does it meet the requirements of the author and the needs of the actual environment? If not, how does the author plan to solve it using intelligent algorithms?

 

Our response: Thank you for your comment. We do have problems with the representation of a series finite state machine for the control system. We originally wanted to express that the whistling decision-making subsystem is arranged from top to bottom according to the logical level, that is, the data obtained by the road condition sensing subsystem is divided into three modules: scene segmentation module, volume calculation module and horn control module. We have revised the original expression problem and established a new section 4.6 to explain the state machine. The state migration diagram of the state machine is shown as.

Figure 14. State transition diagram of the state machine

When there is a complex environment like the coexistence of houses and intersections, the whistling decision-making subsystem first confirms whether there are intruders within the safe distance of the track through the information obtained by the sensing subsystem, and if there are, it will direct the bagpipe warning with the maximum volume. If not, the current state will be divided into the scene of must whistling due to the detected objects such as bridges, crossings, tunnels, etc. Then the system will calculate the reasonable volume. The volume calculation formula will refer to the current time, the size and location of the surrounding pedestrians, houses and villages to get a volume value. As for the verification of the effectiveness of the system, in section 5.2, the volume of the whistle in the scene of the test set is estimated in advance, and then compared with the output of the model. The result shows that the accuracy of the model for judging whether to whistle or not is 99.22%, and the average deviation of the volume of the whistle in the intersection and other environments is about 3.15dB, which is indeed larger than other whistle scenarios, however, the overall error is within the allowable range and meets the use requirements.

 

Q2. The low mAP of pedestrian and intersection recognition will bring more safety hazards in actual operation, how should the authors plan to improve the detection accuracy? The Iou threshold of mAP in Table 4 is 0.5 or higher?

                                                    

Our response: Thank you for your comment. A plan to improve the detection accuracy of pedestrians and intersections has been added in the new section 5.3. We believe that the detection accuracy and detection distance of pedestrians and intersections are slightly lower, which exposes the lack of detection ability of the model for small and complex targets. Pedestrians are small targets among the detection targets and have certain complexity (such as posture difference, clothing color and environment may be close, cyclists, etc.). For the target of intersection, the other 10 types of targets are vertical, but the road surface is almost parallel to the camera's perspective, resulting in a small visible area under the train's perspective, so it becomes a small target (as shown in Figure 1); If the whole intersection is used as the intersection label, on one hand, it will make the characteristics of the intersection too complex, on the other hand, such a wide view is difficult to meet in some scenes in reality, and when the camera can take a relatively complete picture of the intersection, the train is very close to the intersection, and it may be late to recognize the intersection.

In view of the above problems, we plan to solve the following problems: 1. The expansion of data sets is always effective; 2. Expand network input. Because the pixel information corresponding to the small target is difficult to correctly express its realistic characteristics under the low pixel condition of the picture, the size of the network input can be increased to avoid this error. However, such adjustment will cause a decrease in the number of recognition frames under existing conditions. In this experiment, the mid-range graphics card RTX 3060 is used. When the model input size is 618×618, the FPS is 19 f/s. When the train is running at high speed, the train will advance about 2 meters in one frame. Such perception accuracy can still meet the practical application. If the network input increase, computer performance and application cost need to be considered. 3. Add the algorithm to enhance the detection of small targets. At present, a large number of scholars have achieved higher detection accuracy for small targets by optimizing the algorithm of the basic YOLO model. 4. Adjust the network input structure to oblong. The network input of the target detection algorithm is usually square, but in this application, the distribution of the information in front of the train is obviously uneven in the horizontal and vertical directions (horizontal information is usually more abundant and important), so the future model optimization can build a network with oblong network input according to this feature, so that the higher effective pixel input can be achieved while keeping the overall calculation amount of the model unchanged. 5. Borrow road signs for assistance. Establish the corresponding traffic landmark in front of the intersection, and use the model to learn the landmark. Your question is very useful for our research. Our future research will focus on these issues.

The IoU threshold of mAP in Table 4 is 0.5. This information has been added to the revised paper.

 

Q3. For Sec. 4.2:     Since the system is an intelligent system, according to the distance of the identified target, the system can achieve different sound control methods when the train is “from far to near” and “from near to far” to better reduce noise?

 

Our response: Thank you for your comment. The model does have such a function. According to the volume formula in Table 7, the impact of the detected houses and villages on the volume output will change with their size in the image. However, for pedestrians and vehicles, their impact on the whistle mainly depends on the distance between their positions and the track. The closer they are to the track, the easier it is to promote the train whistle.

 

Q4. For Sec. 4.4:  Since the problem of noise reduction at night is mentioned, there is a lack of light on the train track, and the dark environment at night makes it difficult to identify the target in the captured image, which is the challenge of the intelligent system, how to solve it? If the harsh environment causes the intelligent system to not work properly, what should be done at this time? If human assistance is required, how should the collaboration of artificial and intelligent systems be designed?

 

Our response: Thank you for your comment. These problems are really very realistic. First of all, driving at night is really a very difficult problem, for which our existing scheme is as follows: 1. Establish a target detection model based on infrared thermal imager. On the basis of the ordinary three-channel image, the image of the infrared thermal imager is added to read the information of the intrusion near the orbit. 2. Establish traffic signs. Except for the orbital intruders, the remaining 9 types of targets are all fixed targets. Then we can set up high-visibility road signs at night for these kinds of targets, and trigger the corresponding honking response or status by identifying these road signs, so as to conduct honking control. 3. Compared with the automatic driving of cars, the train has the characteristics of simple driving route, high driving vibration noise and high headlamp power, so the safety is higher.

In case of sudden intelligent system failure, manual assistance is required at this time. As shown in Figure 4, when the whistle system is working, it can still be operated manually. The difficulty may be how to judge whether the whistle system has failed. Our solution is as follows: 1. Design an evaluation module for the clarity of the image taken by the camera, so that the model can notify the driver in time for manual control in case of fog, object occlusion, etc. 2. Compare with remote database. Through GPS positioning, the information of fixed targets such as curves, bridges, intersections and other fixed targets in the route ahead is obtained. If the whistle model does not detect the relevant targets in the corresponding time period, the error will be timely notified to the driver.

Your question is very useful for our research. Our future research will focus on these issues.

 

Q5. On P14, Table formatting problem, the content is incomplete.

 

Our response: Thank you for your comment. The previous form does have problems and has been modified. The identification distance of straight track is not included in the table because it is difficult to distinguish the identification distance of straight track from the driving perspective.

 

Q6. What kind of platform is the test to be carried out, is the test platform the same as the training platform? If not, what is the hardware level of the test platform, whether it is equipped with a high-performance graphics card, and what is the computing power of the graphics card? If a high-performance graphics card is used, will there be cost problem and difficulty involved in the actual deployment environment?

 

Our response: Thank you for your comment. The test platform and training platform adopt the same configuration: Intel Core i7-11800H CPU and NVIDIA GeForce RTX3060 GPU. The RTX3060 is a mid-range graphics card, and the corresponding price is about 350 francs, which is relatively acceptable.

Author Response File: Author Response.doc

Round 2

Reviewer 3 Report

The authors have addressed the concern from this reviewer.

Author Response

Thank you very much for your comments. Your comments are very helpful to our research. Wish you all the best!

Reviewer 4 Report

Although the authors have addressed the most comments, some problems are not revised in the manuscript and there are still some problems should be addressed. It is necessary to add some comparison experiment results to prove the advantages of the proposed method. Especially, the object detection method (the proposed method uses Yolov4, what about yolov5, v6, v7?) and the segmentation method (some other segmentation methods in 2022?).   

Author Response

  1. Although the authors have addressed the most comments, some problems are not revised in the manuscript and there are still some problems should be addressed. It is necessary to add some comparison experiment results to prove the advantages of the proposed method. Especially, the object detection method (the proposed method uses Yolov4, what about yolov5, v6, v7?) and the segmentation method (some other segmentation methods in 2022?).

 

Our response: Thank you for your comment. For the questions you raised, we have carefully considered and revised some of the contents on the basis of the last revision, including: 1. The description of horn control based on FSM method in Sec 4.6 is improved. 2. In Sec 5.3, the scheme of using infrared camera or depth camera to enhance the performance of the model at night is mentioned. Reference is provided for this content. 3. Modify the format of Table 8 in Sec5.1. 4. It is explained in Sec 3.1 that the model training platform and test platform are the same platform.

In terms of highlighting the progressiveness of this model by comparing models, we mainly compare the train whistle model based on artificial intelligence with the existing human control model and the model based on remote communication and database retrieval, and compare in response time, scope of application, etc. to highlight the intelligence and progressiveness of the model in this paper. These contents are mainly discussed in chapter 5.3 and introduction. Because this paper is more inclined to application innovation, the algorithm used for target detection and semantic segmentation in the perception module of the model has not been modified. If the unmodified algorithm is compared with the newly proposed algorithm, this may not be significant. At the same time, YOLOv4 and U-net models are still widely used by existing scholars and have not been eliminated because of the proposed new algorithm, which is due to the good performance of the model itself.

The suggestions you raised are of great practical significance and research value. We will focus on the adaptability of the model in complex scenarios and the improvement of application algorithms in future research.

Author Response File: Author Response.doc

Back to TopTop