Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Ensemble Deep Learning for Automated Damage Detection of Trailers at Intermodal Terminals

Sustainability 2024, 16(3), 1218; https://doi.org/10.3390/su16031218

by Pavel Cimili, Jana Voegl, Patrick Hirsch^*

and Manfred Gronalt

Reviewer 1:

Chunping Li

Reviewer 2: Anonymous

Reviewer 3:

Jiamin Guo

Reviewer 4: Anonymous

Reviewer 5: Anonymous

Sustainability 2024, 16(3), 1218; https://doi.org/10.3390/su16031218

Submission received: 15 December 2023 / Revised: 26 January 2024 / Accepted: 30 January 2024 / Published: 31 January 2024

(This article belongs to the Special Issue Sustainable Supply Chain Optimization and Risk Management)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a framework based on ensembled deep learning for damage detection of trailer at intermodal terminals. The proposed approach is validated through experiments. The paper is written well, but there are critical issues that need to be further addressed.

1. What is the novelty of this paper? Authors should highlight the contributions of this work in section Introduction.

2. Authors state that the proposed framework and approach have the advantage over current existing methods for ADD of trailers in performance. but in the experimental scenario it lacks the comparation and analysis with the state-of-the-arts especially for the work presented in [Cimili et al.,2022]

3. The paper also lacks the quantitative analysis for the proposed framework in time efficiency.

Author Response

Dear reviewer,

Thank you for your comments, suggestions, and essential critical remarks. Below are our explanations point by point.

1) We agree entirely that a clear description of the contributions was missing, and we added it to the Introduction (Section 1, line 85):

“…The contributions of this paper are summarized as follows:

Introduction of a novel ensemble deep learning algorithm designed explicitly for ADD of trailers at terminals. To the best of our knowledge, this is the first published algorithm that can handle damage detection of trailers using full-sized real-world OCR images of trailers.
Comprehensive testing of several state-of-the-art object detection deep neural networks in the ADD ensemble framework.
The proposal and testing of an alternative algorithm of trailer detection on the image using only classical computer vision techniques without deep learning.
Analysis and recommendations for the ADD improvement for a specific terminal, which was a part of this study. While derived from a particular case study, these recommendations are applicable to any terminal worldwide that seeks to integrate a similar ADD algorithm…”

2) Comparing the performance per image of our ensemble learning algorithm with Cimili et al. (supervised and semi-supervised learning) is not straightforward. Cimili et al. used several classification models for trailers' upper and lower parts separately. However, in this paper, we work not with classification but with an object detection network combined with SAHI.

However, it is worth mentioning that Cimili et al. (2022) tested ADD only on tiles. They did not achieve enough performance to test full-sized trailer images. In contrast, we tested the ensemble model on full-sized trailer images, which requires the correct analysis of several hundred small tiles per image. Even one wrong tile prediction would ruin the prediction for the whole image. To clarify this, we added more description in the text (line 517):

"...

The results show that the presented ensemble learning model outperforms approaches introduced by Cimili et al. [9]. Nevertheless, a direct quantitative comparison is highly challenging. Cimili et al. [9] used classification instead of object detection NNs, tested separate models for the trailer's upper and lower parts, and employed supervised and semi-supervised approaches. However, their results per tile for each of the tested models were not sufficient at that time to conduct testing on full-size trailer images instead of just tiles. The maximal observed accuracy of 89% per tile in Cimili et al. [9] would mean that by each full-sized trailer image consisting of at least a hundred such tiles, around 10% of them would be misclassified. In contrast, our ensemble learning model achieves almost 90% accuracy and an 81% F1-Score for the entire trailer image, which is notable given that correct analysis of a single full-sized trailer image requires accurate predictions for several hundred tiles. Even a single erroneous prediction in these tiles could render the entire prediction incorrect...."

3) In Section 4 (Results, line 472), we describe "...The inference with the alternative cropping takes around 5.5-8 seconds and 7.5-12 seconds for the standard approach depending on the image size...".

As the inference speed depends only on the image size (analysis of a single tile always takes a similar time), it is hard to say what inference time will be on average as the image size depends only on the speed of the vehicle passing through the OCR gate. That is why it would make the most sense to write minimal and maximal time detected for the ADD we observed for the standard algorithm variant and the variant with the Alternative Phase 1. This time is short enough, in our concrete case, to deliver the result of the damage analysis to the driver in time before the driver arrives at the check-in point.

We also did not compare the execution time during the algorithm selection for each phase - it was highly close per tile for each neural network. Thus, we were interested only in performance.

There is also no other published algorithm for the ADD of trailers with which can handle ADD of the whole trailer side, and that we could directly compare in time. Cimili et al. (2022) tracked performance only for tiles; there was no analysis of the full trailer images due to the lack of performance, so we cannot make a direct time comparison here as well.

In conclusion, we also advise standardizing the image length, for example, by using a speed limit. Implementing such a measure would make tracking a model's inference time easier.

*) We also tried to improve the text based on your single-choice feedback (generally, we have marked all changes yellow in the text).

Reviewer 2 Report

Comments and Suggestions for Authors

Comments for author File: Comments.pdf

Author Response

Dear reviewer,

Thank you very much for your report and necessary and reasonable suggestions for improvements.

1) We updated Figure 1 in a higher resolution (some of them are initially blurry).

2) We apologize for the unclear description. Adjustment of only brightness (+100, -40), blur, or saturation does not help in snow or rain. We should have mentioned in the text that we also used Gaussian noise and its combination with the previously mentioned techniques for snow and rain. We also added more explanation in the Section 3.1 (line 239):

“…We found a solution in augmentation: adjusting brightness (+100, -40), adding blur, and changing saturation helps to mimic extremely sunny or very dark days. The combination of these techniques with Gaussian noise during augmentation helped to get rid of false predictions in case of rain and snow…”

3) We think that the section would be too large if we combined Phase 1 and Alternative Phase 1 into one section. Alternatively, we could move Alternative Phase 1 right after the Section with Phase 1. However, we still find that it will be more apparent to the reader to understand why the alternative method might be a good substitute to the standard Phase 1 after understanding how Phase 3 and Phase 4 work. Because implementing Alternative Phase 1 eliminates Phase 3, and Phase 2 is parallelized with Alternative Phase 1 + Phase 4.

4-5) As we uploaded the paper in the free format and not the Sustainability template, during the conversion, which happened after our submission, some issues with formatting and numeration appeared. We have fixed them.

6) Thanks for the point! As we mentioned - we uploaded the paper in the free format. Several tables were moved and split between pages during the conversion, which happened after the submission. We fixed it now. Initially, Table 8 was not in the standard form, and the automatic conversion could not handle it. We have manually set the tables now.

Reviewer 3 Report

Comments and Suggestions for Authors

1. the title of Fig.6 is too long to read.

2. in the Fig.2 and Fig.9, what is the equation (X!<y ?)

3.In the Fig.10, there are 6 figures, their No should be a,b,c,d,e,f.

Author Response

Dear reviewer,

We appreciate the time you invested in the review of our article and your suggestions for improvements.

1)We made the description of Figure 6 more concise.

2) We assume you meant Figure 3 instead of Figure 2. We thoroughly reviewed the algorithm scheme but still found the explanations in the figures pretty clear. In square 2d of the scheme, we write that y is the number of detected metal parts, and in square 2b, that x is the number of detected complete belts.

“x!=y?” checks if these numbers are equal. "!=" is used in Python for inequality, and we are used to such notation. For clarity, we now changed "x!=y?" to "x = y?", but then outgoing True became False and vice versa.

3)We changed the numeration from numbers to letters. Thanks!

Reviewer 4 Report

Comments and Suggestions for Authors

In this paper, the authors propose an ensemble deep-learning method for automated damage detection of trails. The overall network architecture involves multiple well-known deep-learning models to process tasks such as detecting metal, belts, and damaged parts. The experiments are conducted on a publicly unavailable dataset due to privacy restrictions but prove the effectiveness of the proposed method. However, the authors should address the following issues:

1) In the abstract, the authors mention that the algorithm achieves an 88.33% accuracy and 81.08% F1 score without naming the dataset. It is recommended to include a dataset like “a publicly unavailable dataset” or likewise to avoid confusion for readers.

2) Authors should summarize the contributions in Section 1.

3) It takes much time to understand the definitions of p1 and p2 in lines 241-242. Authors should give clearer equations with explanations for both.

4) What are the loss functions for each NN? They should be briefly introduced at the end of each phase.

5) Double check the Eq. (1) and Table 6 for their format and Eqs. (3)-(6) for duplicated issue.

6) Authors should present some qualitative results if there are no privacy restrictions.

Comments on the Quality of English Language

N/A

Author Response

Dear reviewer,

Thank you for your comments and suggestions, as well as for your critical remarks. We found them valuable and reasonable.

1) We explicitly pointed out that the models were tested on the publicly unavailable dataset containing trailer damage in the section where we describe our dataset (line 175):

“… The dataset used in this research (publicly unavailable due to privacy restrictions) …”

In the abstract (line 10) we also added information that we used real-life dataset for more clarity:

“…The algorithm achieves an 88.33% accuracy and 81.08% F1-score on the real-life trailer damage dataset by leveraging the strengths of each object detection model…”

2) We agree that a clear description of the contributions needed to be included, and we added it to the Introduction (Section 1, line 85) contributions description:

“…The contributions of this paper are summarized as follows:

Introduction of a novel ensemble deep learning algorithm designed explicitly for ADD of trailers at terminals. To the best of our knowledge, this is the first published algorithm that can handle damage detection of trailers using full-sized real-world OCR images of trailers.
Comprehensive testing of several state-of-the-art object detection deep neural networks in the ADD ensemble framework.
The proposal and testing of an alternative algorithm of trailer detection on the image using only classical computer vision techniques without deep learning.
Analysis and recommendations for the ADD improvement for a specific terminal, which was a part of this study. While derived from a particular case study, these recommendations are applicable to any terminal worldwide that seeks to integrate a similar ADD algorithm…”

3) After revising how we find coordinates for point 1 and point 2, we agreed that the description could have been more informative and precise.

Instead of one sentence, we added the whole paragraph (line 262) describing how we find these coordinates:

“…

Save the following coordinates of the detected semi-squares of future crop — p1(x1-w1×0.5, max(min(y1-h1, y2-h2),0)) and p2(x2+w2×1.5, max(min(y1-h1, y2-h2),0)). Point p1 is located at the top left corner of the trailer, and point p2 is at the top right corner (see Figure 8). Our formula calculates their precise coordinates using data from the detected semi-squares. The X-axis coordinates for the trailer's upper corners, x1-w1×0.5 and x2+w2×1.5, differ from the X-coordinates of the detected semi-squares. The actual left corner of the trailer is positioned further to the left than the detected left semi-square, and the right corner is located further to the right on the X-axis than the detected right semi-square. max(min(y1-h1, y2-h2),0) detects Y-coordinates for the points p1 and p2. We select the coordinate, which is located higher, min(y1-h1, y2-h2), and comparison with 0 guarantees we do not leave image space.

…”

4) Thanks for pointing out that information on the loss functions was missing. This information is essential for any paper on neural networks.

We added this information to the Section 3 (line 201):

“…During the YOLOv8 training, the complex loss function was used. It consists of two major components — VarifocalLoss and SIoU Loss. VarifocalLoss is a refined version of the traditional focal loss, which manages class imbalance. Simultaneously, SIoU Loss improves the prediction of bounding boxes by evaluating different geometrical factors like angle and shape discrepancies between the ground truth and predicted bounding boxes. The abovementioned classical focal loss was used for the training of RetinaNet. For the Faster R-CNN training, we used the combination of the L1-loss and the binary cross entropy loss...”

5) As we uploaded the paper in the free format and not the Sustainability template, during the conversion, which happened after our submission, some issues with formatting and numeration appeared. We fixed them.

6) We appreciate your concerns about the amount of qualitative results. We tried to summarize all qualitative results in Conclusions, describing the current and possible issues and giving managerial advice on how the ADD integration quality can be improved through some administrative decisions like speed limits, cover over the OCR gate, etc. (line 579-line 583):

“In the future, more training data can be generated by installing additional cameras at the OCR gate. These can be regular cameras placed at new angles and positions or depth cameras that generate not only visual information but also distance data. For instance, Beckman et al. [32] proposed a solution for crack detection using Faster R-CNN with a sensitivity detection network using such cameras.

The varying length of images creates additional challenges in detecting damage and training models. Setting a truck speed limit at the OCR gate could be one of the effective methods to address this issue.”

We also describe that the amount of emissions in case of usage of such an ADD will be significantly reduced due to the reduction of unnecessary trailer movements. The amount of workload on personnel should decrease significantly as well (line 593-line 597):

“…In the long term, the proposed ADD method for trailers can significantly reduce the workload on personnel and redirect resources to other necessary tasks. Subsequently, such changes can enhance the cost efficiency of intermodal terminals worldwide. This, in turn, will increase the attractiveness of such facilities and encourage investments in constructing new terminals.

By optimizing intermodal terminals with the proposed algorithm, the transportation sector can advance sustainability by promoting intermodal transport, contributing to CO₂ reduction through significant reduction of unnecessary trailer movements, and fighting against the greenhouse effect in the long run…”

It is not easy to describe more qualitative results at the moment as the algorithm has not yet been tested with a real-world OCR system. However, it will be possible to gain more feedback from the terminal's management after the algorithm is integrated into the terminal's OCR system, which, due to software, hardware, and financial limitations and requirements, might happen several years later in the worst case.

Reviewer 5 Report

Comments and Suggestions for Authors

The authors proposed an ensemble learning based algorithm for damaged trailer detection. The introduction of the problem was well set up with description of the problem, relevant research works, and the present challenges. The design of the ADD-framework has many assumptions which may be very specific to the dataset used in the study. For this method to be more generalizable, more discussion and/or independent data testing is desired.

More specific comments:

1. Can authors provide higher resolution figures in Figure 1?

2. Can your algorithm work if the ‘semi-squares’ were covered or blurred in the photo captured by OCR gate?

3. Line 239 & Line 409: format issues for equation numbering.

4. Line 325: 305 damaged cases out of how many before augmentation?

Author Response

Dear reviewer,

We appreciate your kind feedback and reasonable comments on the possible improvements.

We understand your concerns about the generalizability of the approach. However, this is the first published study on the problem, where the full-sized trailer images are analyzed, which requires a lot of research in the future. As you can see, there are no openly available datasets on trailer damages, and even in our case, we got a private closed dataset, but the amount of training and testing material is unfortunately strictly limited.

1) We uploaded Figure 1 in a higher resolution.

2) Thank you for pointing out this critical point. We added more explanation on that in Section 3.1 (line 279):

“…While developing the trailer detection algorithm, we assumed that the fresh livery of the trailers used for this study would stay the same in the long run. However, the trailer might not be detected if the semi-squares are covered by dirt or paint or are repainted. In cases where the semi-squares are not detected in the first phase of the algorithm, we could initiate an Alternative Phase 1. Future research on the ADD of trailers might include developing a trailer detection algorithm that does not rely on semi-squares, thereby avoiding this limitation…”

3) As we uploaded the paper in the free format and not the Sustainability template, during the conversion, which happened after our submission, some issues with formatting and numeration appeared. We have fixed them.

4) For clarity, we added more information on image and tile selection in Section 3.4 (line 361):

“…After the tiling of the images, we selected only the tiles that represented damage — in total, 305 tiles from 81 damage cases consisting of cracked tarpaulin and improper patches…”

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Authors have revised the manuscript according to reviewer’s comments. I think that the paper can be accepted.

Article Menu

Ensemble Deep Learning for Automated Damage Detection of Trailers at Intermodal Terminals

Further Information

Guidelines

MDPI Initiatives

Follow MDPI