Next Article in Journal
Vitamin C-Assisted Fabrication of Aerogels from Industrial Graphene Oxide for Gaseous Hexamethyldisiloxane Adsorption
Previous Article in Journal
Design Considerations for the Liquid Air Energy Storage System Integrated to Nuclear Steam Cycle
 
 
Article
Peer-Review Record

High-Performance Scaphoid Fracture Recognition via Effectiveness Assessment of Artificial Neural Networks

Appl. Sci. 2021, 11(18), 8485; https://doi.org/10.3390/app11188485
by Yu-Cheng Tung 1, Ja-Hwung Su 2,*, Yi-Wen Liao 3, Ching-Di Chang 1, Yu-Fan Cheng 1, Wan-Ching Chang 1 and Bo-Hong Chen 3
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Appl. Sci. 2021, 11(18), 8485; https://doi.org/10.3390/app11188485
Submission received: 3 August 2021 / Revised: 10 September 2021 / Accepted: 10 September 2021 / Published: 13 September 2021

Round 1

Reviewer 1 Report

In this paper authors presented a two-stage method for scaphoid fracture recognition by effectiveness analysis of numerous modern Artificial Neural Networks.

The paper’s subject could be interesting for readers of journal. Therefore, I recommend this paper for publication in this journal but before that, I have a few comments on the text that should be addressed before publication:

 

Comments:

1)About the title of Figures in this article, somewhere authors put the titles at the middle alignment (like Figure 10) and elsewhere authors put the titles at the left alignment (like Figure 1). They have to be integrated in terms of alignment.

2) Figure 3 and 4 should move to right. These figures are on left alignment too much. Authors should fix this problem.

3)About title of section 5 (Conclusions and Future Work), it should be rewrited as conclusion and future works.

4)In the conclusion section, the authors did not mention anything about conflict of interests. Also there should be more suggestions about future studies with similar titles.

5)About title 4.3.3 (Imapcts of Reizes), what does Reizes mean?. Did you mean resizes?. If so, authors should rewrite this title.

6)In page 8, L249: The used Equation has no number! Every equation in this article should have a detrmined number in order to address easier and more effective.

7)Authors shold explain why they used 70 percent of data as training data. Where does this percent come from?Is it based on experience or related works?

8)About the title of Tables in this article, somewhere authors put the titles at the middle alignment (like Table 3) and elsewhere authors put the titles at the left alignment (like Figure 2). They have to be integrated in terms of alignment.

9)Which softwares have been used to draw and export the charts?. Also authors should explain why they selected used softwares in this work over other softwares.

10)Which softwares have been used to model the utilized data and apply convolutional neural network?. Also authors should explain why they selected used softwares in this work over other similar softwares.

11)Why authors did not use criteria like RMSE or R to evaluate accuracy of the modelling and predicting?. In other words authors should explain how they evaluated the accuracy of the conducted modelling and predicting more clearly.

12)In line 345: Authors mentioned that "In this paper, due to problem of insufficient data, the overfitting occurs and heavily 345 impact the recognition quality.". Why the used data is insufficient?. In other words why authors did not utilize more appropriate data?

13) Since recently it has been proved that artificial intelligence (AI) has a numerous applications in all of engineering fields, I highly recommend the authors to add some references in this manuscript in this regard. It would be useful for the readers of journal to get familiar with the application of AI in other engineering fields. I recommend the others to add all the following references, which are the newest references in this field of petroleum engineering [1],computer engineering [2], electrical engineering [3], biomedical engineering [4], software engineering [5], energy engineering [6]

[1] Roshani M, Application of GMDH neural network technique to improve measuring precision of a simplified photon attenuation based two-phase flowmeter. Flow Measurement and Instrumentation. 2020

[2] Arab, F., Karimi, M., & Safavi, S. M. (2016, December). Analysis of QoS parameters for video traffic in homeplug AV standard using NS-3. In 2016 Smart Grids Conference (SGC) (pp. 1-6). IEEE.

[3] Fathabadi, F.R. and Molavi, A., 2019. Black-box identification and validation of an induction motor in an experimental application. European Journal of Electrical Engineering, 21(2), pp.255-263.

 [4] Tavakoli, S., & Yooseph, S. (2019, November). Algorithms for inferring multiple microbial networks. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 223-227). IEEE.

[5] Nisar, M. U., Voghoei, S., & Ramaswamy, L. (2017, June). Caching for pattern matching queries in time evolving graphs: challenges and approaches. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) (pp. 2352-2357). IEEE.

[6] Bahramian, F., Akbari, A., Nabavi, M., Esfandi, S., Naeiji, E., & Issakhov, A. (2021). Design and tri-objective optimization of an energy plant integrated with near-zero energy building including energy storage: An application of dynamic simulation. Sustainable Energy Technologies and Assessments, 47, 101419.                 

Author Response

The authors are grateful for the reviewers’ helpful comments that are valuable in improving this paper. We have revised the manuscript as follows. For details of figures, please kindly find the uploaded file.

Revision made in accordance with comments by Reviewer No.1

  1. About the title of Figures in this article, somewhere authors put the titles at the middle alignment (like Figure 10) and elsewhere authors put the titles at the left alignment (like Figure 1). They have to be integrated in terms of alignment.

Answer: Thanks for this comment. We are sorry that, the previous format is out-of-date (old format). For this comment, first, we updated our manuscript using the right format. Second, as suggested, we made an overall modification for the similar errors. Third, we used the MDPI editing service to make the writing better.

  1. Figure 3 and 4 should move to right. These figures are on left alignment too much. Authors should fix this problem.

Answer: Thanks for this comment. We have centered the figures based on this comment.

  1. About title of section 5 (Conclusions and Future Work), it should be rewrited as conclusion and future works.

Answer: Thanks for this comment. The title of Section 5 has been modified as “Conclusion and Future Works” according to this comment.

  1. In the conclusion section, the authors did not mention anything about conflict of interests. Also there should be more suggestions about future studies with similar titles.

Answer: Thanks for this comment. For the conflict of interests, in Page 16, we added the claim that, “The authors would like to declare that no conflicts of interest exist in this paper. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.” For the suggestions about future studies with similar titles, we added a paragraph (P. 15) in the conclusion section. Please refer to the following for a quick review.

“Finally, a number of suggestions about future related studies with similar focus are listed in the following:

    • Other Bioimage recognition have also been studied by the authors’ team, such as those for the detection of liver and lung tumors. Whatever the method is, an experienced suggestion is that segmentation is useful in decreasing the computational cost and increasing the accuracy.
    • In real applications, there exists a huge imbalance between positives and negatives, especially for cancers and non-cancers. Consequently, a kind suggestion is that data augmentation and transfer learning are necessary.
    • For fracture recognition, although the experimental results indicated that most scaphoid fractures can be recognized, other symptoms, such as hydrops, joint effusion, non-union, delayed union, avascular necrosis, arthritis, and so on, also need attention, as their automated recognition is very helpful to doctors in determining the required treatments.”
  1. About title 4.3.3 (Imapcts of Reizes), what does Reizes mean?. Did you mean resizes?. If so, authors should rewrite this title.

Answer: Thanks for this comment. We are sorry for this careless writing error. For this problem, we have corrected such typos. Moreover, we used the MDPI editing service to enhance the writing quality. Please refer to the following figure.

  1. In page 8, L249: The used Equation has no number! Every equation in this article should have a determined number in order to address easier and more effective.

Answer: Thanks for this comment. For this comment, all equation numbers (P. 8-9) were appended in this paper. 

  1. Authors should explain why they used 70 percent of data as training data. Where does this percent come from? Is it based on experience or related works?

Answer: Thanks for this comment. This setting was referred by the reference [32], and the related citation was added in Section 3.2.1 (P. 5) for this concern. 

  1. About the title of Tables in this article, somewhere authors put the titles at the middle alignment (like Table 3) and elsewhere authors put the titles at the left alignment (like Figure 2). They have to be integrated in terms of alignment.

Answer: Thanks for this comment. For this comment, we made a careful modification based on the regular format. 

  1. Which softwares have been used to draw and export the charts?. Also authors should explain why they selected used softwares in this work over other softwares.

Answer: Thanks for this comment. According to this comment, we added a description in Section 4.1 (P. 9). Please refer to the following for a quick review.

“In the proceeding experiments, the results are shown in terms of accuracy charts and AUC curves, where the charts and AUC curves were generated by the Microsoft Excel and Python (Matplotlib Pyplot) software, respectively.”

“Moreover, the CNN models were constructed using Keras and tensorflow 2.3. The reasons for selecting the software are that they are popular, cheap, and easy to obtain.”

  1. Which softwares have been used to model the utilized data and apply convolutional neural network?. Also authors should explain why they selected used softwares in this work over other similar softwares.

Answer: Thanks for this comment. For this comment, in this paper, the CNNs models were constructed by keras and tensorflow 2.3. The reasons for selecting the software are that they are popular, cheap and easy to obtain. The related explanation was added in Section 4.1 (P. 9).

  1. Why authors did not use criteria like RMSE or R to evaluate accuracy of the modelling and predicting?. In other words authors should explain how they evaluated the accuracy of the conducted modelling and predicting more clearly.

Answer: Thanks for this comment. The main reason for using accuracy instead of RMSE/R can be explained by the following aspects. The RMSE is proposed for calculating the standard deviation of prediction errors, where the prediction error indicates the difference between the predicted value and the truth value. In this type of applications, the values of prediction and truth are represented in a probability format. Therefore, in this field, RMSE is used to reveal the prediction variance. In contrast, the prediction outcome in this paper is a binary value, namely yes or no. In this type of predictions, the main attention is always focused on the TP, TN, FP and FN, which stand for #hits, #correct rejections, #false alarms and #misses. Hence, RMSE is not used as the metric in this paper. Regarding the reason of choosing the accuracy as the basic metric, the major intent of using accuracy as the basic measure was to observe the overall correction rates, including true positives and true negatives simultaneously, which has been widely used in the field of machine learning [43, 44]. Based on the accuracy, the other six metrics were further chosen by referring to the references [24-31]. Note that, in these earlier studies [43, 44], the accuracy was called by precision. In addition to accuracy, we have adopted the other 6 metrics by referring to the references [24-31]. For this comment, we have added the related description in Section 4.1 (P. 9).

  1. In line 345: Authors mentioned that "In this paper, due to problem of insufficient data, the overfitting occurs and heavily 345 impact the recognition quality.". Why the used data is insufficient?. In other words why authors did not utilize more appropriate data?

Answer: Thanks for this comment. Actually, the medical data in real applications is not easy to collect due to privacy concerns. Moreover, without permission from patients, it is not appropriate to experiment on collected data, from an ethical point of view. To make the experiment more reliable, the experimental data were gathered from Kaohsiung Chang Gung Memorial Hospital. For this concern, we have added the explanations in Sections 3.2.1 (P. 5) and 4.5 (P. 12). 

  1. Since recently it has been proved that artificial intelligence (AI) has a numerous applications in all of engineering fields, I highly recommend the authors to add some references in this manuscript in this regard. It would be useful for the readers of journal to get familiar with the application of AI in other engineering fields. I recommend the others to add all the following references, which are the newest references in this field of petroleum engineering [1],computer engineering [2], electrical engineering [3], biomedical engineering [4], software engineering [5], energy engineering [6]

[1] Roshani M, Application of GMDH neural network technique to improve measuring precision of a simplified photon attenuation based two-phase flowmeter. Flow Measurement and Instrumentation. 2020

[2] Arab, F., Karimi, M., & Safavi, S. M. (2016, December). Analysis of QoS parameters for video traffic in homeplug AV standard using NS-3. In 2016 Smart Grids Conference (SGC) (pp. 1-6). IEEE.

[3] Fathabadi, F.R. and Molavi, A., 2019. Black-box identification and validation of an induction motor in an experimental application. European Journal of Electrical Engineering, 21(2), pp.255-263.

 [4] Tavakoli, S., & Yooseph, S. (2019, November). Algorithms for inferring multiple microbial networks. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 223-227). IEEE.

[5] Nisar, M. U., Voghoei, S., & Ramaswamy, L. (2017, June). Caching for pattern matching queries in time evolving graphs: challenges and approaches. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) (pp. 2352-2357). IEEE.

[6] Bahramian, F., Akbari, A., Nabavi, M., Esfandi, S., Naeiji, E., & Issakhov, A. (2021). Design and tri-objective optimization of an energy plant integrated with near-zero energy building including energy storage: An application of dynamic simulation. Sustainable Energy Technologies and Assessments, 47, 101419. 

Answer: Thanks for this comment. The recommended references are really valuable and we are glad to add them into the reference list [1-6].

Author Response File: Author Response.pdf

Reviewer 2 Report

Overview and general recommendation:

The article proposes a method for scaphoid fracture recognition and gives thorough effectiveness analysis of several Artificial Neural Networks. The authors lift adequate comparative experiments on different CNNs and different experiment settings and give the explanation of the experiment results as much as possible. The usage of the data augmentation and transfer learning enhanced the method performance. Although the article does not give the comparison of the experiment results between proposed method and other related previous methods, the article still recommends two models that achieve good performance. 

The following are some of the issues of this article:

  1. In Table 1, there are several previous methods that focus on the same problem. What is the dataset of the former experiments and why not use the same dataset to compare the model performance between the proposed method and the former methods?
  2. The description about the model and the training is too simple and not detailed enough. Section 3 represents the method ideas, dataset preprocessing and framework of the CNN. But the detail about the model architecture and training is simple;
  3. Section 4.3.1 shows that different data augmentation achieves different performance progress. But why different rotation leads to different results? And why the effect on different models are different, too?
  4. Section 4 gives experiments on different training settings and gets the best training setting on average. However, the analysis points that different models fit different settings. So training all the model in same settings seems unfair and not reasonable. What about training the models in their best settings separately and get the best performance of different methods?
  5. The article points two paradigms of transfer-learning. However, the implementation of transfer-learning of the method is not detailed enough. Which paradigm does the method use and which performs better in the experiments?

Author Response

The authors are grateful for the reviewers’ helpful comments that are valuable in improving this paper. We have revised the manuscript and used the MDPI editing service to enhance the writing quality. The modification for the reviewer’s comments is shown in the following. For details of figures, please kindly find the uploaded file.

Revision made in accordance with comments by Reviewer No.2

  1. In Table 1, there are several previous methods that focus on the same problem. What is the dataset of the former experiments and why not use the same dataset to compare the model performance between the proposed method and the former methods?

Answer: Thanks for this comment. For this issue, our concern can be summarized as the following points. First, the medical data is not easy to collect, due to privacy concerns. Second, the data sets of former methods are not public. Third, without permission from patients, it is not appropriate to experiment on collected data, from an ethical point of view. Therefore, to make the experiment more reliable, the experimental data were gathered from Kaohsiung Chang Gung Memorial Hospital based on the ethical approvals. For this comment, we have added the clarifications in Section 3.2.1 (P. 5) and Section 4.5 (P. 12). 

  1. The description about the model and the training is too simple and not detailed enough. Section 3 represents the method ideas, dataset preprocessing and framework of the CNN. But the detail about the model architecture and training is simple.

Answer: Thanks for this comment. For this comment, we added Tables 2 and 3 into Section 3.2.4 (P. 7). Also, the related description was added into Section 3.2.4 (P. 7). Please refer to the following statement for a quick review.

“Table 2 provides the architecture details of the considered CNN models, including the numbers of convolutions and pools, activation functions, and optimization functions. Based on these architectures, our primary intent was to approximate the nearly optimal settings for different CNN models, as depicted in Table 3. The detailed analysis for determining the best settings is shown in Section 4.” 

  1. Section 4.3.1 shows that different data augmentation achieves different performance progress. But why different rotation leads to different results? And why the effect on different models are different, too?

Answer: Thanks for this comment.

For the first concern of “why different rotation leads to different results?”, we are sorry that it is really our careless fault, which was caused by problem of performance instability. Actually, in CNNs, the filters are randomly generated for each convolution. Therefore, the models and prediction results are different. To reveal this point, we re-conducted the experiments by 20 testings for rotations of 150 and -150. The following Tables A and B show the best, worst and average accuracies of CNN models for these settings. From Tables A and B, we can know that there exists a gap between best and worst accuracies for each model. However, the best accuracy between rotations of 150 and -150 is pretty close for each model. Therefore, for this question, our response is the prediction difference between rotations 150 and -150 is not significant. Based on this discovery, we re-conducted Figure 9 in the manuscript (P. 10). Please refer to the following Figure A for a quick review.

Table A. Best, worst and average accuracies of CNN models for 20 testings under rotation 150.

 

best

worst

average

VGG16

0.833333

0.680556

0.783333

VGG19

0.819444

0.694444

0.7625

RN50

0.861111

0.666667

0.759722

RN101

0.791667

0.5

0.60625

RN152

0.777778

0.444444

0.583333

DN121

0.861111

0.597222

0.757639

DN169

0.888889

0.694444

0.797222

DN201

0.902778

0.597222

0.809722

ENB0

0.902778

0.694444

0.796528

INv3

0.902778

0.694444

0.802778

 

Table B. Best, worst and average accuracies of CNN models for 20 testings under rotation -150.

 

best

worst

average

VGG16

0.861111

0.722222

0.789583

VGG19

0.833333

0.694444

0.772222

RN50

0.833333

0.708333

0.776389

RN101

0.791667

0.486111

0.646528

RN152

0.777778

0.486111

0.590278

DN121

0.847222

0.638889

0.767361

DN169

0.861111

0.680556

0.814583

DN201

0.930556

0.5

0.79375

ENB0

0.861111

0.763889

0.808333

INv3

0.861111

0.708333

0.800694

 

 Figure A. Accuracies of different models with different data augmentations.

For the second concern of “And why the effect on different models are different, too?”, the potential interpretation is that the model architectures were sensitive to data augmentation, considering the pre-trained model. In detail, the Residual idea works well for ResNets and DenseNets. This is also the main contribution to discerning the difference among AI Nets. The related explanation was added in Section 4.3.1 (P. 10).

  1. Section 4 gives experiments on different training settings and gets the best training setting on average. However, the analysis points that different models fit different settings. So training all the model in same settings seems unfair and not reasonable. What about training the models in their best settings separately and get the best performance of different methods?

Answer: Thanks for this comment. Indeed, this is our careless neglect. For this issue, we re-conducted the overall comparisons with the best settings for each model. In the revision, Table 3 (P. 7) shows the best settings and the overall comparison result is shown in Table 8 (P. 12). Because the best setting has been adopted, the related figures (Figures 10-13) in the revision were updated thereupon. For a quick review, the summary of the related tables (Table C) and figures (Figures B-D) are shown in the followings.

  • The new results in Table C (Table 8 in the revision) are slightly different from previous results, which are summarized as follows.

In Section 4.4, P. 12:

“First, the ranking of the five Nets was DenseNet, InceptionNet, EfficientNet, ResNet, and VGG, with average accuracies of 0.889, 0.889, 0.861, 0.852, and 0.806, respectively. Second, the top three individual models were DN201, DN169, and RN101, in accordance with their accuracies. Third, overall, the best performances for each metric consistently occurred in two Nets; namely, DN201 and RN101. From this aspect, DN201 and RN101 were the two most reliable models. Fourth, from an average point of view, the top three individual models were DN201, RN101, and INv3, for which the averages of all metric results were 0.886, 0.882, and 0.879, respectively. This can be regarded as an echo of the third point that DN201 and RN101 had a balanced performance for all metrics.”

Table C. Effectiveness of compared CNNs with transfer learning for different metrics.

Metric

Model

Sensitivity

Specificity

Precision

F1-score

AUC

Kappa

Accuracy

VGG16

0.861

0.806

0.816

0.838

0.860

0.667

0.833

VGG19

0.833

0.722

0.750

0.789

0.870

0.556

0.778

RN50

0.889

0.833

0.842

0.865

0.910

0.722

0.861

RN101

0.889

*0.889

*0.889

0.889

*0.950

0.778

0.889

RN152

0.806

0.806

0.806

0.806

0.880

0.611

0.806

DN121

0.917

0.833

0.846

0.880

0.930

0.750

0.875

DN169

0.917

0.861

0.868

0.892

0.890

0.778

0.889

DN201

*0.944

0.861

0.872

*0.907

0.910

*0.806

*0.903

INv3

0.889

*0.889

*0.889

0.889

0.930

0.778

0.889

ENB0

*0.944

0.778

0.810

0.872

0.920

0.722

0.861

Note that, the * indicates the best performance in each metric (attribute).

  • Second, the discussion for Figure B (Figure 11 in the revision) was modified as follows.

In Section 4.5, P. 12-13:

“The medical data in real applications is not easy to collect. To make the experiment more reliable, the experimental data were gathered from Kaohsiung Chang Gung Memorial Hospital, instead of crawling the Web. Therefore, due to the problem of insufficient data, overfitting occurred and heavily impacted the recognition quality. To cope with this problem, two operations were adopted: data augmentation and transfer learning. Here, an issue to clarify is the performance comparisons among learning by original data, learning by data augmentation, learning by transfer learning, and learning by fusing data augmentation and transfer learning. This comparison can be explained by Figure 11, summarizing Table 5, Figure 10, and Table 8. From this comparison, we can determine that, whatever the model, without the fusion of transfer learning and data augmentation, the best accuracy cannot be achieved.”

Figure B. Accuracies of compared CNNs by considering Transfer Learning (termed TL), Data Augmentation (termed DA), and fusion of TL and DA (termed TL+DA).

  • Moreover, the description for Figure C (Figure 12 in the revision) was updated as follows.

In Section 4.5, P. 13:

“Figure 12 reveals the AUC comparisons of models with accuracies larger than 0.9; namely, RN50, RN101, DN121, DN201, INv3, and ENB0. Basically, it can be divided into two observation spaces, by the line where the False-Positive Rate (FPR) equals 0.1. Before FPR 0.1, DN121 is better than the others. To the contrary, the performances of RN101 and DN201 are pretty close, but higher than those of the other models, after FPR 0.1. Overall, RN101, DN201, and INv3 are the highly considerable models, in terms of the AUC.”

Figure C. AUCs of models with accuracies larger than 0.9.

  • Finally, the updated description for Figure D (Figure 13 in the revision) is shown as follows.

In Section 4.5, P. 13-14:

“Figure 13 depicts the validation result, showing that the top three models were DN201, DN169, and RN101, although their values were very close. This result is consistent with those given above. However, by considering the overall performances in Table 8, the best two models for recognizing scaphoid fractures are DN201 and RN101.”

Figure D. Hybrid evaluations by averaging F1-score, AUC, and Kappa metrics.

  1. The article points two paradigms of transfer-learning. However, the implementation of transfer-learning of the method is not detailed enough. Which paradigm does the method use and which performs better in the experiments?

Answer: Thanks for this comment. In this paper, the fine-tuning-based transfer learning is performed with VGG Nets, while the layer transfer-based learning was performed for the other Nets. The related description was added into Section 3.2.5 (P. 8). 

Author Response File: Author Response.pdf

Reviewer 3 Report


Authors using image classification and not object detection, or instant segmentation task - no novelty to applying CNN's to image classification. 
Even more, these are old methods, SOTA methods now are Vision Transformers. 

Quality of work illustration -> 'Imapcts of Reizes'

Authors suggest to used augmentation and some transfer knowledge - no novelty this is taught in any deep learning course. 

Also paper does not sound scientifically - no formulas are presented in the manuscript.

Usage domain-specific data - no novelty, since authors do not provide realistic methods like object detection or instant segmentation - no novelty. More important authors fail to mention the main drawback of their study!
The age of bare hand. For different ages children and adults, the bones have different structures. Since authors using a small amount of data model would fail on different age data. 

 

A lot of missing information, for example, how many epochs were trained, what was the training stop rule, and so on.

 

The reviewer thinks that such a low amount of work placed in research (by just running a loop over little data by known methods) is unacceptable, for such a high-level journal.

So, I cannot find any merits from this work.

Author Response

The authors are grateful for the reviewers’ helpful comments that are valuable in improving this paper. We have revised the manuscript as follows. For details of figures, please kindly find the uploaded file.

Revision made in accordance with comments by Reviewer No.3

  1. Authors using image classification and not object detection, or instant segmentation task - no novelty to applying CNN's to image classification. Even more, these are old methods, SOTA methods now are Vision Transformers. Authors suggest to used augmentation and some transfer knowledge - no novelty this is taught in any deep learning course. Also paper does not sound scientifically - no formulas are presented in the manuscript. Usage domain-specific data - no novelty, since authors do not provide realistic methods like object detection or instant segmentation - no novelty.

Answer: Thanks for this comment. Basically, this paper can be viewed as an empirical study and the major intent behind this paper is to enhance the existing techniques with segmentation, data augmentation and transfer learning. Moreover, the detailed analysis of related works is provided in Section 2. Table 1 shows the primary points of the previous studies, including the data used, detection by deep learning, number of CNNs, detection on radiographs, Transfer learning, data augmentation, number of Evaluation metrics, and segmentations, thus indicating the uniqueness of this paper. Finally, through the effectiveness assessment, the recommendation of AI Nets is determined. For this concern, we have updated the motivation in Sections 1 (P. 2) and 2 (P. 3-4).

Figure 13. Hybrid evaluations by averaging F1-scores, AUCs and Kappas.

In addition to the effectiveness evaluation, another important issue is the efficiency. Considering this issue, Figure 14 shows the execution time of compared models. The comparison result revealed that DenseNet, InceptionNet, and EfficientNet had higher costs than the other models. By integrating the results of Figures 13 and 14, three further viewpoints are given here: First, from effectiveness point of view, DN201, DN169, and RN101 are three candidate models. Second, from efficiency point of view, VGG 16, RN 50, and RN152 are three candidate models. Third, from a balanced viewpoint, the top three models are VGG 16, RN 50, and RN 152 because the referred balances between test-performance and training-cost are higher than that of the other models. The related description has been added into Section 4.5 (P. 14).

Figure 14. Training time of recognition models.

  1. The age of bare hand. For different ages children and adults, the bones have different structures. Since authors using a small amount of data model would fail on different age data.

Answer: Thanks for this comment. The major aim in this paper is the adult bones and this limitation has been added in Section 3.2.1 (P. 5). 

  1. A lot of missing information, for example, how many epochs were trained, what was the training stop rule, and so on.

Answer: Thanks for this comment. Really, the numbers of epochs for all models were not shown in this paper. For this comment, we have added the related epochs in Table 3 (P. 7).

  1. Quality of work illustration -> 'Imapcts of Reizes'

Answer: Thanks for this comment. We are sorry that, there exist such writing errors in this paper. For this problem, we have checked and corrected the writing errors. Afterwards, we used the MDPI editing service to enhance the writing quality. Please refer to the following figure.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

All the comments have been addressed correctly and the paper is ready for publication in the present form.

Author Response

The authors are grateful for the reviewers’ helpful comments that are valuable in improving this paper.

Reviewer 3 Report

While authors just using simple image classification comparison, on small dataset I can not recommend it for publication at the Applied Sciences level journal.

Authors at least must take any other hands dataset without fractions, 
for example, RSNA Bone Age, and demonstrate that while they providing the high accuracy on a small dataset, their model won't predict every patient with fraction if there is not one (type I error (false positive)). Authors must demonstrate in model testing on a different dataset that their algorithm have almost no false positive cases.

Author Response

The authors are grateful for the reviewers’ helpful comments that are valuable in improving this paper. We have revised the manuscript as follows. For figure details, please find the attached.

Revision made in accordance with comments by Reviewer No.3
1.    Authors at least must take any other hands dataset without fractions, for example, RSNA Bone Age, and demonstrate that while they providing the high accuracy on a small dataset, their model won't predict every patient with fraction if there is not one (type I error (false positive)). Authors must demonstrate in model testing on a different dataset that their algorithm have almost no false positive cases.

Answer: Thanks for this comment. As suggested, we conducted an additional evaluation by comparing the specificities (True Negative rate) of using the RSNA data (new) and KCGMH data (old) for all constructed models. The experimental results show that the specificity differences are very close (within 5%) even though the datasets are generated by different radiograph capturing devices. That is, the constructed models are stable in detecting the scaphoid fractures. For this comment, we have added the descriptions in Section 4.5 (P. 15). Also, the related link was added in the reference list [45]. Please refer to the followings for a quick review.

  • In above experiments, the evaluation results using the KCGMH data have been shown in great detail. However, to make the experiments more robust, we further verified the above models by another dataset named RSNA [45], which was proposed for a challenge to predict bone age from pediatric hand radiograph. The major intent behind this verification is to investigate the performance variances of above recognition models if using different data sources. In fact, the dataset RSNA contains bones of different ages, and 538 images were selected as the testing data. Because the bones in this dataset are all normal, the specificity (TN rate) is the aimed measure. Figure 15 shows the specificity comparisons of using RSNA data and KCGMH data for the recognition models. It delivers an aspect that, the specificity differences are very close (within 5%) even though the datasets are generated by different radiograph capturing devices. That is, the constructed models are stable in detecting the scaphoid fractures. 
  • In summary, the above experimental results provide an evidence that, DenseNet and ResNet perform better than the other three nets. From these two nets, DN201 and RN101 are further selected as the recommended recognizers because the overall performances are better than the others in these two nets.

 
Figure 15. Specificity comparisons of using RSNA data and KCGMH data for the recognition models.

 

Author Response File: Author Response.pdf

Back to TopTop