Next Article in Journal
Are the Current Research Methods Reliable for Evaluating the Mechanical Performance of NiTi Endodontic Rotary Instruments?
Previous Article in Journal
On the Residual Static and Impact Capacity of Shear-Reinforced Concrete Beams Subjected to an Initial Impact
 
 
Article
Peer-Review Record

Impact of Image Preprocessing Methods and Deep Learning Models for Classifying Histopathological Breast Cancer Images

Appl. Sci. 2022, 12(22), 11375; https://doi.org/10.3390/app122211375
by David Murcia-Gómez 1, Ignacio Rojas-Valenzuela 1 and Olga Valenzuela 2,*
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2022, 12(22), 11375; https://doi.org/10.3390/app122211375
Submission received: 30 September 2022 / Revised: 24 October 2022 / Accepted: 5 November 2022 / Published: 9 November 2022

Round 1

Reviewer 1 Report

The authors presented a comparative study of different preprocessing techniques used to improve deep learning models in the classification of cancer histopathology images. While this is an essential conclusive study that this preprocessing has no significant difference in performance, the study needs to focus more on augmentation techniques as this is the step that has seen more attention when training deep learning models than preprocessing images. Image preprocessing has more impact when working on feature engineering rather than deep learning models. Other important comments the author needs to consider are as follows:

1. Rather than making three random starts per experiment, the author could simply work on cross-validation and make conclusions from that.

2. How did the author use the pre-trained models? did they retrain all layers or some of the final layers?

3. Is Table 1 confusing? Is the number in decimal points or?

4. VGG16 y VGG19 is this referring to VGG16 and VGG19? Please use the appropriate notation

5. A summary table of the models and other architectural differences will be useful, e.g. No training parameters, number of layers, main component of the architecture (i.e. skip connection or particular module etc.)

6. I am not sure why the author used both VGG16 and VGG19 when they mentioned that they selected different architectures with a different setup 

Author Response

 

Comments and Suggestions for Authors

 

The authors presented a comparative study of different preprocessing techniques used to improve deep learning models in the classification of cancer histopathology images. While this is an essential conclusive study that this preprocessing has no significant difference in performance, the study needs to focus more on augmentation techniques as this is the step that has seen more attention when training deep learning models than preprocessing images. Image preprocessing has more impact when working on feature engineering rather than deep learning models. Other important comments the author needs to consider are as

follows:

 

  1. Rather than making three random starts per experiment, the author could simply work on cross-validation and make conclusions from that.

 

Thank you for this comment, which indeed, in order to perform the statistical analysis, cross-validation could have been performed. Performing different simulations or cross-validation, to perform the statistical analysis ANOVA, are both viable alternatives to reach the conclusions. Remember that for the statistical analysis, the mean of each group must be calculated, and then the variance of these means (variance explained by the group variable, inter-variance) must be compared with the average variance within the groups (the variance not explained by the group variable, intravariance). Simulations were performed in this manner to separate each run (due to hardware availability, among other factors) and to avoid errors due to lack of memory.

 

  1. How did the author use the pre-trained models? did they retrain all layers or some of the final layers?

 

The usual transfer learning procedure was used, where a trained deep learning model is taken to solve a different task (in this case, a multiclass classification problem on the Imagenet dataset) and the final layers are replaced so that the model produces output appropriate for the problem being solved.


For learning the different models to be used, the transfer learning method was used, a method that is widely used in image classification problems with deep learning, since it has the great advantage of reusing the previously used system.

 

The following sentence has been also include in the paper:

“Therefore, usual transfer learning procedure was used. Relying on prior learning, it avoids starting from scratch and allows us to build accurate models in a time-saving manner \cite{Shin2016, Rawat2017, Zhuang2021}.”


Several relevant references have been included in the bibliography in the field of transfer learning.


Regarding the second question, no layer has been frozen so that the entire mesh could be fully trained, as can be read on subsection Experimental settings. ”No layer was frozen so that the mesh could be fully trained.” More information has been included in the paper:

 

Different scenarios can be considered for transfer learning. The first scenario would be to have a large dataset (in our case images), but different from the pre-trained model's dataset and to be able to train the entire model, being useful to initialize the model from a pre-trained model, using its architecture and weights.

 

The second scenario would be when the set of images that you want to train has a certain similarity with the images that were used to train the model, or when you have a small set of images to train the new system. In this case, it is usually chosen to train some layers and leave others frozen.

 

Finally, the third scenario would be to have a small set of images, but similar to the pre-trained model's dataset. In that case, they usually freeze the convolutional base.

 

The scenario used in this article has been the first scenario: the model is initialized from a pre-trained model (using its architecture and weights), and the entire model is trained/adjusted (no layer was frozen).

 

 

  1. Is Table 1 confusing? Is the number in decimal points or?

 

Thank you very much for your suggestion. The modification of the numbers in table 1 has been made, according to what was indicated. So now it's 63,050 instead of 63.050.

 

  1. VGG16 y VGG19 is this referring to VGG16 and VGG19? Please use the appropriate notation

 

Thank you very much for your correction. Yes, this is a mistake and we have modified it in the paper.

 

 

  1. A summary table of the models and other architectural differences will be useful, e.g. No training parameters, number of layers, main component of the architecture (i.e. skip connection or particular module etc.)

 

We thank you for this comment, since indeed a table that can summarize the main characteristics of the models used is very relevant to improve paper compression. In table 3, a summary of the different deep learning model is presented.

 

The first column is the model name, the second is Size (MB):  the size required for its storage has also been included, using the well-known Keras application (Keras is a deep learning API written in Python, widely used in the TensorFlow machine learning platform). The third column, Parameters, is the number of total parameters of the deep learning model. The fourth column, depth, refers to the topological depth of the network, this includes activation layers, batch normalization layers etc. The main characteristics of the models used have also been summarized in the last column.

 

  1. I am not sure why the author used both VGG16 and VGG19 when they mentioned that they selected different architectures with a different setup

 

Although it is true that they are similar architectures, VGG16 and VGG19 differ in the number of layers. Furthermore, as it is a type of architecture that usually provides very good results and is widely used in the literature, we consider that it would be a good option to obtain good results.

 

There are references, such as [KUM-20], [HAM-20], [MAN-20] and [HAM-22] that compare both VGG16 and VGG19 and other models in Histopathological Images problems. For this reason, we have considered analyzing both structures.

 

[KUM-20] Kamlesh Kumar, Umair Saeed, Athaul Rai, Noman Islam, Ghulam Muhammad Shaikh and Abdul Qayoom, IDC Breast Cancer Detection Using Deep Learning Schemes, Advances in Data Science and Adaptive Analysis Vol. 12, No. 02, 2041002 (2020)

 

[HAM-22] Hameed, Z., Garcia-Zapirain, B., Aguirre, J.J. et al. Multiclass classification of breast cancer histopathology images using multilevel features of deep convolutional neural network. Sci Rep 12, 15600 (2022). https://doi.org/10.1038/s41598-022-19278-2

 

[HAM-20] Hameed, Zabit, et al. Breast Cancer Histopathology Image Classification Using an Ensemble of Deep Learning Models, Sensor, 2020. doi:10.3390/s20164373

https://www.mdpi.com/1424-8220/20/16/4373

 

[MAN-20] Man, R; Yang, P; Xu, BW, Man, Rui; Yang, Ping; Xu, Bowen. Classification of Breast Cancer Histopathological Images Using Discriminative Patches Screened by Generative Adversarial Networks.  IEEE ACCESS, 2020

http://dx.doi.org/10.1109/ACCESS.2020.3019327                                                             

 

Reviewer 2 Report

 

This manuscript investigated the effects of image preprocessing methods on five breast cancer histopathological image classification models of deep learning. Actually, the novelty of the work is not very high. However, the efforts the authors have made may contribute to the research field. However, the following concerns should be addressed before being accepted for publication.

 

1.        Title. The current title is “Impact of Image Preprocessing Methods on Cancer Histopathological Image Classification Models.” First, the title should be confined to breast cancer, but not cancer, because only breast cancer histopathological images were investigated. Second, the title should reflect deep learning.

2.        Introduction. Please refer to Comment 1. The background of cancer is suggested to be confined to breast cancer.

3.        Indeed, the authors have conducted a lot of experiments. However, image preprocessing methods are too many, and the image preprocessing parameters will have an impact on the results. For example, the CLAHE method investigated in the manuscript, has several parameters, each of which will affect the image contrast enhancement. The authors should report all the image preprocessing parameters they used for different image preprocessing techniques, e.g., the number of blocks divided in CLAHE. Also, the authors are encouraged to investigate the effects of different image preprocessing parameters. In addition, more image preprocessing techniques can be considered.

4.        The network structures of the deep learning models are suggested to be drawn in the manuscript, i.e., in the form of figures.

5.        There are many deep learning models for the task concerned. The authors only used five models. Are these representative and enough? Maybe more deep learning models should be considered.

6.        A separate Discussion section should be provided. The discussion should be more in-depth.

7.        The numbers in Table 1, such as 63.050, are recommended to change to 63,050.

8.        “The [18] models”: Do you mean “The models [18]”?

9.        “VGG16 y VGG19”: Do you mean “VGG16 & VGG19”?

10.     Mobilenet should be changed to MobileNet.

11.     Densenet should be changed to DenseNet.

12.     0,05 should be changed to 0.05.

13.     95,0% should be changed to 95.0%.

14.     What is the implication of this study for other related studies? Should we use any image preprocessing method?

Author Response

Comments and Suggestions for Authors

 

 

This manuscript investigated the effects of image preprocessing methods on five breast cancer histopathological image classification models of deep learning. Actually, the novelty of the work is not very high.

However, the efforts the authors have made may contribute to the research field. However, the following concerns should be addressed before being accepted for publication.

 

 

  1. Title. The current title is “Impact of Image Preprocessing Methods on Cancer Histopathological Image Classification Models.” First, the title should be confined to breast cancer, but not cancer, because only breast cancer histopathological images were investigated. Second, the title should reflect deep learning.

 

Thank you very much for your comment, which is very accurate and indeed, clarifies what the theme of the paper is going to be and what problem it is going to be applied to. We have proceeded to make the change, with the aim of improving the title, according to your recommendation. Now the title is:

 

“Impact of Image Preprocessing Methods and Deep Learning Models for classifying Histopathological Breast Cancer Images”

 

 

  1. Introduction. Please refer to Comment 1. The background of cancer is suggested to be confined to breast cancer.

 

Thank you very much for your comment, since in effect, we agree that the introduction should focus on breast cancer. For this reason, we have modified section 1, including and modifying paragraphs of it. Mainly:

 

Breast cancer is one of the leading causes of cancer-related deaths worldwide and

affects a large number of women today [10,11]. Invasive ductal carcinoma (IDC) the most frequently subtype of all breast cancers [12]. As stated in [13], 934,870 new cancer cases and 287,270 cancer deaths in women are projected to occur in the United States in 2022, with an estimated 287,850 new cases of breast cancer (the leading cause, accounting for 31% of new cancer cases) and 43,250 deaths from breast cancer (the second leading estimated cause of cancer death, with breast cancer accounting for 15% in women). Incidence rates for breast cancer in women have slowly increased by about 0.5% per year since the mid-2000s [14].

 

Most women diagnosed with breast cancer are over the age of 50, but younger women

can also develop breast cancer. About 1 in 8 women will be diagnosed with breast cancer during their lifetime. Survival rates for breast cancer are believed to have increased and the number of related deaths continues to decline, largely due to factors such as earlier detection, a new personalized treatment approach, and a better understanding of the disease [15,16].

 

The diagnosis of breast cancer histopathology images with hematoxylin and eosin

stained is non-trivial, labor-intensive and often leads to a disagreement between pathologists [20,21]. Computer-assisted diagnosis systems contribute to help pathologists improve diagnostic consistency and efficiency, especially in breast cancer using histopatological images [22–24].

 

This paper proposes a comparative study of how different histopathological image preprocessing methods, and various deep learning models, affect the performance of the classification system in breast cancer.

 

 

  1. Indeed, the authors have conducted a lot of experiments. However, image preprocessing methods are too many, and the image preprocessing parameters will have an impact on the results. For example, the CLAHE method investigated in the manuscript, has several parameters, each of which will affect the image contrast enhancement.

The authors should report all the image preprocessing parameters they used for different image preprocessing techniques, e.g., the number of blocks divided in CLAHE. Also, the authors are encouraged to investigate the effects of different image preprocessing parameters. In addition, more image preprocessing techniques can be considered.

 

Thank you for your comment, we agree with the reviewer that there are a large number of image preprocessing methods, but we have tried to analyze the most commonly used image preprocessing methods due to the high computational cost of running multiple simulations. in the bibliography on histopathological imaging in breast cancer.

 

It should be noted that convolution-based filters are not parametric, so their effectiveness does not depend on parameters. On the other hand, regarding histogram-based methods, we fully agree with the reviewer's comment. Although histogram equalization is performed non-parametrically (as stated in ref: Danelljan, M.; Robinson, A.; Khan, F.S.; Felsberg, M. Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking. In Computer Vision – ECCV 2016; Springer International Publishing, 2016; pp. 472–488), it is true that image pre-processing parameters have an impact on the behavior of the pre-processing methods

 

In CLAHE, the default parameters offered by the OpenCV library for this method have been used, namely: clipLimit=2.0, tileGridSize=(8,8). This information was included in the paper. Thank you for your suggestion.

 

Lastly, regarding FHE, the pipeline used is from:

https://www.kaggle.com/code/nguyenvlm/fuzzy-logic-image-contrast-enhancement.

 

Although the effects of different image preprocessing parameters is relevant, we want to analyze the methods themselves, with conventional parameters used in the literature. The objective is to analyze the behavior of different methods (using standard parameters) in comparison with different Deep Learning models, to perform an exhaustive statistical analysis. Because each of the Deep Learning simulations require a large amount of computing time, it is outside of this study to analyze different parameters of a given filter.

 

  1. The network structures of the deep learning models are suggested to be drawn in the manuscript, i.e., in the form of figures.

 

We thank you for this comment, since indeed, a table that can summarize the main characteristics of the models used is very interesting for the reader. Instead of including it in the form of figures, we include a summary table with the information presented in a more compact way. In table 3, a summary of the different deep learning model is presented.

 

The first column is the model name, the second is Size (MB):  the size required for its storage has also been included, using the well-known Keras application (Keras is a deep learning API written in Python, widely used in the TensorFlow machine learning platform). The third column, Parameters, is the number of total parameters of the deep learning model. The fourth column, depth, refers to the topological depth of the network, this includes activation layers, batch normalization layers etc. The main characteristics of the models used have also been summarized in the last column.

 

 

  1. There are many deep learning models for the task concerned.

The authors only used five models. Are these representative and enough?

Maybe more deep learning models should be considered.

 

 

Indeed, there are many different deep learning models used to solve this type of problem: classification of histopathological images. The models have been selected as mentioned in section 3.3, that is, based on two main reasons. Firstly, to use models that show good results in the literature and secondly, to have a varied deep learning models in terms of the number of parameters and layers.

 

In the bibliography there is a large number of bibliographical references to the five models used in this paper. As stated in [Srinidhi-2021], in most applications (related with deep learning and image classification), standard architectures (e.g., VGGNet InceptionNet, ResNet, MobileNet, DenseNet) can be directly employed, and custom networks should only be used if it is impossible to transform the inputs into a suitable format for the given architecture, or the transformation may cause significant information loss that may affect the task performance”.

 

It is important to remember that to perform statistical analysis with great rigor and accuracy, all possible combinations of filters and models (in several repetitions), must be simulated, analyzing the behavior and measuring the error rates. The computational cost and the time required for each simulation is high, and for this reason it was decided to select these five models, which have been widely used in the literature.

 

We believe that these five models are sufficiently representative, although, of course, the study can be extended to more models in the future. New related references have been placed in the document.

 

[Srinidhi -2021] Srinidhi, C.L.; Ciga, O.; Martel, A.L. Deep neural network models for computational histopathology: A survey. Medical Image Analysis 2021, 67, 101813. https://doi.org/10.1016/j.media.2020.101813.

 

  1. A separate Discussion section should be provided. The discussion should be more in-depth.

 

A discussion section is included in the paper.

 

This is Section 5.

 

As mentioned in Section 2, there are a large number of publications in the literature on

the use of deep learning systems in medical image classification, especially in histopathology. Image pre-processing is an important step, and as can be seen from the bibliography, there are a large number of alternatives for its implementation (in this paper we have used the most frequently presented ones in the literature). On the other hand, there are a large number of deep learning models in the literature, and in order to perform a statistical analysis with great rigor and accuracy, all possible combinations of filters and models must be simulated (in several repetitions), the behavior analyzed, and the error rates measured. The computational cost and time required for each simulation are high. For this reason, it was decided to select these five models that have been widely used in the literature. However, to the best of our knowledge, there is no exhaustive statistical analysis in the literature that attempts to analyze what relevance or impact it has on the behavior of the system to use the different alternatives of deep learning models and preprocessing of histological images for the problem of breast cancer.

 

As mentioned in Section 4, the analysis of the p-value in Table 6 yields a value of 0.146

for the filter technique and a value of 0.0000 for the model of deep learning. This means that the choice of the different deep learning model alternatives has a statistically significant influence on the behaviour of the system (measured by the AUC index). On the other hand, for the filter effect, it can be stated that it is not statistically significant, and therefore, the different methods have similar behaviour in order to obtain the AUC value. This means that the designer of a deep learning system, for the problem of histopathological cancer images studied, must focus more attention on the deep learning model to be used, than on the processing systems used. The three homogeneous groups of filter types intersect each other, therefore, from a statistical point of view they are equivalent in terms of their behaviour on the AUC output variable. It is also important to note that different convolution-based and histogram-based filter alternatives have been used. These alternatives, of these two major types of pre-processing, are mixed in the three homogeneous groups of Table 7. No statistically significant difference was found when using convolution-based or histogram-based types, both methods have similar behaviour for the output variable.

 

For the deep learning model type factor, there are also three groups, but in this case

there are statistically significant differences between them. For this particular problem,

and when analysing and discussing the results, it can be seen that the group that achieves

the best results (consisting of VGG16, VGG19 and ResNet50) are deep learning models

that have a large number of parameters and therefore their size (measured in MB) and

complexity is high. This does not carry over to the depth parameter (depth refers to the

topological depth of the network, including activation layers, batch normalization layers,

etc.).

 

As might be initially expected, VGG16 and VGG19 behave similarly from a statistical

point of view and both produce the best results. While it is true that both have a

larger number of parameters out of the five deep learning models analysed, their depth

topology is still the lowest. As strengths of this paper, we must highlight the novelty

and robustness of performing a comprehensive statistical analysis of the impact that the

different pre-processing algorithms and deep learning models have on the classification of histopathological images in breast cancer. As a possible weakness of the study, we point out that it would have been interesting to analyse other models of deep learning systems, preprocessing and even other pathologies. This weakness is due to the high computational cost and time required to run multiple simulations for each of the combinations of filters and deep learning models.

 

 

  1. The numbers in Table 1, such as 63.050, are recommended to

change to 63,050.

 

Thank you very much for your suggestion. The modification of the numbers in table 1 has been made. So now it's 63,050 instead of 63,050.

 

  1. “The [18] models”: Do you mean “The models [18]”?

 

Thank you very much for your correction. The phrase has been modified. In effect, the beginning of Subsection Models Used begins as:

 

“Five deep learning models have been selected in order to use architectures that are diverse in size and performance.”

 

  1. “VGG16 y VGG19”: Do you mean “VGG16 & VGG19”?

 

Thank you very much for your correction. Yes, this is a mistake and we have modified it in the paper.

 

  1. Mobilenet should be changed to MobileNet.

 

Thanks for the clarification. Changed Mobilenet to MobileNet throughout the paper.

 

  1. Densenet should be changed to DenseNet.

 

Thanks for the clarification. Changed Densenet to DenseNet throughout the paper.

 

  1. 0,05 should be changed to 0.05.

 

Thanks for the clarification. The sentence in the Experimental Results section is now:

The P-values test the statistical significance of each of the factors. Since one P-value is less than 0.05, this factor has a statistically significant effect on AUC at the 95% confidence level.

 

  1. 95,0% should be changed to 95.0%.

 

Thanks for the clarification. The sentence in the Experimental Results section is now:

 

The P-values test the statistical significance of each of the factors. Since one P-value is less than 0.05, this factor has a statistically significant effect on AUC at the 95% confidence level.

 

  1. What is the implication of this study for other related studies?

Should we use any image preprocessing method?

 

In analyzing the influence of functional blocks in the development of deep learning system  for solving image classification tasks in histopathological examinations of breast cancer, this study shows the importance of the choice of deep learning model used to solve this problem, as there is no interest (or differences from an statistical point of view) in applying any of the filters used in this paper.

 

Although it cannot be easily generalized, this statistical analysis methodology can be extended and applied to other problems (related to medical images) and pathologies (not only breast cancer). However, it is important to have a large and relevant amount of data and even to increase the number of deep learning models used for the evaluation (especially depending on the problem to be solved).

Author Response File: Author Response.pdf

Reviewer 3 Report

Manuscript summary:

 

The authors perform a retrospective analysis to assess the influence of pre-processing techniques of 2D images in the context of classification with regression and deep learning models. The study is conducted using a dataset of almost 3x10^5 images from a pathology database. The images are binary tagged for the presence of Invasive Ductal Carcinoma. Five different regression models and ten filters are used for a total of 50 combinations. The results show how the filters have a negligible impact on the prediction precision and accuracy, while the model selection plays an important role.

 

General comments:

 

The investigated topic is very technical but of interest to the research community. Increasing the robustness of the classification methods is reported as challenging in the literature and data pre-processing can play a significant role. However, the submitted manuscript provides only limited insights on this problem and does not represent a significant contribution for the research community. The information is not clearly presented and the language should be improved. My general concerns are the following:

 

1) Information flow and organization:

The information flow is sub-optimal in all the manuscript. As an example, the second paragraph of the introduction describes the burden of cancer worldwide (general topic) and the third paragraph goes into technical details of pathology (very specific) without any linking information. Moreover, the organization of the manuscript is chaotic. The first paragraph of 3.1 should be part of the introduction, the first two paragraphs of 4. should be part of materials and methods and there is no discussion of the results.

 

2) Sections written as review:

The sections and sub-sections 2., 3.2, 3.3, 3.4 are written as a review paper and not as an original research article. Most of the manuscript describes previous art pages 1 to 7 and the original work is limited to pages 8 and 9. The original contribution of the authors is overall limited in the presented manuscript. Moreover, the filtering techniques and the models are described in too much detail, which could be avoided reporting the parameters specific of this study and referencing to the appropriate literature for more details on the model itself.

 

3) Conclusions

The presented results show how applying filtering may even worsen the results in terms of AUC. See for example AUC for VCC16 in table 5, where the best result is achieved with the raw data. One may expect that the precision of the results improves after filtering, but figure 3 shows how the standard deviation of the AUC is unaffected and there are some differences only in the accuracy. Figure 4 reports that the models perform with different accuracy, which is an expected results and not in the scope of this manuscript that aims to investigate the influence of pre-processing and not the models themselves. Therefore, the conclusions do not provide insights on the problem.

 

4) Discussion 

The discussion of the results is missing. The authors do not comment on the strengths and the weaknesses of their study. 

 

 

Specific comments:

The sentence “Accurate detection and classification of some types of cancer is a critical task in medical imaging due to the complexity of those tissues” requires a motivation based on scientific arguments.

 

“The images require preprocessing” is incorrect. As a matter of fact, few paragraphs after, the authors cite [21] mentioning that normalization and filtering can make the models more robust. Applying pre-processing may even lead to information loss and it is not a required step, but it can bring benefits if performed consistently. 

 

ANOVA is referred to as “powerful techniques”. Please rephrase

 

The authors balance the dataset to have exactly 50-50 splitting between positive and negative label. Why? If they want to deploy the model at the clinic where the dataset was acquired, I would expect that the future incidence will be similar to the retrospective data. Therefore, the model should be trained without balancing. Balancing could be useful when datasets are strongly biased, which is not the case for this dataset.

 

The authors should motivate why they removed the last default layers replacing with flatten layers and override of 20% of the neurons. This action is reported, not motivated and not discussed.

 

The p-value discussion at the end of page 8 should be rephrased and moved to the materials and methods section. 

 

Table 1 highlights the best model for a given filter. This is not the presented scope of the manuscript. It would be more appropriate to highlight the best filter for each model instead.

 

In the conclusion section the authors refer to “intelligent systems”, did the mean “artificial intelligence” or similar?

 

The conclusion that “the choice of the architecture is statistically relevant” is an obvious conclusion. On the other hand, the filtering (focus of the manuscript) did not show significant differences and is not discussed. 

Author Response

Comments and Suggestions for Authors

Manuscript summary:

 

 

 

The authors perform a retrospective analysis to assess the influence of pre-processing techniques of 2D images in the context of classification with regression and deep learning models. The study is conducted using a dataset of almost 3x10^5 images from a pathology database. The images are binary tagged for the presence of Invasive Ductal Carcinoma. Five different regression models and ten filters are used for a total of 50 combinations. The results show how the filters have a negligible impact on the prediction precision and accuracy, while the model selection plays an important role.

 

 

 

General comments:

 

 

 

The investigated topic is very technical but of interest to the research community. Increasing the robustness of the classification methods is reported as challenging in the literature and data pre-processing can play a significant role. However, the submitted manuscript provides only limited insights on this problem and does not represent a significant contribution for the research community. The information is not clearly presented and the language should be improved. My general concerns are the following:

 

 

 

1) Information flow and organization:

 

The information flow is sub-optimal in all the manuscript. As an example, the second paragraph of the introduction describes the burden of cancer worldwide (general topic) and the third paragraph goes into technical details of pathology (very specific) without any linking information.

 

Thank you for these comments, which undoubtedly help to improve the quality of the paper. We have re-written and substantially modified both the introduction and the paragraphs that it indicates.

 

Now in the introduction we have written the following in the introduction section, to better link the information:

 

Breast cancer is one of the leading causes of cancer-related deaths worldwide and

affects a large number of women today [10,11]. Invasive ductal carcinoma (IDC) the most frequently subtype of all breast cancers [12]. As stated in [13], 934,870 new cancer cases and 287,270 cancer deaths in women are projected to occur in the United States in 2022, with an estimated 287,850 new cases of breast cancer (the leading cause, accounting for 31% of new cancer cases) and 43,250 deaths from breast cancer (the second leading estimated cause of cancer death, with breast cancer accounting for 15% in women). Incidence rates for breast cancer in women have slowly increased by about 0.5% per year since the mid-2000s [14].

 

Most women diagnosed with breast cancer are over the age of 50, but younger women

can also develop breast cancer. About 1 in 8 women will be diagnosed with breast cancer during their lifetime. Survival rates for breast cancer are believed to have increased and the number of related deaths continues to decline, largely due to factors such as earlier detection, a new personalized treatment approach, and a better understanding of the disease [15,16].

 

The diagnosis of breast cancer histopathology images with hematoxylin and eosin

stained is non-trivial, labor-intensive and often leads to a disagreement between pathologists [20,21]. Computer-assisted diagnosis systems contribute to help pathologists improve diagnostic consistency and efficiency, especially in breast cancer using histopatological images [22–24].

 

This paper proposes a comparative study of how different histopathological image preprocessing methods, and various deep learning models, affect the performance of the classification system in breast cancer.

 

 

Moreover, the organization of the manuscript is chaotic. The first paragraph of 3.1 should be part of the introduction, the first two paragraphs of 4. should be part of materials and methods and there is no discussion of the results.

 

In effect, we have modified section 3.1 and moved these paragraphs to the introduction section as indicated. A new section has also been made in the materials and methods section, called Experimental settings, where the information that was previously in section 4 of results is included.

 

We have also included a new Discussion section.

 

2) Sections written as review:

 

The sections and sub-sections 2., 3.2, 3.3, 3.4 are written as a review paper and not as an original research article. Most of the manuscript describes previous art pages 1 to 7 and the original work is limited to pages 8 and 9. The original contribution of the authors is overall limited in the presented manuscript. Moreover, the filtering techniques and the models are described in too much detail, which could be avoided reporting the parameters specific of this study and referencing to the appropriate literature for more details on the model itself.

 

We appreciate the reviewer's comment, but in order to make the article self-consistent (all information about the pre-processing and deep learning model is explained) and as requested by both the first reviewer and the second reviewer, additional information has been added on the deep learning methods used (table 3 with the main characteristics of the methods used). It is true that the filtering techniques and the deep learning models are described in this paper in detail, but this may make it easier for the future reader to understand. However, for more detail on each of the filtering techniques and deep learning models, the appropriate bibliographical references are provided.

 

3) Conclusions

 

The presented results show how applying filtering may even worsen the results in terms of AUC. See for example AUC for VCC16 in table 5, where the best result is achieved with the raw data. One may expect that the precision of the results improves after filtering, but figure 3 shows how the standard deviation of the AUC is unaffected and there are some differences only in the accuracy. Figure 4 reports that the models perform with different accuracy, which is an expected results and not in the scope of this manuscript that aims to investigate the influence of pre-processing and not the models themselves. Therefore, the conclusions do not provide insights on the problem.

 

 

When analyzing what is the influence of the functional blocks in the design of deep learning models for solving image classification tasks in histopathological studies of breast cancer, this study highlights the importance of choosing the architecture/deep learning model used to solve this problem.

 

In this work it is highlighted that although there is extensive research and contributions related to the preprocessing of images to work with deep learning systems, however, from a statistical point of view, the different techniques analyzed have the same behavior on the analyzed error rate (AUC)

 

In effect, the title of the work has been modified, so that the reader can see from the first moment that the influence of the different types of filtering used in the preprocessing of the image is statistically analyzed, and also different alternatives for the determination of deep learning models.

 

4) Discussion

 

The discussion of the results is missing. The authors do not comment on the strengths and the weaknesses of their study.

 

Thank you for your comment, and indeed, it is very relevant to have a discussion section in the paper. In the final part of said discussion, the strengths and the weaknesses of the work presented in the paper. This is Section 5:

 

As mentioned in Section 2, there are a large number of publications in the literature on

the use of deep learning systems in medical image classification, especially in histopathology. Image pre-processing is an important step, and as can be seen from the bibliography, there are a large number of alternatives for its implementation (in this paper we have used the most frequently presented ones in the literature). On the other hand, there are a large number of deep learning models in the literature, and in order to perform a statistical analysis with great rigor and accuracy, all possible combinations of filters and models must be simulated (in several repetitions), the behavior analyzed, and the error rates measured. The computational cost and time required for each simulation are high. For this reason, it was decided to select these five models that have been widely used in the literature. However, to the best of our knowledge, there is no exhaustive statistical analysis in the literature that attempts to analyze what relevance or impact it has on the behavior of the system to use the different alternatives of deep learning models and preprocessing of histological images for the problem of breast cancer.

 

As mentioned in Section 4, the analysis of the p-value in Table 6 yields a value of 0.146

for the filter technique and a value of 0.0000 for the model of deep learning. This means that the choice of the different deep learning model alternatives has a statistically significant influence on the behaviour of the system (measured by the AUC index). On the other hand, for the filter effect, it can be stated that it is not statistically significant, and therefore, the different methods have similar behaviour in order to obtain the AUC value. This means that the designer of a deep learning system, for the problem of histopathological cancer images studied, must focus more attention on the deep learning model to be used, than on the processing systems used. The three homogeneous groups of filter types intersect each other, therefore, from a statistical point of view they are equivalent in terms of their behaviour on the AUC output variable. It is also important to note that different convolution-based and histogram-based filter alternatives have been used. These alternatives, of these two major types of pre-processing, are mixed in the three homogeneous groups of Table 7. No statistically significant difference was found when using convolution-based or histogram-based types, both methods have similar behaviour for the output variable.

 

For the deep learning model type factor, there are also three groups, but in this case

there are statistically significant differences between them. For this particular problem,

and when analysing and discussing the results, it can be seen that the group that achieves

the best results (consisting of VGG16, VGG19 and ResNet50) are deep learning models

that have a large number of parameters and therefore their size (measured in MB) and

complexity is high. This does not carry over to the depth parameter (depth refers to the

topological depth of the network, including activation layers, batch normalization layers,

etc.).

 

As might be initially expected, VGG16 and VGG19 behave similarly from a statistical

point of view and both produce the best results. While it is true that both have a

larger number of parameters out of the five deep learning models analysed, their depth

topology is still the lowest. As strengths of this paper, we must highlight the novelty

and robustness of performing a comprehensive statistical analysis of the impact that the

different pre-processing algorithms and deep learning models have on the classification of histopathological images in breast cancer. As a possible weakness of the study, we point out that it would have been interesting to analyse other models of deep learning systems, preprocessing and even other pathologies. This weakness is due to the high computational cost and time required to run multiple simulations for each of the combinations of filters and deep learning models.

 

Specific comments:

 

The sentence “Accurate detection and classification of some types of cancer is a critical task in medical imaging due to the complexity of those tissues” requires a motivation based on scientific arguments.

 

Thank you for your comment, and indeed, it is positive to attach and motivate bibliographical references that support this statement. The phrase has been modified as follows in the paper:

 

Histopathology is a technique frequently used to diagnose tumors. It can identify

the characteristics of how the cancer looks under the microscope. Accurate detection

and classification of some types of cancer is a critical task in medical imaging due to the complexity of those tissues [ 17 ,18 ], concretely in breast cancer [ 19 ], which is the one that will be analyzed in this paper.

 

 

“The images require preprocessing” is incorrect. As a matter of fact, few paragraphs after, the authors cite [21] mentioning that normalization and filtering can make the models more robust. Applying pre-processing may even lead to information loss and it is not a required step, but it can bring benefits if performed consistently.

 

Thank you for your comment, so that the sentence of the paper can be improved.

 

The previous sentence was:

 

“In most cases, when such models are used, the images require preprocessing in order to remove noise, blurring, enhance contrast and, ultimately, improve the quality of the images to obtain better performance from the deep learning algorithms.”

 

The new sentence is now:

 

Once the database of the WSI images is available, a phase that is usually performed in different papers presented in the bibliography, is to carry out a pre-processing of said images. The main objective of this pre-processing step is to improve the behavior of the system (several alternatives for pre-processing can be used, among which it is worth mentioning to remove noise, blurring, enhance contrast and, ultimately, improve the quality of the images to obtain better performance for the final deep learning algorithms).

 

ANOVA is referred to as “powerful techniques”. Please rephrase

 

Thank you very much for your comment.

 

Now the sentence is:

 

In order to carry out this statistical analysis, the Analysis Of Variance tool (ANOVA), is used

 

The authors balance the dataset to have exactly 50-50 splitting between positive and negative label. Why? If they want to deploy the model at the clinic where the dataset was acquired, I would expect that the future incidence will be similar to the retrospective data. Therefore, the model should be trained without balancing. Balancing could be useful when datasets are strongly biased, which is not the case for this dataset.

 

Data balancing is a widely used method in classification problems. It is true that in a real use case in a clinic it would be different. However, our main goal in this paper, our objective is not to design a model that fits a real case, but to measure statistically the influence of certain parameters (deep learning models and filter) when it comes to design a pipeline for solving breast cancer image classification problem. Therefore, the balancing has been carried out in order to have the best conditions possible to measure the effect of the parameters that are the object of study.

 

 

The authors should motivate why they removed the last default layers replacing with flatten layers and override of 20% of the neurons. This action is reported, not motivated and not discussed.

 

We appreciate this comment, since the motivation and discussion of how transfer learning is carried out is a relevant aspect.

A widely used procedure when using transfer learning is to remove the last layers of the pre-trained model, and are replaced by layers that offer an adequate solution to the problem to be solved. In this case, since the last unremoved layer of the model is a convolutional one that offers an output in more than one dimension, we need a layer that converts the output of the convolutional to a vector. This is precisely what the flatten layer does. On the other hand, after some experiments in cases where the models suffered from overfitting, measures would be incorporated to avoid it behavior, so we embed regularization kernels and a dropout layer that 20% of the neurons were removed.

The motivation and discussion is now reported in a new subsection: 3.4 Experimental settings:

 

A total of 3 different runs were performed, using transfer learning technique, for each

filter and each model by randomly selecting the training, validation and test sets.

Different scenarios can be considered for transfer learning. The first scenario would be to have a large dataset (in our case images), but different from the pre-trained model’s dataset and to be able to train the entire model, being useful to initialize the model from a pre-trained model, using its architecture and weights. The second scenario would be when the set of images that you want to train has a certain similarity with the images that were used to train the model, or when you have a small set of images to train the new system. In this case, it is usually chosen to train some layers and leave others frozen. Finally, the third scenario would be to have a small set of images, but similar to the pre-trained models dataset. In that case, they usually freeze the convolutional base. The scenario used in this article has been the first scenario: the model is initialized from a pre-trained model (using its architecture and weights), and the entire model is trained/adjusted (no layer was frozen).

Therefore, the procedure that was followed for each filter and deep learning model is the following. First, the model pre-trained with the Imagenet dataset is loaded, removing the last default layers, replacing them by layers that offer an adequate solution to the problem to be solved (flatten layer). On the other hand, after some experiments in cases where the models suffered from overfitting, measures would be incorporated to avoid it behavior, so we embed regularization kernels and a dropout layer that 20% of the neurons were removed. Finally a fully connected layer whose activation function is a sigmoid. This results in a single output. Additionally, in this last layer, l1 and l2 kernels have been applied to regulate training. These kernels are penalty factors added to the loss function so that abrupt changes in the values of the weights are penalized during training to avoid over-fitting the models. Therefore, usual transfer learning procedure was used. Relying on prior learning, it avoids starting from scratch and allows us to build accurate models in a time-saving manner [77–79].

The input to the different deep learning models are the breast cancer histopathology images. These images received the appropriate scaling for the expected input to each neural network. An Adam optimizer with a learning rate of 10−5 was used. During training, the binary cross entropy was used as the loss function and a total of 50 epochs were iterated for each model. However, in general, not so many iterations were performed because early stopping with a latency of 15 was used in terms of accuracy in validation, with the weights of the best result recovered at the end of training. A batch size of 128 was used.

 

 

The p-value discussion at the end of page 8 should be rephrased and moved to the materials and methods section.

 

Thank you for your comment. We have modified in this new version of the paper the paragraphs of the discussion of p-value to the following:

 

“The table ANOVA decomposes the variability of the AUC into contributions due to different factors. Because the sums of squares are of the type III (the default), the contribution of each factor is measured after the effects of all other factors have been removed. The P-values test the statistical significance of each factor. Since a P-value is less than 0.05, this factor has a statistically significant effect on the AUC at a 95\% confidence level.


Of all the information presented in the table ANOVA, the researcher's main interest is likely to be focused on the value in the "F-ratio" and "p-value" columns If the numbers in this column are smaller than the critical value set by the experimenter, the effect is considered significant. This p-value is often set at 0.05. Any value lower than this leads to significant effects, while any value higher than this leads to non-significant effects.”

 

Due to the fact that section 3 is already large, this information has been kept in the results part (which indicates the results obtained from the statistical analysis)

 

Table 1 highlights the best model for a given filter. This is not the presented scope of the manuscript. It would be more appropriate to highlight the best filter for each model instead.

We would like to clarify that the aim of this paper is to statistically analyses the influence of the different preprocessing alternatives of the images (which we have called filter), as well as the different alternatives of the deep learning models, on the AUC (the error rate widely used in the literature). We have highlighted that both main effect are statistically analyzed. Therefore, the behavior of the different deep learning models is relevant in this paper, and it also follows from the statistical analysis obtained that it is the statistical factor that has a major impact on the behavior of the system. The alternatives for the type of pre-processing have no statistical significance on the error measure that determines the behavior of the system.


For this reason, the best deep learning model has been highlighted in table 4, which presents all possible combinations of deep learning models and filters

 

In the conclusion section the authors refer to “intelligent systems”, did the mean “artificial intelligence” or similar?

 

Thank you very much for the comment. Yes, you are right. We have changed it to:

 

“There is an extensive bibliography on the use of artificial intelligent systems for classification and decision support of histopathological images”.

 

The conclusion that “the choice of the architecture is statistically relevant” is an obvious conclusion. On the other hand, the filtering (focus of the manuscript) did not show significant differences and is not discussed.

 

As previously presented, the aim of this paper is to statistically analyses the influence of the different preprocessing alternatives of the images (which we have called filter), as well as the different alternatives of the deep learning models. There are a large number of studies on both factors in the literature. There is a great alternative for image pre-processing. There are also many alternatives to the deep learning model. However, to the best of our knowledge, there is no exhaustive statistical analysis in the literature that attempts to analyze what relevance or impact it has on the behavior of the system to use the different alternatives of deep learning models and preprocessing of histological images for the problem of breast cancer.

 

 

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Thanks for incorporating the feedback given. Maybe the author could consider expanding table 5 as it is very hard to read in its current form.

Reviewer 2 Report

Thanks for the revision. My concerns have been addressed. However, the novelty of this study is not very high. The authors are recommended to highlight the revised text to facilitate the review process.

Back to TopTop