1. Introduction
The global bone cancer rate is 0.2% among all types of cancer. Approximately 3600 patients were diagnosed with bone cancer and around 1720 patients passed away in the year 2020 [
1]. The most common type of bone cancer is osteosarcoma, which comprises 28% of adult and 56% of adolescent bone cancer found [
2]. Tumors can be found when touched and are painful when pressed. Bones can easily break from moving the body during daily routines. One of the issues in the orthopedic nursing is that the prevalence is so low. As the result, doctors are not familiar with this disease and find it difficult to diagnose. Osteosarcoma is a primary bone malignancy with a particularly high incidence rate in children and adolescents relative to other age groups [
3,
4]. Within one to two years after surgery, osteosarcoma patients often return with a skinny body and dyspnoea because cancer cells have spread to other organs, especially the lungs. Unfortunately, between 50 and 75% of patients with osteosarcoma will present with clinically detectable metastases from bone to lung [
4,
5,
6], a proportion that has increased with sophisticated methods of detection such as computed tomography (CT).
Computer-aided diagnosis (CAD) of osteosarcoma metastasizes is a popular research topic because it can help doctors to detect the nodule in a patient’s lung at the early state. There are several methods that have been proposed in the past few years. The result from the CAD system can separate nodule and non-nodule images out of the large CT-scanned image series from the Digital Imaging and Communications in Medicine (DICOM) format. The most promising machine learning tools are the Convolutional Neural Networks (CNNs), since they are trainable by using labelled data or images which are then fed into the networks to get the output categories. CNNs are the most popular methods for classifying images [
7,
8]. They can classify images into suitable categories. A CNN extracts features and learns the important features in the Convolution layers, then classifies the output as a positive or negative diagnosis of osteosarcoma metastasizes using the fully connected layers. This learning algorithm, called a supervised learning algorithm, is used to train a CNN’s model to achieve good performance by using quality images in the dataset, such as balanced dataset images or enhanced images. Then, the quantity of the dataset, as well as the suitable pretrained neural network, all have an influence on how well the learning techniques function [
9,
10,
11,
12]. The CNN is a trainable machine learning algorithm and can classify with a high accuracy when provided with a good images dataset. In most cases, the dataset consists of huge images to learn, however, since osteosarcoma metastatic disease is very rare disease, the CT-scanned images are limited. This situation is more severe for transferred patients because of the patient privacy law and government policy in Thailand. Only the CT-scanned image file in CD-ROM media can be sent to the new hospital. The transmission of patient data or image files via the internet is prohibited for safety control. Generally, CD-ROM media has a capacity of 703 MB, therefore, the Digital Imaging and Communications in Medicine (DICOM) file from the CT-Scan machine cannot be written onto one CD-ROM. Due to this situation, the source hospital has to convert the DICOM file to another image file format such as JPG, PNG or BMP to reduce the file size before writing onto CD-ROM and transferring to the destination hospital. Then, the quality of image files is changed due to their format and may affect the accuracy of the network that was used for detecting the nodule in the lung.
One popular image file format was raster format, which consisted of grid pixels combined into a big image. When these images were zoomed in, there would be plenty of visible squares. Therefore, raster was a complex image file format that had a smooth color radiant. However, if size of the image was reduced or enlarged, the lines between pixels would start showing. The difference between pixels could be used in image processing to learn how to detect the edge of the object in an image. In addition, there was another type of image, called a vector image, from mathematical calculation. Vector images displayed in geometric shapes that were created by a graphic program. The image file format was defined by a set of mathematical parameters. The examples of vector files included AI and EPS. Presently, this type of image is not applied to medical use due to the huge file size which affected the storage space and delayed the display time, furthermore, the software in each hospital still cannot support this type of image [
13,
14,
15].
The DICOM file does not contain only pixel image data from the CT scan machine, but it contains extra data such as a patient record, pixel density, sliced size, medical protocol, hospital information and machine specification. Moreover, the pixel data in DICOM file are recorded in a long binary format without being separated into an individual CT sliced image, therefore, the pixel data must be extracted, separated and converted into a series of common CT-scanned image files before they can be used in any machine learning algorithms. Due to the conversion, there are a variety of the image file formats with different qualities.
The common image file formats are BMP, JPG and PNG. Therefore, according to the limited images and different file formats, this research will create the CNNs for osteosarcoma metastatic disease detection and investigate the effect of the number of images in the training dataset when using different image formats, as well as evaluate the performance index F1-Score of the trained networks from different image formats with all formats.
Therefore, the aim of this study is to analyze the effectiveness of learning systems from common image file types to detect osteosarcoma based on CNNs’ models to be a guideline for a radiologist to choose a robust model regardless of image format and dataset size.
3. Results
In order to obtain the performance impact from different quality image formats and different sizes of training data in datasets, the experiments were conducted. The first experiment was conducted by training all three possible CT-scanned image file formats by three different CNNs, which are VGG-16, ResNet-50 and MobileNet-V2. The dataset in the first experiment, called the large dataset, was randomly selected (80%) from 2212 CT-scanned images and transformed into 1769 CT-scanned images in BMP, JPG and PNG formats. These 1769 CT-scanned images contained nodules and were labeled as the positive class images. The rest of images in CT-scanned images were used for evaluating the trained networks. To balance the positive images in the large dataset, the negative CT-scanned images were also randomly selected from other CT-scanned images which did not contain nodules. There were 3538 total images in the large dataset that were used to train the networks. In this experiment, due to limitation of the hospital’s computer resources, a batch size was set to 16 per 1 iteration, which means that 16 images were feedforwarded into the networks and the error from each image was obtained and summarized before using this error to find the gradient and train the networks via back propagated algorithm to reduce the loss value. To cover all the images in the dataset, one epoch requires 222 iterations. The term “epoch” means training the neural network with all the training data for one cycle. Therefore, if the networks are trained for 400 epochs, it requires 88,800 iterations, and if the networks are trained for 2000 epochs, it requires 444,000 iterations.
In the first experiment, each model, VGG-16, ResNet-50 and MobileNet-V2, had been trained with three different image file types for 400 epochs. Thus, there were nine trained networks, which were named as L-VGG-JPG, L-VGG-PNG, L-VGG-BMP, L-ResNet-JPG, L-ResNet-PNG, L-ResNet-BMP, L-MobileNet-JPG, L-MobileNet-PNG and L-MobileNet-BMP. These names will be used as the reference in this section. The L is used to explain that the networks had been trained with the large dataset.
Figure 5 shows the loss value when training the VGG-16. The loss value graphs of the ResNet-50 and MobileNet show similar behavior to VGG-16, with differences in a training slope and steady-state loss fluctuation value. Loss of the models when trained by different image formats was different at the beginning, however, loss decreased to zero when the number of epochs increased. Loss of VGG-16 (a) was the smallest among the three selected models. By considering loss, the performance of the models when evaluated by the dataset can be obtained. If the loss value is near to zero, there is less error.
The one big concern for a radiologist or oncologist who trains the model in his hospital is when he should stop the training. In order to answer this question, the second experiment was conducted. In the second experiment, the CNN models were trained with more epochs in order to obtain the effect of epochs to loss values. Since VGG-16 models have the smallest loss value from the first experiment, only VGG-16 networks were continuously trained up to 2000 epochs. Then, the training graphs were recorded and divided into four training phases. In phase 1, loss values between 300 and 400 epochs were considered, as shown in
Figure 6a. Loss values between 1000 and 1100 epochs, 1500 and 1600 epochs and 1900 and 2000 epochs were considered for phase 2, 3 and 4, respectively, as shown in
Figure 6b–d. In each phase, the trendlines of losses vs. epochs were calculated and the slopes of the trendlines were collected.
Table 3 shows the slope of all training phases.
From
Table 3, the slope of each training phase is very close to each other, since the slopes are very small. This second experiment discovered that the loss values converted to zero and the training slopes converged to 0 for all training phases. Thus, it was not necessary to keep training over 400 epochs. Then, the result explains that if the radiologist trains the network and its loss is small enough, he can stop the training process with confidence.
Because osteosarcoma is a very rare bone cancer, it is possible that CT-scanned images in the dataset for training the CNN model may be very small. Therefore, a third experiment was conducted to obtain the impact from the small training dataset to the models’ performances. This experiment aims to simulate the situation for a small hospital that has a very small number of CT-scanned images of patients. To obtain the effect of a small dataset, therefore, 10% of the 2212 positive class images, or 220 images, were randomly picked for model training. The positive 177 images (80% of 220 images) were also randomly selected to be the positive CT-scanned class images for training, combined with 177 negative CT-scanned class images that did not contain nodules, to create the small training dataset. There were 354 total images in the small dataset that were used to train the VGG-16, ResNet-50 and MobileNet-V2. Then, the CNN models were trained for 400 epochs in the same scenario as the first experiment, and there were nine total trained models output from this experiment. The training graphs of the small dataset were displayed and compared with the training graphs of the large dataset. To see this phenomenon, the training graphs were expanded from 300 to 400 epochs, as shown in
Figure 7. The fluctuation of VGG-16 model loss is very small when trained by small and large datasets. In the case of ResNet-50 and MobileNet-V2, loss values fluctuated within a larger range if the models were trained by the small dataset. Loss of ResNet-50 fluctuated about 0.23 when trained by the small dataset and 0.07 when trained by the large dataset. Loss of MobileNet-V2 fluctuated about 0.17 when trained by the small dataset and 0.07 when trained by the large dataset. According to the result, VGG-16 is the most robust CNN model when trained by the large and small dataset.
The first, second and third times of experiments show that the losses were approaching zero for all the models for all conditions. However, they are not reflective of real performance when the trained models are used in real situations where the input images are unseen. Therefore, the fourth experiment was conducted to obtain the model’s performance when validated by unseen CT-scanned images. The unseen positive CT-scanned images are the remaining 443 images of each image file type (JPG, PNG and BMP) from the first experiment, which were not included in the large dataset. The unseen 443 negative CT-scanned images were randomly selected from the CT-scanned images that were not included in any dataset and did not have nodules in the images. Then, all 400 epoch-trained models (nine models trained by the small dataset and nine models trained by the large dataset) were evaluated by 3threedifferent unseen image file types, yielding 54 cases in total. The evaluated results and performance scores for the small dataset trained models are shown in
Table 4 and the evaluated results and performance scores for the large dataset trained models are shown in
Table 5. The experimental result showed that there is a small effect caused by different unseen file formats to the model’s performance. The results shown concluded that the trained models can achieve the same performance when used in the real scenario where the input image formats are different, such as in the case of the transferred patients.
In the case of the small dataset, VGG-16 models have the best F1-Score when evaluated by unseen images, while the accuracy of all models is close to each other. Moreover, VGG-16 model can achieve the same scores and results (TP, TN, FP and FN) regardless of any trained or tested image file formats. Therefore, if there is a limited training CT-scanned image, the VGG-16 model is the best CNN model choice for detecting nodules, since it is the most robust model with the highest F1-Score. In the case of the large dataset, the accuracy of all models decreased, while the F1-Score increased for some models. In the case of ResNet-50, the models that were trained by JPG file format had a higher F1-Score than models that were trained by BMP and PNG formats, respectively. Although MobileNet-V2 models have similar results as the ResNet-50, the models that were trained by BMP format had the highest F1-Score, followed by models that were trained by PNG and JPG formats, respectively. The VGG-16 model was still robust for any trained or tested image formats in the case of the large dataset in consideration of the precision and recall values, even though it did not have the best F1-Score.
To verify the results in the case of training with the large dataset, the 2000 epochs trained models from the fourth experiment were considered. The evaluated results and performance scores for the large dataset 2000 epochs trained models are shown in
Table 6. The accuracy and F1-Score of some models decreased and some models increased.
The uncertainty remains for ResNet-50 and MobileNet-V2 in the 2000 epochs training case. For ResNet-50, the model that was trained by JPG format had the highest F1-Score, followed by models that were trained by PNG and BMP formats, respectively. Even though the F1-Score of the ResNet-50 model that was trained by JPG format slightly increased when epochs were increased from 400 to 2000, the F1-Score of the ResNet-50 models that were trained by PNG and BMP formats decreased. The situation was more confused for MobileNet-V2 cases. The MobileNet-V2 model that was trained by BMP format had the highest F1-Score when the model was trained for 400 epochs, but the MobileNet-V2 model that was trained by PNG format had the highest F1-Score when the model was trained for 2000 epochs. The models that were trained by JPG format had a higher F1-Score than the model that was trained by JPG format when epochs were increased to 2000. For VGG-16, all scores remained the same even if the epochs were increased. From this result, the VGG-16 model is still the most robust CNN model for detecting nodules in a CT-scanned image file. It is not easy to find the number of epochs required for the ResNet-50 and MobileNet-V2 models to have the best F1-Score. In addition, the image file type formats affected the score of the ResNet-50 and MobileNet-V2 inconclusively.
All experiments show low F1-Scores and accuracy in all models, regardless of any training or testing image formats, due to small nodule size in the CT-scanned image. However, the experiments discovered the robustness of each model according to different image formats, which addresses the issue of the transferred patients between hospitals in Thailand. In varieties of common image formats that can be sent to Lerdsin Hospital, the experimental results show a guideline for a radiologist to select the VGG-16 model in order to obtain the consistency, accuracy and F1-Scores output regardless of image format and dataset size. Other models show fluctuation and unpredictable F1-Scores and accuracy when these models are trained and tested with different image formats and a different number of epochs. The experiment also provides a guideline for a training where, if the model loss is low enough, then, it is not necessary to keep training the model. Finally, the radiologist can be comfortable using the VGG-16 model for any common image file formats that are already on hand or in the future.
4. Conclusions and Discussion
This research aims to analyze the effectiveness of learning systems from common image file types to detect osteosarcoma based on CNNs’ models. There are three popular CT-scanned CNN models investigated in this research, which are VGG-16, ResNet-50 and MobileNet-V2 models. Osteosarcoma bone cancer is a rare type of cancer, thus, it is hard to detect even when advanced computer aided technology, such as machine learning or CNNs, is applied [
36]. There are limited cases and resources for successful machine learning. Furthermore, there is an additional problem in Thailand. According to the technical and policy issues in the case of transferred patients in Lerdsin Hospital, all patient records and image files must be written in CD-ROM media for safety purposes. The patient’s DICOM files could not be transferred from the source hospital to the destination hospital because they could not be put onto CD-ROM, therefore, the DICOM files must be transformed into other common CT-scanned image file formats [
37,
38]. The three common CT-scanned image file formats that are sent to Lerdsin Hospital are BMP, JPG and PNG formats [
15,
39]. This research searched for the answer to the question of the impact from different file formats on the CNN models while training and on performance validation. This conclusion can leverage the telemedicine operation in Thailand. Hospitals in remote areas can use the online detection system by using any kind of CT-scanned image format, yet achieve the same result. The findings indicated that there are uncertain and inconclusive results for ResNet-50 and MobileNet-V2 when the models were trained by different CT-scanned image formats. Also, the F1-Score of the same model was changed when the epochs were increased. This result implied that ResNet-50 and MobileNet-V2 are not robust for osteosarcoma bone cancer metastasized to lung detection when trained by different training image formats. They are dependent on both training image format and number of epochs, which lead to unclear conclusion about how many training epochs are required in order to obtain the best F1-Score and which training image format should be used. However, in the case of the VGG-16 model, the results from both models trained by large and small datasets were consistent. The different CT-scanned image formats did not affect performance scores of the trained models in any case. VGG-16 is the most robust CNN model for osteosarcoma bone cancer metastasized to lung detection. Due to the experimental results, the file formats have little impact on the overall performance scores. Thus, the criteria for selecting image formats must be the image size and quality which could be most practically used. The average CT-scanned image sizes of the JPG, PNG and BMP formats in the datasets are 177.22, 271.10 and 1029.92 Kilobytes, respectively.
Therefore, the PNG image format is preferred for both training and validating models in medical applications, because the PNG format is a patent-free and lossless compression image format that has been proven as the best quality medical image for radiologic application [
15,
39] with a small storage size [
37] that is capable for a small hospital.
F1-Score decreased when the training epochs increased. This phenomenon from this research also has been investigated and discussed for further understanding. According to the positive class images, as shown in
Figure 8, the area of the nodule in the patient lung is tiny when compared with the whole image area. This situation makes it difficult even for experienced radiologists to detect the positive nodules in the image. Due to this fact, to increase in the performance of the models for detecting the osteosarcoma nodules in the lung, and the more sophisticated machine learning algorithms or CNN models that can identify the region of interest that has the most probability to contain the nodules, must be further investigated, such as a Region-based Convolutional Neural Network (RCNN) or Single Shot Detection (SSD) framework.