A Method for Real-Time Lung Nodule Instance Segmentation Using Deep Learning

Santone, Antonella; Mercaldo, Francesco; Brunese, Luca

doi:10.3390/life14091192

Open AccessArticle

A Method for Real-Time Lung Nodule Instance Segmentation Using Deep Learning

by

Antonella Santone

,

Francesco Mercaldo

^*

and

Luca Brunese

Department of Medicine and Health Sciences “Vincenzo Tiberio”, University of Molise, 86100 Campobasso, Italy

^*

Author to whom correspondence should be addressed.

Life 2024, 14(9), 1192; https://doi.org/10.3390/life14091192

Submission received: 20 June 2024 / Revised: 16 September 2024 / Accepted: 19 September 2024 / Published: 20 September 2024

(This article belongs to the Special Issue Emerging Biomarkers: Recent Findings and Application in Pathological Condition Detection—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Lung screening is really crucial in the early detection and management of masses, with particular regard to cancer. Studies have shown that lung cancer screening, can reduce lung cancer mortality by 20–30% in high-risk populations. In recent times, the advent of deep learning, with particular regard to computer vision, demonstrated the ability to effectively detect and locate objects from video streams and also (medical) images. Considering these aspects, in this paper, we propose a method aimed to perform instance segmentation, i.e., by providing a mask for each lung mass instance detected, allowing for the identification of individual masses even if they overlap or are close to each other by classifying the detected masses into (generic) nodules, cancer or adenocarcinoma. In this paper, we considered the you-only-look-once model for lung nodule segmentation. An experimental analysis, performed on a set of real-world lung computed tomography images, demonstrated the effectiveness of the proposed method not only in the detection of lung masses but also in lung mass segmentation, thus providing a helpful way not only for radiologist to conduct automatic lung screening but also for discovering very small masses not easily recognizable to the naked eye and that may deserve attention. As a matter of fact, in the evaluation of a dataset composed of 3654 lung scans, the proposed method obtains an average precision of 0.757 and an average recall of 0.738 in the classification task. Additionally, it reaches an average mask precision of 0.75 and an average mask recall of 0.733. These results indicate that the proposed method is capable of not only classifying masses as nodules, cancer, and adenocarcinoma, but also effectively segmenting the areas, thereby performing instance segmentation.

Keywords:

lung; nodule; cancer; adenocarcinoma; segmentation; object detection; classification; YOLO; deep learning; health

1. Introduction

Lung cancer (LC) causes nearly 400,000 deaths annually in Europe (https://gco.iarc.fr/today/en, accessed on 19 June 2024). Unfortunately, most patients are diagnosed at an advanced stage, resulting in very low 5-year overall survival rates of less than 10% [1,2,3]. This is particularly tragic, as low-dose computed tomography (CT) screening has shown a survival benefit for a well-defined high-risk population since the National Lung Screening Trial (NLST) in 2011 [4,5,6], a finding later confirmed by several European randomized controlled trials [7,8,9]. Screening high-risk individuals could thus lead to earlier detection and significantly improve 5-year survival rates. However, the U.S. experience reveals challenges in implementing national screening programs, with less than 10% of at-risk individuals participating [10,11,12]. Similar difficulties are anticipated in other countries, suggesting that it may take years to achieve a stage shift and improved survival rates in cancer registries.

Another major opportunity to achieve earlier detection is through incidentally detected lung nodules, which must be distinguished from those found through screening. Incidental pulmonary nodules (IPNs) are a common finding in radiology, occurring on any CT scan that includes the lungs. With the increasing use of chest imaging for various reasons, the detection of IPNs is also rising. While only a small number of IPNs are early-stage LC, identifying and following up on these malignant nodules is crucial. The accurate assessment and appropriate follow-up of IPNs could be as vital as screening programs in reducing LC mortality.

This issue is particularly pressing since IPNs often occur in individuals who do not qualify for LC screening due to diverse sociodemographic and medical characteristics, forming a distinct group from the LC screening population. A major concern is that in the U.S., approximately 1.1 million IPNs annually are not properly followed-up with, as shown by several analyses of U.S. health data [13].

In recent years, specially designed IPN management programs have been developed, mainly in the U.S., to improve the follow-up process. Implementing these programs has increased the rate of early LC detection, especially among individuals ineligible for LC screening. Therefore, a combined approach that includes both LC screening for high-risk individuals and proper follow-up for those with IPNs is essential for reducing LC mortality across a large population of potentially affected individuals.

However, there is a lack of data on the real-world situation of IPNs in European countries. The reality of IPN management remains unclear, and we can only speculate on how many IPNs may be overlooked in Europe [13].

There are several important reasons to boost lung screening. First of all, early detection, i.e., screening, particularly with low-dose computed tomography (LDCT), allows for the detection of lung nodules at an earlier stage. Early detection is crucial because it increases the likelihood of successful treatment and improves survival rates. Moreover, screening can lead to a shift in diagnosis from advanced stages of lung cancer, which are harder to treat and have poorer prognoses, to earlier stages that are more amenable to curative treatment.

Screening can also help in reducing mortality; in fact, studies such as the National Lung Screening Trial (NLST) and European randomized controlled trials have demonstrated that LDCT screening in high-risk populations can significantly reduce lung cancer mortality. Furthermore, by identifying lung nodules early, screening can lead to earlier interventions, which can significantly improve 5-year overall survival rates.

With screening, it is also possible to detect high-risk populations: screening programs focus on individuals at high risk for lung cancer, such as long-term smokers and older adults, thereby maximizing the benefit-to-harm ratio of screening. The early detection of nodules in high-risk individuals allows also for prompt medical or surgical interventions, potentially preventing the progression to advanced lung cancer.

Screening programs often include regular follow-up scans to monitor identified nodules. This ensures that any changes in size or appearance that might indicate malignancy are promptly addressed. Systematic screening reduces the likelihood of incidental nodules being overlooked, ensuring that suspicious nodules are appropriately evaluated.

The early detection and treatment of lung cancer through screening can be more cost-effective than treating advanced-stage cancer. Early-stage treatments often involve less complex and less expensive interventions. Screening can help allocate healthcare resources more efficiently by reducing the need for extensive treatments associated with advanced-stage cancer.

There are also several psychological benefits from screening; for instance, for high-risk individuals, regular screening can provide reassurance and reduce anxiety about developing lung cancer. Early detection empowers patients with more treatment options and better-informed decisions about their health.

Unfortunately there are several barriers to screening campaigns, for instance, low participation rates; in fact, despite the benefits, participation in screening programs can be low, as seen in the U.S., where less than 10% of eligible individuals participate. Efforts to increase awareness and accessibility are needed. Screening can sometimes result in false positives, leading to unnecessary anxiety, further testing, and invasive procedures. Effective screening programs must balance the benefits of early detection with the risks of overdiagnosis and overtreatment.

Another problem is related to ensuring high-quality, standardized screening programs, which is essential for maximizing the benefits and minimizing the harms of lung cancer screening. Radiologists and healthcare providers need adequate training to accurately identify and manage lung nodules detected during screening.

In summary, the importance of screening in discovering lung nodules lies in its potential to detect lung cancer at an earlier, more treatable stage, thereby improving survival rates and reducing mortality. Despite the challenges, the benefits of systematic and well-implemented screening programs can be substantial for high-risk populations.

There are several kinds of masses that can be found in the lung: a mass in the lung is a larger, abnormal growth in the lung tissue, typically defined as being greater than 3 cm (about 1.2 inches) in diameter. Lung masses can be benign (non-cancerous) or malignant (cancerous), but they are more likely to be malignant compared to smaller lung nodules.

In detail, a lung nodule is a small, round, or oval-shaped growth in the lung, often detected incidentally during chest imaging tests such as X-rays or CT scans. It is typically less than 3 cm (about 1.2 inches) in diameter. Nodules larger than 3 cm are generally referred to as lung masses and are more likely to be malignant. They can be benign (non-cancerous) or malignant (cancerous). Causes of benign lung nodules include infections, inflammatory conditions, and non-cancerous growths. They require monitoring through follow-up imaging to determine if the nodules change in size or appearance, which can indicate malignancy.

A lung mass can be recognized as lung cancer if it is characterized by an uncontrolled growth of abnormal cells in the lung tissues, which can form tumors, interfere with normal lung function, and spread (metastasize) to other parts of the body.

The major types of lung cancer include non-small-cell lung cancer (NSCLC) and small-cell lung cancer (SCLC). Symptoms may include persistent cough, chest pain, shortness of breath, and unexplained weight loss. They can be diagnosed through imaging tests, biopsies, and other diagnostic procedures. Staging is based on the size and extent of the tumor, lymph node involvement, and presence of metastasis.

Adenocarcinoma is a subtype of non-small-cell lung cancer (NSCLC) that arises from glandular cells in the lungs, which are cells that secrete mucus and other substances. It represents the most common type of lung cancer, especially in non-smokers and women. It is often found in the outer regions of the lungs and tends to grow more slowly than other types of lung cancer, such as small-cell lung cancer. Symptoms are similar to other forms of lung cancer but may include more frequent mucus production due to the glandular origin of the cancer. Diagnosis and treatment are similar to other types of lung cancer, involving imaging, biopsy, and treatments such as surgery, radiation, chemotherapy, and targeted therapies.

The main difference is that a lung nodule is a small growth that can be either benign or malignant, while lung cancer specifically refers to malignant tumors that originate in the lung. Lung nodules are often detected incidentally and require monitoring to determine their nature, whereas lung cancer is diagnosed based on the presence of malignant cells and typically requires comprehensive treatment.

Lung cancer is a broad term encompassing various types of cancer that originate in the lung, including both non-small-cell lung cancer (NSCLC) and small-cell lung cancer (SCLC). Adenocarcinoma is a specific type of NSCLC that originates from glandular cells in the lung. Adenocarcinoma has distinct pathological and clinical features compared to other types of lung cancer, such as squamous-cell carcinoma or small-cell lung cancer.

In recent times, artificial intelligence (AI), with particular regard to computer vision, has demonstrated a great ability to discover disease automatically from medical images. This is the reason why researchers are working on the integration of computer vision models into medical diagnostics, considering the impact on the screening process, enhancing the accuracy, speed, and efficiency of healthcare delivery.

As a matter of fact, AI can be utilized to analyze medical images such as X-rays, CT scans, MRI, and mammograms with high precision. Machine learning algorithms, particularly deep learning algorithms, can identify patterns and anomalies that may not be easily detectable by human eyes.

Moreover, AI tools can assist radiologists by highlighting potential areas of concern in scans, such as tumors or fractures, thereby reducing the chance of oversight. Regarding pathology, digital pathology images can be analyzed to detect cancerous cells or other diseases, providing a second opinion to pathologists.

There are several tasks than can be automatized with the adoption of computer vision; one of these task is represented by so-called segmentation, i.e., a process that involves dividing an image into segments or regions that are easier to analyze and understand. This technique is fundamental for various applications such as object detection, recognition, and image classification. There are two main types of segmentation in computer vision: semantic segmentation and instance segmentation. Semantic segmentation assigns a label to each pixel in an image, categorizing them into predefined classes. Instance segmentation not only classifies each pixel but also distinguishes between different instances of the same class. For example, it can separate two different cars in the same image, labeling them as distinct instances.

With regard to the state-of-the-art methods in the literature on disease detection using AI, convolutional neural networks (CNNs) [14,15,16,17], deep convolutional neural networks (DCNNs) [18,19,20], and recurrent neural networks (RNNs) are the most commonly used learning algorithms in medical imaging [21]. The CNN architecture is highly favored for supervised deep learning tasks such as lesion segmentation and classification due to its minimal preprocessing requirements. Recently, CNNs have been utilized in medical imaging for tasks like image segmentation (e.g., Mask R-CNN [22]) and classification (e.g., AlexNet [23] and VGGNet [24]). DCNNs, which include more layers and complex nonlinear relationships, have demonstrated reasonable accuracy in classification and regression tasks [25,26]. RNNs, as higher-order neural networks, can re-input the network output, allowing them to capture and exploit cross-slice variations to incorporate volumetric patterns of nodules, but they are susceptible to the vanishing gradient problem [27].

In this paper, we propose a method aimed to detect and locate the presence of lung nodules, cancer, and adenocarcinoma from medical images, by considering instance segmentation.

As instance segmentation model, we consider the you-only-look-once model i.e., YOLO, in particular, YOLO version 8; as a matter of fact, YOLOv8 has demonstrated interesting performances in both object detection and instance segmentation, especially in speed and accuracy, making it suitable for real-time applications. Differently from state-of-the-art methods in the literature that are mainly focuses on classification, we focused on instance segmentation, i.e., a computer vision task relating to the identification and delineation of individual objects within an image while assigning a unique label to each pixel, differently from the classification that aims to assign a label to a full image. In the current state-of-the-art literature, there have been several attempts to exploit the YOLO model for lung nodule segmentation; for instance, the researchers in [28,29,30,31] adopted the YOLO model for lung nodule segmentation, but differently from the proposed method, they adopt previous versions of the YOLO model; in particular, some authors considered the YOLO5 version, while the authors in [29,30,31] proposed the adoption of YOLO7. Also, methods exploiting CNNs have been proposed by researchers [32,33]: in this case, the main differences with respect to the proposed method is represented by the fact that CNNs can mark an image with a related label, without the ability to perform instance segmentation (i.e., to locate and detect the nodule in the image under analysis).

The remainder of this paper proceeds as follows: in the next section, an overview of the state-of-the-art research in lung cancer segmentation is provided, Section 3 introduces the method we propose for lung nodule instance segmentation, Section 4 discusses the experimental analysis obtained through the application of the proposed method on a real-world dataset of medical images, and, in the last section, the conclusion and future research lines are presented.

2. Related Work

Lung segmentation is a critical task in medical image analysis, enabling the precise identification of lung regions for subsequent diagnostic or therapeutic processes. Over the past decade, deep learning has revolutionized this field by providing robust and automated solutions. This section reviews significant contributions in lung segmentation using deep learning, focusing on the methods, datasets, and performances reported in the literature.

With regard to the state-of-the-art literature on disease detection using AI, CNNs, deep convolutional neural networks (DCNNs), and recurrent neural networks (RNNs) are the most commonly used learning algorithms in medical imaging [21]. The CNN architecture is highly favored for supervised deep learning tasks such as lesion segmentation and classification due to its minimal preprocessing requirements. Recently, CNNs have been utilized in medical imaging for tasks like image segmentation (e.g., Mask R-CNN [22]) and classification (e.g., AlexNet [23] and VGGNet [24]). DCNNs, which include more layers and complex nonlinear relationships, have demonstrated reasonable accuracy in classification and regression tasks [25,26]. RNNs, as higher-order neural networks, can re-input the network output, allowing them to capture and exploit cross-slice variations to incorporate volumetric patterns of nodules, although they are susceptible to the vanishing gradient problem [27].

Reinforcement learning was first applied by Google DeepMind in 2013 [34], and since then, it has been extensively explored to enhance lung cancer detection accuracy, sensitivity, and specificity. Semi-supervised learning techniques, such as deep reinforcement learning and generative adversarial networks, utilize labeled datasets. Supervised learning involves using a learning algorithm, where labels are assigned to input data during training. Various supervised deep learning methods have been applied to CT images for identifying abnormalities with anatomical localization. However, these methods have limitations, including the need for large amounts of labeled data, fixed network weights after training, and no capacity for post-training improvement. To address this, few-shot learning (FSL) models [35,36] have been developed to reduce data requirements during training.

Several deep learning approaches have been explored for lung segmentation. Wang et al. [37] developed a multi-view CNN (MV-CNN) for lung nodule segmentation, achieving an average DSC of 77.67% and an average ASD of 0.24 on the LIDC-IDRI dataset. Unlike conventional CNNs, MV-CNN integrates multiple input images for lung nodule identification, though it struggles with 3D CT scans. Consequently, a 3D CNN was created to handle volumetric patterns of cancerous nodules [38]. Sun et al. [39] designed a two-stage CAD system for automatic lung nodule segmentation and false positive (FP) reduction. This system, tested on the LIDC-IDRI dataset and evaluated by four experienced radiologists, achieved an average F1_score of 0.8501 for lung nodule segmentation.

In 2020, Cao et al. [40] introduced a dual-branch residual network (DB-ResNet) that captures multi-view and multi-scale features of nodules simultaneously. DB-ResNet, evaluated on the LIDC-IDRI dataset, achieved a DSC of 82.74%, outperforming trained radiologists. In 2021, Banu et al. [41] proposed the attention-aware weight-excitation U-Net (AWEU-Net) architecture for lung nodule segmentation in CT images. This architecture includes two stages: lung nodule detection using a fine-tuned Faster R-CNN and lung nodule segmentation using U-Net with position and channel attention-aware weight excitation. AWEU-Net achieved DSCs of 89.79% and 90.35% and IoUs of 82.34% and 83.21% on the LUNA16 and LIDC-IDRI datasets, respectively. Dutta [42] developed Dense R2Unet, a dense recurrent residual CNN based on U-Net with dense interconnections, which demonstrated superior segmentation performance compared to U-Net and ResUNet on a lung segmentation dataset.

One of the pioneering works in this domain is by Ronneberger et al. [43], who introduced the U-Net architecture, a convolutional neural network designed for biomedical image segmentation. This architecture has been widely adopted and modified for lung segmentation tasks due to its ability to capture context and high-resolution features simultaneously.

Christ et al. [44] proposed a two-step approach involving a lung segmentation step followed by nodule detection. Their method utilizes a fully convolutional network (FCN) and showed promising results on the LUNA16 dataset, demonstrating the effectiveness of deep learning in handling medical images.

Jin et al. [45] developed a 3D U-Net model to leverage volumetric information in CT scans. Their approach improved the segmentation accuracy by considering the spatial coherence across slices. The results on the LUNA16 dataset indicated superior performance compared to traditional 2D methods.

Another significant contribution is by Hofmanninger et al. [46], who introduced a self-supervised learning approach to enhance lung segmentation. By pre-training their model on a large dataset with a self-supervised task, they achieved improved performance on the LIDC-IDRI dataset.

The authors of [47] presented a hybrid approach combining deep learning and traditional image processing techniques. Their method integrates a CNN with conditional random fields (CRFs) to refine the segmentation boundaries, yielding better delineation of lung regions on the JSRT dataset.

In recent years, attention mechanisms have been incorporated into segmentation networks to improve focus on relevant features. The researchers in [48] developed an attention-guided U-Net, which significantly enhanced segmentation accuracy on the COVID-19-CT-Seg dataset by emphasizing critical regions within the lung fields.

Another notable study by Zheng et al. [49] utilized a dual-branch network to separately process lung fields and lesions, achieving state-of-the-art results on the COVID-19-CT-Seg dataset. This approach allowed for better handling of heterogeneous appearances of lung lesions.

Chen et al. [50] introduced a transformer-based architecture for lung segmentation, leveraging the global context provided by transformers to enhance segmentation performance on the LUNA16 and COVID-19-CT-Seg datasets.

The authors of [51] proposed a cross-attention network to further refine the segmentation process. By integrating cross-attention mechanisms, their model achieved improved accuracy and robustness, particularly in challenging cases with severe pathologies.

Finally, Wang et al. [52] explored the use of generative adversarial networks (GANs) for lung segmentation. Their model, trained on the LUNA16 and LIDC-IDRI datasets, demonstrated high accuracy and generalizability by generating more realistic segmentations.

Table 1 summarizes state-of-the-art methods in lung segmentation using deep learning, including details on the datasets used and the performance metrics reported.

Advancements in deep learning have significantly enhanced the lung instance segmentation performance. State-of-the-art models, such as Mask R-CNN and U-Net-based architectures, have achieved impressive accuracy in segmenting individual lung structures from medical images. For instance, Mask R-CNN can reach average Intersection over Union (IoU) scores exceeding 85%, while U-Net models often achieve Dice coefficients above 90% for lung segmentation tasks.

Recent developments also involve Transformer-based architectures and attention mechanisms, which have further refined the segmentation accuracy. These techniques improve the model’s ability to distinguish between overlapping lung regions and small structures, often resulting in accuracy improvements of 3–5% over previous methods. Additionally, integrating multi-scale and multi-modal data has enhanced performance, with some models achieving Dice scores and IoU metrics that approach or exceed 95%, especially in complex cases involving lesions and abnormalities.

Overall, deep learning has made substantial strides in lung instance segmentation, leading to more accurate and robust models that contribute to better diagnostic and treatment outcomes in pulmonary medicine.

Differently from the cited works, in this paper, we propose a method aimed to detect and locate the presence of lung nodules, cancer, and adenocarcinoma from medical images, by considering instance segmentation.

3. Deep Learning for Lung Nodule Instance Segmentation

In this section, we present the proposed method for detecting and localizing objects in lung CT, with a specific focus on identifying (generic) nodules, cancer, and adenocarcinomas by exploiting deep learning.

Figure 1 demonstrates the proposed approach: alongside each detected nodule, the proposed method presents the prediction percentage, reflecting the confidence of the model in the detection accuracy.

To build an effective deep learning model for object detection in lung CT, we need a dataset consisting of CT related to lung. To create a model capable of identifying not only the presence of lung nodules but also their locations within the CT image, these images must be manually labeled by expert radiologists. This involves annotating the images with bounding boxes that mark the areas where the objects, specifically lung nodules, cancer and adenocarcinoma, are located (as shown in Figure 1).

To ensure that the model accurately predicts unseen lung CT images, the dataset should encompass images captured from diverse angles and under varying conditions. Despite initial differences in image sizes, a preprocessing step is necessary to resize all images to a uniform dimension.

The next step involves increasing the dataset size through image augmentation (depicted in Figure 1). This technique introduces controlled random modifications, such as rotations, flips, and trims, to the existing images, creating new variations. Data augmentation enhances the neural network’s ability to generalize by exposing it to a wider array of scenarios, thus reducing the risk of overfitting, where the model performs well on training data but poorly on unseen data.

In the follow, we further explain the techniques we considered for preprocessing and for augmentation: for each technique, we explain the purpose and the working mechanism.

Regarding preprocessing, we apply the following techniques:

Auto-Orient
-
Purpose: Ensures that all images are oriented correctly.
-
How It Works: This technique reads the EXIF metadata of the image, which includes the orientation information (e.g., landscape, portrait, upside-down). It then rotates the image accordingly so that it is displayed correctly, regardless of how the photo was taken.
Resize: Stretch to 640 × 640
-
Purpose: Standardizes the size of all images to 640 × 640 pixels.
-
How It Works: Each image is resized to fit within a 640 × 640 pixel frame. “Stretch” indicates that the aspect ratio is not preserved, so the image may be distorted to fit the specified dimensions.
Auto-Adjust Contrast: Using histogram equalization.
-
Purpose: Enhances the contrast of the image.
-
How It Works: Histogram equalization is a method that adjusts the contrast of an image by redistributing the intensity values. It spreads out the most frequent intensity values, which can enhance the contrast of the image and make features more distinguishable.
Grayscale
-
Purpose: Converts the image to grayscale.
-
How It Works: The image is converted from a color image (RGB) to a grayscale image where each pixel represents the intensity of light. This simplifies the image data and can reduce computational complexity while focusing on the structural information of the image.

Regarding augmentation, we applied following techniques:

Outputs per training example: three.
-
Purpose: Generates multiple variations of each training example to increase the dataset’s size and diversity.
-
How It Works: For each original image, three different augmented versions are created using the specified augmentation techniques. This helps the model generalize better by exposing it to a wider variety of image transformations.
Rotation: Between −10° and +10°.
-
Purpose: Introduces variability in the orientation of the images.
-
How It Works: Each image is randomly rotated by an angle between −10 degrees and +10 degrees. This helps the model become invariant to small rotations and improves its ability to recognize objects from different angles.
Hue: Between −15° and +15°
-
Purpose: Adjusts the color tone of the image.
-
How It Works: The hue of the image is randomly adjusted within the specified range. This means that the colors in the image will shift slightly, helping the model learn to recognize objects regardless of slight changes in color.
Saturation: Between −25% and +25%.
-
Purpose: Varies the intensity of colors in the image.
-
How It Works: The saturation level of the image is randomly increased or decreased by up to 25%. This means that the colors can become more vibrant or more muted, aiding the model in recognizing objects with different color intensities.
Brightness: Between −10% and +10%.
-
Purpose: Adjusts the brightness of the image.
-
How It Works: The brightness of the image is randomly altered within the specified range. This means that the image can become slightly brighter or darker, helping the model handle varying lighting conditions.
Exposure: Between −15% and +15%.
-
Purpose: Simulates changes in exposure.
-
How It Works: The exposure of the image is randomly adjusted by up to ±15%. This affects the overall lightness or darkness of the image, helping the model learn to recognize features under different exposure levels.

These preprocessing and augmentation techniques work together to prepare the images for training in a way that enhances model performance. Preprocessing steps ensure that all images are standardized and have enhanced contrast, while augmentation techniques create variations of the images to make the model robust to changes in orientation, color, and lighting. This combination helps in building a more generalizable and resilient model.

After preparing the augmented images with their corresponding annotations (bounding boxes for each generic nodule, cancer, and adenocarcinoma), we need a deep learning model for object detection. In this paper, we resort to the YOLO model [53], introduced by J. Redmon et al. in 2016; YOLO is the first one-stage object detection model, designed to classify images and detect object locations simultaneously.

YOLO’s primary advantage over other models is its speed. It processes the entire image in a single pass by dividing it into an SxS grid, where each grid cell is responsible for detecting an object if the object’s center falls within the cell. This single-stage detection pipeline enables YOLO to predict bounding boxes, object classes, and probabilities simultaneously, making it much faster than multi-stage models.

Although YOLO tends to make more localization errors compared to other models, it is less likely to produce false positives in the background and remains one of the fastest and most reliable models for object detection. In this paper, we consider the YOLOv8s version using the PyTorch framework.

The YOLO network consists of three main components: the backbone, the neck, and the head. The backbone, a CNN, extracts image features at various granularities. The neck combines these features to prepare them for the head, which makes the final predictions of bounding boxes and object classes.

As explained, the YOLO model is primarily designed for object detection rather than segmentation tasks. However, understanding how YOLO works for object detection helps in grasping how adaptations can be made for segmentation tasks. Below, we explain the working of YOLO for object detection and how similar principles can be applied to segmentation, followed by an explanation of the YOLO architecture.

YOLO is designed to detect objects in an image and simultaneously predict bounding boxes and class probabilities. In the following, we explain how it works:

Grid Division: YOLO divides the input image into an SxS grid.
Bounding Box Prediction: Each grid cell predicts a fixed number of bounding boxes. For each bounding box, it predicts the following:
-
The coordinates (x, y) of the box center relative to the grid cell.
-
The width and height of the box relative to the entire image.
-
The confidence score indicating the likelihood of the box containing an object and the accuracy of the box’s location.
Class Prediction: Each grid cell also predicts the probabilities of each class for the objects it contains.
Non-Maximum Suppression (NMS): YOLO applies NMS to filter out overlapping boxes, keeping only the most confident

While YOLO is not inherently designed for segmentation, it can be adapted for simple forms of segmentation by modifying the output to predict masks or contours of objects instead of just bounding boxes. This involves adding layers to predict a binary mask for each grid cell indicating the presence of an object.

The YOLO architecture consists of three main components: the backbone, the neck, and the head [54,55,56,57,58,59]:

Backbone: The backbone is a CNN used to extract features from the input image. Common backbones include versions of Darknet (e.g., Darknet-53) for YOLO. The backbone’s purpose is to create a feature map that highlights important features in the image. In this paper, the backbone was obtained via transfer learning using the imagenet dataset.
Neck: The neck is a series of layers that further process the feature map from the backbone. This typically involves layers like convolutional layers, upsampling, and feature concatenation to ensure that the network captures multi-scale features.
Feature Pyramid Network (FPN): Often used to improve the model’s ability to detect objects at different scales by combining low-resolution, semantically strong features with high-resolution, semantically weak features.
Head: The head is responsible for making the final predictions. It outputs a tensor that contains the bounding box coordinates, object confidence scores, and class probabilities for each grid cell.

Below, we list the processing pipeline of the YOLO object detection model:

Image Input: An image is input into the model and resized to a standard dimension.
Feature Extraction (Backbone): The image is passed through the backbone CNN, which extracts feature maps.
Feature Processing (Neck): The feature maps are processed through the neck, enhancing multi-scale feature representation.
Prediction (Head): The processed feature maps are passed through the head, where the following are performed:
- Bounding box coordinates are predicted.
- Object confidence scores are generated.
- Class probabilities are calculated.
Postprocessing: Non-maximum suppression (NMS) is applied to remove duplicate detections, retaining the most confident predictions.

To adapt YOLO for segmentation tasks, the following modifications can be made:

Mask Prediction: Instead of just predicting bounding boxes, additional layers can be added to predict segmentation masks. Each grid cell can output a binary mask that indicates the presence of the object.
Loss Function: The loss function can be modified to include segmentation mask loss, in addition to the bounding box and classification loss.
Architectural Adjustments: The network architecture can be adjusted to include upsampling layers and skip connections similar to those used in fully convolutional networks (FCNs) or U-Net, which are common in segmentation tasks.

In detail, YOLO computes loss using three main components: localization loss, confidence loss, and classification loss. Below is the mathematical formulation of the YOLO loss function.

Relating the YOLO Loss Function, the total loss

L

in YOLO is a combination of three types of losses:

Localization loss (bounding box regression loss);
Confidence loss (objectness loss);
Classification loss.

The combined loss functions can be expressed as

L = λ_{coord} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} ⊮_{i j}^{obj} [{(x_{i} - {\hat{x}}_{i})}^{2} + {(y_{i} - {\hat{y}}_{i})}^{2}] + λ_{coord} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} ⊮_{i j}^{obj} [{(\sqrt{w_{i}} - \sqrt{{\hat{w}}_{i}})}^{2} + {(\sqrt{h_{i}} - \sqrt{{\hat{h}}_{i}})}^{2}]

+ \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} ⊮_{i j}^{obj} {(C_{i} - {\hat{C}}_{i})}^{2} + λ_{noobj} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} ⊮_{i j}^{noobj} {(C_{i} - {\hat{C}}_{i})}^{2} + \sum_{i = 0}^{S^{2}} ⊮_{i}^{obj} \sum_{c \in classes} {(p_{i} (c) - {\hat{p}}_{i} (c))}^{2}

where −

S \times S

is the grid size. −B is the number of bounding boxes per grid cell. −

⊮_{i j}^{obj}

is an indicator function that equals 1 if the j-th bounding box in cell i is responsible for the prediction. −

⊮_{i j}^{noobj}

is an indicator function that equals 1 if no object is present in the j-th bounding box in cell i. −

x_{i}, y_{i}

are the coordinates of the center of the bounding box. −

{\hat{x}}_{i}, {\hat{y}}_{i}

are the ground truth coordinates of the center of the bounding box. −

w_{i}, h_{i}

are the width and height of the bounding box. −

{\hat{w}}_{i}, {\hat{h}}_{i}

are the ground truth width and height of the bounding box. −

C_{i}

is the confidence score for the bounding box. −

{\hat{C}}_{i}

is the ground truth confidence score. −

p_{i} (c)

is the predicted probability of class c. −

{\hat{p}}_{i} (c)

is the ground truth probability of class c. −

λ_{coord}

and

λ_{noobj}

are the weights for the localization loss and confidence loss for cells without objects, respectively.

In the following, we explain the components of the loss function:

Localization Loss (Bounding Box Regression Loss): This term penalizes errors in the predicted bounding box coordinates and dimensions.

$λ_{coord} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} ⊮_{i j}^{obj} [{(x_{i} - {\hat{x}}_{i})}^{2} + {(y_{i} - {\hat{y}}_{i})}^{2} + {(\sqrt{w_{i}} - \sqrt{{\hat{w}}_{i}})}^{2} + {(\sqrt{h_{i}} - \sqrt{{\hat{h}}_{i}})}^{2}]$
Confidence Loss (Objectness Loss): This term penalizes errors in the confidence score, which reflects whether an object is present in the predicted bounding box.

$\sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} ⊮_{i j}^{obj} {(C_{i} - {\hat{C}}_{i})}^{2} + λ_{noobj} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} ⊮_{i j}^{noobj} {(C_{i} - {\hat{C}}_{i})}^{2}$
Classification Loss: This term penalizes errors in the predicted class probabilities.

$\sum_{i = 0}^{S^{2}} ⊮_{i}^{obj} \sum_{c \in classes} {(p_{i} (c) - {\hat{p}}_{i} (c))}^{2}$

The overall loss function balances these components to train a model that accurately predicts bounding box coordinates, confidence scores, and class probabilities.

In the following, we describe the introduced notation:

$⊮_{i j}^{obj}$ : indicator function for bounding boxes that contain objects.
$⊮_{i j}^{noobj}$ : indicator function for bounding boxes that do not contain objects.
$λ_{coord}$ : weight for localization loss, often set higher to emphasize precise localization.
$λ_{noobj}$ : weight for confidence loss for cells without objects, usually set lower to reduce the impact of false positives.

These components together ensure that the YOLO model learns to predict bounding boxes accurately, with high confidence and correct classification.

While YOLO is primarily an object detection model, its architecture and principles can be adapted for segmentation by incorporating mask prediction and appropriate architectural modifications. The backbone extracts features, the neck processes these features, and the head makes predictions, which can be extended to include segmentation masks for each detected object.

Below, we provide a more technical explanation related to the steps needed to adapt the YOLO model for segmentation tasks; in particular, the modifications belong to the following categories: architectural, training, and postprocessing enhancements.

Below, we describe the architectural modifications:

Mask Prediction Head
-
Addition: An additional head is introduced to predict the segmentation masks for each detected object. This head works alongside the existing heads responsible for bounding box regression and class prediction.
-
Structure: The mask head typically includes convolutional layers designed to output a binary mask for each detected instance. Each pixel in the mask indicates whether it belongs to the object or the background.
Feature Pyramid Network (FPN) Enhancements
-
Purpose: FPNs are crucial for handling objects at various scales by effectively combining features from different layers of the backbone network.
-
Implementation: YOLOv8 leverages enhanced FPNs to ensure that both the detection and segmentation tasks benefit from rich, multi-scale feature representations.
ROIAlign (Region of Interest Alignment)
-
Role: ROIAlign is used to extract fixed-size feature maps for each predicted bounding box. This technique ensures precise spatial alignment, which is critical for accurate mask predictions.
-
Functionality: Unlike ROI pooling, ROIAlign avoids quantization errors, using bilinear interpolation to compute the feature values, thereby preserving the spatial structure of the objects.
Multi-Task Learning
-
Framework: YOLOv8 is trained in a multi-task learning setup where the backbone and neck networks are shared between the detection and segmentation tasks.
-
Benefits: This shared learning helps the model to learn more robust and generalizable features, improving performance on both tasks simultaneously.

In the following, we describe the training modifications:

Segmentation Loss Function
-
Loss Addition: A segmentation loss, often a binary cross-entropy loss or a variant like Dice loss, is added to the overall loss function to train the mask head.
-
Purpose: This loss ensures that the predicted masks closely match the ground truth masks, optimizing the model for accurate instance segmentation.
Data Preprocessing and Augmentation
-
Preprocessing: Images are resized and normalized to ensure consistency and efficiency during training.
-
Augmentation: Techniques such as random cropping, flipping, rotation, and color jittering are applied to increase the diversity of the training data and improve model generalization. Specific augmentations like CutMix and MixUp might also be used to help the model learn more robustly from partial object views and blended images.
Anchor-Free Mechanism
-
Adaptation: For segmentation, the anchor-free mechanism allows the model to predict object locations without relying on predefined anchor boxes. This flexibility helps in dealing with varying object shapes and sizes more effectively.

In the following, we describe the last category, i.e., the postprocessing enhancements:

Non-Maximum Suppression (NMS) for Masks
-
Application: NMS is applied to the predicted masks to eliminate redundant and overlapping predictions, ensuring that only the most confident and non-overlapping masks are retained.
-
Thresholding: Specific thresholds are set for the IoU (Intersection over Union) to filter out masks with high overlap, improving the final segmentation results.
Mask Refinement
-
Techniques: Additional refinement steps, such as conditional random fields (CRFs) or simple morphological operations, may be used to smooth the predicted masks and correct any minor inaccuracies.

By incorporating these modifications, YOLOv8 is adapted for instance segmentation tasks. The model retains its efficiency and speed while achieving high accuracy in predicting both bounding boxes and detailed object masks, as shown in Section 4.

4. Experimental Analysis

In this section, we present the results of the experimental analysis aimed to demonstrate the effectiveness of the proposed method.

Regarding model training and testing, we exploited the Python programming language, in particular the 3.10 version. (https://www.python.org/, accessed on 19 June 2024) and PyTorch 2.4 version (https://pytorch.org/, accessed on 19 June 2024), a machine learning library based on the Torch library, used for applications such as computer vision and natural language processing. For model training, we utilized a machine equipped with an NVIDIA Tesla T4 GPU card featuring 16 GB of memory.

For our purposes, we considered the Lung Nodule Segmentation study Image Dataset [60], freely available for research purposes (https://universe.roboflow.com/varun-18tlk/lung-nodule-segmentation-study/dataset/3, accessed on 19 June 2024). The exploited dataset is composed of 3654 different images related to lung CT, across three classes: adenocarcinoma, cancer, and nodules. For each lung CT, there is a related segmentation annotation. The annotation process was performed by expert radiologists. The dataset is composed of 1650 images. On these images, we applied the following preprocessing techniques:

Auto-Orient;
Resize: stretch to 640 × 640;
Auto-Adjust Contrast: using histogram equalization;
Grayscale: applied.

Moreover, the following augmentation techniques were used to enrich the considered dataset:

Rotation: between −10° and +10°;
Hue: between −15° and +15°;
Saturation: between −25% and +25%;
Brightness: between −10% and +10%;
Exposure: between −15% and +15%.

By applying these augmentation techniques for each image, we obtained three different images, so the final dataset was composed of 3654 images. We split the images in the following way: 3006 images for training (82%), 324 (9%) for validation, and the remaining 324 (9%) for testing.

Figure 2 shows the number of instances related to the analyzed dataset.

In detail, from Figure 2, it emerges that in the exploited dataset, the numbers of instances related to cancer and adenocarcinoma are quite similar, while the number of nodules is greater if compared with the remaining two categories.

Figure 3, Figure 4 and Figure 5 show three examples of lung CT belonging to the exploited dataset with the related segmentation annotation. In particular, in Figure 3, there is the segmentation annotation related to a nodule; in Figure 4, the segmentation annotation related to a cancer; and in Figure 5, the segmentation annotation related to an adenocarcinoma.

Regarding the model parameters, we used a batch size of 16, set the number of epochs to 50, and initialized the learning rate at 0.01. For model training, we utilized a machine equipped with an NVIDIA Tesla T4 GPU card with 16 GB of memory.

In Figure 6, we show the experimental analysis results.

In the following, we describe the metrics related to each subplot in Figure 6. In the first line of the plots in Figure 6, we have the train/box_loss (i.e., the box_loss trend during the training: a loss that measures how “tight” the predicted bounding boxes are to the ground truth object); the train/seg_loss, referring to the segmentation loss during the training phase of a machine learning model (this loss measures the discrepancy between the predicted segmentation masks and the ground truth masks); the train/cls_loss trend (i.e., the cls_loss trend during the training: the cls_loss is a loss that measures the correctness of the classification of each predicted bounding box; each box may contain an object class or a ”background”; this loss is usually called cross-entropy loss); the metrics/precision(B), and the metrics/recall(B), the metrics/precision(M), and the metrics/recall(M), which are, respectively, related to the precision and recall for B and M, where (M) refers to “macro” average, which computes the average performance metric over each class, and (B) refers to the “best” performance metric achieved by the model during the training process.

The distribution focal loss (DFL) is a loss function designed to improve the training of object detection models by refining the predictions of bounding box coordinates. It is particularly used in methods like Generalized Focal Loss (GFL) for dense object detection, which aims to enhance the localization precision by focusing on the distribution of the predicted bounding boxes.

The distribution focal loss improves the localization of object bounding boxes by refining the predictions of box coordinates.

Thus, to formulate the DFL from a mathematical point of view, let us define the following: y is the ground truth value. p is the predicted value.

cls (i)

is the classification probability for the i-th bin.

bin (i)

is the center of the i-th bin.

δ

is the predicted distribution (logits or probabilities).

σ (\cdot)

is the sigmoid function.

The distribution focal loss is defined as

L_{DFL} = - \sum_{i = 1}^{k} [y_{i} log (σ (δ_{i})) + (1 - y_{i}) log (1 - σ (δ_{i}))]

where k is the number of bins.

y_{i}

is the ground truth distribution for the i-th bin.

δ_{i}

is the predicted distribution (logits or probabilities) for the i-th bin.

In the following, we explain the DFL formula:

-: Ground Truth Value (y): This represents the true value of the bounding box coordinate. In the context of the DFL, this is usually expressed in terms of a probability distribution over a set of discrete bins.
-: Predicted Value (p): This represents the predicted probability distribution over the bins for a bounding box coordinate.
-: Classification Probability ( $cls (i)$ ): This is the predicted probability for the i-th bin. It represents the likelihood that the bounding box coordinate falls within that bin.
-: Bin Center ( $bin (i)$ ): The center value of the i-th bin. Bins are used to discretize the continuous range of possible bounding box coordinates.
-: Predicted Distribution ( $δ$ ): The predicted logits or probabilities for each bin. These values are usually converted to probabilities using the sigmoid function.
-: Sigmoid Function ( $σ (\cdot)$ ): This function is used to convert logits into probabilities. It ensures that the predicted values are between 0 and 1, suitable for probability distribution.

In practice, the distribution focal loss is combined with other losses, such as classification loss and confidence loss, to form the total loss function used to train the object detection model. The combined loss function can be expressed as

L = λ_{coord} L_{coord} + L_{conf} + L_{cls} + L_{DFL}

where

L_{coord}

is the localization (bounding box regression) loss.

L_{conf}

is the confidence (objectness) loss.

L_{cls}

is the classification loss.

L_{DFL}

is the distribution focal loss.

λ_{coord}

is a weighting factor for the localization loss.

The distribution focal loss enhances the precision of bounding box predictions by focusing on the fine-grained distribution of the coordinates, leading to better localization performance in object detection models.

In the second line of the plots in Figure 6, we have the val/box_loss (i.e., the trend in the box_loss in the validation), the val/seg_loss (i.e., the trend in the seg_loss in validation), val/cls_loss (aimed to quantiff the classification error of the predicted labels), and the val/dfl_loss (aimed to measure the discrepancy between the predicted and actual feature locations in the convolutional feature maps. The DFL, or distribution focal loss, is designed to tackle class imbalance in object detection across different categories. The “cls” value refers to the classification loss, which is computed using the Cross Entropy Loss function. Each of these losses is computed independently and then summed to form the final loss).

Metrics/mAP50(B) is the best average precision when the Intersection over Union is equal to 0.5 (mAP_0.5), metrics/mAP50(M) is the mean average precision when the Intersection over Union is equal to 0.5 (mAP_0.5), and metrics/mAP50-95(M) is the mean average precision when the Intersection over Union is from 0.5 to 0.95 (mAP_50-95).

All the metrics show the anticipated trends: the precision, recall, mAP_0.5, and mAP_0.5:0.95 increase with the number of epochs, indicating that the model is effectively learning to detect objects in CT lung images. Conversely, other metrics decrease as the number of epochs increases, further confirming the model’s correct learning process. The loss metrics, which represent the model’s errors in recognizing specific objects, are generally high in the initial epochs but decrease as the model enhances its detection capabilities over time.

Below are more details about the precision, recall, mAP_0.5, and mAP_0.5:0.95 metrics.

Precision measures the proportion of positive predictions that are correct, accounting for false positives, which are cases incorrectly flagged for inclusion. It can be calculated as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

From the precision trend shown in Figure 6, we observe that precision increases over the epochs. This upward trend indicates that the network is effectively learning to distinguish between the presence and absence of persons and dogs in thermal images over time.

The second metric we used to evaluate the effectiveness of the proposed method is recall. Recall measures the proportion of actual positives that were correctly predicted, accounting for false negatives, which are cases that should have been flagged but were not. Recall is computed as

R e c a l l = \frac{T P}{T P + F N}

The recall trend shown in Figure 6 mirrors the precision trend: both metrics increase with the number of epochs. This indicates an improving model performance over time. As shown in the plots, both precision and recall range from 0 to 1, demonstrating promising performance. Like precision, recall also exhibits a growing trend with an increasing number of epochs.

Precision and recall are frequently employed to gauge a model’s performance in classification tasks. Yet, to evaluate the model’s ability to precisely locate objects of interest within thermal images, we employed metrics like average precision (AP). AP assesses the accuracy of object detectors, such as the YOLO model employed in our study, by averaging precision values across recall values spanning from 0 to 1.

We aimed to compute the mean average precision (mAP), which entails using the Intersection over Union (IoU), precision, recall, precision–recall curve, and AP. Object detection models forecast both the bounding boxes and categories of objects within an image, with the IoU serving to ascertain the accuracy of these predictions.

The IOU indicates the extent of overlap between predicted and ground truth bounding boxes. An exact match results in an IOU of 1.0, while no overlap results in an IOU of 0.0. The IOU formula is

I O U = \frac{a o}{a u}

where ao is the area of overlap, and au is the area of union.

In evaluating object detection models, the degree of overlap needed for successful recognition is defined by IOU thresholds. For example, mAP_0.5 measures accuracy at IOU = 50%, meaning detections with more than 50% overlap are considered successful. Higher IOU thresholds, like mAP_0.75, require more precise bounding boxes, making detection more challenging.

mAP is the average of AP values across all classes, and Figure 6 shows the mAP values for IOU = 50 (metrics/mAP_0.5) and for IOUs ranging from 50 to 95 (metrics/mAP_0.5:0.95), with a step size of 0.05.

Both metrics/mAP_0.5 and metrics/mAP_0.5:0.95 in Figure 6 exhibit increasing trends, indicating that the model effectively learns to locate humans and dogs in thermal images, correctly identifying the regions of interest.

In the following, we provide a detailed analysis related to each subplot.

With regard to the training losses, we comment on the following metrics:

train/box_loss:
-
The box loss decreases steadily over the training epochs, indicating that the model has improving its ability to predict bounding box coordinates.
-
The smooth curve (dotted orange) aligns well with the actual results (solid blue), suggesting consistent improvement.
train/seg_loss:
-
The segmentation loss also shows a steady decline, indicating the model’s improving performance in segmenting the lung regions.
-
Similar to the box loss, the smooth curve aligns with the actual results, reinforcing the trend in consistent improvement.
train/cls_loss:
-
The classification loss starts high but decreases rapidly, showing the model’s quick adaptation in classifying lung regions correctly.
-
The smooth curve again aligns well with the actual results, supporting the trend in improving classification accuracy.
train/dfl_loss:
-
The distribution focal loss (DFL) decreases over time, indicating better prediction of the bounding box distribution.
-
The alignment between the smooth and actual curves suggests a stable training process.

With regard to the validation losses, we comment on the following metrics:

val/box_loss:
-
The box loss shows a downward trend with some fluctuations, which is typical in validation losses due to varying validation set difficulty.
-
The overall decreasing trend indicates the model generalizes well on the validation set.
val/seg_loss:
-
The segmentation loss for validation follows a downward trend but with notable fluctuations. This could indicate varying difficulties in the validation samples or overfitting issues.
val/cls_loss:
-
The classification loss shows a downward trend with fluctuations, indicating the model’s varying performance on different validation samples.
val/dfl_loss:
-
The distribution focal loss decreases overall, similar to the other losses, but with fluctuations. This is consistent with the nature of validation losses.

With regard to the precision, recall, mAP50, and mAP50-95, we comment on the following metrics:

metrics/precision(B) and metrics/recall(B):
-
The precision and recall for bounding box predictions (B) show an increasing trend, indicating better detection accuracy and recall over epochs.
-
Fluctuations are present, which is typical, but the overall trend is positive.
metrics/precision(M) and metrics/recall(M):
-
The precision and recall for mask predictions (M) also show an increasing trend, indicating improving segmentation performance.
-
The fluctuations are present, but the overall improvement trend is clear.
metrics/mAP50(B) and metrics/mAP50(M):
-
The mean average precision (mAP) at IoU threshold 50% shows improvement for both bounding boxes (B) and masks (M).
-
The increase is consistent, showing the model’s capability to accurately predict both bounding boxes and masks.
metrics/mAP50-95(B) and metrics/mAP50-95(M):
-
The mAP across IoU thresholds from 50% to 95% for both bounding boxes (B) and masks (M) shows a steady improvement.
-
These metrics are more stringent and reflect the model’s robustness across various IoU thresholds.

To conclude this analysis, in the training phase, the model shows a consistent improvement in all loss metrics, indicating effective learning during training; in the validation phase the model generalizes well with a general downward trend in validation losses and improving performance metrics. Fluctuations in validation losses are typical and expected. The increasing trends in precision, recall, and mAP metrics suggest that the model is becoming more accurate and robust in both bounding box and mask predictions.

Overall, the training process appears to be effective, leading to a model that generalizes well on unseen data.

Table 2 shows the values obtained for bounding box precision, recall, mAP_0.5, and mAP_0.5:0.95 metrics (detailed for the single classes, i.e., nodule, cancer, and adenocarcinoma, and for all the classes).

In Table 2, it is evident that the bounding box precision and recall are 0.757 and 0.738, respectively, for all classes indicated under the all label. Moreover, we obtained a precision equal to 0.884 and a recall equal to 0.747 for the nodule label, a precision of 0.572 and a recall of 0.783 for the cancer label, and a precision of 0.815 and a recall of 0.685 for the adenocarcinoma label.

Table 3 shows the values obtained for the mask precision, recall, mAP_0.5, and mAP_0.5:0.95 metrics (detailed for the single classes, i.e., nodule, cancer, and adenocarcinoma, and for all the classes).

In Table 3, it is evident that the mask precision and recall are 0.75 and 0.733, respectively, for all classes indicated under the all label. Moreover, we obtained a precision equal to 0.865 and a recall equal to 0.73 for the nodule label, a precision of 0.572 and a recall of 0.783 for the cancer label, and a precision of 0.815 and a recall of 0.685 for the adenocarcinoma label.

The testing of the lung CT nodule instance segmentation model required, on average, 3.4 ms for preprocessing, 12.3 ms for inference, and 3.9 ms for postprocessing per image.

Moreover, with the aim to better evaluate the effectiveness of the proposed method, in Figure 7, we report the precision and recall values on a precision–recall graph.

The trend in this plot is expected to be monotonically decreasing due to the trade-off between precision and recall: increasing one typically decreases the other. Although the precision–recall graph is not always monotonically decreasing because of certain exceptions or data limitations, the plot in Figure 7 shows a decreasing trend for the labels involved.

The precision–recall plot also displays the Area Under the Curve (AUC) values for the involved classes (nodule, cancer, and adenocarcinoma) and the overall identification with mAP_0.5. As anticipated, the precision–recall trend is generally monotonically decreasing, as shown in the plot for all classes with mAP_0.5, which has an AUC of 0.940. This value is the average AUC for all considered classes. Specifically, the AUC for the adenocarcinoma class is 0.734; for the cancer class, it is equal to 0.588; and for the nodule one, it is equal to 0.802. Given that these metrics range from 0 to 1, these values indicate that the proposed model is particularly effective in detecting (generic) nodules and also adenocarcinoma, while it is less effective in the detection of cancer.

In the following, we provide interpretations of the curves:

Adenocarcinoma (Blue Line):
-
The blue line represents the precision–recall relationship for the adenocarcinoma class.
-
With an average precision (AP) of 0.734, the model performs relatively well in detecting adenocarcinoma, maintaining high precision and recall values.
Cancer (Orange Line):
-
The orange line represents the performance for the cancer class.
-
An AP of 0.588 indicates moderate performance, with a noticeable drop in precision as recall increases, suggesting that the model struggles more with this class compared to adenocarcinoma and nodules.
Nodule (Green Line):
-
The green line shows the precision–recall for nodules.
-
With the highest AP of 0.802, the model performs best on this class, indicating high precision and recall across most thresholds.
All Classes (Bold Blue Line):
-
The bold blue line represents the overall performance across all classes, with an mAP@0.5 of 0.708.
-
The mean average precision (mAP) at an IoU threshold of 0.5 provides a single metric summarizing the model’s performance across all classes.
-
An mAP@0.5 of 0.708 suggests that the model has a strong overall detection capability, balancing precision and recall well.

The model performs best on the nodule class, followed by adenocarcinoma, and then cancer, while the overall mAP@0.5 score of 0.708 shows that the model is quite effective in detecting the various classes.

Figure 8 shows a confusion matrix. A confusion matrix is a useful representation for evaluating the performance of a classification model, including object detection models like the YOLO one we exploited in this study for the instance segmentation task. In a nutshell, a confusion matrix provides a summary of the prediction results, showing the number of correct and incorrect predictions made by the model, categorized by the actual classes and predicted classes.

In the confusion matrix, the rows represent the actual labels (ground truth), while the columns represent the predicted labels by the model; in detail, we have the following:

True Positives (TP): The number of times a class was correctly predicted.
False Positives (FP): The number of times a class was incorrectly predicted.
False Negatives (FN): The number of times the model failed to predict a class when it should have.
True Negatives (TN): The number of times the background was correctly identified as not having an object.

Moreover, in object detection tasks, the “background” label plays a crucial role because it indicates areas where no objects of interest are present. This is essential for models like YOLO, which need to differentiate between objects and non-objects; in particular, we have the following:

Background Row: This row contains the counts of predictions where the actual label is “background”. Ideally, the majority of these counts should fall in the “background” column, indicating that the model correctly identified non-object areas. Counts in other columns indicate false positives, where the model incorrectly identified the background as containing an object.
Background Column: This column contains the counts of predictions that the model predicted “background”. Ideally, the majority of these counts should come from the “background” row, indicating correct predictions of non-object areas. Counts in other rows indicate false negatives, where the model failed to detect actual objects and mistakenly classified them as background.

In the confusion matrix in Figure 8, we can note that 31 adenocarcinomas were correctly marked as belonging to the right class, while 14 adenocarcinomas were wrongly marked as background. Relating to the cancer class, 35 cancers were correctly labeled as belonging to the right class, 9 cancers were wrongly labeled as nodule, and 2 cancers were wrongly marked as background. With regard to the nodule label, 184 nodules were rightly recognized as nodule, 23 nodules were wrongly detected as cancer, and 34 nodule were wrongly labeled as background.

In Figure 9, we also show the normalized confusion matrix, where each value ranges from 0 to 1.

From the normalized confusion matrix, it is possible to immediately note that 69% of adenocarcinomas were rightly detected, and 76% of the cancers and nodules were also correctly marked as belonging to the relevant right category.

In Figure 10, we show several example of images belonging to the exploited lung CT dataset with the related detail about the segmentation annotation performed by expert radiologists.

In detail, in Figure 10, there are several images with different lung conditions: as a matter of fact, we can note instance segmentations related to nodules, cancer, and adenocarcinoma. It is possible to note that the segmented areas are decidedly small compared to the entire image; therefore, they are pathologies that may not always be immediately visible to the radiologist given the size of the masses in question. For this reason, the proposed method is of particular interest for screening.

In Figure 11, we show the same set of images shown in Figure 10 with the segmentation and classification output performed by the trained model; in this figure, it is possible to directly compare the annotation segmentation with the prediction performed by the proposed model.

As shown by the lung CT images shown in Figure 11, we can note that in most cases, the proposed method is able to correctly predict the segmentation annotation and the related label. We note that in Figure 11, for each prediction, there is also the detection percentage related to the model’s confidence for a certain label assigned to the segmented area.

We note that the image related to adenocarcinoma was correctly predicted, with a percentage equal to 70%, while we also note that some masses labeled as cancer by radiologists were labeled as nodules; the same thing happened for some nodules labeled as cancer. We can note, however, that from a segmentation point of view, the proposed method is able to correctly label even extremely small masses, as can be seen in Figure 2, among other things even with very high classification percentages. We also note that in one image, two nodules were predicted, while in the same image, the radiologists highlighted only one nodule; cases like these, where the model highlights more masses than the radiologists, are very interesting and deserve a more detailed in-depth study aimed to understanding if there are actually more masses present in the lung CT. From this example, it can obviously be deduced that the model is also able to highlight multiple masses in a CT scan, even of different types. Finally, we also note that in two images, no masses were predicted by the proposed method, while in the same images, the radiologists had noted two nodules.

As shown in Figure 11, in the images in row 1, there are images showing high-confidence predictions for nodules (0.8 to 0.9), and in row 2, there are predicted nodules and cancers, similar to the ground truth (shown in Figure 10) but with slight variations in confidence scores. Images in row 3 include an adenocarcinoma prediction (0.7) that matches the ground truth, while images in row 4 show high-confidence predictions for nodules and cancers, corresponding closely with the ground truth labels (shown in Figure 10).

By comparing the ground truth images (shown in Figure 10) and the predictions with the related segmentation masks shown in Figure 11, we conclude that the predicted masks closely match the ground truth masks, indicating the model’s high accuracy in detecting nodules and cancers. Moreover, the confidence scores for predictions are generally close to those of the ground truth, demonstrating the reliability of the model. There are minor variations in the confidence scores between the ground truth and predictions. In some cases, the exact positioning of the bounding boxes might differ slightly, but the identified regions largely overlap. The predictions consistently identify the same types of abnormalities (nodules, cancers) as the ground truth. The model accurately identifies regions labeled as adenocarcinoma, indicating its effectiveness in distinguishing different types of abnormalities.

The proposed instance segmentation model demonstrates high performance in segmenting lung nodules and cancers, as evidenced by the close alignment between the ground truth and predicted labels. The minor differences in confidence scores and bounding box positions do not significantly impact the overall accuracy and reliability of the model predictions.

5. Conclusions and Future Work

Considering the importance of screening for the detection of abnormal masses in lung, in this paper, we propose a method aimed to automatically detect and locate masses in lung CT, in particular the proposed method is able to segment (i.e., divide an image into segments, each of which represents a separate object or part of the image) a lung CT under analysis with masses classified as (generic) nodule, cancer, or adenocarcinoma. For model training, we resorted to the YOLO instance segmentation model, one of the most widespread models exploited for real-time segmentation. To evaluate the effectiveness of the proposed method, a dataset composed of 3654 lung CT images was considered, obtaining an average precision and recall, respectively, equal to 0.757 and 0.738 in the classification task. Furthermore, we obtained an average mask precision and mask recall equal to 0.75 and 0.733, which indicates that the proposed method is able to not only classify masses in nodule, cancer, and adenocarcinoma but also segment the areas, thus performing an effective instance segmentation.

While YOLOv8’s adaptation for instance segmentation brings several advantages, there are also some inherent limitations and challenges associated with this approach; for example, the addition of a mask prediction head and the need for detailed segmentation increase the computational and memory requirements compared to standard object detection tasks. This can limit model deployment on resource-constrained devices. Furthermore, segmentation tasks are inherently more complex and computationally intensive, leading to longer inference times compared to pure object detection. As a matter of fact, while YOLOv8 is designed for real-time applications, adding segmentation capabilities can introduce a trade-off between accuracy and speed. Ensuring both high segmentation accuracy and real-time performance is challenging. Segmenting very small objects remains challenging, as the feature maps may not have enough resolution to accurately delineate tiny details. Accurately segmenting overlapping objects can be problematic, as the model might struggle to separate closely packed instances within the same class. The predicted masks can sometimes be coarse and less detailed, especially for objects with intricate shapes.

To address these limitations, several strategies can be employed; for instance, it is possible to use model compression techniques like pruning and quantization to reduce the computational burden, enhance the training dataset with extensive data augmentation to improve generalization and performance, and employ advanced postprocessing techniques to refine segmentation results and improve mask quality.

As future work, we plan to consider additional model, for instance, other YOLO model versions, such as versions 9 and 10 (we experimented with the eight one in this study). Moreover, we plan to take into account the so-called explainability, i.e., a way to understand which area of the segmented images from the model’s point of view most influenced the segmentation. As a matter of fact, we plan to investigate newer versions of the YOLO model, which are likely to incorporate further innovations in neural network architecture and training methodologies, which be essential for maintaining state-of-the-art performance in object detection and instance segmentation tasks. Furthermore, integrating explainability techniques into YOLO models will be crucial for enhancing transparency and trustworthiness, allowing researchers and practitioners to better understand the decision-making processes of these models. These efforts will provide a clearer road map for advancing this research, paving the way for more robust, interpretable, and efficient computer vision systems.

Author Contributions

Conceptualization, A.S. and F.M.; Methodology, F.M.; Software, F.M.; Validation, A.S.; Formal analysis, A.S. and L.B.; Investigation, A.S. and L.B.; Resources, L.B.; Writing—original draft, F.M.; Writing—review & editing, A.S., F.M. and L.B.; Supervision, F.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially supported by EU DUCA, EU CyberSecPro, SYNAPSE, PTR 22–24 P2.01 (Cybersecurity) and SERICS (PE00000014) under the MUR National Recovery and Resilience Plan funded by the EU—NextGenerationEU projects, by MUR—REASONING: foRmal mEthods for computAtional analySis for diagnOsis and progNosis in imagING—PRIN, e-DAI (Digital ecosystem for integrated analysis of heterogeneous health data related to high-impact diseases: innovative model of care and research), Health Operational Plan, FSC 2014–2020, PRIN-MUR-Ministry of Health, the National Plan for NRRP Complementary Investments D^3 4 Health: Digital Driven Diagnostics, prognostics and therapeutics for sustainable Health care, Progetto MolisCTe, Ministero delle Imprese e del Made in Italy, Italy, CUP: D33B22000060001 and FORESEEN: FORmal mEthodS for attack dEtEction in autonomous driviNg systems CUP N.P2022WYAEW.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset (i.e., images and annotation segmentations) is freely available for research purposes at the following URL: https://universe.roboflow.com/varun-18tlk/lung-nodule-segmentation-study/dataset/3, accessed on 19 June 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Blandin Knight, S.; Crosbie, P.A.; Balata, H.; Chudziak, J.; Hussell, T.; Dive, C. Progress and prospects of early detection in lung cancer. Open Biol. 2017, 7, 170070. [Google Scholar] [CrossRef] [PubMed]
Siegel, R.L.; Miller, K.D.; Fuchs, H.E.; Jemal, A. Cancer statistics, 2022. CA Cancer J. Clin. 2022, 72, 524–541. [Google Scholar] [CrossRef] [PubMed]
Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef] [PubMed]
National Lung Screening Trial Research Team. Reduced lung-cancer mortality with low-dose computed tomographic screening. N. Engl. J. Med. 2011, 365, 395–409. [Google Scholar] [CrossRef]
Pinsky, P.F.; Church, T.R.; Izmirlian, G.; Kramer, B.S. The National Lung Screening Trial: Results stratified by demographics, smoking history, and lung cancer histology. Cancer 2013, 119, 3976–3983. [Google Scholar] [CrossRef]
National Lung Screening Trial Research Team. Lung cancer incidence and mortality with extended follow-up in the National Lung Screening Trial. J. Thorac. Oncol. 2019, 14, 1732–1742. [Google Scholar] [CrossRef]
Horeweg, N.; van Rosmalen, J.; Heuvelmans, M.A.; van der Aalst, C.M.; Vliegenthart, R.; Scholten, E.T.; ten Haaf, K.; Nackaerts, K.; Lammers, J.W.J.; Weenink, C.; et al. Lung cancer probability in patients with CT-detected pulmonary nodules: A prespecified analysis of data from the NELSON trial of low-dose CT screening. Lancet Oncol. 2014, 15, 1332–1341. [Google Scholar] [CrossRef]
Becker, N.; Motsch, E.; Trotter, A.; Heussel, C.P.; Dienemann, H.; Schnabel, P.A.; Kauczor, H.U.; Maldonado, S.G.; Miller, A.B.; Kaaks, R.; et al. Lung cancer mortality reduction by LDCT screening—Results from the randomized German LUSI trial. Int. J. Cancer 2020, 146, 1503–1513. [Google Scholar] [CrossRef]
de Koning, H.J.; van Der Aalst, C.M.; de Jong, P.A.; Scholten, E.T.; Nackaerts, K.; Heuvelmans, M.A.; Lammers, J.W.J.; Weenink, C.; Yousaf-Khan, U.; Horeweg, N.; et al. Reduced lung-cancer mortality with volume CT screening in a randomized trial. N. Engl. J. Med. 2020, 382, 503–513. [Google Scholar] [CrossRef]
Doria-Rose, V.P.; White, M.C.; Klabunde, C.N.; Nadel, M.R.; Richards, T.B.; McNeel, T.S.; Rodriguez, J.L.; Marcus, P.M. Use of lung cancer screening tests in the United States: Results from the 2010 National Health Interview Survey. Cancer Epidemiol. Biomarkers Prev. 2012, 21, 1049–1059. [Google Scholar] [CrossRef]
Jemal, A.; Fedewa, S.A. Lung cancer screening with low-dose computed tomography in the United States—2010 to 2015. JAMA Oncol. 2017, 3, 1278–1281. [Google Scholar] [CrossRef] [PubMed]
Pham, D.; Bhandari, S.; Pinkston, C.; Oechsli, M.; Kloecker, G. Lung cancer screening registry reveals low-dose CT screening remains heavily underutilized. Clin. Lung Cancer 2020, 21, e206–e211. [Google Scholar] [CrossRef] [PubMed]
Schmid-Bindert, G.; Vogel-Claussen, J.; Gütz, S.; Fink, J.; Hoffmann, H.; Eichhorn, M.E.; Herth, F.J. Incidental pulmonary nodules—What do we know in 2022. Respiration 2022, 101, 1024–1034. [Google Scholar] [CrossRef] [PubMed]
Tufail, A.B.; Ma, Y.K.; Kaabar, M.K.; Martínez, F.; Junejo, A.; Ullah, I.; Khan, R. Deep learning in cancer diagnosis and prognosis prediction: A minireview on challenges, recent trends, and future directions. Comput. Math. Methods Med. 2021, 2021, 9025470. [Google Scholar] [CrossRef] [PubMed]
Ahmad, S.; Ullah, T.; Ahmad, I.; Al-Sharabi, A.; Ullah, K.; Khan, R.A.; Rasheed, S.; Ullah, I.; Uddin, M.N.; Ali, M.S. A novel hybrid deep learning model for metastatic cancer detection. Comput. Intell. Neurosci. 2022, 2022, 8141530. [Google Scholar] [CrossRef] [PubMed]
Peralta-Ochoa, A.M.; Chaca-Asmal, P.A.; Guerrero-Vásquez, L.F.; Ordoñez-Ordoñez, J.O.; Coronel-González, E.J. Smart healthcare applications over 5G networks: A systematic review. Appl. Sci. 2023, 13, 1469. [Google Scholar] [CrossRef]
Muhammad, G.; Alshehri, F.; Karray, F.; El Saddik, A.; Alsulaiman, M.; Falk, T.H. A comprehensive survey on multimodal medical signals fusion for smart healthcare systems. Inf. Fusion 2021, 76, 355–375. [Google Scholar] [CrossRef]
Huang, P.; Li, C.; He, P.; Xiao, H.; Ping, Y.; Feng, P.; Tian, S.; Chen, H.; Mercaldo, F.; Santone, A.; et al. MamlFormer: Priori-experience guiding transformer network via manifold adversarial multi-modal learning for laryngeal histopathological grading. Inf. Fusion 2024, 108, 102333. [Google Scholar] [CrossRef]
Huang, P.; Xiao, H.; He, P.; Li, C.; Guo, X.; Tian, S.; Feng, P.; Chen, H.; Sun, Y.; Mercaldo, F.; et al. LA-ViT: A Network with Transformers Constrained by Learned-Parameter-Free Attention for Interpretable Grading in a New Laryngeal Histopathology Image Dataset. IEEE J. Biomed. Health Inform. 2024, 28, 3557–3570. [Google Scholar] [CrossRef]
Mercaldo, F.; Martinelli, F.; Santone, A. A proposal to ensure social distancing with deep learning-based object detection. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Virtual, 18–22 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–5. [Google Scholar]
Wang, L. Deep learning techniques to diagnose lung cancer. Cancers 2022, 14, 5569. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2012; Volume 25. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Liu, W.; Chen, W.; Wang, C.; Mao, Q.; Dai, X. Capsule embedded resnet for image classification. In Proceedings of the 2021 5th International Conference on Computer Science and Artificial Intelligence, Beijing, China, 4–6 December 2021; pp. 143–149. [Google Scholar]
Guan, X.; Gao, W.; Peng, H.; Shu, N.; Gao, D.W. Image-based incipient fault classification of electrical substation equipment by transfer learning of deep convolutional neural network. IEEE Can. J. Electr. Comput. Eng. 2021, 45, 1–8. [Google Scholar] [CrossRef]
Magge, A.; Weissenbacher, D.; Sarker, A.; Scotch, M.; Gonzalez-Hernandez, G. Bi-directional recurrent neural network models for geographic location extraction in biomedical literature. In BIOCOMPUTING 2019: Proceedings of the Pacific Symposium; World Scientific: Singapore, 2018; pp. 100–111. [Google Scholar]
Liu, K. Stbi-yolo: A real-time object detection method for lung nodule recognition. IEEE Access 2022, 10, 75385–75394. [Google Scholar] [CrossRef]
Wu, X.; Zhang, H.; Sun, J.; Wang, S.; Zhang, Y. YOLO-MSRF for lung nodule detection. Biomed. Signal Process. Control 2024, 94, 106318. [Google Scholar] [CrossRef]
Mammeri, S.; Amroune, M.; Haouam, M.Y.; Bendib, I.; Corrêa Silva, A. Early detection and diagnosis of lung cancer using YOLO v7, and transfer learning. Multimed. Tools Appl. 2024, 83, 30965–30980. [Google Scholar] [CrossRef]
Ji, Z.; Zhao, J.; Liu, J.; Zeng, X.; Zhang, H.; Zhang, X.; Ganchev, I. ELCT-YOLO: An efficient one-stage model for automatic lung tumor detection based on CT images. Mathematics 2023, 11, 2344. [Google Scholar] [CrossRef]
Bilal, A.; Shafiq, M.; Fang, F.; Waqar, M.; Ullah, I.; Ghadi, Y.Y.; Long, H.; Zeng, R. IGWO-IVNet3: DL-based automatic diagnosis of lung nodules using an improved gray wolf optimization and InceptionNet-V3. Sensors 2022, 22, 9603. [Google Scholar] [CrossRef]
Haq, I.; Mazhar, T.; Malik, M.A.; Kamal, M.M.; Ullah, I.; Kim, T.; Hamdi, M.; Hamam, H. Lung nodules localization and report analysis from computerized tomography (CT) scan using a novel machine learning approach. Appl. Sci. 2022, 12, 12614. [Google Scholar] [CrossRef]
Garg, S.; Kaur, J. Improving segmentation by denoising brain MRI images through interpolation median filter in ADTVFCM. Int. J. Comput. Trends Technol. 2013, 4, 187–188. [Google Scholar]
Lim, J.Y.; Lim, K.M.; Lee, C.P.; Tan, Y.X. SSL-ProtoNet: Self-supervised Learning Prototypical Networks for few-shot learning. Expert Syst. Appl. 2024, 238, 122173. [Google Scholar] [CrossRef]
Gharoun, H.; Momenifar, F.; Chen, F.; Gandomi, A. Meta-learning approaches for few-shot learning: A survey of recent advances. Acm Comput. Surv. 2024, 56, 294. [Google Scholar] [CrossRef]
Wang, S.; Zhou, M.; Gevaert, O.; Tang, Z.; Dong, D.; Liu, Z.; Jie, T. A multi-view deep convolutional neural networks for lung nodule segmentation. In Proceedings of the 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Jeju, Republic of Korea, 11–15 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1752–1755. [Google Scholar]
Hamidian, S.; Sahiner, B.; Petrick, N.; Pezeshk, A. 3D convolutional neural network for automatic detection of lung nodules in chest CT. In Proceedings of the Medical Imaging 2017: Computer-Aided Diagnosis, Orlando, FL, USA, 13–16 February 2017; SPIE: St Bellingham, WA, USA, 2017; Volume 10134, pp. 54–59. [Google Scholar]
Sun, X.; Lin, H.; Wang, S.; Zheng, L. Industrial robots sorting system based on improved faster RCNN. Comput. Syst. Appl 2019, 28, 258–263. [Google Scholar]
Cao, H.; Liu, H.; Song, E.; Hung, C.C.; Ma, G.; Xu, X.; Jin, R.; Lu, J. Dual-branch residual network for lung nodule segmentation. Appl. Soft Comput. 2020, 86, 105934. [Google Scholar] [CrossRef]
Banu, S.F.; Sarker, M.M.K.; Abdel-Nasser, M.; Puig, D.; Raswan, H.A. AWEU-Net: An attention-aware weight excitation U-Net for lung nodule segmentation. Appl. Sci. 2021, 11, 10132. [Google Scholar] [CrossRef]
Dutta, K. Densely connected recurrent residual (dense r2unet) convolutional neural network for segmentation of lung ct images. arXiv 2021, arXiv:2102.00663. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Christ, P.F.; Ettlinger, F.; Grün, F.; Elshaera, M.E.A.; Lipkova, J.; Schlecht, S.; Ahmaddy, F.; Tatavarty, S.; Bickel, M.; Bilic, P.; et al. Automatic liver and tumor segmentation of CT and MRI volumes using cascaded fully convolutional neural networks. arXiv 2017, arXiv:1702.05970. [Google Scholar]
Jin, D.; Xu, Z.; Tang, Y.; Harrison, A.P.; Mollura, D.J. CT-realistic lung nodule simulation from 3D conditional generative adversarial networks for robust lung segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, 16–20 September 2018; Proceedings, Part II 11; Springer: Berlin/Heidelberg, Germany, 2018; pp. 732–740. [Google Scholar]
Hofmanninger, J.; Prayer, F.; Pan, J.; Röhrich, S.; Prosch, H.; Langs, G. Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. Eur. Radiol. Exp. 2020, 4, 50. [Google Scholar] [CrossRef]
Hooda, R.; Mittal, A.; Sofat, S. Segmentation of lung fields from chest radiographs-a radiomic feature-based approach. Biomed. Eng. Lett. 2019, 9, 109–117. [Google Scholar] [CrossRef]
Gholamiankhah, F.; Mostafapour, S.; Goushbolagh, N.A.; Shojaerazavi, S.; Layegh, P.; Tabatabaei, S.M.; Arabi, H. Automated Lung Segmentation from Computed Tomography Images of Normal and COVID-19 Pneumonia Patients. Iran. J. Med. Sci. 2022, 47, 440. [Google Scholar]
Zheng, C.; Deng, X.; Fu, Q.; Zhou, Q.; Feng, J.; Ma, H.; Liu, W.; Wang, X. Deep learning-based detection for COVID-19 from chest CT using weak label. MedRxiv 2020. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Yang, X.L.; Jin, Z.Y. Att GGO-Net: A Semantic Segmentation Method of Lung CT Images with Self and Cross Attention Mechanism. In Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2023; Volume 2504, p. 012017. [Google Scholar]
Xiang, K.; Zhang, X.; She, J.; Liu, J.; Wang, H.; Deng, S.; Jiang, S. Toward robust diagnosis: A contour attention preserving adversarial defense for covid-19 detection. In Proceedings of the AAAI Conference on Artificial Intelligence; 2023; Volume 37, pp. 2928–2937. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Mercaldo, F.; Martinelli, F.; Santone, A.; Cesarelli, M. Blood Cells Counting and Localisation through Deep Learning Object Detection. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 4400–4409. [Google Scholar]
Mercaldo, F.; Cesarelli, M.; Martinelli, F.; Santone, A. Deep learning for blood cells classification and localisation. In Proceedings of the Fifteenth International Conference on Machine Vision (ICMV 2022), Yerevan, Armenia, 15–18 November 2023; SPIE: Bellingham, WA, USA, 2023; Volume 12701, pp. 315–322. [Google Scholar]
Zhou, Y. A YOLO-NL object detector for real-time detection. Expert Syst. Appl. 2024, 238, 122256. [Google Scholar] [CrossRef]
Santone, A.; Cesarelli, M.; Colasuonno, E.; Bevilacqua, V.; Mercaldo, F. A Method for Ocular Disease Diagnosis through Visual Prediction Explainability. Electronics 2024, 13, 2706. [Google Scholar] [CrossRef]
Zhuo, S.; Zhang, X.; Chen, Z.; Wei, W.; Wang, F.; Li, Q.; Guan, Y. DAMP-YOLO: A Lightweight Network Based on Deformable Features and Aggregation for Meter Reading Recognition. Appl. Sci. 2023, 13, 11493. [Google Scholar] [CrossRef]
Varun. Lung Nodule Segmentation Study Dataset. 2024. Available online: https://universe.roboflow.com/varun-18tlk/lung-nodule-segmentation-study (accessed on 28 May 2024).

Figure 1. The proposed method.

Figure 2. The number of instances for each category, i.e., nodule, cancer, and adenocarcinoma: it appears that in the dataset used, the numbers of instances for cancer and adenocarcinoma are quite similar, whereas the number of nodules is significantly higher compared to the other two categories.

Figure 3. An example of a CT image belonging to the analyzed dataset, with the annotation related to a nodule.

Figure 4. An example of a CT image belonging to the analyzed dataset, with the annotation related to lung cancer.

Figure 5. An example of a CT image belonging to the analyzed dataset, with the annotation related to adenocarcinoma.

Figure 6. The results of the experimental analysis: The analyzed metrics exhibit the expected trends, i.e., precision, recall, mAP_0.5, and mAP_0.5:0.95 all increase with the number of epochs, indicating that the model is effectively improving its ability to detect objects in CT lung images.

Figure 7. Precision–recall graph.

Figure 8. Confusion matrix.

Figure 9. Normalized confusion matrix.

Figure 10. The results of the experimental analysis.

Figure 11. The results of the experimental analysis.

Table 1. Comparison of state-of-the-art segmentation methods using deep learning.

Method	Dataset	References
U-Net	LUNA16	[43]
FCN + Nodule Detection	LUNA16	[44]
3D U-Net	LUNA16	[45]
Self-Supervised Learning	LIDC-IDRI	[46]
CNN + CRFs	JSRT	[47]
Attention-Guided U-Net	COVID-19-CT-Seg	[48]
Dual-Branch Network	COVID-19-CT-Seg	[49]
Transformer-Based Network	LUNA16, COVID-19-CT-Seg	[50]
Cross-Attention Network	COVID-19-CT-Seg	[51]
GANs	LUNA16, LIDC-IDRI	[52]

Table 2. Bounding Box Classification results.

Class	Image	Instances	Box (P)	Box (R)	Box (mAP50)	Box (mAP50-95)
all	324	332	0.757	0.738	0.71	0.488
adenocarsinoma	324	45	0.815	0.685	0.713	0.492
cancer	324	46	0.572	0.783	0.588	0.467
nodule	324	241	0.884	0.747	0.831	0.505

Table 3. Mask Classification results.

Class	Image	Instances	M (P)	M (R)	M (mAP50)	M (mAP50-95)
all	324	332	0.75	0.733	0.708	0.438
adenocarsinoma	324	45	0.815	0.685	0.734	0.539
cancer	324	46	0.572	0.783	0.588	0.366
nodule	324	241	0.865	0.73	0.802	0.41

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Santone, A.; Mercaldo, F.; Brunese, L. A Method for Real-Time Lung Nodule Instance Segmentation Using Deep Learning. Life 2024, 14, 1192. https://doi.org/10.3390/life14091192

AMA Style

Santone A, Mercaldo F, Brunese L. A Method for Real-Time Lung Nodule Instance Segmentation Using Deep Learning. Life. 2024; 14(9):1192. https://doi.org/10.3390/life14091192

Chicago/Turabian Style

Santone, Antonella, Francesco Mercaldo, and Luca Brunese. 2024. "A Method for Real-Time Lung Nodule Instance Segmentation Using Deep Learning" Life 14, no. 9: 1192. https://doi.org/10.3390/life14091192

APA Style

Santone, A., Mercaldo, F., & Brunese, L. (2024). A Method for Real-Time Lung Nodule Instance Segmentation Using Deep Learning. Life, 14(9), 1192. https://doi.org/10.3390/life14091192

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Method for Real-Time Lung Nodule Instance Segmentation Using Deep Learning

Abstract

1. Introduction

2. Related Work

3. Deep Learning for Lung Nodule Instance Segmentation

4. Experimental Analysis

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI