1. Introduction
The world has seen a rise in lung-related diseases in recent years, which can be linked to various factors such as pollution, global warming, viruses, and allergies. COVID-19 is an infectious disease caused by the SARS-CoV-2 virus, which originated in Wuhan, China, and became a global pandemic [
1]. It claimed more than 7 million lives in recent years [
2]. COVID-19 primarily affects the lungs, which can lead to pneumonia and require critical support. The main challenge during the COVID-19 pandemic was accurately diagnosing the disease. Previously, COVID-19 diagnosis relied primarily on the Polymerase Chain Reaction (PCR) test, which took at least a day to deliver results, putting lives at risk during the waiting period. Later, rapid testing kits were developed to diagnose the disease, but they were not as accurate as PCR tests. Another challenge in diagnosing COVID-19 is that radiologists often mistake it for bacterial pneumonia or other diseases due to the similarity of symptoms.
Chronic obstructive pulmonary disease (COPD) is another respiratory condition that is mainly caused by pollution and smoking. It is the third leading cause of death globally, with 3.23 million deaths reported in 2019 [
3]. On the other hand, asthma is a common respiratory condition that affected around 262 million individuals in 2019 and resulted in 455,000 deaths [
4]. Lung cancer is another major disease that is very hard to diagnose in its early stage [
5]. Symptoms of lung cancer are also very similar to other lung-related diseases, including cough, chest pain, shortness of breath, fatigue, etc. [
6]. For cancer diagnosis, the patient has to undergo very expensive tests and often has to wait long periods for the reports. Making it very difficult for the patient to manage their mental health and pay for treatment.
Bacterial pneumonia is a widespread illness that typically affects only one lobe of a lung. To diagnose this disease, a chest X-ray and other blood reports are often necessary. Not receiving proper and timely treatment can result in fatal consequences. Symptoms of bacterial pneumonia, such as fever, cough, and fatigue, are also very similar to those of other lung diseases. Moreover, pneumonia is also the leading cause of death from infection in children worldwide [
7].
Tuberculosis (TB) is a serious illness that primarily affects the lungs, but it can also target other organs [
8]. An estimated 3 million cases of TB remain undetected annually [
9]. The disease is spread through the droplets of cough or sneeze by an infected person [
10]. To diagnose TB, doctors often perform a TB skin test and/or TB blood tests, as well as a chest X-ray [
11]. The increasing threat of lung diseases is particularly alarming in developing and low to middle-income countries, where millions of people are struggling with poverty and the harmful effects of air pollution. The World Health Organization (WHO) estimates that over 4 million premature deaths occur annually due to diseases caused by household air pollution, including asthma and pneumonia [
12]. Therefore, it is not only crucial to take immediate action to reduce air pollution and carbon emissions but also to establish efficient diagnostic systems that can play a pivotal role in the early detection and management of lung diseases.
The growing number of lung-related diseases is causing a significant increase in the risk of illness and mortality rates [
13]. The COVID-19 outbreak, which began in late 2019, has brought to light the importance of lung health. The disease primarily affects the lungs, causing severe damage and respiratory issues, including pneumonia. The pandemic has emphasized the critical need for early and accurate detection of lung diseases. A chest X-ray (CXR) is a medical test that uses a very small amount of ionizing radiation to create grayscale images of the inside of the chest. These images can only be interpreted by a radiologist or physician. The test is used to evaluate the lungs, heart, and chest wall and may help diagnose symptoms like shortness of breath, coughing, fever, chest pain, or injury. It can also aid in diagnosing and monitoring the treatment of various lung conditions like pneumonia, emphysema, and cancer. Due to its speed and ease, it is especially useful in emergency situations [
14]. As shown in
Figure 1, there are various chest X-rays available that indicate different lung diseases.
Figure 1 displays various chest X-rays depicting different lung diseases.
Artificial intelligence (AI) is a technology in computer science that enables computers to learn from provided data and assist in making decisions. Over the past decade, AI has made significant progress and is now integrated into many of our daily activities. It mimics human intelligence but with greater speed and accuracy. With the advancements in AI, training an AI model to achieve targets has become easier. The availability of relevant datasets is the main requirement. Machine learning is a sub-field of artificial intelligence that mainly focuses on training models based on text data. Given a dataset, and after preprocessing has been performed, we can train a model to predict the unseen data and evaluate it.
Deep learning is a subfield of artificial intelligence focused on image data. The main algorithm used in deep learning is the Convolutional Neural Network (CNN), which is widely used for image classification. It has the ability to extract deep insights from images, which helps in decision-making [
20].
Figure 2 explains the process of training a model in deep learning.
Transfer learning is a method of deep learning that involves reusing a model developed for a task as a starting point for a model on a second task. With recent advancements in deep learning, many big organizations have made their pre-trained models available to the public to achieve highly accurate results with minimal effort. These pre-trained models are typically very complex and have been trained on a large number of labels [
21].
Deep learning, combined with CXR which is a low-cost and easily accessible imaging technique, has enabled researchers to develop various deep learning models for predicting lung diseases. Many of the researchers have used pre-trained classifiers to train the models on chest X-ray datasets and achieve high accuracy. This review aims to examine the existing work in this area and the datasets used to achieve these tasks. It is crucial to note that the focus of this research is not limited to a specific lung disease. Instead, it aims to provide a broad solution for detecting and diagnosing various lung-related diseases. This technology offers the potential to significantly improve diagnostic accuracy, reduce associated costs, and contribute to the crucial need for precise and efficient disease diagnosis across a range of lung diseases. This approach creates opportunities for addressing a range of lung diseases, ultimately contributing to improved healthcare outcomes. This section explains two methods that are widely used in deep learning.
1.1. Convolutional Neural Networks
Convolutional neural networks (CNNs) are a type of supervised deep learning algorithm that is primarily used for image classification tasks. In the field of deep learning, CNNs are a highly popular and widely used algorithm, with numerous research papers validating their effectiveness [
22,
23,
24,
25,
26,
27]. The main advantage of CNNs over other machine-learning algorithms is that they can automatically identify relevant features without requiring human intervention [
28].
A typical CNN architecture consists of multiple layers, with the first layer being a convolutional layer that applies an activation function (usually ReLU) to introduce non-linearity into the network. The second layer is a pooling layer that extracts the main features from the input, which are then transferred to a fully connected layer that acts as a classification layer to output the final predicted class. Sometimes, a dropout layer is also used to reduce overfitting [
29].
1.2. Transfer Learning (TL)
Transfer Learning is a powerful machine learning technique that involves training a model for a specific task and then applying it to another related task [
30]. This technique is designed to enhance understanding of the current task by building on knowledge gained from related tasks that were performed at different times but within a similar source domain. By creating a connection between previous tasks and the target task, Transfer Learning enables faster and more effective learning, resulting in logical and improved solutions. This technique is particularly useful when there is a limited supply of target training data available. Many pre-trained models are publicly available that are used by researchers reviewed in this paper, including ResNet50, VGG-16, and MobileNet [
31].
The objective of this research is to utilize the capabilities of deep learning to enhance the diagnosis and detection of lung diseases without the constraints of targeting a single specific condition. This study aims to explore the current research in diagnosing lung diseases using artificial intelligence by examining trends, methods, and techniques used in diagnosis. It also seeks to identify the available datasets, their types, and their accessibility for research purposes. Additionally, this study will investigate the data augmentation techniques used in ongoing research, the types of results being generated, and their accessibility to physicians. The trustworthiness of these results, based on the application of fundamental artificial intelligence terminologies, will also be assessed. Furthermore, this research will explore whether hybrid approaches combining text and images are being utilized and examine the use of generative adversarial networks (GANs) in current lung disease diagnosis methods.
We present a systematic examination of deep learning approaches for interpreting chest radiographs (CXR) in the diagnosis of lung diseases. The main contributions are listed below:
Analyzes various methods and datasets aimed at enhancing diagnostic accuracy for conditions like lung cancer, COVID-19, and pneumonia.
Synthesizes current research to offer a cohesive overview of trends, techniques, and common challenges in AI for early disease detection.
2. Literature Review
As the world moves towards automating the day-to-day tasks that consume a lot of our time and effort, machine learning and deep learning have become the primary approaches for achieving automation. Previously, diagnosing lung diseases was a difficult and expensive task that required a lot of time and effort. However, with the research performed in this field, it has become much easier to achieve accurate diagnoses.
G.M.M. Alshmrani et al. presented a multiclass deep learning framework that predicts lung diseases. They collected data from various sources on different lung diseases and employed a pre-trained classifier, VGG19, with three convolutional layers to train the model. They targeted five diseases, namely, Pneumonia, Lung Cancer, Tuberculosis (TB), Lung Opacity, and COVID-19 vs normal. The achieved accuracy was 96.48%, with an f1 score of 95.62% [
32]. A hybrid deep learning approach called VDSNet was presented by S. Bharati et al. They did not target any specific disease; instead, their output is binary, named as “findings” or “non-findings”. Multiple approaches were used, but VDSNet produced the best results. They used three parts to build this approach: Spatial Transformer Layers, Feature Extraction Layers, and Classification Layer. The Spatial Transformer Layers consist of three more layers. The first layer normalizes the X-ray image features, the second one is for batch normalization, and the third one is the spatial transformer, which helps to identify crucial features for lung disease classification. In the second part, they use a VGG16 model with several layers for feature extraction. This involves convolutional and dense layers, with some dropout layers in between. In the third part, they assemble the features from the previous layers and use dropout layers before the final classification. On the full dataset, VDSNet achieved a validation accuracy of 73%, with a recall of 63% and a precision of 69% [
33].
M. K. Gourisaria et al. proposed a deep learning network, namely, PneuNetV1, to predict Pneumothorax. Initially, they had a highly imbalanced dataset, which included 9378 (77.6%) from the non-pneumothorax class, and the rest, 2711 (22.4%), were from the pneumothorax. They used data augmentation techniques to balance classes. Not many details are mentioned about the architecture of their proposed CNN model. They compared their proposed CNN model with other pre-trained classifiers. PneuNetV1 outperformed all other pre-trained classifiers with an accuracy of 91% [
34]. In a similar approach to predicting pneumonia, Prusty et al. used a pre-trained classifier called Resnet50v2. As many lung diseases can progress to pneumonia, this research focused specifically on pneumonia. If pneumonia is detected, it can be an indication of other underlying issues in the lungs. The dataset for this study consisted of 1341 normal cases and 3875 cases of pneumonia. Using the Transfer learning approach, they used the Resnet50v2 model with early stopping to avoid overfitting. They used 10 epochs to train the model. The researchers achieved a validation accuracy of 99.69% [
35]. Ravi et al. proposed a stacking classifier approach that involved using three different models: EfficientNetB0, EfficientNetB1, and EfficientNetB2. These models were used to detect pediatric pneumonia, tuberculosis, and COVID-19. To achieve this, the features from all three models were concatenated and fed into multiple fully connected layers. Dropout was applied to avoid overfitting, and batch normalization layers were used. The approach resulted in an accuracy of 98% for detecting pediatric pneumonia, 99% for detecting tuberculosis, and 98% for detecting COVID-19 [
36].
In another study conducted by F.J.M. Shamrat et al., they applied a transfer learning approach to detect various lung diseases such as Emphysema, Infiltration, Mass, Pleural Thickening, Pneumonia, Pneumothorax, Atelectasis, Edema, Effusion, Hernia, Cardiomegaly, Pulmonary Fibrosis, Nodule, and Consolidation using the NIH ChestX-ray14 dataset. To preprocess the images, they used the CLAHE approach with a Gaussian filter to reduce noise and improve contrast. They balanced the classes using random under-sampling to reduce the majority classes to 5000 and oversampling to increase the adequate classes. They then fine-tuned several pre-trained models, including InceptionV3, AlexNet, DenseNet121, VGG19, and MobileNetV2. The MobileNetV2 model showed the best results with a high precision of 96.71% and an accuracy of 99.78% [
37]. In a similar approach, Souid et al. utilized MobileNetV2 and the NIH ChestX-ray14 dataset to predict lung diseases. They augmented images with ImageDataGenerator and implemented a sparse technique (also known as one hot encoding) to generate a column for each disease. If the image had the specific disease, the corresponding column was assigned a “1” and a “0” otherwise. Their model achieved a validation accuracy of 94% [
38]. Zhang et al. utilized the Pneumonia Classification Dataset, which comprises 5786 X-ray images, to train a VGG-based CNN model. This model has fewer trainable parameters and yields better results compared to VGG-16 and other pre-trained models. Additionally, it requires less training time. To enhance the quality of the images, they employed Dynamic Histogram Equalization (DHE) in preprocessing. They trained two models, one with the enhanced image quality and the other with the original images. The models were tested on an imbalanced testing set and achieved an overall accuracy of 96.07% and an F1-Score of 92.58% with the enhanced-quality images. They also achieved an accuracy of 95.38% and F1-Score of 91.87% with the original images [
39].
A. Sivasangari et al. used CNN to predict lung cancer from CXR images. They used the data from the Japanese Social order of Radiological Technology, which contains a total of 247 images. Out of these, 154 images show evidence of lung infection (pneumonic stroma), while 93 images do not show any signs of infection. The preprocessing steps for the images involve maintaining their aspect ratio, subsampling them to a smaller size, and then applying symmetrical reduction. Four layers of oversampling are used to enhance image quality, and specific maps and designs are applied in these steps. The images then pass through a 1 × 1 convolute layer for further processing, followed by a classification layer using Softmax to categorize image components. The pre-processed images were then fed into the proposed Deep Convolutional Neural Network (DCNN) and achieved a validation accuracy of 97.20% [
40]. D.M. Ibrahim et al. proposed a multi-class transfer learning framework to predict four classes: COVID-19, Pneumonia, Lung Cancer, and normal. They trained four pre-trained classifiers: VGG19+CNN, ResNet15V2, ResNet15V2+GRU, and ResNet15V2+Bi-GRU. In the first step, they augment the original 33,676 images to 75,000 images. In the preprocessing stage, images are resized to 224 × 224 × 3 dimensions and augmented through rotation, flipping, and skewing. All images, including the originals and augmented versions, are normalized. They are then converted into arrays for input into the subsequent model stage. Finally, the dataset is split into training and validation sets. It features input, feature extraction, and classification layers. VGG19+CNN architecture takes 224 × 224 × 3 images as input. It includes feature extraction with VGG19 and two CNN blocks. The output is flattened for classification using dense layers with SoftMax activation. ResNet152V2 used a feature extraction model, which is pre-trained for faster accuracy. It includes a reshape layer, a flattened layer, a dense layer with 128 neurons, a dropout, and a final dense layer with Softmax activation for classification. ResNet combined with GRU 256 units and ResNet combined with Bi-GRU 512 units, both models employ ResNet152V2 for feature extraction and classify images into four classes using dense layers. VGG19+CNN achieved the best accuracy of 98.05%, while ResNet152V2+GRU achieved a minimal loss of 0.1350 [
41]. Md. Nahiduzzaman et al. proposed a deep learning framework for predicting 17 different classes using 11 datasets from various sources, including a new dataset called the CXR17 dataset. This new dataset was created by combining data from different sources. The team trained six models with different sets of data. In the first trial, they trained the model on all 17 classes. In the second trial, they trained it on 14 classes taken from the CXR14 dataset. In the third trial, the model was trained on COVID-19, viral and bacterial pneumonia, and normal classes. In the fourth trial, they trained it on the generalized pneumonia class with COVID-19 and normal classes. In the fifth trial, they targeted binary classification, COVID-19 vs. normal, while, in the sixth trial, they targeted another binary approach, Tuberculosis vs. normal [
42].
The proposed CNN-ELM model combines feature extraction through a CNN with the efficiency and performance of the Extreme Learning Machine (ELM) for classification tasks using standardized features from lung CXR images. In the first trial, which was a unique multiclass approach using all 17 classes, the model achieved an F1-Score of 91%. For the other trials, the model achieved greater AUC scores for the CXR14 dataset when compared to other research studies. They have also developed a mobile application to predict lung disease from CXR images [
42].
2.1. Challenges of Using ImageNet Pre-Trained Models in Biomedical Imaging
Recent studies as shown in
Table 1 have highlighted significant limitations in using models pre-trained on natural image datasets like ImageNet for medical imaging tasks. The domain mismatch between natural and medical images stems from differences in modality and visual information. Natural images, such as those in ImageNet, display diverse textures, colors, and contours, which support general feature extraction. In contrast, medical images are typically more homogeneous and focus on fine-grained features relevant to pathology, such as texture variations or small abnormalities. As a result, models pre-trained on ImageNet often fail to capture the domain-specific morphological characteristics needed for accurate medical image analysis, especially in tasks requiring detailed segmentation [
43].
In response to these limitations, researchers are exploring medical-specific pretraining approaches. For instance, Wen et al. show that pretraining on relevant medical datasets like CheXpert or EyePACS can improve performance on classification tasks, where modality similarity between pretraining and target data is beneficial. However, for segmentation tasks, which demand high morphological awareness, medical-specific pretraining alone may be insufficient. Wen et al. further suggest that integrating diverse medical datasets (e.g., X-ray, MRI, and CT) in pretraining can enhance a model’s morphological feature learning, potentially leading to improved performance in both classification and segmentation tasks [
43].
2.2. Pre-Processing Techniques
2.2.1. Contrast Limited Adaptive Histogram Equalization (CLAHE)
Contrast Limited Adaptive Histogram Equalization (CLAHE) is a method designed to address the issue of low contrast in digital images, particularly in the context of medical images. When evaluated against Adaptive Histogram Equalization (AHE) and conventional Histogram Equalization (HE), CLAHE has shown enhanced efficacy in the medical imaging domain. This is achieved by moderating the contrast amplification often seen with HE, which might unintentionally intensify noise. By curbing the contrast boost in HE, CLAHE delivers optimal outcomes, especially in situations where heightened contrast leads to excessive noise, a common issue in medical photos. In essence, contrast amplification can be visualized as the gradient of the function linking the original image brightness levels to the intended final image brightness levels. This contrast modulation can be realized either by limiting the gradient or by capping the histogram at a particular brightness level. Thus, both gradient regulation and histogram capping play a pivotal role in steering contrast amplification. Users have the flexibility to adjust the contrast level by setting a clipping threshold, aligning with their specific contrast needs [
44].
The mathematical equation for the CLAHE (Contrast-Limited Adaptive Histogram Equalization) transfer function is given by the following:
The variables are defined as follows:
: This is the output intensity level after applying CLAHE for the input intensity level . It represents the enhanced intensity of the pixel after histogram equalization.
: The total number of possible intensity levels in the image, typically = 256 for an 8-bit grayscale image (with intensity values ranging from 0 to 255).
: The total number of pixels in the region of interest (ROI) or tile where CLAHE is being applied. CLAHE works locally by dividing the image into small tiles, and represents the number of pixels in a single tile.
: The current intensity level of the pixel being processed in the input image. This variable ranges from 0 to .
: The normalized histogram of intensity level
within the current tile. The histogram
is normalized by applying the contrast-limiting step in CLAHE, which caps the histogram values to prevent over-amplification of noise [
44].
2.2.2. Gaussian Filter
The Gaussian filter is a linear filter used to remove noise from images, along with the blurring of the image, similar to the average filter. However, it differs from the average filter in that it uses a different kernel, which is shaped like a bell curve (Gaussian Probability Density Function) [
45].
The equation for the 2-Dimensional Gaussian is given as follows:
The variables are defined as follows:
: The value of the 2D Gaussian function at the point . This represents the intensity or weight at this point in the Gaussian distribution.
and : The coordinates in the 2D plane where the Gaussian function is being evaluated. They represent the horizontal and vertical distances from the center of the Gaussian distribution, respectively.
: The standard deviation of the Gaussian distribution. This parameter controls the “spread” or “width” of the Gaussian function. A larger result in a broader, flatter distribution, while a smaller results in a narrower, more peaked distribution.
: Euler’s number, approximately equal to 2.718. It is the base of the natural logarithm and appears in the exponential part of the Gaussian function.
: A constant that normalizes the Gaussian function, ensuring that the total area under the 2D Gaussian surface equals 1.
Where the mean is (0,0) and
is the variance (default is 1). The Gaussian filter is effective in removing Gaussian noise, which is often referred to as white noise or amplifier noise. This type of noise can be caused by amplifier noise, thermal vibrations of atoms, and radiation from warm objects. The Gaussian filter is particularly effective in removing this noise due to its bell-shaped probability distribution function [
45].
2.2.3. Dynamic Histogram Equalization
Dynamic Histogram Equalization (DHE) is an innovative contrast enhancement technique that offers a refined approach compared to conventional histogram equalization (HE). Unlike traditional HE, which can sometimes lead to a loss of image details, DHE meticulously controls the enhancement process to ensure image details are preserved. The core principle of DHE is to partition the image histogram based on local minima, assigning specific gray level ranges to each partition. These partitions are then equalized separately. To further refine the process, DHE subjects these partitions to a repartitioning test, ensuring that no dominating histogram components overshadow the others. This method of segmentation and repartitioning ensures that the image’s contrast is enhanced without introducing severe side effects, such as a washed-out appearance or checkerboard effects. One of the standout features of DHE is its ability to control the extent of enhancement, allowing for a balanced and effective contrast improvement. The technique has been particularly highlighted for its efficacy in medical image processing, where clarity and detail preservation are paramount. Overall, DHE offers a dynamic approach to contrast enhancement, ensuring that images are not only clearer but also free from undesirable artifacts [
46].
2.3. Potential Strategies to Address Data Imbalance
Data imbalance refers to a situation where one class in a dataset is significantly underrepresented compared to others. This imbalance can create issues in training machine learning models, as they may become biased toward predicting the majority class, leading to poor performance on the minority class, which is often critical in domains like medical diagnosis or fraud detection. One effective strategy for addressing this issue is artificially increasing the volume of minority-class samples. Techniques such as flipping, rotation, scaling, and shearing are common approaches that modify existing data to create new, varied samples. This allows the model to be exposed to a broader range of the minority class, aiding in better generalization and balanced performance. For image processing, augmentations like horizontal or vertical flips, random rotations, and rescaling are used to create diverse examples from limited data, which helps the model learn more generalizable patterns [
47].
Data Augmentation refers to increasing dataset size when there is a scarcity of data to train a machine learning model and obtain good accuracy. The process expands the minority class representation, reducing the likelihood of bias towards the majority class and ultimately resulting in more accurate predictions across all classes [
47].
In the literature review section, we have seen researchers using different techniques to increase the number of images in their dataset to achieve good results.
2.3.1. Geometric Transformations
Geometric transformations are widely used techniques in image data augmentation to modify the structure of an image, enhancing the dataset’s variability and improving model performance. These transformations alter the spatial arrangement of image pixels without changing the actual pixel values, simulating real-world variations such as changes in viewpoint, scale, perspective, and non-rigid deformations [
48].
Common geometric transformations include flipping, rotation, skewing, scaling, and masking, which are discussed in detail below.
Flipping
Flipping is a widely used data augmentation technique in image processing. This technique involves mirroring the image along its vertical or horizontal axis. When an image is flipped horizontally (left to right), it is called a horizontal flip, and when an image is flipped vertically (top to bottom), it is called a vertical flip as depicted in
Figure 3. However, flipping an image vertically may not always preserve the label, especially in datasets that involve text recognition, such as the MNIST dataset [
49,
50].
Rotation
Rotation is another type of data augmentation technique in which we rotate the images at different angles between 1° and 359°. The measurement of these angles is not fixed and can be altered as required. So, from one image, there will be four new different images as depicted in
Figure 4, which will enable the model to obtain a good fit on the training data and enhance learning. However, some datasets are sensitive to such transformations, such as the MNIST datasets, where a model can easily confuse between 6 and 9 [
49].
Shearing
Shearing is a geometric transformation that shifts one edge of an image along the vertical or horizontal axis, resulting in a parallelogram shape. In
Figure 5, A vertical shear moves the edge along the vertical axis, whereas a horizontal shear moves it along the horizontal axis. The extent of this transformation is determined by a shear angle [
52].
Masking
Masking is an image augmentation technique to intentionally obscure or mask certain parts of an image during training, forcing the model to learn and focus on the more salient, unmasked features. It augments the training data by transforming the existing images to create additional variations of the image for the model to learn from. When masking is applied, it generates multiple versions of the same image with different areas obscured or masked as depicted in
Figure 6 [
50].
Scaling
Scaling is another image augmentation technique that involves resizing images to different resolutions as shown in
Figure 7. This process can be performed by either increasing or decreasing the image size while preserving the aspect ratio, which helps to prevent distortion [
50].
2.3.2. Synthetic Data Generation
Synthetic data generation is a crucial technique for addressing the challenges of data scarcity and privacy restrictions, particularly in fields like medical imaging. In this context, synthetic data can serve as a viable substitute for real data by simulating the distribution and characteristics of actual datasets. Traditional methods, such as data augmentation, which modifies existing data to produce slight variations, are widely used to increase dataset size; however, they have limitations in terms of diversity and scale. For the comprehensive expansion of datasets, generative models, especially Generative Adversarial Networks (GANs), have emerged as a powerful approach to generating entirely new yet realistic data samples [
53].
Generative Adversarial Networks (GANs)
Generative Adversarial Networks (GANs) consist of two competing neural networks, a generator and a discriminator, that operate in tandem. The generator’s task is to produce synthetic data that resembles real samples, while the discriminator aims to differentiate between real and synthetic data. Through this adversarial process, the generator iteratively improves its output until the synthetic data becomes virtually indistinguishable from real data.
In medical imaging, GANs have demonstrated significant potential by generating high-quality synthetic data, such as MRI or CT scans, that can supplement limited datasets. This approach is especially beneficial in medical applications, where collecting large, diverse datasets is often constrained by ethical, privacy, and logistical barriers. GAN-generated data has been shown to aid in various tasks, including data augmentation, cross-modality image synthesis, and resolution enhancement, all of which contribute to the robustness and accuracy of machine-learning models used in medical diagnosis and analysis [
53].
2.3.3. Effectiveness of Data Augmentation Techniques
This study explores both geometric transformations and Generative Adversarial Networks (GANs) for their impact on model performance through data augmentation. Traditional geometric transformations, such as cropping, rotation, and flipping, have been shown to enhance model accuracy by introducing spatial variation. For instance, in the Dogs vs. Goldfish classification task, these transformations improved validation accuracy from 0.855 to 0.890, while, in the Dogs vs. Cats classification, accuracy rose from 0.705 to 0.775. These improvements underscore the effectiveness of spatial diversity in helping models to generalize across different orientations [
54].
In contrast, GANs can generate synthetic images that resemble real data, particularly for underrepresented classes. In the Dogs vs. Goldfish task, using GAN-generated images improved validation accuracy to 0.865, and, for Dogs vs. Cats, the accuracy rose from 0.705 to 0.720. Although the improvements with GANs were slightly less pronounced than with traditional augmentations, GANs are especially valuable in addressing class imbalance by creating realistic images for minority classes, thereby enhancing dataset diversity. These results, sourced from Wang and Perez’s study on data augmentation, highlight the complementary strengths of geometric and GAN-based augmentations in improving model robustness and accuracy across various classification tasks [
54].
3. Research Methodology
To conduct this systematic literature review [
15,
16,
17,
18,
55,
56], for this study, we followed PRISMA statement guidelines 2020 [
57], which helped us to identify the reason for this study with more clarity and to refine our research questions in more detail so as to understand why this review is being performed, what this study should do to conduct this review, and what the findings of this review are. It further helped us in shortlisting the highly relevant papers in an easy and effective manner, which made this research area more closely related to the topic of deep learning in chest radiographs.
3.1. Research Questions
This section formulated research questions to guide the analysis of the selected papers, focusing on key aspects of diagnosing lung diseases using artificial intelligence. The research questions aim to explore the current trends, methods, and techniques employed in AI-based diagnosis, as well as the accessibility and types of datasets available for this purpose.
RQ1: What are the current trends, methods, and techniques in diagnosing lung diseases using artificial intelligence?
RQ2: What types of datasets and data augmentation techniques are available and accessible for AI-based lung disease diagnosis research?
RQ3: How accessible and trustworthy are the AI-generated results for physicians in diagnosing lung diseases?
RQ4: Are hybrid approaches, such as combining text and image analysis or Generative Adversarial Networks (GANs), being used in current research on lung disease diagnosis?
3.2. Information Source
To conduct this systematic literature review about deep learning advancements in chest radiograph interpretation, we selected papers from only reputed scientific databases. We selected work from MDPI, Elsevier, IEEE, and Springer or that which was presented at any reputable conference that is related to these journals.
To query in Google Scholar, we used multiple keywords to find the relevant papers. The keywords used included image classification, lung disease, MRI, Chest X-ray, and lung cancer. Lung cancer was included as a keyword because, without it, the database does not automatically retrieve relevant papers. To ensure that lung cancer-related studies were included in the analysis, we added “lung cancer” as an additional keyword to capture papers on this topic.
3.3. Exclusion and Inclusion Criteria
This review filtered the search results to only include papers published in or after 2020 in reputable journals like MDPI, ELSEVIER, Springer, and IEEE or those presented at reputed conferences. The main reason to shortlist papers in or after 2020 was the spread of COVID-19 at the end of December 2019. The selection of the papers was based on the relevance of the study, the currency of the version of the paper, and also the institution that published it. Moreover, the datasets were available publicly. Additionally, this review also included some relevant research papers gathered randomly from journals. In this review, we made sure to use only authentic research papers.
3.4. Data Collection Procedure
These research questions were designed to find the overall outcome of the available research, from data availability to the reliability of the findings by different researchers. To answer all our research questions,
Figure 8 illustrates the research collection process, which started with querying the databases, which resulted in 7546 papers from the literature for this research. This review used certain keywords like image classification, lung disease, MRI, Chest X-ray, and lung cancer on Google Scholar. In the shortlisting process, we excluded a massive number of research papers that did not target the prediction of lung diseases using machine learning. Initially, we obtained 7546 research papers using the keywords. This paper filtered the research papers before 2020 so that this study could only include recent research papers, which reduced the number of papers to 2684. Some of the papers were also excluded, including those that did not have a relevant title, which further reduced the number of papers to 355. Many papers had a relevant title, but upon abstract review, they were excluded. This brought down the number of papers to 200. The final shortlisting was performed based on the papers from reputed journals or presented at reputed conferences. Finally, we shortlisted 11 highly relevant papers for our review.
3.5. Data Analysis
This section outlines the specific conferences or journals where each paper was published and the year of publication. Additionally, it highlights the number of dataset sources used in different research papers and their availability to the public. In
Table 2, the number of datasets used and their public availability are presented. The highest number of datasets was used by G.M.M. Alshmrani et al. [
32] and Md. Nahiduzzaman et al. [
42], with 13 and 11 datasets, respectively.
Datasets
In this section, we discuss different datasets used in the existing literature. According to the literature review, there are a lot of CXR datasets available to target different diseases. However, we have also seen that many datasets have a scarcity of images from some classes.
Table 3 gives a brief overview of the existing datasets used in the literature. It has been noticed that some of the datasets referred to in research papers have broken links. Many authors have used datasets from various sources, but some of the given links are no longer functional. In
Table 3, we have listed only those links that are currently working. Additionally, we have taken the initiative to verify the updated links available on the internet and cross-checked them with the details and number of images provided by the authors whose links did not work.
From
Table 3, it is visible that the most used dataset is NIH Chest X-Rays, which is also known as the CXR-14 dataset, which contains 14 classes. Md. Nahiduzzaman et al. have worked on a wide range of diseases and have applied various approaches, as shown in
Figure 9 and
Figure 10 [
42]. They have used the CXR-14 dataset in combination with other sources of data to create a new dataset of 17 classes, which they named the CXR-17 dataset. S. Baharati et al., Shamrat et al., and Souid et al. have also used the CXR-14 dataset, but S. Baharati et al. converted the dataset into finding and non-finding classes. The CXR-14 dataset contains the highest number of images, 112,120, out of all datasets [
33,
37,
38,
62]. It has been observed that there are very few datasets available for Lung Cancer and Tuberculosis, and the ones that are available have a limited number of images. A.Sivasangari et al. and DM Ibrahim et al. have both used the Japanese Society of Radiological Technology (JSRT) dataset, which only contains 247 images, for their work on Lung Cancer [
40,
41,
66]. On the other hand, G.M.M. Alshmrani et al. have not clearly specified the source of data for Lung Cancer and Lung Opacity [
32]. In comparison to other classes, COVID-19 and Pneumonia have a good number of datasets available, as per the literature.
4. Experimental Results of Deep Learning Approaches
In the results section, we first discuss the assessment of research performance using various metrics employed by different authors and results were analyzed using Jupiter notebook and python programming language, whose equations are presented below:
In these statistical metrics, TP, TN, FP, and FN stand for true positive, true negative, false positive, and false negative, respectively. These values are derived from a confusion matrix. A confusion matrix (also known as an error matrix) is a table used to describe the performance of a classifier, providing the number of true positives, true negatives, false positives, and false negatives [
37].
Based on the literature we have reviewed, we found that some authors focused on a single disease while others worked on multiple diseases. For Lung Cancer, three researchers, G.M.M. Alshmrani et al., A. Sivasnagri et al., and D.M.Ibrahim et al., have conducted studies [
32,
40,
41]. A. Sivasnagri et al. worked solely on lung cancer and achieved a very high accuracy rate of 97.20%, as shown in
Figure 11 [
40]. On the other hand, G.M.M. Alshmrani et al. and D.M. Ibrahim et al. have worked on lung cancer in combination with other diseases and achieved an accuracy rate of 79.78% and 93.38%, respectively, for the Lung Cancer class, based on the confusion matrix provided [
32,
41]. It is important to note that A. Sivasnagri et al. proposed a model that can differentiate between benign, adenocarcinoma, and squamous cell carcinoma, which are types of cancer, while G.M.M. Alshmrani et al. and D.M. Ibrahim et al. proposed a network that can differentiate between lung cancer and other complex diseases [
32,
40,
41].
Several research studies have been conducted on COVID-19 with approaches that resulted in high accuracy. Research has been conducted by Ravi et al. and Md. Nahiduzzaman et al. to predict COVID-19 vs normal class [
36,
42]. In addition to working on other classes of diseases, they both specifically targeted COVID-19 in their research. Both researchers achieved very high accuracy rates, with Md. Nahiduzzaman et al. achieved 99.37%, and Ravi et al. achieved 98% [
36,
42]. It is important to note that they used different datasets to achieve these accuracies. Md. Nahiduzzaman et al. also worked on multi-class prediction of COVID-19, viral pneumonia, and bacterial pneumonia, achieving 95.33% accuracy [
42]. In another approach, Md.Nahiduzzaman et al. worked on COVID-19 vs pneumonia, achieving 99.30% accuracy [
42].
Several research studies have been conducted on the CXR-14 dataset, which contains 14 different classes [
54]. Shamrat et al. achieved an overall accuracy of 91.60%, while Md. Nahidduzzaman et al. achieved an accuracy of 89.10% [
37,
42]. Souid et al. achieved 94% accuracy on the same dataset [
38]. Additionally, Md. Nahiduzzaman et al. combined the COVID-19 dataset, the Tuberculosis dataset, and the viral and bacterial pneumonia dataset with CXR-14 to create a new dataset with 17 classes. They then trained their proposed model on this new dataset and achieved an accuracy of 90.92% [
42,
62].
S. Bahartai et al. developed a generalized approach to detect abnormalities versus no abnormalities and achieved an overall accuracy of 73% [
33]. D.M. Ibrahim et al. achieved an accuracy of 98.05% by targeting three classes: Lung Cancer, Pneumonia, and COVID-19 [
41]. Ravi et al. achieved an accuracy of 98% for detecting Pediatric Pneumonia [
36]. Two researchers, Prusty et al. and Zhang et al., achieved 99.69% and 96.07% accuracy, respectively, in detecting pneumonia versus normal [
35,
39]. G.M.M. Alshmrani et al. achieved 96.48% accuracy in detecting Lung Opacity along with other diseases [
32]. To target Pneumothorax, M.K. Gourisaria et al. achieved 91.23% accuracy [
21]. For the Tuberculosis class, Md. Nahidduzzaman et al. and Ravi et al. achieved very high accuracies of 99.82% and 99%, respectively, by detecting tuberculosis versus normal [
34,
42]. All these numbers are given in
Figure 11.
However, accuracy alone cannot be used to verify a model’s credibility. This is because most of this research was tested on imbalanced validation sets. So, other metrics like precision, recall, and, most importantly, F1-Score should also be considered.
Table 4 gives a detailed overview of all the metrics of these models.
Limitations in Existing Methods
This review highlights the potential of numerous research papers while also identifying important limitations and considerations for future work. A balanced dataset is crucial for effective model training, but we recognize that datasets are often scarce or imbalanced. Fortunately, there are various techniques available for augmenting data to balance classes, though some researchers still face challenges even after implementation. While training on slightly imbalanced data may still yield results, it is critical to prioritize balancing testing sets. We note that some of the research we reviewed relied on imbalanced data, which may raise questions about the accuracy of their conclusions. The testing results of G.M.M. Alshmrani et al. were based on imbalanced data, which raises concerns about the accuracy of the results [
32]. M. K. Gourisaria et al. used data augmentation techniques, but, even after augmentation, a slight class imbalance still existed in the validation set. Furthermore, they did not provide sufficient detail about the architecture of their proposed CNN model [
34]. Prusty et al. trained and tested on an imbalanced dataset without any data augmentation [
35]. To ensure transparency and accuracy, it is important for authors to clearly state the number of training and testing images per class used and to specify which dataset was utilized. Finally, in evaluating deep learning models, the inclusion of a confusion matrix and classification report are essential metrics that should be incorporated into research papers.
In our literature review, we have not seen a single research paper that used Generative adversarial networks (GANs) to augment their datasets where there is class imbalance. GANs have been used for data augmentation, generating new training images for classification, refining synthetic images, and improving brain segmentation [
73]. In studies where data scarcity is a problem, such as in the case of lung cancer research, Generative Adversarial Networks (GANs) can be a valuable tool.
5. Discussion
This paper examines the fact that plenty of research has been performed in the field of prompt diagnostics of lung diseases, and most of the researchers have obtained near-perfect results in predicting lung disease. However, the practical application of a model featured solely in a research paper holds little value for society if it cannot aid radiologists and physicians in real-life clinical settings. Doctors should be included in the implementation of deep learning models for real-world applications to save time and improve patient outcomes. The quality of images is a crucial factor in the effectiveness of a deep learning model. Models trained on high-quality image datasets tend to produce good results. However, in scenarios where new X-ray images with excellent image quality are not available, such models may struggle. Therefore, it is essential to train these models using a mix of good- and bad-quality images. This helps them learn and perform well even in adverse conditions. When it comes to predicting lung diseases, it can be a complex process, and doctors typically do not rely solely on chest X-ray images to make their decisions. Instead, they conduct various tests and assessments before making a diagnosis. Therefore, future models should incorporate not only CXR images but also test reports, combining the two to create a hybrid approach that doctors cannot ignore. This approach will improve the accuracy of diagnoses and provide better treatment options for patients. The following questions were formulated based on the literature review and answered below.
RQ1. Current Trends, Methods, and Techniques in Diagnosing Lung Diseases Using AI: Recent studies emphasize using deep learning, particularly Convolutional Neural Networks (CNNs), for chest radiograph (CXR) interpretation. Techniques like transfer learning with pre-trained models such as ResNet, VGG, and MobileNet are common, enabling high diagnostic accuracy. Researchers achieved impressive results across various diseases, including lung cancer, pneumonia, tuberculosis, and COVID-19, with accuracy levels often exceeding 90% for specific tasks.
RQ2. Available Datasets and Data Augmentation Techniques: The most frequently used datasets include the NIH CXR14, RSNA, and Shenzhen TB datasets. These are primarily used for training models on common lung diseases. Data augmentation techniques like flipping, rotation, and shearing are applied to overcome dataset limitations, balance classes, and improve model generalizability.
RQ3. Accessibility and Trustworthiness of AI-Generated Results for Physicians: Studies show that deep learning models achieve high accuracy, often above 95%, which is promising for clinical settings. However, challenges remain in applying these models to real-world scenarios due to limited validation in diverse clinical environments. Transparent reporting with confusion matrices and evaluation metrics like F1 scores and precision is recommended to enhance trust.
RQ4. Use of Hybrid Approaches (Text–Image, GANs) in Current Research: While hybrid approaches are suggested to improve diagnostic reliability, they are not yet widely applied. Text–image analysis and Generative Adversarial Networks (GANs) are mentioned as potential avenues, especially for data augmentation and improving model accuracy. GANs could help address class imbalances, particularly in datasets with fewer images of rare diseases like lung cancer.
6. Implications of This Study
This paper provides a comprehensive resource for researchers and practitioners working on AI-driven diagnostics, particularly within the healthcare sector. By evaluating the strengths, limitations, and real-world applicability of various AI models, this work serves as a reference point for developing future diagnostic tools. Importantly, it emphasizes the need for these AI models to be effectively integrated into clinical workflows. This means creating models that not only perform well in research settings but also align with the practical requirements and routines of healthcare professionals, ensuring they are user-friendly and impactful in real-world diagnostic environments.
Another implication is that hybrid AI systems, which integrate multiple types of diagnostic information, such as imaging data, lab results, and patient histories, could offer more accurate and reliable diagnostic outcomes. By combining different sources of data, hybrid models can capture a more complete picture of a patient’s health, potentially leading to more nuanced insights and improved diagnostic accuracy. This approach leverages the strengths of each data type, which could help mitigate the limitations of using any single data source. Hybrid models also align well with the complexity of real-world diagnostic scenarios, where decisions are rarely based on one type of data alone.
This study also underscores the need for advanced data techniques to handle the common problem of data imbalance, where certain conditions or outcomes may be underrepresented in the dataset. Data augmentation (modifying and replicating existing data to create new samples) and generative models (like Generative Adversarial Networks) are highlighted as essential tools to address these imbalances. By creating more balanced datasets, these techniques enable models to better learn patterns across all conditions, reducing biases toward majority classes. This focus on dataset balancing is crucial for developing robust AI models that perform well across diverse patient populations and clinical conditions, making it a key area for future research and technological advancement.
7. Conclusions
This review highlights significant advancements in deep learning models for interpreting chest radiographs, particularly for diagnosing a range of lung diseases. The findings underscore the effectiveness of artificial intelligence (AI) and deep learning, especially convolutional neural networks (CNNs), in classifying lung conditions with remarkable accuracy. Leveraging pre-trained models and transfer learning approaches has contributed to high diagnostic performance, as observed in various studies, achieving over 90% accuracy in detecting diseases such as lung cancer, pneumonia, tuberculosis, and COVID-19.
This study also addresses the challenges of data imbalance and scarcity, which are especially pronounced in medical imaging due to the limited availability of diverse datasets. Although data augmentation techniques like flipping, rotation, and shearing have been commonly used to mitigate this issue, future research should explore synthetic data generation methods, particularly Generative Adversarial Networks (GANs), to further address data scarcity in classes with fewer samples. GANs offer the potential to create realistic synthetic images that can balance datasets and enhance the model’s ability to generalize across different conditions.
Despite these advancements, integrating these AI-driven models into clinical practice remains a challenge. High model accuracy alone is not sufficient; real-world application requires validation in diverse clinical settings and alignment with the workflow of medical professionals. Hybrid approaches, combining chest X-ray data with other diagnostic reports, could further improve the reliability and usability of these models in clinical settings, enhancing diagnostic precision. Moreover, this study analyzed a limited number of studies from specific regions; future research will aim to analyze additional datasets from reputable journals and conferences published globally.