The data for this study came from two medical centers and was divided into a training group, an internal validation group, and an external validation group. The specific experimental process is shown in
Figure 1. Densenet201 was used to train a benign and malignant automatic discrimination model, and external validation groups were used to verify the model’s performance. At the same time, we compared it with other commonly used deep learning models. We invited a chief physician with over 30 years of experience in otolaryngology diagnosis and treatment—Clinician A—and an address physician with 10 years of experience in otolaryngology diagnosis and treatment—Clinician B—to evaluate the malignant risk of lesions in the external validation set of phonoscope images, with a risk value ranging from 0.00 to 1.00. We also draw ROC curves and calculate AUC values, and conduct a Delong test with the external validation group of our Densenet201 model.
2.1. Study Population and Imaging Acquisitions
Data were acquired from two medical centers. Medical center A is the Donghai Campus of the Second Affiliated Hospital of Fujian Medical University. Medical center B is the Licheng Campus of the Second Affiliated Hospital of Fujian Medical University. In this study conducted from January 2019 to June 2023, 428 patients with laryngeal lesions visited otolaryngology head and neck surgery departments at medical centers A and B. At medical center A, a simple randomization method was used to select 127 cases of benign laryngeal lesions (53 males and 74 females, aged 45 ± 12.3 years) and 105 cases of laryngeal squamous cell carcinoma (102 males and 3 females, aged 52 ± 8.6 years) for training and calibrating the AI system. The remaining cases at medical center A, comprising 53 males (20 males and 33 females, aged 46 ± 12.8 years) with benign laryngeal lesions and 45 cases of laryngeal squamous cell carcinoma (44 males and one female, aged 52 ± 9.6 years), underwent internal AI testing. The cases in medical center B were used as external testing, with 53 males (24 males and 29 females, aged 41 ± 11.2 years) with benign laryngeal lesions, 45 cases of laryngeal squamous cell carcinoma (44 males, one female, age 53 ± 9.1 years) underwent external testing of AI. Between January 2019 and June 2023, 195 cases with pathologically confirmed LSCC on surgical resection were retrieved. One hundred and fifty (150) cases from medical center A were used as a training and internal validation cohort and 45 cases from medical center B were used as an external validation cohort. Two hundred and thirty three (233) cases with pathologically confirmed benign lesions of larynx also were retrieved from two medical centers. One hundred and eighty (180) cases from medical center A were used as a training and internal validation cohort and 53 cases from medical center B were used as an external validation cohort. Our raw laryngoscopic images were captured using integration system endoscopes (CV-170, Olympus Medical Systems Corp., Tokyo, Japan) and standard endoscopes (OTV-S7, Olympus Medical Systems Corp., Tokyo, Japan), endoscopic systems (LMD-1420; Shanghai Suoguang Visual Products Corp., Shanghai, China and CLV-S40; Olympus Medical Systems Corp., Tokyo, Japan). An experienced endoscopist elected four to 11 high quality images from the raw images captured from different perspectives for each case for data augmentation and a total of 2254 laryngoscopic images were included in this study, including LSCC, benign laryngeal tumors such as polyps and non-specific inflammation and so on which were all biopsy-proven. Demographic and clinical characteristics were collected from the case management system, including age, gender, pathology and tumor size marked T (according to American Joint Committee on Cancer about LSCC) [
24]. Patients from medical center A were divided randomly into training and internal validation cohorts with a ratio of 7:3. Patients from medical center B were utilized as the external validation cohort. A summary of the image sets and clinical characteristics were detailed provided in
Table 1. In
Figure 2, we present a set of examples of benign and malignant laryngoscopic images.
Table 2 presents the histopathological results of the benign lesions encountered in our study. It is important to note that our dataset included a diverse range of benign lesions, including papilloma, tuberculosis, and granulomatous lesions, among others. The inclusion of these benign lesions allowed for a comprehensive assessment of the diagnostic performance of Densenet201 across various histopathological categories.
2.2. Structure of CNN Model
In this study, we have leveraged the power of the Densenet201 architecture, a state-of-the-art convolutional neural network (CNN), renowned for its outstanding performance in image recognition tasks. Densenet, short for Densely Connected Convolutional Networks, exhibits a unique architectural characteristic—dense connectivity. This feature sets it apart from traditional CNN architectures by establishing direct connections between layers within the network. Densenet201, an extension of the original Densenet architecture, is a deep convolutional neural network (CNN) that excels in image recognition tasks. It is particularly well-suited for extracting features from complex images [
37,
38]. Here is a breakdown of its key architectural components:
Dense Blocks: Densenet201 comprises multiple dense blocks, each containing a series of densely connected convolutional layers. In these blocks, each layer receives feature maps not just from the previous layer but also from all preceding layers within the same block. This dense connectivity promotes feature reuse, enabling the network to capture both low-level and high-level features effectively.
Transition Layers: Between dense blocks, transition layers are inserted. These layers include batch normalization, a pooling operation (typically average pooling), and a convolutional layer with a bottleneck structure (1 × 1 convolution). Transition layers reduce the spatial dimensions of feature maps while increasing the number of channels, striking a balance between computational efficiency and expressive power.
Global Average Pooling (GAP): At the end of the network, a global average pooling layer is used to aggregate the feature maps spatially, resulting in a single vector for each feature map. This reduces the spatial dimension to 1 × 1, enabling the network to produce a fixed-size feature vector regardless of input size.
Fully Connected Layer: Following GAP, a fully connected layer performs the final classification. The number of neurons in this layer corresponds to the number of classes in the classification task.
Feature Reuse: Densenet’s dense connectivity allows for maximum feature reuse, which facilitates the learning of more compact and discriminative representations from the data [
35].
Mitigating Vanishing Gradient: The dense connections ensure the flow of gradients during training, mitigating the vanishing gradient problem often encountered in very deep networks.
Efficient Parameter Utilization: Densenet’s parameter-efficient design enables it to maintain high accuracy while using fewer parameters compared to traditional architectures [
36].
State-of-the-Art Performance: Densen201 consistently achieves state-of-the-art performance in various image recognition challenges, outperforming many other architectures in terms of both accuracy and computational efficiency [
39,
40].
The network structure diagram of Densenet201 and detailed parameters can be seen in
Figure 3 and
Table 3.
2.3. Training Process of DCNN Model
The hardware equipment utilized was the NVIDIA RTX 3090 24 G. The software environment incorporated Python 3.6, Pytorch 0.4.1, OpenCV 3.4.1, Numpy 1.15, and SimpleITK 2.0. The training process of the deep convolutional neural network (DCNN) model is a crucial phase where the model learns to recognize patterns and features within the training data. In this section, we will provide an overview of the key steps involved in training the DCNN model:
Data Preprocessing: Before training begins, the laryngoscopic images are preprocessed to ensure uniformity and compatibility with the model. This preprocessing typically involves resizing the images to a consistent resolution 512 × 512, normalizing pixel values to a common scale (0–255).
Initialization: The DCNN model is initialized with random weights or pretrained weights from a model pretrained on a large dataset like ImageNet. Transfer learning from a pretrained model often accelerates convergence and boosts performance. Initially, the learning rate was 0.001, which decreased by a factor of 0.5 after every 100 epochs. The total number of epochs was 16,000. During training, this learning rate was changed to increase performance and training speed and the optimizer was ‘SGD’.
Loss Function Selection: A suitable loss function was chosen based on the nature of the classification task. For binary classification (LSCC vs. benign), a common choice is binary cross-entropy loss. For multi-class problems, categorical cross-entropy may be used.
Optimizer: An optimizer, such as Adam, SGD (Stochastic Gradient Descent), or RMSprop, is employed to adjust the model’s weights during training to minimize the selected loss function. The learning rate and other hyperparameters associated with the optimizer are carefully tuned to ensure effective convergence.
Mini-Batch Training: To manage memory and computational resources efficiently, training is typically performed in mini-batches. During each training iteration, a batch of laryngoscopic images and their corresponding ground truth labels are fed into the model. The optimizer computes gradients and updates the model weights based on this mini-batch. The batch size was 64.
Backpropagation: After each mini-batch forward pass, backpropagation is used to calculate gradients with respect to the loss function. These gradients are then used to update the model’s weights in the direction that minimizes the loss.
Regularization Techniques: To prevent overfitting, regularization techniques such as dropout and L2 regularization may be applied. These methods help the model generalize better to unseen data.
Validation: During training, a separate validation dataset, distinct from the training set, is used to assess the model’s performance at regular intervals (e.g., after each epoch). This allows for early stopping if the model’s performance on the validation data starts deteriorating, preventing overfitting.
Monitoring and Logging: Key metrics such as accuracy, loss, and possibly others like precision, recall, and F1-score, are monitored and logged during training. Visualization tools and logging systems are often employed to keep track of the model’s progress.
The training process is iterative, with the model gradually learning to make accurate predictions as it updates its weights during each epoch. This process continues until the model reaches a level of performance deemed satisfactory for the given task.
In this study, we diligently followed these steps and fine-tuned hyperparameters as needed during the model training process. This study compared multiple deep learning models; all models were trained with completely consistent hyperparameters to ensure the scientificity of the comparison.
2.4. Statistical Analysis
In this section, we present a rigorous statistical analysis to evaluate the performance of our deep convolutional neural network (DCNN) model in the context of laryngeal cancer diagnosis based on laryngoscopic images. The assessment encompasses several key metrics, including accuracy, specificity, sensitivity, receiver operating characteristic (ROC) analysis, area under the curve (AUC), and the DeLong test.
Accuracy: Accuracy is a pivotal metric quantifying the overall classification performance of our model. It is defined as the ratio of correctly classified samples to the total number of samples. Mathematically, it can be expressed as:
where TP (True Positives) denotes accurately identified laryngeal cancer cases, TN (True Negatives) represents correctly identified non-cancerous laryngeal lesions, FP (False Positives) corresponds to non-cancerous laryngeal lesions cases incorrectly identified as laryngeal cancer, and FN (False Negatives) denotes laryngeal cancer cases incorrectly classified as non-cancerous laryngeal lesions.
Specificity: Specificity assesses the model’s capability to correctly identify non-cancerous laryngeal lesions cases. It is calculated as:
Sensitivity: Sensitivity, also referred to as true positive rate or recall, measures the model’s ability to accurately detect laryngeal cancer cases. It can be calculated as:
Receiver Operating Characteristic (ROC) Analysis: ROC analysis is employed to visualize the model’s performance across different threshold settings. It generates an ROC curve illustrating the trade-off between sensitivity and specificity at varying thresholds.
Area Under the Curve (AUC): The AUC quantifies the overall performance of the model by calculating the area under the ROC curve. A higher AUC signifies superior discrimination, with 1 indicating perfect discrimination and 0.5 representing random chance.
DeLong Test: The DeLong test serves as a statistical tool for comparing the ROC curves of multiple classification models. It determines whether observed differences in AUC values are statistically significant, aiding in model selection and validation.
Statistical Procedure:
Accuracy, specificity, and sensitivity were computed based on the model’s predictions against the ground truth labels within the dataset. ROC analysis was executed to construct the ROC curve, and the AUC was quantified as a holistic measure of the model’s discriminatory capacity. To discern any significant distinctions in performance among different models or model variants, the DeLong test was applied. This statistical test ascertained whether variations in AUC values were statistically meaningful. In the discussion section, the outcomes of these meticulous statistical analyses offer valuable insights into the effectiveness of our DCNN model in the diagnosis of laryngeal cancer from laryngoscopic images. Additionally, they enable the assessment of potential performance disparities between our model and alternative models or variations in the classification task.