3.3.1. Feature Extraction Based on EfficientNet
EfficientNet [
37] is an efficient network architecture that utilizes Neural Architecture Search (NAS) technology to optimize the network’s width, depth, and resolution. Through a Compound Scaling strategy, EfficientNet systematically scales the dimensions of various network layers, thereby achieving higher performance with limited resources. Furthermore, EfficientNet has shown superior performance in standard image recognition benchmarks, particularly excelling in complex medical imaging tasks, where it efficiently captures essential features crucial for enhancing the accuracy of medical image retrieval.
In this paper, we evaluated models such as EfficientNet and ResNet through a series of experimental comparisons. Ultimately, EfficientNet_b6 was selected. EfficientNet_b6 enables the model to capture more details and features. It incorporates advanced technologies including MBConv blocks (Mobile inverted Bottleneck Convolution) and SE blocks (Squeeze-and-Excitation blocks). These technologies enhance the model’s ability to recognize key features while maintaining parameter efficiency.
3.3.2. Attention and Multi-Layer Feature Fusion
In this study, we adopted EfficientNet as the backbone network and integrated a Convolutional Block Attention Module (CBAM) [
38] to enhance the expressive power of the features. We selected the output of the penultimate Modified Block Convolution (MBConv) module from EfficientNet as the input for the CBAM, based on the premise that deeper network layers contain rich spatial hierarchical information, which is ideal for further attention mechanism processing. The CBAM module adaptively adjusts the weights of the feature map by introducing channel and spatial attention mechanisms, thereby enhancing the expressive capacity of the features. Specifically, the Channel Attention Module (CAM) selectively enhances features across channels based on their informational importance. The Spatial Attention Module (SAM) further emphasizes significant spatial features, guiding the network’s focus within the image.
The output of CBAM is concatenated with the output from EfficientNet, integrating features from different processing stages and enhancing the model’s ability to capture both global and local features. Specifically, the output from CBAM, denoted as
, is concatenated with the output from EfficientNet,
, to form a unified feature representation
through the following operation:
The concatenated features
are then subjected to further feature transformation through a linear layer
:
Here, and represent the weight matrix and bias term of the layer, respectively.
The output from the linear layer
is bifurcated into two parts: one part is directed to the classification layer for image categorization tasks, where a softmax function is used to output class probabilities; the other part is concatenated with the flattened CBAM output
and then processed through another linear layer
:
Ultimately, is fed into the hashing layer to produce hash codes, where and represent the weight matrix and bias term of the layer, respectively.
This design enables our model not only to perform effective image classification but also to generate hash codes for image retrieval, greatly enhancing the practicality and efficiency of the model in medical image analysis.
3.3.3. Learnable Quantization Hashing Layer
To effectively conduct image retrieval, this study introduced a hash code layer at the final output stage of the model. Image retrieval tasks require the model to map learned features into a discrete hash space to facilitate rapid comparison and retrieval. To this end, we designed a hash code layer that utilized learnable parameters to optimize the binarization of feature vectors.
The core of the hash code layer is a learnable hyperbolic tangent function
, whose parameter
can be adjusted during the model training process. This function aims to map continuous features into a more compact representation space while maintaining the discriminative ability between features. Specifically, the features processed by the linear layer
, denoted as
, are quantized through the hyperbolic tangent function to generate the final hash code
H:
where
is a learnable parameter, and we require
.
This design not only improves the efficiency of image retrieval but also enhances the model’s sensitivity to feature similarity, enabling the generation of more compact and discriminative hash codes. This is of significant importance for rapid retrieval and comparison in large-scale image databases.
3.3.4. Classification and Triplet Loss
In our study, to train an efficient medical image retrieval model, we adopted a multitask learning strategy that combines classification loss and triplet loss. Below is a detailed description of the two key loss functions involved in constructing the objective function.
The classification loss is based on focal loss [
39], a loss function designed specifically to address class imbalance issues and improve the model’s ability to recognize minority classes. We utilize the output of the classification layer to compute focal loss, defined as follows:
Here, represents the probability predicted by the model that the sample belongs to the positive class, and is a tuning parameter used to balance the importance of easy-to-classify and hard-to-classify samples.
For multi-class problems, focal loss can be extended to the weighted sum of all classes:
Here, n is the total number of classes, and is the indicator function, which equals 1 if the sample’s true label is i, and 0 otherwise.
Triplet loss [
40] is used to learn discriminative representations of image features, ensuring that images of the same class have smaller feature distances compared to images of different classes. In our study, we simultaneously input anchor, positive, and negative samples into feature extractors with the same parameters, where positive samples represented data of the same class as the anchor, and negative samples represented data of different classes. The calculation of triplet loss is as follows:
Here, , , and represent the feature representations of anchor, positive, and negative samples, respectively, is the distance metric between features, and is the minimum distance maintained between positive and negative sample pairs.
Combining the two loss functions mentioned above, we construct the objective function as follows:
In this objective function, , , and represent the Focal Losses of anchor, positive, and negative samples, respectively, while represents the triplet loss; and are hyperparameters used to balance the classification loss and triplet loss weights, with set to 1/3 in our experiments. During model training, we minimize the objective function to simultaneously optimize classification accuracy and feature discriminability, thereby improving the performance of medical image retrieval.
3.3.5. Training and Implementation Details
In this study, we employed the PyTorch deep learning framework for model construction and training. Model training was conducted on a high-performance computing platform equipped with Nvidia V100 32G GPUs, to ensure computational efficiency. The data were partitioned into a database set (training set) and a test set in an 8:2 ratio.
Table 2 shows the parameter settings we used during the model training. During training, we utilized the Adam optimizer with a learning rate set to 1 ×
. The training consisted of 50 epochs, with the margin hyperparameter for triplet loss set to 0.5 and the gamma hyperparameter for focal loss set to 1.5. The objective of the training was to minimize the total loss function
, which integrates classification loss and triplet loss, as detailed in Algorithm 1.
Algorithm 1 Deep Hashing Network Training |
- 1:
Initialize: - 2:
Set model parameters: epochs = 50, learning_rate = 1 × - 3:
optimizer = Adam, loss_function = triplet loss + focal loss - 4:
Load Data: - 5:
Load dataset from TCIA - 6:
Preprocess and augment data - 7:
Split data into training and testing sets - 8:
Build Model: - 9:
Use EfficientNet as the base feature extractor - 10:
Add an attention mechanism module (e.g., CBAM) - 11:
Integrate features into deep hashing layer - 12:
Configure outputs for classification and hashing layers - 13:
Training Loop: - 14:
for epoch = 1 to epochs do - 15:
for data, labels in training_data do - 16:
optimizer.zero_grad() - 17:
features = EfficientNet(data) - 18:
attention_features = CBAM(features) - 19:
hash_codes = DeepHashLayer(attention_features) - 20:
classification_output = ClassifierLayer(hash_codes) - 21:
triplet_loss = compute_triplet_loss(hash_codes, labels) - 22:
focal_loss = compute_focal_loss(classification_output, labels) - 23:
total_loss = triplet_loss + focal_loss - 24:
total_loss.backward() - 25:
optimizer.step() - 26:
end for - 27:
Validate model on test data - 28:
end for - 29:
Save Model: - 30:
Save trained model parameters to file - 31:
Output: - 32:
Print loss and validation results during training
|
3.3.6. Metrics
In our study, the core metric for evaluating retrieval performance is Mean Average Precision (MAP), which assesses overall performance across the dataset. Compared to traditional precision and recall metrics, MAP offers superior performance and stability.
Average Precision (AP) quantifies the average proportion of correct results among all returned results for a single query image. It is calculated using the following formula:
Here, represents the total number of correct results in query image q, n is the total number of returned results, denotes the precision at rank k, and is an indicator function that takes a value of 1 when the result at rank k is correct, and 0 otherwise.
MAP is computed by averaging the AP values for all query images. Assuming there are
q query images, the formula for MAP is
Additionally, the MAP@k metric is used to measure average precision when considering the top k retrieval results. Specifically, when k is equal to 1, MAP@1 considers only the first result in the retrieval, reflecting the accuracy of the system’s first returned result. In tasks such as information retrieval, MAP@1 can serve as an indicator of classification accuracy.