In the tasks of left–right ear similarity detection and identity authentication, designing an effective network architecture is crucial for improving model performance. To this end, we conducted a series of ablation experiments to analyze the impact of different modules and loss functions on model performance. The purpose of the ablation experiments was to systematically evaluate the contribution of each component to the final result, providing empirical evidence for model design. In this experiment, we focused on the roles of the Symmetry Alignment Module (SAM) and the Feature Interaction Network (FIN) in model performance, as well as the relative importance of contrastive loss, symmetry loss, and BCE loss during the training process. By incrementally analyzing the combinations of different modules and loss functions, we can gain a deeper understanding of their specific impacts on ear similarity detection and identity authentication tasks. In both the similarity detection and identity authentication processes, the metrics we need to calculate are True Negative (TN), False Negative (FN), False Positive (FP), and True Positive (TP). Specifically, TP refers to the number of true positive samples correctly identified by the model, FN represents the number of true positive samples incorrectly identified as negative, FP indicates the number of negative samples incorrectly identified as positive, and TN represents the number of true negative samples correctly identified as negative.
The metrics to be calculated in left–right ear similarity detection are as follows:
The model was trained using a batch size of 32 for 100 epochs with an initial learning rate of , optimized with the Adam optimizer. A cosine annealing learning rate scheduler was employed, with a minimum learning rate of , balancing stability and convergence. The feature dimension was set to 512 to provide sufficient representation power. Data augmentation was applied to increase dataset diversity and improve model robustness. Mixed-precision training was used to optimize computational efficiency on CUDA-enabled devices. A custom multi-task loss function with dynamic weighting was designed to adapt during training. Early stopping with a patience of 10 epochs prevented overfitting. A validation set was allocated at 20% of the total data to monitor generalization performance throughout the training process.
4.1.1. Left–Right Ear Similarity Detection
In the task of left–right ear similarity detection, this study used 300 pairs of positive samples and a corresponding number of negative samples for training, with 1000 pairs used for testing. During testing, a threshold of 0.5 was set, and if the similarity score between the left and right ears exceeded 0.5, the sample was considered positive; otherwise, it was considered negative.
Table 1 presents the impact of different modules and loss functions on model performance in the similarity detection task.
Figure 9 shows the similarity distribution scatter plot and the distribution plot for 1000 pairs of positive and 1000 pairs of negative samples.
Combining the ablation study metrics of different modules in the similarity detection task shown in
Table 1, and the similarity distribution scatter plot of positive and negative ear samples shown in
Figure 9, we can further analyze the impact of different network modules. In the similarity detection task, the basic ResNet18 model with only the BCE loss, while able to learn basic left–right ear similarity features, still has limited discriminative power. The accuracy is 0.8992, and the F1 score is 0.9051. The similarity distribution of positive and negative samples still shows significant overlap. With the introduction of contrastive loss, the model’s performance improves significantly. When the ResNet18 model employs all three loss functions, its accuracy, precision, and F1 score are slightly lower than when using only the BCE and contrastive losses. The primary reason is the absence of the SAM module, which limits the effectiveness of the symmetry loss. The SAM module is designed to optimize the learning of symmetry features between the left and right ears. Without it, the symmetry loss conflicts with the BCE and contrastive losses, failing to enhance model performance. Consequently, the model exhibits a slight decline in performance across these metrics. The accuracy of the ResNet18+BCE loss+contrastive loss model increases to 0.9493, and the F1 score rises to 0.9488. However, as shown in the scatter and distribution plots, even with the introduction of the loss function, there is still a bottleneck. Some overlap remains between the positive and negative samples (
Figure 9a), which affects the model’s discriminative ability.
To further optimize the stability and accuracy of the similarity detection task, we introduce the Symmetry Alignment Module (SAM), which dynamically adjusts the features by utilizing the symmetry information of the left and right ears. As a result, the accuracy of the ResNet18+SAM model improves to 0.9435, and the F1 score reaches 0.9430. As shown in
Figure 9b, compared to the model using only ResNet18, the addition of the SAM module makes the separation between positive and negative samples clearer. It reduces the number of negative samples with high similarity, thereby decreasing the false positive rate and enhancing the model’s discriminative ability. However, it is still observed that some positive and negative samples cross near the 0.5 threshold, indicating that although SAM improves feature alignment, the model has not yet achieved optimal classification performance.
After incorporating the Feature Interaction Network (FIN), the ResNet18+SAM+FIN model achieves outstanding performance in the similarity detection task, reaching an accuracy of 0.9903 and an F1 score of 0.9898. This is further supported by the scatter distribution plot in
Figure 9c, which shows that, under this configuration, the separation between positive and negative samples is at its optimal state. Negative samples are predominantly clustered between 0 and 0.5, while positive samples are concentrated between 0.9 and 1.0, with minimal overlap between the two regions. This clear distinction indicates that the FIN enhances the interaction mechanism across feature layers, enabling the model to more effectively capture subtle feature disparities between the left and right ear images, significantly improving the precision of similarity detection. Consequently, it can be inferred that the SAM provides robust symmetry calibration, while the FIN further refines the model’s feature interaction capabilities. The synergistic interplay between these components enables the ResNet18+SAM+FIN model to deliver exceptional classification performance in the similarity detection task, achieving optimal separation of positive and negative samples.
4.1.2. Performance and Analysis in Human Ear Authentication
In the authentication task, different positive-to-negative sample ratios were tested.
Table 2 presents the impact of different modules and loss functions on the model performance for ear authentication using 1000 pairs of positive and 1000 pairs of negative ear samples. In the identity authentication task, we designed an optimization problem to dynamically set the threshold; the optimal threshold is calculated using Equation (
13). When the input sample is a positive sample, if the similarity exceeds the threshold, it is classified as from the same person (authentication success); otherwise, it is classified as from different people (authentication failure). When the input sample is a negative sample, if the similarity exceeds the threshold, it is incorrectly classified as from the same person (authentication failure); otherwise, it is classified as from different people (authentication success).
In the identity authentication task, to ensure the generalization performance of the data, we tested two ratios of positive to negative samples—1:1 and 1:2. We designed an optimization function
J to find the optimal threshold by determining the intersection point of TPR and 1-FPR. The intersection point essentially represents a balance where the model does not favor either positive or negative samples when distinguishing between them. For the ratio of 1:1, the optimal threshold selection curves and ROC curves for the three networks are shown in
Figure 10.
From the ablation study presented in
Table 2 and the optimal threshold analysis shown in
Figure 10, it can be observed that in the identity authentication task, the baseline ResNet18 achieves an accuracy of only 0.8863, with both precision and True Positive Rate (TPR) ranging between 0.8851 and 0.8879. This suggests the presence of some misidentification issues in the model. According to the ROC curve data, the Equal Error Rate (EER) for the baseline ResNet18 is 0.1080, corresponding to a threshold of 0.7672, which is consistent with the optimal threshold identified for this configuration. This relatively high EER indicates room for optimization in balancing false acceptance and false rejection rates.
The introduction of the SAM module significantly enhances the model’s ability to distinguish negative samples. The True Negative Rate (TNR) increases from 0.8847 to 0.9003, while the False Positive Rate (FPR) decreases from 0.1153 to 0.0997. The EER improves to 0.0980, aligning with an optimal threshold of 0.7252. This reduction in the optimal threshold from 0.7672 to 0.7252 reflects a shift in the decision boundary, demonstrating that SAM effectively reduces incorrect matches by refining the model’s sensitivity to negative samples. However, limitations in feature interaction continue to constrain further performance gains.
To overcome this, the FIN module is incorporated to optimize feature representation. With the ResNet18+SAM+FIN approach, the accuracy rises to 0.9252, with precision, TPR, and F1 score all reaching 0.9252. The false positive rate further decreases to 0.0748, and the EER drops to 0.0360, corresponding to an optimal threshold of 0.8875. The increase in the optimal threshold from 0.7252 to 0.8875 across these enhancements indicates a progressively stricter criterion for positive sample acceptance, enabling the model to achieve superior discrimination while maintaining high accuracy.
The ablation study of various modules in the identity authentication task under a 1:2 positive-to-negative sample ratio, as detailed in
Table 3, and the ROC curve analysis with optimal threshold selection, as shown in
Figure 11, demonstrate that the baseline ResNet18 model achieves an accuracy of 0.8863 on this dataset. Yet, its precision drops sharply to 0.7960 compared to the 1:1 ratio scenario, revealing that a predominance of negative samples undermines the model’s ability to accurately identify positive instances thus elevating misclassification risks. The ROC data indicate an Equal Error Rate (EER) of 0.1135 at an optimal threshold (Best_threshold) of 0.7959, reflecting a suboptimal balance between False Acceptance Rate (FAR) and False Rejection Rate (FRR). Despite this, the True Positive Rate (TPR) remains strong at 0.8860, consistent with the 1:1 case, indicating robust positive sample recall. However, the abundance of negative samples limits the True Rejection Rate (TRR) to 0.8865 and sustains a high False Positive Rate (FPR) of 0.1135. These findings highlight the baseline ResNet18’s limited adaptability to imbalanced data without enhancement modules, particularly in negative-sample-dominated settings, where its precision in distinguishing positive instances is significantly compromised.
The incorporation of the SAM module refines the model’s recognition capabilities, subtly shifting its performance metrics. While accuracy dips marginally to 0.8737, the Equal Error Rate (EER) improves significantly to 0.0970, corresponding to an optimal threshold of 0.7177. This reduction from the baseline threshold of 0.7959 reflects a recalibrated decision boundary, enhancing the model’s resilience to data perturbations and bolstering its generalization to negative samples. This adjustment achieves a more balanced trade-off between false acceptance and rejection rates, as evidenced by the improved EER. The True Positive Rate (TPR) remains robust at 0.8740, closely approximating the baseline, yet the True Rejection Rate (TRR) declines slightly to 0.8735, accompanied by a modest rise in the False Positive Rate (FPR) to 0.1265. Precision, impacted by the imbalanced sample distribution, further decreases to 0.7755. Nevertheless, the SAM’s stabilizing influence sustains a near-0.9 TPR, effectively minimizing false rejections while optimizing the baseline model’s EER, thereby enhancing overall performance on imbalanced data.
The ResNet18+SAM+FIN configuration significantly enhances the model’s performance across key metrics. Accuracy climbs to 0.9010, and precision improves to 0.8198, outperforming both the standalone ResNet18 and ResNet18+SAM models. The optimal threshold rises to 0.8875, accompanied by a substantial reduction in the Equal Error Rate (EER) to 0.0425, aligning with the Best_threshold value. This increase from 0.7177 to 0.8875 underscores the FIN’s role in refining feature representations, greatly improving the model’s capacity to discern positive samples in a negative-sample-dominated context. The True Positive Rate (TPR) reaches 0.9010, boosting the F1 score to 0.8585, while the False Positive Rate (FPR) drops to 0.0990, and the True Rejection Rate (TRR) stabilizes at 0.9010. A comparative analysis of classification metrics (TP, TN, FN, FP) across varying positive sample ratios, as depicted in
Figure 12 and
Figure 13, demonstrates a consistent performance uplift with the integration of the SAM and the FIN. Notably, True Positives (TPs) and True Negatives (TNs) increase significantly, while False Positives (FPs) and False Negatives (FNs) decrease markedly, highlighting the synergistic enhancement facilitated by this configuration.
This outcome demonstrates that the FIN, by enhancing feature interactions and representation, substantially refines the model’s discrimination boundary between the positive and negative samples, particularly in high-noise, imbalanced scenarios. The FPR decreases by 0.0145 from the baseline to 0.0990, while the TRR rises to 0.9010, indicating that the combined robustness of the SAM and feature enhancement from the FIN effectively suppresses false acceptance risks while preserving recall capability, achieving a balanced trade-off among precision, recall, and security in the authentication task. FIN’s critical role lies in enhancing the interaction of left and right ear features, mitigating cross-individual similarity interference, thereby enabling a more precise differentiation of individuals and improving the reliability of the authentication task. Furthermore, the optimal threshold selection strategy validates the rationality of using the intersection of TPR and 1-FPR as the Best_threshold, optimizing the model’s discriminative ability across sample categories. As the optimal threshold increases, the model’s discrimination criterion becomes stricter, raising the acceptance standard for positive samples and effectively reducing the occurrence of misclassifications (False Positives) and False Negatives. Even with a majority of negative samples, our model sustains superior performance, showcasing excellent generalization in the identity authentication task.
The comparative analysis of classification metrics such as True Positive (TP), True Negative (TN), False Negative (FN), and False Positive (FP) under the optimal threshold for different positive sample ratios are shown in
Figure 12 and
Figure 13. In the bar charts, the green bars represent TPs, the blue bars represent TNs, the red bars represent FPs, and the orange bars represent FNs. It can be observed that with the integration of the SAM and FIN modules, the model’s performance shows a systematic improvement. Specifically, the detection of TPs and TNs increases significantly, while the values of FPs and the False Negative Rate (FNR) are notably reduced.
This result demonstrates that the FIN, by enhancing the interaction and representation capabilities among features, significantly refines the model’s decision boundary between the positive and negative samples. Notably, in high-noise scenarios with imbalanced datasets, FIN reduces the False Positive Rate (FPR) to 0.0990—a decrease of 0.0145 compared to the baseline model—while boosting the True Rejection Rate (TRR) to 0.9010. Based on the distribution results shown in
Figure 12 and
Figure 13, we also calculated the Matthews correlation coefficient (MCC) indicator. When the ratio of positive to negative samples is 1:1, the MCC value is 0.925; when the ratio of positive to negative samples is 1:2, the MCC value is 0.916. This demonstrates the outstanding performance of our network in the identity authentication task. This improvement highlights how the integration of the SAM’s robustness with the FIN’s feature enhancement effectively reduces the number of false positive misidentifications while preserving recall performance. Consequently, the model strikes an optimal balance between precision, recall, and security in identity authentication tasks.
This finding underscores the pivotal role of the FIN in identity authentication tasks. By strengthening the interaction between the left and right ear features, the FIN effectively minimizes cross-individual similarity interference. As a result, the model can more precisely differentiate between individuals, significantly enhancing the reliability of the authentication process. Furthermore, the experiment validated the effectiveness of an optimal threshold selection strategy by confirming that the intersection of the True Positive Rate (TPR) and one; the False Positive Rate (FPR) serves as a well-justified optimal threshold. This approach ensures the final model achieves superior discriminative power across different sample categories. Even in scenarios dominated by negative samples, our model maintains exceptional performance, highlighting its robust generalization capabilities in identity authentication tasks.
To gain a deeper understanding of the model’s decision-making process and enhance its interpretability, we introduced SHAP (SHapley Additive exPlanations) value analysis. By applying the DeepExplainer technique, we were able to visualize the key areas the model focused on when determining whether two ear images belonged to the same individual. The feature importance heatmaps generated by SHAP analysis clearly showed the contribution of each ear region in the model’s decision-making process, where the brighter areas indicated a greater influence on the final decision.
As shown in
Figure 14, the upper row presents the feature importance distribution of the left and right ears from the same individual, while the lower row shows the comparison of the left and right ears from different individuals. In the comparison of ears from the same individual, the auricle (outer rim), earlobe, and concha (bowl-shaped cavity) regions consistently exhibit high brightness patterns, indicating that these anatomical structures contain the most person-specific biometric information. In contrast, the comparison between the different individuals displays distinct dark areas and inconsistent brightness distributions, reflecting the model’s ability to effectively identify significant differences in these regions.
The use of SHAP explainability analysis not only validated the effectiveness of the model but also provided a clear direction for further optimization of the authentication system, specifically emphasizing the anatomical structures of the ear that play a decisive role in biometric differentiation. These findings are also consistent with the existing knowledge in the field of ear biometric research, further confirming the theoretical and practical validity of the method we proposed.