Symmetry Alignment–Feature Interaction Network for Human Ear Similarity Detection and Authentication

Yuan, Li; Zhou, He-Bin; Li, Jiang-Yun; Liu, Li; Gu, Xiao-Chai; Zhao, Ya-Nan

doi:10.3390/sym17050654

Open AccessArticle

Symmetry Alignment–Feature Interaction Network for Human Ear Similarity Detection and Authentication^†

by

Li Yuan

^1,2,‡

,

He-Bin Zhou

^1,2,‡

,

Jiang-Yun Li

^1,2,

Li Liu

^1,2,

Xiao-Chai Gu

³ and

Ya-Nan Zhao

^4,*

¹

School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China

²

Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, Beijing 100083, China

³

Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing 100700, China

⁴

Institute of Acupuncture and Moxibustion, China Academy of Chinese Medical Sciences, Beijing 100700, China

^*

Author to whom correspondence should be addressed.

^†

Our full implementation is publicly available online: https://github.com/17513146506/Cross-Ear-Biometric-Verification-and-Interaural-Similarity-Assessment (accessed on 23 April 2025).

^‡

These authors contributed equally to this work.

Symmetry 2025, 17(5), 654; https://doi.org/10.3390/sym17050654 (registering DOI)

Submission received: 25 March 2025 / Revised: 20 April 2025 / Accepted: 24 April 2025 / Published: 26 April 2025

(This article belongs to the Special Issue Symmetry Applied in Biometrics Technology)

Download

Browse Figures

Versions Notes

Abstract

:

In the context of ear-based biometric identity authentication, symmetry between the left and right ears emerges as a pivotal factor, particularly when registration involves one ear and authentication utilizes its contralateral counterpart. The extent to which bilateral ear symmetry supports consistent identity verification warrants significant investigation. This study addresses this challenge by proposing a novel framework, the Symmetry Alignment–Feature Interaction Network, designed to enhance authentication robustness. The proposed network incorporates a Symmetry Alignment Module, leveraging differentiable geometric alignment and a dual-attention mechanism to achieve precise feature correspondence between the left and right ears, thereby mitigating the robustness deficiencies of conventional methods under pose variations. Additionally, a Feature Interaction Network is introduced to amplify nonlinear interdependencies between binaural features, employing a difference–product dual-path architecture to enhance feature discriminability through Dual-Path Feature Interaction and Similarity Fusion. Experimental validation on a dataset from the University of Science and Technology of Beijing demonstrates that the proposed method achieves a similarity detection accuracy of 99.03% (a 9.11% improvement over the baseline ResNet18) and an F1 score of 0.9252 in identity authentication tasks. Ablation experiments further confirm the efficacy of the Symmetry Alignment Module, reducing the false positive rate by 3.05%, in combination with the Feature Interaction Network, shrinking the standard deviation of similarity distributions between the positive and negative samples by 67%. A multi-task loss function, governed by a dynamic weighting mechanism, effectively balances feature learning objectives. This work establishes a new paradigm for the authentication of biometric features with symmetry, integrating symmetry modeling with Dual-Path Feature Interaction and Similarity Fusion to advance the precision of ear authentication.

Keywords:

symmetry alignment–feature interaction network symmetry alignment module; feature interaction network; multi-task loss; ear authentication

1. Introduction

Biometric recognition technologies have gained significant attention in recent years as non-invasive and easy-to-implement methods for identity verification. Traditional biometric modalities, such as fingerprint recognition, facial recognition, and iris recognition, have been extensively studied and applied in various fields. In contrast, ear-based biometric authentication, as a relatively novel approach, has demonstrated unique advantages and considerable potential for reliable identity verification [1]. The human ear, possessing distinctive anatomical features, offers high stability and resistance to environmental interference, making it a valuable biometric trait for personal identification [2].

Human ears are bilaterally positioned on either side of the cranium. During the registration phase of human ear authentication, one ear is used for registration, and subsequent identity authentication may employ either the ipsilateral ear or, under certain constraints, the contralateral ear. Our preliminary investigations have established that the former scenario yields notably high accuracy [3]. However, the second scenario, namely authentication using the contralateral ear, has received little research attention. This study centers on the pivotal question of bilateral ear symmetry. Within a single individual, the left and right ears exhibit pronounced symmetry, characterized by closely aligned morphological and structural attributes, whereas inter-individual ear comparisons reveal marked disparities. This inherent symmetry presents a promising avenue for enhancing the efficacy of ear similarity detection and identity authentication methodologies [1,4].

The human ear’s intricate geometric configuration and unique textural features render it a highly reliable biometric identifier for identity recognition purposes. The bilateral symmetry of the ears provides a consistent set of fiducial points for precise feature alignment (as illustrated in Figure 1, where our model, the Symmetry Alignment–Feature Interaction Network (SAFIN), successfully detects and aligns these features), while complex morphological structures and persistent textural patterns—including the intricate contours of the helix, tragus, and lobule—significantly enhance the model’s ability to distinguish between individuals. These characteristics impart a remarkable degree of individuality and demonstrate superior robustness against external variables, such as variations in lighting, cosmetic alterations, or physical wear and tear, when compared to facial or fingerprint-based biometric methods. As a result, the textural attributes of the ear offer a robust and dependable basis for identity authentication in diverse operational scenarios. Moreover, the sophisticated topological intricacies of ear morphology, combined with its substantial inter-individual variability, create significant challenges for synthetic replication or forgery attempts. This makes ear-matching technologies a highly attractive and effective solution for identity verification within demanding security environments, such as financial transaction processing systems and access control mechanisms for restricted areas [5].

To address the research gap in symmetry modeling and feature interaction in the existing ear feature extraction methods, this paper proposes the Symmetry Alignment–Feature Interaction Network (SAFIN) to explore ear authentication performance when using contralateral ears for registration and authentication. The specific contributions are as follows: First, we introduce a differentiable symmetry modeling framework, incorporating the Symmetry Alignment Module (SAM), which combines geometric symmetry modeling with deep feature learning to enhance the representation of symmetry features. Second, we design a dual-stream network architecture with adaptive feature interaction, where the Feature Interaction Network (FIN) module improves the matching of bilateral ear features through dynamic weighting and deep feature interaction mechanisms thus boosting feature discriminability. Finally, the SAFIN incorporates a multi-task loss function combining binary cross-entropy, contrastive, and symmetry losses to balance feature discriminability and similarity computation. This enables simultaneous ear similarity detection and identity authentication while enhancing robustness and scalability through efficient feature sharing.

The paper is organized into several key sections. The Section 1 discusses the gap in research on ear-based identity authentication, the background of ear similarity detection and its limitations, and the motivation behind the study. The Related Works section surveys the literature on ear similarity detection, highlighting significant advancements in algorithm design, feature modeling, and various application scenarios. The Contributions section outlines the three main contributions the paper makes to address these research gaps. In the Materials and Methods section, the paper describes the components of the proposed method, including the Symmetry Alignment Module, Feature Interaction Network, multi-task loss function, and dataset construction. The Results section presents the experimental findings on ear similarity detection and human ear authentication. The Discussion section provides an analysis of these experimental results, while the Conclusions summarize the key findings and suggest potential future research directions. Finally, the Abbreviations section lists all the abbreviations used throughout the paper.

2. Related Works

Ear similarity detection and authentication, as a significant branch of biometric recognition, have made notable progress in recent years in terms of algorithm design, feature modeling, and application scenarios. The human ear, with its unique morphological features and stable texture patterns (such as the intricate geometry of the helix, tragus, and earlobe), serves as an ideal biometric identifier.

With the development of deep learning, the accuracy and robustness of ear recognition have significantly improved. Convolutional neural networks (CNNs) have evolved significantly in recent years, particularly in feature extraction tasks. CNNs are particularly effective at capturing complex patterns and details, enhancing the precision of feature extraction and improving overall system performance [6,7]. Liao et al. [8] proposed a transfer learning-based method that enhanced feature extraction using the pre-trained VGG16 model, achieving an 88.23% recognition accuracy on a small sample dataset. Furthermore, SSD-MobileNet V1, combined with object detection technology, improved detection efficiency and achieved a 99% recognition rate on the USTB dataset [9]. To address low-resolution ear images, Markičević et al. [10] introduced a super-resolution (SR) method, employing an enhanced deep residual network (EDSR) and Swin Transformer (SwinIR) to reconstruct low-quality images, significantly improving Rank-1 recognition accuracy. Additionally, Tomar et al. [11] proposed a two-level data augmentation strategy, combining mirroring, rotation, and the pre-trained VGG16 model to optimize feature extraction, thereby enhancing recognition accuracy in one-shot learning scenarios.

Recent computational advances offer solutions to data-related challenges in various domains. Mohamed (2023) [12] demonstrated how deep learning algorithms could process complex, unstructured data for enhanced decision-making, suggesting potential applications in fields with limited dataset availability. Similarly, Ismail (2023) [13] showed how machine learning models like LightGBM could effectively handle incomplete datasets, achieving high accuracy despite missing values. These approaches could address challenges in domains where data acquisition is difficult, enabling robust pattern recognition and prediction even with imperfect training data.

However, the methods discussed above primarily emphasize the feature extraction and classification performance for ear images, without leveraging ear recognition for identity verification. Consequently, they lack the capability to perform individual identity matching. To overcome this limitation, Gupta et al. developed a Siamese network (EarSiamNet) that employed contrastive learning to optimize similarity computation, combined with Support Vector Machines (SVMs) for classification. This approach achieved a recognition rate of 97.85% on the IIT-D dataset [14]. Similarly, Yi et al. proposed a deep learning-based method for ear detection and recognition, enabling ear-based identity verification in uncontrolled environments [15].

Nonetheless, prevailing methodologies fall short in comprehensively modeling the symmetry between left and right ears, relying exclusively on unilateral ear feature matching while overlooking critical geometric symmetry cues. This deficiency limits their adaptability to lateral perspectives, non-frontal orientations, and situations involving partial occlusion. Furthermore, these approaches predominantly utilize Euclidean distance to evaluate similarity among unilateral ear features, eschewing a bilateral feature interaction mechanism. Such an omission curtails the models’ generalization prowess in complex, real-world settings. Notably, in instances of pathological asymmetry—such as traumatic ear deformities—the lack of an adaptive weight adjustment mechanism significantly compromises the robustness of recognition performance.

Moreover, the existing methods fail to consider the dynamic adjustment of feature contributions. For example, Wang et al. (2020) demonstrated through occlusion scenarios that non-occluded ear regions should be assigned higher weights, whereas current static fusion strategies (e.g., fixed weighting) struggle to achieve this goal [16]. Therefore, it is necessary to introduce more adaptive bilateral ear matching strategies to fully exploit the deep feature relationships between both ears.

In addition, the inter-class differences in ear features are relatively small, especially in large-scale identity databases containing more than 1000 individuals. Traditional metric learning losses, such as Triplet Loss, struggle to effectively enlarge the inter-class distance in the feature space [17]. Hence, there is a need to design loss functions with adaptive feature separation capabilities, enabling the dynamic adjustment of inter-class distance constraints based on class distribution, while simultaneously enhancing intra-class feature aggregation.

3. Materials and Methods

This study introduces an innovative framework—the Symmetry Alignment–Feature Interaction Network (SAFIN)—designed to significantly improve the accuracy and robustness of ear-based identity authentication using contralateral ears by thoroughly exploiting ear symmetry characteristics. The network incorporates a Symmetry Alignment Module (SAM) to model the geometric symmetry between left and right ears, enhancing resilience against rotational and pose variations. Additionally, a Feature Interaction Network (FIN) module is introduced to capture the interdependencies between bilateral ear features, boosting matching precision and the network’s ability to effectively represent ear characteristics. Finally, a multi-task loss function is devised, integrating classification and metric learning objectives to balance feature discriminability and similarity computation during optimization, thereby improving the model’s overall performance.

In summary, the network architecture proposed in this study is illustrated in Figure 2.

The SAFIN architecture excels at capturing the geometric symmetry features of the ear while integrating the feature interactions between the left and right ears. Its distinctiveness lies in its construction of a symmetry model, which enhances stability against variations in rotation, viewpoint, and illumination. Concurrently, by leveraging a feature interaction mechanism, the model adeptly learns both the subtle disparities and overarching similarities between bilateral ears, thereby improving the precision of ear similarity detection and identity authentication.

3.1. Symmetry Alignment Module (SAM)

The SAFIN introduces the Symmetry Alignment Module (SAM), marking the first integration of geometric symmetry with differentiable feature learning. The network architecture of the SAM module is illustrated in Figure 3. The backbone network employs ResNet18 to extract features from input ear images, while the SAM module performs geometric alignment of the left and right ear images. It leverages a convolutional neural network (CNN) to extract symmetry features, adaptively weighting the features of both ears through spatial and channel attention mechanisms. This approach maximizes the contribution of bilateral ear symmetry features during the learning process, significantly enhancing the model’s adaptability to subtle morphological variations, as well as changes in rotation and pose.

A pivotal function of the SAM module lies in the geometric alignment of features extracted from left and right ear images. For input images of the left and right ears, after feature extraction by the backbone network, the features of both ears are obtained. The SAM module first aligns the features of the left and right ears through geometric transformations. The geometric transformations we refer to involve a horizontal flipping operation, based on the symmetric nature of the human ears, which is predefined. This process not only places both features in the same spatial coordinate system but also takes into account the relative positions and angles of the ears. Through symmetry calibration, SAM ensures that the geometric forms of the left and right ear images match precisely, providing a consistent foundation for subsequent feature extraction.

To accomplish the alignment of left and right ear images, the SAM module employs a convolutional neural network (CNN)-based methodology, harnessing CNN-extracted features for spatial reconciliation. Through iterative convolutional and pooling operations, the network discerns symmetry features from bilateral ear images, generating calibration parameters that underpin precise geometric alignment. This process ensures meticulous spatial congruence between the ears, laying a solid foundation for subsequent feature extraction. Post-alignment, SAM derives symmetry features through successive convolutional layers. These features encapsulate both the morphological intricacies and textural nuances of the ears—such as the elaborate geometry of the helix, tragus, and earlobe—and the geometric affinities between the bilateral structures. The convolutional framework adeptly distills these symmetrical traits, yielding deep feature maps that robustly articulate these symmetry attributes.

To further enhance the contribution of symmetry features between the left and right ears within the network, SAFIN employs spatial attention mechanisms and channel attention mechanisms, enabling the model to adaptively weight the features from both ears.

In the spatial attention mechanism, the model dynamically allocates varying weights to distinct regions of the left and right ears, guided by the spatial characteristics of the input images. For the feature maps of the bilateral ears, this mechanism computes attention weights for each position within the ear regions, subsequently modulating the influence of individual areas. As a result, the model prioritizes critical symmetry features—such as the morphology of the helix or tragus—while attenuating noise from less pertinent regions, thereby enhancing recognition accuracy for these salient attributes. In the channel attention mechanism, the model adaptively assigns significance to different feature channels by deriving weighting coefficients tailored to each channel. Specifically, this mechanism aggregates channel-specific features through global average pooling, then processes the resulting representation via a fully connected layer to produce channel-wise weights. These weights are applied to the corresponding channels of the feature maps, enabling adaptive recalibration across channels. This strategy amplifies the impact of highly informative channels while diminishing the influence of less relevant ones, ultimately elevating the model’s overall performance and computational efficiency.

Through the spatial attention mechanism and channel attention mechanism, SAM adaptively weights the symmetric features of the left and right ears, ensuring that these features contribute maximally to the network. For the features of the left and right ears, the model dynamically adjusts their weights based on the symmetry importance of different regions and channels, enabling the network to better learn the symmetric relationship between the ears. Figure 4 illustrates the feature map distribution after the feature maps obtained from the backbone network are processed by the SAM module. The left images show raw feature maps with chaotic distribution, while the right images after SAM processing demonstrate enhanced attention to critical anatomical details, particularly the helix and concha regions highlighted with red ellipses.

As shown in Figure 4, the model can better understand the matching relationship between the left and right ears and effectively improve the attention to detail features. SAM not only optimizes the extraction of global features but also strengthens the perception of key detail areas, particularly in fine structures such as the helix and concha regions. Compared to directly comparing the raw features of the left and right ears, SAM adaptively captures highly corresponding local information spatially, allowing the network to focus on more discriminative detail features. In the left feature map, the details are somewhat mixed and disorganized, but after applying SAM, the feature map places a greater emphasis on critical areas such as the auricle and the concha. The right image highlights key parts with red ellipses, emphasizing the model’s focus on important regions and reducing feature clutter. Therefore, the SAM module is not just an alignment mechanism; it also enables the network to more precisely learn and recognize the matching relationships between ears through adaptive weight adjustment and the deep integration of spatial and channel information. It acts as a high-level information integrator after feature extraction, highlighting critical detail features, optimizing match precision, and simultaneously enhancing the robustness and generalization ability of the model.

3.2. Feature Interaction Network (FIN)

The Feature Interaction Network (FIN) is another core module within the SAFIN model, as illustrated in Figure 5. Designed to enhance the correlation between features of the left and right ears, FIN optimizes feature matching through weighted interactions and deep integration. The design of this module primarily addresses the critical needs of biometric recognition and authentication tasks, particularly in ear recognition applications. Ear recognition faces challenges such as pose variations, occlusions, and lighting changes. The introduction of the FIN module effectively improves the model’s matching accuracy and robustness, especially during similarity detection.

The FIN module performs both difference and product interactions on the features of the left and right ears, capturing their similarities and disparities from complementary vantage points. By harnessing the hierarchical processing capabilities of deep neural networks, FIN amplifies feature correlation and refines their representation through fully connected layers, ultimately enhancing the precision of the final similarity computation.

In the network architecture of FIN, the initial component is the feature processor, which begins by concatenating the features of the left and right ears via a fully connected layer. The resulting concatenated features are composed of the following two distinct elements: First is the feature difference (diff_features), which denotes the absolute disparity between the left and right ear features. This component encapsulates variations in shape, texture, and geometry, enabling the model to discern the unique characteristics of each ear. Second is the feature product (prod_features), which highlights the similarities and correlations between the left and right ears’ features. By capturing morphological and textural consistencies, this element enhances the matching precision of the ear recognition system. Together, these two feature components provide a dual perspective, effectively encoding both the similarities and differences between the left and right ears. In the interaction enhancer, the network further strengthens the interaction between the left and right ears’ features. This is achieved through a linear layer, where the output dimension is reduced to half of the input feature dimension, followed by a ReLU activation function to introduce nonlinear mapping.

Following feature processing and interaction enhancement, the FIN module projects the refined features into a scalar value using an output projection layer. This scalar value quantifies the similarity between the left and right ears. Specifically, a linear layer transforms the interaction-enhanced features into a one-dimensional output. Through this projection, the FIN module produces a final similarity score, which serves to determine whether the two ears belong to the same individual.

3.3. Design of Multi-Task Loss Function

In the tasks of detecting similarities between the left and right ears and performing identity authentication, the model must efficiently conduct feature learning and matching within complex real-world settings. To enhance the model’s effectiveness for these objectives, we have developed a multi-objective loss function that integrates binary cross-entropy loss, contrastive loss, and center loss. This fusion of loss functions strikes a balance between feature discriminability and similarity computation during training, thereby boosting the model’s robustness across diverse scenarios. Tailored specifically for optimizing the detection and matching of left and right ear similarities in ear recognition, this loss function design also elevates the accuracy and reliability of the identity authentication process.

Binary cross-entropy loss is a widely used loss function in biometric recognition, designed to measure the difference between the model’s output probability distribution and the true labels in binary classification tasks. In the ear recognition task, the primary focus is to determine whether the left and right ear images belong to the same individual, framing this task as a binary classification problem. Specifically, positive samples (the left and right ears of the same person) and negative samples (ears from different individuals) need to be effectively distinguished. The formula for calculating binary cross-entropy loss is shown in Equation (1).

L_{c l s} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \cdot log σ ({\hat{y}}_{i}) + (1 - y_{i}) \cdot (1 - log σ ({\hat{y}}_{i}))]

(1)

where

{\hat{y}}_{i}

represents the similarity score (logits) output by the model, which indicates the probability that the i-th sample pair belongs to the same individual;

y_{i}

represents the true label (0 or 1), where 1 indicates that the sample pair comes from the same individual; and

σ

denotes the Sigmoid function, which maps the logits into the probability space. The binary cross-entropy (BCE) loss ensures that the model maintains high accuracy in determining whether the left and right ears belong to the same individual. By minimizing the BCE loss, the model learns to distinguish the features of the left and right ears of the same person from those of different individuals, thereby optimizing the accuracy of similarity detection and identity authentication.

Contrastive loss is a commonly used loss function in metric learning, particularly suited for learning distance metrics (or similarity metrics). It aims to bring the left and right ears of the same individual closer together in the feature space while pushing the ears of different individuals farther apart. For the task of ear similarity detection and identity authentication, the contrastive loss is designed as shown in Equation (2).

L_{e a r c o n} = \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \cdot d^{2} + (1 - y_{i}) \cdot max {(0, m - d)}^{2}]

(2)

where

y_{i}

represents the true label (0 or 1), where 1 indicates that the sample pair belongs to the same individual; d is the Euclidean distance between the samples; and m is a predefined margin, which is used to increase the distance between different samples. This loss helps improve the model’s ability to distinguish ear features in complex environments, such as pose variations, changes in illumination, and partial occlusions.

Symmetry loss is a loss function used to measure the relative symmetry between two input images. It is commonly employed in multimodal learning and dual-input models, aiming to constrain the symmetry of the features learned by the model, particularly in tasks involving paired inputs, such as left and right images. In ear similarity detection and identity authentication tasks, symmetry is a crucial feature. The left and right ears are symmetric; although there may be slight differences in some details, they generally exhibit a high degree of structural symmetry. Therefore, leveraging this symmetry to enhance the model’s learning of ear features can significantly improve the accuracy of identity authentication and similarity detection. Symmetry loss is designed to optimize the model by exploiting this symmetry. The core idea of symmetry loss is to compute the difference between the embedding features of the left ear (left image) and the right ear (right image). Specifically, let the features of the left and right ears be denoted as

{ear}_{left}

and

{ear}_{right}

. The symmetry loss is defined by measuring the distance between these two features, as shown in Equation (3).

L_{s y m} = {∥{ear}_{left} - {ear}_{right}∥}_{2}^{2}

(3)

where

{ear}_{left}

is the feature vector of the left ear, and

{ear}_{right}

is the feature vector of the right ear, both of which are used to compute the squared Euclidean distance, which measures the difference between the two embedded feature vectors. In the task of ear similarity detection and identity authentication, the input typically consists of two ear images. The model needs to determine whether these two images belong to the same individual. Since the left and right parts of the human ear are generally symmetrical, we aim to minimize the symmetry loss to encourage the features of the left and right ears to be as similar as possible, thereby improving the accuracy of similarity discrimination. By using symmetry loss, the model learns the symmetry between the embedding features of the left and right ears, which helps the model better capture the structural features of the ears from different perspectives. This is crucial for ear recognition tasks, as the left and right parts of the ear are usually mirror-symmetric.

The final loss is the weighted sum of the three individual losses, as shown in Equation (4).

L_{e a r t o t a l} = λ_{b c e} \cdot L_{B C E} + λ_{e a r c o n} \cdot L_{e a r c o n} + λ_{s y m} \cdot L_{s y m}

(4)

As shown in Figure 6, the weight of the binary cross-entropy (BCE) loss initially dominates and then gradually decreases as training progresses. In contrast, the weights of the symmetry loss and contrastive loss increase over time, particularly the contrastive loss, which accelerates its growth rate as training advances. This dynamic adjustment mechanism ensures that the model can focus on the most critical tasks at different stages of training, thereby enhancing its final performance and robustness.

3.4. Left–Right Ear Similarity Detection and Human Ear Authentication

In this study, our model concurrently tackles the following two interrelated tasks: detecting similarities between the left and right ears and performing identity authentication. Both tasks are pivotal in biometric recognition, with the former concentrating on the structural likeness of ears and the latter utilizing ear features for identity confirmation. To proficiently address these dual objectives, we developed a multi-task reasoning strategy, wherein the model generates a similarity score that simultaneously quantifies ear matching and supports decisions for identity authentication. The model ultimately generates an unnormalized similarity score (logits), which is transformed into a matching probability through the Sigmoid activation function. For the similarity detection task, we set a binary classification threshold of 0.5 to determine whether the left and right ears are similar. However, in the context of identity authentication, employing a fixed threshold for classification can be affected by variations in the data distribution. To mitigate this, we employ a statistical optimization strategy, constructing an optimization function based on True Positives (TPs), False Positives (FPs), True Negatives (TNs), and False Negatives (FNs) to fine-tune the decision threshold. We systematically evaluate a range of threshold values,

τ

, and select the one that optimizes performance for identity authentication. Following this refinement, we establish the final decision rule for identity authentication, as detailed in Equation (5).

{\hat{y}}_{i} = \{\begin{matrix} yes, & P \geq τ_{d} \\ no, & P < τ_{d} \end{matrix}

(5)

This approach adeptly balances the model’s False Acceptance Rate (FAR) and False Rejection Rate (FRR), guaranteeing robust reliability in identity authentication.

3.5. Dataset Construction

The image dataset used in this study originates from the USTB Image Database [18], from which 1000 pairs of positive samples were randomly selected. For negative sample generation, we employed a completely random sampling mechanism that explicitly excluded the current index to ensure no overlap with positive samples. Specifically, for each left ear image i, we randomly selected an index other than i with equal probability as the right ear for the negative sample pair, which eliminated selection bias while maintaining a strict 1:1 and 1:2 class balance ratio. To enhance the diversity and robustness of the dataset, all samples underwent offline data augmentation, including but not limited to illumination adjustments (both enhancement and reduction), image equalization, contrast enhancement, and random rotations, as illustrated in Figure 7.

Figure 8 displays images of the left and right ears of several subjects. If a pair of samples comes from the left and right ears of the same individual, the pair is labeled as a positive sample with a label value of 1. Conversely, if the sample pair comes from different individuals, it is marked as a negative sample pair, with a label value of 0. During the training process, negative sample pairs were generated by random combinations to ensure a balanced distribution of labels across the dataset, fulfilling the requirements for supervised learning. To maintain balance within the dataset during training, the ratio of positive to negative samples was kept at 1:1.

4. Results

4.1. Experimental

In the tasks of left–right ear similarity detection and identity authentication, designing an effective network architecture is crucial for improving model performance. To this end, we conducted a series of ablation experiments to analyze the impact of different modules and loss functions on model performance. The purpose of the ablation experiments was to systematically evaluate the contribution of each component to the final result, providing empirical evidence for model design. In this experiment, we focused on the roles of the Symmetry Alignment Module (SAM) and the Feature Interaction Network (FIN) in model performance, as well as the relative importance of contrastive loss, symmetry loss, and BCE loss during the training process. By incrementally analyzing the combinations of different modules and loss functions, we can gain a deeper understanding of their specific impacts on ear similarity detection and identity authentication tasks. In both the similarity detection and identity authentication processes, the metrics we need to calculate are True Negative (TN), False Negative (FN), False Positive (FP), and True Positive (TP). Specifically, TP refers to the number of true positive samples correctly identified by the model, FN represents the number of true positive samples incorrectly identified as negative, FP indicates the number of negative samples incorrectly identified as positive, and TN represents the number of true negative samples correctly identified as negative.

The metrics to be calculated in left–right ear similarity detection are as follows:

Accuracy (Acc) represents the proportion of correctly classified samples over the total number of samples. The formula for calculating accuracy is shown in Equation (6).

$Acc = \frac{T P + T N}{T P + T N + F P + F N}$

(6)
Precision (Pre) represents the proportion of predicted positive samples among all the samples predicted as positive. The formula for calculating precision is shown in Equation (7).

$Pre = \frac{T P}{T P + F P}$

(7)
Recall (Rec) represents the proportion of actual positive samples that are correctly predicted as positive. The formula for calculating recall is shown in Equation (8).

$Rec = \frac{T P}{T P + F N}$

(8)
F1 Score is the harmonic mean of precision and recall, providing a balance between the two. It is particularly useful in cases of class imbalance. The formula is shown in Equation (9).

$F_{1} = \frac{2 \times Pre \times Rec}{Pre + Rec}$

(9)
True Rejection Rate (TRR) measures the proportion of actual negative samples (i.e., from different individuals) that are correctly rejected by the system. The higher the TRR, the better the system is at rejecting incorrect, non-matching identities. The formula is shown in Equation (10).

$TRR = \frac{T N}{T N + F P}$

(10)
False Positive Rate (FPR) measures the proportion of actual negative samples that are incorrectly predicted as positive. A lower FPR indicates better performance in avoiding false matches. See Equation (11).

$FPR = \frac{F P}{F P + T N}$

(11)
True Positive Rate (TPR) measures the ability of the system to correctly identify genuine users. See Equation (12).

$TPR = \frac{T P}{T P + F N}$

(12)

The model was trained using a batch size of 32 for 100 epochs with an initial learning rate of

1 \times 10^{- 4}

, optimized with the Adam optimizer. A cosine annealing learning rate scheduler was employed, with a minimum learning rate of

1 \times 10^{- 6}

, balancing stability and convergence. The feature dimension was set to 512 to provide sufficient representation power. Data augmentation was applied to increase dataset diversity and improve model robustness. Mixed-precision training was used to optimize computational efficiency on CUDA-enabled devices. A custom multi-task loss function with dynamic weighting was designed to adapt during training. Early stopping with a patience of 10 epochs prevented overfitting. A validation set was allocated at 20% of the total data to monitor generalization performance throughout the training process.

4.1.1. Left–Right Ear Similarity Detection

In the task of left–right ear similarity detection, this study used 300 pairs of positive samples and a corresponding number of negative samples for training, with 1000 pairs used for testing. During testing, a threshold of 0.5 was set, and if the similarity score between the left and right ears exceeded 0.5, the sample was considered positive; otherwise, it was considered negative. Table 1 presents the impact of different modules and loss functions on model performance in the similarity detection task. Figure 9 shows the similarity distribution scatter plot and the distribution plot for 1000 pairs of positive and 1000 pairs of negative samples.

Combining the ablation study metrics of different modules in the similarity detection task shown in Table 1, and the similarity distribution scatter plot of positive and negative ear samples shown in Figure 9, we can further analyze the impact of different network modules. In the similarity detection task, the basic ResNet18 model with only the BCE loss, while able to learn basic left–right ear similarity features, still has limited discriminative power. The accuracy is 0.8992, and the F1 score is 0.9051. The similarity distribution of positive and negative samples still shows significant overlap. With the introduction of contrastive loss, the model’s performance improves significantly. When the ResNet18 model employs all three loss functions, its accuracy, precision, and F1 score are slightly lower than when using only the BCE and contrastive losses. The primary reason is the absence of the SAM module, which limits the effectiveness of the symmetry loss. The SAM module is designed to optimize the learning of symmetry features between the left and right ears. Without it, the symmetry loss conflicts with the BCE and contrastive losses, failing to enhance model performance. Consequently, the model exhibits a slight decline in performance across these metrics. The accuracy of the ResNet18+BCE loss+contrastive loss model increases to 0.9493, and the F1 score rises to 0.9488. However, as shown in the scatter and distribution plots, even with the introduction of the loss function, there is still a bottleneck. Some overlap remains between the positive and negative samples (Figure 9a), which affects the model’s discriminative ability.

To further optimize the stability and accuracy of the similarity detection task, we introduce the Symmetry Alignment Module (SAM), which dynamically adjusts the features by utilizing the symmetry information of the left and right ears. As a result, the accuracy of the ResNet18+SAM model improves to 0.9435, and the F1 score reaches 0.9430. As shown in Figure 9b, compared to the model using only ResNet18, the addition of the SAM module makes the separation between positive and negative samples clearer. It reduces the number of negative samples with high similarity, thereby decreasing the false positive rate and enhancing the model’s discriminative ability. However, it is still observed that some positive and negative samples cross near the 0.5 threshold, indicating that although SAM improves feature alignment, the model has not yet achieved optimal classification performance.

After incorporating the Feature Interaction Network (FIN), the ResNet18+SAM+FIN model achieves outstanding performance in the similarity detection task, reaching an accuracy of 0.9903 and an F1 score of 0.9898. This is further supported by the scatter distribution plot in Figure 9c, which shows that, under this configuration, the separation between positive and negative samples is at its optimal state. Negative samples are predominantly clustered between 0 and 0.5, while positive samples are concentrated between 0.9 and 1.0, with minimal overlap between the two regions. This clear distinction indicates that the FIN enhances the interaction mechanism across feature layers, enabling the model to more effectively capture subtle feature disparities between the left and right ear images, significantly improving the precision of similarity detection. Consequently, it can be inferred that the SAM provides robust symmetry calibration, while the FIN further refines the model’s feature interaction capabilities. The synergistic interplay between these components enables the ResNet18+SAM+FIN model to deliver exceptional classification performance in the similarity detection task, achieving optimal separation of positive and negative samples.

4.1.2. Performance and Analysis in Human Ear Authentication

In the authentication task, different positive-to-negative sample ratios were tested. Table 2 presents the impact of different modules and loss functions on the model performance for ear authentication using 1000 pairs of positive and 1000 pairs of negative ear samples. In the identity authentication task, we designed an optimization problem to dynamically set the threshold; the optimal threshold is calculated using Equation (13). When the input sample is a positive sample, if the similarity exceeds the threshold, it is classified as from the same person (authentication success); otherwise, it is classified as from different people (authentication failure). When the input sample is a negative sample, if the similarity exceeds the threshold, it is incorrectly classified as from the same person (authentication failure); otherwise, it is classified as from different people (authentication success).

\arg \max J = T P R - F P R

(13)

In the identity authentication task, to ensure the generalization performance of the data, we tested two ratios of positive to negative samples—1:1 and 1:2. We designed an optimization function J to find the optimal threshold by determining the intersection point of TPR and 1-FPR. The intersection point essentially represents a balance where the model does not favor either positive or negative samples when distinguishing between them. For the ratio of 1:1, the optimal threshold selection curves and ROC curves for the three networks are shown in Figure 10.

From the ablation study presented in Table 2 and the optimal threshold analysis shown in Figure 10, it can be observed that in the identity authentication task, the baseline ResNet18 achieves an accuracy of only 0.8863, with both precision and True Positive Rate (TPR) ranging between 0.8851 and 0.8879. This suggests the presence of some misidentification issues in the model. According to the ROC curve data, the Equal Error Rate (EER) for the baseline ResNet18 is 0.1080, corresponding to a threshold of 0.7672, which is consistent with the optimal threshold identified for this configuration. This relatively high EER indicates room for optimization in balancing false acceptance and false rejection rates.

The introduction of the SAM module significantly enhances the model’s ability to distinguish negative samples. The True Negative Rate (TNR) increases from 0.8847 to 0.9003, while the False Positive Rate (FPR) decreases from 0.1153 to 0.0997. The EER improves to 0.0980, aligning with an optimal threshold of 0.7252. This reduction in the optimal threshold from 0.7672 to 0.7252 reflects a shift in the decision boundary, demonstrating that SAM effectively reduces incorrect matches by refining the model’s sensitivity to negative samples. However, limitations in feature interaction continue to constrain further performance gains.

To overcome this, the FIN module is incorporated to optimize feature representation. With the ResNet18+SAM+FIN approach, the accuracy rises to 0.9252, with precision, TPR, and F1 score all reaching 0.9252. The false positive rate further decreases to 0.0748, and the EER drops to 0.0360, corresponding to an optimal threshold of 0.8875. The increase in the optimal threshold from 0.7252 to 0.8875 across these enhancements indicates a progressively stricter criterion for positive sample acceptance, enabling the model to achieve superior discrimination while maintaining high accuracy.

The ablation study of various modules in the identity authentication task under a 1:2 positive-to-negative sample ratio, as detailed in Table 3, and the ROC curve analysis with optimal threshold selection, as shown in Figure 11, demonstrate that the baseline ResNet18 model achieves an accuracy of 0.8863 on this dataset. Yet, its precision drops sharply to 0.7960 compared to the 1:1 ratio scenario, revealing that a predominance of negative samples undermines the model’s ability to accurately identify positive instances thus elevating misclassification risks. The ROC data indicate an Equal Error Rate (EER) of 0.1135 at an optimal threshold (Best_threshold) of 0.7959, reflecting a suboptimal balance between False Acceptance Rate (FAR) and False Rejection Rate (FRR). Despite this, the True Positive Rate (TPR) remains strong at 0.8860, consistent with the 1:1 case, indicating robust positive sample recall. However, the abundance of negative samples limits the True Rejection Rate (TRR) to 0.8865 and sustains a high False Positive Rate (FPR) of 0.1135. These findings highlight the baseline ResNet18’s limited adaptability to imbalanced data without enhancement modules, particularly in negative-sample-dominated settings, where its precision in distinguishing positive instances is significantly compromised.

The incorporation of the SAM module refines the model’s recognition capabilities, subtly shifting its performance metrics. While accuracy dips marginally to 0.8737, the Equal Error Rate (EER) improves significantly to 0.0970, corresponding to an optimal threshold of 0.7177. This reduction from the baseline threshold of 0.7959 reflects a recalibrated decision boundary, enhancing the model’s resilience to data perturbations and bolstering its generalization to negative samples. This adjustment achieves a more balanced trade-off between false acceptance and rejection rates, as evidenced by the improved EER. The True Positive Rate (TPR) remains robust at 0.8740, closely approximating the baseline, yet the True Rejection Rate (TRR) declines slightly to 0.8735, accompanied by a modest rise in the False Positive Rate (FPR) to 0.1265. Precision, impacted by the imbalanced sample distribution, further decreases to 0.7755. Nevertheless, the SAM’s stabilizing influence sustains a near-0.9 TPR, effectively minimizing false rejections while optimizing the baseline model’s EER, thereby enhancing overall performance on imbalanced data.

The ResNet18+SAM+FIN configuration significantly enhances the model’s performance across key metrics. Accuracy climbs to 0.9010, and precision improves to 0.8198, outperforming both the standalone ResNet18 and ResNet18+SAM models. The optimal threshold rises to 0.8875, accompanied by a substantial reduction in the Equal Error Rate (EER) to 0.0425, aligning with the Best_threshold value. This increase from 0.7177 to 0.8875 underscores the FIN’s role in refining feature representations, greatly improving the model’s capacity to discern positive samples in a negative-sample-dominated context. The True Positive Rate (TPR) reaches 0.9010, boosting the F1 score to 0.8585, while the False Positive Rate (FPR) drops to 0.0990, and the True Rejection Rate (TRR) stabilizes at 0.9010. A comparative analysis of classification metrics (TP, TN, FN, FP) across varying positive sample ratios, as depicted in Figure 12 and Figure 13, demonstrates a consistent performance uplift with the integration of the SAM and the FIN. Notably, True Positives (TPs) and True Negatives (TNs) increase significantly, while False Positives (FPs) and False Negatives (FNs) decrease markedly, highlighting the synergistic enhancement facilitated by this configuration.

This outcome demonstrates that the FIN, by enhancing feature interactions and representation, substantially refines the model’s discrimination boundary between the positive and negative samples, particularly in high-noise, imbalanced scenarios. The FPR decreases by 0.0145 from the baseline to 0.0990, while the TRR rises to 0.9010, indicating that the combined robustness of the SAM and feature enhancement from the FIN effectively suppresses false acceptance risks while preserving recall capability, achieving a balanced trade-off among precision, recall, and security in the authentication task. FIN’s critical role lies in enhancing the interaction of left and right ear features, mitigating cross-individual similarity interference, thereby enabling a more precise differentiation of individuals and improving the reliability of the authentication task. Furthermore, the optimal threshold selection strategy validates the rationality of using the intersection of TPR and 1-FPR as the Best_threshold, optimizing the model’s discriminative ability across sample categories. As the optimal threshold increases, the model’s discrimination criterion becomes stricter, raising the acceptance standard for positive samples and effectively reducing the occurrence of misclassifications (False Positives) and False Negatives. Even with a majority of negative samples, our model sustains superior performance, showcasing excellent generalization in the identity authentication task.

The comparative analysis of classification metrics such as True Positive (TP), True Negative (TN), False Negative (FN), and False Positive (FP) under the optimal threshold for different positive sample ratios are shown in Figure 12 and Figure 13. In the bar charts, the green bars represent TPs, the blue bars represent TNs, the red bars represent FPs, and the orange bars represent FNs. It can be observed that with the integration of the SAM and FIN modules, the model’s performance shows a systematic improvement. Specifically, the detection of TPs and TNs increases significantly, while the values of FPs and the False Negative Rate (FNR) are notably reduced.

This result demonstrates that the FIN, by enhancing the interaction and representation capabilities among features, significantly refines the model’s decision boundary between the positive and negative samples. Notably, in high-noise scenarios with imbalanced datasets, FIN reduces the False Positive Rate (FPR) to 0.0990—a decrease of 0.0145 compared to the baseline model—while boosting the True Rejection Rate (TRR) to 0.9010. Based on the distribution results shown in Figure 12 and Figure 13, we also calculated the Matthews correlation coefficient (MCC) indicator. When the ratio of positive to negative samples is 1:1, the MCC value is 0.925; when the ratio of positive to negative samples is 1:2, the MCC value is 0.916. This demonstrates the outstanding performance of our network in the identity authentication task. This improvement highlights how the integration of the SAM’s robustness with the FIN’s feature enhancement effectively reduces the number of false positive misidentifications while preserving recall performance. Consequently, the model strikes an optimal balance between precision, recall, and security in identity authentication tasks.

This finding underscores the pivotal role of the FIN in identity authentication tasks. By strengthening the interaction between the left and right ear features, the FIN effectively minimizes cross-individual similarity interference. As a result, the model can more precisely differentiate between individuals, significantly enhancing the reliability of the authentication process. Furthermore, the experiment validated the effectiveness of an optimal threshold selection strategy by confirming that the intersection of the True Positive Rate (TPR) and one; the False Positive Rate (FPR) serves as a well-justified optimal threshold. This approach ensures the final model achieves superior discriminative power across different sample categories. Even in scenarios dominated by negative samples, our model maintains exceptional performance, highlighting its robust generalization capabilities in identity authentication tasks.

To gain a deeper understanding of the model’s decision-making process and enhance its interpretability, we introduced SHAP (SHapley Additive exPlanations) value analysis. By applying the DeepExplainer technique, we were able to visualize the key areas the model focused on when determining whether two ear images belonged to the same individual. The feature importance heatmaps generated by SHAP analysis clearly showed the contribution of each ear region in the model’s decision-making process, where the brighter areas indicated a greater influence on the final decision.

As shown in Figure 14, the upper row presents the feature importance distribution of the left and right ears from the same individual, while the lower row shows the comparison of the left and right ears from different individuals. In the comparison of ears from the same individual, the auricle (outer rim), earlobe, and concha (bowl-shaped cavity) regions consistently exhibit high brightness patterns, indicating that these anatomical structures contain the most person-specific biometric information. In contrast, the comparison between the different individuals displays distinct dark areas and inconsistent brightness distributions, reflecting the model’s ability to effectively identify significant differences in these regions.

The use of SHAP explainability analysis not only validated the effectiveness of the model but also provided a clear direction for further optimization of the authentication system, specifically emphasizing the anatomical structures of the ear that play a decisive role in biometric differentiation. These findings are also consistent with the existing knowledge in the field of ear biometric research, further confirming the theoretical and practical validity of the method we proposed.

5. Discussion

Our model, trained on an RTX 4090, typically takes approximately 2.5 h, with 12,693,122 parameters. For inference, the duration varies depending on the task. For similarity detection, processing the entire dataset of 1000 pairs takes about 5 min. For identity authentication, inference is faster, with approximately 100 ms per individual. Our model, with its efficient design and optimized performance, ensures rapid and accurate results for both tasks.

Experimental results from both tasks show that the ResNet18+SAM+FIN model, integrated with BCE Loss, contrastive loss, and symmetry loss, excels in similarity detection. The SAM improves symmetry calibration, while the FIN boosts feature interactions. Together, these enhancements allow for the precise differentiation of left and right ear similarities, yielding exceptional matching accuracy.

In the identity authentication task, the ResNet18+SAM+FIN framework significantly improves the model’s discriminative capabilities. The SAM enhances the model’s robustness by reducing the False Positive Rate, while the FIN strengthens the feature interactions, considerably improving the model’s ability to distinguish between individuals. This synergy ultimately boosts the precision of the authentication process.

The research findings reveal that the SAM primarily contributes robust support for symmetry modeling, while the FIN substantially elevates the model’s overall recognition capabilities by enhancing feature interactions. The combined synergistic impact of these two modules empowers the final model to deliver peak performance across both tasks. Figure 15 demonstrates the significant advantages of our model in the ear matching detection task, particularly in handling the matching of similar regions between the left and right ear features. The small squares in the figure represent the output feature layer of the SAM in the SAFIN, as shown in Figure 4, which detects local feature regions in the ear images. Through the FIN, the similarity of local ear regions is computed, capturing similar feature regions between the left and right ears. This reflects the model’s stability and accuracy under complex conditions such as pose variations, different lighting conditions, and partial occlusions. By combining ResNet18+SAM+FIN, our model significantly enhances its ability to recognize key internal ear regions, such as the helix, triangular fossa, and antihelix.

Notwithstanding the remarkable success of the SAFIN model in symmetry modeling and feature interaction, it presents several limitations that should be considered for future improvements. First, the performance of the Symmetry Alignment Module (SAM) heavily depends on the quality of the ear image dataset. When images contain noise or have low resolution, feature extraction may become inaccurate, reducing matching accuracy and affecting identity verification. Second, the SAM assumes high symmetry between the left and right ear images, which holds in most cases but may fail when significant morphological differences exist between ears or under varying shooting angles. Lastly, the Feature Interaction Network (FIN) may face challenges in generalizing to diverse populations if the training data lack sufficient representation of factors like race or gender.

The widespread adoption of ear-based biometric authentication technology may trigger multifaceted social impacts. From an ethical perspective, without robust regulatory frameworks, it risks exacerbating algorithmic discrimination. For instance, individuals with ear deformities due to congenital conditions, injuries, or surgeries may face authentication failures, potentially excluding them from public services or commercial systems and thus creating technology-driven social exclusion. Moreover, in the absence of clear legal boundaries, a covert collection of ear features could transform into a surveillance tool, particularly in public spaces with seamless identification, eroding citizens’ freedom of movement and anonymity. Privacy risks permeate the entire lifecycle of biometric data management. In the collection phase, a lack of informed consent may violate personal data autonomy. During storage, inadequate encryption could expose sensitive biometric information to cyberattacks. To balance convenience and fundamental rights, deployment should adhere to “privacy-by-design” principles, incorporating minimal data collection, distributed storage, and dynamic authorization mechanisms. Additionally, legislation must define clear boundaries for technology use, and an independent biometric data oversight body should be established to mitigate systemic risks.

6. Conclusions

The strengths of the SAFIN arise from the seamless integration of its symmetry modeling and feature interaction mechanisms. At the core of the SAFIN, the Symmetry Alignment Module (SAM) refines the symmetry modeling between images of the left and right ear by emulating the perception of symmetric structures in the human visual cortex. Using geometric alignment and dual attention mechanisms, this module precisely captures the spatial relationships between the left and right ears, facilitating the robust alignment and differentiation of symmetric features with greater accuracy.

Secondly, the Feature Interaction Network (FIN) serves as a critical component within the SAFIN. Employing a dual-path design based on difference and product operations, FIN amplifies intra-class similarities and inter-class dissimilarities among bilateral ear features through an innovative feature interaction strategy. This approach enables the network to extract more robust and nuanced features from the paired ear images, thereby enhancing the precision of similarity detection and the dependability of identity authentication. By optimizing feature interactions and information flow, the FIN significantly elevates the model’s effectiveness in bilateral ear matching tasks.

Additionally, the SAFIN introduces a multi-task loss function that combines contrastive loss, symmetry loss, and binary cross-entropy loss. This multi-faceted optimization approach enables the model to learn on multiple levels. Contrastive loss helps to reinforce the distinction between positive and negative samples; symmetry loss ensures that the model can capture the symmetric features between the ears; and binary cross-entropy loss optimizes the model’s discriminative ability in the final classification task. The synergistic effect of these three components enables the SAFIN to maintain high accuracy while enhancing its sensitivity and discriminative power toward ear features.

Due to the limitations of this study, future research should focus on addressing the following areas. First, to mitigate the impact of dataset quality and noise, future work could incorporate noise suppression and data augmentation techniques, such as using autoencoders or generative adversarial networks (GANs), to improve the robustness of feature extraction. Second, given the dependence of the Symmetry Alignment Module (SAM) on the assumption of symmetry between left and right ear images, future research could explore hybrid models that combine both symmetry and asymmetry assumptions. Additionally, adaptive modules that can effectively handle such asymmetries should be developed to improve the model’s performance in cases where significant morphological differences exist between ears or under varying capture angles. Finally, to enhance feature generalization, particularly in diverse demographic populations, techniques such as transfer learning, domain adaptation, and cross-domain feature learning could be employed to enable the model to perform effectively across different groups.

Author Contributions

Conceptualization, L.Y. and Y.-N.Z.; Methodology, L.Y.; Software, H.-B.Z.; Validation, H.-B.Z., L.L., J.-Y.L. and Y.-N.Z.; Formal Analysis, L.Y. and J.-Y.L.; Investigation, L.L.; Resources, L.Y. and X.-C.G.; Data Curation, L.Y.; Writing—Original Draft Preparation, H.-B.Z. and X.-C.G.; Writing—Review and Editing, H.-B.Z., X.-C.G. and L.Y.; Visualization, L.L.; Supervision, Y.-N.Z.; Project Administration, Y.-N.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific and Technological Innovation Project of China Academy of Chinese Medical Sciences, grant number CI2023C003YG. The article processing charge (APC) was funded by the same project.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the non-invasive nature of the research, which involved the analysis of publicly available ear images without direct interaction with the participants. The study did not involve any physical interventions or manipulation of the participants, and no personal or sensitive data were collected. Furthermore, the study adhered to ethical guidelines for research involving human data.

Informed Consent Statement

Patient consent was waived due to the use of publicly available ear images from the USTB-Helloear Ear Dataset, which does not involve direct interaction with or identification of individual participants. The dataset is anonymized, and no personal or sensitive data were collected. Ethical review and approval were waived for this study as no intervention or risk to participants was involved.

Data Availability Statement

The datasets analyzed during the current study are publicly available and can be accessed at https://pan.baidu.com/s/1s5O39DSIBTRWPwa8eby5rQ (accessed on 25 March 2025) upon request to Li Yuan.

Acknowledgments

All the computations were performed on the high-performance computing platform of the University of Science and Technology Beijing.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SAFIN	Symmetry Alignment–Feature Interaction Network
SAM	Symmetry Alignment Module
FIN	Feature Interaction Network
BCE	Binary Cross-Entropy
TP	True Positive
TN	True Negative
FP	False Positive
FN	False Negative
TPR	True Positive Rate
FPR	False Positive Rate
TRR	True Rejection Rate
F1	F1 Score (Harmonic Mean of Precision and Recall)
CNN	Convolutional Neural Network
EER	Equal Error Rate
ROC	Receiver Operating Characteristic
VGG	Visual Geometry Group (Convolutional Neural Network Architecture)
SVM	Support Vector Machine
EDSR	Enhanced Deep Residual Network
SR	Super Resolution
SHAP	SHapley Additive exPlanations
MCC	Matthews Correlation Coefficient

References

Antakis, S. A Survey on Ear Recognition; University of Twente: Enschede, The Netherlands, 2009. [Google Scholar]
Oyebiyi, O.G.; Abayomi-Alli, A.; Arogundade, O.T.; Qazi, A.; Imoize, A.L.; Awotunde, J.B. A Systematic Literature Review on Human Ear Biometrics: Approaches, Algorithms, and Trend in the Last Decade. Information 2023, 14, 192. [Google Scholar] [CrossRef]
Zhang, Y.; Mu, Z.; Yuan, L.; Yu, C. Ear verification under uncontrolled conditions with convolutional neural networks. IET Biom. 2018, 7, 185–198. [Google Scholar] [CrossRef]
Meng, D.; Nixon, M.S.; Mahmoodi, S. On Distinctiveness and Symmetry in Ear Biometrics. IEEE Trans. Biom. Behav. Identity Sci. 2021, 3, 155–165. [Google Scholar] [CrossRef]
Kisku, D.R.; Mehrotra, H.; Gupta, P.; Sing, J.K. SIFT-based ear recognition by fusion of detected keypoints from color similarity slice regions. In Proceedings of the International Conference on Biometrics: Theory, Applications, and Systems (BTAS), Washington, DC, USA, 29 September–1 October 2008. [Google Scholar]
Villoth, J.P.; Zivkovic, M.; Zivkovic, T.; Abdel-Salam, M.; Hammad, M.; Jovanovic, L.; Simic, V.; Bacanin, N. Two-tier deep and machine learning approach optimized by adaptive multi-population firefly algorithm for software defects prediction. Neurocomputing 2025, 630, 129695. [Google Scholar] [CrossRef]
Gür, Y.E.; Toğaçar, M.; Solak, B. Integration of CNN Models and Machine Learning Methods in Credit Score Classification: 2D Image Transformation and Feature Extraction. Comput. Econ. 2025, 65, 2991–3035. [Google Scholar] [CrossRef]
Liao, W.; Shang, W. Ear-Based Identity Recognition Method Based on Transfer Learning. In Proceedings of the IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 14–16 July 2023. [Google Scholar]
Lei, Y.; Du, B.; Qian, J.; Feng, Z. Research on Ear Recognition Based on SSD MobileNet_v1 Network. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020. [Google Scholar]
Markičević, L.; Peer, P.; Emersič, Ž. Improving Ear Recognition with Super-resolution. In Proceedings of the 2023 30th International Conference on Systems, Signals and Image Processing (IWSSIP), Ohrid, North Macedonia, 27–29 June 2023. [Google Scholar]
Tomar, V.; Kumar, N.; Deshmukh, M.; Singh, M. Two-Level Image Augmentation with Deep-Learning for Single Sample Face and Ear Recognition. In Proceedings of the 2024 IEEE Silchar Subsection Conference (SILCON 2024), Agartala, India, 15–17 November 2024. [Google Scholar]
Mohamed, M. Empowering deep learning based organizational decision making: A Survey. Sustain. Mach. Intell. J. 2023, 3, 1–13. [Google Scholar] [CrossRef]
Ismail, M. Towards Sustainable Equine Welfare: Comparative Analysis of Machine Learning Techniques in Predicting Horse Survival. Sustain. Mach. Intell. J. 2023, 5, 1–8. [Google Scholar] [CrossRef]
Gupta, D.; Shah, D.; Chand, S.; Bhapkar, A.; Chandan, D.; Kulkarni, V. EarSiamNet: Leveraging Siamese Networks and SVM for Ear Biometric Authentication. In Proceedings of the 2023 International Conference on Modeling, Simulation & Intelligent Computing (MoSICom), Dubai, United Arab Emirates, 7–9 December 2023. [Google Scholar]
Yi, Z. Ear Detection and Recognition Under Uncontrolled Conditions Based on Deep Learning Algorithm. Ph.D. Dissertation, School of Automation and Chemical Engineering, Beijing University of Science and Technology, Beijing, China, 2018. [Google Scholar]
Wang, K.; Peng, X.; Yang, J.; Meng, D.; Qiao, Y. Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition. IEEE Trans. Image Process. 2020, 29, 4057–4069. [Google Scholar] [CrossRef]
Kaya, M.; Bilge, H.Ş. Deep Metric Learning: A Survey. Symmetry 2019, 11, 1066. [Google Scholar] [CrossRef]
Zhang, Y.; Mu, Z.; Yuan, L.; Yu, C.; Liu, Q. USTB-Helloear: A Large Database of Ear Images Photographed Under Uncontrolled Conditions. In Proceedings of the International Conference on Image and Graphics, Shanghai, China, 13–15 September 2017; pp. 405–416. [Google Scholar]

Figure 1. Matching of left and right ears of the same individual using SAFIN. Green squares indicate regions where our model detected similarity, and red lines connect the matched points.

Figure 2. Network architecture of SAFIN.

Figure 3. Network structure of SAM module in SAFIN.

Figure 4. Feature map distribution after applying SAM.

Figure 5. Network structure of FIN module in SAFIN.

Figure 6. Dynamic adjustment of loss function weights during training.

Figure 7. Illustration of data augmentation techniques applied to the samples.

Figure 8. (a) Samples showing the left and right ear images of several subjects (part 1). (b) Additional samples of left and right ear images from the dataset (part 2).

Figure 9. Scatter plots and distribution graphs of similarity scores for 1000 pairs of positive and 1000 pairs of negative samples (using different models: h/height, or a resolution of 300 dpi or higher). (a) ResNet18; (b) ResNet18 with SAM; (c) ResNet18 with SAM and FIN.

Figure 10. Optimal threshold selection curves and ROC curves for the three networks under a positive-to-negative sample ratio of 1:1. The intersection point of TPR and 1-FPR is used to determine the best threshold.

Figure 11. ROC curve for three networks along with optimal threshold selection curve when positive-to-negative sample ratio is 1:2.

Figure 12. Further classification metric comparison under optimal threshold for 1:1 positive-to-negative sample ratio.

Figure 13. Further classification metric comparison under optimal threshold for 1:2 positive-to-negative sample ratio.

Figure 14. Comparison of importance heatmaps for ear biometric authentication based on SHAP values.

Figure 15. Our model’s significant advantage in ear matching detection. Green squares indicate regions where our model detected similarity, and red lines connect the matched points.

Table 1. Ablation study of different modules and loss functions in similarity detection task.

Methods	Precision	Accuracy (%)	F1 Score
Resnet18	0.8611	89.92	0.9051
Resnet18	0.9377	94.93	0.9488
Resnet18	0.9173	94.74	0.9476
Resnet18+SAM	0.9266	94.35	0.9430
Resnet18+SAM	0.8986	94.15	0.9449
Resnet18+SAM	0.8804	92.79	0.9347
Resnet18+SAM+FIN	0.9542	96.49	0.9653
Resnet18+SAM+FIN	0.9833	97.47	0.9731
Resnet18+SAM+FIN	0.9878	99.03	0.9898

Table 2. Ablation study of different modules in identity authentication task with 1:1 positive–negative sample ratio.

Methods	Accuracy	Precision	TPR	F1	TRR	FPR	Best_Threshold	EER
ResNet18	0.8863	0.8851	0.8879	0.8865	0.8847	0.1153	0.7672	0.1080
ResNet18+SAM	0.8925	0.8987	0.8847	0.8917	0.9003	0.0997	0.7252	0.0980
ResNet18+SAM+FIN	0.9252	0.9252	0.9252	0.9252	0.9252	0.0748	0.8875	0.0360

Table 3. Ablation study of different modules in identity authentication task with a 1:2 positive–negative sample ratio.

Methods	Accuracy	Precision	TPR	F1	TRR	FPR	Best_Threshold	EER
ResNet18	0.8863	0.7960	0.8860	0.8386	0.8865	0.1135	0.7959	0.1135
ResNet18+SAM	0.8737	0.7755	0.8740	0.8218	0.8735	0.1265	0.7177	0.0970
ResNet18+SAM+FIN	0.9010	0.8198	0.9010	0.8585	0.9010	0.0990	0.8875	0.0425

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, L.; Zhou, H.-B.; Li, J.-Y.; Liu, L.; Gu, X.-C.; Zhao, Y.-N. Symmetry Alignment–Feature Interaction Network for Human Ear Similarity Detection and Authentication. Symmetry 2025, 17, 654. https://doi.org/10.3390/sym17050654

AMA Style

Yuan L, Zhou H-B, Li J-Y, Liu L, Gu X-C, Zhao Y-N. Symmetry Alignment–Feature Interaction Network for Human Ear Similarity Detection and Authentication. Symmetry. 2025; 17(5):654. https://doi.org/10.3390/sym17050654

Chicago/Turabian Style

Yuan, Li, He-Bin Zhou, Jiang-Yun Li, Li Liu, Xiao-Chai Gu, and Ya-Nan Zhao. 2025. "Symmetry Alignment–Feature Interaction Network for Human Ear Similarity Detection and Authentication" Symmetry 17, no. 5: 654. https://doi.org/10.3390/sym17050654

APA Style

Yuan, L., Zhou, H.-B., Li, J.-Y., Liu, L., Gu, X.-C., & Zhao, Y.-N. (2025). Symmetry Alignment–Feature Interaction Network for Human Ear Similarity Detection and Authentication. Symmetry, 17(5), 654. https://doi.org/10.3390/sym17050654

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetry Alignment–Feature Interaction Network for Human Ear Similarity Detection and Authentication^†

Abstract

1. Introduction

2. Related Works