1. Introduction
Human ear recognition is a biometric technology that emerged at the end of the last century. It has unique physiological characteristics and viewing angles [
1]. This gives ear recognition technology natural advantages compared with other biometric technologies. Currently, relatively mature biometric technologies include face recognition, fingerprint recognition, iris recognition, etc. [
2]. Among them, face recognition is influenced by various factors, such as changes in facial expression, whether or not to wear glasses, and whether or not to have a beard. In contrast, ear recognition is almost independent of these factors [
3]. Acquiring human ear image information is much easier than receiving fingerprint information. This is because it can be collected secretly without a person’s cooperation. Compared to iris recognition, the installation cost of ear image capture devices is relatively low. Moreover, acquiring iris information is more complex than ear image acquisition. Therefore, human ear recognition technology can be applied in many fast-paced identity identification scenarios. Although there is much research on human ear recognition at home and abroad, the technology could be more mature, and there is still a long way to go before it can be applied to real life. In-depth research on this technology can actively promote and improve contactless remote identification. The explosion of COVID-19 worldwide in the past three years has affected many biometric systems. For example, facial recognition will be severely impacted by people wearing masks. At this time, ear recognition can benefit identity confirmation [
4]. In addition, it performs well in financial and surveillance security [
5].
Computer vision and machine learning techniques have been significantly developed in recent years. Among them, deep convolutional neural networks have been popular among most researchers and applied to almost all areas of computer vision, especially ear recognition tasks. Deep convolutional neural networks have the feature of fusing feature extraction and classification into an end-to-end model that can handle different practical problems by learning the representations of the input data. Most ear recognition methods based on hand-crafted features do not use standard performance evaluation metrics and baseline ear databases, and the variation in the collected subject ear images is slight. When these methods are confronted with an ear database with significant asymmetry in an unconstrained environment, the recognition performance is significantly worse than that of deep learning-based approaches. In deep feature extraction methods, the parameters of static convolution are artificially set and fixed, which can reduce the extraction effect of ear image features. However, dynamic convolution [
6] can dynamically aggregate multiple parallel convolution kernels to adaptively adjust the convolution parameters to further refine ear features. The ECA [
7] module can realize cross-channel information interaction, suppress invalid features, and improve the feature weights of the ear geometry region. The dynamic convolution and ECA modules can significantly enhance the feature representation ability of the model, which has shown excellent performance in the fields of CIFAR and ImageNet database classification [
6,
7,
8,
9,
10], scene recognition [
10], ancient Chinese character recognition [
11], fine-grained image classification [
12], and plant disease recognition [
13]. Therefore, we propose a feature fusion human ear recognition method based on channel features and dynamic convolution (CFDCNet).
Our contributions can be summarized as follows: (1) a feature fusion human ear recognition method based on channel features and dynamic convolution [
6] is proposed, which has good recognition performance in both constrained and unconstrained ear recognition scenarios; (2) in the case of significant differences in ear sample features between the same category and different categories. This paper introduces dynamic convolution to extract ear image features adaptively, enhancing the robustness of ear feature representation; (3) an ECA mechanism [
7] is introduced to efficiently fuse the depth and spatial information of ear images and suppress invalid features such as background and noise; (4) we utilize the maximum pooling operation in the network to retain the primary feature information of the ear contour to the maximum extent and prevent the model from overfitting; and (5) we performed simulations on AMI [
14] and AWE [
15,
16,
17] human ear databases and achieved 99.70% and 72.70% Rank-1 (R1) recognition accuracy, respectively. The recognition performance of our method is significantly better than that of the DenseNet-121 [
18] model and most existing human ear recognition methods.
The rest of this paper is organized as follows.
Section 2 briefly reviews past work;
Section 3 discusses our proposed method;
Section 4 discusses the experimental results and analysis; and
Section 5 presents a conclusion.
2. Related Work
Earlier researchers performed ear identification based on handcrafted features. In [
19], the author used Haar wavelets for ear localization. Their method has good robustness against occlusions, and the recognition performance is significantly better than Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Orthogonal Locality Preserving Projection (OLPP). The drawback of this method is that it was evaluated on small datasets, and no fixed-value performance evaluation metrics were used to assess it. In [
20], the authors proposed an ear recognition method combining homographic distance and Scale-invariant feature transform (SIFT) features. The method outperforms PCA in recognition and shows robustness to slight angle changes, background noise, and occlusions. The drawback of their method is that it does not use a benchmark database or specify evaluation metrics to assess the model’s performance. In [
21], the authors first segmented the ear features using Fourier and morphological descriptors and then used log-Gabor, Gabor, and complex Gabor filters for local ear feature extraction. The method was evaluated on a private database containing 465 ear images. The results show that log-Gabor has the best feature extraction performance. The disadvantage is that no exact performance evaluation metrics or benchmark database were used. In [
22], the authors proposed an ear recognition method using 2D orthogonal filters for ear feature extraction. It was evaluated on the IITD and UND ear databases. The method shows that the 2D orthogonal filter performs better than others. The disadvantage is that the database used to evaluate the method slightly varies. In [
23], the authors first localize the ear information using the snake model and then use geometric features for ear identification. The model was evaluated on the IIT Delhi ear database. The drawback of their method is that it was validated only on a small database, and the ear images in this database were collected indoors with slight variation. In [
24], the authors use a robust pattern recognition technique for human ear recognition. The method uses descriptors for ear feature extraction, and the extracted features are powerful for rotation and illumination. The authors tested it on AMI [
14], IITD-II, and AWE [
15,
16,
17] databases, and the recognition performance is significantly better than other descriptor methods. The disadvantage is that it needs better recognition performance on unconstrained datasets. In [
25], the authors first extracted the local features of the ear using the local phase quantization operator, then removed the global features of the ear using the Gabor–Zernike operator, and finally put the optimal features of the ear together using a genetic algorithm. The recognition performance of this method evaluated on three constrained databases is ideal, but on unconstrained databases, the recognition performance is lower than that of the deep learning-based method.
Researchers have found some application-specific scenarios with high-security index requirements that require the combination of multiple biometrics, so they started to utilize multimodal approaches for ear recognition. In [
26], the authors proposed a multi-modal biometric technique combining the ear and iris. They used a local feature descriptor, SIFT, for feature fusion. It was evaluated on the USTB-II ear database and the CASIA iris database. According to accuracy, the method is more accurate than ear biometric recognition alone. In [
27], the authors propose a multimodal recognition system combining side faces and ears. They first augmented the images in the database, then obtained the local optima of the pictures using the Hessian matrix, and finally used Speeded Up Robust Features (SURF) to construct the scale space and localize the image feature points. The results of this method on three ear and side face databases show that multimodal recognition of ears and side faces performs better than ear recognition alone. In [
28], the authors used ears and fingerprints for multi-pattern recognition. Local Binary Patterns (LBP) were used to extract the local texture features of the images. The system achieved an accuracy of 98.10%. The drawback is that they did not evaluate the system with a benchmark database.
In recent years, ear recognition methods based on deep feature learning have achieved good human ear feature recognition results. In [
29], the authors used a Convolutional Neural Network (CNN) consisting of convolutional, maximum pooling, and fully connected layers for ear feature extraction. The evaluation was performed on the USTB-III ear database. The disadvantage is that the method does not use standard evaluation parameters, and the database used for the assessment is constrained and small in number. In [
30], the authors fine-tune the CNN frameworks of VGG face, VGG, ResNet, AlexNet, and GoogleNet to perform ear recognition. To enable the network to learn multi-scale information, the last pooling layer of each CNN model is replaced with a spatial pyramid pooling layer. A combination of softmax and center loss is used for training. The authors also created an unconstrained ear dataset called USTB HelloEar. The results show that the VGG face model has the best recognition performance. The drawback of this method is that performance evaluation metrics are not used to evaluate the model. In [
31], the authors first used Refinet for ear detection and then hand-crafted feature-based and ResNet models for ear recognition. The models were tested on the UERC database, and the recognition performance of the deep learning-based approach was significantly better than that of the hand-crafted feature-based approach. The disadvantage of the model is that the novelty could be better, and the ear detection and recognition are performed using existing models. In [
32], the authors used integrated learning, feature extraction, and other learning strategies for ear recognition based on network models such as Inception, ResNext, and VGG. They evaluated the model’s performance by resizing the image input network and achieved good recognition results on the EarVN1.0 unconstrained ear database. The drawback is that it was tested on only one dataset and not compared with other human ear recognition techniques. A CNN model that can be used for ear recognition was designed in [
2]. It was evaluated on the AMI and IITD-II databases. The authors did not use standard performance evaluation metrics, and the database used was constrained, with slight variation in the ear images. In [
33], ear recognition is performed with the NASNET model, and the performance is compared with MobileNet, VGG, and ResNet. The method was evaluated on the UERC-2017 unconstrained ear database and achieved the best recognition performance.
It is worth mentioning that most recognition methods based on hand-crafted features exhibit poor recognition performance in the face of human ear datasets with highly variable illumination, angle, occlusion, and background. Therefore, we propose a feature fusion human ear recognition method based on channel features and dynamic convolution (CFDCNet). Based on the DenseNet-121 [
18] model, the robustness of ear feature representation is enhanced by replacing the original convolutional layer with dynamic convolution [
6] for adaptive extraction of ear image features. Then the weights of the important ear features are increased by an efficient channel attention mechanism (ECA) [
7]. Finally, we improved the model’s generalization ability by using the maximum pooling operation to retain the ear’s key features. We evaluated our model on two publicly available ear datasets exhibiting good recognition performance.