Boosting the Performance of Deep Ear Recognition Systems Using Generative Adversarial Networks and Mean Class Activation Maps

Bouaouina, Rafik; Benzaoui, Amir; Doghmane, Hakim; Brik, Youcef

doi:10.3390/app14104162

Open AccessArticle

Boosting the Performance of Deep Ear Recognition Systems Using Generative Adversarial Networks and Mean Class Activation Maps

¹

Electronic and Telecommunications Department, Université du 8 Mai 1945, Guelma 24000, Algeria

²

Electrical Engineering Department, University of Skikda, BP 26, El Hadaiek, Skikda 21000, Algeria

³

Electronics Department, University of Mohamed Boudiaf, M’sila 28000, Algeria

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 4162; https://doi.org/10.3390/app14104162

Submission received: 15 April 2024 / Revised: 6 May 2024 / Accepted: 13 May 2024 / Published: 14 May 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Ear recognition is a complex research domain within biometrics, aiming to identify individuals using their ears in uncontrolled conditions. Despite the exceptional performance of convolutional neural networks (CNNs) in various applications, the efficacy of deep ear recognition systems is nascent. This paper proposes a two-step ear recognition approach. The initial step employs deep convolutional generative adversarial networks (DCGANs) to enhance ear images. This involves the colorization of grayscale images and the enhancement of dark shades, addressing visual imperfections. Subsequently, a feature extraction and classification technique, referred to as Mean-CAM-CNN, is introduced. This technique leverages mean-class activation maps in conjunction with CNNs. The Mean-CAM approach directs the CNN to focus specifically on relevant information, extracting and assessing only significant regions within the entire image. The process involves the implementation of a mask to selectively crop the pertinent area of the image. The cropped region is then utilized to train a CNN for discriminative classification. Extensive evaluations were conducted using two ear recognition datasets: mathematical analysis of images (MAI) and annotated web ears (AWEs). The experimental results indicate that the proposed approach shows notable improvements and competitive performance: the Rank-1 recognition rates are 100.00% and 76.25% for MAI and AWE datasets, respectively.

Keywords:

biometrics; ear recognition; convolutional neural networks; generative adversarial networks; class activation map; attention networks

1. Introduction

Ear recognition constitutes a prominent area of investigation within the field of pattern recognition [1,2]; the ultimate objective is to efficiently identify people from images of their ears under uncontrolled conditions. Compared to traditional biometric modalities, the ear has several advantages and is considered one of the most stable anatomical features of the human body. Extensive research has demonstrated the relative stability of ear anatomy, with noteworthy changes primarily occurring before the age of eight and after the age of seventy. This characteristic renders the shape of the ear less susceptible to age-related alterations compared to facial features [3,4]. Notably, the ear’s resilience to modifications induced by cosmetics, hairstyles, expressions, or facial hair provides an intrinsic advantage over facial recognition systems. Moreover, the shape of the ear cannot be obscured by beards, moustaches, or glasses during acquisition. Compared to fingerprint or iris systems, the sensors required for ear biometrics are not specialized or expensive. Additionally, ear image acquisition can be conducted quietly from a distance without the need for user–sensor interaction; this eliminates hygiene concerns and the risk of virus transmission. The public widely accepts the use of ear biometrics in access authentication and security applications, including visa and passport programs [5]. Ear and face features can be used as complementary information in multimodal biometric systems, as both can be captured simultaneously in a single scan. This integration enhances recognition accuracy, frequently in surveillance and tracking applications [6]. For these reasons, more and more research is being conducted into the application of ear recognition in various fields, such as security, surveillance and forensics.

Convolutional neural networks (CNNs) have been clearly identified as a reliable data learning technique for solving complex computer vision problems [7,8,9]. These developments have facilitated the creation of an end-to-end model that includes simultaneous feature extraction and classification steps [10]. However, despite the acknowledged prowess of CNNs in solving intricate computer vision problems, their application to deep ear recognition systems remains limited. Challenges such as the absence of large public datasets for ear biometrics and the inherent opacity of CNNs as “black box” models contribute to this restraint.

In addressing the challenges inherent in ear biometric recognition and aiming to enhance performance, this study endeavors to investigate the internal mechanisms of deep learning models. The primary objective is to discern pertinent features within ear images that contribute to effective recognition. To achieve this goal, a two-step deep ear recognition method is proposed, incorporating deep convolutional generative adversarial networks (DCGANs) [11] in the initial stage and mean-class activation maps with convolutional neural networks (Mean-CAM-CNNs) in the subsequent phase. Given the prevalent issue of low visual quality in the utilized datasets, often characterized by darkness or monochromatic attributes, the first step involves training a DCGAN model. This model is designed to colorize grayscale images and enhance dark ear images before their integration into the deep classifier.

The second step introduces a novel framework, denominated as Mean-CAM-CNN. This framework capitalizes on class activation maps (CAMs) [12] and incorporates a guided mask to extract a region of interest (RoI) from ear images based on mean heatmaps derived from images of the same class. Such an approach directs the attention of the CNN exclusively to the pertinent information, enhancing the network’s focus. The Mean-CAM framework provides high-resolution visualizations with discerning class discriminative power. These pivotal visual representations contribute to a heightened interpretability and transparency of deep ear recognition models. Consequently, the proposed methodology seeks to improve the predictive capabilities of such models.

To summarize, the Mean-CAM method enables the CNN to focus solely on valuable information for recognition, eliminating irrelevant details, complemented by the colorization of grayscale images and enhancement of dark shades through the trained DCGAN model.

The main contributions of this paper are summarized as follows:

-: To mitigate the challenges associated with an absence of color information and to enhance the visual quality of dark images during their input to a CNN model, we introduced and trained a proficient framework. This framework employs a DCGAN model for the purpose of colorizing grayscale and dark images.
-: To improve the predictive ability of CNNs, we introduced a novel framework, termed Mean-CAM-CNN. This framework is designed to guide the CNN’s focus towards the most salient common region, encapsulating the most pertinent features. The Mean-CAM process incorporates CAMs to delineate an RoI from images belonging to the same class. This selective mechanism isolates discriminative features that furnish pertinent representations essential for the task of ear recognition.
-: We conducted an extensive evaluation of the proposed methodologies by employing two widely recognized and challenging ear recognition datasets. This evaluation encompassed both graphical and statistical analyses to comprehensively assess the performance outcomes.

The paper’s structure is as follows: Section 2 reviews related studies in recently published articles about ear recognition; Section 3 outlines the methods proposed in this study to address the issue of deep ear identification; Section 4 discusses the experimental analysis conducted in this study; and finally, Section 5 resumes the fundamental conclusions.

2. Related Work

Most biometric systems that utilize ear images extract relevant characteristics, which are then compared with enrolled models in a database. Ear recognition approaches are divided into two categories, namely, handcrafted and deep-learning-based, depending on the data type and feature extraction strategy employed.

2.1. Handcrafted Methods

Early endeavors in ear biometrics concentrated on employing handcrafted strategies for feature extraction, specifically aimed at delineating the anatomical structure of the ear. Hassaballah et al. [13] systematically tested and compared various extensions of the local binary patterns (LBPs) descriptor for ear recognition. They sought to improve performance by introducing a new variant called averaged local binary patterns (ALBPs) through a simple thresholding scheme. In a subsequent study, Hassaballah et al. [14] developed robust local oriented patterns (RLOPs) based on LBPs, demonstrating superior rotation invariance and noise tolerance in unconstrained ear recognition compared to the conventional LBPs and their derivatives. Sarangi et al. [15] incorporated the Jaya algorithm to enhance the quality of ear images. They employed the speeded-up robust features (SURF) descriptor for extracting local invariant pose features. Sajadi and Fathi [16] proposed a hybrid approach utilizing local and global features in the frequency domain for ear image representation. Employing the local phase quantization (LPQ) texture descriptor for local features and the Gabor–Zernike (GZ) operator for global features, a genetic algorithm was employed to optimize feature selection and enhance recognition accuracy. Khaldi and Benzaoui [17] proposed a two-step ear recognition system. The first step involves the application of an ear segmentation approach to extract a region of interest, which is free of non-ear parts. The second step employs the binarized statistical image features (BSIFs) descriptor to extract and classify features. The results demonstrated that employing ear segmentation rather than original images improves the recognition performance. Regouid et al. [18] explored the transformation of two-dimensional ear images into one-dimensional representations. Their investigation focused on the one-dimensional local binary patterns (1D-LBP) descriptor and its variations as an alternative solution for feature extraction.

While these handcrafted methods demonstrated noteworthy achievements in ear recognition, their inherent architectural constraints limit the discriminative information extractable from captured images.

2.2. Deep Learning Methods

Recently, there has been a notable shift in the emphasis of ear biometrics from conventional feature engineering approaches to those rooted in deep learning paradigms [19]. Alshazly et al. [20] conducted a comparative analysis between handcrafted and CNN-based ear recognition models. Their evaluation, focused on unconstrained ear recognition, involved seven well-established handcrafted descriptors and four CNN models based on AlexNet variants. The CNNs exhibited superior recognition accuracy compared to handcrafted features. Subsequently, Alshazly et al. [21] suggested an innovative ear identification system predicated on sets of deep CNN models. Notably, CNNs were trained on visual geometry group (VGG)-based network topologies to extract deep features. Through the testing of various VGG representations, the VGG-13-16-19-set configuration emerged as the optimal choice. Priyadharshini et al. [22] contributed a six-layer deep CNN tailored for ear identification. Their exploration involved the systematic testing of activation functions, learning rates, kernel sizes, and epochs to assess the proposed model. CNNs demonstrated the capability to directly learn optimal features from input images, surpassing the performance of traditional descriptors. Khaldi and Benzaoui [23] addressed the challenge of test images lacking color information for a system trained on color images. They employed a DCGAN model to colorize and enhance grayscale and dark images, subsequently classifying ears using a CNN-based classification model, with observed improvements in recognition performance. In a recent investigation, Alshazly et al. [24] developed and combined several ensembles based on ResNet architectures with varying depths to enhance recognition performance. They also explored the application of the support vector machine (SVM) classifier as a substitute to the classical dense classifier. Omara et al. [25] presented a novel ear identification method grounded in Mahalanobis distance learning, utilizing deep CNN features extracted from VGG and ResNet models. Sharkas [26] proposed a two-stage ear recognition approach, where the discrete Curvelet transform extracted key ear features, and subsequently, end-to-end deep learning network ensembles were deployed for classification. Xu et al. [27] proposed human ear recognition approach based on channel features and dynamic convolution (CFDCNet). CFDCNet adapts the DenseNet-121 model for ear feature extraction using dynamic convolution, enhancing feature aggregation within class samples and dispersion across different samples. Additionally, a channel attention mechanism prioritizes important ear features while suppressing irrelevant ones, boosting the discriminative capability of the feature representation. Aiadi et al. [28] introduced a novel unsupervised lightweight neural network architecture focused on ear print recognition. This architecture comprises a singular layer and is named MDFNet due to its reliance on gradient magnitude, gradient direction, and the responses generated by data-driven filters.

The deployment of CNNs in ear recognition models has been notably limited, primarily attributed to the lack of challenging and extensive databases. This paper introduces a novel approach, leveraging a DCGAN in conjunction with Mean-CAM-CNN, as an alternative solution. The aim of this proposal is to address the limitations of current deep ear recognition models systematically, while also improving the effectiveness of classical CNNs.

3. Proposed Approach

This section outlines the methodology presented in this article to tackle the complex task of identifying ears. The approach involves two essential processes: first, initial image preprocessing using DCGAN; and second, feature extraction and classification using Mean-CAM combined with a fine-tuned CNN model. In the following sub-sections, we provide meticulous detail on each of the two phases.

3.1. Preprocessing

In the examined datasets, a substantial portion of images display low visual quality, often being dark or monochromatic in the AWE dataset. In a previous study [23], we introduced a framework with a generative model to colorize grayscale and enhance dark ear images, yielding improved performance when subjected to a trained classifier. We employed deep convolutional generative adversarial networks (DCGANs) for colorization and enhancement. A standard U-Net paradigm was designed for the generative model to create colorized ear representations, as shown in Figure 1.

The structural configuration of the discriminative model, illustrated in Figure 2, exhibits a comparatively simpler architecture than that of the generator. Comprising five convolutional layers, the model utilizes 4 × 4 filters with a stride of two, concomitant with batch normalization and Leaky-ReLU activation. The ultimate layer employs a singular 4 × 4 filter with a stride of one and is activated through the sigmoid function, yielding a scalar value indicative of the authenticity of the generated image. This binary classification model generates a probability output within the range of [0, 1], designated as

P

in Figure 2.

The discriminator is presented with a 256 × 256 × 4 input, achieved through the concatenation of a grayscale image and either an authentic or synthesized color image. Its primary aim is to optimize the probability of accurately discerning between generated and genuine images, expressed as (

l o g D (y | x)

). In contrast, the generator endeavors to minimize the complementary expression,

1 - l o g D (G (θ_{z} | x)

). The ultimate cost function, denoted as

V

, is formally articulated through Equation (1) [11].

{m i n}_{G} {m a x}_{D} V (G, D) = E_{x} [\log D (y∣ x)] + E_{z} [1 - l o g D (G (O_{z} ∣ x))]

(1)

where the grayscale image is denoted as

x

, the ground truth as

y

, and

G (O_{z} ∣ x)

represents the mapping function embodying the generator’s output color image derived from the input image

x

. Correspondingly, the discriminator is denoted by the mapping function,

D (y∣ x)

, yielding a scalar within the range [0, 1], signifying the probability of the input being generated or otherwise. The symbol

E_{x}

represents the expected value across all real color images, while

E_{z}

signifies the expected value computed across all generated color images.

Equation (2) delineates the objective of training the adversarial model, aiming to minimize the average Euclidean distance at the pixel level between colorized images (

θ

) and ground truth images (y). In this formulation,

x

represents the grayscale image,

e

denotes the channel index, and

i, j

are indices referring to individual pixels.

D i s t a n c e (x, θ) = \frac{1}{3 u v} \sum_{e = 1}^{3} \sum_{i = 1}^{u} \sum_{j = 1}^{v} ‖x_{i, j}^{e} - y_{i, j}^{e}‖ \begin{matrix} 2 \\ 2 \end{matrix}

(2)

Figure 3 visually represents the outcomes subsequent to the preprocessing application. The results demonstrate the successful colorization of the images and the simultaneous achievement of uniform brightness and intensity equalization, along with the standardization of image dimensions. These transformations were obtained through the utilization of the proposed DCGAN model.

3.2. Feature Extraction/Classification

This phase represents the primary contribution of the present study. Figure 4 elucidates the architectural framework of the proposed feature extraction/classification methodology, denoted as Mean-CAM-CNN, which comprises two distinct branches: the global stage and the local stage. Each stage integrates a dedicated CNN. In the global stage, entire images belonging to the same class are input into the initial CNN to generate CAMs corresponding to each image. The average CAM is subsequently computed from the collective CAMs of images within the same class. Utilizing the average CAM, the significant (i.e., relevant) region of each global (i.e., original) image is extracted and utilized in the training/classification processes at the local stage. In summary, the implementation of Mean-CAM enhances predictive capabilities by directing the attention of the CNN to the most pertinent common region, thereby encapsulating the most relevant features.

Figure 4 provides a comprehensive depiction of the Mean-CAM-CNN architecture, wherein ResNet-50 [29] serves as the foundational backbone model. The Mean-CAM-CNN is organized into two principal stages: global and local. Each stage consists of five convolutional blocks, incorporating batch normalization and rectified linear unit (ReLU) activation functions. These blocks are seamlessly connected to global max-pooling (GMP), a fully connected (FC) layer, and a softmax layer.

In contrast to the global stage, the local stage is characterized by an input composed of local images restricted by the mask generated through the global stage. For illustrative purposes, the input image is superimposed onto the heatmap.

Before diving into the global and local stages’ technical details, we will first introduce the theoretical basis of the CAM methodology and the proposed frameworks: Mean-CAM and mask inference.

3.2.1. CAM Methodology

The methodology introduced by Zhou et al. [12] encompasses the proposition of class activation mapping (CAM). CAM serves to identify discriminative regions within images, offering insights into the specific features that neural networks prioritize when making decisions. This technique facilitates a nuanced interpretation of the visual elements neural networks consider during decision-making processes, thereby contributing to an enhanced understanding of the underlying mechanisms governing neural network decisions.

Figure 5 visually presents CAMs for multiple images across distinct classes. The network architecture comprises several convolutional layers, culminating in global average pooling (GAP) immediately preceding the final output layer. Extracted features are subsequently conveyed to an FC layer, incorporating softmax activation to yield the requisite output. Evaluation of the significance of specific image regions involves the projection of output layer weights onto the convolutional feature maps generated by the preceding layer.

The mathematical expressions defining CAM are delineated as follows: let

f_{k} (x, y)

denote the activation map of unit

k

in the final convolutional layer at spatial coordinates

(x, y)

. The outcome of the GAP, denoted by

F_{k}

, is expressed as:

F_{k} = \frac{1}{(x \times y)} \sum_{i = 1}^{x} \sum_{j = 1}^{y} f_{k} (x, y)

(3)

For a class,

c

, the input,

S_{c}

, to the softmax function is given by:

S_{c} = \sum_{k = 1}^{K} W_{k} (c) F_{k}

(4)

where

W_{k} (c)

represents the weight corresponding to class

c

for unit

k

, indicating the significance of

F_{k}

for class

c

, and

K

denotes the total number of activation maps in the final convolutional layer.

The output,

P_{c}

, of the softmax layer can be expressed as:

P_{c} = \frac{\exp (S_{c})}{\sum_{k = 1}^{K} \exp (S_{c})}

(5)

where

c \in {1,2, . . ., C}

,

C

represents the total number of classes, and

S_{c}

represents the elements of the input vector to the softmax function.

Thus, the final equation for the activation map of class

c

would be

H_{c} (x, y)

:

H_{c} (x, y) = \sum_{k = 1}^{K} W_{k} (c) f_{k} (x, y)

(6)

where

H_{c} (x, y)

denotes the significance of activation at the spatial grid

(x, y)

, leading to the image’s classification into class

c

. Here,

k \in {1,2, . . ., K}

and

K

represent the number of activation maps in the last convolutional layer.

W_{k} (c)

represents the weight associated with class

c

for unit

k

, while

f_{k} (x, y)

refers to the activation map of unit

k

.

CAMs are constructed by projecting the expected class score back onto the previous convolutional layer. This emphasizes the unique areas that are relevant to each class.

The process of CAM is illustrated in Figure 6. As previously stated, the GAP layer is preferred over the global max-pooling (GMP) layer due to its ability to determine the full extent of the object. Conversely, GMP layers only identify a discriminative component. With GAP, the activation is averaged across all areas, providing insight into all discriminative regions, while the GMP layer only analyzes the most discriminative one. The GAP layer computes the average value for each feature map of the final convolutional layer, which is multiplied by a weight and added together to generate the final output. Likewise, the class activation map is generated by computing the weighted sum of the final convolutional layer.

We chose ResNet-50 for our study because it has a top layer that is compatible with the CAM technique. The top layer of ResNet-50 is a GAP layer, followed by a fully connected layer that has a softmax activation function. As a consequence, we do not need to make any further changes to ResNet-50’s architecture to use this technique [12].

3.2.2. Proposed Mean-CAM Methodology

The CNN model proposed here uses a GAP layer and a densely connected softmax layer. For each image,

I

, the model generates the final convolutional feature map,

f = F (I)

, where

F

is a sequence of operations performed by the CNN, comprising convolution, pooling, and activation. Additionally,

f_{k}

denotes the

k^{t h}

channel of the feature map, and

f_{k} (x, y)

represents the value at spatial location

(x, y)

. As per Zhou et al. [12], the CAM retrieves the class activation map and specifies the object region, as displayed in Figure 6. The CAM ideally retrieves the class activation map from the ground truth, which aggregates each channel of the weighted feature map. For a global image,

f_{k} (x, y)

represents the activation of a spatial location

(x, y)

in the

k^{t h}

channel of the output of the final convolutional layer, where

k \in {1,2, . . ., K}

. For ResNet-50, this layer contains 2048 activation maps, each with dimensions of 7 × 7. The subsequent average pooling layer reduces the previous layer’s size to 1 × 1 × 2048, calculating the average of each feature map. The weights,

W_{k}^{c}

, connect the predominant predicted class with the

k^{t h}

node in the flattening layer, indicating the potency of each activation map for the predicted class (as shown in Figure 7). Using Equation (6) as inspiration, we calculate the CAM for each image within the same class, like those in the second row of Figure 8 and Figure 9. Therefore:

H_{c} (x, y) = \sum_{k = 1}^{K} W_{k}^{c} f_{k} (x, y)

(7)

where

H_{c}

represents the CAM for class

c

and

W_{k}^{c}

is the

k^{t h}

weight of the softmax layer for class

c

. Additionally, we denote

f_{k}

as the activation map in Figure 6.

For a given class

c

that consists of

N

images denoted as

I_{n}^{c}

, where

n \in {1,2, . . ., N}

, the average class activation map is obtained by computing the sum of activation maps of all images in the class and dividing it by

N

. Please refer to Figure 7 for illustration.

{M e a n_H}_{c} (x, y) = \frac{1}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{K} W_{k}^{n} f_{k}^{n} (x, y) = \frac{1}{N} \sum_{n = 1}^{N} H_{n}

(8)

where

{M e a n_H}_{c}

represents the average class activation map (Mean-CAM), indicating the collective importance of each pixel in the original images of a class

c

.

W_{k}^{n}

denotes the weights associated with the

n^{t h}

image for class

c

with the

k^{t h}

channel. The definition of

H_{n}

is outlined in Equation (7).

3.2.3. Mask Inference

The Mean-CAM captures the set of crucial pixels in an original image of a specific class. Our research involved creating a binary mask to identify pertinent areas in the global image for classification purposes. During the mask inference process, the region of interest is recognized through thresholding, as outlined below:

M_{(x, y)} = \{\begin{matrix} 1, i f {M e a n_H}_{c} (x, y) > τ \\ 0, O t h e r w i s e \end{matrix}

(9)

where

τ \in [0, 1]

, a threshold value, defines the region’s size of attention.

A high

τ

value results in a smaller area, while a low value results in a larger area. To identify the discriminative points within mask

M

, we create a maximum connected region that encompasses them. The region is defined by horizontal and vertical coordinates

[x_{m i n}, y_{m i n}, x_{m a x}, y_{m a x}]

. Any

(x, y)

coordinate within this region has a value of 1; those outside it have a value of 0. The discriminative region,

I_{c r o p}

, is obtained by cropping mask

M

from original image

I

and resizing it to match the input image size. Figure 8 and Figure 9 depict bounding boxes and the cropped area with

τ = 0.5

.

Figure 8 and Figure 9 show that our local branch model improves the representation of the ear image by directing the model’s attention to the most interesting common area from all images of the same individual. This area represents the most relevant features.

3.2.4. Global and Local Stages

The global stage provides insight into the pertinent information obtained from the global input image. In this study, we utilized ResNet-50 [29] as the backbone model on the global stage. It encompasses of five blocks, succeeded by a GAP layer and an FC layer for classification. Ultimately, the output vector

S_{g} (c ∣ I)

is normalized using a softmax layer:

P_{g} (c ∣ I) = \frac{\exp (S_{g} (c ∣ I))}{\sum_{c = 1}^{C} \exp (S_{g} (c ∣ I))}

(10)

where

I

represents the global image,

P_{g} (c ∣ I)

indicates the probability score of

I

fitting into the

c^{t h}

class,

c \in {1,2, . . ., C}

,

C

represents the total number of classes, and

S_{g}

refers to the input vector elements for the softmax function.

The local stage aims to mitigate some of the disadvantages of the global stage but focuses solely on the local area. Both stages utilize the ResNet-50 convolutional network structure, but are not designed to share weights due to their distinct functions. The probability score of the local stage is determined through Equation (10), denoted as

P_{l} (c ∣ I_{c r o p})

.

I_{c r o p}

represents the input image of the local stage. Similar to the global stage, normalization occurs in the same manner.

4. Experimental Analysis

This section encompasses a series of experiments conducted to assess the efficacy of the proposed methodology. The evaluation of the proposed approach’s applicability was carried out on two distinct unconstrained ear imaging datasets. The subsequent delineation outlines the datasets employed in this study. Subsequently, the impact of various parameters within our formulation is scrutinized, employing the following performance metrics:

Rank-1 and Rank-5 recognition rates.
Cumulative match score curves (CMCs).
Area under the CMC curve (AUCMC).

Ultimately, a comparative analysis is undertaken to assess the performance of the developed approach against recently published ear recognition methods, thereby providing insights into its effectiveness.

4.1. Datasets

The experimental evaluation leveraged MAI and AWE datasets as benchmarks to scrutinize the viability of the proposed approach and assess its performance. Figure 5 presents illustrative examples from these datasets.

4.1.1. MAI

The mathematical analysis of images (MAI) ear dataset [30] comprises 700 images sourced from 100 individuals spanning ages 19 to 65 years. For each participant, seven images are taken, including one of the left ear and six of the right ear. The data collection is conducted indoors, ensuring consistent illumination conditions for all acquisitions. The images are formatted as JPEGs and possess a resolution of 492 × 702 pixels.

4.1.2. AWE

The annotated web ears (AWEs) dataset [31] encompasses 1000 images capturing both left and right ears from a cohort of 100 individuals, with 10 images per person. Diverging from conventional ear datasets acquired under controlled laboratory settings, the AWE dataset adopts a collection approach from the web, introducing unregulated conditions to ensure heightened intra-class variability. All images are stored as PNG files, exhibiting dimensions ranging from 15 × 29 to 473 × 1022 pixels, with a median size of 83 × 160 pixels. The dataset incorporates various perturbations, including head rotation, occlusion, gender, race, illumination, and blur, rendering it a notably challenging dataset for evaluation.

4.2. Evaluation Protocols

In all conducted experiments, the evaluation protocol for the MAI dataset involves the selection of four samples from each individual for inclusion in the training set, with the remaining samples designated for the test set. This protocol results in a training dataset comprising 400 samples (i.e., 100 × 4) and a test dataset comprising 300 samples (i.e., 100 × 3). In contrast, the experimental evaluation protocol for the AWE dataset entails the selection of six samples per individual for the training set, while the persistent samples are allocated to the test set. Consequently, this protocol yields a training dataset comprising 600 samples (i.e., 100 × 6) and a test dataset comprising 400 samples (i.e., 100 × 4).

The experimentation process encompasses augmented configurations on datasets for the purpose of ear recognition. Various transformations are applied to the original images to generate additional variations. The transformations applied to the training images include the following:

-: Normalization using the mean and standard deviation.
-: Random rotation of the image by −20 and +20 degrees.
-: Application of a Gaussian blur filter to the image.
-: Adjustment of the hue, saturation, contrast, and brightness of the image within specified range values.
-: Horizontal flipping of the image by 50%.

Within our models, online data augmentation is employed, where images undergo preprocessing, augmentation during runtime, and are subsequently stored after the operation. The training dataset is augmented by randomly applying different transformations to each image. This ensures that each training epoch processes distinct variants of the same image, with all training images undergoing four transformations throughout the learning process.

4.3. Setup

The experiments were conducted on a laptop computer featuring an Intel(R) Core(TM) i7-10750H processor, 16 GB of RAM, and an Nvidia RTX 2060. The simulation environment utilized for these experiments was the Anaconda framework, and the PyTorch version employed was 1.9.0.

For training of the CNN models, the Root Mean Square Propagation optimizer (RMSprop) was employed, utilizing cross-entropy loss with momentum and a decay value of 0.9. Notably, a weight decay of 0.0001 was chosen to mitigate overfitting, and this value was consistently utilized across all experimental setups. The learning rate was initially set to 0.005 and progressively reduced to 0.00005 during the training process. The model underwent training for 200 epochs on the MAI dataset, while 320 epochs were required to achieve full convergence on the AWE dataset.

4.4. Experiment #1

In the initial experiment, we conducted training sessions for several CNN architectures aimed at classifying ear images, utilizing two datasets, namely, MAI and AWE. These datasets were employed as input for four pre-trained CNN architectures: AlexNet [32], VGG-16 [33], VGG-19 [33], and ResNet-50 [29]. The initialization of the pre-trained CNNs involved setting their weights to the corresponding weights of the ImageNet models [34]. Subsequently, fine-tuning of the CNNs on the ear datasets ensued, employing lower learning rates to enable the networks to adapt their weights to the specific characteristics of ear images. Data augmentation techniques were implemented to enhance the training of the CNNs.

The primary objective of this experiment was to discern the most robust architecture among the considered models, tailored for the task of ear classification. It is crucial to note that in this experiment, neither the suggested pre-processing nor the Mean-CAM techniques were applied. The outcomes of this experiment are delineated in Table 1, with the optimal results being highlighted in bold.

The outcomes presented in Table 1 unequivocally indicate that ResNet-50 outperforms other competitive architectures across the three employed performance metrics. ResNet-50’s performance on the MAI dataset is particularly noteworthy, achieving a Rank-1 recognition rate of 98.66%. However, the corresponding Rank-1 rate for the AWE dataset was considerably lower, at 57.75%. MAI is a simple and undemanding dataset, whereas AWE is considered as one of the most challenging datasets for ear recognition. This discrepancy underscores the imperative for an alternative approach beyond mere augmentation of the training set and the fine-tuning of weights.

4.5. Experiment #2

The second experiment aimed to propose an alternative strategy for enhancing the performance of the fine-tuned and augmented ResNet-50 model. Specifically, we advocate incorporating the Mean-CAM framework with the ResNet-50 model. As expounded in Section 3, Mean-CAM enhances the representation of ear images by guiding the model’s attention towards the most salient common area, signifying the most pertinent features. The investigation revolves around assessing whether Mean-CAM-CNN can yield performance improvements in recognition compared to utilizing a conventional CNN alone.

The performance of the proposed Mean-CAM-CNN framework was evaluated by scrutinizing the representativeness of various threshold values of the

τ

parameter. The evaluation protocols employed in these experiments are detailed in Section 4.2. Table 2 presents the Rank-1, Rank-5, and AUCMC results for the MAI and AWE datasets, with bold values highlighting the optimal results for each respective metric. Additionally, Figure 10 depicts the CMCs, illustrating the performance of ResNet-50 features and the proposed Mean-CAM-CNN framework, considering different threshold values of the

τ

parameter, across the two utilized datasets.

In this experiment, we observe that the performance results of ResNet-50 and Mean-CAM-CNN, with various values of the τ parameter, are closely comparable for the MAI dataset, displaying marginal variations. Conversely, for the AWE dataset, Mean-CAM-CNN consistently outperforms ResNet-50. The highest Rank-1 identification accuracies achieved were 99.67% for the MAI dataset and 74.50% for the AWE dataset. Notably, the optimal configuration, yielding these top Rank-1 recognition rates, corresponds to Mean-CAM-CNN with a

τ

parameter set to 0.5. This configuration demonstrates superior performance across both datasets and all utilized performance metrics. In comparison to the conventional ResNet-50 model, the best Rank-1 recognition rates attained using Mean-CAM-CNN increased by 1.01% for the MAI dataset and 16.75% for the challenging AWE dataset. The noteworthy performance of Mean-CAM-CNN, particularly with the challenging AWE dataset, can be attributed to its capacity to focus the CNN’s attention solely on pertinent information, thereby eliminating irrelevant details.

To highlight the effectiveness of the Mean-CAM technique in guiding the CNN to prioritize pertinent information, we provide visual demonstrations in Table 3. The results illustrate the significant impact of Mean-CAM in addressing misclassifications induced by the baseline model. Additionally, Mean-CAM enhances the prediction probability of the correct class. For instance, in the first AWE image presented in Table 3, Mean-CAM rectified the target class number from 41 to 95 by excluding irrelevant regions, while concurrently elevating the predicted probability of the correct class to 99.14%. Furthermore, in the first MAI, although the baseline model correctly identified the class, Mean-CAM augmented the prediction probability from 56.69% to 90.10%. This improvement underscores the efficacy of our proposed Mean-CAM framework.

Our proposed approach is based on the use of a global stage to extract salient features. Specifically, the Mean-CAM method is used to delineate a region of interest in the image based on mean heatmaps derived from images of the same class. This means that each image belonging to a class has a significant contribution in constructing the relevant RoI based on the average of all extracted salient maps (see Figure 8 and Figure 9). The resulting aggregation of all trained images belonging to the same class can direct the attention of our model to the most frequent relevant region and away from the irrelevant ones. Consequently, this technique reduces potential misclassification by minimizing the intra-class variability, which makes the feature extraction task less sensitive to the variability of data in a class.

4.6. Experiment #3

In the final experiment, our inquiry was directed towards ascertaining the potential enhancements in recognition performance attainable through the application of the proposed preprocessing methodology, predicated upon the trained DCGAN model. The primary focus of this investigation was to evaluate the efficacy of the preprocessing technique in augmenting the performance of the Mean-CAM-CNN model. To investigate this, we generated new versions of the MAI and AWE datasets, subjecting them to preprocessing involving colorization and enhancement using the trained DCGAN model. Subsequently, we applied the Mean-CAM-CNN framework to the newly generated datasets. The results of recognition, both with and without preprocessing, are presented in Table 4, showcasing the outcomes across the three previously defined performance metrics. The findings indicate a positive impact of the proposed preprocessing on recognition performance, as evidenced by improvements in recognition rates. Specifically, the Rank-1 recognition rates increased from 99.67% to 100.00% for MAI and from 74.50% to 76.25% for AWE.

4.7. Comparison

For a comprehensive assessment, we conducted a comparative analysis of the developed DCGAN + Mean-CAM-CNN approach against established ear recognition methodologies. The results of this comparison are detailed in Table 5. The Rank-1 recognition rate was employed as the primary evaluation metric for both MAI and AWE datasets, adhering to predefined protocols tailored to each dataset.

The comparison reveals satisfactory Rank-1 recognition rates for most of the analyzed papers, such as [18,20,21,22,23,24,25,26,27,28], achieving rates surpassing 93% with the MAI dataset. Conversely, our approach exhibits the highest and most competitive performance, reaching a Rank-1 recognition rate of 100.00%. The favorable results attained by several papers using the MAI dataset may be attributed to the dataset’s simplistic and constrained image nature. In contrast, the majority of the compared papers, both employing handcrafted and deep learning methods, face greater challenges with the AWE dataset due to its complex and unconstrained image nature. However, our proposed approach demonstrates highly acceptable and competitive results, yielding a Rank-1 recognition rate of 76.25% with the AWE dataset. The Mean-CAM method enables the CNN to focus solely on valuable information for recognition, eliminating irrelevant details, complemented by the colorization of grayscale images and enhancement of dark shades through the trained DCGAN model.

5. Conclusions

This paper presents a practical approach leveraging the combination of DCGAN and Mean-CAM-CNN for addressing the intricate challenge of ear biometric recognition. The recognition process encompasses two primary phases. In the initial stage, a trained DCGAN model is deployed to colorize and enhance ear images. Subsequently, a Mean-CAM-CNN framework is introduced to extract salient features and execute efficient classification. Specifically, the Mean-CAM method is employed to delineate a region of interest in the image based on mean heatmaps derived from images of the same class. Following this, a CNN is trained solely using the extracted regions for the purpose of classifying and recognizing ear images. Comprehensive experiments were conducted on unconstrained ear recognition databases, namely, MAI and AWE, to substantiate the efficacy of the proposed approach. The optimal Rank-1 recognition rates achieved were 100.00% and 76.25% for MAI and AWE, respectively. In particular, Mean-CAM was observed to enhance the Rank-1 recognition rates by 1.01% for the MAI dataset and 16.75% for the AWE dataset, whereas DCGAN was found to elevate the Rank-1 recognition rates by 0.33% for the MAI dataset and 1.75% for the AWE dataset. The added value of the proposed framework is justified by the superior learning capabilities inherent in CNNs, the efficacy of the employed preprocessing technique, the elimination of extraneous information, and the selective extraction of information pertinent to recognition through the utilization of the Mean-CAM cropping strategy.

In forthcoming endeavors, our objective is to expand upon our research to enhance its efficacy in accurately identifying individuals from ear images captured under uncontrolled conditions. This will be achieved through an exploration of several key areas:

–: Firstly, we aim to investigate various feature visualization techniques, such as t-distributed stochastic neighbor embedding (t-SNE). This analysis will allow us to gain deeper insights into the discriminatory features inherent in ear images.
–: Secondly, we plan to evaluate multiple CNN architectures concurrently. This comparative study will enable us to identify the most effective CNN architecture for our specific task, thereby optimizing the performance of our system.
–: Lastly, we will explore the potential synergies between deep-learned features and handcrafted features. This entails investigating combinations such as local binary pattern, robust local oriented patterns, and local phase quantization. By integrating these different types of features, we aim to leverage the strengths of both deep learning and traditional feature engineering approaches, resulting in a more robust and accurate identification system.

Author Contributions

Conceptualization, R.B. and A.B.; methodology, R.B., A.B. and H.D.; software, R.B. and Y.B.; validation, A.B., Y.B. and H.D.; formal analysis, R.B. and A.B.; investigation, R.B.; resources, A.B.; data curation, A.B.; writing—original draft preparation, R.B. and Y.B; writing—review and editing, A.B. and H.D.; visualization, Y.B.; supervision, A.B. and H.D.; project administration, A.B.; funding acquisition, R.B., A.B., H.D. and Y.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The employed data for the different experiments were taken from two public datasets and can be obtained from: http://www.ctim.es/research_works/ami_ear_database (accessed on 15 April 2024) and http://awe.fri.uni-lj.si/ (accessed on 15 April 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, Z.; Yang, J.; Zhu, Y. Review of Ear Biometrics. Arch. Comput. Methods Eng. 2021, 28, 149–180. [Google Scholar] [CrossRef]
Doghmane, H.; Bourouba, H.; Messaoudi, K.; Bourennane, E.B. Ear recognition based on discriminant multi-resolution image representation. Int. J. Biom. 2020, 12, 377–395. [Google Scholar] [CrossRef]
Sforza, C.; Grandi, G.; Binelli, M.; Tommasi, D.; Rosati, R.; Ferrario, V. Age- and Sex-Related Changes in the Normal Human Ear. Forensic Sci. Int. 2009, 187, 110–111. [Google Scholar] [CrossRef] [PubMed]
Yoga, S.; Balaih, J.; Rangdhol, V.; Vandana, S.; Paulose, S.; Kavya, L. Assessment of Age Changes and Gender Differences Based on Anthropometric Measurements of the Ear: A Cross-Sectional Study. J. Adv. Clin. Res. Insights 2017, 4, 92–95. [Google Scholar] [CrossRef]
Ganapathi, I.I.; Ali, S.S.; Prakash, S.; Vu, N.S.; Werghi, N. A survey of 3D ear recognition techniques. ACM Comput. Surv. 2023, 55, 1–36. [Google Scholar] [CrossRef]
Ma, Y.; Huang, Z.; Wang, X.; Huang, K. An Overview of Multimodal Biometrics Using the Face and Ear. Math. Probl. Eng. 2020, 2020, 6802905. [Google Scholar] [CrossRef]
Beghriche, T.; Attallah, B.; Brik, Y.; Djerioui, M. A multi-level fine-tuned deep learning based approach for binary classification of diabetic retinopathy. Chemom. Intell. Lab. Syst. 2023, 237, 104820. [Google Scholar] [CrossRef]
Mohammed, A.; Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 757–774. [Google Scholar]
Amrouni, N.; Benzaoui, A.; Zeroual, A. Palmprint Recognition: Extensive Exploration of Databases, Methodologies, Comparative Assessment, and Future Directions. Appl. Sci. 2023, 14, 153. [Google Scholar] [CrossRef]
Matsuo, Y.; LeCun, Y.; Sahani, M.; Precup, D.; Silver, D.; Sugiyama, M.; Morimoto, J. Deep Learning, Reinforcement Learning, and World Models. Neural Netw. 2022, 152, 267–275. [Google Scholar] [CrossRef] [PubMed]
Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Hassaballah, M.; Alshazly, H.A.; Ali, A.A. Ear Recognition Using Local Binary Patterns: A Comparative Experimental Study. Expert Syst. Appl. 2019, 118, 182–200. [Google Scholar] [CrossRef]
Hassaballah, M.; Alshazly, H.A.; Ali, A.A. Robust Local Oriented Patterns for Ear Recognition. Multimed. Tools Appl. 2020, 79, 31183–31204. [Google Scholar] [CrossRef]
Sarangi, P.P.; Mishra, B.S.P.; Dehuri, S.; Cho, S.B. An Evaluation of Ear Biometric System Based on Enhanced Jaya Algorithm and SURF Descriptors. Evol. Intell. 2020, 13, 443–461. [Google Scholar] [CrossRef]
Sajadi, S.; Fathi, A. Genetic Algorithm Based Local and Global Spectral Features Extraction for Ear Recognition. Expert Syst. Appl. 2020, 159, 113639. [Google Scholar] [CrossRef]
Khaldi, Y.; Benzaoui, A. Region of interest synthesis using image-to-image translation for ear recognition. In Proceedings of the 2020 International Conference on Advanced Aspects of Software Engineering (ICAASE), Constantine, Algeria, 28–30 November 2020; pp. 1–6. [Google Scholar]
Regouid, M.; Touahria, M.; Benouis, M.; Mostefai, L.; Lamiche, I. Comparative Study of 1D-Local Descriptors for Ear Biometric System. Multimed. Tools Appl. 2022, 81, 29477–29503. [Google Scholar] [CrossRef]
Korichi, A.; Slatnia, S.; Aiadi, O. TR-ICANet: A Fast Unsupervised Deep-Learning-Based Scheme for Unconstrained Ear Recognition. Arab. J. Sci. Eng. 2022, 47, 9887–9898. [Google Scholar] [CrossRef]
Alshazly, H.; Linse, C.; Barth, E.; Martinetz, T. Handcrafted versus CNN Features for Ear Recognition. Symmetry 2019, 11, 1493. [Google Scholar] [CrossRef]
Alshazly, H.; Linse, C.; Barth, E.; Martinetz, T. Ensembles of Deep Learning Models and Transfer Learning for Ear Recognition. Sensors 2019, 19, 4139. [Google Scholar] [CrossRef] [PubMed]
Priyadharshini, R.A.; Arivazhagan, S.; Arun, M. A Deep Learning Approach for Person Identification Using Ear Biometrics. Appl. Intell. 2020, 51, 2161–2172. [Google Scholar] [CrossRef] [PubMed]
Khaldi, Y.; Benzaoui, A. A New Framework for Grayscale Ear Images Recognition Using Generative Adversarial Networks under Unconstrained Conditions. Evol. Syst. 2021, 12, 923–934. [Google Scholar] [CrossRef]
Alshazly, H.; Linse, C.; Barth, E.; Idris, S.A.; Martinetz, T. Towards Explainable Ear Recognition Systems Using Deep Residual Networks. IEEE Access 2021, 9, 122254–122273. [Google Scholar] [CrossRef]
Omara, I.; Hagag, A.; Ma, G.; El-Samie, A.; Fathi, E.; Song, E. A Novel Approach for Ear Recognition: Learning Mahalanobis Distance Features from Deep CNNs. Mach. Vis. Appl. 2021, 32, 1–14. [Google Scholar] [CrossRef]
Sharkas, M. Ear Recognition with Ensemble Classifiers; A Deep Learning Approach. Multimed. Tools Appl. 2022, 81, 43919–43945. [Google Scholar] [CrossRef]
Xu, X.; Liu, Y.; Liu, C.; Lu, L. A Feature Fusion Human Ear Recognition Method Based on Channel Features and Dynamic Convolution. Symmetry 2023, 15, 1454. [Google Scholar] [CrossRef]
Aiadi, O.; Khaldi, B.; Saadeddine, C. MDFNet: An unsupervised lightweight network for ear print recognition. J. Ambient Intell. Human Comput. 2023, 14, 13773–13786. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Gonzalez, E.; Alvarez, L.; Mazorra, L. MAI Ear Database. 2008. Available online: http://www.ctim.es/research%20works/ami%20ear%20database (accessed on 10 April 2024).
Emeršič, Ž.; Struc, V.; Peer, P. Ear Recognition: More than a Survey. Neurocomputing 2017, 255, 26–39. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 2012 Advances in Neural Information Processing Systems 25, Lake Tahoe, NV, USA, 3–8 December 2012; pp. 1097–1105. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]

Figure 1. Design of the U-Net for the generative model.

Figure 2. Design of the discriminative model.

Figure 3. Coloring, enhancing, and resizing ear images with the proposed DCGAN model: (a) original images; (b) enhanced images.

Figure 4. A graphical flowchart of the proposed Mean-CAM-CNN framework.

Figure 5. CAMs for various images belonging to different classes from the AWE and MAI datasets. The maps display the pertinent image regions utilized for image classification.

Figure 6. Class activation mapping.

Figure 7. Graphical flowchart of the Mean-CAM process.

Figure 8. The process of discriminative region generation for the MAI dataset. (Top) Original ear images of the same class for the global stage. (Top-down) The output’s visualization of mask inference using the CAM technique. The most important characteristics are red and blue for the lower ones. (Bottom-up) The output’s visualization of the mask inference using the Mean-CAM technique. Green bounding boxes denote the discriminative region. (Bottom) Cropped and resized images as the input images size from the green bounding boxes, used as an input to the local stage.

Figure 9. The process of discriminative region generation for the AWE dataset. (Top) Original ear images of the same class for the global stage. (Top-down) The output’s visualization of the mask inference using the CAM technique. The most important characteristics are red and blue for the lower ones. (Bottom-up) The output’s visualization of the mask inference using the Mean-CAM technique. Green bounding boxes denote the discriminative region. (Bottom) Cropped and resized images as the input images sized from the green bounding boxes, used as input to the local stage.

Figure 10. CMCs for the testing sets of (a) MAI and (b) AWE datasets: comparative analysis of ResNet-50 and Mean-CAM-CNN with varied threshold values (

τ

).

Figure 10. CMCs for the testing sets of (a) MAI and (b) AWE datasets: comparative analysis of ResNet-50 and Mean-CAM-CNN with varied threshold values (

τ

).

Table 1. Comparative evaluation of fine-tuned CNN architectures for ear recognition.

Metric	Architecture	MAI	AWE
Rank-1 (%)	AlexNet	88.50	30.25
	VGG-16	95.50	47.25
	VGG-19	91.50	40.25
	ResNet-50	98.66	57.75
Rank-5 (%)	AlexNet	95.50	50.25
	VGG-16	99.50	73.75
	VGG-19	97.50	66.75
	ResNet-50	99.66	79.00
AUCMC (%)	AlexNet	92.11	56.54
	VGG-16	94.56	75.41
	VGG-19	93.91	72.44
	ResNet-50	100.00	96.54

Table 2. Ear recognition results using ResNet-50 and Mean-CAM-CNN across different threshold values of the τ parameter.

		AMI			AWE
		Rank-1 (%)	Rank-5 (%)	AUCMC (%)	Rank-1 (%)	Rank-5 (%)	AUCMC (%)
ResNet-50		98.66	99.66	100.00	57.75	79.00	96.54
Mean-CAM-CNN	τ = 0.1	96.67	100.00	100.00	62.25	79.50	96.70
	τ = 0.2	98.00	100.00	99.99	58.25	78.75	97.02
	τ = 0.3	98.66	99.66	99.99	68.50	85.75	97.99
	τ = 0.4	99.33	99.33	99.99	68.75	83.25	98.56
	τ = 0.5	99.67	100.00	100.00	74.50	89.50	98.93
	τ = 0.6	99.67	100.00	100.00	69.25	87.00	98.56
	τ = 0.7	98.33	100.00	100.00	67.50	87.25	98.34
	τ = 0.8	98.66	99.66	99.99	62.00	85.25	97.88
	τ = 0.9	96.00	99.00	99.99	60.00	78.75	96.81

Table 3. Visualization and performance analysis of model predictions: a comparative study using the Mean-CAM technique on ear images (B/L pred denotes baseline prediction, and P represents probability).

Dataset	Original Image	Results	B/L Visualizations	Mean-CAM Visualizations
MAI		Input class: 46 B/L pred: 46 P = 56.69% Mean-CAM pred: 46 P = 90.10%
MAI		Input class: 88 B/L pred: 74 P = 33.27% Mean-CAM pred: 88 P = 99.83%
AWE		Input class: 95 B/L pred: 41 P = 92.51% Mean-CAM pred: 95 P = 99.14%
AWE		Input class: 8 B/L pred: 65 P = 79.10% Mean-CAM pred: 8 P = 98.96%

Table 4. Comparative analysis of ear recognition outcomes: Evaluating the influence of preprocessing on Mean-CAM-CNN performance.

	AMI			AWE
	Rank-1 (%)	Rank-5 (%)	AUCMC (%)	Rank-1 (%)	Rank-5 (%)	AUCMC (%)
Without preprocessing	99.67	100.00	100.00	74.50	89.50	98.93
With preprocessing	100.00	100.00	100.00	76.25	91.25	99.96

Table 5. Comparative analysis of Rank-1 recognition rates: evaluating the DCGAN + Mean-CAM-CNN approach against competing methods on MAI and AWE datasets.

Approach	Publication	Year	Method	MAI	AWE
Handcrafted	Hassaballah et al. [13]	2019	LBP variants	73.71	49.60
	Hassaballah et al. [14]	2020	RLOP	72.29	54.10
	Sarangi et al. [15]	2020	Jaya algorithm + SURF	/	44.00
	Sajadi and Fathi [16]	2020	GZ + LPQ	/	53.50
	Khaldi and Benzaoui [17]	2020	BSIF	/	44.53
	Regouid et al. [18]	2022	1D multi-resolution LBP	100.00	43.00
Deep-Learning	Alshazly et al. [20]	2019	VGG-13-16-19-ensembles	93.96	/
	Alshazly et al. [21]	2019	AlexNet (Fine-tuned)	94.50	/
	Priyadharshini et al. [22]	2020	CNN	96.99	/
	Khaldi and Benzaoui [23]	2021	DCGAN + VGG16	96.00	50.53
	Alshazly et al. [24]	2021	Combination of ResNet features	99.64	67.25
	Omara et al. [25]	2021	Mahalanobis distance + CNN	97.80	/
	Sharkas [26]	2022	Discrete curvelet transform + Ensemble of ResNet features	99.45	/
	Xu et al. [27]	2023	CFDCNet	99.70	72.70
	Aiadi et al. [28]	2023	MDFNet	97.67	/
	Our proposed method	2024	DCGAN + Mean-CAM-CNN	100.00	76.25

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bouaouina, R.; Benzaoui, A.; Doghmane, H.; Brik, Y. Boosting the Performance of Deep Ear Recognition Systems Using Generative Adversarial Networks and Mean Class Activation Maps. Appl. Sci. 2024, 14, 4162. https://doi.org/10.3390/app14104162

AMA Style

Bouaouina R, Benzaoui A, Doghmane H, Brik Y. Boosting the Performance of Deep Ear Recognition Systems Using Generative Adversarial Networks and Mean Class Activation Maps. Applied Sciences. 2024; 14(10):4162. https://doi.org/10.3390/app14104162

Chicago/Turabian Style

Bouaouina, Rafik, Amir Benzaoui, Hakim Doghmane, and Youcef Brik. 2024. "Boosting the Performance of Deep Ear Recognition Systems Using Generative Adversarial Networks and Mean Class Activation Maps" Applied Sciences 14, no. 10: 4162. https://doi.org/10.3390/app14104162

APA Style

Bouaouina, R., Benzaoui, A., Doghmane, H., & Brik, Y. (2024). Boosting the Performance of Deep Ear Recognition Systems Using Generative Adversarial Networks and Mean Class Activation Maps. Applied Sciences, 14(10), 4162. https://doi.org/10.3390/app14104162

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Boosting the Performance of Deep Ear Recognition Systems Using Generative Adversarial Networks and Mean Class Activation Maps

Abstract

1. Introduction

2. Related Work

2.1. Handcrafted Methods

2.2. Deep Learning Methods

3. Proposed Approach

3.1. Preprocessing

3.2. Feature Extraction/Classification

3.2.1. CAM Methodology

3.2.2. Proposed Mean-CAM Methodology

3.2.3. Mask Inference

3.2.4. Global and Local Stages

4. Experimental Analysis

4.1. Datasets

4.1.1. MAI

4.1.2. AWE

4.2. Evaluation Protocols

4.3. Setup

4.4. Experiment #1

4.5. Experiment #2

4.6. Experiment #3

4.7. Comparison

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI