Gender Recognition Based on the Stacking of Different Acoustic Features

Yücesoy, Ergün

doi:10.3390/app14156564

Open AccessArticle

Gender Recognition Based on the Stacking of Different Acoustic Features

by

Ergün Yücesoy

Vocational School of Technical Sciences, Ordu University, Ordu 52200, Turkey

Appl. Sci. 2024, 14(15), 6564; https://doi.org/10.3390/app14156564 (registering DOI)

Submission received: 31 May 2024 / Revised: 19 July 2024 / Accepted: 21 July 2024 / Published: 27 July 2024

(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

:

A speech signal can provide various information about a speaker, such as their gender, age, accent, and emotional state. The gender of the speaker is the most salient piece of information contained in the speech signal and is directly or indirectly used in many applications. In this study, a new approach is proposed for recognizing the gender of the speaker based on the use of hybrid features created by stacking different types of features. For this purpose, four different features, namely Mel frequency cepstral coefficients (MFCC), Mel scaled power spectrogram (Mel Spectrogram), Chroma, Spectral contrast (Contrast), and Tonal Centroid (Tonnetz), and twelve hybrid features created by stacking these features were used. These features were applied to four different classifiers, two of which were based on traditional machine learning (KNN and LDA) while two were based on the deep learning approach (CNN and MLP), and the performance of each was evaluated separately. In the experiments conducted on the Turkish subset of the Common Voice dataset, it was observed that hybrid features, created by stacking different acoustic features, led to improvements in gender recognition accuracy ranging from 0.3 to 1.73%.

Keywords:

gender recognition; hybrid features; MFCC; KNN; LDA; CNN; MLP; machine learning; deep learning

1. Introduction

In parallel with technological advancements, the interaction between humans and computers has increased, and the use of technologies that employ more complex methods, such as speech recognition in human–computer interaction, has become widespread [1]. In systems utilizing speech recognition technologies, artificial intelligence processes the speech signals to determine the necessary actions based on the obtained information. Speech production is a complex process performed by humans. The resulting speech signal contains linguistic information such as vocabulary and syntax, as well as paralinguistic information like the speaker’s age, gender, accent, and emotional state [2]. The paralinguistic cues that people often use to communicate with each other have now started to be utilized in communication with computers. The gender of the speaker is one of the most salient paralinguistic cues, and this information is widely utilized in personalizing services. However, recognizing the gender of the speaker from speech signals is a challenging task. The gender recognition task is utilized in many applications for various purposes, such as developing effective advertising and marketing strategies, enhancing human–machine interaction, identifying criminals, and increasing customer satisfaction through personalized user services [3]. Moreover, gender information is also used in other speech-based recognition systems for purposes such as limiting the search area to speakers of the same gender or developing gender-specific models, thereby increasing the accuracy and speed of these systems [4].

Various factors influence the accuracy of automatic gender recognition systems, including voice quality, diversity, feature selection and extraction, and the design and evaluation of classifier models [5]. Among these factors, choosing and extracting suitable features from the speech signal is one of the most crucial steps that directly impacts the recognition rate [6]. Mel scale power spectrogram (Mel Spectrogram), Mel frequency cepstral coefficients (MFCCs), spectral contrast (Contrast), power spectrogram chroma (Chroma), and tonal centroid features (Tonnetz) are some of the most commonly used speech features. However, none of these features represent all aspects of speech, so the feature extraction process needs to be tailored for each task. After extracting features from the speech signals, they are fed into a classifier along with class labels indicating the speaker’s gender, and then the classifier is trained. Afterward, a dataset containing new speech data that were not used in the training phase is used to test the performance of the model on unseen data. Many methods have been developed for classifying input data represented in different feature spaces. Each method’s problem-solving approach is different; accordingly, each method has different advantages and disadvantages [7]. Previously, traditional machine learning methods such as Hidden Markov Models (HMM), Gaussian Mixture Models (GMM), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), and Support Vector Machines (SVM) were commonly used for gender recognition [8,9]. In recent years, deep learning has become one of the most popular areas of machine learning due to its success in various fields such as computer vision [10], natural language processing [11], and speech recognition [12]. Deep learning algorithms tend to perform better when there are a large amount of training data available. However, they require powerful GPUs and significant memory space for their training. Moreover, they are highly reliant on data quality, and their performance is adversely impacted by noisy, incomplete, or biased data [13]. On the other hand, traditional machine learning algorithms are more suitable for small datasets and do not require significant computing power [14].

In this study, we focus on feature extraction methods. We compare the gender recognition performances of various features such as MFCC, Chroma, Mel Spectrogram, Contrast, and Tonnetz. These features are frequently used in the literature. Firstly, the features are evaluated separately, and then the performance evaluations of the hybrid features created by stacking them in different orders are conducted. In the study, four different methods are used for classification, namely KNN, LDA, MLP, and CNN. Among these, KNN and LDA are traditional machine learning approaches. KNN performs classification based on the distance between features, while LDA performs classification based on the linear combination of features. On the other hand, MLP and CNN are deep learning approaches. MLP is considered the fundamental architecture of deep neural networks (DNN) and does not require as much computing power as modern deep learning architectures. CNN is the most known and widely used algorithm in the field of deep learning. CNN is highly effective at capturing complex patterns and relationships within a variety of data types, such as images, audio, and sensor data, offering state-of-the-art accuracy across a variety of tasks. In the study, 20 models are created using individual features, and 48 models are created using hybrid features. From the results obtained in the experiments, it is seen that hybrid features created by stacking different acoustic features increase gender recognition accuracy. The accuracy achieved through this method is competitive with similar studies. In the literature review, no other study was found that used a similar approach as proposed in this study for gender recognition. In this respect, this study will contribute to the literature. The rest of the paper is organized as follows: Section 2 introduces similar studies on gender recognition from speech signals. In Section 3, the materials and methods used in the study are explained in detail. Section 4 presents the experimental results, ending with the conclusion and future works.

2. Literature Review

Many studies in the literature have been conducted to automatically classify speakers by gender using various data processing techniques and machine learning methods. This section presents an overview of the current studies on this topic.

In [15], a new model was proposed that uses the Deeper Long Short-Term Memory (LSTM) Networks structure for gender prediction from a speech dataset. The proposed model achieved 98.4% accuracy in classifying the speeches in a public audio dataset based on the gender of the speakers. In [16], the noise-free smooth data were obtained through data preprocessing. The features were then extracted using a multi-layered architectural model. The study conducted experiments with KNN and SVM classifiers on three different datasets—TIMIT, RAVDESS, and BGC. It was reported that the highest accuracy rate achieved in these experiments was 96.8% in the TIMIT dataset using KNN. In [17], the authors introduced a gender classification model for Arabic speakers based on an ensemble classifier. The model utilized a three-stage machine learning approach to optimize the feature engineering process, resulting in a 96.02% classification rate on the test dataset using the linear SVM classifier. In a different study [18], it was suggested to utilize different Deep Neural Network-based embedder architectures such as x-vector and d-vector for predicting age and gender. The study reported that the model achieved an accuracy of 99.60% on the TIMIT dataset. In a recent study [4], a new convolutional neural network with multiple attention modules was introduced to classify speakers based on their age and gender. The researchers conducted experiments using the Common Voice and local Korean speech recognition datasets. The results showed that the proposed model achieved gender recognition accuracies of 96% and 97% in the respective experiments. In [19], the authors proposed two self-attention-based models for end-to-end gender recognition systems in unconstrained environments. Experimental results from the study showed that the proposed convolution self-attention model for the unconstrained environment provided 96.23% accuracy on the VoxCeleb dataset. In [20], a language-independent classification model for gender recognition was introduced, which utilized the Common Voice dataset. The study involved training a deep learning network to extract vital information from speech spectrograms, followed by fine-tuning the pre-trained ResNet50 model for gender recognition. The proposed model achieved a gender classification accuracy of 98.57%. In [21], a model was proposed for predicting gender from Turkish speech signals by employing MFCC and machine learning. The researchers examined 58 different TV series and movies, creating a new original dataset with 894 audio recordings, each consisting of 5 s segments. The study also involved creating a hybrid feature vector using MFCC and spectrogram features. The researchers analyzed the performance of various machine-learning algorithms such as logistic regression (LR), decision trees (DT), random forest (RF), and extreme gradient boosting (XGB). Among these, logistic regression exhibited the highest accuracy at 89%. In [22], a new method was presented that utilized MFCC, LPCC, and LPC features, combining them with KNN and MLPNN classifiers. In the study, the useful coefficients in these three features were combined and used to recognize the gender of the speakers. The experiments were conducted on the TIMIT dataset, and the results showed that the proposed method classified male speakers with an accuracy of 97% and female speakers with an accuracy of close to 98%. In [23], the researchers aimed to develop a gender recognition model with high accuracy and low complexity. To achieve this, a two-stage heterogeneous stacked ensemble model was proposed. In the experiments carried out using data from different sources, the gender recognition accuracy of the proposed model was found to be 99.36%. In [24], modified ensemble techniques based on classifiers such as K-NN, Random Forest (RF), and SVM were used for gender recognition. The study stated that the proposed ensemble model provides 99.05% accuracy on a dataset consisting of 3168 audio recordings, outperforming traditional machine learning models.

3. Materials and Methods

3.1. Dataset

In this study, audio recordings selected from the Turkish section of the publicly available Mozilla Common Voice (MCV) dataset are utilized [25]. MCV is a dataset created through contributions from volunteers globally. This dataset contains a total of 30,329 h of speech data, of which 19,916 h have been verified. The dataset covers 120 languages as of February 2024. Each recording is stored in a unique MP3 file, and demographic data for the recordings, such as age, gender, and accent, are stored in a text file. Participants can log in to the system via the Common Voice website or iPhone application and contribute to the dataset as either speakers or listeners. Speakers voice the sentences that appear on the screen, while listeners verify the previously voiced sentences using a simple voting method. Speeches are considered valid if they are heard by at least two listeners and receive a positive vote by the majority. Conversely, speeches are considered invalid if they receive a negative vote by the majority. Records that have not yet received a sufficient number of valid or invalid votes are grouped as “other”. The datasets for each language are split into training, development, and testing sections, and only valid recordings are included in these sections. In the datasets, the gender information of the speakers is categorized into male, female, and other. Additionally, the age information is categorized into seven groups with ten-year age intervals, such as twenties, thirties, forties, and so on.

A total of 113 h of verified speech records in the Turkish dataset version 16.1 is used in the study. However, some of these records are missing gender and age information. Additionally, some of the recordings have been down-voted by at least one listener, even though they are verified. In the study, records that have received at least one down-vote, lack age or gender information, or have “other” defined as their gender are excluded from the dataset. Additionally, there is an imbalance in the number of records across the different age groups, with some having a high number of records and others having a low number. To address this issue and prevent biased results, a balanced dataset is created by randomly selecting an equal number of records from each age group. Unfortunately, there were no speech samples that met the criteria for males in their seventies and eighties, so these groups were excluded from the dataset. As a result, speeches from ten groups remained in the final dataset, consisting of a total of 6700 speeches. This dataset was used in the development of the model proposed in the study.

3.2. Feature Extraction

The process of extracting a set of features from a speech signal is an important step in speech-based recognition tasks. There are several methods available for extracting speech features. MFCC, chroma, Mel Spectrogram, Contrast, and Tonnetz are among the commonly used features and are used for gender recognition in this study.

Mel Spectrogram: A spectrogram is a way to visually represent the time and frequency information of a sound signal and is obtained using Short-Time Fourier Transform. A Mel Spectrogram is a type of spectrogram that converts the linear frequencies into the Mel scale, which roughly represents how the human ear perceives frequencies. This scale is given by the following equation [26].

m = 2595 * \log_{10} (1 + \frac{f}{700})

(1)

where

f

represents the physical frequency in Hz and

m

represents the perceived frequency in the Mel scale. Mel Spectrograms are commonly utilized in various applications, particularly in speech recognition, speech synthesis, and voice classification. Figure 1 illustrates the Mel Spectrogram features extracted from both female and male speech.

Mel Frequency Cepstral Coefficients (MFCCs): MFCCs provide a non-linear representation of the short-term power spectrum of an audio signal. They are widely used in various audio signal processing and speech recognition tasks. To calculate MFCC, the Mel Spectrogram of the signal is created and then the Discrete Cosine Transform is applied to the logarithmic Mel Spectrogram of the signal. The resulting list of numbers or coefficients from this process is called the Mel Frequency Cepstrum Coefficients. MFCC features extracted from both female and male speech are shown in Figure 2.

Chroma: Chroma is widely used in various fields such as music classification, genre recognition, and automatic music transcription. It captures the harmonic content of a piece of music and provides a robust representation. In this representation, the entire spectrum is projected into twelve regions representing different semitones of the musical octave [27]. Chroma features extracted from a female and a male speech are shown in Figure 3.

Spectral Contrast: The spectral contrast feature considers the spectral peak, spectral valley, and their difference in each subband to estimate the distribution of harmonic and non-harmonic components in the audio signal. Figure 4 shows the contrast features extracted from female and male speech.

Tonnetz: The Tonnetz features, similar to the Chromogram, offer a way to represent the harmony and pitch classes of a sound. These features are determined by measuring the tonal centroids of a sound in a six-dimensional pitch space known as the Tonal Centroid Space. In several studies, it has been experimentally demonstrated that this method effectively identifies harmonic changes in sound recordings. The Tonnetz features extracted from a female and a male speech are illustrated in Figure 5.

3.3. Classification Models

3.3.1. Convolutional Neural Network

The Convolutional Neural Network (CNN) is a type of deep learning method inspired by the organization of the visual cortex of animals. Its capacity to effectively capture spatial hierarchies of features has led to its widespread use in various fields, including computer vision, speech processing, and face recognition [28]. CNNs consist of one or more convolutional layers, pooling layers, and one or more fully connected layers similar to a standard multilayer neural network. The convolutional layer, a key component of the CNN architecture, performs feature extraction with different kernel sizes and strides. The pooling layer is used to reduce feature map size and dimensional complexity, cutting down on computational load and enabling faster learning. The fully connected layer, usually located at the end of the CNN architecture, is responsible for producing the final output predictions. The layer structure of the 1D CNN model proposed in the study is outlined in Table 1.

3.3.2. Multilayer Perceptron

The multi-layer perceptron (MLP) is one of the most fundamental and widely used forms of artificial neural networks. This structure is a mathematical model that aims to simulate the way the human brain processes information. The MLP consists of multiple layers of interconnected neurons (artificial nerve cells), typically organized as an input layer, one or more hidden layers, and an output layer [29]. Input Layer: This is where the model first receives data. Hidden Layer(s): These layers, each containing a set of weights and biases, are where complex features are learned. Output Layer: This is the layer that produces results (predictions). The parameter settings of the MLP model proposed in the study can be found in Table 2.

3.3.3. Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) [30] is a supervised learning method that aims to find a linear combination of features to create a decision boundary that effectively separates two or more classes in a dataset. The method consists of the following two steps: dimensionality reduction and linear classification. By minimizing within-class variance and maximizing between-class variance, it becomes easier to distinguish between groups in the dataset. LDA is widely used for both dimensionality reduction and as a classifier. It assumes that the features in each class fit a multivariate Gaussian distribution, which is described by the following expression:

P (x| y = k) = \frac{1}{{(2 π)}^{d / 2} {|Σ_{k}|}^{1 / 2}} e x p (- \frac{1}{2} {(x - μ_{k})}^{t} {Σ_{k}}^{- 1} (x - μ_{k}))

(2)

Given the new input data

x

, LDA calculates the posterior probability of each class based on the observed features. The posterior probability is calculated using Bayes’ theorem as follows:

P (y = k| x) = \frac{P (x| y = k) P (y = k)}{P (x)} = \frac{P (x | y = k) P (y = k)}{\sum_{l} P (x | y = l) . P (y = l)}

(3)

LDA evaluates the likelihood that a new input aligns with each class, and then assigns the input to the class with the highest posterior probability.

3.3.4. K-Nearest Neighbors

The K-Nearest Neighbors (KNN) algorithm [31] is a supervised machine learning method that can be used for both regression and classification tasks. Unlike other supervised learning algorithms, the KNN classifier does not have a distinct training phase. For this reason, it is also referred to as a lazy learner. KNN does not focus on building a general model; instead, it stores all examples corresponding to the training data in an n-dimensional space. KNN is highly robust against noisy training data and can also be applied to multi-class classification problems [32]. KNN has proven to be a very effective classifier in various applications and is therefore widely used in many fields [33].

When classifying new data with the KNN algorithm, the distances between the new point and all points in the training set are calculated. Different metrics such as Euclidean, Manhattan, and Minkowski are used for distance calculation in machine learning [34]. In the KNN algorithm, the letter

K

represents the number of nearest neighbors considered when classifying new data. For

K = 1

, the new data are assigned to the class of the nearest neighbor, while for

K = 3

, the three closest neighbors are taken into account and the new data are assigned to the class that has the majority among these three neighbors. The number of neighbors (K) and the distance metric used are important parameters that influence the performance of the KNN algorithm. Typically, these parameters are determined through an optimization process. This study uses grid search and 10-fold cross-validation for parameter optimization.

3.4. Performance Evaluation

Performance evaluation is a crucial stage in developing an effective machine learning model. Various metrics are used to evaluate a model’s performance, depending on the task it performs. Accuracy, precision, recall, and F1 score are the most commonly used metrics for evaluating the classification task [34]. Mathematical representations of these metrics are provided below.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(4)

P r e c i s i o n = \frac{T P}{T P + F P}

(5)

R e c a l l = \frac{T P}{T P + F N}

(6)

F 1 - s c o r e = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(7)

where

T P

represents the number of true positives,

T N

represents the number of true negatives,

F P

represents the number of false positives, and

F N

represents the number of false negatives.

4. Experimental Results and Discussion

In this section, all experiments use speeches from the Turkish subset of the Common Voice dataset, selected based on the criteria detailed in Section 3.1. The dataset is split into two parts with a 75:25 ratio for training and testing. Four classifiers (KNN, LDA, MLP, and CNN) and five feature extraction methods (MFCC, Mel Spectrogram, Chroma, Contrast, and Tonnetz) are utilized in the study. A total of 193 features, including 40 MFCC, 12 Chroma, 128 Mel Spectrogram, 7 Contrast, and 6 Tonnetz, are extracted from each speech signal using the “librosa” library in Python during the feature extraction phase. After these features are averaged over time and converted into one-dimensional arrays, they are applied to each classifier, first separately and then stacked end-to-end. The performance evaluation results of the classifiers, created using each feature separately, are presented in Table 3.

Based on the results in Table 3, it is evident that the CNN classifier consistently offers the highest accuracy in gender recognition, irrespective of the type of feature used. The ranking of classifiers based on accuracy is as follows: CNN, MLP, KNN, and LDA. Additionally, when considering the type of feature, the MFCC feature consistently provides the highest accuracy across all classifiers, followed by the Mel Spectrogram, Contrast, Chroma, and Tonnetz features, in that order. Specifically, the CNN model utilizing MFCC features achieved a remarkable 99.22% classification accuracy by accurately predicting the gender of 1662 out of 1675 speakers in the test dataset. The classification accuracies obtained with other classifiers on the same dataset are as follows: 98.45% with MLP, 98.09% with KNN, and 95.94% with LDA.

After examining the gender classification performance of each feature separately, the performance evaluations of hybrid features created by stacking these features in twos, threes, fours, and fives are carried out. Initially, four hybrid features are formed by combining MFCC with four other features. Then, the number of stacked features is increased to three, four, and five to create eight hybrid features, and the performance of each is measured by following the same processes. The study ultimately creates twelve hybrid features by stacking five features in various sequences. A list of these features is given in Table 4, along with the stacked components and their sizes.

Each hybrid feature in the table is used as input for four different classifiers, resulting in a total of 48 classification models. Performance evaluations are conducted for these models, and the results are presented in Table 5. The table highlights the highest accuracy achieved by each classifier.

When the results in Table 5 are examined, it is seen that a performance increase between 0.3% and 1.73% is achieved in all models created with the use of the hybrid features. The highest classification accuracy is achieved with the CNN model created by stacking the MFCC, Mel Spectrogram, and contrast features (Feature ID = 5). This model improves gender classification accuracy from 99.22% to 99.52% by accurately predicting the gender of 1667 out of 1675 speeches in the test dataset. The CNN model is followed by KNN with an accuracy rate of 99.22%, MLP with an accuracy rate of 99.04%, and LDA models with an accuracy rate of 97.67%. In each model, the highest classification accuracy is achieved with different hybrid features. The CNN and KNN model achieved the highest classification accuracy by using a stacking of the MFCC, Mel Spectrogram, and contrast features (ID = 5). The MLP model achieved the highest classification accuracy by using a stacking of the MFCC and contrast features (ID = 3). The LDA model achieved the highest classification accuracy by using a stacking of the MFCC, Chroma, Mel Spectrogram, and contrast features (ID = 9).

Among these three features, the size of the one created by stacking the MFCC and contrast is considerably lower (47 × 1) than the size of the other two (175 × 1 and 187 × 1). Despite this, an increase of between 0.18% and 1.59% is achieved in the accuracy of the models developed with this feature. In addition, the difference between the accuracy of this model (%99.40) and the highest accuracy (%99.52) obtained in the study is only 0.12%. Taking this into account, it can be concluded that the hybrid feature created by stacking MFCC and contrast features is the most effective feature for the proposed gender classification model. To thoroughly assess the performance of the gender classification models developed in the study, their confusion matrix is presented in Table 6.

The confusion matrix in Table 6 illustrates the distribution of the correct and incorrect predictions made by each classifier based on gender classes. The CNN accurately predicts the gender class of 1667 speakers, while the KNN, MLP, and LDA correctly predict the gender class of 1662, 1659, and 1636 speakers, respectively. The numbers of incorrectly predicted samples by these models are as follows: CNN (8 speakers), KNN (13 speakers), MLP (16 speakers), and LDA (39 speakers). When considering the accuracy of these classifiers as a percentage for each class separately, we find that the CNN model has an accuracy of 99.17% for female speakers and 99.88% for male speakers. The KNN model achieves 99.53% accuracy for female speakers and 98.91% for male speakers, while the MLP model scores 99.06% for female speakers and 99.03% for male speakers. Finally, the LDA model has an accuracy rating of 98.35% for female speakers and 96.98% for male speakers.

The experimental results from this study are compared with similar studies on the subject in Table 7. Most of these studies used different datasets, so it is important to consider the datasets used when comparing the results. The Common Voice dataset used in this study contains vocalizations recorded online by volunteer users in uncontrolled environments. In contrast, some of the studies cited in the table utilized datasets (such as TIMIT and Kaggle) created in controlled environments, such as soundproof rooms. Considering this situation, the dataset used in this study is no simpler than other datasets, and therefore the comparison made here is fair. Nearly all the studies in the table can predict the gender classes of speakers with an accuracy rate of over 96%. Four of these studies have an accuracy rate of over 99%, including this study. Two of the studies achieving over 99% accuracy use the ensemble learning approach, while the other study uses a two-stage transfer learning approach. In this study, four different models were developed, and the accuracy of three of them, including KNN, is over 99%. The high accuracy achieved with a simple classifier like KNN demonstrates the effectiveness of the hybrid features used in the study. This suggests that the proposed gender classification model is highly practical and useful.

5. Conclusions and Future Works

The study examines the performance of hybrid features created by stacking different types of features to determine the gender of the speaker. The study uses MFCC, Chroma, Mel Spectrogram, Contrast, and Tonnetz features, which are commonly used in speech-based recognition tasks, with four different classifiers. Two of them, CNN and MLP, are deep learning models, while the other two, KNN and LDA, are traditional machine learning models. First, the gender recognition performances of five features are measured individually. Then, the features are stacked in different orders to create twelve hybrid features, and the performance of each is evaluated separately. In all experiments, a subset of 6700 speeches selected from the Turkish section of the Common Voice dataset is utilized. This dataset is randomly split into two parts in a 75:25 ratio, with the first part being used for model training and parameter optimization, and the second part for testing the models. In the study, a total of 68 models were developed—20 using five individual features and 48 using 12 hybrid features created by stacking these features. Performance measurements were made for each model. The experimental results show that using hybrid features created by stacking different features increases the accuracy of gender classification, regardless of the classifier used. Through the use of hybrid features, the CNN model increases its classification accuracy from 99.22% to 99.52%, the KNN model from 98.09% to 99.22%, the MLP model from 98.45% to 99.04%, and the LDA model from 95.94% to 97.67%. The difference in accuracy between the CNN and KNN models, which have the highest accuracies, is only 0.3%, which is negligible given the complexity of the two models. These results demonstrate that the hybrid features, created by stacking different features, significantly increase gender recognition accuracy. In future studies, new features such as Zero Crossing Rate (ZCR) and root mean square energy (RMSE) can be included in hybrid features. It is also possible to change the dimensions of the features that make up the hybrid features. Additionally, these approaches can be applied to different problems using various datasets, and their performance can be evaluated.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Common Voice Corpus 16.1 Turkish database is available via https://commonvoice.mozilla.org/tr/datasets (accessed on 19 January 2024).

Conflicts of Interest

The author declares no conflicts of interest.

References

Gondohanindijo, J.; Noersasongko, E. Multi-Features Audio Extraction for Speech Emotion Recognition Based on Deep Learning. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 198–206. [Google Scholar] [CrossRef]
Safavi, S.; Russell, M.; Jančovič, P. Automatic speaker, age-group and gender identification from children’s speech. Comput. Speech Lang. 2018, 50, 141–156. [Google Scholar] [CrossRef]
Alkhawaldeh, R.S. DGR: Gender recognition of human speech using one-dimensional conventional neural network. Sci. Program. 2019, 2019, 7213717. [Google Scholar] [CrossRef]
Tursunov, A.; Khan, M.; Choeh, J.Y.; Kwon, S. Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms. Sensors 2021, 21, 5892. [Google Scholar] [CrossRef]
Rezapour Mashhadi, M.M.; Osei-Bonsu, K. Speech emotion recognition using machine learning techniques: Feature extraction and comparison of convolutional neural network and random forest. PLoS ONE 2023, 18, e0291500. [Google Scholar] [CrossRef] [PubMed]
Jiang, S.; Chen, Z. Application of dynamic time warping optimization algorithm in speech recognition of machine translation. Heliyon 2023, 9, e21625. [Google Scholar] [CrossRef] [PubMed]
Reda, M.M.; Nassef, M.; Salah, A. Factors affecting classification algorithms recommendation: A survey. In Proceedings of the 8th International Conference on Soft Computing, Artificial Intelligence and Applications, Dubai, United Arab Emirates, 29–30 June 2019. [Google Scholar] [CrossRef]
Tian, Q.; Arbel, T.; Clark, J.J. Deep LDA-runed nets for efficient facial gender classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 10–19. [Google Scholar] [CrossRef]
Singhal, A.; Sharma, D.K. Estimation of Accuracy in Human Gender Identification and Recall Values Based on Voice Signals Using Different Classifiers. J. Eng. 2022, 2022, 9291099. [Google Scholar] [CrossRef]
Chai, J.; Zeng, H.; Li, A.; Ngai, E.W. Deep learning in computer vision: A critical review of emerging techniques and application scenarios. Mach. Learn. Appl. 2021, 6, 100134. [Google Scholar] [CrossRef]
Lauriola, I.; Lavelli, A.; Aiolli, F. An introduction to deep learning in natural language processing: Models, techniques, and tools. Neurocomputing 2022, 470, 443–456. [Google Scholar] [CrossRef]
Kwon, H.; Yoon, H.; Park, K.W. Acoustic-decoy: Detection of adversarial examples through audio modification on speech recognition system. Neurocomputing 2020, 417, 357–370. [Google Scholar] [CrossRef]
Sarker, I.H. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2021, 2, 420–440. [Google Scholar] [CrossRef] [PubMed]
Taye, M.M. Understanding of machine learning with deep learning: Architectures, workflow, applications and future directions. Computers 2023, 12, 91. [Google Scholar] [CrossRef]
Ertam, F. An effective gender recognition approach using voice data via deeper LSTM networks. Appl. Acoust. 2019, 156, 351–358. [Google Scholar] [CrossRef]
Uddin, M.A.; Hossain, M.S.; Pathan, R.K.; Biswas, M. Gender recognition from human voice using multi-layer architecture. In Proceedings of the 2020 International Conference on Innovations in Intelligent Systems and Applications (INISTA), Novi Sad, Serbia, 24–26 August 2020. [Google Scholar] [CrossRef]
Hamdi, S.; Moussaoui, A.; Oussalah, M.; Saidi, M. Gender identification from arabic speech using machine learning. In Proceedings of the International Symposium on Modelling and Implementation of Complex Systems, Batna, Algeria, 24–26 October 2020. [Google Scholar] [CrossRef]
Kwasny, D.; Hemmerling, D. Gender and age estimation methods based on speech using deep neural networks. Sensors 2021, 21, 4785. [Google Scholar] [CrossRef] [PubMed]
Nasef, M.M.; Sauber, A.M.; Nabil, M.M. Voice gender recognition under unconstrained environments using self-attention. Appl. Acoust. 2021, 175, 107823. [Google Scholar] [CrossRef]
Alnuaim, A.A.; Zakariah, M.; Shashidhar, C.; Hatamleh, W.A.; Tarazi, H.; Shukla, P.K.; Rajnish, R. Speaker gender recognition based on deep neural networks and ResNet50. Wirel. Commun. Mob. Comput. 2022, 2022, 4444388. [Google Scholar] [CrossRef]
Hızlısoy, S.; Çolakoğlu, E.; Arslan, R.S. Speech-to-Gender Recognition Based on Machine Learning Algorithms. Int. J. Appl. Math. Electron. Comput. 2022, 10, 84–92. [Google Scholar] [CrossRef]
AL-Dujaili, M.J.; Ahily, H.J.S.; Fatlawi, A. Gender recognition of human based on speech characteristics by features fusion with K_NN and MLPNN classifications. AIP Conf. Proc. 2023, 2977, 020092. [Google Scholar] [CrossRef]
Taran, S.; Pandey, A.A. Dual-Staged heterogeneous stacked ensemble model for gender recognition using speech signal. Appl. Acoust. 2023, 205, 109271. [Google Scholar] [CrossRef]
Madhu, G.; Bukka, A. Ensemble Learning Model for Gender Recognition Using the Human Voice. In Proceedings of the 2023 First International Conference on Advances in Electrical, Electronics and Computational Intelligence (ICAEECI), Tiruchengode, India, 19–20 October 2023. [Google Scholar] [CrossRef]
Mozilla Common Voice. Available online: https://commonvoice.mozilla.org/tr/datasets (accessed on 19 January 2024).
Qiao, T.; Zhang, S.; Zhang, Z.; Cao, S.; Xu, S. Sub-spectrogram segmentation for environmental sound classification via convolutional recurrent neural network and score level fusion. In Proceedings of the 2019 IEEE International Workshop on Signal Processing Systems (SiPS), Nanjing, China, 20–23 October 2019. [Google Scholar] [CrossRef]
Senevirathna, E.N.W.; Jayaratne, L. Audio music monitoring: Analyzing current techniques for song recognition and identification. GSTF J. Comput. 2015, 4, 23–34. [Google Scholar] [CrossRef]
Li, H.C.; Deng, Z.Y.; Chiang, H.H. Lightweight and resource-constrained learning network for face recognition with performance optimization. Sensors 2020, 20, 6114. [Google Scholar] [CrossRef]
Chan, K.Y.; Abu-Salih, B.; Qaddoura, R.; Ala’M, A.Z.; Palade, V.; Pham, D.S.; Muhammad, K. Deep neural networks in the cloud: Review, applications, challenges and research directions. Neurocomputing 2023, 545, 126327. [Google Scholar] [CrossRef]
Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
Fix, E.; Hodges, J.L. Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties. Int. Stat. Rev./Rev. Int. Stat. 1989, 57, 238–247. [Google Scholar] [CrossRef]
Boateng, E.Y.; Otoo, J.; Abaye, D.A. Basic tenets of classification algorithms K-nearest-neighbor, support vector machine, random forest and neural network: A review. J. Data Anal. Inf. Process. 2020, 8, 341–357. [Google Scholar] [CrossRef]
Ozturk Kiyak, E.; Ghasemkhani, B.; Birant, D. High-Level K-Nearest Neighbors (HLKNN): A Supervised Machine Learning Model for Classification Analysis. Electronics 2023, 12, 3828. [Google Scholar] [CrossRef]
Kalra, V.; Kashyap, I.; Kaur, H. Effect of distance measures on K-nearest neighbour classifier. In Proceedings of the 2022 Second International Conference on Computer Science, Engineering and Applications (ICCSEA), Gunupur, India, 8 September 2022. [Google Scholar] [CrossRef]

Figure 1. Mel Spectrogram features extracted from a female (left) and a male speech (right).

Figure 2. MFCC features extracted from a female (left) and a male speech (right).

Figure 3. Chroma features extracted from a female voice (left) and a male voice (right).

Figure 4. Contrast features extracted from a female voice (left) and a male voice (right).

Figure 5. Tonnetz features extracted from a female voice (left) and a male voice (right).

Table 1. Layer structure of the proposed 1D CNN model.

1D CNN Configuration
Conv1D (5, @256), activation = “ReLU”, Conv1D (5, @128), activation = “ReLU”, Dropout (0.1), MaxPooling (5), Conv1D (5, @128), activation = “ReLU”, Conv1D (5, @128), activation = “ReLU”, Flatten, Dense (10-neurons), Softmax (2-gender)
Parameters: Adam Optimizer, 0.001 learning rate, 100 epochs

Table 2. The parameter settings of the MLP model proposed in the study.

MLP Configuration
hidden_layer_sizes = (256), alpha = 0.001, batch_size = 128, epsilon = 1 × 10⁻⁸, learning_rate = ‘constant’, activation = ‘logistic’, solver = ‘adam’, max_iter = 600

Table 3. Performance evaluation results of the classifiers created with different features.

Feature Type	Feature Size	KNN				LDA				MLP				CNN
Feature Type	Feature Size	Acc.	Prec.	Rec.	F1	Acc.	Prec.	Rec.	F1	Acc.	Prec.	Rec.	F1	Acc.	Prec.	Rec.	F1
MFCC	40 × 1	98.09	98.09	98.09	98.09	95.94	95.98	95.94	95.94	98.45	98.45	98.45	95.45	99.22	99.23	99.22	99.22
Mel	128 × 1	95.70	95.72	95.70	95.70	89.79	89.96	89.79	89.78	97.37	97.38	97.37	97.37	98.33	98.33	98.33	98.33
Contrast	7 × 1	93.97	93.97	93.97	93.97	89.55	89.77	89.55	89.54	90.33	90.33	90.33	90.33	93.01	93.02	93.01	93.02
Chroma	12 × 1	73.85	74.16	73.85	73.79	68.60	69.37	68.60	68.34	68.36	69.22	68.36	68.07	78.09	78.52	78.09	78.03
Tonnetz	6 × 1	65.37	65.49	65.37	65.34	62.15	62.22	62.15	62.23	62.15	62.85	62.15	61.49	69.19	69.21	69.19	69.19

Table 4. Hybrid features created by stacking five different features.

Feature ID	Stacked Features	Feature Size
1	MFCC + Chroma	52 × 1
2	MFCC + Mel Spectrogram	168 × 1
3	MFCC + Contrast	47 × 1
4	MFCC + Tonnetz	46 × 1
5	MFCC + Mel Spectrogram + Contrast	175 × 1
6	MFCC + Chroma + Mel Spectrogram	180 × 1
7	MFCC + Mel Spectrogram + Tonnetz	174 × 1
8	Chroma + Mel Spectrogram + Contrast	147 × 1
9	MFCC + Chroma + Mel Spectrogram + Contrast	187 × 1
10	MFCC + Chroma + Mel Spectrogram + Tonnetz	186 × 1
11	Chroma + Mel Spectrogram + Contrast + Tonnetz	153 × 1
12	MFCC + Chroma + Mel Spectrogram + Contrast + Tonnetz	193 × 1

Table 5. Performance measurement results of different classifiers created with hybrid features.

Feature ID	KNN				LDA				MLP				CNN
Feature ID	Acc.	Prec.	Rec.	F1	Acc.	Prec.	Rec.	F1	Acc.	Prec.	Rec.	F1	Acc.	Prec.	Rec.	F1
1	98.20	98.22	98.21	98.21	96.24	96.27	96.24	96.24	98.93	98.93	98.93	98.93	99.28	99.28	99.28	99.28
2	98.69	98.69	98.69	98.69	96.24	96.28	96.24	96.24	98.87	98.87	98.87	98.87	99.40	99.40	99.40	99.40
3	99.10	99.10	99.10	99.10	97.43	97.43	97.43	97.43	99.04	99.04	99.04	99.04	99.40	99.40	99.40	99.40
4	98.09	98.09	98.09	98.09	96.06	96.10	96.06	96.06	98.51	98.52	98.51	98.51	99.40	99.40	99.40	99.40
5	99.22	99.23	99.22	99.22	97.37	97.38	97.37	97.37	98.87	98.87	98.87	98.87	99.52	99.52	99.52	99.52
6	98.75	98.75	98.75	98.75	96.78	96.79	96.78	96.78	98.69	98.69	98.69	98.69	99.40	99.40	99.40	99.40
7	98.69	98.69	98.69	98.69	96.36	96.39	96.36	96.36	98.93	98.93	98.93	98.93	99.52	99.52	99.52	99.52
8	97.67	97.69	97.67	97.67	92.78	92.78	92.78	92.78	97.91	97.91	97.91	97.91	98.87	98.87	98.87	98.87
9	99.22	99.23	99.22	99.22	97.67	97.68	97.67	97.67	98.93	98.93	98.93	98.93	99.46	99.46	99.46	99.46
10	98.75	98.75	98.75	98.75	96.60	96.62	96.60	96.60	98.87	98.87	98.87	98.87	99.52	99.52	99.52	99.52
11	97.67	97.69	97.67	97.67	93.07	93.07	93.07	93.07	97.91	97.93	97.91	97.91	98.87	98.87	98.87	98.87
12	99.16	99.17	99.16	99.16	97.67	97.68	97.67	97.67	98.99	98.99	98.99	98.99	99.52	99.52	99.52	99.52
For each classifier, the highest accuracies are highlighted in bold.

Table 6. Confusion matrices of different classifiers created with hybrid features.

CNN			KNN			MLP			LDA
	Female	Male		Female	Male		Female	Male		Female	Male
Female	840	7	Female	843	4	Female	839	8	Female	833	14
Male	1	827	Male	9	819	Male	8	820	Male	25	803

Table 7. Comparison of the proposed gender classification approach with competing approaches.

References	Dataset	Methods	Year	Accuracy
Ertam [15]	Kaggle voice gender dataset	Deeper LSTM networks	2019	98.4%
Uddin et al. [16]	TIMIT	KNN and SVM	2020	96.8%
Hamdi et al. [17]	Arabic Natural Audio Dataset (ANAD)	Ensemble-classifier	2020	96.02%
Kwasny and Hemmerling [18]	Common Voice and TIMIT	x-vector with QuartzNet embedder and two-stage transfer learning	2021	99.60%
Tursunov et al. [4]	Common Voice Korean speeches dataset	CNN with MAM	2021	96% 97%
Nasef et al. [19]	VoxCeleb	Convolution Self-Attention	2021	96.23%
Alnuaim et al. [20]	Common Voice	Pre-trained ResNet50 model	2022	98.57%
Hızlısoy et al. [21]	A new local dataset	LR, DT, RF, XGB etc.	2022	89%
AL-Dujaili et al. [22]	TIMIT	Combining MFCC, LPCC and LPC with KNN and MLPNN.	2023	97% for male 98% for female
Taran and Pandey [23]	Kaggle voice gender dataset	Two-stage heterogeneous stacked ensemble model	2023	99.36%
Madhu and Bukka [24]	Kaggle voice gender dataset	Modified ensemble techniques based on k-NN, RF, and SVM	2023	99.05%
Proposed method	Common Voice	CNN model based on stacked acoustic features	2024	99.52%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yücesoy, E. Gender Recognition Based on the Stacking of Different Acoustic Features. Appl. Sci. 2024, 14, 6564. https://doi.org/10.3390/app14156564

AMA Style

Yücesoy E. Gender Recognition Based on the Stacking of Different Acoustic Features. Applied Sciences. 2024; 14(15):6564. https://doi.org/10.3390/app14156564

Chicago/Turabian Style

Yücesoy, Ergün. 2024. "Gender Recognition Based on the Stacking of Different Acoustic Features" Applied Sciences 14, no. 15: 6564. https://doi.org/10.3390/app14156564

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Gender Recognition Based on the Stacking of Different Acoustic Features

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Dataset

3.2. Feature Extraction

3.3. Classification Models

3.3.1. Convolutional Neural Network

3.3.2. Multilayer Perceptron

3.3.3. Linear Discriminant Analysis

3.3.4. K-Nearest Neighbors

3.4. Performance Evaluation

4. Experimental Results and Discussion

5. Conclusions and Future Works

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI