Recently, research on the classification methods of infant cry signals has made significant progress. N. Nimbarte et al. [
6] employed MFCC for feature extraction and K-nearest neighbors (KNNs) for classification, achieving high accuracy in cry-type recognition when tested on an open dataset. W. You et al. [
7] employed long short-term memory (LSTM) networks for infant cry analysis. The model achieved a maximum accuracy of 92.39% in recognizing various types of infant cries and stabilized after 75 rounds. V. Bella et al. [
8], besides combining MFCC and LSTM, introduced two data augmentation techniques, namely time stretching and pitch shifting, to improve model performance. B. Lv et al. [
9] proposed an infant cry emotion recognition method based on a multiscale CNN-BLSTM (convolutional neural network–bidirectional long short-term memory network). N. Meephiw et al. [
10] applied MFCC for feature extraction, studying and comparing different numbers of MFCC features to determine factors suitable for classification techniques. From the experimental results, MFCC:11 appeared to be well suited for infant cry classification. The results of Liang Y-C et al. [
11] showed that CNN and LSTM both provided decent performance, around 95% in accuracy, precision, and recall, in differentiating healthy and sick infants. Micheletti et al. [
12] also evaluated the deep learning model’s performance relative to LENA’s cry classifier, one of the most commonly used commercial software systems for quantifying child crying. Broadly, we found that both deep learning and LENA model outputs showed convergent validity with human annotations of infant crying. Yasin et al. [
13] used machine learning and artificial intelligence to distinguish cry tones in real time through feature analysis. In the review by T. Özseven et al. [
14], the methods used for the ICRC were summarized to guide future studies, and an overview of the developments was presented to the researchers. In addition, the findings of the studies examined were given comparatively. As a result of this study, it has been determined that the most used traditional methods in ICRC are MFCC as a feature set and neural network-based classifiers as classifiers. K. Zhang et al. [
15] proposed a novel approach named BCRNet, which integrates transfer learning and feature fusion. The BCRNet model utilizes multidomain features as inputs and a transfer learning model to extract deep features. H. A. Patil et al. [
16], through experiments on infant cry classification and speech emotion recognition tasks, extensively analyzed various attention-based methods. The results revealed that the transformer model surpassed the previous state-of-the-art level in infant cry classification, achieving a recall rate improvement of 10.9%. S. K. Singh et al. [
17], by comparing the performance of CNN models and transformer models, categorized infant crying into five reasons: hunger, pain, hiccups, discomfort, and fatigue. Anders et al. [
18] investigated convolutional neural networks (CNNs) for the classification of infant vocalization sequences. The target classes were ‘crying’, ‘fussing’, ‘babbling’, ‘laughing’, and ‘vegetative vocalizations’. The work by Sandhya, P., et al. [
19] aims to identify the speaker in an emotional environment using spectral features and classify using any of the classification techniques to achieve a high speaker recognition rate. Feature combinations can also be used to improve accuracy. Matikolaie et al. [
20] used a novel combination of short-term and long-term features from different timescales to develop an automatic newborn cry diagnostic system to differentiate the cry audio signals (CASs) of healthy infants from those with respiratory distress syndrome (RDS). P. Kulkarni et al. [
21] presented a study on the classification of child cries based on various features extracted through speech and auditory processing. Certain spectral and descriptive features vary significantly in a child’s cry intended for a specific purpose. Ekinci et al. [
22] interpret the information embedded in baby cry audio signals using sound processing methods and classify them using machine learning algorithms. To achieve this objective, they utilized a dataset consisting of baby cry audio signals divided into five distinct classes. Khosro Rezaee et al. [
23] used automatic acoustic analysis and data mining in their study to determine the discriminative features of preterm and full-term infant cries. The approach proposed by Sutanto, Erwin, et al. [
24] started with an analysis of the sound’s power from WAV files before exploring the 2D pattern, which includes features for the machine learning. From this work, around 85% accuracy could be achieved. The finest experimental results showed a mean accuracy of around 91% for most scenarios, and this exhibits the potential of the proposed extreme gradient boosting-powered grouped-support-vector network in neonate cry classification [
25]. Matikolaie et al. [
26] analyzed the CASs of newborns under two months old using machine learning approaches to develop an automatic diagnostic system for identifying septic infants from healthy ones. The ChatterBaby algorithm detected significant acoustic similarities between colic and painful cries, suggesting that they may share a neuronal pathway [
27]. The audio and speech features (AS features) were exacted using a Mel–Bark-frequency cepstral coefficient from the spectrogram cry signal and fed into a DCNN. The output of the proposed system yielded a balanced accuracy of 92.31%. A highest accuracy level of 95.31%, highest specificity level of 94.58%, and highest sensitivity level of 93% were attained through the proposed technique [
28]. A CNN architecture trained with recordings of babies from Australia was used for classifying the audio material of Romanian babies. This was in an attempt to see what happens should the participants belong to a different cultural landscape. The results of the CNN automatic classification were compared to those obtained by the Dunstan coaches. The conclusions have proved that Dunstan language is universal [
29]. A method to extract cries is proposed by Cabon S et al. [
30]. It is based on an initial segmentation between silence and sound events, followed by feature extraction on the resulting audio segments and a cry and non-cry classification.
Research on infant cry classification has advanced through various techniques. MFCC-based feature extraction, often paired with machine learning methods like KNN and SVM, has been widely used, as seen in the work of N. Nimbarte et al. and N. Meephiw et al. More recently, deep learning approaches, especially CNN and LSTM networks, have shown great promise. Studies by B. Lv, W. You, and V. Bella achieved high accuracy, with hybrid models like BCRNet and attention-based methods further enhancing performance. Data augmentation techniques, cross-cultural studies, and real-time classification have also contributed to significant improvements in the field.
This paper proposes an improvement upon the ResNet architecture by incorporating the squeeze-and-excitation (SE) attention mechanism module, resulting in the SE-ResNet-Transformer. It utilizes an enhanced version of Mel-frequency cepstral coefficients (MFCCs), comprising MFCCs, first-order differential coefficients of MFCCs, and second-order differential coefficients of MFCCs as features. Additionally, it establishes a database containing infant cry signals with three types of emotions by integrating the Donateacry-corpus, Chillanto database, and ESC-50 database.