1. Introduction
With the rapid development of artificial intelligence technology, much research on prediction and classification (such as music genre classification) has been born based on machine learning and deep learning [
1,
2]. In the medical field, since hospitals have massive structured text data (such as clinical complaints [
3], symptoms [
4], and physiological values) and image data (such as CT and MRI) [
5], there are many studies on disease risk prediction based on text and image classification [
6].
As the aging population continues to intensify, to reduce the pressure on public medical care, it is necessary to improve the ability of nursing homes to predict disease risks. However, nursing homes are different from hospitals. Nursing homes do not have professional medical testing equipment or professional medical information systems. Therefore, to improve the ability of nursing homes to predict disease risk, only data collected by non-professional medical equipment can be used. In this article, the sample data we use are the dataset “patient-health-detection-using-vocal-audio” and subject data publicly available on Kaggle. The “patient-health-detection-using-vocal-audio” dataset contains audio files of three types of sounds for regular people and those with voice diseases (Laryngocele and Vox senilis).
In order to achieve effective prediction of disease risk using audio data files in nursing homes, we first solve the problem of sample data imbalance in the dataset “patient-health-detection-using-vocal-audio” and subject data; then learn the representation in the sample data; then analyze the representation learning results to classify the sample data effectively; finally, obtain the best model of Risevi (a disease risk prediction model based on Vision Transformer). The application scenario of the Risevi model is shown in
Figure 1.
In
Figure 1, a nursing home that has already used an information system is taken as an example to describe the application scenario of the Risevi model. The Risevi model will be deployed to the nursing home’s local server and cloud server, and the local server and cloud server are hot backups for each other. We use third-party speech synthesis technology to interact with the elderly in the nursing home through the voice of the tablet computer in the daily routine rounds. Every time the older adult replies, the voice file will be collected to the local and cloud servers. After the Risevi model deployed in the cloud server receives the voice files of the elderly, it will conduct relevant analysis and feedback on the analysis results to the health monitoring and alarm platform. Suppose the Risevi model deployed in the cloud server responds over time. In that case, the Risevi model deployed in the local server will feed the analysis results to the health monitoring and alarm platform. The health management personnel of the nursing home formulate a health management plan for each older adult by checking the health information content of each older adult in the health monitoring and alarm platform.
The research goal of this paper is to use audio files in the application scenario of nursing homes to achieve adequate disease risk prediction. To achieve this goal, we first design a sample generation method based on MelGAN to solve the problem of sample data imbalance; then design a sample feature extraction method using Mel frequency cepstral coefficient (MFCC, Mel frequency cepstral coefficient) and Wav2vec2 to extract audio data features; then design a sample feature classification method based on transfer learning and Vision Transformer to achieve the effective classification of audio data; finally, we use the best Risevi model obtained to achieve effective prediction of disease risk using audio files.
The main work of this paper is as follows:
When designing the Risevi model, to solve the problem of audio sample data imbalance, we propose an audio sample generation method based on MelGAN;
We propose an audio sample feature extraction method to allow the designed Risevi model to obtain the audio sample data characteristics accurately;
To accurately classify audio sample data features, we propose an audio sample feature classification method based on transfer learning and Vision Transformer;
The implementation method of the Risevi model proposed by us has good reproducibility in processing audio multi-classification.
This paper consists of six sections. The first section is an introduction, which mainly describes the research background of this paper; the second section is related work, which mainly describes the research status of this paper; the third section is model design, which mainly describes the process of Risevi model design; the fourth section is the realization of the model, which mainly describes the realization process of the Risevi model; the fifth section is the analysis of the experimental results, which mainly describes the experimental results; the sixth section is the conclusion, which mainly describes the results of this research and the prospect of this research.
2. Related Works
There have been many kinds of research on audio classification in recent years. M. Wisniewski et al. proposed a recognition method based on MRMR and SVM to identify asthmatic wheezing using lung sound features [
7]. The recognition accuracy of this method reached 92.9%. Ian et al. proposed a robust sound classification system based on support vector machines and deep neural network classifiers [
8]. The system achieved an average accuracy of 92.58% under high-noise conditions. Based on deep convolutional neural network (DCNN), deep neural network (DNN), paragraph vector (PV), support vector machine (SVM), and random forest [
9], Le et al. proposed a method for depression classification based on audio, video, and text description and a predictive framework. M. V. A. Rao et al. designed a prediction method based on the SVR (Support Vector Regression) algorithm to automatically predict spirometry readings through audio signals of coughing and wheezing and then predict the severity of asthma [
10]. This method achieved an accuracy of 77.77% in experiments.
Myounggyu et al. developed an audio classification framework for mobile applications using mobile sensing technology to solve the problem that audio characteristics change over time and affect the accuracy of audio classification in automotive application scenarios [
11]. Experimental results show that this framework improves the average classification accuracy of speech-to-speech and music by 166% and 64%, respectively, compared to non-adaptive methods. Michael et al. developed an AUDEEP toolkit for deep unsupervised representation learning from acoustic data using deep recurrent neural networks [
12]. This toolkit achieved 88% accuracy in the sound scene recognition dataset TUT Acoustic Scenes 2017. Yifang et al. proposed a new acoustic scene classification system based on multi-modal deep feature fusion [
13]. This system achieved 91% accuracy in the DC ASE16 dataset.
Gabriel et al. proposed a speech emotion classification model based on a neural network [
14]. This model uses convolutional layers and feature combinations for feature extraction. The model achieved 71% accuracy in classifying four categories: Anger, Happiness, Neutral, and Sadness. E. J. Alqahtani et al. designed a classification method based on the SMOTE algorithm [
15], AdaBoost ensemble classifier, and NNge classification algorithm to predict Parkinson’s disease (PD) through voice recordings. This method achieved an accuracy rate of 96.3% in the experiment. A. Joshi et al. built a system based on a hierarchical Bayesian neural network to predict Parkinson’s disease (PD) by extracting relevant features from video and audio [
16].
Y. You et al. proposed a parallel classification system based on KNN and SVM in order to realize the risk prediction of dementia using audio recordings [
17]. The accuracy of this system reached 97.2% in the experiment. Arunodhayan et al. developed a real-time classification system for acoustic events based on convolutional neural networks (CNN) using publicly available ESC-50 and Ultrasound-8k datasets [
18]. S. Aich et al. proposed a Parkinson’s disease (PD) prediction method based on nonlinear and linear classification of speech using a dataset created by Oxford University [
19], Max Leedt University, and the National Center for Voice and Speech in Denver, Colorado. This method achieved an accuracy rate of 97.57% in the experiment. D. Pettas et al. proposed a method for predicting audio events of pressurized metered-dose inhalers based on long-term [
20], short-term memory (LSTM) units and recurrent neural networks with spectrogram features to improve drug compliance in patients with obstructive inflammatory lung disease. The accuracy of this method in the experiment reached 92.76%.
In order to solve the problem of severe decline in classification accuracy in noisy or weakly segmented application scenarios in audio classification, Irene et al. proposed a new type of pooling layer [
21]. The average performance gain of this method on the ESC-30 and URBAN datasets reaches 7.9% and 17.3% for Gaussian noise and 4.3% and 6.7% for Brownian noise, respectively. In order to capture the time information of the entire audio sample, Liwen et al. proposed a method called Pyramid Time Pooling (PTP) [
22]. This approach can capture high-level temporal dynamics of input feature sequences in an unsupervised manner. This method achieves 88.9% accuracy on the AER task.
In order to improve the performance of multi-class deep neural networks [
23], Yuan et al. proposed a novel multi-class classification framework based on local OVR deep neural networks. This framework achieved a classification accuracy of 86.8% on the ESC-50 dataset. Bo et al. proposed a retrieval-based scene classification framework based on recurrent neural networks [
24]. A detection accuracy of 93% is obtained through experiments on natural audio scenes. G. Pinkas et al. designed a prediction model based on self-supervised attention [
25], recurrent neural network, and SVM to facilitate the screening of COVID-19 through speech. This model achieved F1 values between 0.74 and 0.8 in experiments.
Based on the paralinguistic features extracted from the recordings of older participants who completed the LOGOS episodic memory test [
26], K. Sriskandaraja et al. proposed a model for predicting low and high risks of dementia based on a KNN classifier. The model achieved 91% accuracy in experiments. M. T. Guimarães et al. used digitized speech signals of Huntington’s disease patients and healthy volunteers to read Lithuanian poetry [
27]. They proposed a prediction method for Huntington’s disease based on the KNN classifier. This method achieved 99% accuracy in experiments. To assist healthcare professionals in diagnosing and monitoring patients with depression [
28], V. Aharonson et al. designed a severity classification model based on machine learning techniques using speech.
V. Ramesh et al. used cough audio features based on GAN to create a classifier that can distinguish common respiratory diseases in adults [
29]. This classifier achieved 76% accuracy in experiments. Using respiratory recordings [
30], L. Pham et al. proposed a framework to classify abnormalities in the respiratory cycle and predict diseases based on deep learning techniques. This framework achieves 84% accuracy on the ICBHI breath sounds benchmark dataset. Based on the CatBoost algorithm [
31], Maksim et al. proposed a new method to classify the sex of cats based on vocalization. This method achieved an accuracy rate of 90.16% in the experiment. In order to realize the classification of COVID-19 cough [
32], Hao et al. proposed a new self-supervised learning framework based on Transformer’s feature encoder. This framework achieved 83.74% accuracy in experiments.
In order to detect depression as early as possible through speech [
33], Ermal et al. used the DAIC-WOZ dataset to propose a new deep learning framework AudiBERT that utilizes the multimodal characteristics of the human voice. In order to realize the prediction of Parkinson’s disease (PD) [
34], S. Kamoji et al. used the Freezing of Gait dataset, Parkinson’s clinical voice dataset, Parkinson’s disease wave and spiral drawing dataset, and designed predictive models based on the decision tree, KNN, transfer learning, and Convolutional Neural Networks. The accuracy rate on the Freezing of Gait dataset reached 94.98%. The accuracy rate reached 97% on the Parkinson’s clinical speech dataset. The Parkinson’s wave and spiral plot datasets achieved an accuracy of 80% and 93.33%, respectively.
In order to achieve effective accent classification [
35], Zixuan et al. proposed an end-to-end classification method based on a temporal convolutional attention network. In the experiment, compared with the baseline method, the accuracy of the English and Chinese speech datasets increased by 6.27% and 26.11%, respectively. Using the Coswara dataset [
36], P. Srikanth et al. proposed a framework for COVID-19 prediction using audio cough data based on the random forest algorithm. This framework achieved 98.36% accuracy in experiments. Y. F. Khan et al. trained a hybrid model to predict Alzheimer’s disease (AD) based on convolutional neural network (CNN) and bidirectional long short-term memory (Bidirectional LSTM) using the DementiaBank clinical transcription audio dataset [
37]. This hybrid model achieved 85.05% accuracy in experiments.
Using the DiCOVA2021 audio dataset [
38], J. Chang et al. designed a COVID-19 detection method UFRC based on ImageNet pre-trained ResNet-50. This method, called UFRC, achieved an accuracy of 86% in experiments. In order to realize the early symptoms of voice deterioration and predict Parkinson’s disease (PD) [
39], R. Shah et al. proposed an interpretable temporal audio classification model based on neural networks. This model achieved an accuracy rate of 90.32%, a precision of 91%, a recall rate of 90%, and an F1 score of 90.5% in the experiment. In order to realize the prediction of heart disease through the classification of heartbeat sounds [
40], S. Kamepalli et al. proposed a multi-class classification and prediction model of heart sounds for detecting abnormal heart sounds based on stacked LSTM. This model classifies heartbeat sounds into four categories with 85% and 87% accuracy on the training and validation sets.
To clarify the position of the boundary of the heart murmurs in the heartbeat sound [
41], N. S. Bathe et al. took the audio PCG signal as input. This model achieved an accuracy rate of 93.45% in the experiment. In order to realize the prediction of Alzheimer’s disease (AD) through audio classification [
42], V. Yadav et al. designed a feature selection method based on MFCC and MLP using the ADReSSo dataset. This method achieved an accuracy rate of 75% in experiments. A. Patel et al. proposed an algorithm for lung disease prediction for lung audio classification based on transfer learning and RESNET50 [
43]. This algorithm achieved an accuracy rate of 82% in the experiment.
S. Redekar et al. proposed a method for predicting heart rate values through human speech based on the Mel frequency cepstral coefficient (MFCC) and random forest algorithm [
44]. This method achieved 90.3% accuracy in experiments. Using the voices of 92 subjects [
45], F. Amato et al. proposed a method for predicting gastroesophageal reflux disease (GERD) through voice based on machine learning techniques. This method achieved 82% accuracy in experiments. Y. Zhu et al. designed a detection system using novel modulated spectral features and linear prediction of coronavirus disease (COVID-19) through speech [
46]. This system achieves an AUC-ROC of 0.711 on the DiCOVA2 dataset and 0.612 on the Cambridge set.
In research on neonatal care, C. Sitaula et al. proposed a method for detecting bowel sounds in newborns assisted by auscultation [
47]. The method uses a convolutional neural network (CNN) to classify wriggling and non-wriggling sounds and a Laplacian hidden semi-Markov model (HSMM) to optimize the classification. This method achieved an accuracy of 89.89% and an area under the curve (AUC) of 83.96% in experiments. L. Burne et al. proposed a new method for automatically detecting wriggling sounds from neonatal abdominal recordings [
48], using hand-crafted and 1D and 2D in-depth features obtained from Mel frequency cepstral coefficients (MFCC). Results are then refined with the help of a hierarchical hidden semi-Markov model (HSMM) strategy. This method achieved 95.1% accuracy and 85.6% area under the curve (AUC) in experiments.
As can be seen from
Table 1 and the above paragraphs, in the current research on disease risk prediction using audio classification, the accuracy rate of identifying asthma has reached 77.77%, the accuracy rate of Parkinson’s disease prediction has reached 97.57%, the accuracy rate of dementia identification has reached 97.2%, and the accuracy rate of Huntington’s disease prediction reached 99%. However, the underlying algorithms used in these studies could be further improved. For example, when there are many decision trees, the overhead of random forest algorithm training time and space will be considerable. KNN, SVM, and LSTM algorithms will run very inefficiently when there are many audio samples. The CNN model extracts audio sample features with poor interpretability. In short, in the application scenario of nursing homes, the existing research cannot meet the needs of using audio for disease risk prediction, and the Risevi model we propose meets the current application needs of nursing homes.
3. Model Design and Implementation
In this section, we mainly describe the Risevi model design and implementation process. In the process of Risevi model design, we first design an audio sample generation method based on MelGAN; secondly, we design an audio sample feature extraction method based on sampling rate; then, we construct a sample classification method based on Vision Transformers; after iterative training, finally, we obtain the best Risevi model. In implementing the Risevi model, the software platforms we use are Python 3.8.16, Keras 2.6.0, LightGBM 3.3.5, Tensorflow-GPU 2.6.0, sklearn 0.0, torch 2.0.1, and transformers 4.29.1, vit-keras 0.1.2.
3.1. Sample Generation
In the process of designing the Risevi model, the shortage and imbalance of audio samples were taken into consideration. Therefore, we designed an audio sample generation method based on MelGAN. In computer vision, GAN is widely used in image data generation [
49]. In order to improve the application of GAN in the field of audio modeling, Kundan Kumar et al. proposed an audio generation model MelGAN based on GAN in 2019. The overall structure of MelGAN is shown in
Figure 2 [
50].
As shown in
Figure 2 [
50], MelGAN consists of a generator and a discriminator, and the Mel spectrogram is the generator’s input. After inputting the Mel spectrogram into the generator, it first passes through the convolutional layer. After passing through the convolutional layer, the upsampling layer, and the residual module, it finally outputs the audio. The discriminator adopts a multi-scale architecture. The discriminator analyzes not only the original audio file but also the original audio file after down-conversion using Avg pooling. MelGAN uses Hinge Loss as the model’s loss function, as shown in Formulas (1) and (2) [
50].
In Formulas (1) and (2) [
50],
represents the initial audio file,
represents the Mel spectrogram of the input generator,
represents Gaussian noise, and
represents the scale of the discriminator.
When we design an audio sample generation method, the efficiency of sample generation is a critical evaluation index for us. MelGAN’s decoder can replace the autoregressive model and belongs to the non-autoregressive feedforward convolution architecture. The efficiency of MelGAN to generate audio achieves our expected efficiency goals without significant degradation in the generated audio quality.
3.2. Feature Extraction
In our method of designing audio sample feature extraction, audio feature extraction refers to Mel frequency cepstral coefficient (MFCC) and unsupervised pre-training model Wav2vec2.
The Mel scale is a non-linear frequency change scale formed based on the human ear’s sensory judgment of audio frequency changes. The identification of the audio file is mainly based on the formant position in the audio file and the process of formant position transition. The formant position is the peak of the audio file in the spectrogram. The process of changing the formant’s position is the spectrum’s envelope. An envelope is a curve drawn against the formant positions [
51]. After performing the logarithmic operation on the Fourier transform of the audio signal, the frequency spectrum obtained by the inverse Fourier transform operation is the cepstral. The process of Mel cepstral coefficient MFCC feature extraction is shown in
Figure 3.
In
Figure 3, first, perform Fourier transform on the audio signal; secondly, map the obtained spectral power onto the Mel scale; then perform the logarithmic operation on the obtained Mel frequency power [
51]; then perform a discrete cosine transform on the Mel logarithm Operation; finally, obtain the MFCC of the spectrum [
52].
Wav2vec2 is a self-supervised learning framework for speech representation proposed by the Facebook AI Research team in 2020. The structure of this framework is shown in
Figure 4.
In
Figure 4,
represents an audio file. This framework uses a multi-layer convolutional neural network to encode
into
and then inputs
into the Transformer network to build
. The construction of
is not discrete but based on continuous speech representations. Quantization
is used for loss functions in contrastive tasks. The loss function used by the Wav2vec2 framework in the comparison task is shown in Formula (3) [
53].
In Formula (3), represents the output of the context network centered at the labeled time step . Formula (3) states that the model needs to distinguish the accurate quantized representation among quantized representations .
3.3. Transfer Learning
The ideal machine learning training scenario has many labeled samples to train the model. However, it is challenging to collect enough training samples in most cases. Semi-supervised learning can partially solve the problem by relaxing the demand for massive labeled data. Semi-supervised learning only needs a certain amount of labeled sample data, and semi-supervised learning can use many unlabeled sample data to improve the accuracy of model learning. However, in practical application scenarios, unlabeled datasets are also challenging to collect comprehensively. Transfer learning focuses on cross-domain knowledge transfer and is currently the most effective machine learning method to solve the above problems [
54].
Transfer learning is an essential topic in machine learning and data mining [
55]. Traditional machine learning tries to train a learning model from a task, while transfer learning tries to transfer knowledge from a source task to a target task. The datasets collected by non-medical testing equipment are relatively scarce, and there are many parameters in the deep learning network model. Therefore, when performing health risk prediction [
56], a limited number of samples are prone to overfitting when training the deep learning model. To overcome this problem, we employ a transfer learning strategy. Transfer learning is a deep learning technique that uses a model pre-trained on a large dataset for a problem as the initialization weights for a model trained on a different dataset. CNNs tend to perform better on larger datasets than on smaller ones. Fields using transfer learning techniques include object detection, medical imaging [
57], and image classification. Models trained on large datasets, such as ImageNet, can extract features from smaller datasets. Training models with transfer learning are more efficient than those without transfer learning. Using transfer learning prevents overfitting during model training. When using transfer learning, only a small amount of data is needed for training to improve model performance. The pre-trained model used in this research is Vision Transformer.
In this paper, the loss function used by the Risevi model we proposed based on transfer learning is shown in Formula (4).
In Formula (4), represents the loss value of the model; represents the number of samples; represents the dimension of the prediction vector; represents the predicted value of the output; represents the actual value of the sample.
3.4. Feature Classification
When designing the feature classification method, we used Vision Transformer, LightGBM, CNN, LSTM, and SVM as the basic algorithms to construct the feature classification method and compared the experimental results. Experimental results show that the effect of constructing feature classification is the best using Vision Transformer as the primary model algorithm.
The Transformer model is an end-to-end NLP model proposed by the Google team in 2017. This model uses the self-attention mechanism to allow the model to obtain global information and be trained in parallel. Vision Transformer is a graphical version of Transformer proposed by the Google team in 2021. Vision Transformer (ViT) is also used in computer vision, image classification, object detection, and video processing. Y. Zhou et al. used the ViT algorithm for fire smoke detection [
58], W. Zhang et al. used the ViT algorithm to identify the quality of metal 3D printing [
59], S. R. Dubey et al. used the ViT algorithm for image search [
60], and X. Li et al. used the ViT algorithm to reconstruct 3D volumes from a single image element [
61]. Y. Fang et al. used the ViT algorithm for sitting posture recognition [
62], A. Dey et al. used the ViT algorithm for detecting fall events [
63], and T. Chuman used the ViT algorithm for image encryption [
64].
In NLP, the input to Transformer is a sequence of phrases. Following this cue, the researchers first segmented the image into multiple blocks, then combined these image blocks into a linear sequence, and finally used these sequences as input to the Vision Transformer. The overall architecture of the Vision Transformer is shown in
Figure 5.
In
Figure 5 [
65], the Vision Transformer first divides the image into image blocks that the model can process and then uses the Linear Projection layer to encode the position of all patches and generate an embedded sequence, then input the sequence into the standard Transformer Encoder, and then use a large-scale training set for model training, and finally obtain the model. The feature classification method we designed is based on the Vision Transformer model and fine-tuned using our dataset.
3.5. Risevi Model
In order to realize disease risk prediction using audio data, we designed the Risevi model based on MelGAN, transfer learning, and Vision Transformer. The overall architecture of the Risevi model is shown in
Figure 6.
In
Figure 6, the operation process of the Risevi model is as follows:
Combine the dataset “patient-health-detection-using-vocal-audio” and subject data to form a mixed sample data;
Generate sample data using a sample generation method designed based on MelGAN;
Use the feature extraction method to extract features, then convert the tensor of the audio file into a floating-point number and convert it into a waveform;
Perform data augmentation and deduplication operations on image data;
Divide the calculated image data into three parts, one for training data, one for verification data, and one for test data;
Load the Vision Transformer model pre-trained weights;
Iteratively train and fine-tune the Risevi model;
Obtain the Risevi model.
3.5.1. Sample Data
During the implementation of the Risevi model, the dataset used is the dataset “Patient Health Detection using Vocal Audio” publicly available on Kaggle. This dataset consists of sound audio files of regular people and diseased patients [
66], and the sample data distribution is shown in
Table 2.
As shown in
Table 2, the audio files in the dataset are all wav audio files. Among them are 560 audio files of regular people, 84 audio files of patients with Laryngocele disease, and 392 audio files of patients with Vox senilis disease. The dataset comes from Kaggle, and the total size of the dataset is 401.7 MB.
3.5.2. Sample Generation Implementation Process
In this paper, we propose a sample generation method based on MelGAN. The method’s input is the spectrogram of the audio file, and the output is the audio file corresponding to the spectrogram. The primary process of the sample generation method is as follows:
Library file import;
Load the audio dataset;
Preprocess audio datasets;
In order to enhance the characteristic information of the audio data, the audio file is pre-emphasized using an FIR high-pass filter;
In order to avoid spectral leakage of the audio signal, the Hamming window function adds a window to the audio data;
Use the short-time Fourier transform to obtain the time-frequency signal of an audio file;
Obtain the audio file’s spectrogram by superimposing each frame’s frequency domain signal;
Convert the spectrogram to a Mel-spectrogram by using a Mel-scale filter bank;
Using the MelGAN generator network, build a sample generator model;
Using the MelGAN discriminator network, create a sample discriminator model;
Define the generator loss;
Define the discriminator loss;
Set up checkpoint saving;
Define training parameters;
Train the model.
The structural design of the MelGAN generator model used during the sample generation process is shown in
Table 3.
In
Table 3, the spectrogram of the audio input by the generator network is transmitted to the upsampling stage after passing through a Conv layer. After each upsampling, the residual module is nested, and the final output is the audio. Since the original audio has a 256× higher temporal resolution than the Mel spectrogram, the generator network needs to complete a 256× upsampling. In the sample generation process, the structural design of the MelGAN discriminator model used is shown in
Table 4.
In
Table 4, the discriminator network uses a larger kernel of 41 × 41 and uses group convolution while maintaining a small number of parameters. In each layer in the discriminator network, the weights are normalized.
3.5.3. Feature Extraction Implementation Process
This paper’s feature extraction method mainly refers to the Mel frequency cepstral coefficient MFCC of audio files and the unsupervised pre-training model Wav2vec2. The primary process of the feature extraction method is as follows:
Library file import;
Load the dataset using buffered prefetch;
Apply noise reduction to audio files;
Fade in and fade out audio files;
Load a pre-trained Wav2Vec2 model as an embedding feature extractor;
Fine-tune the pre-trained feature extractor and perform embedding calculations on the current audio features;
Resample audio to 16 kHz;
Use feature_extractor to extract features;
Obtain embedding representations from the model;
Convert the feature tensor of the extracted audio file into a floating-point number, and then convert the obtained floating-point number into a waveform;
Perform frequency masking and time masking;
Generate a new sample dataset.
3.5.4. Feature Classification Implementation Process
This paper proposes a feature classification method based on transfer learning and the Vision Transformer (ViT) network model. ViT pre-training models include L/16, B/16, B/32, S/16, R50 + L/32, and R26 + S/32, etc. Among them, the accuracy of the pre-training models L/16, B/16, B/32, and 50 + L/32 on the imagenet2012 dataset exceeded 80%. However, the parameter size of the pre-training model B/32 is relatively small, so we use the pre-trained model B/32. When migrating the weights of the pre-trained model, from the perspective of the size and similarity of the dataset, the generated dataset we use is a large dataset, but it is different from the dataset used by the pre-trained model. To fine-tune the model, first freeze all layers, unfreeze a layer, train a layer, and then train each layer in this way until the model reaches the preset evaluation index threshold.
In order to adapt the model to audio feature classification, we fine-tuned the Encoder Block and MLP Block of the Transformer Encoder module in the ViT network model. Since Dropout uses the method of randomly deactivating some neurons according to the node retention probability already set in the neural network, it reduces the overfitting of the neural network. However, this method will result in the inability to ensure that the cost function is monotonically decreasing, and the model needs to increase the training times of the model to achieve the expected accuracy. In order to further improve the training efficiency and prediction accuracy of the ViT network model, we replaced the Dropout in the Encoder Block and MLP Block of the Transformer Encoder module in the ViT network model with DropPath. In the process of model training, after completing the division of the training set, verification set, and test set, ViT is integrated into the classification model training process. When deploying the trained model to the application for classification, there is no integration of ViT.
The primary process of the feature classification method is as follows:
Library file import;
Load the sample dataset;
Rotate and horizontally flip the sample dataset;
Scales the sample dataset pixel values;
Divide the training set, validation set, and test set;
Create a base model using the pre-trained ViT-B32 model;
Add Flatten, Batch Normalization, and Dense;
Set optimizer and learning rate;
Train the model;
Evaluate the model;
Adjust parameter values.
3.5.5. Model Algorithm
The operation process of the Risevi model is shown in Algorithm 1.
Algorithm 1: Risevi model algorithm |
Input: Publicly available datasets, audio datasets of young healthy subjects. |
Output: Risevi model |
: training sample |
: verification sample |
: test sample |
: mixed sample |
1 | ← + |
2 | ← Based on MelGAN |
3 | ← Extract features, operate tensors |
4 | ← Convert to wave form |
5 | for item in |
6 | | Data augmentation |
7 | for item in |
8 | | Sample deduplication |
9 | , , ← |
10 | Load classification model |
11 | Load pretrained weights |
12 | While True: |
13 | | Train a classification model |
14 | | Use for accuracy, precision, recall, F1 score monitoring |
15 | | According to the monitoring results, adjust the parameter value |
16 | | if accuracy and precision and recall and F1 score |
17 | | | Obtain the Risevi model |
18 | | | break |
In Algorithm 1, represents the public dataset used; represents the audio dataset of young, healthy subjects; represents the dataset generated by the sample generation method; represents the dataset extracted by the feature extraction method; represents the dataset after converting into a waveform graph; represents the preset accuracy rate value; represents the preset precision rate value; represents the preset recall rate value; represents the preset F1 score.
3.6. Evaluation Metrics
This paper uses Accuracy, Precision, Recall, and F1 score to evaluate the Risevi model.
In Formulas (5)–(8), TP means true positive, TN means true negative, FP means false positive, and FN means false negative. The confusion matrix of the evaluation indicators is shown in
Table 5.
Table 5 describes the confusion matrix relationship of evaluation indicators true positive, true negative, false positive, and false negative.
3.7. Application Practice
In the application of industrial practice, we use the Python language on the Linux system platform to develop a backend service system based on the open-source web application framework Django. This backend service system encapsulates the trained Risevi model into an API interface. The audio file this API interface receives returns the name of the disease and the probability of disease risk. The APP application program or the health analysis system can empower the Risevi model by connecting with the API interface. During the implementation of the application, the API interface may have a response timeout failure phenomenon due to network speed and other factors. Therefore, the cloud server and the local server deploy the backend service system simultaneously, and the cloud and the local server are on hot standby for each other.
In this section, we describe the sample generation, feature extraction, and classification methods used in the Risevi model’s design. We describe the sample data used in the implementation of the Risevi model, the implementation of the sample generation method, the implementation of the feature extraction method, the implementation of the feature classification method, the algorithm and application practice of the Risevi model, and the evaluation index of the Risevi model.
4. Experimental Results and Analysis
This section mainly explains the experimental results of using different optimizers, different learning rates, before and after model optimization, and different basic algorithms when conducting Risevi model experiments.
4.1. Comparison of Different Optimizers
In the Risevi model experiment, we used five optimizers, RectifiedAdam, AdaBelief, LAMB, LazyAdam, and NovoGrad, when the batch_size was 16, and the learning_rate was 0.0001. We compared the experimental results using Accuracy, Precision, Recall, and F1 score.
As can be seen from
Figure 7, the accuracy and precision of the five optimizers using RectifiedAdam, AdaBelief, LAMB, LazyAdam, and NovoGrad are all over 90%. As can be seen from
Figure 7a, the accuracy rate using the RectifiedAdam optimizer is the highest. It can be seen from
Figure 7b that the precision rate using the RectifiedAdam optimizer is the highest.
As can be seen from
Figure 8, the recall rate and F1 score of the five optimizers using RectifiedAdam, AdaBelief, LAMB, LazyAdam, and NovoGrad all exceed 75%, and the recall rate and F1 score of the RectifiedAdam optimizer highest compared with each other.
4.2. Comparison of Different Learning Rates
In the Risevi model experiment, we used 0.1, 0.01, 0.001, 0.0001, and 0.00001 for batch size 16 and the learning rate, respectively. We compared the experimental results using Accuracy, Precision, Recall, and F1 score.
As can be seen from
Figure 9, the accuracy and precision of the five learning rates of 0.1, 0.01, 0.001, 0.0001, and 0.00001 are both over 85%. It can be seen from
Figure 9a that using a learning rate of 0.0001 achieves the highest accuracy. From
Figure 9b, we can see that the precision rate is the highest when using the learning rate 0.0001.
Figure 10 shows that the recall and F1 scores using four learning rates of 0.01, 0.001, 0.0001, and 0.00001 are over 85%. From
Figure 10a, we can see that a learning rate of 0.0001 has the highest recall compared to the other. From
Figure 10b, we can see that the F1 score using a learning rate of 0.0001 is the highest.
4.3. Comparison before and after Optimization
In the Risevi model experiment, we optimized feature extraction, classification, and corresponding parameter values. We compared the experimental results before and after model optimization using Accuracy, Precision, Recall, and F1 score. In the Risevi model optimization experiment, the model optimization process is shown in
Figure 11.
As shown in
Figure 11, first load the pre-trained model; secondly, freeze all layers, unfreeze a layer, train a layer, and train each layer in this way; then, compare each optimizer and various learning rates; then the Encoder Block and MLP Block of the Transformer Encoder module in the ViT network model were fine-tuned; finally, other parameters in the model were optimized.
As can be seen from
Figure 12a, the accuracy, precision, recall, and F1 score of the optimized model are all improved compared with those before optimization. It can be seen from
Figure 12b that the AUC value of the optimized model is very close to 1.0.
4.4. Comparison of Different Algorithms
In the Risevi model experiment, we used Vision Transformer, VGG19, ResNet50, CNN, LightGBM, LSTM, and SVM as basic models to build feature classification models.
As can be seen from
Figure 13, using Vision Transformer, VGG19, ResNet50, CNN, LightGBM, LSTM, and SVM as the basic models, the accuracy and precision of the feature classification models constructed, respectively, exceed 50%. As can be seen from
Figure 13a, the accuracy rate using Vision Transformer is the highest. As can be seen from
Figure 13b, the precision rate using Vision Transformer is the highest.
As can be seen from
Figure 14, using Vision Transformer, VGG19, ResNet50, CNN, LightGBM, LSTM, and SVM as the basic models, the recall rate and F1 score of the feature classification models constructed, respectively, exceed 50%. From
Figure 14a, we can see that the recall rate using Vision Transformer is the highest compared to that. From
Figure 14b, we can see that the F1 score using Vision Transformer is the highest compared to that.
The contribution of this article is to innovatively apply Vision Transformer to the research of audio classification to realize the prediction of disease risk using audio files in the application scenario of nursing homes. In the current research on audio classification, there are many studies on the classification of MFCC parameters of extracted audio files. After extracting the features of the audio file, we first convert the tensor into a floating-point number, then convert it into a waveform image, then classify the waveform image, and finally, realize the classification of the audio file.
In this section, we illustrate the experimental results of using no optimizer, different learning rates, before and after model optimization, and different basic algorithms. From the above experimental results, in the application scenario of nursing homes, the Risevi model can predict disease risk based on audio.