Risevi: A Disease Risk Prediction Model Based on Vision Transformer Applied to Nursing Homes

Zhou, Feng; Hu, Shijing; Wan, Xiaoli; Lu, Zhihui; Wu, Jie

doi:10.3390/electronics12153206

Open AccessArticle

Risevi: A Disease Risk Prediction Model Based on Vision Transformer Applied to Nursing Homes

by

Feng Zhou

¹

,

Shijing Hu

^1,*,

Xiaoli Wan

^2,*,

Zhihui Lu

¹

and

Jie Wu

¹

School of Computer Science, Fudan University, Shanghai 200438, China

²

Information Center, Zhejiang International Business Group, Hangzhou 310000, China

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(15), 3206; https://doi.org/10.3390/electronics12153206

Submission received: 30 June 2023 / Revised: 10 July 2023 / Accepted: 22 July 2023 / Published: 25 July 2023

(This article belongs to the Topic Computer Vision and Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

The intensification of population aging has brought pressure on public medical care. In order to reduce this pressure, we combined the image classification method with computer vision and used audio data that is easy to collect in nursing homes. Based on MelGAN, transfer learning, and Vision Transformer, we propose an application called Risevi (A Disease Risk Prediction Model Based on Vision Transformer), a disease risk prediction model for nursing homes. We first design a sample generation method based on MelGAN, then refer to the Mel frequency cepstral coefficient and the Wav2vec2 model to design the sample feature extraction method, perform floating-point operations on the tensor of the extracted features, and then convert it into a waveform. We then design a sample feature classification method based on transfer learning and Vision Transformer. Finally, we obtain the Risevi model. In this paper, we use public datasets and subject data as sample data. The experimental results show that the Risevi model has achieved an accuracy rate of 98.5%, a precision rate of 96.38%, a recall rate of 98.17%, and an F1 score of 97.15%. The experimental results show that the Risevi model can provide practical support for reducing public medical pressure.

Keywords:

machine learning; image classification; audio classification; disease risk prediction; deep learning

1. Introduction

With the rapid development of artificial intelligence technology, much research on prediction and classification (such as music genre classification) has been born based on machine learning and deep learning [1,2]. In the medical field, since hospitals have massive structured text data (such as clinical complaints [3], symptoms [4], and physiological values) and image data (such as CT and MRI) [5], there are many studies on disease risk prediction based on text and image classification [6].

As the aging population continues to intensify, to reduce the pressure on public medical care, it is necessary to improve the ability of nursing homes to predict disease risks. However, nursing homes are different from hospitals. Nursing homes do not have professional medical testing equipment or professional medical information systems. Therefore, to improve the ability of nursing homes to predict disease risk, only data collected by non-professional medical equipment can be used. In this article, the sample data we use are the dataset “patient-health-detection-using-vocal-audio” and subject data publicly available on Kaggle. The “patient-health-detection-using-vocal-audio” dataset contains audio files of three types of sounds for regular people and those with voice diseases (Laryngocele and Vox senilis).

In order to achieve effective prediction of disease risk using audio data files in nursing homes, we first solve the problem of sample data imbalance in the dataset “patient-health-detection-using-vocal-audio” and subject data; then learn the representation in the sample data; then analyze the representation learning results to classify the sample data effectively; finally, obtain the best model of Risevi (a disease risk prediction model based on Vision Transformer). The application scenario of the Risevi model is shown in Figure 1.

In Figure 1, a nursing home that has already used an information system is taken as an example to describe the application scenario of the Risevi model. The Risevi model will be deployed to the nursing home’s local server and cloud server, and the local server and cloud server are hot backups for each other. We use third-party speech synthesis technology to interact with the elderly in the nursing home through the voice of the tablet computer in the daily routine rounds. Every time the older adult replies, the voice file will be collected to the local and cloud servers. After the Risevi model deployed in the cloud server receives the voice files of the elderly, it will conduct relevant analysis and feedback on the analysis results to the health monitoring and alarm platform. Suppose the Risevi model deployed in the cloud server responds over time. In that case, the Risevi model deployed in the local server will feed the analysis results to the health monitoring and alarm platform. The health management personnel of the nursing home formulate a health management plan for each older adult by checking the health information content of each older adult in the health monitoring and alarm platform.

The research goal of this paper is to use audio files in the application scenario of nursing homes to achieve adequate disease risk prediction. To achieve this goal, we first design a sample generation method based on MelGAN to solve the problem of sample data imbalance; then design a sample feature extraction method using Mel frequency cepstral coefficient (MFCC, Mel frequency cepstral coefficient) and Wav2vec2 to extract audio data features; then design a sample feature classification method based on transfer learning and Vision Transformer to achieve the effective classification of audio data; finally, we use the best Risevi model obtained to achieve effective prediction of disease risk using audio files.

The main work of this paper is as follows:

When designing the Risevi model, to solve the problem of audio sample data imbalance, we propose an audio sample generation method based on MelGAN;
We propose an audio sample feature extraction method to allow the designed Risevi model to obtain the audio sample data characteristics accurately;
To accurately classify audio sample data features, we propose an audio sample feature classification method based on transfer learning and Vision Transformer;
The implementation method of the Risevi model proposed by us has good reproducibility in processing audio multi-classification.

This paper consists of six sections. The first section is an introduction, which mainly describes the research background of this paper; the second section is related work, which mainly describes the research status of this paper; the third section is model design, which mainly describes the process of Risevi model design; the fourth section is the realization of the model, which mainly describes the realization process of the Risevi model; the fifth section is the analysis of the experimental results, which mainly describes the experimental results; the sixth section is the conclusion, which mainly describes the results of this research and the prospect of this research.

2. Related Works

There have been many kinds of research on audio classification in recent years. M. Wisniewski et al. proposed a recognition method based on MRMR and SVM to identify asthmatic wheezing using lung sound features [7]. The recognition accuracy of this method reached 92.9%. Ian et al. proposed a robust sound classification system based on support vector machines and deep neural network classifiers [8]. The system achieved an average accuracy of 92.58% under high-noise conditions. Based on deep convolutional neural network (DCNN), deep neural network (DNN), paragraph vector (PV), support vector machine (SVM), and random forest [9], Le et al. proposed a method for depression classification based on audio, video, and text description and a predictive framework. M. V. A. Rao et al. designed a prediction method based on the SVR (Support Vector Regression) algorithm to automatically predict spirometry readings through audio signals of coughing and wheezing and then predict the severity of asthma [10]. This method achieved an accuracy of 77.77% in experiments.

Myounggyu et al. developed an audio classification framework for mobile applications using mobile sensing technology to solve the problem that audio characteristics change over time and affect the accuracy of audio classification in automotive application scenarios [11]. Experimental results show that this framework improves the average classification accuracy of speech-to-speech and music by 166% and 64%, respectively, compared to non-adaptive methods. Michael et al. developed an AUDEEP toolkit for deep unsupervised representation learning from acoustic data using deep recurrent neural networks [12]. This toolkit achieved 88% accuracy in the sound scene recognition dataset TUT Acoustic Scenes 2017. Yifang et al. proposed a new acoustic scene classification system based on multi-modal deep feature fusion [13]. This system achieved 91% accuracy in the DC ASE16 dataset.

Gabriel et al. proposed a speech emotion classification model based on a neural network [14]. This model uses convolutional layers and feature combinations for feature extraction. The model achieved 71% accuracy in classifying four categories: Anger, Happiness, Neutral, and Sadness. E. J. Alqahtani et al. designed a classification method based on the SMOTE algorithm [15], AdaBoost ensemble classifier, and NNge classification algorithm to predict Parkinson’s disease (PD) through voice recordings. This method achieved an accuracy rate of 96.3% in the experiment. A. Joshi et al. built a system based on a hierarchical Bayesian neural network to predict Parkinson’s disease (PD) by extracting relevant features from video and audio [16].

Y. You et al. proposed a parallel classification system based on KNN and SVM in order to realize the risk prediction of dementia using audio recordings [17]. The accuracy of this system reached 97.2% in the experiment. Arunodhayan et al. developed a real-time classification system for acoustic events based on convolutional neural networks (CNN) using publicly available ESC-50 and Ultrasound-8k datasets [18]. S. Aich et al. proposed a Parkinson’s disease (PD) prediction method based on nonlinear and linear classification of speech using a dataset created by Oxford University [19], Max Leedt University, and the National Center for Voice and Speech in Denver, Colorado. This method achieved an accuracy rate of 97.57% in the experiment. D. Pettas et al. proposed a method for predicting audio events of pressurized metered-dose inhalers based on long-term [20], short-term memory (LSTM) units and recurrent neural networks with spectrogram features to improve drug compliance in patients with obstructive inflammatory lung disease. The accuracy of this method in the experiment reached 92.76%.

In order to solve the problem of severe decline in classification accuracy in noisy or weakly segmented application scenarios in audio classification, Irene et al. proposed a new type of pooling layer [21]. The average performance gain of this method on the ESC-30 and URBAN datasets reaches 7.9% and 17.3% for Gaussian noise and 4.3% and 6.7% for Brownian noise, respectively. In order to capture the time information of the entire audio sample, Liwen et al. proposed a method called Pyramid Time Pooling (PTP) [22]. This approach can capture high-level temporal dynamics of input feature sequences in an unsupervised manner. This method achieves 88.9% accuracy on the AER task.

In order to improve the performance of multi-class deep neural networks [23], Yuan et al. proposed a novel multi-class classification framework based on local OVR deep neural networks. This framework achieved a classification accuracy of 86.8% on the ESC-50 dataset. Bo et al. proposed a retrieval-based scene classification framework based on recurrent neural networks [24]. A detection accuracy of 93% is obtained through experiments on natural audio scenes. G. Pinkas et al. designed a prediction model based on self-supervised attention [25], recurrent neural network, and SVM to facilitate the screening of COVID-19 through speech. This model achieved F1 values between 0.74 and 0.8 in experiments.

Based on the paralinguistic features extracted from the recordings of older participants who completed the LOGOS episodic memory test [26], K. Sriskandaraja et al. proposed a model for predicting low and high risks of dementia based on a KNN classifier. The model achieved 91% accuracy in experiments. M. T. Guimarães et al. used digitized speech signals of Huntington’s disease patients and healthy volunteers to read Lithuanian poetry [27]. They proposed a prediction method for Huntington’s disease based on the KNN classifier. This method achieved 99% accuracy in experiments. To assist healthcare professionals in diagnosing and monitoring patients with depression [28], V. Aharonson et al. designed a severity classification model based on machine learning techniques using speech.

V. Ramesh et al. used cough audio features based on GAN to create a classifier that can distinguish common respiratory diseases in adults [29]. This classifier achieved 76% accuracy in experiments. Using respiratory recordings [30], L. Pham et al. proposed a framework to classify abnormalities in the respiratory cycle and predict diseases based on deep learning techniques. This framework achieves 84% accuracy on the ICBHI breath sounds benchmark dataset. Based on the CatBoost algorithm [31], Maksim et al. proposed a new method to classify the sex of cats based on vocalization. This method achieved an accuracy rate of 90.16% in the experiment. In order to realize the classification of COVID-19 cough [32], Hao et al. proposed a new self-supervised learning framework based on Transformer’s feature encoder. This framework achieved 83.74% accuracy in experiments.

In order to detect depression as early as possible through speech [33], Ermal et al. used the DAIC-WOZ dataset to propose a new deep learning framework AudiBERT that utilizes the multimodal characteristics of the human voice. In order to realize the prediction of Parkinson’s disease (PD) [34], S. Kamoji et al. used the Freezing of Gait dataset, Parkinson’s clinical voice dataset, Parkinson’s disease wave and spiral drawing dataset, and designed predictive models based on the decision tree, KNN, transfer learning, and Convolutional Neural Networks. The accuracy rate on the Freezing of Gait dataset reached 94.98%. The accuracy rate reached 97% on the Parkinson’s clinical speech dataset. The Parkinson’s wave and spiral plot datasets achieved an accuracy of 80% and 93.33%, respectively.

In order to achieve effective accent classification [35], Zixuan et al. proposed an end-to-end classification method based on a temporal convolutional attention network. In the experiment, compared with the baseline method, the accuracy of the English and Chinese speech datasets increased by 6.27% and 26.11%, respectively. Using the Coswara dataset [36], P. Srikanth et al. proposed a framework for COVID-19 prediction using audio cough data based on the random forest algorithm. This framework achieved 98.36% accuracy in experiments. Y. F. Khan et al. trained a hybrid model to predict Alzheimer’s disease (AD) based on convolutional neural network (CNN) and bidirectional long short-term memory (Bidirectional LSTM) using the DementiaBank clinical transcription audio dataset [37]. This hybrid model achieved 85.05% accuracy in experiments.

Using the DiCOVA2021 audio dataset [38], J. Chang et al. designed a COVID-19 detection method UFRC based on ImageNet pre-trained ResNet-50. This method, called UFRC, achieved an accuracy of 86% in experiments. In order to realize the early symptoms of voice deterioration and predict Parkinson’s disease (PD) [39], R. Shah et al. proposed an interpretable temporal audio classification model based on neural networks. This model achieved an accuracy rate of 90.32%, a precision of 91%, a recall rate of 90%, and an F1 score of 90.5% in the experiment. In order to realize the prediction of heart disease through the classification of heartbeat sounds [40], S. Kamepalli et al. proposed a multi-class classification and prediction model of heart sounds for detecting abnormal heart sounds based on stacked LSTM. This model classifies heartbeat sounds into four categories with 85% and 87% accuracy on the training and validation sets.

To clarify the position of the boundary of the heart murmurs in the heartbeat sound [41], N. S. Bathe et al. took the audio PCG signal as input. This model achieved an accuracy rate of 93.45% in the experiment. In order to realize the prediction of Alzheimer’s disease (AD) through audio classification [42], V. Yadav et al. designed a feature selection method based on MFCC and MLP using the ADReSSo dataset. This method achieved an accuracy rate of 75% in experiments. A. Patel et al. proposed an algorithm for lung disease prediction for lung audio classification based on transfer learning and RESNET50 [43]. This algorithm achieved an accuracy rate of 82% in the experiment.

S. Redekar et al. proposed a method for predicting heart rate values through human speech based on the Mel frequency cepstral coefficient (MFCC) and random forest algorithm [44]. This method achieved 90.3% accuracy in experiments. Using the voices of 92 subjects [45], F. Amato et al. proposed a method for predicting gastroesophageal reflux disease (GERD) through voice based on machine learning techniques. This method achieved 82% accuracy in experiments. Y. Zhu et al. designed a detection system using novel modulated spectral features and linear prediction of coronavirus disease (COVID-19) through speech [46]. This system achieves an AUC-ROC of 0.711 on the DiCOVA2 dataset and 0.612 on the Cambridge set.

In research on neonatal care, C. Sitaula et al. proposed a method for detecting bowel sounds in newborns assisted by auscultation [47]. The method uses a convolutional neural network (CNN) to classify wriggling and non-wriggling sounds and a Laplacian hidden semi-Markov model (HSMM) to optimize the classification. This method achieved an accuracy of 89.89% and an area under the curve (AUC) of 83.96% in experiments. L. Burne et al. proposed a new method for automatically detecting wriggling sounds from neonatal abdominal recordings [48], using hand-crafted and 1D and 2D in-depth features obtained from Mel frequency cepstral coefficients (MFCC). Results are then refined with the help of a hierarchical hidden semi-Markov model (HSMM) strategy. This method achieved 95.1% accuracy and 85.6% area under the curve (AUC) in experiments.

As can be seen from Table 1 and the above paragraphs, in the current research on disease risk prediction using audio classification, the accuracy rate of identifying asthma has reached 77.77%, the accuracy rate of Parkinson’s disease prediction has reached 97.57%, the accuracy rate of dementia identification has reached 97.2%, and the accuracy rate of Huntington’s disease prediction reached 99%. However, the underlying algorithms used in these studies could be further improved. For example, when there are many decision trees, the overhead of random forest algorithm training time and space will be considerable. KNN, SVM, and LSTM algorithms will run very inefficiently when there are many audio samples. The CNN model extracts audio sample features with poor interpretability. In short, in the application scenario of nursing homes, the existing research cannot meet the needs of using audio for disease risk prediction, and the Risevi model we propose meets the current application needs of nursing homes.

3. Model Design and Implementation

In this section, we mainly describe the Risevi model design and implementation process. In the process of Risevi model design, we first design an audio sample generation method based on MelGAN; secondly, we design an audio sample feature extraction method based on sampling rate; then, we construct a sample classification method based on Vision Transformers; after iterative training, finally, we obtain the best Risevi model. In implementing the Risevi model, the software platforms we use are Python 3.8.16, Keras 2.6.0, LightGBM 3.3.5, Tensorflow-GPU 2.6.0, sklearn 0.0, torch 2.0.1, and transformers 4.29.1, vit-keras 0.1.2.

3.1. Sample Generation

In the process of designing the Risevi model, the shortage and imbalance of audio samples were taken into consideration. Therefore, we designed an audio sample generation method based on MelGAN. In computer vision, GAN is widely used in image data generation [49]. In order to improve the application of GAN in the field of audio modeling, Kundan Kumar et al. proposed an audio generation model MelGAN based on GAN in 2019. The overall structure of MelGAN is shown in Figure 2 [50].

As shown in Figure 2 [50], MelGAN consists of a generator and a discriminator, and the Mel spectrogram is the generator’s input. After inputting the Mel spectrogram into the generator, it first passes through the convolutional layer. After passing through the convolutional layer, the upsampling layer, and the residual module, it finally outputs the audio. The discriminator adopts a multi-scale architecture. The discriminator analyzes not only the original audio file but also the original audio file after down-conversion using Avg pooling. MelGAN uses Hinge Loss as the model’s loss function, as shown in Formulas (1) and (2) [50].

\min_{D_{k}} E_{x} [\min (0,1 - D_{k} (x))] + E_{s, z} [\min (0,1 + D_{k} (G (s, z)))], \forall k = 1,2, 3

(1)

\min_{G} E_{s, z} [\sum_{k = 1,2, 3} - D_{k} (G (s, z))]

(2)

In Formulas (1) and (2) [50],

x

represents the initial audio file,

s

represents the Mel spectrogram of the input generator,

z

represents Gaussian noise, and

k

represents the scale of the discriminator.

When we design an audio sample generation method, the efficiency of sample generation is a critical evaluation index for us. MelGAN’s decoder can replace the autoregressive model and belongs to the non-autoregressive feedforward convolution architecture. The efficiency of MelGAN to generate audio achieves our expected efficiency goals without significant degradation in the generated audio quality.

3.2. Feature Extraction

In our method of designing audio sample feature extraction, audio feature extraction refers to Mel frequency cepstral coefficient (MFCC) and unsupervised pre-training model Wav2vec2.

The Mel scale is a non-linear frequency change scale formed based on the human ear’s sensory judgment of audio frequency changes. The identification of the audio file is mainly based on the formant position in the audio file and the process of formant position transition. The formant position is the peak of the audio file in the spectrogram. The process of changing the formant’s position is the spectrum’s envelope. An envelope is a curve drawn against the formant positions [51]. After performing the logarithmic operation on the Fourier transform of the audio signal, the frequency spectrum obtained by the inverse Fourier transform operation is the cepstral. The process of Mel cepstral coefficient MFCC feature extraction is shown in Figure 3.

In Figure 3, first, perform Fourier transform on the audio signal; secondly, map the obtained spectral power onto the Mel scale; then perform the logarithmic operation on the obtained Mel frequency power [51]; then perform a discrete cosine transform on the Mel logarithm Operation; finally, obtain the MFCC of the spectrum [52].

Wav2vec2 is a self-supervised learning framework for speech representation proposed by the Facebook AI Research team in 2020. The structure of this framework is shown in Figure 4.

In Figure 4,

x

represents an audio file. This framework uses a multi-layer convolutional neural network to encode

x

into

z

and then inputs

z

into the Transformer network to build

C

. The construction of

C

is not discrete but based on continuous speech representations. Quantization

Q

is used for loss functions in contrastive tasks. The loss function used by the Wav2vec2 framework in the comparison task is shown in Formula (3) [53].

L_{m} = - \log \frac{e x p (s i m (c_{t}, q_{t}) / k)}{\sum_{\bar{q} ~ Q_{t}} e x p (s i m (c_{t}, \tilde{q}) / k)}

(3)

In Formula (3),

c_{t}

represents the output of the context network centered at the labeled time step

t

. Formula (3) states that the model needs to distinguish the accurate quantized representation among

k + 1

quantized representations

\tilde{q} \in Q_{t}

.

3.3. Transfer Learning

The ideal machine learning training scenario has many labeled samples to train the model. However, it is challenging to collect enough training samples in most cases. Semi-supervised learning can partially solve the problem by relaxing the demand for massive labeled data. Semi-supervised learning only needs a certain amount of labeled sample data, and semi-supervised learning can use many unlabeled sample data to improve the accuracy of model learning. However, in practical application scenarios, unlabeled datasets are also challenging to collect comprehensively. Transfer learning focuses on cross-domain knowledge transfer and is currently the most effective machine learning method to solve the above problems [54].

Transfer learning is an essential topic in machine learning and data mining [55]. Traditional machine learning tries to train a learning model from a task, while transfer learning tries to transfer knowledge from a source task to a target task. The datasets collected by non-medical testing equipment are relatively scarce, and there are many parameters in the deep learning network model. Therefore, when performing health risk prediction [56], a limited number of samples are prone to overfitting when training the deep learning model. To overcome this problem, we employ a transfer learning strategy. Transfer learning is a deep learning technique that uses a model pre-trained on a large dataset for a problem as the initialization weights for a model trained on a different dataset. CNNs tend to perform better on larger datasets than on smaller ones. Fields using transfer learning techniques include object detection, medical imaging [57], and image classification. Models trained on large datasets, such as ImageNet, can extract features from smaller datasets. Training models with transfer learning are more efficient than those without transfer learning. Using transfer learning prevents overfitting during model training. When using transfer learning, only a small amount of data is needed for training to improve model performance. The pre-trained model used in this research is Vision Transformer.

In this paper, the loss function used by the Risevi model we proposed based on transfer learning is shown in Formula (4).

C = - \frac{1}{n} \sum_{x} [y \ln a + (1 - y) \ln (1 - a)]

(4)

In Formula (4),

C

represents the loss value of the model;

n

represents the number of samples;

x

represents the dimension of the prediction vector;

a

represents the predicted value of the output;

y

represents the actual value of the sample.

3.4. Feature Classification

When designing the feature classification method, we used Vision Transformer, LightGBM, CNN, LSTM, and SVM as the basic algorithms to construct the feature classification method and compared the experimental results. Experimental results show that the effect of constructing feature classification is the best using Vision Transformer as the primary model algorithm.

The Transformer model is an end-to-end NLP model proposed by the Google team in 2017. This model uses the self-attention mechanism to allow the model to obtain global information and be trained in parallel. Vision Transformer is a graphical version of Transformer proposed by the Google team in 2021. Vision Transformer (ViT) is also used in computer vision, image classification, object detection, and video processing. Y. Zhou et al. used the ViT algorithm for fire smoke detection [58], W. Zhang et al. used the ViT algorithm to identify the quality of metal 3D printing [59], S. R. Dubey et al. used the ViT algorithm for image search [60], and X. Li et al. used the ViT algorithm to reconstruct 3D volumes from a single image element [61]. Y. Fang et al. used the ViT algorithm for sitting posture recognition [62], A. Dey et al. used the ViT algorithm for detecting fall events [63], and T. Chuman used the ViT algorithm for image encryption [64].

In NLP, the input to Transformer is a sequence of phrases. Following this cue, the researchers first segmented the image into multiple blocks, then combined these image blocks into a linear sequence, and finally used these sequences as input to the Vision Transformer. The overall architecture of the Vision Transformer is shown in Figure 5.

In Figure 5 [65], the Vision Transformer first divides the image into image blocks that the model can process and then uses the Linear Projection layer to encode the position of all patches and generate an embedded sequence, then input the sequence into the standard Transformer Encoder, and then use a large-scale training set for model training, and finally obtain the model. The feature classification method we designed is based on the Vision Transformer model and fine-tuned using our dataset.

3.5. Risevi Model

In order to realize disease risk prediction using audio data, we designed the Risevi model based on MelGAN, transfer learning, and Vision Transformer. The overall architecture of the Risevi model is shown in Figure 6.

In Figure 6, the operation process of the Risevi model is as follows:

Combine the dataset “patient-health-detection-using-vocal-audio” and subject data to form a mixed sample data;
Generate sample data using a sample generation method designed based on MelGAN;
Use the feature extraction method to extract features, then convert the tensor of the audio file into a floating-point number and convert it into a waveform;
Perform data augmentation and deduplication operations on image data;
Divide the calculated image data into three parts, one for training data, one for verification data, and one for test data;
Load the Vision Transformer model pre-trained weights;
Iteratively train and fine-tune the Risevi model;
Obtain the Risevi model.

3.5.1. Sample Data

During the implementation of the Risevi model, the dataset used is the dataset “Patient Health Detection using Vocal Audio” publicly available on Kaggle. This dataset consists of sound audio files of regular people and diseased patients [66], and the sample data distribution is shown in Table 2.

As shown in Table 2, the audio files in the dataset are all wav audio files. Among them are 560 audio files of regular people, 84 audio files of patients with Laryngocele disease, and 392 audio files of patients with Vox senilis disease. The dataset comes from Kaggle, and the total size of the dataset is 401.7 MB.

3.5.2. Sample Generation Implementation Process

In this paper, we propose a sample generation method based on MelGAN. The method’s input is the spectrogram of the audio file, and the output is the audio file corresponding to the spectrogram. The primary process of the sample generation method is as follows:

Library file import;
Load the audio dataset;
Preprocess audio datasets;
In order to enhance the characteristic information of the audio data, the audio file is pre-emphasized using an FIR high-pass filter;
In order to avoid spectral leakage of the audio signal, the Hamming window function adds a window to the audio data;
Use the short-time Fourier transform to obtain the time-frequency signal of an audio file;
Obtain the audio file’s spectrogram by superimposing each frame’s frequency domain signal;
Convert the spectrogram to a Mel-spectrogram by using a Mel-scale filter bank;
Using the MelGAN generator network, build a sample generator model;
Using the MelGAN discriminator network, create a sample discriminator model;
Define the generator loss;
Define the discriminator loss;
Set up checkpoint saving;
Define training parameters;
Train the model.

The structural design of the MelGAN generator model used during the sample generation process is shown in Table 3.

In Table 3, the spectrogram of the audio input by the generator network is transmitted to the upsampling stage after passing through a Conv layer. After each upsampling, the residual module is nested, and the final output is the audio. Since the original audio has a 256× higher temporal resolution than the Mel spectrogram, the generator network needs to complete a 256× upsampling. In the sample generation process, the structural design of the MelGAN discriminator model used is shown in Table 4.

In Table 4, the discriminator network uses a larger kernel of 41 × 41 and uses group convolution while maintaining a small number of parameters. In each layer in the discriminator network, the weights are normalized.

3.5.3. Feature Extraction Implementation Process

This paper’s feature extraction method mainly refers to the Mel frequency cepstral coefficient MFCC of audio files and the unsupervised pre-training model Wav2vec2. The primary process of the feature extraction method is as follows:

Library file import;
Load the dataset using buffered prefetch;
Apply noise reduction to audio files;
Fade in and fade out audio files;
Load a pre-trained Wav2Vec2 model as an embedding feature extractor;
Fine-tune the pre-trained feature extractor and perform embedding calculations on the current audio features;
Resample audio to 16 kHz;
Use feature_extractor to extract features;
Obtain embedding representations from the model;
Convert the feature tensor of the extracted audio file into a floating-point number, and then convert the obtained floating-point number into a waveform;
Perform frequency masking and time masking;
Generate a new sample dataset.

3.5.4. Feature Classification Implementation Process

This paper proposes a feature classification method based on transfer learning and the Vision Transformer (ViT) network model. ViT pre-training models include L/16, B/16, B/32, S/16, R50 + L/32, and R26 + S/32, etc. Among them, the accuracy of the pre-training models L/16, B/16, B/32, and 50 + L/32 on the imagenet2012 dataset exceeded 80%. However, the parameter size of the pre-training model B/32 is relatively small, so we use the pre-trained model B/32. When migrating the weights of the pre-trained model, from the perspective of the size and similarity of the dataset, the generated dataset we use is a large dataset, but it is different from the dataset used by the pre-trained model. To fine-tune the model, first freeze all layers, unfreeze a layer, train a layer, and then train each layer in this way until the model reaches the preset evaluation index threshold.

In order to adapt the model to audio feature classification, we fine-tuned the Encoder Block and MLP Block of the Transformer Encoder module in the ViT network model. Since Dropout uses the method of randomly deactivating some neurons according to the node retention probability already set in the neural network, it reduces the overfitting of the neural network. However, this method will result in the inability to ensure that the cost function is monotonically decreasing, and the model needs to increase the training times of the model to achieve the expected accuracy. In order to further improve the training efficiency and prediction accuracy of the ViT network model, we replaced the Dropout in the Encoder Block and MLP Block of the Transformer Encoder module in the ViT network model with DropPath. In the process of model training, after completing the division of the training set, verification set, and test set, ViT is integrated into the classification model training process. When deploying the trained model to the application for classification, there is no integration of ViT.

The primary process of the feature classification method is as follows:

Library file import;
Load the sample dataset;
Rotate and horizontally flip the sample dataset;
Scales the sample dataset pixel values;
Divide the training set, validation set, and test set;
Create a base model using the pre-trained ViT-B32 model;
Add Flatten, Batch Normalization, and Dense;
Set optimizer and learning rate;
Train the model;
Evaluate the model;
Adjust parameter values.

3.5.5. Model Algorithm

The operation process of the Risevi model is shown in Algorithm 1.

Algorithm 1: Risevi model algorithm
Input: Publicly available datasets, audio datasets of young healthy subjects.
Output: Risevi model
${r_{1}, r_{2}, r_{3} \dots r_{n}}$ : training sample
${v_{1}, v_{2}, v_{3} \dots v_{n}}$ : verification sample
${s_{1}, s_{2}, s_{3} \dots s_{n}}$ : test sample
$S$ : mixed sample
1	$S$ ← $P$ + $Y$
2	$W$ ← Based on MelGAN
3	$F$ ← Extract features, operate tensors
4	$I$ ← Convert $F$ to wave form
5	for item in $I$
6		Data augmentation
7	for item in $I$
8		Sample deduplication
9	${r_{1}, r_{2}, r_{3} \dots r_{n}}$ , ${v_{1}, v_{2}, v_{3} \dots v_{n}}$ , ${s_{1}, s_{2}, s_{3} \dots s_{n}}$ ← $I$
10	Load classification model
11	Load pretrained weights
12	While True:
13		Train a classification model
14		Use ${v_{1}, v_{2}, v_{3} \dots v_{n}}$ for accuracy, precision, recall, F1 score monitoring
15		According to the monitoring results, adjust the parameter value
16		if accuracy $\geq A c c$ and precision $\geq P r e$ and recall $\geq R e c$ and F1 score $\geq {F 1}_{s}$
17			Obtain the Risevi model
18			break

In Algorithm 1,

P

represents the public dataset used;

Y

represents the audio dataset of young, healthy subjects;

W

represents the dataset generated by the sample generation method;

F

represents the dataset extracted by the feature extraction method;

I

represents the dataset after converting

F

into a waveform graph;

A c c

represents the preset accuracy rate value;

P r e

represents the preset precision rate value;

R e c

represents the preset recall rate value;

{F 1}_{s}

represents the preset F1 score.

3.6. Evaluation Metrics

This paper uses Accuracy, Precision, Recall, and F1 score to evaluate the Risevi model.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} \times 100

(5)

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %

(6)

R e c a l l = \frac{T P}{T P + F N} \times 100 %

(7)

F 1 = \frac{2 T P}{2 T P + F P + F N} \times 100 %

(8)

In Formulas (5)–(8), TP means true positive, TN means true negative, FP means false positive, and FN means false negative. The confusion matrix of the evaluation indicators is shown in Table 5.

Table 5 describes the confusion matrix relationship of evaluation indicators true positive, true negative, false positive, and false negative.

3.7. Application Practice

In the application of industrial practice, we use the Python language on the Linux system platform to develop a backend service system based on the open-source web application framework Django. This backend service system encapsulates the trained Risevi model into an API interface. The audio file this API interface receives returns the name of the disease and the probability of disease risk. The APP application program or the health analysis system can empower the Risevi model by connecting with the API interface. During the implementation of the application, the API interface may have a response timeout failure phenomenon due to network speed and other factors. Therefore, the cloud server and the local server deploy the backend service system simultaneously, and the cloud and the local server are on hot standby for each other.

In this section, we describe the sample generation, feature extraction, and classification methods used in the Risevi model’s design. We describe the sample data used in the implementation of the Risevi model, the implementation of the sample generation method, the implementation of the feature extraction method, the implementation of the feature classification method, the algorithm and application practice of the Risevi model, and the evaluation index of the Risevi model.

4. Experimental Results and Analysis

This section mainly explains the experimental results of using different optimizers, different learning rates, before and after model optimization, and different basic algorithms when conducting Risevi model experiments.

4.1. Comparison of Different Optimizers

In the Risevi model experiment, we used five optimizers, RectifiedAdam, AdaBelief, LAMB, LazyAdam, and NovoGrad, when the batch_size was 16, and the learning_rate was 0.0001. We compared the experimental results using Accuracy, Precision, Recall, and F1 score.

As can be seen from Figure 7, the accuracy and precision of the five optimizers using RectifiedAdam, AdaBelief, LAMB, LazyAdam, and NovoGrad are all over 90%. As can be seen from Figure 7a, the accuracy rate using the RectifiedAdam optimizer is the highest. It can be seen from Figure 7b that the precision rate using the RectifiedAdam optimizer is the highest.

As can be seen from Figure 8, the recall rate and F1 score of the five optimizers using RectifiedAdam, AdaBelief, LAMB, LazyAdam, and NovoGrad all exceed 75%, and the recall rate and F1 score of the RectifiedAdam optimizer highest compared with each other.

4.2. Comparison of Different Learning Rates

In the Risevi model experiment, we used 0.1, 0.01, 0.001, 0.0001, and 0.00001 for batch size 16 and the learning rate, respectively. We compared the experimental results using Accuracy, Precision, Recall, and F1 score.

As can be seen from Figure 9, the accuracy and precision of the five learning rates of 0.1, 0.01, 0.001, 0.0001, and 0.00001 are both over 85%. It can be seen from Figure 9a that using a learning rate of 0.0001 achieves the highest accuracy. From Figure 9b, we can see that the precision rate is the highest when using the learning rate 0.0001.

Figure 10 shows that the recall and F1 scores using four learning rates of 0.01, 0.001, 0.0001, and 0.00001 are over 85%. From Figure 10a, we can see that a learning rate of 0.0001 has the highest recall compared to the other. From Figure 10b, we can see that the F1 score using a learning rate of 0.0001 is the highest.

4.3. Comparison before and after Optimization

In the Risevi model experiment, we optimized feature extraction, classification, and corresponding parameter values. We compared the experimental results before and after model optimization using Accuracy, Precision, Recall, and F1 score. In the Risevi model optimization experiment, the model optimization process is shown in Figure 11.

As shown in Figure 11, first load the pre-trained model; secondly, freeze all layers, unfreeze a layer, train a layer, and train each layer in this way; then, compare each optimizer and various learning rates; then the Encoder Block and MLP Block of the Transformer Encoder module in the ViT network model were fine-tuned; finally, other parameters in the model were optimized.

As can be seen from Figure 12a, the accuracy, precision, recall, and F1 score of the optimized model are all improved compared with those before optimization. It can be seen from Figure 12b that the AUC value of the optimized model is very close to 1.0.

4.4. Comparison of Different Algorithms

In the Risevi model experiment, we used Vision Transformer, VGG19, ResNet50, CNN, LightGBM, LSTM, and SVM as basic models to build feature classification models.

As can be seen from Figure 13, using Vision Transformer, VGG19, ResNet50, CNN, LightGBM, LSTM, and SVM as the basic models, the accuracy and precision of the feature classification models constructed, respectively, exceed 50%. As can be seen from Figure 13a, the accuracy rate using Vision Transformer is the highest. As can be seen from Figure 13b, the precision rate using Vision Transformer is the highest.

As can be seen from Figure 14, using Vision Transformer, VGG19, ResNet50, CNN, LightGBM, LSTM, and SVM as the basic models, the recall rate and F1 score of the feature classification models constructed, respectively, exceed 50%. From Figure 14a, we can see that the recall rate using Vision Transformer is the highest compared to that. From Figure 14b, we can see that the F1 score using Vision Transformer is the highest compared to that.

The contribution of this article is to innovatively apply Vision Transformer to the research of audio classification to realize the prediction of disease risk using audio files in the application scenario of nursing homes. In the current research on audio classification, there are many studies on the classification of MFCC parameters of extracted audio files. After extracting the features of the audio file, we first convert the tensor into a floating-point number, then convert it into a waveform image, then classify the waveform image, and finally, realize the classification of the audio file.

In this section, we illustrate the experimental results of using no optimizer, different learning rates, before and after model optimization, and different basic algorithms. From the above experimental results, in the application scenario of nursing homes, the Risevi model can predict disease risk based on audio.

5. Conclusions

This paper proposes the Risevi model based on MelGAN, transfer learning, and Vision Transformer. In the experiment, the Risevi model achieved an accuracy rate of 98.5%, a precision rate of 96.38%, a recall rate of 98.17%, and an F1 score of 97.15%. The experimental results show that the Risevi model can use audio files for disease risk prediction in the application scenario of nursing homes. The sample dataset we use is the publicly available dataset “Patient Health Detection using Vocal Audio” on Kaggle and subject audio files. The limitations of this research mainly fall into two categories, the first category is the limitations of the algorithm itself, and the second category is the limitations of the sample dataset. The limitation of the algorithm itself is that the efficiency of feature extraction needs to be further improved when the model extracts the features of the waveform image converted from the tensor of the audio file. The limitation of the sample dataset is that the dataset used in this paper covers a single disease, and the proportion of audio files of the subjects is tiny. In future research, we plan to optimize the model further to improve the efficiency of the model. In addition, we plan to collect more disease audio datasets and subject audio files, covering a wider range of diseases. Ultimately, we achieve comprehensive and effective disease risk prediction using audio files.

Author Contributions

F.Z. and S.H., wrote the main manuscript text, and Z.L. provided the idea. X.W. and J.W. prepared the data and figures. All authors reviewed the manuscript. The authors read and approved the final manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

The work of this paper is supported by the National Key Research and Development Program of China (2019YFB1405000), and the National Natural Science Foundation of China under Grant (No. 61873309, 92046024, 92146002).

Data Availability Statement

The “Patient Health Detection using Vocal Audio” dataset is from https://www.kaggle.com/datasets/subhajournal/patient-health-detection-using-vocal-audio (accessed on 19 April 2023).

Acknowledgments

Thanks to Xiong Yu and Cheng Hui for their medical guidance and cooperation.

Conflicts of Interest

The authors declare no conflict of interest.

References

Melchiorre, A.B.; Schedl, M. Personality Correlates of Music Audio Preferences for Modelling Music Listeners. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization (UMAP ‘20), Genoa, Italy, 14–17 July 2020; Association for Computing Machinery: New York, NY, USA; pp. 313–317. [Google Scholar] [CrossRef]
Sarma, M.S.; Das, A. BMGC: A Deep Learning Approach to Classify Bengali Music Genres. In Proceedings of the 4th International Conference on Networking, Information Systems & Security (NISS2021), Kenitra, Morocco, 1–2 April 2021; Association for Computing Machinery: New York, NY, USA; pp. 1–6. [Google Scholar] [CrossRef]
Wang, Y.; Canahuate, G.M.; Van Dijk, L.V.; Mohamed, A.S.R.; Fuller, C.D.; Zhang, X.; Marai, G.-E. Predicting late symptoms of head and neck cancer treatment using LSTM and patient reported outcomes. In Proceedings of the 25th International Database Engineering & Applications Symposium (IDEAS ‘21), Montreal, QC, Canada, 14–16 July 2021; Association for Computing Machinery: New York, NY, USA; pp. 273–279. [Google Scholar] [CrossRef]
Villavicencio, C.N.; Jeng, J.-H.; Hsieh, J.-G. Support Vector Machine Modelling for COVID-19 Prediction based on Symptoms using R Programming Language. In Proceedings of the 2021 4th International Conference on Machine Learning and Machine Intelligence (MLMI ‘21), Hangzhou, China, 17–19 September 2021; Association for Computing Machinery: New York, NY, USA; pp. 65–70. [Google Scholar] [CrossRef]
Jakubicek, R.; Vicar, T.; Chmelik, J.; Ourednicek, P.; Jan, J. Deep-learning Based Prediction of Virtual Non-contrast CT Images. In Proceedings of the 2021 International Symposium on Electrical, Electronics and Information Engineering (ISEEIE 2021); Seoul, Republic of Korea, 19–21 February 2021; Association for Computing Machinery: New York, NY, USA; pp. 72–76. [Google Scholar] [CrossRef]
Foo, A.; Hsu, W.; Lee, M.L.; Tan, G.S.W. DP-GAT: A Framework for Image-based Disease Progression Prediction. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ‘22), Washington, DC, USA, 14–18 August 2022; Association for Computing Machinery: New York, NY, USA; pp. 2903–2912. [Google Scholar] [CrossRef]
Wisniewski, M.; Zielinski, T.P. MRMR-based feature selection for automatic asthma wheezes recognition. In Proceedings of the 2012 International Conference on Signals and Electronic Systems (ICSES), Wroclaw, Poland, 18–21 September 2012; pp. 1–5. [Google Scholar] [CrossRef]
McLoughlin, I.; Zhang, H.; Xie, Z.; Song, Y.; Xiao, W. Robust sound event classification using deep neural networks. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 2015, 23, 540–552. [Google Scholar] [CrossRef] [Green Version]
Yang, L.; Sahli, H.; Xia, X.; Pei, E.; Oveneke, M.C.; Jiang, D. Hybrid Depression Classification and Estimation from Audio Video and Text Information. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge (AVEC ‘17), Mountain View, CA, USA, 23 October 2017; Association for Computing Machinery: New York, NY, USA; pp. 45–51. [Google Scholar] [CrossRef]
Rao, M.V.A.; Kausthubha, N.K.; Yadav, S.; Gope, D.; Krishnaswamy, U.M.; Ghosh, P.K. Automatic prediction of spirometry readings from cough and wheeze for monitoring of asthma severity. In Proceedings of the 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; pp. 41–45. [Google Scholar] [CrossRef] [Green Version]
Won, M.; Alsaadan, H.; Eun, Y. Adaptive Audio Classification for Smartphone in Noisy Car Environment. In Proceedings of the 25th ACM international conference on Multimedia (MM ‘17), Mountain View, CA, USA, 23–27 October 2017; Association for Computing Machinery: New York, NY, USA; pp. 1672–1679. [Google Scholar] [CrossRef] [Green Version]
Freitag, M.; Amiriparian, S.; Pugachevskiy, S.; Cummins, N.; Schuller, B. AuDeep: Unsupervised learning of representations from audio with deep recurrent neural networks. J. Mach. Learn. Res. 2017, 18, 6340–6344. [Google Scholar]
Yin, Y.; Shah, R.R.; Zimmermann, R. Learning and Fusing Multimodal Deep Features for Acoustic Scene Categorization. In Proceedings of the 26th ACM international conference on Multimedia (MM ‘18), Seoul, Republic of Korea, 22–26 October 2018; Association for Computing Machinery: New York, NY, USA; pp. 1892–1900. [Google Scholar] [CrossRef]
Lima, G.; Bak, J. Speech Emotion Classification using Raw Audio Input and Transcriptions. In Proceedings of the 2018 International Conference on Signal Processing and Machine Learning (SPML ‘18), Shanghai, China, 28–30 November 2018; Association for Computing Machinery: New York, NY, USA; pp. 41–46. [Google Scholar] [CrossRef]
Alqahtani, E.J.; Alshamrani, F.H.; Syed, H.F.; Olatunji, S.O. Classification of Parkinson’s Disease Using NNge Classification Algorithm. In Proceedings of the 21st Saudi Computer Society National Computer Conference (NCC), Riyadh, Saudi Arabia, 25–26 April 2018; pp. 1–7. [Google Scholar] [CrossRef]
Joshi, A.; Ghosh, S.; Gunnery, S.; Tickle-Degnen, L.; Sclaroff, S.; Betke, M. Context-Sensitive Prediction of Facial Expressivity Using Multimodal Hierarchical Bayesian Neural Networks. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 278–285. [Google Scholar] [CrossRef]
You, Y.; Ahmed, B.; Barr, P.; Ballard, K.; Valenzuela, M. Predicting Dementia Risk Using Paralinguistic and Memory Test Features with Machine Learning Models. In Proceedings of the IEEE Healthcare Innovations and Point of Care Technologies, (HI-POCT), Bethesda, MD, USA, 20–22 November 2019; pp. 56–59. [Google Scholar] [CrossRef]
Kumar, A.S.; Erler, R.; Kowerko, D. A Real-Time Demo for Acoustic Event Classification in Ambient Assisted Living Contexts. In Proceedings of the 27th ACM International Conference on Multimedia (MM ‘19), Nice, France, 21–25 October 2019; Association for Computing Machinery: New York, NY, USA; pp. 2205–2207. [Google Scholar] [CrossRef]
Aich, S.; Kim, H.-C.; Younga, K.; Hui, K.L.; Al-Absi, A.A.; Sain, M. A Supervised Machine Learning Approach using Different Feature Selection Techniques on Voice Datasets for Prediction of Parkinson’s Disease. In Proceedings of the 21st International Conference on Advanced Communication Technology (ICACT), PyeongChang, Republic of Korea, 19–22 February 2019; pp. 1116–1121. [Google Scholar] [CrossRef]
Pettas, D.; Nousias, S.; Zacharaki, E.I.; Moustakas, K. Recognition of Breathing Activity and Medication Adherence using LSTM Neural Networks. In Proceedings of the IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE), Athens, Greece, 28–30 October 2019; pp. 941–946. [Google Scholar] [CrossRef]
Martín-Morató, I.; Cobos, M.; Francesc; Ferri, J. Adaptive Distance-Based Pooling in Convolutional Neural Networks for Audio Event Classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 1925–1935. [Google Scholar] [CrossRef]
Zhang, L.; Shi, Z.; Han, J. Pyramidal Temporal Pooling With Discriminative Mapping for Audio Classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 770–784. [Google Scholar] [CrossRef]
Chen, Y.; Wang, J. A Novel Multi-class Classification Framework Based on Local OVR Deep Neural Network. In Proceedings of the 4th International Conference on Computer Science and Application Engineering (CSAE ‘20), Sanya, China, 20–22 October 2020; Association for Computing Machinery: New York, NY, USA; pp. 1–6. [Google Scholar] [CrossRef]
Dong, B.; Lumezanu, C.; Chen, Y.; Song, D.; Mizoguchi, T.; Chen, H.; Khan, L. At the Speed of Sound: Efficient Audio Scene Classification. In Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR ‘20), Dublin, Ireland, 8–11 June 2020; Association for Computing Machinery: New York, NY, USA; pp. 301–305. [Google Scholar] [CrossRef]
Pinkas, G.; Karny, Y.; Malachi, A.; Barkai, G.; Bachar, G.; Aharonson, V. SARS-CoV-2 Detection from Voice. IEEE Open J. Eng. Med. Biol. 2020, 1, 268–274. [Google Scholar] [CrossRef] [PubMed]
Sriskandaraja, K.; Ahmed, B.; Valenzuela, M. Subject Independent Dementia Risk Prediction Models Using Paralinguistic and Memory Test Features with Feature Warping. In Proceedings of the 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Montreal, QC, Canada, 20–24 July 2020; pp. 2540–2543. [Google Scholar] [CrossRef]
Guimaraes, M.T.; Medeiros, A.G.; Almeida, J.S.; Martin, M.F.Y.; Damasevicius, R.; Maskeliunas, R.; Mattos, C.L.C.; Filho, P.P.R. An Optimized Approach to Huntington’s Disease Detecting via Audio Signals Processing with Dimensionality Reduction. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
Aharonson, V.; de Nooy, A.; Bulkin, S.; Sessel, G. Automated Classification of Depression Severity Using Speech—A Comparison of Two Machine Learning Architectures. In Proceedings of the IEEE International Conference on Healthcare Informatics (ICHI), Oldenburg, Germany, 30 November–3 December 2020; pp. 1–4. [Google Scholar] [CrossRef]
Ramesh, V.; Vatanparvar, K.; Nemati, E.; Nathan, V.; Rahman, M.M.; Kuang, J. CoughGAN: Generating Synthetic Coughs that Improve Respiratory Disease Classification. In Proceedings of the 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Montreal, QC, Canada, 20–24 July 2020; pp. 5682–5688. [Google Scholar] [CrossRef]
Pham, L.; Phan, H.; Palaniappan, R.; Mertins, A.; McLoughlin, I. CNN-MoE Based Framework for Classification of Respiratory Anomalies and Lung Disease Detection. IEEE J. Biomed. Health Inform. 2021, 25, 2938–2947. [Google Scholar] [CrossRef] [PubMed]
Kukushkin, M.; Ntalampiras, S. Automatic acoustic classification of feline sex. In Proceedings of the 16th International Audio Mostly Conference (AM ‘21), Trento, Italy, 1–3 September 2021; Association for Computing Machinery: New York, NY, USA; pp. 156–160. [Google Scholar] [CrossRef]
Xue, H.; Flora; Salim, D. Exploring Self-Supervised Representation Ensembles for COVID-19 Cough Classification. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD ‘21), Virtual Event Singapore, 14–18 August 2021; Association for Computing Machinery: New York, NY, USA; pp. 1944–1952. [Google Scholar] [CrossRef]
Toto, E.; Tlachac, M.L.; Elke; Rundensteiner, A. AudiBERT: A Deep Transfer Learning Multimodal Classification Framework for Depression Screening. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (CIKM ‘21), Virtual Event Queensland, Australia, 1–5 November 2021; Association for Computing Machinery: New York, NY, USA; pp. 4145–4154. [Google Scholar] [CrossRef]
Kamoji, S.; Koshti, D.; Dmello, V.V.; Kudel, A.A.; Vaz, N.R. Prediction of Parkinson’s Disease using Machine Learning and Deep Transfer Learning from different Feature Sets. In Proceedings of the 6th International Conference on Communication and Electronics Systems (ICCES), Coimbatre, India, 8–10 July 2021; pp. 1715–1720. [Google Scholar] [CrossRef]
Wu, Z.; Wang, C.; Xue, H.; Shen, L.; Wang, Z.; Chen, J. An End-to-End Chinese Accent Classification Method. In Proceedings of the 10th International Conference on Computing and Pattern Recognition (ICCPR 2021), Shanghai China, 15–17 October 2021; Association for Computing Machinery: New York, NY, USA, 2022; pp. 163–168. [Google Scholar] [CrossRef]
Srikantrh, P.; Behera, C.K. A Machine Learning framework for Covid Detection Using Cough Sounds. In Proceedings of the International Conference on Engineering & MIS (ICEMIS), Istanbul, Turkey, 4–6 July 2022; pp. 1–5. [Google Scholar] [CrossRef]
Khan, Y.F.; Kaushik, B.; Rahmani, M.K.I.; Ahmed, M.E. Stacked Deep Dense Neural Network Model to Predict Alzheimer’s Dementia Using Audio Transcript Data. IEEE Access 2022, 10, 32750–32765. [Google Scholar] [CrossRef]
Chang, J.; Ruan, Y.; Shaoze, C.; Yit, J.S.T.; Feng, M. UFRC: A Unified Framework for Reliable COVID-19 Detection on Crowdsourced Cough Audio. In Proceedings of the 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Glasgow, Scotland, UK, 11–15 July 2022; pp. 3418–3421. [Google Scholar] [CrossRef]
Shah, R.; Dave, B.; Parekh, N.; Srivastava, K. Parkinson’s Disease Detection—An Interpretable Approach to Temporal Audio Classification. In Proceedings of the IEEE 3rd Global Conference for Advancement in Technology (GCAT), Bangalore, India, 7–9 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
Kamepalli, S.; Rao, B.S.; Kishore, K.V.K. Multi-Class Classification and Prediction of Heart Sounds Using Stacked LSTM to Detect Heart Sound Abnormalities. In Proceedings of the 3rd International Conference for Emerging Technology (INCET), Belgaum, India, 27–29 May 2022; pp. 1–6. [Google Scholar] [CrossRef]
Bathe, N.S.; Ingale, V. Heart Murmur Detection from Phonocardiogram Recordings using Deep Learning Techniques. In Proceedings of the International Conference on Futuristic Technologies (INCOFT), Belgaum, India, 24–26 November 2022; pp. 1–4. [Google Scholar] [CrossRef]
Yadav, V.; Kumar, R.; Azad, C. A filter-based feature selection approach for the prediction of Alzheimer’s diseases through audio classification. In Proceedings of the 2nd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE), Greater Noida, India, 28–29 April 2022; pp. 1890–1894. [Google Scholar] [CrossRef]
Patel, A.; Degadwala, S.; Vyas, D. Lung Respiratory Audio Prediction using Transfer Learning Models. In Proceedings of the Sixth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Dharan, Nepal, 11–13 October 2022; pp. 1107–1114. [Google Scholar] [CrossRef]
Redekar, S.; Sawant, A.; Kolanji, R.; Sawant, N. Heart Rate Prediction from Human Speech using Regression Models. In Proceedings of the IEEE World Conference on Applied Intelligence and Computing (AIC), Sonbhadra, India, 17–19 June 2022; pp. 702–707. [Google Scholar] [CrossRef]
Amato, F.; Fasani, M.; Raffaelli, G.; Cesarini, V.; Olmo, G.; Di Lorenzo, N.; Costantini, G.; Saggio, G. Obesity and Gastro-Esophageal Reflux voice disorders: A Machine Learning approach. In Proceedings of the IEEE International Symposium on Medical Measurements and Applications (MeMeA), Messina, Italy, 22–24 June 2022; pp. 1–6. [Google Scholar] [CrossRef]
Zhu, Y.; Tiwari, A.; Monteiro, J.; Kshirsagar, S.; Falk, T.H. COVID-19 Detection via Fusion of Modulation Spectrum and Linear Prediction Speech Features. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1536–1549. [Google Scholar] [CrossRef]
Sitaula, C.; He, J.; Priyadarshi, A.; Tracy, M.; Kavehei, O.; Hinder, M.; Withana, A.; McEwan, A.; Marzbanrad, F. Neonatal Bowel Sound Detection Using Convolutional Neural Network and Laplace Hidden Semi-Markov Model. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 1853–1864. [Google Scholar] [CrossRef]
Burne, L.; Sitaula, C.; Priyadarshi, A.; Tracy, M.; Kavehei, O.; Hinder, M.; Withana, A.; McEwan, A.; Marzbanrad, F. Ensemble Approach on Deep and Handcrafted Features for Neonatal Bowel Sound Detection. IEEE J. Biomed. Health Inform. 2023, 27, 2603–2613. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Kumar, K.; Kumar, R.; de Boissiere, T.; Gestin, L.; Teoh, W.Z.; Sotelo, J.; de Brebisson, A.; Bengio, Y.; Courville, A. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. arXiv 2019, arXiv:1910.06711. [Google Scholar]
Yang, H.; Deng, Y.; Zhao, H.-A. A Comparison of MFCC and LPCC with Deep Learning for Speaker Recognition. In Proceedings of the 4th International Conference on Big Data and Computing (ICBDC ‘19), Guangzhou, China, 10–12 May 2019; Association for Computing Machinery: New York, NY, USA; pp. 160–164. [Google Scholar] [CrossRef]
Liu, J.; Zhang, Y.; Lv, D.; Lu, J.; Xu, H.; Xie, S.; Huang, X.; Zhao, J. Research on Yunnan Folk Music Classification Based on the Features of HHT-MFCC. In Proceedings of the 2021 4th International Conference on Artificial Intelligence and Pattern Recognition (AIPR ‘21), Xiamen, China, 24–26 September 2021; Association for Computing Machinery: New York, NY, USA; pp. 393–398. [Google Scholar] [CrossRef]
Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv 2020, arXiv:2006.11477. [Google Scholar]
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A Comprehensive Survey on Transfer Learning. arXiv 2020, arXiv:1911.02685. [Google Scholar] [CrossRef]
Xiao, Y.; Liang, F.; Liu, B. A Transfer Learning-Based Multi-Instance Learning Method With Weak Labels. IEEE Trans. Cybern. 2022, 52, 287–300. [Google Scholar] [CrossRef] [PubMed]
Shi, F.; Chen, B.; Cao, Q.; Wei, Y.; Zhou, Q.; Zhang, R.; Zhou, Y.; Yang, W.; Wang, X.; Fan, R.; et al. Semi-Supervised Deep Transfer Learning for Benign-Malignant Diagnosis of Pulmonary Nodules in Chest CT Images. IEEE Trans. Med. Imaging 2022, 41, 771–781. [Google Scholar] [CrossRef] [PubMed]
Asif, S.; Yi, W.; Ain, Q.U.; Hou, J.; Yi, T.; Si, J. Improving Effectiveness of Different Deep Transfer Learning-Based Models for Detecting Brain Tumors From MR Images. IEEE Access 2022, 10, 34716–34730. [Google Scholar] [CrossRef]
Zhou, Y.; Wang, J.; Han, T.; Cai, X. Fire Smoke Detection Based on Vision Transformer. In Proceedings of the 4th International Conference on Natural Language Processing (ICNLP), Xi’an, China, 25–27 March 2022; pp. 39–43. [Google Scholar] [CrossRef]
Zhang, W.; Wang, J.; Ma, H.; Zhang, Q.; Fan, S. A Transformer-Based Approach for Metal 3d Printing Quality Recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Taipei City, Taiwan, 18–22 July 2022; pp. 1–4. [Google Scholar] [CrossRef]
Dubey, S.R.; Singh, S.K.; Chu, W.-T. Vision Transformer Hashing for Image Retrieval. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar] [CrossRef]
Li, X.; Kuang, P. 3D-VRVT: 3D Voxel Reconstruction from A Single Image with Vision Transformer. In Proceedings of the International Conference on Culture-oriented Science & Technology (ICCST), Beijing, China, 18–21 November 2021; pp. 343–348. [Google Scholar] [CrossRef]
Fang, Y.; Shi, S.; Fang, J.; Yin, W. SPRNet: Sitting Posture Recognition Using improved Vision Transformer. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–6. [Google Scholar] [CrossRef]
Dey, A.; Rajan, S.; Xiao, G.; Lu, J. Fall Event Detection using Vision Transformer. In Proceedings of the IEEE Sensors, Dallas, TX, USA, 30 October–2 November 2022; pp. 1–4. [Google Scholar] [CrossRef]
Chuman, T.; Kiya, H. Security Evaluation of Block-based Image Encryption for Vision Transformer against Jigsaw Puzzle Solver Attack. In Proceedings of the IEEE 4th Global Conference on Life Sciences and Technologies (LifeTech), Osaka, Japan, 7–9 March 2022; pp. 448–451. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Chakraborty, S. Patient Health Detection Using Vocal Audio [Data Set]; Kaggle: TN, USA, 2022. [Google Scholar] [CrossRef]

Figure 1. Risevi model application scenarios.

Figure 2. MelGAN model architecture [50].

Figure 3. The Flowchart of MFCC [51].

Figure 4. Structure of the Wav2vec2 framework [53].

Figure 5. The overall architecture of the Vision Transformer model [65].

Figure 6. The overall architecture of the Risevi model.

Figure 7. Comparison of accuracy and precision of different optimizers. (a) Comparison of accuracy rates. (b) Comparison of precision rates.

Figure 8. Comparison of recall and F1 scores for different optimizers. (a) Comparison of recall rates. (b) Comparison of F1 scores.

Figure 9. Comparison of accuracy and precision for different learning rates. (a) Comparison of accuracy rates. (b) Comparison of precision rate.

Figure 10. Comparison of recall and F1 scores for different learning rates. (a) comparison of recall rates. (b) comparison of F1 scores.

Figure 11. Flow chart of the model optimization process.

Figure 12. Accuracy, precision, recall, and F1 score comparison before and after optimization and AUC value after optimization. (a) Comparison of accuracy, precision, recall, and F1 scores. (b) AUC values after optimization.

Figure 13. Comparison of accuracy and precision rates of different algorithms. (a) Comparison of accuracy rates. (b) Comparison of precision rates.

Figure 14. Comparison of recall and F1 scores of different algorithms. (a) comparison of recall rate. (b) Comparison of F1 scores.

Table 1. Related Research Statistics.

Researcher	Disease	Basic Algorithm	Accuracy
M. V. A. Rao [10]	Asthma	SVR	77.77%
Y. You [17]	Dementia	KNN, SVM	97.2%
S. Aich [19]	Parkinson’s disease	linear classification	97.57%
D. Pettas [20]	Lung Disease	LSTM	92.76%
K. Sriskandaraja [26]	Dementia	KNN	91%
M. T. Guimarães [27]	Huntington’s disease	KNN	99%
V. Ramesh [29]	Cough	GAN	76%
Y. F. Khan [37]	Alzheimer’s Disease	CNN, LSTM	85.05%
S. Kamepalli [40]	Cardiac	LSTM	85%

Table 2. Sample data distribution.

Data Category	Data Volume	Data Size	Type of Data
Laryngocele	84	29.7 MB	wav
Normal	560	247 MB	wav
Vox senilis	392	125 MB	wav

Table 3. Structure of the MelGAN generator model [50].

7 × 1	stride = 1 conv 512
lReLU 16 × 1	stride = 8 conv transpose 256
Residual Stack 256
lReLU 16 × 1	stride = 8 conv transpose 128
Residual Stack 128
lReLU 4 × 1	stride = 2 conv transpose 64
Residual Stack 64
lReLU 4 × 1	stride = 2 conv transpose 32
Residual Stack 32
lReLU 7 × 1	stride = 1 conv 1 Tanh

Table 4. Structure of the MelGAN discriminator model [50].

15 × 1	stride = 1 conv 16 l ReLU
41 × 1	stride = 4 groups = 4 conv 64 l ReLU
41 × 1	stride = 4 groups = 16 conv 256 l ReLU
41 × 1	stride = 4 groups = 64 conv 1024 l ReLU
41 × 1	stride = 4 groups = 256 conv 1024 l ReLU
5 × 1	stride = 1 conv 1024 l ReLU
3 × 1	stride = 1 conv 1

Table 5. Confusion matrix.

Sample Type	Predicted as a Normal Sample	Predicted as an Attack Sample
normal sample	TN	FP
attack sample	FN	TP

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, F.; Hu, S.; Wan, X.; Lu, Z.; Wu, J. Risevi: A Disease Risk Prediction Model Based on Vision Transformer Applied to Nursing Homes. Electronics 2023, 12, 3206. https://doi.org/10.3390/electronics12153206

AMA Style

Zhou F, Hu S, Wan X, Lu Z, Wu J. Risevi: A Disease Risk Prediction Model Based on Vision Transformer Applied to Nursing Homes. Electronics. 2023; 12(15):3206. https://doi.org/10.3390/electronics12153206

Chicago/Turabian Style

Zhou, Feng, Shijing Hu, Xiaoli Wan, Zhihui Lu, and Jie Wu. 2023. "Risevi: A Disease Risk Prediction Model Based on Vision Transformer Applied to Nursing Homes" Electronics 12, no. 15: 3206. https://doi.org/10.3390/electronics12153206

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Risevi: A Disease Risk Prediction Model Based on Vision Transformer Applied to Nursing Homes

Abstract

1. Introduction

2. Related Works

3. Model Design and Implementation

3.1. Sample Generation

3.2. Feature Extraction

3.3. Transfer Learning

3.4. Feature Classification

3.5. Risevi Model

3.5.1. Sample Data

3.5.2. Sample Generation Implementation Process

3.5.3. Feature Extraction Implementation Process

3.5.4. Feature Classification Implementation Process

3.5.5. Model Algorithm

3.6. Evaluation Metrics

3.7. Application Practice

4. Experimental Results and Analysis

4.1. Comparison of Different Optimizers

4.2. Comparison of Different Learning Rates

4.3. Comparison before and after Optimization

4.4. Comparison of Different Algorithms

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI