Automatic COVID-19 Detection from Cough Sounds Using Multi-Headed Convolutional Neural Networks

Wang, Wei; Shang, Qijie; Lu, Haoyuan

doi:10.3390/app13126976

Open AccessArticle

Automatic COVID-19 Detection from Cough Sounds Using Multi-Headed Convolutional Neural Networks

by

Wei Wang

^1,*,

Qijie Shang

^1,*

and

Haoyuan Lu

²

¹

School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

²

Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(12), 6976; https://doi.org/10.3390/app13126976

Submission received: 27 April 2023 / Revised: 27 May 2023 / Accepted: 30 May 2023 / Published: 9 June 2023

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

Corona Virus Disease 2019 (COVID-19) is rampant all over the world, threatening human life and health. Currently, the detection of COVID-19 is mainly based on the nucleic acid test as the standard. However, this method not only takes up a lot of medical resources but also takes a long time to achieve detection results. Different from the existing method, we propose that the cough sound is used as a large-scale pre-screening method before the nucleic acid test.

Abstract

Novel coronavirus disease 2019 (Corona Virus Disease 2019, COVID-19) is rampant all over the world, threatening human life and health. Currently, the detection of the presence of nucleic acid from SARS-CoV-2 is mainly based on the nucleic acid test as the standard. However, this method not only takes up a lot of medical resources but also takes a long time to achieve detection results. According to medical analysis, the surface protein of the novel coronavirus can invade the respiratory epithelial cells of patients and cause severe inflammation of the respiratory system, making the cough of COVID-19 patients different from that of healthy people. In this study, the cough sound is used as a large-scale pre-screening method before the nucleic acid test. Firstly, the Mel spectrum features, Mel Frequency Cepstral Coefficients, and VGG embeddings features of cough sound are extracted and oversampling technology is used to balance the dataset for classes with a small number of samples. In terms of the model, we designed multi-headed convolutional neural networks to predict audio samples, and adopted an early stop method to avoid the over-fitting problem of the model. The performance of the model is measured by the binary cross-entropy loss function. Our model performs well on the dataset of the AICovidVN 115M challenge that its accuracy rate is 98.1%, and on the dataset of the University of Cambridge that its accuracy rate is 91.36%.

Keywords:

COVID-19 detection; Mel Frequency Cepstral Coefficients; VGG embeddings; multi-headed convolutional neural networks; feature extraction; model training

1. Introduction

Life and health have always been an important topic of global human concern. In order to ensure the safety of human life, health, and safety, different countries have established their own healthcare systems. Studies have shown that there is an upward trend in government and individual health expenditures in some regions [1,2]. However, some countries still need to improve their healthcare systems [3,4]. In recent years, the emergence of the COVID-19 epidemic has greatly tested the sustainability of health systems in many countries [5], and the lives of people in some developing countries have also been greatly impacted [6]. In order to better deal with the national medical financial risks brought by the COVID-19 epidemic, countries and the research community are actively taking measures to deal with the epidemic [7], especially in the detection of the COVID-19 disease [8,9].

At present, the main detection method for COVID-19 is to use throat swabs and nasal swabs to collect the throat parts (i.e., the posterior pharyngeal wall) of the subjects [10]. Although it is a standard method, there are many potential risks in its sampling and detection process. Before sampling, the tested people need to wait in line for nucleic acid detection. During sampling, the medical staff will also have close contact with the subjects during on-site collection. Such a sampling process violates social distancing and greatly increases the risk of cross-infection. After sampling, the medical staff will send the collected samples back to the hospital and test them with nucleic acid testing reagents. The tested person will need to wait about four to five hours to obtain the test result. This detection process is not efficient and increases the chances of virus transmission. According to medical analysis [11], the new coronavirus is an RNA virus, and its surface protein will first attack human respiratory epithelial cells, causing severe inflammation of the respiratory system. It leads to the difference in acoustic characteristics between patients with COVID-19 and healthy people [12,13,14,15]. Many research results show that, in addition to visual information, audio signals produced by the human body can also provide reliable basis for medical diagnosis [16,17,18]. Meanwhile, with the development of artificial intelligence, it has become a reality for wearable devices to collect and analyze audio signals [19].

During the COVID-19 epidemic, many universities tried to collect the cough sounds of COVID-19 patients and healthy people, and use machine learning to classify two cough sounds. In September 2020, the team at Cambridge University extracted several features of cough sounds through manual extraction and transfer learning, and divided them into three tasks for research. The first is to distinguish COVID-positive from non-COVID samples, the second is to distinguish COVID-positive with a cough from non-COVID with a cough, and the third is to distinguish COVID-positive coughs from non-COVID asthma coughs [20]. The research results showed that the performance of all tasks remained above 0.8 in the area under the curve (AUC). In the same year, the Massachusetts Institute of Technology also made relevant reports on the research based on cough sound detection [21]. The team widely collected audio signals on its own website, extracted the acoustic characteristics of cough sounds, and then used CNN to classify cough sounds. The team verified the specificity value of 94.2% and AUC of 0.97 on the dataset. In January 2021, Andreu Perez J et al. [22] proposed a cough analysis system that could classify audio feature tensors and detect the severity of COVID-19 infectors. In the AICovidVN 115M COVID-19 challenge, Nguyễn Thành Trung’s team [23] extracted multiple features based on the dataset provided by the competition, combined them, and implemented different normalization processing methods for the input data. Additionally, they used the LightGBM model for classification, and performed model tuning through the method of k-fold cross-validation. Finally, the area under the ROC curve is 0.96. Pahar et al. [24] extracted features such as Mel Frequency Cepstral Coefficients (MFCC) and zero-crossing rates from the audio data of the coswara public dataset and used them for model training. In their experiment, a total of seven machine learning classifiers were trained and evaluated, including LR (logistic regression) model, support vector machine model, k-nearest neighbor algorithm, multilayer perceptron (MLP), long short-term memory (LSTM), CNN, and 50-layer depth residual neural network (resnet50). Among them, the performance of LSTM, CNN, and resnet50 classifiers is better than other architectures. In terms of COVID-19 detection, many research teams in China have also made many contributions. Researchers from the Institute of Artificial Intelligence of Beijing University of Posts and Telecommunications expanded the corpus of cough sounds and speech clips. They extracted VQ features from the audio and used Vlad encoding to make the features more representative and improve the performance of the algorithm [25]. The research of the team formed by the University of Science and Technology of China and iFLYTEK won two championships in the second DiCOVA COVID-19 Sound Signal Detection Challenge held by ICASSP2022 [26]. Their design idea is to use both supervised and self-supervised pre-training schemes, and finally fuse the prediction results, which provides an outstanding AUC index value with 0.88.

Considering that one type of the acoustic features is relatively single or the acoustic features do not have strong representation abilities, we adopted a combination of three features, which fully considered the characteristics of human hearing; especially the pre-training model Vggish is trained on the large-scale audio dataset, and has strong domain adaptability. Three features combination makes our proposed method have a strong generalization ability, and can be better applied to the situation where the training data are limited.

Different from a single neural network structure, our designed multi-headed convolutional neural networks (MHCNNs) structure can fuse different abstract representations from each head and combine them to achieve a more comprehensive data representation. At the same time, from the perspective of information fusion, our designed structure of each head of a MHCNN is different. By integrating information from multiple heads, the network can benefit from different perspectives, resulting in richer representation and a better decision-making ability.

This paper will adopt the deep learning method of MHCNNs to introduce the intelligent diagnosis of audio recognition technology into the medical diagnosis process, and our experimental results have good performance. This will be a low-risk and high-efficiency large-scale epidemic surveillance method. At the same time, this method will effectively reduce medical expenses in many countries, thereby indirectly improving the quality of national healthcare services [27,28].

2. Feature Extraction

This paper uses two methods to extract audio features from cough audio data. The first one is to manually extract the features of the Mel spectrum and MFCC from the samples. The other is to automatically extract the features by employing the VGGish network from the raw audios.

2.1. Mel Spectrum

In human auditory perception, the resolution of frequency is not uniform. Mel frequency is more consistent with the characteristics of the human auditory system than linear frequency. Firstly, the speech is pre-emphasized with the filter to spectrally flatten the signal. The pre-emphasized speech is separated into short frames in order to guarantee stationarity inside the frame. In addition, two adjacent frames have the overlap in order to ensure stationary between the frames. To reduce the frame edge effect, a Hamming window is applied to each frame. Secondly, in order to find out the obvious characteristics of voice information, the spectrum of each frame is calculated with fast Fourier transform (FFT). Finally, the spectrogram is converted into Mel spectrum in order to more accurately represent the characteristics of audio signals.

2.2. MFCC

MFCC of short-term cepstral features is the most widely used. The block diagram of the MFCC feature extraction is shown in Figure 1.

After pre-emphasized, adding window, frame partition, and FFT are processed, the power of each band is calculated. In order to simulate the nonlinear auditory characteristics of human cochlea, a bank of triangular filters is designed. In addition, the power of the filterbank is multiplied by the characteristic of triangular filter to generate the filter outputs and then these filter outputs are summed to generate the power of each filter. To remove the correlation between the output values of the triangular filter, the logarithm of all filter bank energies is transformed by the discrete cosine transform (DCT) to obtain MFCC. In order to enhance the performance of the speaker recognition system, the first-order and second-order delta of MFCC are computed as the dynamic parameters and stitched together with the static parameters to form the features of each frame.

2.3. VGGish

VGGish [29] is a convolutional neural network that is a simple structure and has a good generalization ability of the model. The VGGish model was pre-trained using a large-scale YouTube video audio track dataset and the learned model parameters were in GitHub website. By using the VGGish network, the raw audios are transformed into features. The VGGish pre-trained model first divides data samples into 0.96 s non-overlapping sub-samples, and for each 0.96 s, it returns a 128-dimensional feature vector.

For each cough audio file, the shapes of the extracted Mel spectrum feature matrix, MFCC feature matrix, and VGG network feature matrix are (nums, 128), (nums, 13), and (nums, 128), respectively, where nums is related to the duration of the audio; the longer the audio time, the larger the nums value. Then, features extracted by the pre-trained model are converted into one-dimensional vectors, which are called VGG embeddings. We transposed each feature matrix and took the mean value for each column to form a one-dimensional vector. Among them, the sizes of the feature vectors corresponding to Mel spectrum, MFCC, and VGG embeddings are (1, 128), (1, 13), and (1, 128), respectively. In this paper, we use early fusion to fuse three features and horizontally splice three features of each audio file to obtain a (1, 269) feature vector. N cough audio files are finally integrated as a feature sequence of (N, 269), where N expresses the number of cough audio files, and being output as an .npy file as well as the corresponding tag data (0 for healthy people and 1 for COVID-19 patients).

3. Multi-Headed Convolutional Neural Networks Architecture

Since the further abstract expression of the input feature sequence by a series of convolution and pooling operations of each CNN allows the network to better learn the input feature sequence from multiple perspectives, this paper designs and implements the network structure of MHCNNs. MHCNNs input the one-dimensional audio feature sequence into three CNNs with different operations, combines the output data abstracted by the three CNNs, and inputs them into several fully connected layers for classification.

For the first input of the network, a 1D cough audio feature sequence is input to MaxPooling1D layer after passing through two Conv1D layers in turn, and then input to GlobalAveragePooling1D layer after being abstracted through another two Conv1D layers in turn. Finally, the Dropout layer and the Flatten layer are added. After being processed by this series of layers, the length of audio feature sequence would be 128.

For the second input of the network, 1D cough audio feature sequence first passes through two Conv1D layers and one MaxPooling1D layer, and then only one Conv1D layer is input to the GlobalAveragePooling1D layer. By the Dropout layer and the Flatten layer, the final sequence data size is 256.

For the third input of the network, 1D cough audio feature sequence is 7936 in length after being processed by a four-layer network consisting of Conv1D layer, Dropout layer, MaxPooling1D layer, and Flatten layer.

Then, the abstracted sequence data of three different lengths are combined, and the final output is obtained after passing through four Dense layers. Its network architecture is shown in Figure 2. Among them, the activation functions of all convolution layers use Relu. The activation function of the last dense layer is set to Softmax.

4. Experimental Dataset1

To evaluate the performance of the proposed method, our experiments are conducted on the AICovidVN 115M challenge dataset. This dataset contains 4068 cough audio files, including 669 cough audio files of COVID-19 patients and 3399 cough audio files of non-COVID people, which indicates that the category distribution of this dataset is imbalanced. The visualization of the audio time length of this dataset is shown in Figure 3. The horizontal axis represents 4068 cough audio files, and the vertical axis represents the time length of each audio file, in seconds (s). The average duration of all cough audio files is about 9.1 s, where 1225 cough audio files are longer than 9 s and 2843 cough audio files are shorter than 9 s.

Due to the problem of category data imbalance in the dataset, SMOTE technology is used to augment the data of COVID-19 patients. It will start from the data with few sample categories (COVID-19 patient) to find adjacent samples and synthesize new samples, so that the sample ratio of the two categories could remain almost unchanged while data becomes augmented. In this way, the dataset is balanced, and the final total number of samples is 6798. The balanced feature sequences are divided into training set and testing set with a ratio of 8:2, that is, 5438 feature sequences are used for training (453 feature sequences are used as the validation set), and 1360 feature sequences are used for testing. To be consistent with the input size of CNN network structure, it is necessary to expand the input feature vector from two-dimensional to three-dimensional, taking the training data as an example, that is, from (5438, 269) to (5438, 269, 1), as the input of CNN model training.

4.1. Experimental Parameter Setting

This experiment trains the network model on Google’s Colab platform, and accelerates the training speed with the help of its GPU. Google Drive is used to load experimental data and save the network model parameters. In the training process, learning rate decay is involved. It could reduce the step size of parameter adjustment, which is conducive to the convergence of the algorithm. The Keras exponential decay function is used in tensorflow to realize the decay of learning rate, so that the learning rate could be taken reasonably and adjusted dynamically with the progress of training. The initial learning rate should not be too large, which is set to 0.1. It is found that the model training process is fast, but the accuracy is only about 50%. The initial learning rate is then set as 0.00035, the decay index is 0.9, and the decay speed is 1000. At the same time, this paper uses adaptive moment estimation (Adam) optimizer to optimize the training process. The learning rate exponential decay object created above is inputted as a parameter into the Adam optimizer, which can better select super parameters, to calculate the learning rate for us in an adaptive way. The generated optimizer object will be part of the model compilation.

In the process of training, this paper uses the binary cross-entropy loss function, which is commonly used for classification tasks, and sets the value of epoch at the same time. If the epoch value is set too large, the network may cause over-fitting problems due to unsatisfactory generalization ability. In order to avoid overfitting of the model, this article uses the EarlyStopping function in Keras, and sets the patience parameter of the function to 10. This means that when the loss on the function monitoring validation set does not continue to decrease for 10 consecutive iterations, the training will be terminated early to prevent overfitting. If the epoch value is too small, the network may be underfitting due to insufficient learning of the training data. To avoid this from happening, we set the epoch value to be greater than the epoch value when the network converges. During the experiment, we found that after training a model, the total number of iterations of the dataset does not exceed 100. Considering the computational power and experimental phenomena, we set the epoch value of the model to 100 and the batch size to 64.

4.2. Experimental Results and Analysis

To compare the performances of MHCNN and single-headed CNN, the MHCNN is divided into three independent single-headed CNNs, which are named Head1 CNN, Head2 CNN, and Head3 CNN.

In order to compare the effects of MHCNN and three single-headed CNN (Head1 CNN, Head2 CNN, and Head3 CNN), we, respectively, used three different features (Mfcc + mel, Mfcc + mel + VGG embeddings, and VGG embeddings). The accuracy is shown in Table 1, where Mfcc + mel + VGG embeddings feature, and MHCNN model (accuracy with 98.09%) outperforms another methods.

Moreover, different feature performances (Mfcc + mel, Mfcc + mel + VGG embeddings, and VGG embeddings) with the same model and different model performances (Head1 CNN, Head2 CNN, Head3 CNN, and MHCNN) with the same feature are also evaluated by AUC value. The results are shown in Table 2, where it can be seen that the combination of Mfcc + mel + VGG embeddings feature and MHCNN model has the highest AUC value (AUC = 0.9959).

In this study, The model.fit() function is used to train the model for a certain number of epochs, and return historical training data. The learning curves for accuracy and loss function are shown in Figure 4 and Figure 5. The model was trained 38 times, and by the time of the 30th epoch, the network had gradually converged. It can be seen from the two figures that the generalization performance of the network is satisfied, and the two curves on the training set and the validation set are relatively close.

Other evaluation indicators, including precision, recall rate, and F1 score are also calculated and shown in Table 3. Among 1360 test samples, there are 680 negative data and 680 positive data. The precision on the negative data is 0.99, the recall rate is 0.97, and the F1 score is 0.98. The precision on positive data is 0.97, the recall rate is 0.99, and the F1 score is 0.98. The results of three evaluation metrics show that MHCNN model performs well in the test set.

At the same time, the confusion matrix is adapted to evaluate and supervise the performance of the learning algorithm, and reflect the parts that the real value is consistent with the predicted value or not. The confusion matrix of the MHCNN model is shown below in Figure 6. From the perspective of lines, the first line represents the actual number of cough audio of healthy people in the test audio, and the second line represents the actual number of cough audio of patients with COVID-19. From the perspective of columns, the first column represents the number of negative cough signals put into the system for prediction, and the second column represents the number of positive results. The matrix can be used to separately count the prediction results on the test data, including the number of classification errors and correct classifications. We can see that the classification of 680 healthy audio, 662 predictions are correct, and 18 predictions are wrong. For 680 cough audio of patients with COVID-19, there were 8 wrong predictions and 672 correct predictions.

In addition, the ROC curve of the new method is shown in Figure 7.

5. Experimental Dataset2

To verify the generalization of our proposed method, our experiments are conducted on the University of Cambridge dataset (until 22 May 2020) too. This dataset was gathered from a web-based app and an Android app. It contains 544 cough audio files, including 141 cough audio files of COVID-19 patients and 403 cough audio files of non-COVID people, which indicates that the category distribution of this dataset is imbalanced. A total of 141 cough audio files of COVID-19 patients have the time length 897 s. In addition, 403 cough audio files of non-COVID people have the time length 1870 s.

Due to the problem of category data imbalance in the dataset, SMOTE technology is used to augment the data of COVID-19 patients. It will start from the data with few sample categories (COVID-19 patient) to find adjacent samples and synthesize new samples so that the sample ratio of the two categories could remain almost unchanged while data become augmented. In this way, the dataset is balanced, and the final total number of samples is 806. The balanced feature sequences are divided into a training set and a testing set with a ratio of 8:2, that is, 644 feature sequences are used for training (54 feature sequences are used as the validation set), and 162 feature sequences are used for testing. To be consistent with the input size of the CNN network structure, it is necessary to expand the input feature vector from two-dimensional to three-dimensional, taking the training data as an example, that is, from (644, 269) to (644, 269, 1), as the input of CNN model training.

5.1. Experimental Parameter Setting

This experiment trains the network model on Google’s Colab platform, and accelerates the training speed with the help of its GPU. Google Cloud hard disk is used to load experimental data and save the network model parameters. In the training process, the learning rate is set as 0.00025 and the decay index is 0.85. This paper uses the binary cross-entropy loss function are used commonly used for classification tasks too. In order to avoid overfitting of the model, this article uses the EarlyStopping function in Keras, and sets the patience parameter of the function to 10. To avoid underfitting the network due to insufficient learning of the training data, we set the epoch value to be greater than the epoch value when the network converges. Considering the computational power and experimental phenomena, we set the epoch value of the model to 100 and the batch size to 64.

5.2. Experimental Results and Analysis

To compare the performances of MHCNN and single-headed CNN, MHCNN is divided into three independent single-headed CNNs, which are named Head1 CNN, Head2 CNN, and Head3 CNN.

In order to compare the effects of MHCNN and three single-headed CNN (Head1 CNN, Head2 CNN, and Head3 CNN), we, respectively, used three different features (Mfcc + mel, Mfcc + mel + VGG embeddings, and VGG embeddings). The accuracy is shown in Table 4 and in Figure 8, where Mfcc + mel + VGG embeddings feature, and the MHCNN model (accuracy with 91.36%) outperforms another methods.

Moreover, different feature performances (Mfcc + mel, Mfcc + mel + VGG embeddings, and VGG embeddings) with the same model and different model performances (Head1 CNN, Head2 CNN, Head3 CNN, and MHCNN) with the same feature are also evaluated by AUC value. The results are shown in Table 5 and in Figure 9, where it can be seen that the combination of Mfcc + mel + VGG embeddings feature and MHCNN model has the highest AUC value (AUC = 0.9646).

In this study, The model.fit() function is used to train the model for a certain number of epochs, and return historical training data. The learning curves for accuracy and loss function are shown in Figure 10 and Figure 11. After training the neural network’s 10 epochs, the network gradually converges. It can be seen the two curves on the training set and the validation set are relatively different, where the effect on training set is better than the validation set because the size of the dataset from the University of Cambridge is smaller than the AICovidVN 115M dataset.

Other evaluation indicators, including precision, recall rate, and F1 score are, also calculated and shown in Table 6. Among 162 test samples, there are 81 negative data and 81 positive data. The precision on the negative data is 0.95, the recall rate is 0.88, and the F1 score is 0.91. The precision on positive data is 0.89, the recall rate is 0.95, and the F1 score is 0.92. The results of three evaluation metrics show that MHCNN model performs well in the test set.

At the same time, the confusion matrix is adapted to evaluate and supervise the performance of the learning algorithm, and reflect the parts that the real value is consistent with the predicted value or not. The confusion matrix of the MHCNN model is shown below in Figure 12. From the perspective of lines, the first line represents the actual number of cough audio clips of healthy people in the test audio, and the second line represents the actual number of cough audio clips of patients with COVID-19. From the perspective of columns, the first column represents the number of negative cough signals put into the system for prediction, and the second column represents the number of positive results. The matrix can be used to separately count the prediction results on the test data, including the number of classification errors and correct classifications. We can see that the classification of 81 healthy audio, 71 predictions are correct, and 10 predictions are wrong. For 81 cough audio of patients with COVID-19, there were 4 wrong predictions and 77 correct predictions.

In addition, the ROC curve of the new method is shown in Figure 13.

At the same time, in order to verify that our proposed method can distinguish the coughs of respiratory patients from COVID-19 patients, we chose the coughing sounds of asthma patients and COVID-19 patients in the University of Cambridge dataset to conduct experiments. Among them, there are 21 cough files of asthmatic patients with a total of 120 s, and 54 cough files of COVID-19 patients with a total of 299 s.

Our designed MHCNN neural network is used to classify asthma coughs and COVID-19 coughs on the University of Cambridge dataset, where the training set and testing set are divided in the same way as previous experiments. We used the optimal feature combination (Mfcc + mel + VGG embeddings) found in the previous two group experiments is adopted to train with the MHCNN network.

Evaluation indicators of precision, recall rate and F1 score are calculated and shown in Table 7, where they show that MHCNN model performs well in the test set.

In this experiment, the confusion matrix of the MHCNN model is shown in Figure 14.

The ROC curve of this experiment is shown in Figure 15.

6. Conclusions

In this paper, a MHCNN is designed for the detection of COVID-19 by using the cough sound of the human body. We separately validated the AICovidVN 115M challenge dataset and the University of Cambridge dataset, and achieve good results. This method achieves better performance as a new way to assist screening and diagnosis of COVID-19. Compared with the traditional detection method, it reduces the manpower and material costs and decreases the frequent contact between medical staff and the people being tested. Most importantly, it provides a new way to actively find patients with COVID-19, which is conducive to early isolation and treatment of infected people, and reduces the risk of potential infection, and is of great significance for combating the epidemic.

Author Contributions

Conceptualization, W.W., Q.S. and H.L.; methodology, W.W. and Q.S.; software, Q.S.; validation, Q.S. and H.L.; formal analysis, H.L.; investigation, W.W. and Q.S.; resources, W.W.; data curation, Q.S. and H.L.; writing—original draft preparation, W.W. and Q.S.; writing—review and editing, W.W. and Q.S.; visualization, W.W.; supervision, W.W.; project administration, W.W.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work has received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

During this research, thanks to Zhen-Hao Zhang for our manuscript preparation support.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jakovljevic, M.; Camilleri, C.; Rancic, N.; Grima, S.; Buttigieg, S.C. Cold War Legacy in Public and Private Health Spending in Europe. Front. Public Health 2018, 6, 215. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Jakovljevic, M.; Fernandes, P.O.; Teixeira, J.P.; Rancic, N.; Timofeyev, Y.; Reshetnikov, V. Underlying differences in health spending within the world health organisation europe region—Comparing eu15, eu post-2004, cis, eu candidate, and carinfonet countries. Int. J. Environ. Res. Public Health 2019, 16, 3043. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Cerda, A.A.; García, L.Y.; Rivera-Arroyo, J.; Riquelme, A.; Teixeira, J.P.; Jakovljevic, M. Comparison of the healthcare system of Chile and Brazil: Strengths, inefficiencies, and expenditures. Cost Eff. Resour. Alloc. 2022, 20, 71–79. [Google Scholar] [CrossRef] [PubMed]
Katoue, M.G.; Cerda, A.A.; García, L.Y.; Jakovljevic, M. Healthcare system development in the Middle East and North Africa region: Challenges, endeavors and prospective opportunities. Front. Public Health 2022, 10, 4937. [Google Scholar] [CrossRef] [PubMed]
Mihajlo, J.; Sulaiman, M.; Sanaa, A.A.; Dalal, H.H. Editorial: Does healthcare financing explain different healthcare system performances and responses to COVID-19? Front. Public Health 2022, 10, 4183. [Google Scholar] [CrossRef]
You, J.; Zhang, J.; Li, Z. Consumption-Related Health Education Inequality in COVID-19: A Cross-Sectional Study in China. Front. Public Health 2022, 10, 666. [Google Scholar] [CrossRef]
Zhao, W.; Sun, Y.; Li, Y.; Guan, W. Prediction of COVID-19 Data Using Hybrid Modeling Approaches. Front. Public Health 2022, 10, 923978. [Google Scholar] [CrossRef]
Giri, B.; Pandey, S.; Shrestha, R.; Pokharel, K.; Ligler, F.S.; Neupane, B.B. Review of analytical performance of COVID-19 detection methods. Anal. Bioanal. Chem. 2021, 413, 35–48. [Google Scholar] [CrossRef]
Cheng, M.P.; Papenburg, J.; Desjardins, M.; Kanjilal, S.; Yansouni, C.P. Diagnostic Testing for Severe Acute Respiratory Syndrome—Related Coronavirus 2: A narrative review. Ann. Intern. Med. 2020, 172, 726–734. [Google Scholar] [CrossRef] [Green Version]
Hung, K.F.; Sun, Y.C.; Chen, B.H.; Lo, J.F.; Cheng, C.M.; Chen, C.Y.; Wu, C.H.; Kao, S.Y. New COVID-19 saliva-based test: How good is it compared with the current nasopharyngeal or throat swab test? J. Chin. Med. Assoc. 2020, 83, 891–894. [Google Scholar] [CrossRef]
Mason, R.J. Pathogenesis of COVID-19 from a cell biology perspective. Eur. Respir. J. 2020, 55, 2000607. [Google Scholar] [CrossRef] [Green Version]
Al Ismail, M.; Deshmukh, S.; Singh, R. Detection of COVID-19 through the analysis of vocal fold oscillations. In Proceedings of the 46th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1035–1039. [Google Scholar]
Lella, K.K.; Pja, A. Automatic diagnosis of COVID-19 disease using deep convolutional neural network with multi-feature channel from respiratory sound data: Cough, voice, and breath- ScienceDirect. Alex. Eng. J. 2022, 61, 1319–1334. [Google Scholar] [CrossRef]
Rahman, T.; Ibtehaz, N.; Khandakar, A.; Hossain, M.S.A.; Mekki, Y.M.S.; Ezeddin, M.; Bhuiyan, E.H.; Ayari, M.A.; Tahir, A.; Qiblawey, Y.; et al. QUCoughScope: An Intelligent Application to Detect COVID-19 Patients Using Cough and Breath Sounds. Diagnostics 2022, 12, 920. [Google Scholar] [CrossRef] [PubMed]
Lella, K.K.; Pja, A. Automatic COVID-19 disease diagnosis using 1D convolutional neural network and augmentation with human respiratory sound based on parameters: Cough, breath, and voice. Alex. Eng. J. 2021, 8, 240–264. [Google Scholar] [CrossRef] [PubMed]
Laguarta, J.; Subirana, B. Longitudinal speech biomarkers for automated Alzheimer’s detection. Front. Comput. Sci. 2021, 3, 1–12. [Google Scholar] [CrossRef]
Pramono, R.X.A.; Imtiaz, S.A.; Rodriguez-Villegas, E. Evaluation of features for classification of wheezes and normal respiratory sounds. PLoS ONE 2019, 14, e0213659. [Google Scholar] [CrossRef] [Green Version]
Abbas, A.; Fahim, A. An automated computerized auscultation and diagnostic system for pulmonary diseases. J. Med. Syst. 2010, 34, 1149–1155. [Google Scholar] [CrossRef]
Al Bassam, N.; Hussain, S.A.; Al Qaraghuli, A.; Khan, J.; Sumesh, E.P.; Lavanya, V. IoT based wearable device to monitor the signs of quarantined remote patients of COVID-19. Inform. Med. Unlocked 2021, 24, 100588. [Google Scholar] [CrossRef]
Brown, C.; Chauhan, J.; Grammenos, A.; Han, J.; Hasthanasombat, A.; Spathis, D.; Xia, T.; Cicuta, P.; Mascolo, C. Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 23–27 August 2020; pp. 3474–3484. [Google Scholar]
Laguarta, J.; Hueto, F.; Subirana, B. COVID-19 Artificial Intelligence Diagnosis Using Only Cough Recordings. IEEE Open J. Eng. Med. Biol. 2020, 1, 275–281. [Google Scholar] [CrossRef]
Andreu-Perez, J.; Pérez-Espinosa, H.; Timonet, E.; Kiani, M.; Girón-Pérez, M.I.; Benitez-Trinidad, A.B.; Jarchi, D.; Rosales-Pérez, A.; Nick Gatzoulis, N.; Reyes-Galaviz, O.F.; et al. A Generic Deep Learning Based Cough Analysis System from Clinically Validated Samples for Point-of-Need COVID-19 Test and Severity Levels. IEEE Trans. Serv. Comput. 2021, 3, 1220–1232. [Google Scholar] [CrossRef]
Nguyễn, T.T.; Hoàng, Đ.T.; Đào, M.T. EE3063-SEM202-FINAL-PROJECT. Available online: https://github.com/dee-ex/EE3063-SEM202-FINAL-PROJECT (accessed on 1 May 2023).
Pahar, M.; Klopper, M.; Warren, R.; Niesler, T. COVID-19 Cough Classification using Machine Learning and Global Smartphone Recordings. Comput. Biol. Med. 2021, 135, 104572–104573. [Google Scholar] [CrossRef] [PubMed]
Haoran, Z.; Yichen, H.; Yongmei, T.; Ya, L. COVID-19 Detection Algorithm Using Voice Quality Features and VLAD Coding. J. Signal Process. 2021, 37, 1843–1851. [Google Scholar] [CrossRef]
Official of iFLYTEK. After Winning the Two Dicova Championships, Can You Detect COVID-19 by Voice. Available online: https://new.qq.com/rain/a/20220620A092R500 (accessed on 1 May 2023).
Ranabhat, C.L.; Jakovljevic, M. Sustainable Health Care Provision Worldwide: Is There a Necessary Trade-Off between Cost and Quality? Sustainability 2023, 15, 1372–1383. [Google Scholar] [CrossRef]
Jakovljevic, M.; Liu, Y.; Cerda, A.; Simonyan, M.; Correia, T.; Mariita, R.; Kumara, A.; Garcia, L.; Krstic, K.; Osabohien, R.; et al. The Global South political economy of health financing and spending landscape–history and presence. J. Med. Econ. 2021, 24, 25–33. [Google Scholar] [CrossRef] [PubMed]
Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the 42nd International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]

Figure 1. The block diagram of the MFCC feature extraction.

Figure 2. The architecture of a MHCNN.

Figure 3. The time length analysis of cough audio files in dataset.

Figure 4. Changes in accuracy value during training on the AICovidVN 115M challenge dataset.

Figure 5. Changes in loss value during training on the AICovidVN 115M challenge dataset.

Figure 6. Confusion matrix of the MHCNN model on the AICovidVN 115M challenge dataset.

Figure 7. ROC curve of the MHCNN model on the AICovidVN 115M challenge dataset.

Figure 8. Accuracy bar charts of the combination of four different models and three different features on the University of Cambridge dataset.

Figure 9. AUC bar charts of the combination of four different models and three different features on the University of Cambridge dataset.

Figure 10. Changes in accuracy value during training on the University of Cambridge dataset.

Figure 11. Changes in loss value during training on the University of Cambridge dataset.

Figure 12. Confusion matrix of the MHCNN model on the University of Cambridge dataset.

Figure 13. ROC curve of the MHCNN model on the University of Cambridge dataset.

Figure 14. Confusion matrix of MHCNN model.

Figure 15. ROC curve of MHCNN model.

Table 1. Accuracy of the combination of four different models and three different features on the AICovidVN 115M challenge dataset.

Accuracy	Head1 CNN	Head2 CNN	Head3 CNN	MHCNN
Mfcc + mel	96.62%	97.28%	97.79%	97.13%
Mfcc + mel + VGG embeddings	97.28%	94.93%	96.91%	98.09%
VGG embeddings	91.03%	93.75%	95.15%	95.29%

Table 2. AUC of the combination of four different models and three different features on the AICovidVN 115M challenge dataset.

AUC	Head1 CNN	Head2 CNN	Head3 CNN	MHCNN
Mfcc + mel	0.9909	0.9948	0.9942	0.9929
Mfcc + mel + VGG embeddings	0.9938	0.9908	0.9948	0.9959
VGG embeddings	0.9782	0.9926	0.9835	0.9858

Table 3. Evaluation indicators of the MHCNN model on the AICovidVN 115M challenge dataset.

	Precision	Recall	F1
Negative	0.99	0.97	0.98
Positive	0.97	0.99	0.98

Table 4. Accuracy of the combination of four different models and three different features on the University of Cambridge dataset.

Accuracy	Head1 CNN	Head2 CNN	Head3 CNN	MHCNN
Mfcc + mel	86.59%	86.59%	89.63%	88.41%
Mfcc + mel + VGG embeddings	82.72%	81.48%	87.04%	91.36%
VGG embeddings	84.57%	79.01%	86.42%	89.51%

Table 5. AUC of the combination of four different models and three different features on the University of Cambridge dataset.

AUC	Head1 CNN	Head2 CNN	Head3 CNN	MHCNN
Mfcc + mel	0.9289	0.9378	0.9491	0.9488
Mfcc + mel + VGG embeddings	0.8820	0.9053	0.9491	0.9646
VGG embeddings	0.9514	0.9114	0.9265	0.94734

Table 6. Evaluation indicators of the MHCNN model on the University of Cambridge dataset.

	Precision	Recall	F1
Negative	0.95	0.88	0.91
Positive	0.89	0.95	0.92

Table 7. Performance of the MHCNN model on Evaluation Indicators.

	Precision	Recall	F1
Negative	0.83	0.91	0.87
Positive	0.90	0.82	0.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, W.; Shang, Q.; Lu, H. Automatic COVID-19 Detection from Cough Sounds Using Multi-Headed Convolutional Neural Networks. Appl. Sci. 2023, 13, 6976. https://doi.org/10.3390/app13126976

AMA Style

Wang W, Shang Q, Lu H. Automatic COVID-19 Detection from Cough Sounds Using Multi-Headed Convolutional Neural Networks. Applied Sciences. 2023; 13(12):6976. https://doi.org/10.3390/app13126976

Chicago/Turabian Style

Wang, Wei, Qijie Shang, and Haoyuan Lu. 2023. "Automatic COVID-19 Detection from Cough Sounds Using Multi-Headed Convolutional Neural Networks" Applied Sciences 13, no. 12: 6976. https://doi.org/10.3390/app13126976

APA Style

Wang, W., Shang, Q., & Lu, H. (2023). Automatic COVID-19 Detection from Cough Sounds Using Multi-Headed Convolutional Neural Networks. Applied Sciences, 13(12), 6976. https://doi.org/10.3390/app13126976

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic COVID-19 Detection from Cough Sounds Using Multi-Headed Convolutional Neural Networks

Abstract

Featured Application

Abstract

1. Introduction

2. Feature Extraction

2.1. Mel Spectrum

2.2. MFCC

2.3. VGGish

3. Multi-Headed Convolutional Neural Networks Architecture

4. Experimental Dataset1

4.1. Experimental Parameter Setting

4.2. Experimental Results and Analysis

5. Experimental Dataset2

5.1. Experimental Parameter Setting

5.2. Experimental Results and Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI