1. Introduction
Alzheimer’s disease (AD), the most common form of dementia, is a neurodegenerative disease characterized by cognitive decline [
1,
2,
3]. Among the various causes of dementia, AD accounts for 60–70% of all cases [
4]. The increase in life expectancy has been associated with a steady increase in the population with AD [
5,
6]. After the age of 65, the likelihood that a person will develop AD doubles every five years. Therefore, the number of patients with dementia will reportedly be over three times higher by 2050 compared to 2010 [
7,
8].
From the early stages to the most severe cases, the symptoms of AD include decreased spatial awareness, lack of concentration, and distraction [
9,
10]. Other major symptoms include memory impairment and deficits in language skills [
11,
12,
13]. Additionally, physical function decreases, which makes it difficult to perform daily activities; therefore, patients lose autonomy and become dependent on others for care [
13,
14]. However, deterioration in language ability can impair communication, thereby making daily life difficult for AD patients and caregivers [
14,
15]. Over time, family members of AD patients may experience increased physical and emotional exhaustion [
16,
17]. Moreover, owing to the high costs of all care, diagnosis, and pharmacological treatment, AD is one of the most expensive chronic diseases [
15,
18]. For example, the cost of caring for patients with AD and other forms of dementia is over twice that of patients of the same age suffering from cancer and 74% higher than those with cardiovascular diseases [
19,
20]. Consequently, patients and their families incur a financial burden [
14,
16,
17,
18].
Currently, there is no cure for AD, and it is considered a very serious disease [
2,
4]. However, if detected early, the progression of symptoms can be delayed or alleviated with medication [
21,
22]. A definitive diagnosis of AD includes diagnostic techniques such as genetic tests, cerebrospinal fluid tests, positron emission tomography (PET), and magnetic resonance imaging (MRI), which can be costly and invasive [
23,
24]. Therefore, they are unsuitable for early diagnosis [
18,
23,
24]. In addition, various standards for AD diagnosis exist, but most depend on the results of tests performed by experts [
25,
26]. Furthermore, AD has social consequences, such as the cost to the national economy. Therefore, there is increasing interest in the development of simple screening techniques that can provide an easy and convenient diagnosis that is accessible and low-cost [
27,
28,
29].
One possible solution is using speech analysis and processing to detect changes in language ability, which can facilitate the early detection of AD [
30,
31]. These changes could be a key indicator for the preclinical stages of AD [
12,
30] and for patients experiencing greater difficulty speaking as the disease progresses [
13]. AD patients may show speech-related symptoms such as hesitation, frequent pauses, blurred pronunciation, tremors, light stuttering, the use of irregular words, reduced verbal fluency, changes in the rhythm of speech, deviation from simple grammatical and lexical rules, slow or irregular breathing, and an inability to control breathing [
12,
31]. Moreover, there is a close relationship between language ability and cognitive ability [
32,
33]. These characteristics can be used as an initial indicator to distinguish between AD-related anomic aphasia and non-AD pathology [
12,
30]. In terms of the order of symptom manifestation, it can be considered that language impairment occurs before memory impairment; therefore, it can be a good predictor of early AD [
32,
33].
Although changes in acoustic and vocal rhythm may be imperceptible to the human ear, advances in automatic speech analysis technology have made it possible to identify and effectively extract these acoustic and temporal parameters [
14,
34]. Speech biometrics or automatic speech analysis are considered ideal tools for assessing cognitive deficits or changes in older adults, as these methods are capable of recording speech planning, sequencing, and performance in real time [
35]. Recently, various studies have attempted to classify spoken language using different speech processing techniques and algorithms to identify the early signs of cognitive decline [
14,
27,
35,
36].
In addition to the symptoms of language disorders, the vocabulary level, complexity of syntactic structure, and use of irregular words are significantly affected by factors such as age, educational background, and cognitive ability; therefore, it is difficult to use these predictors as indicators of early AD [
37]. In contrast, the frequency of hesitation, impaired affective prosody, emphasis of specific syllables, changes in tempo or timing, differences in pitch and intonation, and irregular breathing can be used as indicators in speech analysis and processing of voice signals [
38,
39,
40]. Language analysis is important owing to its suitability for classification; some studies have shown that it can be used to distinguish between people with and without AD with over 91.2% accuracy [
41,
42].
The COVID-19 pandemic has increased the demand for remote diagnosis and management of AD [
43,
44]. The existing mode of mini-mental state examination for dementia screening (MMSE-DS) requires patients to visit medical institutions for in-person screenings. Therefore, the restrictions imposed due to the COVID-19 pandemic have made early AD diagnosis difficult [
45,
46]. To satisfy the changing requirements, government agencies are planning to introduce a remote AD management system that will enable elderly people with reduced mobility to undergo dementia screening examinations at home [
47,
48,
49]. Hence, there is a pressing need to develop measures with greater accuracy and efficiency for remote AD diagnostic screening [
48,
50,
51]. Previously, remote dementia management systems were implemented over the phone [
29,
32,
35]. Also, in recent years, there has been active research on the delivery of telemedicine via smart devices and applications [
45,
46]. Various smart devices are highly suitable for the diagnosis and remote management of AD, considering they can quickly and easily capture voice and image data and, in some cases, record basic bio-signals [
46,
49,
52].
In this study, MMSE-DS is performed with AD patients and healthy adults to select factors that are significant for classifying dementia patients based on case records. After pre-processing the voice data obtained for each item, mel-frequency cepstral coefficients (MFCCs) are used to produce a spectrogram by arranging the coefficients in a specific order defined by the authors and synthesizing them into a single image. These images are used for training and are applied to different deep learning models to obtain high accuracy. Furthermore, we establish the criteria for the selection of factors suitable for analysis based on the MMSE-DS and voice data. Lastly, we verify that the proposed method can be used to diagnose AD with high accuracy compared to MMSE-DS by utilizing voice signals that can be easily acquired by exchanging simple questions and answers online using deep learning methods without establishing a special examination system for AD screening.
4. Discussion
This study presented a method of screening AD patients using only voice data. Voice data were collected, pre-processed, normalized, and converted into a spectrogram to obtain MFCC images. Deep learning models are commonly trained with the MFCC transition to utilize the advantages of sound signals and non-verbal elements [
71,
72,
73,
74]. MFCC features have been widely used in voice classification tasks, as they have been shown to perform well in terms of robustness and discrimination power. These features are based on the mel-frequency scale, which is a non-linear frequency scale that is closely related to the non-verbal elements. For this reason, we used MFCC and two-dimensional spectrogram images [
75,
76,
77]. The MMSE-DS assessment tool, the most widely used method for interviewing AD patients in public health centers in South Korea, was used to acquire the voice data. The MMSE-DS comprised 28 questions that were used to screen patients for their attention and calculation, temporal orientation, memory registration, and delayed recall.
Responses were obtained for all questions with MMSE-DS. At the beginning of the experiment, we trained the deep learning model using all the questions. We attempted to train the model in various ways, but the ROC value was very low at 0.6. Therefore, we judged that it is more effective to select valid questions and use them for model training than to use all questionnaires. For this reason, not all of these responses were converted into MFCC images; only 12 questions were finally selected based on expert advice from a focus group interview conducted by psychiatrists at Chungnam National University Hospital. The excluded questions were based on temporal orientation, ability to follow a command, spatial orientation, and visuospatial construction, as they required judgment, visual denomination, and abstract thinking. In particular, the ability to follow a command and visuospatial construction questions did not require a verbal response, and hence, they were not suitable for this study [
32,
58].
The answers to each question were standardized to a duration of 3 s. One MFCC image was generated, and the answers to all 12 questions were compiled to generate a training image. The responses were standardized to 3 s because none of the responses in either group exceeded this duration. However, the AD patients were unable to answer some questions, in which case a waiting time of 30 s was provided. If there was no answer at the end of this period, silence was added to the entire duration. The texture of the MFCC image of the part processed as silence was different from the voice signals, and if all 30 s of the wait time were used, the ratio of the texture of silence would become unnecessarily high. Considering that these factors hinder learning performance, the voice data were standardized to 3 s, which was found to be an appropriate time to receive an answer.
In this study, the AD classification accuracy of the deep learning model was measured using the five-fold cross-validation and hold-out validation methods. In the five-fold cross-validation method, the Densenet121 model exhibited the highest overall accuracy. Inception v3 and VGG19 also exhibited high accuracy. In the hold-out validation method, Resnet50, Inception v3, and Densenet121 showed high performance.
The conventional methods used for a definitive diagnosis of AD, such as genetic tests, cerebrospinal fluid tests, PET, and MR imaging, are invasive and costly, making them less accessible [
23,
24]. Moreover, these methods typically require expert interpretation [
25,
26]. In contrast, the approach presented in this study, utilizing a deep learning model based on voice analysis, offers a non-invasive, cost-effective, and time-efficient alternative. If it is more systematized, it may be automated and utilized without the help of experts.
Overall, the Densenet121, Inception v3, resnet50, and VGG19 models performed excellently and exhibited similar or superior performance to the MMSE-DS [
59]. When the voice signals of the responses were analyzed to classify the AD patients, we achieved a sensitivity of 0.9550, a specificity of 0.8333, an accuracy of 0.9000, and an area under the curve (AUC) of 0.9243 in the five-fold cross-validation method. The accuracy of this study, wherein the spectrogram of the voice data was used to train a convolutional neural network (CNN) to classify AD patients, was higher than that reported by Duc and Ryu (85.27%), who investigated the correlation between the 3D-functional MRI results and MMSE scores [
78], and Tae Hui Kim (0.895), who evaluated the diagnostic accuracy of MMSE-DS [
59]. Additionally, this study showed greater accuracy than the results reported by Liu and Cheng (91.2%), who used FDG-PET images to classify AD patients using a CNN and a recurrent neural network [
79].
Considering AD patients were classified solely based on the analysis of voice signals, questions from the MMSE-DS requiring execution, visuospatial construction, judgment, and abstract thinking were not included in the experimental data based on recommendations from psychiatrists. This study aimed to show that AD classification is possible without these elements, and hence, the results are significant. In addition, it is considered meaningful to confirm which questionnaire among the MMSE-DS is suitable for voice-based deep learning classification. The academia community related to AD in South Korea judged that these results were reliable and useful for screening AD patients. In this study, an individual effect analysis of each question was not performed. It is expected that more diverse and accurate results can be acquired if the scope is expanded by performing an individual analysis of each question, a validation of a combination of various questions, and an analysis including additional elements.
We have tried to detect Alzheimer’s disease based on the audio data without including any semantics. Therefore, we focused on non-verbal characteristics in the responses of patients with Alzheimer’s disease, and for this purpose, we recruited subjects and designed an experiment. Actually, the experiment’s results showed that Alzheimer’s dementia patients had distinctive features in intonation and nuance, as well as inaccurate pronunciation, a slow pace, and the elongation of vowel sounds, regardless of whether the answer was correct or not [
71,
72,
73,
74].
As this was an experimental study in which real AD patients were recruited, there are limitations in terms of the subjects and data. Because the experiment was conducted with a limited number of 80 people, this study could not perform external validation. In future studies, subject recruitment will be conducted on a large scale, and subjects from external institutions for external validation will be considered. In addition, other voice data, not responses to specific answers, will be collected, compared, and analyzed. Also, the severity of dementia will be classified using a deep learning model.
With respect to data augmentation, there are concerns that traditional data augmentation methods (flip, crop, enlargement, reduction, rotation, inversion, etc.) may negatively impact patterns or textures reflecting non-verbal characteristics in MFCC images. As can be seen in
Table A1 and
Table A2, the augmentation range of the brightness and horizontal shift also preserves the quality of usable data by validating the effective range. It is expected that more accurate and valid results can be obtained by increasing the number of participating subjects and the amount of data.
From a clinical point of view, our results have several strengths. First, MMSE-DS requires checking all 28 items, while our results require only 12 items, making it simple and time-saving. Second, the MMSE-DS should have different cut-off scores according to educational background, age, and gender. It is convenient and useful because clinicians do not have to consider various conditions for dementia screening.
But there are also limitations. Dementia occurs when there is a severe decrease in function in some of the various cognitive domains. Since this study used only temporal orientation, memory registration, delayed recall, and attention, the evaluation of other cognitive functions may be limited. However, as mentioned earlier, in the early stages of dementia, generally there is a decrease in peripheral awareness, a lack of concentration, distraction, memory impairment, and a decrease in language ability. In other words, if early dementia patients can be selected through voice and responses to limited items, it is expected to bring about a fundamental innovation in the dementia screening method.
This study is a simplified dementia screening test, so further validation of each question’s reliability is needed, but it is considered to be a useful technique for screening high-risk groups. It will also be of great help in developing a platform that performs high-risk screening, precise diagnostic testing, and management.
If the deep learning model proposed in this study is used, AD screening can be performed more easily and quickly. Based on this, it is possible to build a telemedicine or screening automation system through smart devices [
46,
52]. By installing them in an easily accessible place, the barrier to entry can be lowered, and the quality of telemedicine can be improved by utilizing virtual reality [
47,
48,
49]. In addition, since early diagnosis of AD can inhibit its progression, it is expected that the simple screening method introduced in this paper will also work as a digital therapeutic.