1. Introduction
The Alzheimer’s disease (AD) and the other neurodegenerative causes of late-life dementia have become one of the greatest medical challenges of our time. Diagnosis is quite challenging; it can be difficult to make and in a significant proportion of cases will be inaccurate. It is common for a diagnosis to be delayed for 2–3 years after symptom onset [
2], and the presence of AD pathology will, on average, prove to be mistaken at postmortem examination, in as many as 25% of cases [
3]. Currently, there are no licensed treatments that will stop, let alone reverse, the progression of the neurodegenerative pathologies. In recent years, however, understanding aspects of the biological and clinical characteristics of dementias has benefited from the ability to interrogate the increasing quantities of electronic data that exist for this growing patient population.
One example concerns the delineation of the typical pattern of response to one of the symptomatic treatments (cholinesterase inhibitors) that are licensed for the treatment of AD, using clinical evaluations recorded in thousands of electronic patient records. These evaluations included a summary estimate of a patient’s cognitive ability in the form of a score (out of 30) on a bedside or office-based cognitive test such as the Mini-Mental State Examination (MMSE) [
4] or the Montreal Cognitive Assessment (MoCA) [
5].
Given the progressive nature of AD, it can be assumed that the score achieved by a given patient will remain unchanged for a variable length of time after the diagnosis has been made, and then gradually worsen. When the latter occurs, the score obtained at time
t will usually be lower than the score at time
. Other than in patients who enjoy a particularly strong therapeutic effect from cognitively-enhancing drugs such as acetylcholinesterase inhibitors [
6], it is almost never the case that a patient’s score on the MMSE undergoes a significant increase between two assessments. Moreover, the longer the interval between assessments, it is negative change rather than stability that is the expected pattern of change. Perera et al. [
7] found that, other than in patients who started with MMSE scores of 10 or less, the improvement in MMSE score after starting a cholinesterase inhibitor seldom exceeded 5 points, and that the trajectory returned to progressive decline after as little as six months of treatment.
While knowledge of these trends may be helpful to individual clinicians when asked for a prognosis, and to health economists, epidemiologists and policymakers when considering the burden of the disease on society, their applicability depends crucially on the accuracy of the recorded MMSE data. Such an assumption would, however, be unsafe given the tendency for such clinical instruments to be used and scored in different ways by different clinicians, not to mention for accidentally erroneous values to be entered in the record. Therefore, a challenge that arises in the interpretation of big data such as electronic patient records (EPR) is the identification of clinical episodes where the MMSE score has been incorrectly calculated or inaccurately recorded. Previous work [
8] supports the assumption of a set of baseline rules for better identifying diagnostic inaccuracies, based on the differences between MMSE scores obtained on consecutive episodes of assessment. A more advanced approach, however, could attempt to identify an inaccurate MMSE using features associated with an individual episode. A further challenge is to also investigate whether the MMSE score values can be correctly and effectively predicted using information from electronic health records.
This paper employs machine learning models to attempt an accurate MMSE score prediction, based on a set of clinical, molecular (e.g., APOE type), and demographic features associated with each patient episode. The main aim of this prediction is to determine whether instances of specious and/or inconsistent documentation of MMSE scores, which are frequently documented in clinical records, can be identified and corrected. In the setting of the current experiments, a patient’s MMSE score was defined as erroneous if one of a number of conditions are met. These conditions, based on a priori assumptions derived from the clinical experience of one of the authors (PG), were as follows: if the change in MMSE value between two consecutive episodes of testing either (a) increased by more than five points during any interval up to a year, (b) increased by three or more points between the first and second years after diagnosis, or (c) increased by any number of points over an interval greater than 2 years, an erroneous MMSE score had been documented. The solution to the aforementioned problem entails two steps. First, these rules are used to isolate the assessments at which erroneous MMSE scores are recorded from those in which the value of the MMSE score is plausible. Secondly, to train and test an MMSE score prediction model to determine what proportion of miscalculated cases are replaced by a predicted MMSE value, and what proportion of the scores presumed to be correct are left unchanged by the prediction model.
We hypothesise that the MMSE test and other clinical factors reported in electronic health records represent rich sources of information for cognitive assessment and AD severity. Built on this hypothesis, the predictive capacity of machine learning algorithms is exploited to forecast the MMSE value of a patient in a clinical episode and to classify if the patient is misdiagnosed or not. The accuracy of these models is experimentally assessed, and the observed results provide evidence of the importance that the clinical conditions have on the AD assessment. Further clinical interventions are required to validate the assumptions that guided our work. Nevertheless, the analytical validation based on traditional metrics that measure the accuracy of the predictive models, indicates that these results are promising and strengthen the premise that a holistic evaluation of a patient is required for an accurate diagnosis. Although several approaches in the literature report the use of machine learning on combined datasets that include MMSE test and other clinical factors [
3,
9,
10,
11], to the best of our knowledge, the predictive problems stated in this paper have not modelled with machine learning models trained over the OPTIMA dataset [
2,
12,
13]. Thus, this work not only evidences the role of machine learning, but represents a novel contribution to the portfolio of accurate tools to support clinicians in the challenging task of diagnosing dementia patients.
The principal stages of this work are summarised as below:
Cleaning and curating data from a real-world ageing study, based on certain rules, to test and validate meaningful predictions.
A prediction (Regression and Classification) model that provides MMSE score estimations for patient episodes, based on various features of those episodes, and identifies the most important features for calculating the MMSE value.
A prediction model that classifies patient episodes with erroneous MMSE scores, based on other clinical features associated with the episode.
The paper is structured as follows: the current
Section 1 introduces the background and motivation of the work.
Section 2 presents recent related work on dementia diagnostic models and MMSE score predictions, while
Section 3 formulates the problem addressed and analyses the methods employed in this work. Lastly,
Section 4 presents the OPTIMA dataset [
2,
12,
13], cleaning and curation process, the experimental results and their evaluation on our dataset,
Section 5 covers discussion about our approach and
Section 6 concludes the current and future work.
5. Discussion
The experiments presented in this study illustrate the predictive value of demographic features combined with past measurements of specific aspects of a cognitive examination (the MMSE) in the effective prediction of a patient’s score on the same instrument. The resulting MMSE predictions can therefore be utilised with the aim to better approach correct MMSE values, as many miscalculations may have been entered into an electronic database for various reasons. In this way, assessments of the cognitive state of an individual at each assessment episode become more reliable methods of distinguishing cognitively normal people from those with incipient or established dementia, and of estimating the severity of dementia that is present in the latter.
It is axiomatic that the performance and value of any analytical model will be limited by the quality of the available data [
41]. Outside clinical circles, however, the potential for inaccuracy in medical reporting is perhaps less well recognised, and in the evaluation of patients with cognitive disorders—a process that is still largely driven by observation, experience that is sometimes, but not always, guided by the application of consensus criteria—disagreements and erroneous diagnoses are particularly rife. This is largely because the majority of cases of dementia are caused by the gradual accumulation of molecular level damage that has no effect on function, cognition or behaviour until relatively advanced [
42] and was impossible to detect with certainty during life [
43,
44].
In machine learning models, dataset size and heterogeneity (e.g., diverse populations) are always important [
23]. However, finding an adequate dataset in the healthcare industry is usually difficult. We need additional data and diverse data to solve generalisation issues because we train our predictive models on a small and old dataset. Many obstacles may arise while deploying our machine learning model models in real-world scenarios, such as differentially dispersed datasets, retraining, re-calibration, and generalizability. Technical differences, such as various measuring equipment, coding definitions, EHRs systems, medical personnel, and so on, are also predicted to cause this generalization problem. As a result, it is always important to record shifts in new cases to assess and improve the predictive models’ effectiveness. One of the alternatives for this is data-driven testing [
45].
Nonetheless, it is possible to envisage scenarios in which the approach described in this work may be of practical clinical use. One is in research involving electronic health records in which MMSE scores are recorded as part of the clinical assessment and is of interest to the researcher, whether as an outcome or a covariate. The researcher is likely to encounter at least as many (and in practice probably a lot more) erroneously calculated MMSE scores in such an unstructured dataset. The ability to detect between-group differences in values or the influence that the values may exert on another comparison, would be enhanced by having a dataset from which these erroneous values had been replaced by plausible ones. Another situation in which automatic detection of an anomalous MMSE score could be useful would be in the context of a clinical decision-support system which had learned to detect and alert the clinician to a value of MMSE that was implausibly different from previously recorded scores. Such a system would contribute to the goal of accuracy in clinical record keeping, but would also be useful in the context of ensuring data quality in research settings (such as a pharmaceutical trial) where change in MMSE score was an outcome of interest.
Finally, the increasing availability of large datasets in the form of both open data resources (such as DPUK (
https://www.dementiasplatform.uk/ (accessed on 28 July 2021)) or the AD & FTD Mutation Database (
https://uantwerpen.vib.be/ADMutations (accessed on 28 July 2021))) and more recently described methods for integrating these different sources of information [
46,
47,
48] is set to further improve the analytical and predictive capabilities of data science in complex domains. These include the interplay of factors that under the development of common but aetiologically heterogeneous medical conditions, including dementia [
49]. While medical datasets that incorporate sufficient structured information are currently few and far between, such resources are increasing in number and availability due to the growing recognition of the potential of data science and data mining and the increasing number of relevant data portals. Important and potentially powerful examples of the latter include tools for interrogating and anonymously harvesting clinical information from aggregated EPRs. An early and influential example of this approach to clinical data assets without compromising data confidentiality has been the UK’s Clinical Record Interactive Search (UK-CRIS) facility, which has delivered original insights into aspects of diagnosis and management of mental health disorders [
50]. Because (at least in the UK) the diagnosis and management of dementia has historically been included within the purview of mental health professionals, similar insights have been possible in this increasingly important domain of clinical research [
51,
52].
6. Conclusions and Future Work
The analyses presented in this study provide preliminary evidence that machine learning approaches may be helpful in the task of optimising the accuracy of big data assets which, although potentially highly informative, are also vulnerable to the presence of inaccuracies. When in vivo diagnostic tools such as molecular ligands [
53,
54,
55] become widely available, it will be possible to address and retrospectively to ameliorate diagnostic inaccuracies in large clinical dementia datasets. This will, in turn, lead to new insights into disease risks, mechanisms, the discovery of endophenotypes and patient stratification. Although we did not have access to diagnostic ground truth, the use of MMSE scores as a proxy variable is presented as evidence for the feasibility of using AI to improve diagnostic accuracy within large datasets through the implementation of ML algorithms such as RF.
One of the biggest strengths of this study was the availability of a large and comprehensive, structured data set containing values of numerous variables obtained from an ageing population. In contrast, much of the clinical data that are currently available from EPR are unstructured and require NLP pre-processing to extract a database of similar numerical values to those used her—a database that will be less comprehensive and likely to contain missing values. Since MMSE scores are both amenable to accurate extraction using NLP and widely documented in clinical assessments, replication of the current study on a larger and more ecologically valid database should be possible.
It will, in addition, be important to substitute data driven methods to create criteria for accuracy of recorded scores, as the ‘handcrafted’ criteria adopted here, while unlikely to overestimate the rate of deviation from scoring accuracy may have underestimated it. The development of methods for deriving accuracy criteria from the characteristics of the data themselves, and their implementation are among the studies currently in progress in our consortium, and we expect to report on them in the near future.