A Hybrid System Based on Bayesian Networks and Deep Learning for Explainable Mental Health Diagnosis

Pavez, Juan; Allende, Héctor

doi:10.3390/app14188283

Open AccessArticle

A Hybrid System Based on Bayesian Networks and Deep Learning for Explainable Mental Health Diagnosis

by

Juan Pavez

^*

and

Héctor Allende

Department of Informatics, Technical University Federico Santa María, Valparaíso 2390123, Chile

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8283; https://doi.org/10.3390/app14188283

Submission received: 19 June 2024 / Revised: 6 September 2024 / Accepted: 8 September 2024 / Published: 14 September 2024

(This article belongs to the Topic Explainable AI for Health)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

The proposed model can be used to improve symptoms checker tools for mental health disorders, allowing accurate and transparent predictions based on user input.

Abstract

Mental illnesses are becoming one of the most common health concerns among the population. Despite the proven efficacy of psychological treatments, mental illnesses are largely underdiagnosed, particularly in developing countries. A key factor contributing to this is the scarcity of mental health providers capable of diagnosing. In this work, we propose a novel method that combines the general capabilities and accuracy of Large Language models with the explainability of Bayesian Networks. Our system analyzes descriptions of symptoms provided by users and written in natural language and, based on these descriptions, asks questions to confirm or refine the initial diagnosis made by the deep learning model. We trained our model on a large-scale dataset collected from various internet sources, comprising over 2.3 million data points. The initial prediction from the Large Language model is refined through symptom confirmation questions derived from a probabilistic graphical model constructed by experts based on the DSM-5 diagnostic manual. We present results from symptom descriptions sourced from the internet and clinical vignettes extracted from behavioral science exams, demonstrating the effectiveness of our hybrid model in classifying mental health disorders. Our model achieves high accuracy in classifying a wide range of mental health disorders, providing transparent and explainable predictions.

Keywords:

deep learning; Bayesian networks; large language models; natural language processing; mental health

1. Introduction

In recent literature, mental health has emerged as a topic of paramount concern, with increasing emphasis on its profound impact, especially on young people. There is growing evidence of an upward trend in mental health disorders within this age group, which influences quality of life and contributes to an increase in self-harm incidents and suicide rates.

Although there is substantial evidence supporting the effectiveness of psychopharmacological and psychotherapeutic treatments for mental illness, studies show that 60% of individuals with mental illness do not receive any form of treatment. This issue is even more pronounced in developing countries, where 75–80% of serious mental conditions remain untreated [1,2].

Recently, online platforms for psychological treatments have become an important resource for patients to receive care. Platforms such as BetterHelp (betterhelp.com, accessed on 18 June 2024) and Talkspace (talkspace.com, accessed on 18 June 2024) conduct millions of sessions monthly, becoming one of the main means of accessing psychological treatment. These platforms often use symptom checkers to accurately identify the needs of each user and match them with the most appropriate provider.

Symptom checkers are not new; they have been part of applications of artificial intelligence in medicine since the inception of artificial intelligence, with systems such as INTERNIST-I, a pioneering rule-based reasoning system for medical diagnosis [3].

These systems have become increasingly popular, allowing users to enter symptoms and receive information about possible conditions [4,5]. Research indicates that 80% of people consult online platforms before visiting a healthcare professional [6].

Symptom checking systems can be divided into two broad categories. The first category includes rule-based systems that use predefined rules and logic to derive a diagnosis. Probabilistic systems, such as those based on Bayesian networks, also fall into this category. These methods model conditional probabilities between symptoms, risk factors, and conditions, and use the law of probabilities to derive a diagnosis [3,7,8,9,10].

The second category consists of models based on machine learning. These models require only raw data or features derived from raw data. The model then learns by itself the important features of the input data and the relationships between the input variables and the output variable, which in this case is a diagnosis [11,12,13,14,15,16]. These models have already shown great promise for mental health diagnosis tasks. For a comprehensive review of machine learning models and AI applied to mental health diagnosis, the reader can refer to [17].

Both categories have pros and cons. Rule-based systems have the advantage of being directly linked to proven medical knowledge, as their rules are often defined by medical professionals. They are also more transparent and explainable than machine learning-based models, since the reasoning process leading to a decision can be traced and evaluated by professionals. This facilitates the collaboration between human professionals and AI systems. However, a common limitation of rule-based systems is their diagnostic accuracy. Although these models played an important role in the early development of artificial intelligence, they have been largely replaced by machine learning models, which achieve higher accuracy. Additionally, rule-based systems are difficult to adapt to real raw data, as they typically require input in the form of discrete variable values defined by an expert or provided by the user.

On the other hand, machine learning models are generally more accurate than rule-based systems. In addition, they are easier to apply to the kind of unprocessed data found on the Internet and social media, since modern machine learning models can be trained on raw data such as text or images and automatically learn the important features to extract from those data. However, machine learning models lack the transparency of rule-based models. This opacity limits their application in medical diagnosis, as human experts cannot clearly understand the decision-making process of the model, making it difficult to collaborate effectively with these kinds of AI systems.

In this work, we propose a hybrid model that combines a rule-based Bayesian network and a deep language model based on machine learning. The deep-language model, trained on human-written symptom descriptions, diagnoses the most probable conditions. The Bayesian network built by experts, on the other hand, uses clinically validated information from diagnostic manuals to refine this initial diagnosis of the large language model. By integrating these models, we achieve high diagnostic accuracy, allowing the system to process user-provided symptom descriptions without the need for expert intervention, while also ensuring transparency and explainability, enabling users and professionals to understand the reasoning behind the diagnoses. This fusion aims to create a highly accurate diagnostic tool for mental conditions that adheres to clinical standards and provides clear, understandable results. A diagram summarizing the model is shown in Figure 1.

This work is organized as follows:

In the Methods section, we define the system and establish a set of desirable requirements to ensure that it serves as a useful tool for both users and experts. We then detail the components of the system—specifically, the deep language model trained to predict condition categories based on symptom descriptions from users and the Bayesian neural network used to refine the initial response while providing an explainable and transparent diagnostic decision. We describe the construction of the dataset and network necessary for these models and outline the primary features of each one.

In the Results section, we present the performance of both models on two datasets designed to evaluate the accuracy of the system in classifying clinically valid condition categories according to the DSM-V diagnostic manual. We compare these results with other models in the literature and with classifications made by psychology professionals.

Finally, in the Discussion section, we discuss the results and present our conclusions.

2. Materials and Methods

2.1. System Requirements

We start by defining a set of system requirements that we think the model must comply with. Our intention is to create a system that is useful for both users and clinical experts. With this in mind, we defined a set of requirements that will guide the proposal of our system.

2.1.1. Clinically Valid

First, the model must adhere to clinical standards within the psychiatry domain, as delineated in recognized Clinical Diagnosis Manuals such as the DSM-V (Diagnostic and Statistical Manual of Mental Disorders) and ICD-10 (International Statistical Classification of Diseases and Related Health Problems). This means that the model is required to classify symptoms in a manner consistent with the diagnostic criteria specified in these manuals. In addition, the decisions made by the model must be both explainable and transparent, allowing users and professionals to understand the underlying reasoning behind each diagnosis. This transparency is crucial to ensure the traceability and auditability of the model.

To fulfill these requirements, the model must be able to articulate the list of symptoms considered when making the diagnosis and the rationale for its decisions. For example, the model might determine that a patient is suffering from depression if they exhibit symptoms such as depressed mood, loss of interest or pleasure, and significant weight loss. Providing this detailed rationale is vital for users to understand the decision-making process of the model and for professionals to assess its accuracy.

2.1.2. Accuracy

The model must classify symptoms with high accuracy, ensuring alignment with the diagnostic criteria specified in the DSM-V manual. This precision is crucial for the model to serve as a reliable tool for both users and professionals.

2.1.3. Flexibility

An additional objective is to ensure the flexibility of the model to accommodate both lay users and professionals. This requires the model to comprehend natural language descriptions of symptoms and engage in interactive querying to confirm or refine initial diagnoses. Such capabilities are essential to make the model accessible to users who may not have a detailed understanding of the DSM-V diagnostic criteria. In addition, the model must be fast and user-friendly, enabling quick and simple access to the needed information.

With these criteria in place, we propose a hybrid system that merges a deep language model based on machine learning with a rule-based Bayesian network. The deep language model, trained on human-written symptom descriptions, predicts condition categories from user inputs. The Bayesian network, developed by experts, leverages clinically validated data from the DSM-V manual to refine the initial diagnosis provided by the deep language model. This integration facilitates high diagnostic accuracy and allows the system to process the symptom descriptions provided by the users. In addition, it maintains transparency and explainability, ensuring that both users and professionals can understand the diagnostic reasoning.

In subsequent sections, we detail the development of the dataset and network required for these models and describe their key features.

2.2. Deep Learning Model for Classification of Mental Health Disorders

2.2.1. Dataset Construction

Building a useful dataset to train a model that can diagnose mental health disorders is a challenging task. The dataset must be large enough to train a deep learning model and diverse enough to cover a wide range of mental health disorders. In this section, we describe the construction of the dataset used to train our models. Our goal is to build a labeled dataset that is large and consistent with the diagnostic criteria of the DSM-V manual so that the model can learn to classify symptoms in a way that is clinically sound.

During recent years, a growing body of literature has begun to explore the use of data obtained from social media platforms to detect mental health disorders [18,19,20,21]. The underlying premise is to discern discursive markers indicative of potential mental health issues, thus facilitating the prediction of users who may require mental health support.

For instance, in [22], the authors used both the network of friends of users and the textual content of the posts to predict the risk of depression. In [23], the authors proposed a model that incorporated the temporal variable, using the textual content of tweets and Twitter metadata to predict the onset of a depressive episode in users. The reader is invited to read [18] for a recent survey on work using data from social media to detect or predict mental health disorders.

Although a range of social media platforms have been used for data extraction, such as Facebook [24], Twitter has been widely documented in the literature as a preferred data source, something attributable to the simplicity of large-scale data extraction via its API. The popularity of Twitter among younger demographics and the accessibility of tweets make it a convenient choice, albeit with a caveat. The scarcity of labeled data poses a restriction for the training of cutting-edge deep learning models, recognized as the state-of-the-art in natural language processing [25,26,27,28,29].

To obtain labeled data from Twitter and other social networks, in [19], four distinct methods are proposed to label mental health-related data. The first approach involves the use of self-report questionnaires. In this method, the user completes a standardized questionnaire designed to assess a specific mental health condition [30,31,32,33]. Due to their clinical validity, questionnaires are a conventional method to identify mental health disorders in patients and are widely employed in various studies [23,24,34,35]. Nonetheless, the amount of labeled data is directly dependent on the willingness of users to participate in the survey, limiting its scope.

A second methodology for data labeling involves self-declaration, where data are extracted from the own disclosure of users of a mental health diagnosis in their tweets, looking for key phrases such as “I was diagnosed with depression” or “I suffer from depression” [36,37,38]. This approach leads to a greater data volume as it allows full automation, encompassing raw data collection and subsequent automated labeling; thus, it has been used in the construction of benchmark datasets, such as the one created for the 2015 CLPsych Workshop [39]. However, this method of labeling is generally deemed less reliable than manual labeling or clinical questionnaires, and despite the automation, it does not necessarily guarantee a significant data volume. This limitation is largely due to the relatively infrequent public declarations of mental health issues on social platforms, a consequence of the persisting stigma surrounding such disclosures. A more robust method for data labeling entails manual annotation by experts. This approach necessitates that annotators, either with clinical experience or under the supervision of clinical experts, apply established criteria to classify a post or tweet. This method affords flexibility in labeling, potentially allowing for categorization based on severity, causative factors, or any variable stipulated in the labeling rules. This strategy aligns with traditional labeling and benchmark construction methodologies within the field of Natural Language Processing (NLP). However, the demand for expert involvement significantly constrains the speed of labeling relative to other benchmarks. Furthermore, it necessitates the implementation of agreement methods among multiple experts to accurately label challenging cases, ultimately limiting the achievable size of such datasets.

A noteworthy effort directed towards these datasets is the construction of the Deptweet Benchmark [40]. The authors of this study collected and labeled 40,191 tweets, utilizing annotators trained by clinical psychologists. Despite the commendable effort, involving a collaborative endeavor of 90 annotators and an exhaustive validation and check process, the resulting dataset size remains small when measured against the standards of deep learning models.

An alternative data labeling involves the determination of forum membership [41,42,43]. Here, users are classified based on their participation in public support forums related to specific mental health diagnoses. The underlying premise is that these forums are frequently sought for assistance and support following the diagnosis of a mental health condition; thus, forum membership can serve as a form of self-labeling. Despite the likely reduced clinical validation compared to expert labeling [44], the data volume available with this method vastly surpasses that of previous methods.

Reddit presents a particularly resourceful platform for obtaining this type of data [45]. With over 1.6 billion active monthly users and more than 10 million monthly posts, Reddit is among the most popular social networks. Mental health discussions are prevalent on Reddit, with numerous specialized forums providing spaces for anonymous discourse on mental health issues. In particular, there are at least 15 famous subreddits (or subforums) devoted to mental health topics. The topics are resumed in Table 1.

These subforums are dedicated to specific disorders, often aligned with diagnostic criteria as outlined in mental health disorder manuals [46,47]. This alignment offers an opportunity to harness such data for training models capable of identifying social media posts that correspond to these condition categories.

Several studies have used Reddit datasets to train machine learning models [45,47,48,49]. For instance, ref. [48] details the creation of a training dataset and the deployment of a deep learning model using Convolutional Neural Networks (CNNs) [50] to automatically classify each post in the subreddit. Similarly, ref. [49] introduces a slightly different CNN-based model for subreddit classification, achieving results that are promising but not directly comparable to the previous work.

A notable limitation of a dataset sourced primarily from Reddit is its incomplete representation of clinically valid diagnoses. The nature of discussions on Reddit typically skews towards more prevalent conditions like anxiety, depression, or addictions, resulting in a bias in the dataset. This skew limits the effectiveness of the dataset as a comprehensive training tool for a model tasked with automatically identifying and diagnosing a full spectrum of conditions. Overrepresentation of common conditions increases the likelihood of misdiagnosing rarer mental health problems.

To address this challenge, we propose to augment the Reddit dataset with additional online sources. Reddit is not the sole platform for mental health discussions; numerous other forums and data sources specialize in this area, offering a broader range of discussions. An example is PsychForums, a forum analogous to Reddit but specifically dedicated to mental health topics. In PsychForums, discussions are organized into clinically valid categories and diagnoses with a high level of specificity, enabling users to engage in more specialized conversations. This approach helps to create a more balanced and diverse dataset, enhancing the potential accuracy and scope of our model.

The vast quantity of data available in these forums paves the way for the training of data-intensive models such as large language models. An emerging practice in training these models, particularly in the linguistic domain, is the use of semi-supervised learning as an initial stage to supervised training [25,26,51]. The objective of pre-training is to enable the model to learn contextualized vector representations of words, which can subsequently be utilized in classification tasks. These contextualized representations encapsulate crucial syntactic and semantic information that can improve performance in later tasks.

Multiple studies have demonstrated that pre-training models on data derived from the specific domain of subsequent application markedly improve performance [52,53,54].

This improvement can be attributed to the fact that general models, trained on general domain data scraped from the web, often lack exposure to domain-specific terminology, resulting in a loss of important contextual information about word meaning. Although some of these models are trained on data that potentially contain specific domain information, this information can be partially lost due to phenomena such as catastrophic forgetting [55].

In the domain of psychology, a project that follows this path is MentalBert, which involves a BERT model fine-tuned using Reddit data. However, MentalBert does not aim to develop a model for categorizing a range of clinically valid diagnoses. As a result, it is limited to a small subset of subreddits, specifically five, encompassing only three different diagnoses (anxiety, depression, and bipolar).

Our goal is to significantly increase the number of SubReddits analyzed, thus creating a much larger dataset than that of MentalBert. Our dataset comprises

30, 972, 717

sentences, a substantial increase compared to MentalBert’s

13, 671, 785

sentences.

In addition, we have broadened our scope beyond the original Reddit dataset by integrating additional data sources. This expansion is crucial in building a more comprehensive and clinically relevant training set. Our dataset encompasses a wide spectrum of the DSM-V diagnostic categories, enhancing the ability of the model to recognize and categorize a diverse array of mental health disorders.

To achieve this, we first aggregated posts from subreddits including BPD, BipolarSOs, BipolarReddit, bipolar, schizophrenia, Anxiety, depression, selfharm, StopSelfHarm, SuicideWatch, addiction, cripplingalcoholism, OpiatesRecovery, opiates, autism, and mental health, spanning the period from 2005 to 2021. Some examples of the extracted posts are shown in Table 2.

We expand this initial dataset with other symptom descriptions from internet data sources, mainly Psychforums, but also Psychology Today, Medline, and Mayo Clinic, among others.

The topics covered in these new data sources range from specific mental health disorders to therapy, medication, and overall support. In contrast to Reddit, PsychForums organizes its subforums according to diagnostic categories defined in the DSM, which are further broken down into more specific diagnoses. For example, the Anxiety Disorders subforum includes specific sections like Agoraphobia, Social Phobia, and Generalized Anxiety Disorder.

To further enrich our dataset with examples of less common conditions, we obtained condition descriptions and symptoms from specialized databases such as PsychologyToday and the other sources described. We enhanced these samples by simulating symptom variations and modifying the phrasing using back-translation techniques. This method allowed us to create a more robust and diverse dataset, encompassing a wide array of mental health disorders.

By combining Reddit examples with examples from other internet sources, we are able to construct a dataset that is both large, enabling the training of modern large language models, and extensive in coverage of conditions, thereby reducing the bias of the original Reddit dataset that is skewed towards more common conditions. Our new dataset includes conditions in the following DSM-V diagnostic categories: Neurodevelopmental Disorders; Schizophrenia Spectrum and Other Psychotic Disorders; Depressive Disorders; Anxiety Disorders; Trauma and Stressor-Related Disorders; Dissociative Disorders; Somatic Symptom and Related Disorders; Feeding and Eating Disorders; Sleep–Wake Disorders; Sexual Dysfunctions; Disruptive, Impulse-Control, and Conduct Disorders; Substance-Related and Addictive Disorders; Neurocognitive Disorders; and Personality Disorders. After conducting preprocessing and data cleaning, we amassed 2,480,600 data points.

We used the resulting dataset to pretrain a transformer-based deep learning model, specifically BERT, which is described in the next section.

2.2.2. Dataset Preprocessing

A comprehensive dataset was assembled by aggregating 1.5 TB of data spanning from 2005 to 2021. Specifically, the dataset focused on submissions from Reddit; in this context, a submission denotes the initial post introduced by a user to start a discussion. These submissions were filtered to maintain only those associated with the subreddits listed in Table 1.

Each data entry, structured as a JSON object, includes attributes such as the post title, content, author, and an array of metadata fields that include unique identification, thumbnail representation, and voting statistics, among others.

Notably, a non-negligible proportion of submissions predominantly contained links, imagery, or other content non-useful for this work. To enhance data quality, entries with post-content shorter than 50 characters were excluded. Also, URLs within the post content were removed. After these pre-processing steps, the curated dataset comprised 2,319,514 entries, each of which exhibited an average word count of 195.8. For publication, we anonymized the dataset using the Python library scrubadub, which employs natural language processing methods to remove personally identifiable information (PII). This includes names, email addresses, physical addresses/postal codes, credit card numbers, dates of birth, URLs, phone numbers, username and password combinations, Skype/Twitter usernames, social security numbers, tax numbers, and driving license numbers [56].

The dataset based on Reddit covers only 7 out of the 22 diagnostic categories (although this assessment is simplistic, it considers depression as the sole representative of the Depressive Disorders category, which in reality encompasses several more specific diagnoses). To construct a more representative training set that encompasses the full spectrum of diagnoses in DSM-V, we compiled additional examples from sources specialized in psychology, including forums like PsychForum and informative websites such as MayoClinic or Medline. After collecting, anonymizing, and preprocessing these data, we compiled 161,086 training examples. Although this number is considerably lower than the data collected from Reddit, it forms a much more diverse and representative set in terms of clinical diagnoses. This new dataset encompasses a total of 18 out of the 22 categories of the DSM-V.

Table 2 shows a selection of submissions extracted from this dataset. An overview of the statistics of the dataset is delineated in Figure 2.

2.2.3. Model Training

BERT (Bidirectional Encoder Representations from Transformers) is an encoder-centric model based on the Transformer architecture [57]. It employs an unsupervised training approach with emphasis on two principal loss functions during its pre-training phase: Masked Language Model loss (MLM) and Next Sentence Prediction loss (NSP).

For the MLM loss computation, a subset of tokens from an initial input sequence, denoted as

x = (x_{1}, x_{2}, \dots, x_{T})

, are substituted with the special token [MASK], which produces a modified sequence m. The MLM loss is then determined by

L_{t} = - log P (x_{t} | m)

. Here,

P (x_{t} | m)

represents the predicted probability of the model for the genuine token

x_{t}

at the masked position, conditioned on the modified sequence m.

Regarding NSP loss, individual examples are separated into sentences, establishing a dataset composed of examples with sentences A and B. Typically, for half of the instances, sentence B sequentially follows A in the corpus. For the other instances, B is a sentence arbitrarily selected from the corpus. The goal of the model is to predict a label 1 when B is the sequential sentence in the corpus and 0 when randomly chosen. The NSP loss is computed using binary cross-entropy:

L_{N S P} = - y log (y^{*}) - (1 - y) log (1 - y^{*})

.

Following pre-training, the model is fine-tuned via a supervised approach for specific tasks. This synergy between unsupervised and supervised training has shown superior results across diverse domains and is the foundation of many modern language models.

BERT authors trained the model unsupervised on a combined set of BookCorpus and Wikipedia. The models obtained from this pretraining were made available to the community [58] for various versions of the base transformer model of different sizes in terms of the number of weights.

Several subsequent studies [52,53,54] have demonstrated that incorporating an additional phase of unsupervised learning or pre-training can improve classification performance in downstream tasks across multiple domains. This added phase allows the model to adapt to distributional shifts in the corpus, which are anticipated due to the general nature of the corpus used during model training. For example, technical corpora, or those abundant in domain-specific terminology, are likely to exhibit a distribution with significantly higher conditional probability for certain terms that would be rare in a more general corpus.

Notably, it has been recognized that discussions related to mental health on Reddit display a particular distribution in linguistic features [45]. Furthermore, terms associated with treatments and diagnoses are widely used, which are atypical in broader contexts.

With this in mind, the first step involved pretraining the BERT model using the entire collection of Reddit posts from the training dataset. For this endeavor, a training set was assembled by masking 15% of the tokens, acquired via WordPiece tokenization, and constructing pairs of sentences for the next sentence prediction task. This was balanced at a 50% ratio between genuinely consecutive sentences and randomly selected ones.

After the initial pretraining step, we conducted a subsequent pretraining phase using data from the extended training set. We separated this from the Reddit set due to the considerable size difference, leading to an imbalance particularly for the less common conditions in the extended set. Adapting the model to these less common conditions was the primary reason for extending the initial dataset. Hence, we opted for a distinct, shorter pretraining process, separate from the first pretraining phase. The processing for this phase was performed in the same manner as for the Reddit data.

For the fine-tuning step, we explored two approaches. The first involves the classic BERT fine-tuning process, which entails adding a final classification layer. This layer is applied to the vector labeled with [CLS], which acts as an aggregator of the information captured in the other vectors. Subsequently, the model is trained using ADAM, adjusting all BERT weights.

The second approach involves employing LoRA (Low-Rank Adaptation) [59,60] to train a significantly smaller subset of parameters. LoRA has become a very popular method for fine-tuning large models during recent years. It works by introducing low-rank matrices that adapt the pretrained weights instead of fine-tuning all the parameters. The original weight matrices of dimension

M \times R

are replaced by two low-rank matrices of size

M \times R

and

R \times N

, where R is a hyper parameter of the model. This technique reduces the number of trainable parameters considerably, focusing on a smaller and more critical subset. By adjusting these low-rank matrices, LoRA effectively captures the necessary adaptations for the specific task without the need to retrain the entire model.

2.3. Bayesian Network for Diagnosis Refinement through Questioning

In this section, we describe the methodology used to refine the diagnosis obtained from the language model. Our complete system is based on a dual-step approach that combines the strengths of deep learning models and Bayesian networks. The language model is able to automatically analyze textual descriptions of symptoms, capturing the variability and uncertainty inherent in how symptoms are reported by different users. This first step streamlines the diagnostic process, significantly reducing the number of steps to arrive at a prediction.

The enhancement process is based on a Bayesian network constructed with base in the DSM-V diagnostic manual. The process of inquiry and diagnosis is framed as a sequential decision-making problem where an agent interacts with a patient. At each step, the agent asks the patient about the presence of a specific symptom

s_{i}

from a set of possible symptoms S. The response of the patient, either confirming or denying the symptom

s_{i}

, guides the next steps of the agent.

The ultimate objective of the agent is to accurately identify the disease d from a set of possible diseases D, with as few questions as possible.

This approach is similar to other methodological frameworks that have been used in deep reinforcement learning studies, such as [61,62], to model the diagnostic process.

As mentioned previously, our model is based on a Bayesian Network, which is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph. The network comprises two types of nodes: the symptoms nodes S and the disease nodes D. The symptom nodes represent the presence or absence of a symptom, while the disease nodes represent the presence or absence of a disease. The edges between the nodes denote the conditional dependencies between symptoms and diseases.

The inference process in the Bayesian network involves updating the probabilities of the disease nodes based on the responses of the user that confirm or reject the symptom nodes. This process is carried out iteratively, with the agent asking questions about the presence or absence of symptoms to refine the probabilities of the disease nodes. The objective of the agent is to identify the most probable disease given the responses of the patient to the symptoms.

In the following section, we describe the construction of the Bayesian network and the methodology used to refine the diagnosis obtained from the language model.

2.4. Building the Database

2.4.1. DSM-V Overview

Traditionally, mental health diagnoses are performed using diagnostic manuals, which provide criteria for identifying specific conditions. The most widely recognized of these manuals are the DSM-5 and ICD-11 [46,63]. These guides play a crucial role in standardizing the diagnosis of mental health disorders, offering a structured approach to understanding the symptomatology necessary to classify a condition.

To illustrate, according to DSM-5, the two most important core symptoms of major depressive disorder are (1) a depressed mood most of the day and/or (2) markedly diminished interest or pleasure. At least three or four more of the other seven symptoms must be met to be diagnosed with a depressive disorder, which means that depressive disorders themselves do not have consistent symptoms and vary greatly between individuals.

Doctors use standardized questionnaires to confirm symptoms. For example, the PHQ-9 (Patient Health Questionnaire-9) is an assessment of the nine symptoms criteria in the DSM-5. It should be noted that the presentation of symptoms is highly variable between patients, so there are criticisms of this method of evaluation [64].

However, this standardization is crucial to ensure consistent and precise diagnoses and treatments in different healthcare settings.

Specifically, the DSM-V encompasses mental health disorders that span a spectrum from widely recognized disorders such as depression and anxiety to less common conditions such as catatonia. Associated with each disorder is a series of symptoms and diagnostic criteria that cumulatively present hundreds of distinct symptoms and criteria. These criteria are meticulously described, specifying the nature, duration, and impact of symptoms on the functioning of individuals. Primarily, the DSM-V assists clinicians and researchers in diagnosing mental disorders, ensuring that diagnoses are uniform and reflect the latest scientific understanding.

To develop a Bayesian network that models the relationships between symptoms and conditions, we created a database based on the DSM-V as our primary source. We methodically reviewed each condition listed in the diagnostic manual, extracting and simplifying the descriptions of associated symptoms. This process involved distilling the complex and detailed symptom criteria into a more accessible format that can be readily understood by patients.

We designed the network structure to link symptoms (s) with conditions (d). Our network is made up of 42 conditions and 114 symptoms.

To generate the necessary Conditional Probability Tables (CPTs) to calculate probabilities within the network, we used the method of expert elicitation [65]. This is a common method for obtaining probabilities when data are scarce or unavailable. We asked a professional psychologist to assess the likelihood that a symptom manifests given a condition on a scale of three categories: necessary, common, and uncommon.

These probabilities align with the QMR-DT network requirements, which will be explained later.

An example of one of these rates is the following: within sleep disorders, the symptom of difficulty initiating or maintaining sleep was classified as necessary, as it is a defining characteristic of the condition, whereas the symptom of low mood was classified as common.

We then assign probabilities to these categories in the CPT as follows: Necessary →

p (s | d) = 1

, Common →

p (s | d) = 0.7

, and Uncommon →

p (s | d) = 0.3

.

With this information, we developed a Bayesian network encompassing 42 conditions and 114 symptoms.

2.4.2. Modeling the Bayesian Network

To model the diagnostic process, we constructed a Bayesian network that captures the relationships between symptoms and conditions. This network is designed to refine the diagnosis obtained from the language model by validating or refuting the identified symptoms. The network comprises two types of nodes: symptom nodes S and disease nodes D. The symptom nodes represent the presence or absence of a symptom, while the disease nodes represent the presence or absence of a disease. The edges between the nodes denote the conditional dependencies between symptoms and, in each edge, a Conditional Probability Table (CPT) is defined.

Our network is based on the Quick Medical Reference (QMR) [66,67] network, which is a Bayesian network specifically designed for medical diagnosis. It represents diseases’ and symptoms’ relationships by using a two-layer structure: the top layer consists of disease nodes and the lower layer consists of finding nodes (symptoms or exam results), with arcs from diseases to findings. The graph representation of the network is shown in Figure 3.

The probability distribution represented by the QMR network is typically a joint probability distribution over all diseases and symptoms.

P (S, D) = P (S | D) P (D) = [\prod_{i = 1}^{n} P (s_{i} | d)] [\prod_{j = 1}^{m} P (d_{j})]

(1)

where

D = {d_{1}, d_{2}, \dots, d_{m}}

and

S = {d_{1}, d_{2}, \dots, d_{n}}

variables are binary and model the presence or absence of certain a symptom or disease.

Consider a symptom s that can arise from m conditions

{d_{1}, \dots, d_{m}}

. To establish this conditional probability distribution,

2^{m}

parameters are required. This introduces a significant challenge that complicates the task of experts to define probabilities, since an expert would need to consider every potential combination of the activation of parent variables for each symptom.

A more intuitive approach for experts is to determine the sensitivity and specificity values for each symptom and condition. For a given condition

d_{j}

and a symptom

s_{i}

, sensitivity is the probability that the symptom appears when the condition

d_{j}

is present:

S e_{i j} : = P (s_{i} = 1 | d_{j} = 1)

(2)

In contrast, specificity is the probability that symptom

d_{i}

does not manifest when the condition

d_{j}

is absent:

S p_{i j} : = P (s_{i} = 0 | d_{j} = 0)

(3)

To model these probabilities, Pearl [7] introduced the Noisy-OR model. This model presupposes that each condition can independently cause the symptom s, and the probability of the symptom being on is the cumulative effect of these independent probabilities. Mathematically, it is assumed that there are m relationships between the parent conditions

{d_{i}, \dots, d_{m}}

and the symptom s, merged in an OR gate. To add stochasticity to the model, each link between a parent and the symptom may fail with a probability of

1 - p_{i}

. Under these assumptions, the probability of the symptom being “on” is given by

P (s_{i} = 1 | d) = 1 - \prod_{j = 1}^{m} (1 - p_{i j})

(4)

Here,

p_{i j}

represents the likelihood that the condition

d_{j}

, if present, could independently activate the symptom

s_{i}

.

The network is then constructed by defining the relationships between symptoms and conditions using the Noisy-OR model. The network structure is shown in Figure 4.

In the following section, we describe how we combine the predictions from the language model with the Bayesian network to refine the diagnosis.

2.5. A Hybrid Approach for Diagnosis Refinement

In this section, we describe the mathematical details of our symptom checking system.

Consider a network G comprising a set of diseases

D = {d_{1}, \dots, d_{m}}

and symptoms

S = {s_{1}, \dots, s_{n}}

, where each disease

d_{i}

is associated with a subset of symptoms

s_{i}

, and each edge between a symptom

s_{i}

and the associated disease

d_{j}

is defined by a Conditional Probability Table (CPT) representing

P (s_{i} | d_{j})

that follows the Noisy-OR assumptions.

When a user provides a description of symptoms T in the form of a list of words

T = (w_{1}, \dots, w_{t})

, a BERT model computes the a priori probabilities of the disease vector

P (D | T)

. The probability is represented by the function denoted

f_{θ} (T)

, where

θ

represents the model weights.

The BERT model output is formalized as follows:

f_{θ} (T) = softmax (W^{T} \cdot {BERT}_{enc} (T)),

(5)

where

W \in R^{m \times H}

, with H being the output vector size of the BERT model.

As discussed in the previous section, while our deep learning model is designed to predict categories, the Bayesian network operates at the level of individual conditions. To address this, we identify the K most probable categories, as predicted by the neural network. We then construct a graph comprising the top k conditions associated with these categories based on the output probability for each category defined by the BERT model.

In the second stage of the algorithm, we compute the probability of symptoms based on the a priori values and the network G; we start by taking the top k diseases from the disease probability vector

P (D | T)

as was mentioned before.

D_{k} = {top}_{k} (P (D | T)) .

(6)

With this, we construct a subnetwork

G_{k}

from G, including only the diseases in

D_{k}

and their associated symptoms

S_{k} = ⋃_{d_{i} \in D_{k}} s_{d_{i}}

.

The joint distribution represented by this subnetwork is given by

P (S, D) = P (S | D) P (D) = \prod_{s_{i} \in S_{k}} P (s_{i} | D) \prod_{d_{j} \in D_{k}} P (d_{j} | T),

(7)

as shown in Equation (1).

The probabilities of each symptom are computed by

P (s_{i} | T) = \sum_{d \in D_{k}} P (s_{i} | d) P (d | T) .

(8)

The probabilities of each disease are computed as

P (d_{j} | S) = \frac{1}{N} \sum_{d_{i} \in D_{k} ∖ {d_{j}}} P (S | d_{i}) P (d_{i}),

(9)

where

N

is a normalization factor ensuring that probabilities sum to 1.

To gather new evidence, we start by querying the user to confirm or deny the presence of the q top-ranked symptoms:

S_{q} = {top}_{q} (P (S | T)),

(10)

where

S_{q}

represents the symptoms with the highest updated probabilities and q is a parameter of the algorithm that represents how many symptoms the user confirms or rejects at each step. The user then provides evidence that looks like

e = {S_{1} = 1, S_{2} = 1, \dots S_{q} = 0}

in each step.

The new evidence e is added to the existing evidence vector

E = E + e

, and the disease probability vector is updated using Equation (9).

Then, new probabilities for the symptoms excluded from the evidence vector

S_{n}^{'} = S - E

are computed and the process repeats for t iterations, each time updating the probabilities and gathering new evidence.

Finally, the most probable disease is determined as

D_{p} = \arg \max (P (D | T, E)) .

(11)

The complete algorithm is resumed in Algorithm 1.

Algorithm 1 Disease diagnosis using BERT and QMR-inspired Bayesian network

Input: Symptoms description word vector

T \leftarrow (w_{1}, \dots, w_{t})

Output: Predicted condition

D_{p}

1:: procedure QMRBert:
2:: Initialize BERT model with weights $θ$
3:: Let $D \leftarrow {d_{1}, \dots, d_{m}}$ be the set of diseases
4:: Let $S \leftarrow {s_{1}, \dots, s_{n}}$ be the set of symptoms
5:: Let G be the disease–symptoms network
6:: Compute $p (D | T)$ using BERT: $f_{θ} (T) \leftarrow softmax (W^{⊤} {BERT}_{enc} (T))$
7:: Select top k diseases: $D_{k} \leftarrow {top}_{k} (P (D | T))$
8:: Construct subnetwork $G_{k}$ from G using $D_{k}$
9:: Initialize evidence as $E \leftarrow {}$
10:: for iteration $i \in {1 \dots t$ } do
11:: Compute symptoms probabilities as $P (s_{i} | T) \leftarrow \sum_{d} P (s_{i} | d) P (d | T)$
12:: Select t symptoms using $S_{q} \leftarrow {top}_{q} (P (S | T))$
13:: Ask user to provide evidence e based on $S_{q}$
14:: Update evidence vector $E \leftarrow E + e$
15:: Update $P (d_{j} | S, E)$ for each disease $d_{j}$ in $G_{k}$
16:: $P (d_{j} | S) \leftarrow \frac{1}{N} \sum_{d \in D_{k} ∖ {d_{j}}} P (S | d) P (d)$
17:: end for
18:: Final prediction: $D_{p} \leftarrow \arg \max (P (D | T, E))$

3. Results

3.1. Experimental Setup

In this section, we present the experimental setup and results of our proposed methodology. We first describe the datasets used for training and evaluation, followed by the experimental results for the Reddit classification task and the extended dataset classification task. Finally, we present the results for the Bayesian network diagnosis refinement task.

For our research, we used the BERT_base model. The BERT_base comprises 12 layers, 769 hidden units, and 12 heads within its multi-head attention mechanism, aggregating to a total of

110 M

parameters. We used the HuggingFace implementation for these models [68].

The model was pretrained on the complete PsyReddit dataset that covers submissions from 2007 to 2021. Training was carried out over three epochs, amounting to

203, 661

steps. In general, training strategies (Warmup, Weight Decay, Adam) and default hyperparameters recommended by the authors of BERT [58] were used unless otherwise stated.

For comparison purposes, after pretraining, the models were fine-tuned to predict the class or subreddit of submissions between the years 2017 and 2018, following the methodology in [48] for a valid comparison. Like the previous authors, the dataset was split into 80% training and 20% testing, and trained on the former set for the multiclass classification task for the 11 classes defined by the chosen subreddits in the reference.

The models were fine-tuned for three epochs in this dataset.

To select the hyperparameters for both BERT and LoRA models, we employed a grid search strategy. For LoRA,

r = 4, 8, 16

were tested, with

r = 8

selected based on its accuracy on the validation set of the fine-tuning dataset. For BERT, we explored maximum sequence lengths of 128 and 256. Pretraining parameters were set with a learning rate of

2 \times 10^{- 5}

, weight decay of

0.01

, and warmup steps of 10. During fine-tuning, we adjusted the learning rate to

5 \times 10^{- 5}

.

To obtain a model useful for identifying general conditions in mental health-related posts, we fine-tuned the model using our newly proposed extended dataset, which contains a much more diverse array of diagnoses. We trained the model for three epochs and chose the best model based on validation performance. The final model is capable of identifying among 18 clinically valid diagnostic categories. We selected the best-performing model from the previous step to fine-tune on this dataset.

Furthermore, adhering to standard practices in the evaluation of symptom checking systems [4,69], we created a validation set using clinical vignettes, or clinical cases, within the realm of mental health. Clinical vignettes are detailed representative cases used in the training and evaluation of health professionals. These cases are known for their complexity and resemblance to real-life scenarios that are challenging to diagnose, yet are reflective of actual clinical experiences.

We sourced these clinical vignettes from the Blackwell Clinical Vignettes for Behavioral Sciences textbook [70], a resource commonly used in professional training. From this textbook, we selected 37 vignettes. Our task for the model was to identify the most likely diagnosis category, as well as the top three most probable categories, for each vignette.

To establish a realistic benchmark for comparison, we asked three professional psychologists to determine the three most probable conditions for each vignette from the list of available condition labels. Their consensus was used to calculate the top-1 and top-3 accuracies. For a practical illustration, some examples of these vignettes and the corresponding predictions that our model generated are presented in Table 3.

Our main objective was to determine whether the incorporation of a refinement process using a Bayesian network could enhance the initial classification results. To mimic the symptom-checking procedure, we simulated a scenario in which a user responds to the symptoms identified by the system in each interaction.

Specifically, for the symptoms

S_{q} = {top}_{q} (P (S | T))

, our simulation provided evidence

S_{i} = 1

if the symptom is included in the subset of symptoms of the real label of the example and

S_{i} = 0

if it did not. We studied how the number of symptoms asked in each interaction affected the final diagnosis and how the system performance evolved as more refinement steps were taken.

3.2. Experimental Results

Table 4 shows the results in the Reddit classification dataset for the base models with sequence lengths of 128 and 256 as well the results of the non-pretrained model for comparison. The results are shown in each distinct class. Furthermore, the performance of the deep CNN model presented in [48] and the MentalBert model is also documented for comparative analysis.

As can be observed, the pretrained BERT models surpass the performance of their counterparts, with no significant differences between the 128 and 256 max-length models. Coming very close, the MentalBert model also presents highly competitive results in this dataset. In addition, we compared the results with the fine-tuned model using LoRA adapters. Our best pretrained model achieved a macro-averaged F1-score of

0.754

. The LoRA-based model attains a macro-averaged F1-score of

0.73

. Although it is smaller, it should be noted that the LoRA adapter trains only a staggering

2.08 %

of the original BERT weights, which considerably reduces training requirements in terms of memory, making it a very appealing option.

Based on this, we chose the best performing model—in this case, our pretrained BERT model with a maximum length of 256—and fine-tuned it on our extended dataset to obtain a model that can classify a post into any of the main diagnostic categories of the DSM-V. Given the performance advantages of the LoRA adapter approach, we opted to train it with LoRA adapters. The results are summarized in Table 5. As can be seen, the results are very promising, with an average accuracy of

0.71

. The results also highlight the difficulty of training with this more comprehensive dataset. This is expected given the increased complexity of the task, with a larger number of classes and a considerable imbalance in the number of examples, especially in the most rare classes.

To demonstrate the out-of-sample utility of our model, we tested it in a dataset of clinical vignettes compiled from the Blackwell Clinical Vignettes for Behavioral Sciences [70]. We asked our model to predict the top-1 and top-3 most likely diagnoses for each vignette. We did not train the model on this dataset, making it a true out-of-sample test. We also asked three professional psychologists to perform the same task and calculated the top-1 and top-3 accuracies based on the majority vote of their answers. Some examples of vignettes and predictions are shown in Table 3. Our model achieved a top-1 accuracy of 0.54 and a top-3 accuracy of 0.86. On the other hand, the professionals attained a top-1 accuracy of 0.6 and a top-3 accuracy of 0.67. Interestingly, although the top-1 accuracy of the model was lower than that of the professionals, its top-3 accuracy was significantly better. As illustrated in Table 3, some vignettes are challenging to categorize, as the limited description of symptoms can fit into many categories. Consider, for example, the second vignette, where the correct label is dementia (Neurocognitive Disorder category) but the model predicted a Psychotic Disorder. The symptoms are very similar, and the accuracy of the model could be considerably increased if it had the capability to ask further questions.

To test this, we simulated a question and answer process with the model, where the model asks the user to confirm or deny the presence of symptoms. We then updated the probabilities of the diseases based on the response of the user and repeated the process for a total of t iterations. In each step, q symptoms are asked to the user, who can confirm or deny the presence of the symptom. This simulates a symptom checking process, where the symptom checker asks the user to confirm or deny the presence of symptoms in the format of multiple choice questions. We tested the model with

q = 1

,

q = 4

, and

q = 8

symptoms asked as multiple choice questions per iteration and for

t = 1, 3, 5, 8

refinement iterations. The results are summarized in Figure 5.

Interestingly, a decline in performance is observed after one iteration of the symptom refinement process, something that could be attributed to several factors; however, primarily, it seems that the posterior probabilities of the conditions D are strongly influenced by both the initial prior probabilities and the conditional probabilities of the symptoms given the conditions. In the absence of additional information, these conditional probabilities, calculated with little evidence yet, exert a stronger influence than the prior probabilities initially determined by the predictive BERT model, which already makes a good prediction based on the textual description of symptoms.

However, the situation changes after three iterations of symptom confirmation. In the dataset derived from online sources, the top-1/top-3 performance improves to

0.83

/

0.92

, showing a 6-point increase in top-1 accuracy.

The transformation is even more pronounced in the clinical vignette-based dataset. Here, the accuracy at three iterations reaches

0.72

for top-1 and

0.81

for top-3, reflecting an impressive, nearly 20 percentage point enhancement in top-1 accuracy, even surpassing the performance of professional psychologists.

The accuracy trended upward with additional iterations of confirmation but tends to plateau after eight iterations. This pattern suggests that while iterative confirmation significantly enhances diagnostic accuracy initially, there is a threshold beyond which further iterations yield diminishing returns.

Interestingly, while the two approaches are similar, it appears more beneficial to distribute the questioning across more iterations rather than asking more questions in each iteration. For example, when the model performs five iterations, each querying four symptoms, it collects a total of 20 symptoms and achieves a top-1 accuracy of 0.75. In contrast, three iterations that request eight symptoms each gather 24 symptoms but yield a slightly lower performance of 0.73. Despite the smaller amount of evidence, the performance is better in the first scenario, suggesting that distributing the questions across more iterations improves the diagnostic refinement capacities of the model, demonstrating the reasoning abilities that our proposed model shows.

4. Discussion

In this work, we have developed a model that combines the scalable, powerful predictive capabilities and adaptability of large language models with the transparency and clinical soundness of Bayesian networks. This architecture aims to create an accurate diagnostic tool for mental conditions, characterized by its explainability and adherence to the clinical standards of psychiatric diagnosis.

To accomplish this, we constructed an extensive pretraining and training dataset, amassing over 2.4 million data points from various forums and online sources. These data were carefully processed and anonymized to facilitate the training of a large language model using modern training methods.

In addition, we designed a Bayesian network comprising a database of 42 conditions and 114 symptoms. This network enhances the predictive accuracy of the model through a question-and-answer process that echoes the differential diagnosis approach used by healthcare professionals. The integration of this Bayesian network considerably enhances the initial performance of the predictions of the deep learning model even by 20 percentage points in accuracy, underlining the effectiveness of the proposed hybrid approach.

We think that the integration of deep learning models, with their capability for learning high-quality mappings from raw text to prediction probabilities, and probabilistic models, useful at distilling clinical knowledge accumulated over years of professional practice, represents a promising and fruitful area of research.

There is plenty of room for improvement and future development for our work. First, we have utilized a probabilistic network that, although it reflects the relationships between conditions and symptoms as delineated in diagnostic manuals, still poses significant potential for improvement. Contemporary psychopathological research is increasingly focusing on conceptualizing mental conditions as states of intricate networks comprising symptoms, biological factors, and environmental and social influences. Integrating these complex networks into our model could substantially boost its predictive power and help professionals understand the multifaceted development of mental health disorders.

Another growing research area involves causal analysis based on Bayesian network models. Tools like do-calculus allow causal analysis to delve into far more complex inquiries than the correlational analysis offered by standard Bayesian networks. This is particularly pertinent in mental health, where symptoms often have multiple causes. For instance, predicting a condition as probable given a symptom differs fundamentally from determining that a symptom would not exist without the condition being present. The latter is a causal question that traditional probabilistic models struggle to address effectively. An illustrative case is a depressed mood, which can be due to various causes, including grief. The critical question here is whether the symptom is the result of a mental health condition. By expanding our model to incorporate causal analysis tools, we could significantly improve the capacity of the model to provide insightful and relevant responses.

Additionally, the design of our network that is based on the prior probabilities of conditions obtained from an initial prediction based on the input from the user, opens up opportunities for adapting to the large variability encountered in different demographics and users. Specifically, we can adjust the network to account for the unique characteristics of each user by modifying the prior probability of conditions with a user-specific factor based on specific features of each person. This adaptability is crucial in psychology, where the prevalence of mental health disorders and symptoms varies significantly based on factors such as ethnicity, age, and gender.

Although our study has demonstrated the potential of our hybrid method for the classification of mental health conditions, there are some limitations that should be considered in future research.

Despite concerted efforts to minimize dataset bias concerning the representation of uncommon diagnoses, significant challenges persist. Many rare conditions remain underrepresented, leading to disproportionately higher accuracy in recognizing frequently discussed conditions while diminishing the performance of the model on rarer diagnoses. To address this, a promising area of research is the generation of synthetic data using modern large language models, which have demonstrated very good outcomes in creating synthetic data. We plan to explore these developments as an interesting area of future research.

Additionally, the current validation process employs a user interaction simulation approach to emulate patient responses. However, this method may not fully capture the complex nuances and inherent uncertainties of real-world patient interactions. Future research will require a more comprehensive evaluation involving deeper collaboration with field professionals to ensure the predictions of the model are robust and reliable. Additionally, our current framework does not integrate critical factors such as patient history, hereditary traits, and other pertinent clinical considerations typically assessed in a thorough diagnostic process. Addressing these limitations is crucial for advancing the applicability of our model as a reliable support tool in clinical psychology and will be a focal point of future iterations of our work.

In conclusion, our work represents a step forward in the development of a diagnostic tool for mental health disorders that combines the predictive power of deep learning models with the clinical soundness and transparency of Bayesian networks. The results of our experiments demonstrate the potential of this hybrid approach to enhance the accuracy and transparency of automated mental health diagnosis. We believe that this work lays the foundation for future research in this area and opens up exciting possibilities for the development of more effective and reliable diagnostic tools for mental health disorders.

Author Contributions

Conceptualization, J.P.; Methodology, J.P.; Software, J.P.; Writing—original draft, J.P.; Supervision, H.A.; Funding acquisition, H.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by ANID PIA/APOYO AFB220004 and in part by Project DGIIP-UTFSM PI-LIR23-13.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors would like to express their gratitude to Diego Acuña for his essential contributions to technical support and infrastructure management throughout the project. Additionally, they would like to thank Simón Michell, whose expertise as a clinical psychologist has been important in the design of the Bayesian network, the interpretation of the results, and the coordination of their validation.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Patel, V.; Chowdhary, N.; Rahman, A.; Verdeli, H. Improving access to psychological treatments: Lessons from developing countries. Behav. Res. Ther. 2011, 49, 523–528. [Google Scholar] [CrossRef] [PubMed]
Ngui, E.M.; Khasakhala, L.; Ndetei, D.; Roberts, L.W. Mental disorders, health inequalities and ethics: A global perspective. Int. Rev. Psychiatry 2010, 22, 235–244. [Google Scholar] [CrossRef]
Miller, R.A.; Pople, H.E., Jr.; Myers, J.D. Internist-I, an experimental computer-based diagnostic consultant for general internal medicine. In Computer-Assisted Medical Decision Making; Springer: Berlin/Heidelberg, Germany, 1985; pp. 139–158. [Google Scholar]
Semigran, H.L.; Linder, J.A.; Gidengil, C.; Mehrotra, A. Evaluation of symptom checkers for self diagnosis and triage: Audit study. BMJ 2015, 351, h3480. [Google Scholar] [CrossRef] [PubMed]
Razzaki, S.; Baker, A.; Perov, Y.; Middleton, K.; Baxter, J.; Mullarkey, D.; Sangar, D.; Taliercio, M.; Butt, M.; Majeed, A.; et al. A comparative study of artificial intelligence and human doctors for the purpose of triage and diagnosis. arXiv 2018, arXiv:1806.10698. [Google Scholar]
White, R.W.; Horvitz, E. Experiences with web search on medical concerns and self diagnosis. In Proceedings of the AMIA Annual Symposium Proceedings; American Medical Informatics Association: Bethesda, MD, USA, 2009; Volume 2009, p. 696. [Google Scholar]
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan kaufmann: Burlington, MA, USA, 1988. [Google Scholar]
Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
Kononenko, I. Inductive and Bayesian learning in medical diagnosis. Appl. Artif. Intell. Int. J. 1993, 7, 317–337. [Google Scholar] [CrossRef]
Semigran, H.L.; Levine, D.M.; Nundy, S.; Mehrotra, A. Comparison of physician and computer diagnostic accuracy. JAMA Intern. Med. 2016, 176, 1860–1861. [Google Scholar] [CrossRef] [PubMed]
Dabowsa, N.I.A.; Amaitik, N.M.; Maatuk, A.M.; Aljawarneh, S.A. A hybrid intelligent system for skin disease diagnosis. In Proceedings of the 2017 International Conference on Engineering and Technology (ICET), Antalya, Turkey, 21–23 August 2017; pp. 1–6. [Google Scholar]
Rathod, J.; Waghmode, V.; Sodha, A.; Bhavathankar, P. Diagnosis of skin diseases using Convolutional Neural Networks. In Proceedings of the 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 29–31 March 2018; pp. 1048–1051. [Google Scholar]
Alfian, G.; Syafrudin, M.; Ijaz, M.F.; Syaekhoni, M.A.; Fitriyani, N.L.; Rhee, J. A personalized healthcare monitoring system for diabetic patients by utilizing BLE-based sensors and real-time data processing. Sensors 2018, 18, 2183. [Google Scholar] [CrossRef]
Gonsalves, A.H.; Thabtah, F.; Mohammad, R.M.A.; Singh, G. Prediction of coronary heart disease using machine learning: An experimental analysis. In Proceedings of the 2019 3rd International Conference on Deep Learning Technologies, Xiamen, China, 5–7 July 2019; pp. 51–56. [Google Scholar]
Ardila, D.; Kiraly, A.P.; Bharadwaj, S.; Choi, B.; Reicher, J.J.; Peng, L.; Tse, D.; Etemadi, M.; Ye, W.; Corrado, G.; et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 2019, 25, 954–961. [Google Scholar] [CrossRef]
McKinney, S.M.; Sieniek, M.; Godbole, V.; Godwin, J.; Antropova, N.; Ashrafian, H.; Back, T.; Chesus, M.; Corrado, G.S.; Darzi, A.; et al. International evaluation of an AI system for breast cancer screening. Nature 2020, 577, 89–94. [Google Scholar] [CrossRef]
Abd-Alrazaq, A.; Alhuwail, D.; Schneider, J.; Toro, C.T.; Ahmed, A.; Alzubaidi, M.; Alajlani, M.; Househ, M. The performance of artificial intelligence-driven technologies in diagnosing mental disorders: An umbrella review. NPJ Digit. Med. 2022, 5, 87. [Google Scholar] [CrossRef]
Iyortsuun, N.K.; Kim, S.H.; Jhon, M.; Yang, H.J.; Pant, S. A Review of Machine Learning and Deep Learning Approaches on Mental Health Diagnosis. Healthcare 2023, 11, 285. [Google Scholar] [CrossRef] [PubMed]
Guntuku, S.C.; Yaden, D.B.; Kern, M.L.; Ungar, L.H.; Eichstaedt, J.C. Detecting depression and mental illness on social media: An integrative review. Curr. Opin. Behav. Sci. 2017, 18, 43–49. [Google Scholar] [CrossRef]
Kim, J.; Lee, D.; Park, E. Machine learning for mental health in social media: Bibliometric study. J. Med. Internet Res. 2021, 23, e24870. [Google Scholar] [CrossRef] [PubMed]
Di Nuovo, A.G.; Catania, V.; Di Nuovo, S.; Buono, S. Psychology with soft computing: An integrated approach and its applications. Appl. Soft Comput. 2008, 8, 829–837. [Google Scholar] [CrossRef]
De Choudhury, M.; Gamon, M.; Counts, S.; Horvitz, E. Predicting depression via social media. In Proceedings of the International AAAI Conference on Web and Social Media, Cambridge, MA, USA, 8–11 July 2013; Volume 7, pp. 128–137. [Google Scholar]
Reece, A.G.; Reagan, A.J.; Lix, K.L.; Dodds, P.S.; Danforth, C.M.; Langer, E.J. Forecasting the onset and course of mental illness with Twitter data. Sci. Rep. 2017, 7, 13006. [Google Scholar] [CrossRef] [PubMed]
Schwartz, H.A.; Eichstaedt, J.; Kern, M.; Park, G.; Sap, M.; Stillwell, D.; Kosinski, M.; Ungar, L. Towards assessing changes in degree of depression through facebook. In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Baltimore, MD, USA, 27 June 2014; pp. 118–125. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. CoRR abs/1802.05365 (2018). arXiv 2018, arXiv:1802.05365. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Kroenke, K.; Spitzer, R.L.; Williams, J.B. The PHQ-9: Validity of a brief depression severity measure. J. Gen. Intern. Med. 2001, 16, 606–613. [Google Scholar] [CrossRef]
Williams, N. The GAD-7 questionnaire. Occup. Med. 2014, 64, 224. [Google Scholar] [CrossRef]
Sharp, R. The Hamilton rating scale for depression. Occup. Med. 2015, 65, 340. [Google Scholar] [CrossRef]
Beck, A.T.; Steer, R.A.; Brown, G.K. Beck Depression Inventory; Harcourt Brace Jovanovich: New York, NY, USA, 1987. [Google Scholar]
Tsugawa, S.; Kikuchi, Y.; Kishino, F.; Nakajima, K.; Itoh, Y.; Ohsaki, H. Recognizing depression from twitter activity. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, Seoul, Republic of Korea, 18–23 April 2015; pp. 3187–3196. [Google Scholar]
De Choudhury, M.; Counts, S.; Horvitz, E.J.; Hoff, A. Characterizing and predicting postpartum depression from shared facebook data. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, Baltimore, MD, USA, 15–19 February 2014; pp. 626–638. [Google Scholar]
Resnik, P.; Armstrong, W.; Claudino, L.; Nguyen, T.; Nguyen, V.A.; Boyd-Graber, J. Beyond LDA: Exploring supervised topic modeling for depression-related language in Twitter. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Denver, CO, USA, 5 June 2015; pp. 99–107. [Google Scholar]
Pedersen, T. Screening Twitter users for depression and PTSD with lexical decision lists. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Denver, CO, USA, 5 June 2015; pp. 46–53. [Google Scholar]
Coppersmith, G.; Dredze, M.; Harman, C.; Hollingshead, K. From ADHD to SAD: Analyzing the language of mental health on Twitter through self-reported diagnoses. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Denver, CO, USA, 5 June 2015; pp. 1–10. [Google Scholar]
Coppersmith, G.; Dredze, M.; Harman, C.; Hollingshead, K.; Mitchell, M. CLPsych 2015 shared task: Depression and PTSD on Twitter. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Denver, CO, USA, 5 June 2015; pp. 31–39. [Google Scholar]
Kabir, M.; Ahmed, T.; Hasan, M.B.; Laskar, M.T.R.; Joarder, T.K.; Mahmud, H.; Hasan, K. DEPTWEET: A typology for social media texts to detect depression severities. Comput. Hum. Behav. 2023, 139, 107503. [Google Scholar] [CrossRef]
Bagroy, S.; Kumaraguru, P.; De Choudhury, M. A social media based index of mental well-being in college campuses. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, 6–11 May 2017; pp. 1634–1646. [Google Scholar]
Gkotsis, G.; Oellrich, A.; Hubbard, T.; Dobson, R.; Liakata, M.; Velupillai, S.; Dutta, R. The language of mental health problems in social media. In Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology, San Diego, CA, USA, June 2016; pp. 63–73. [Google Scholar]
De Choudhury, M.; Kiciman, E.; Dredze, M.; Coppersmith, G.; Kumar, M. Discovering shifts to suicidal ideation from mental health content in social media. In Proceedings of the CHI Conference on Human Factors in Computing Systems, San Jose, CA, USA, 7–12 May 2016; pp. 2098–2110. [Google Scholar]
Ernala, S.K.; Birnbaum, M.L.; Candan, K.A.; Rizvi, A.F.; Sterling, W.A.; Kane, J.M.; De Choudhury, M. Methodological gaps in predicting mental health states from social media: Triangulating diagnostic signals. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4 May 2019; pp. 1–16. [Google Scholar]
De Choudhury, M.; De, S. Mental health discourse on reddit: Self-disclosure, social support, and anonymity. In Proceedings of the International AAAI Conference on Web and Social Media, Ann Arbor, MI, USA, 1–4 June 2014; Volume 8, pp. 71–80. [Google Scholar]
American Psychiatric Association DS; American Psychiatric Association DS. Diagnostic and Statistical Manual of Mental Disorders: DSM-5; American Psychiatric Association: Washington, DC, USA, 2013; Volume 5. [Google Scholar]
Gaur, M.; Kursuncu, U.; Alambo, A.; Sheth, A.; Daniulaityte, R.; Thirunarayan, K.; Pathak, J. Let me tell you about your mental health! Contextualized classification of reddit posts to DSM-5 for web-based intervention. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Turin, Italy, 22–26 October 2018; pp. 753–762. [Google Scholar]
Gkotsis, G.; Oellrich, A.; Velupillai, S.; Liakata, M.; Hubbard, T.J.; Dobson, R.J.; Dutta, R. Characterisation of mental health conditions in social media using Informed Deep Learning. Sci. Rep. 2017, 7, 45141. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Lee, J.; Park, E.; Han, J. A deep learning model for detecting mental illness from user content on social media. Sci. Rep. 2020, 10, 11846. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef]
Howard, J.; Ruder, S. Universal language model fine-tuning for text classification. arXiv 2018, arXiv:1801.06146. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
Huang, K.; Altosaar, J.; Ranganath, R. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv 2019, arXiv:1904.05342. [Google Scholar]
Ramasesh, V.V.; Lewkowycz, A.; Dyer, E. Effect of scale on catastrophic forgetting in neural networks. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
LeapBeyond; Malmgren, D.; IDEO; Datascope. Scrubadub: Clean Personally Identifiable Information from Dirty Dirty Text. 2023. Latest Version v2.0.1. Available online: https://snyk.io/advisor/python/scrubadub (accessed on 18 June 2024).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ƚ.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Devlin, J. BERT. 2020. Available online: https://github.com/google-research/bert (accessed on 18 June 2024).
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. arXiv 2023, arXiv:2305.14314. [Google Scholar]
Tang, K.F.; Kao, H.C.; Chou, C.N.; Chang, E.Y. Inquire and diagnose: Neural symptom checking ensemble using deep reinforcement learning. In Proceedings of the NIPS Workshop on Deep Reinforcement Learning, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Xia, Y.; Zhou, J.; Shi, Z.; Lu, C.; Huang, H. Generative adversarial regularized mutual information policy gradient framework for automatic diagnosis. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 1062–1069. [Google Scholar]
Meagher, D.J.; MacLullich, A.M.; Laurila, J.V. Defining delirium for the international classification of diseases, 11th revision. J. Psychosom. Res. 2008, 65, 207–214. [Google Scholar] [CrossRef] [PubMed]
Yan, W.J.; Ruan, Q.N.; Jiang, K. Challenges for artificial intelligence in recognizing mental disorders. Diagnostics 2022, 13, 2. [Google Scholar] [CrossRef]
Constantinou, A.C.; Fenton, N.; Neil, M. Integrating expert knowledge with data in Bayesian networks: Preserving data-driven expectations when the expert variables remain unobserved. Expert Syst. Appl. 2017, 57, 197–208. [Google Scholar] [CrossRef]
Shwe, M.; Cooper, G. An empirical analysis of likelihood-weighting simulation on a large, multiply connected medical belief network. Comput. Biomed. Res. 1991, 24, 453–475. [Google Scholar] [CrossRef]
Middleton, B.; Shwe, A.; Heckerman, E.; Henrion, M.; Horvitz, J.; Lehmann, P.; Cooper, F. Probabilistic diagnosis using a reformulation of the internist-1/qmr knowledge base. Methods Inf. Med. 1991, 30, 256–267. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D.; et al. Towards expert-level medical question answering with large language models. arXiv 2023, arXiv:2305.09617. [Google Scholar]
Bhushan, V.; Pall, V.; Le, T.; Nguyen, H. Behavioral Science, 3E, 3rd ed.; Blackwell’s Underground Clinical Vignettes, Blackwell: Oxford, UK, 2022. [Google Scholar]

Figure 1. A summary diagram representing our proposed architecture.

Figure 2. (a) The distribution of number of examples per each subreddit in the Reddit dataset. (b) The distribution of number of examples per each diagnosis category in the extended dataset.

Figure 3. A diagram for a QMR network of symptoms and diseases.

Figure 4. Noisy-OR gate.

Figure 5. Improvement over iterations and for different number of questions asked per iteration for the vignettes dataset and the validation set.

Table 1. Subset of subreddits related to mental health; some smaller ones are left out.

Subreddit	Description
BPD	A community for people with BPD (Borderline Personality Disorder), as well as for their family and friends, to exchange stories, provide support, and discuss coping strategies.
bipolar	A subreddit for people affected by bipolar disorder to share personal stories, discuss treatment options, and offer support to each other. Also includes the related subreddits BipolarReddit and BopolarSOs.
Schizophrenia	A subreddit dedicated to creating a supportive environment for those affected by schizophrenia, whether personally or in their relationships.
Anxiety	A place for people with anxiety disorders and those who know them to share their experiences and strategies for managing anxiety.
depression	A subreddit dedicated to sharing experiences, providing support, and fostering discussion about depression.
selfharm	A community dedicated to offering support, advice, and a safe space for individuals who are dealing with self-harm behaviors or thoughts. Also includes the related subreddit StopSelfHarm.
suicidewatch	A supportive community for individuals who are experiencing feelings of suicide, with the goal of providing emotional support and directing people to appropriate professional help.
addiction	A community offering support, advice, and a non-judgmental platform for discussing addiction and recovery.
cripplingalcoholism	A subreddit specifically for people dealing with severe, often life-threatening alcoholism.
Opiates	A supportive community for people who are trying to recover from opioid addiction. This subreddit offers a platform to share experiences, coping strategies, and resources. Also includes the related subreddit OpiatesRecovery.
Autism	A community that serves as a platform for individuals on the autism spectrum, as well as their family members, friends, and anyone else interested in learning about autism.

Table 2. Some examples representatives of the content posted by authors in the corresponding subreddit.

Example	Category
I’ve spent all my money. This isn’t the first time I’ve done this. I’ll go on long kicks without looking at my bank account, and then I don’t realize what’s happened until it’s too late …I’m just going to fuck things up again. I don’t know what I’m going to do. I can’t even go to the store because one of my tires is flat. I may call the counselor that I’ve seen here (but no longer, it was short term) because I only see one option out of this. I don’t know why I’m writing this, but than you for reading it.	SuicideWatch
I’ve always consumed a lot of booze. I’ve always drank more than the normal person. Granted, I’ve been drinking a BIT more than normal, but today my SO made some criticisms about it. Ugh. They accepting it for 2 years. And all of a sudden it’s a concern? Does this mean I have to hide my drinking now like I’m doing something wrong? Grr!	cripplingalcoholism
My son (3.5 year old) who is autistic has recently started obsessing over objects. Initially it was a Star Trek drinking cup though he never drank out of it. He knew when we went out he couldn’t carry it around, that it was just for home and in the car. This obsession has moved to a Star Trek DVD case. …I’m having difficulty determining how to handle it. It’s worth mentioning there’s a family history of OCD (diagnosed). Any advice would be very helpful. Thanks	autism

Table 3. Clinical vignettes and their corresponding labels and model predictions.

Clinical Vignette	Label	Top-1 Prediction
28 y/o man. irritability and insomnia. extremely productive lately. enhanced alertness. hyper energetic states episodically followed by periods of malaise. apathy. loss of appetite. decreased ability to concentrate. hypersomnia.	mood	mood
30 y/o man. full blown aids. tremor. ataxia. memory loss. visual and auditory hallucinations.	cognitive	psychotic
14 y/o female. body weight is less than 85% of that expected. she has missed her last three periods. reports that she is fat and strongly fears gaining weight. intense hunger but prefers not to eat. abusing laxatives.	eating	eating
22 y/o man. committing multiple crimes and attempting suicide. hyperactive. did poorly in school. abused animals. neglected by his drug abusing parents. failed to hold down any job. committed numerous crimes to support his drinking habit. he feels no remorse for the pain inflicted on others.	personality	personality
38 y/o female. complaining that her left hand has disappeared. wearing mismatched shoes and has colored her eyebrows with red lipstick. when the physician examines her hands she suddenly seems relieved and thanks him for restoring her missing hand. avoids family and neighbors because they cant be trusted. claims that can read others mind.	personality	psychotic

Table 4. We compared results from the Reddit test dataset using the deep model presented in [48] and our pretrained BERT models. We also included a non-pretrained model for reference.

Category	Metric	CNN	MentalBert	BERT (Ours)
				Base₁₂₈	Base₂₅₆	LORA
r/BPD	P	0.88	0.77	0.74	0.79	0.78
	R	0.46	0.61	0.62	0.60	0.59
	F	0.60	0.68	0.67	0.69	0.67
r/bipolar	P	0.77	0.72	0.80	0.75	0.79
	R	0.60	0.74	0.70	0.73	0.69
	F	0.67	0.73	0.74	0.74	0.74
r/schizophrenia	P	0.75	0.77	0.71	0.74	0.62
	R	0.48	0.56	0.58	0.57	0.57
	F	0.58	0.65	0.64	0.64	0.64
r/Anxiety	P	0.83	0.75	0.79	0.79	0.80
	R	0.75	0.82	0.78	0.79	0.77
	F	0.79	0.78	0.79	0.79	0.79
r/depression	P	0.70	0.73	0.70	0.72	0.71
	R	0.77	0.79	0.80	0.82	0.83
	F	0.73	0.76	0.75	0.77	0.77
r/selfharm	P	0.70	0.79	0.82	0.82	0.79
	R	0.58	0.78	0.80	0.76	0.77
	F	0.64	0.78	0.81	0.79	0.78
r/suicidewatch	P	0.62	0.65	0.65	0.64	0.65
	R	0.59	0.55	0.58	0.56	0.56
	F	0.61	0.59	0.61	0.60	0.60
r/addiction	P	0.72	0.70	0.68	0.71	0.69
	R	0.41	0.53	0.54	0.53	0.48
	F	0.52	0.60	0.60	0.60	0.57
r/cripplingalcoholism	P	0.68	0.87	0.87	0.86	0.84
	R	0.76	0.74	0.82	0.77	0.74
	F	0.72	0.80	0.84	0.81	0.79
r/Opiates	P	0.76	0.92	0.92	0.94	0.92
	R	0.86	0.93	0.92	0.92	0.92
	F	0.80	0.93	0.92	0.83	0.92
r/autism	P	0.84	0.88	0.86	0.90	0.90
	R	0.71	0.78	0.78	0.78	0.75
	F	0.77	0.83	0.82	0.83	0.82
Macro Averaged	P	0.75	0.78	0.78	0.79	0.78
	R	0.63	0.71	0.72	0.71	0.70
	F	0.67	0.74	0.75	0.75	0.73

Table 5. Results of the LoRA-fine-tuned model on the extended dataset.

Category	Precision	Recall	F1-Score	Support
Trauma and Stressor-Related Disorders	0.69	0.66	0.68	803
Substance-Related and Addictive Disorders	0.88	0.84	0.86	668
Anxiety Disorders	0.79	0.80	0.80	1840
Neurocognitive Disorders	1.00	0.12	0.22	74
Neurodevelopmental Disorders	0.83	0.61	0.70	240
Dissociative Disorders	0.75	0.77	0.76	1400
Feeding and Eating Disorders	0.85	0.85	0.85	239
Somatic Symptom and Related Disorders	0.90	0.37	0.52	52
Impulse-Control and Conduct Disorders	0.86	0.70	0.77	163
Depressive Disorders	0.78	0.77	0.77	2263
Personality Disorders	0.79	0.84	0.81	4851
Schizophrenia Spectrum and Other Psychotic Disorders	0.71	0.70	0.71	1032
Sexual Dysfunctions	0.80	0.78	0.79	942
Sleep–Wake Disorders	0.67	0.76	0.71	123
Macro Avg	0.81	0.68	0.71	15,170

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pavez, J.; Allende, H. A Hybrid System Based on Bayesian Networks and Deep Learning for Explainable Mental Health Diagnosis. Appl. Sci. 2024, 14, 8283. https://doi.org/10.3390/app14188283

AMA Style

Pavez J, Allende H. A Hybrid System Based on Bayesian Networks and Deep Learning for Explainable Mental Health Diagnosis. Applied Sciences. 2024; 14(18):8283. https://doi.org/10.3390/app14188283

Chicago/Turabian Style

Pavez, Juan, and Héctor Allende. 2024. "A Hybrid System Based on Bayesian Networks and Deep Learning for Explainable Mental Health Diagnosis" Applied Sciences 14, no. 18: 8283. https://doi.org/10.3390/app14188283

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid System Based on Bayesian Networks and Deep Learning for Explainable Mental Health Diagnosis

Abstract

Featured Application

Abstract

1. Introduction

2. Materials and Methods

2.1. System Requirements

2.1.1. Clinically Valid

2.1.2. Accuracy

2.1.3. Flexibility

2.2. Deep Learning Model for Classification of Mental Health Disorders

2.2.1. Dataset Construction

2.2.2. Dataset Preprocessing

2.2.3. Model Training

2.3. Bayesian Network for Diagnosis Refinement through Questioning

2.4. Building the Database

2.4.1. DSM-V Overview

2.4.2. Modeling the Bayesian Network

2.5. A Hybrid Approach for Diagnosis Refinement

3. Results

3.1. Experimental Setup

3.2. Experimental Results

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI