Next Article in Journal
Experimental Bond Characterisation and Inverse Numerical Stress-Slip Law Identification for Adhesive Anchor Applications
Previous Article in Journal
A Review of Advanced Air Lubrication Strategies for Resistance Reduction in the Naval Sector
Previous Article in Special Issue
Psychological Factors, Digital Health Technologies, and Best Asthma Management as Three Fundamental Components in Modern Care: A Narrative Review
 
 
Review
Peer-Review Record

AI Chatbots for Mental Health: A Scoping Review of Effectiveness, Feasibility, and Applications

Appl. Sci. 2024, 14(13), 5889; https://doi.org/10.3390/app14135889
by Mirko Casu 1,2,*, Sergio Triscari 2,*, Sebastiano Battiato 1,3, Luca Guarnera 1 and Pasquale Caponnetto 2,3
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3:
Reviewer 4: Anonymous
Appl. Sci. 2024, 14(13), 5889; https://doi.org/10.3390/app14135889
Submission received: 31 May 2024 / Revised: 3 July 2024 / Accepted: 5 July 2024 / Published: 5 July 2024
(This article belongs to the Special Issue Innovative Digital Health Technologies and Their Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

1. While this article offers a comprehensive overview of the current literature, its broad scope did not include an in-depth quality assessment or risk of bias evaluation.

2. Due to the rapid evolution of AI technology, the findings of this review may become outdated as new advancements in natural language processing and conversational AI emerge. 

Comments on the Quality of English Language

minor revision

 1. While this article offers a comprehensive overview of the current literature, its broad scope did not include an in-depth quality assessment or risk of bias evaluation.


2. Due to the rapid evolution of AI technology, the findings of this review may become outdated as new advancements in natural language processing and conversational AI emerge.

Author Response

Comment 1: While this article offers a comprehensive overview of the current literature, its broad scope did not include an in-depth quality assessment or risk of bias evaluation.

Response 1: We thank the reviewer for his valuable feedback. We have expanded the section on risk of bias, going into more detail on the assessment of risk of bias in the included studies, explaining more in-depth what the RoB 2 and ROBINS-E assess and in which domains, and adding a summary of the results obtained using these tools. Here is the updated section with a detailed subsection for each tool (lines 553-594):

3.10 Risk of Bias

The included studies were assessed for risk of bias using the Cochrane risk-of-bias tool for randomized trials, version 2 (RoB 2) and the Cochrane Risk Of Bias In Non-randomized Studies - of Exposure (ROBINS-E), depending on the nature of the study. The following figures summarize the implementation of these tools and the overall risk-of-bias evaluations for each included study.

3.10.1 Risk of Bias in Randomized Trials (RoB 2)

The assessment using the RoB 2 tool [54] focused on seven domains:

  • D1: Bias arising from the randomization process
  • D2: Bias due to deviations from intended interventions
  • D3: Bias due to missing outcome data
  • D4: Bias in measurement of the outcome
  • D5: Bias in selection of the reported result

Figure 4 presents the detailed evaluation for each study across these domains. Most studies demonstrated a low risk of bias across all domains, indicating a robust method-ological quality. Specifically, Thunström et al. [72], Vereschagin et al. [63], Yasukawa et al. [60], Ulrich et al. [62], So et al. [65], Peuters et al. [59], Ogawa et al. [61], Olano-Espinosa et al. [66], Fitzsimmon-Craft et al. [67], He et al. [58], Bennion et al. [71], Greer et al. [69], and Oh et al. [70] exhibited low risk of bias in all domains. Conversely, some concerns were noted in studies such as Prochaska et al. [64], which showed a high risk of bias in the randomization process and some concerns in the measurement of the outcome.

3.10.2 Risk of Bias in Non-Randomized Studies (ROBINS-E)

The ROBINS-E tool [55] assessed seven domains for non-randomized studies:

  • D1: Risk of bias due to confounding
  • D2: Risk of bias in selection of participants into the study
  • D3: Risk of bias in classification of interventions
  • D4: Risk of bias due to deviations from intended interventions
  • D5: Risk of bias due to missing data
  • D6: Risk of bias in measurement of outcomes
  • D7: Risk of bias in selection of the reported result

Figure 5 provides a summary of these evaluations. Cheah et al. [68] was assessed with low risk of bias across all domains, indicating high methodological rigor and reliable results.

In general, the risk-of-bias assessments using RoB 2 and ROBINS-E tools indicate that most included studies exhibit low risk of bias, suggesting that the findings are robust and reliable. However, attention should be given to studies with identified risks to interpret their results cautiously.

 

Comment 2: Due to the rapid evolution of AI technology, the findings of this review may become outdated as new advancements in natural language processing and conversational AI emerge. 

Response 2: We thank the reviewer for his feedback. Unfortunately, the topic of AI is indeed one that is changing very quickly, as numerous organizations are working to make LLM and AI models more advanced at a rapid pace. We noted this limitation in our scoping review (lines 681-683):

"[...] the rapid evolution of AI technology suggests that the findings of this review may quickly become outdated as new advancements in natural language processing and conversational AI emerge."

Reviewer 2 Report

Comments and Suggestions for Authors
  • The paper is well-written and I enjoyed reading it. I think there are multiple aspects that can be tightened and some of key-terms such as their use of AI-powered can be modified or at least clarified. Please see numerous specific suggestions below.

  • Good rationale and clear objectives
  • Have some concern that the review may have been limited by not including "conversational agent" as a search term. The reported search string only seemed to include "chatbot"
  • Authors didn't document if there was any language restrictions for the included articles and I assume they excluded articles in which NLP wasn't used. The PRISMA diagram notes that they excluded studies where "asynchronous chat not mediated by AI"
  • You need to expand your inclusion or exclusion criteria. From the flow chart you excluded chatbots using scripted dialogue/non-AI. This needs to be clarified in section 2.3.
  • I couldn't follow the PRISMA Flow diagram as there seemed to be 42 articles comprehensively reviewed with 17 screened out and yet only 15 remaining in the review. 
  • Good narrative review of included studies however I think Figure 2 is too small/complicated. My understanding of figures is that they should assist with summarising and I didn't feel that this figure achieved that.
  • Would have liked more information regarding whether each program was "stand alone" or offered as a component/ adjunct to other therapy -this was not always clear and may have assisted in linking to the point raised in the discussion  "integration into existing systems remains a challenge".
  • On page 7, you talk about Gamebot2 and say "These findings suggest that more intensive therapist involvement may be necessary to improve the effectiveness of self-help interventions for gambling problems." I would have thought it also suggests that the intervention can be standalone and does not also need costly therapist involvement.  Alternatively, it might suggest the need for further investigation to identify what elements can best or only be provided by a therapist to determine the best human-machine balance/synergy.
  • On page 8, "lower usability ratings " - do you mean that the chatbot has lower usability ratings than the book? What scale was used? The questions for scales such as SUS would not apply well to a book. Similarly, as scale suitable for a book might not make sense for a chatbot.
    Relatedly, how comparable are the different chatbots, even on usability? Was a particular instrument commonly used or did each study have its own questions. I draw your attention to an instrument developed by the intelligent virtual  community to allow comparison https://ii.tudelft.nl/evalquest/
    You can see the process and related articles used to identify the constructs and instrument
    https://ii.tudelft.nl/evalquest/ASAQ%20publication.html
    Wider uptake of such an instrument will assist comparisons and understanding of what features are useful and when, etc to advance research efforts relating to conversational agents.
  • In section 3. 8 Is Eliza in the Bennion study the same Eliza from decades ago? Is it a fair comparison?
  • In Table 1, you only refer to Thunstrom in the context of usability and engagement, however these were also mentioned in many other studies.  I understand you are focusing on the application area, but I would have thought that BETSY was interested in emotional/mental well-being as well. Were these measured - or just the response to the different interfaces. Since it had a large multidisciplinary team, I can't imagine the only focus was on the preferred interface. If the health outcomes are not in this paper, then maybe you need to identify where they are published?
    The risk table indicates outcome data is not missing from this article. It seems to me for this work you need to cite more than one paper related to the study - unless they only cared about usability and not health outcomes.
  • In Discussion, you mention engagement is a major problem. Are issues with/rates of engagement fewer/better with human-delivered? What is reported?
  • You mention the need for "cultural sensitivity and localization". This can also be a potential issue with human-delivered interventions. There are also other reasons why a conversational agent might be better. See Lisetti, Christine. (2012). 10 advantages of using avatars in patient-centered computer-based interventions for behavior change. ACM SIGHIT Record. 2. 28-28. 10.1145/2180796.2180820.
    Also Johnathon Gratch's work with PTSD and preference for agent because it is perceived to be less judgemental.
  • You mention ethical concerns in a few places in the article but only really specify concerns on data privacy and security. Hudlicka (2016) identifies several ethical issues specific to relational agents: right to keep your emotions to yourself, manipulating others emotions and
    virtual relationships. That work and other initiatives/discussions relating to ethical concerns with IVAs, such as emotional dependencies, use of deception/false memories are pointed to in Deborah Richards, Ravi Vythilingam, and Paul Formosa. 2023. A principlist-based study of the ethical design and acceptability of artificial social agents. Int. J. Hum.-Comput. Stud. 172, C (Apr 2023). https://doi.org/10.1016/j.ijhcs.2022.102980
  • I became quite confused in the Discussion as there seemed to be significant amount of new information being discussed that did not come out of the scoping review. I think the discussion about "Natural Language Processing and Technical Aspects of Chatbots' should have been in the "background" section of the paper to set the scene for the reader.
  • Page 11 - rule-based reasoning is still part of the field of AI, so that section heading is somewhat misleading. While rules are often developed using knowledge engineering from a domain expert, rules and decision-trees can also be produced from machine-learning using algorithms such as C4.5.  You need to rephrase ai-powered chatbots to reflect that you are specifically referring to chatbots using NLP and/or ML.
  • I would like to know what AI technology is used in the 15 analysed chatbots - perhaps it can be added to one of the tables.
  • Section 4.2 "Users make selections to receive answers, but these chatbots are slower and less reliable " What evidence is there for this statement. Why are they slower, they should be instantaneous and also if hand-crafted and validated by health expert, more reliable than generated by AI.

Author Response

The paper is well-written and I enjoyed reading it. I think there are multiple aspects that can be tightened and some of key-terms such as their use of AI-powered can be modified or at least clarified. Please see numerous specific suggestions below.

Good rationale and clear objectives

Comment 1: Have some concern that the review may have been limited by not including "conversational agent" as a search term. The reported search string only seemed to include "chatbot"

Response 1: Thank you for your valuable feedback. The decision to use only "chatbot" as a search term was made to maintain focus and consistency in the literature search and subsequent analysis. Using a single search term helped to ensure that the scope of the review remained manageable and that the results were directly relevant to the topic of AI chatbots. While "conversational agents" and other related terms could have been included, they may have introduced additional complexity and potentially diluted the focus of the review. By sticking to "chatbot," we were able to specifically target literature and research directly related to this particular type of AI technology. Furthermore, the term "chatbot" is widely used and recognized in the field of artificial intelligence and human-computer interaction, making it a suitable choice for a scoping review on this topic. Using a well-defined and commonly accepted term also helped to ensure that the search strategy was clear and reproducible, which are important considerations in research methodology. However, in future comprehensive reviews, we agree that employing a broader set of search strings will provide a more inclusive overview of the literature. We appreciate your suggestion.

 

Comment 2: Authors didn't document if there was any language restrictions for the included articles and I assume they excluded articles in which NLP wasn't used. The PRISMA diagram notes that they excluded studies where "asynchronous chat not mediated by AI"

Response 2: We appreciate the reviewer's attention to detail regarding our inclusion and exclusion criteria. We would like to clarify several points in response to the concerns raised. Regarding language restrictions, we want to emphasize that no language restrictions were applied in our study. We included articles regardless of the language of publication. Concerning the use of NLP and AI-mediated chatbots, we would like to clarify that our study did not exclude chatbots without NLP capabilities. We recognize that modern and advanced rule-based reasoning systems are considered a branch of the AI field. As such, we included studies featuring both NLP-based and rule-based AI chatbots in our review.

With respect to the exclusion of "asynchronous chat not mediated by AI," we acknowledge that this criterion may have been unclear in our original manuscript. To clarify, by this phrase we meant to exclude older chatbot systems that were not based on modern architectures/frameworks, rule-based systems, NLP, or machine learning techniques. Our focus was on contemporary AI-driven chatbot systems that employ more advanced simulation of conversation, regardless of the specific AI approach used (e.g., NLP, modern rule-based systems, or machine learning).

We have updated the manuscript to reflect these clarifications, including our inclusion and exclusion criteria in the methods section (lines 231-250).

 

Comment 3: You need to expand your inclusion or exclusion criteria. From the flow chart you excluded chatbots using scripted dialogue/non-AI. This needs to be clarified in section 2.3.

Response 3: We appreciate your feedback regarding the need to expand our inclusion and exclusion criteria. We have addressed this concern by updating section 2.3 of our manuscript. Specifically, we have also added the following criteria (lines 231-250):

Studies were selected based on the following inclusion criteria:

    • Clinical trials;
    • Randomized controlled trials (RCTs);
    • Articles written in any language;
    • Chatbot interventions mediated by modern AI architectures/frameworks;
    • Chatbots utilizing rule-based systems, natural language processing (NLP), or ma-chine learning;
    • Pilot studies examining chatbot interventions for mental health conditions.

Exclusion criteria included:

    • Cross-sectional studies;
    • Reviews;
    • Commentaries;
    • Editorials;
    • Protocols;
    • Case studies;
    • Older chatbot systems not based on modern AI architectures/frameworks;
    • Human-to-human asynchronous communication platforms without AI mediation;
    • Scripted or pre-set chat systems without AI-driven conversation simulation;
    • Studies not focused on chatbot interventions or mental health conditions.

This modification aligns with the exclusions noted in our flow chart and provides a more precise description of our study selection process.

 

Comment 4: I couldn't follow the PRISMA Flow diagram as there seemed to be 42 articles comprehensively reviewed with 17 screened out and yet only 15 remaining in the review. 

Response 4: Thank you for your careful examination of our PRISMA Flow diagram and for bringing this discrepancy to our attention. You are correct in pointing out that the flow diagram as presented doesn't add up correctly. This is indeed an error on our part, and we are grateful for your keen observation.

Upon re-examination of our data and selection process, we have identified and corrected this error. The corrected information is as follows:

- Total articles comprehensively reviewed (assessed for eligibility): 42

- Articles excluded with reasons: 27

- Final number of studies included in the review: 15

We have updated our PRISMA Flow diagram to accurately reflect these numbers, ensuring consistency throughout the selection process.

 

Comment 5: Good narrative review of included studies however I think Figure 2 is too small/complicated. My understanding of figures is that they should assist with summarising and I didn't feel that this figure achieved that.

Response 5: We appreciate your feedback on Figure 2 (now Figure 3). In response, we've redesigned the figure to focus solely on the key areas covered in the included studies. We've removed the text about feasibility, effectiveness, and applications, and increased the overall size for better readability. The revised figure is now more streamlined and immediate, providing a clearer summary of our review's content.

 

Comment 6: Would have liked more information regarding whether each program was "stand alone" or offered as a component/ adjunct to other therapy -this was not always clear and may have assisted in linking to the point raised in the discussion  "integration into existing systems remains a challenge".

Response 6: Thank you for your helpful comment. In response, we have added information to the Results section for each study to clarify whether the program was "stand alone" or offered as a component/adjunct to other therapies. Here is an example of the added sentence for the first study (lines 321-324):

“He et al. [58] conducted a randomized controlled trial to assess the impact of an AI chatbot on depressive symptoms in college students; the chatbot, named XiaoE, was employed as a standalone intervention in this study.”

Additionally, we have made a new table (Table 3) in which we indicated whether each program was “stand alone” or offered as a component/adjunct to other therapy or programs. Moreover, we expanded the Discussion section to summarize the findings and link them to the challenge of integrating such programs into existing systems (lines 658-671):

“While AI chatbots offer scalable solutions for mental health support, their integration with existing healthcare systems remains a challenge. The results indicate that AI chatbots can effectively function as standalone interventions and adjuncts to therapy for improving mental health (see Table 3). For instance, He et al. [58] demonstrated that XiaoE, used independently, reduced depressive symptoms in college students. Conversely, Yasukawa et al. [60] found that an AI chatbot paired with iCBT improved program adherence. Similarly, Ogawa et al. [61] utilized chatbots within a broader telemedicine approach for Parkinson's patients, enhancing patient engagement. The effectiveness of standalone AI chatbots is further supported by studies from Vereschagin et al. [63], Fitzsimmon-Craft et al. [67], Oh et al. [70], and Bennion et al. [71]. However, Peuters et al. [59] and Cheah et al. [68] showcased the benefits of AI chatbots as components of multi-faceted interventions for mental health and HIV prevention, respectively. These findings suggest that AI chatbots are versatile tools in mental health interventions, and ongoing research will help optimize their use and determine their long-term efficacy relative to human therapists.”

 

Comment 7: On page 7, you talk about Gamebot2 and say "These findings suggest that more intensive therapist involvement may be necessary to improve the effectiveness of self-help interventions for gambling problems." I would have thought it also suggests that the intervention can be standalone and does not also need costly therapist involvement.  Alternatively, it might suggest the need for further investigation to identify what elements can best or only be provided by a therapist to determine the best human-machine balance/synergy.

Response 7: Thank you for your thoughtful comment. We have addressed your suggestion and revised the paragraph to reflect your feedback. We agree that the findings suggest the potential for the intervention to be standalone and effective without the need for costly therapist involvement. However, as you rightly pointed out, further exploration is needed to understand the specific roles and contributions of human therapists and machine interventions to optimize their synergy and balance. 

Here is the revised paragraph (lines 409-421): 

In a similar vein, So et al. [65] compared the effectiveness of guided versus unguided chatbot interventions for problem gambling. Their randomized controlled trial evaluated the addition of minimal therapist guidance to the standalone AI chatbot intervention, GAMBOT2. The research specifically aimed to test the isolated effectiveness of the AI-based intervention by varying only the presence of researcher guidance between groups. Both groups showed significant within-group improvements in gambling outcomes, yet there were no significant between-group differences. This indicates that the guidance provided by therapists did not enhance the outcomes beyond the unguided GAMBOT2 intervention. These findings suggest that the standalone intervention is effective and that costly therapist involvement may not be necessary for positive outcomes. However, further investigation is warranted to identify the specific elements that can best or only be provided by a therapist, to determine the optimal human-machine balance and synergy in such interventions.

 

Comment 8: On page 8, "lower usability ratings " - do you mean that the chatbot has lower usability ratings than the book? What scale was used? The questions for scales such as SUS would not apply well to a book. Similarly, as scale suitable for a book might not make sense for a chatbot.

Response 8: Thank you for your valuable feedback. Yes, that is correct. The SUS was used to evaluate the usability of the chatbot. In that study, the chatbot group had a lower mean SUS score (64.5) compared to the book group (69.5), indicating that the participants found the chatbot slightly less usable than the book. However, the difference was not statistically significant (p=0.35). The lower usability ratings for the chatbot could be due to various factors, such as technical issues, complexity of the interface, or unfamiliarity with the technology.

(Section 3.7) Here is the revised period in the article (lines 486-490):


“The chatbot received lower usability ratings and faced technical challenges; the chatbot group reported a lower mean System Usability Scale score (64.5) compared to the book group (69.5). However, qualitative feedback highlighted several advantages of the chatbot, including the availability of coping tools, interactive learning, and self-management features.”

 

Comment 9: Relatedly, how comparable are the different chatbots, even on usability? Was a particular instrument commonly used or did each study have its own questions. I draw your attention to an instrument developed by the intelligent virtual  community to allow comparison https://ii.tudelft.nl/evalquest/ You can see the process and related articles used to identify the constructs and instrument https://ii.tudelft.nl/evalquest/ASAQ%20publication.html Wider uptake of such an instrument will assist comparisons and understanding of what features are useful and when, etc to advance research efforts relating to conversational agents.

Response 9: Thank you for your valuable feedback and for highlighting the resources from the intelligent virtual community. Regarding the comparability of different chatbots and the instruments used for usability evaluation, our review revealed significant variation. While some studies employed standardized tools like the System Usability Scale (SUS) and the Working Alliance Questionnaire (WAQ), others relied on qualitative feedback without standardized measures. The SUS, for instance, was used in studies by Bennion et al., Oh et al., and Thunström et al., suggesting its broad acceptance. However, the diverse methods and instruments used indicate a need for a unified evaluation framework. The instrument developed by the intelligent virtual community could provide this standardization, facilitating more consistent and comparable assessments across different chatbots. We appreciate your suggestion and will consider incorporating this standardized instrument in future research to enhance the comparability and rigor of our evaluations.

We also updated the Discussion section, see below (lines 612-627):

“While the studies indicate the feasibility and acceptability of AI chatbots for mental health interventions, challenges related to usability and engagement persist (see Table 3). Some studies, like those by He et al. [58] and Prochaska et al. [64], utilized specific standardized tools such as the Working Alliance Questionnaire (WAQ) and the System Usability Scale (SUS) to assess usability and human interaction. Other studies, such as Peuters et al. [59] and Yasukawa et al. [60], relied on qualitative feedback from interviews without using standardized instruments. For instance, Ulrich et al. [62] employed a mix of quantitative metrics and standardized scales like the Mobile Application Rating Scale (MARS) alongside qualitative feedback. The SUS was a commonly used tool in several studies (e.g., Bennion et al. [71], Oh et al. [70], Cheah et al. [68], and Thunström et al. [72]), indicating a preference for its straightforward and well-established usability assessment. However, each study often tailored its evaluation approach to its specific context and objectives, leading to variability in the comparability of findings across different chatbots. This diversity in evaluation methods underscores the importance of a standardized approach for assessing chatbot usability and user interaction to facilitate more consistent comparisons across studies.”

We have also created a new table that specifically compares AI technology, usability, and engagement among the analyzed chatbots. You can find it at the end of the Results sections (Table 3).

 

Comment 10: In section 3. 8 Is Eliza in the Bennion study the same Eliza from decades ago? Is it a fair comparison?

Response 10: Thank you for your insightful comment. The Bennion study included in Section 3.8 compared the usability and effectiveness of two conversational agents: MYLO, a more advanced system utilizing NLP and potentially basic ML techniques, and an updated version of the classic ELIZA chatbot from the 1960s. ELIZA was an early NLP system based on pattern matching and rule-based algorithms, rather than the more sophisticated NLP and ML approaches used in modern chatbots like MYLO. While ELIZA and MYLO represent different generations of chatbot technology, the comparison offers valuable insights into the evolution of this field. By including this study, we were able to illustrate the progress made in chatbot development, from ELIZA's foundational rule-based approach to MYLO's more advanced NLP and ML capabilities. The comparison is therefore "fair", considering that it's motivated by the need to illustrate developments in the field, and to show how newer architectures can deliver better results in direct comparison with previous ones.

 

Comment 11: In Table 1, you only refer to Thunstrom in the context of usability and engagement, however these were also mentioned in many other studies.  I understand you are focusing on the application area, but I would have thought that BETSY was interested in emotional/mental well-being as well. Were these measured - or just the response to the different interfaces. Since it had a large multidisciplinary team, I can't imagine the only focus was on the preferred interface. If the health outcomes are not in this paper, then maybe you need to identify where they are published?

Response 11: Thank you for your valuable feedback. Unfortunately, the study focused on evaluating the system's capabilities and user preferences for the chatbot interface without explicitly measuring emotional or mental well-being outcomes. The health outcomes were not the primary focus of this paper, and the large multidisciplinary team was involved in various aspects of the project, including ethical, medical, and legal considerations, rather than solely the preferred interface. However, it seems that future work will involve assessing the impact of the chatbot on users' emotional and mental well-being, which will be published separately (as there are no health outcomes available at the moment). For these reasons, we have referred to Thunstrom only in the context of usability and engagement in Table 1.

 

Comment 12: The risk table indicates outcome data is not missing from this article. It seems to me for this work you need to cite more than one paper related to the study - unless they only cared about usability and not health outcomes.

Response 12: Thank you for your comment. Unfortunately, as noted in the previous response, the study focused only on usability without considering health outcomes, so in the RoB table we have included outcomes related to usability and engagement.

 

Comment 13: In Discussion, you mention engagement is a major problem. Are issues with/rates of engagement fewer/better with human-delivered? What is reported?

Response 13: Thank you for your valuable feedback. In examining engagement issues across the studies, several key points emerged. In general, while chatbot engagement varied, several studies suggest potential benefits and positive engagement comparable to or better than traditional methods, highlighting the need for ongoing refinement to address engagement variability and improve user experience. We also added the following paragraph to the Discussion section (lines 632-642), thanks to your comment.

Notably, He et al. [58] found that the XiaoE chatbot demonstrated lower attrition and high initial engagement, though engagement fluctuated and appeared more suited for short-term use. Peuters et al. [59] reported initial enthusiasm but overall low engagement, with the chatbot's ability to provide meaningful replies being a significant factor. Ya-sukawa et al. [60] observed improved completion rates with the chatbot-enhanced iCBT program, indicating positive engagement. Similarly, Ulrich et al. [62] noted good engagement rates with a substantial portion of participants completing the program, and Vereschagin et al. [63] found higher engagement with the chatbot compared to other app components. Oh et al. [70] and Bennion et al. [71] reported satisfactory engagement and completion rates, with the latter study indicating comparable engagement to human-delivered CBT.

 

Comment 14: You mention the need for "cultural sensitivity and localization". This can also be a potential issue with human-delivered interventions. There are also other reasons why a conversational agent might be better. See Lisetti, Christine. (2012). 10 advantages of using avatars in patient-centered computer-based interventions for behavior change. ACM SIGHIT Record. 2. 28-28. 10.1145/2180796.2180820. Also Johnathon Gratch's work with PTSD and preference for agent because it is perceived to be less judgemental.

Response 14: Thank you for your valuable comment. We agree, and have decided to expand that paragraph as follows, incorporating the sources you suggested (lines 643-657):

The reviewed studies also emphasize tailoring chatbot interventions to specific populations and contexts. The success of mental health chatbots during the COVID-19 pandemic underscores the need for adaptive solutions in crises. Similarly, cultural sensitivity and localization are critical for chatbots addressing substance use disorders and HIV prevention in vulnerable groups. Personalization and context-specific adaptation are key to their effectiveness and acceptability. These sensitivities are also critical in human-delivered interventions, where cultural biases and misunderstandings can hinder progress. Conversational agents offer advantages in accessibility, confidentiality, and tailored information [73]. They provide consistent care, infinite patience, and can bridge literacy gaps with anthropomorphic features. Chatbots can also address healthcare provider diversity disparities by matching racial features with patients, fostering trust. Additionally, as outlined by Han et al. [74], chatbots are perceived as less judgmental, creating a safe space for users with sensitive information, which is vital for managing conditions like PTSD. These advantages make chatbots a compelling complement to human healthcare, especially in behavior change and diverse population support.

 

Comment 15: You mention ethical concerns in a few places in the article but only really specify concerns on data privacy and security. Hudlicka (2016) identifies several ethical issues specific to relational agents: right to keep your emotions to yourself, manipulating others emotions and virtual relationships. That work and other initiatives/discussions relating to ethical concerns with IVAs, such as emotional dependencies, use of deception/false memories are pointed to in Deborah Richards, Ravi Vythilingam, and Paul Formosa. 2023. A principlist-based study of the ethical design and acceptability of artificial social agents. Int. J. Hum.-Comput. Stud. 172, C (Apr 2023). https://doi.org/10.1016/j.ijhcs.2022.102980

Response 15: Thank you for your valuable comment. We expanded the ethical concerns paragraph in the Limitations subsections of the Discussion section, as follows (lines 695-706):

Ensuring data privacy and protecting user information against breaches are paramount. However, the ethical concerns surrounding relational agents like chatbots go beyond data privacy. Hudlicka [75] identifies key issues: affective privacy, emotion induction, and virtual relationships. Affective privacy relates to keeping thoughts and emotions private, raising questions about the extent of chatbot probing. Emotion induction refers to chatbots' potential to manipulate users' emotions, bringing up consent and impact concerns. Virtual relationships, where users bond with chatbots, also blur lines between human and artificial connections, leading to dependency worries. Richards [76] reinforces these concerns with survey data showing user discomfort with AI's handling of emotions and personal data. Respondents stressed the importance of transparency and user control. Developing ethical guidelines and frameworks, as well as implementing advanced security measures such as end-to-end encryption, are necessary to address these concerns.

 

Comment 16: I became quite confused in the Discussion as there seemed to be significant amount of new information being discussed that did not come out of the scoping review. I think the discussion about "Natural Language Processing and Technical Aspects of Chatbots' should have been in the "background" section of the paper to set the scene for the reader.

Response 16: We appreciate your feedback. We have decided to move that subsection to the Introduction section, under the Background subsection. This way, the article actually follows a better logical thread and is narratively easier to follow, as well as provides technical keys to understanding the next steps.

 

Comment 17: Page 11 - rule-based reasoning is still part of the field of AI, so that section heading is somewhat misleading. While rules are often developed using knowledge engineering from a domain expert, rules and decision-trees can also be produced from machine-learning using algorithms such as C4.5.  You need to rephrase ai-powered chatbots to reflect that you are specifically referring to chatbots using NLP and/or ML.

Response 17: Thank you for your insightful feedback. After careful consideration, we decided to remove the specified subsection as it was unnecessary and may cause confusion. We also agree that rule-based reasoning can indeed be part of the AI field. An AI-powered chatbot developed using NLP and ML architectures may still implement rule-based system reasoning for specific tasks.

 

Comment 18: I would like to know what AI technology is used in the 15 analysed chatbots - perhaps it can be added to one of the tables.

Response 18: Thank you for your valuable suggestion. In response, we have created a new table that specifically compares AI technology, chatbot protocol, usability, and engagement among the analyzed chatbots (Table 3). This addition aims to provide a clearer understanding of the technologies used and how they impact the user experience and engagement levels.

 

Comment 19: Section 4.2 "Users make selections to receive answers, but these chatbots are slower and less reliable " What evidence is there for this statement. Why are they slower, they should be instantaneous and also if hand-crafted and validated by health expert, more reliable than generated by AI.

Response 19: Thank you for your valuable feedback. We carefully reviewed the reference for these statements and conducted a further search for confirmation of the data. However, we did not find any other sources that corroborated the information provided by the original reference. Therefore, for better transparency and clarity, we have decided to remove this statement from the review.

Reviewer 3 Report

Comments and Suggestions for Authors

General comments

=============

Thank you for submitting your manuscript entitled "AI Chatbots for Mental Health: A Scoping Review of Effectiveness, Feasibility, and Applications." The manuscript is well-structured and articulately written, making for a compelling read. However, there are several areas within the manuscript that require extensive revisions to improve clarity, depth, and scholarly rigor.

 

Specific comments

=============

Major comments

---------------------

[Whole Manuscript]

Term Clarifications: There is a recurring use of terms "AI chatbots" and "chatbots" interchangeably. It is critical to define and distinguish these terms clearly, particularly in instances where non-AI chatbots are included in the study scope.

 

[Introduction]

Explain Cyber Health Psychology: The term "Cyber Health Psychology" appears in line 54 but is not adequately explained. Providing a definition and its relevance to the study will help in setting the context for readers unfamiliar with the term.

Background on Chatbots in Medicine: Before delving into mental health applications, provide a broader overview of AI and non-AI chatbots in various medical contexts. This will give readers a fuller understanding of the landscape into which mental health applications fit.

Background on Non-AI Chatbots for Mental Health: Include a section detailing the use and impact of non-AI chatbots in mental health to contrast with AI-based approaches and highlight the transition from non-AI to AI solutions in this field.

State of the Art: Before stating the research question, summarize what is currently known and what gaps your study aims to fill based on existing literature.

 

[Methods]

Clarification in Figure 1: The flow diagram (Figure 1) shows that out of 42 reports assessed for eligibility, only 15 were included after 24 research excluded. Please explain the discrepancy to maintain transparency and reproducibility.

 

[Results]

Details of Study Types: Early in the results section, specify the types of studies included (e.g., randomized controlled trials, observational studies). This helps in setting the context for the findings presented.

Summarization in Table 1: Enhance Table 1 by summarizing the types of studies included, their key findings, and the number of participants. This provides a quick reference and overview for readers.

Focus on Generative AI: Explicitly mention which of the included studies specifically focused on generative AI technologies, as this is central to the manuscript’s theme.

 

[Discussion]

Refocus the Discussion: The current discussion tends to veer towards a general review of external studies rather than focusing on the findings from the included studies. I recommend moving the general AI discussion to the introduction and refocusing the discussion on directly interpreting and contextualizing the results of the included studies.

These suggestions aim to enhance the manuscript's contribution to the field by ensuring clarity, depth, and a focused narrative that aligns with the stated objectives of the study.

Author Response

General comments

 

=============

 

Comment 1: Thank you for submitting your manuscript entitled "AI Chatbots for Mental Health: A Scoping Review of Effectiveness, Feasibility, and Applications." The manuscript is well-structured and articulately written, making for a compelling read. However, there are several areas within the manuscript that require extensive revisions to improve clarity, depth, and scholarly rigor.

Response 1: We thank the reviewer for the positive and constructive feedback. We have revised and improved the main document based on these suggestions.

 

Specific comments

 

=============

 

Major comments

 

---------------------

 

[Whole Manuscript]

 

Comment 2: Term Clarifications: There is a recurring use of terms "AI chatbots" and "chatbots" interchangeably. It is critical to define and distinguish these terms clearly, particularly in instances where non-AI chatbots are included in the study scope.

Response 2: Thank you for your valuable feedback. After a thorough review of the article and in consideration of another peer reviewer’s comment, we have decided to remove the subsection that included "non-AI powered chatbots" from the Discussion section. This adjustment ensures that the entire paper is now exclusively focused on AI chatbots, thereby eliminating any potential for confusion. The terms "AI chatbots" and "chatbots" are now consistently used to refer specifically to chatbots powered by AI technologies.

 

 

[Introduction]

 

Comment 3: Explain Cyber Health Psychology: The term "Cyber Health Psychology" appears in line 54 but is not adequately explained. Providing a definition and its relevance to the study will help in setting the context for readers unfamiliar with the term.

Response 3: Thank you for your insightful comment. We have added a full definition and description of Cyber Health Psychology to provide clarity and context for readers unfamiliar with the term. Here is the new paragraph added to the manuscript (lines 54-62):

 

"The rise of Cyber Health Psychology has significantly transformed mental health support. This is an interdisciplinary field that explores the intersection of psychology, health, and digital technology. It focuses on understanding how digital tools and online platforms can influence health behaviors, mental well-being, and healthcare practices. This field examines the psychological impacts of using health-related technologies, such as mobile health apps, telemedicine, and online health communities, and seeks to develop digital interventions to promote healthy behaviors and improve mental health outcomes. Cyber Health Psychology aims to enhance healthcare delivery and patient engagement in the digital age [10,11]."

 

Comment 4: Background on Chatbots in Medicine: Before delving into mental health applications, provide a broader overview of AI and non-AI chatbots in various medical contexts. This will give readers a fuller understanding of the landscape into which mental health applications fit.

Response 4: Thank you for your valuable feedback. In response to your suggestion, we have added a broader overview of AI chatbots used in various medical contexts in the Introduction section of the paper. This addition aims to provide a fuller understanding of the landscape in which mental health applications of chatbots fit. We have specifically focused on AI-powered chatbots to avoid any possible confusion and to maintain the focus on AI chatbots, excluding non-AI chatbots from the discussion.

 

Here is the added paragraph (lines 79-93):

 

AI-powered chatbots have evolved from simple rule-based only systems to advanced models using natural language processing (NLP) [22,23]. They show great potential in medical contexts, offering personalized, on-demand health promotion interventions [24,25]. These chatbots mimic human interaction through written, oral, and visual communication, providing accessible health information and services. Over the past decade, research has assessed their feasibility and efficacy, particularly in improving mental health outcomes [26]. Systematic reviews have evaluated their effectiveness, feasibility in healthcare settings, and technical architectures in chronic conditions [26]. Recent studies focus on using AI chatbots for health behavior changes like physical activity, diet, and weight management [26]. Integrated into devices like robots, smartphones, and computers, they support behavioral outcomes such as smoking cessation and treatment adherence [26]. Additionally, AI chatbots aid in patient communication, diagnosis support, and other medical tasks, with studies discussing their benefits, limitations, and future directions [25]. Their potential uses include mental health self-care and health literacy education [14,25,27,28].

 

Comment 5: Background on Non-AI Chatbots for Mental Health: Include a section detailing the use and impact of non-AI chatbots in mental health to contrast with AI-based approaches and highlight the transition from non-AI to AI solutions in this field.

Response 5: Thank you for your insightful suggestion regarding the inclusion of a section on non-AI chatbots in mental health. After careful consideration and thorough discussions with the co-authors, we have decided not to incorporate this addition. Our primary concern is that adding an extensive section on non-AI chatbots could risk making the paper excessively long. Furthermore, our aim is to maintain a clear and focused scope by concentrating solely on AI chatbots. Including a detailed discussion on non-AI chatbots could potentially dilute the main topic and make the paper less coherent and readable. We believe that keeping a sharp focus on AI chatbots will ensure a more linear and engaging narrative. We appreciate your understanding and hope this explanation clarifies our decision. 

 

Comment 6: State of the Art: Before stating the research question, summarize what is currently known and what gaps your study aims to fill based on existing literature.

Response 6: Thank you for your valuable feedback. In response to your suggestion, we have expanded the Introduction section by integrating the subsections 1.1 Technical Background and 1.1.1 Natural Language Processing and Detailed Aspects of AI Chatbots. These subsections now provide an in-depth explanation of the state of the art of AI chatbots.

 

Additionally, we have rewritten the Aim of the Study subsection at the end of the Introduction to better specify the gap our study addresses in the existing literature. The revised aim is as follows (lines 194-206):

 

"This study aims to conduct a detailed scoping review to address a critical gap in the existing literature on the effectiveness and feasibility of AI chatbots in the treatment of mental health disorders. Despite the growing prevalence of mental health issues and the global shortage of mental health professionals, the potential of AI-powered chatbots as a scalable and accessible solution remains underexplored. This study seeks to fill this significant void by evaluating the current state of research on the effectiveness of AI chatbots in improving mental and emotional well-being, as well as their ability to address specific mental health conditions. Additionally, it will assess the feasibility of AI chatbots in terms of acceptability, usability, and adoption by both users and mental health professionals. By addressing these critical gaps, this study will contribute to a deeper understanding of the potential of AI chatbots as a viable and scalable solution to the growing mental health crisis, informing the development and implementation of more effective and accessible mental health interventions."

 

[Methods]

 

Comment 7: Clarification in Figure 1: The flow diagram (Figure 1) shows that out of 42 reports assessed for eligibility, only 15 were included after 24 research excluded. Please explain the discrepancy to maintain transparency and reproducibility.

Response 7: Thank you for pointing out the discrepancy in Figure 1 regarding the flow diagram. We have reviewed and addressed this issue, and it was indeed a counting and calculation mistake. We have corrected the numbers to ensure accuracy and maintain transparency and reproducibility in our methodology. The updated flow diagram now accurately reflects that out of the 42 reports assessed for eligibility, 27 were excluded, and 15 were included in the study. We apologize for any confusion this error may have caused and appreciate your attention to detail.

 

[Results]

 

Comment 8: Details of Study Types: Early in the results section, specify the types of studies included (e.g., randomized controlled trials, observational studies). This helps in setting the context for the findings presented.

Response 8: Thank you for your valuable feedback. We have addressed your comment by specifying the types of studies included in our review. In subsection 3.1 "Characteristics of Included Studies," we added the following sentence to provide details about the study types (lines 302-305):

"The methodologies employed in these studies included randomized controlled trials, cluster-controlled trials, open-label randomized studies, 8-week usability studies, pragmatic multicenter randomized controlled trials, pilot randomized controlled trials, and beta testing mixed methods studies."



Comment 9: Summarization in Table 1: Enhance Table 1 by summarizing the types of studies included, their key findings, and the number of participants. This provides a quick reference and overview for readers.

Response 9: Thank you for your constructive feedback. We have addressed your comment by summarizing the types of studies included, their key findings, and the number of participants (Table 1). The table provides a comprehensive and quick reference for readers, ensuring that detailed information is easily accessible.



Comment 10: Focus on Generative AI: Explicitly mention which of the included studies specifically focused on generative AI technologies, as this is central to the manuscript’s theme.

Response 10: Thank you for your insightful feedback. We acknowledge the importance of addressing the role of generative AI in our manuscript. Although numerous included studies were very recent (2023, 2024), none of the analyzed AI chatbots focused on generative AI. Instead, they mostly utilized proprietary or third-party architectures based on classic modern NLP and Machine Learning processes. 

Consequently, we did not include or consider generative AI models in our analysis. We chose to discuss generative AI and Large Language Models briefly in subsection 1.1.1, where we explored the technical aspects of AI chatbots in general. This approach ensures clarity about the scope of our review and the technologies covered.

 

[Discussion]

 

Comment 11: Refocus the Discussion: The current discussion tends to veer towards a general review of external studies rather than focusing on the findings from the included studies. I recommend moving the general AI discussion to the introduction and refocusing the discussion on directly interpreting and contextualizing the results of the included studies.

These suggestions aim to enhance the manuscript's contribution to the field by ensuring clarity, depth, and a focused narrative that aligns with the stated objectives of the study.

Response 11: Thank you for your valuable feedback. We agree with your suggestion and have made the necessary changes. We have moved the general AI discussion to the introduction section to provide a clear background. The discussion section has been refocused to directly interpret and contextualize the results of the included studies.

Reviewer 4 Report

Comments and Suggestions for Authors

The work presents a review of works that use AI chatbots in mental health monitoring.

The work is well structured and written. The references are current and cover various works related to mental health and its applications in the context of chatbots.

In my opinion, I think that the ethical and moral issues of using this type of applications in the treatment and monitoring of these diseases and the impact of the lack of the human factor instead of a bot should be emphasized. These issues are briefly mentioned in point 4.3. but I think they should be more prominent in this type of article.

Author Response

The work presents a review of works that use AI chatbots in mental health monitoring.

Comment 1: The work is well structured and written. The references are current and cover various works related to mental health and its applications in the context of chatbots.

Response 1: We thank the reviewer for the kind words and feedback.

 

Comment 2: In my opinion, I think that the ethical and moral issues of using this type of applications in the treatment and monitoring of these diseases and the impact of the lack of the human factor instead of a bot should be emphasized. These issues are briefly mentioned in point 4.3. but I think they should be more prominent in this type of article.

Response 2: Thank you for your kind words and your valuable comments. We expanded the ethical concerns paragraph in the Limitations subsections of the Discussion section, as follows (lines 695-706):


Ensuring data privacy and protecting user information against breaches are paramount. However, the ethical concerns surrounding relational agents like chatbots go beyond data privacy. Hudlicka [75] identifies key issues: affective privacy, emotion induction, and virtual relationships. Affective privacy relates to keeping thoughts and emotions private, raising questions about the extent of chatbot probing. Emotion induction refers to chatbots' potential to manipulate users' emotions, bringing up consent and impact concerns. Virtual relationships, where users bond with chatbots, also blur lines between human and artificial connections, leading to dependency worries. Richards [76] reinforces these concerns with survey data showing user discomfort with AI's handling of emotions and personal data. Respondents stressed the importance of transparency and user control. Developing ethical guidelines and frameworks, as well as implementing advanced security measures such as end-to-end encryption, are necessary to address these concerns.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Thank you for your extensive revisions to the paper and detailed responses to comments. I look forward to being able to share this article with my students and others.

Reviewer 3 Report

Comments and Suggestions for Authors

Thank you for submitting your manuscript entitled "AI Chatbots for Mental Health: A Scoping Review of Effectiveness, Feasibility, and Applications." The manuscript is well-structured and articulately written, making for a compelling read. Almos all responses were reasonable.

Back to TopTop