Harnessing the Power of ChatGPT for Automating Systematic Review Process: Methodology, Case Study, Limitations, and Future Directions

Alshami, Ahmad; Elsayed, Moustafa; Ali, Eslam; Eltoukhy, Abdelrahman E. E.; Zayed, Tarek

doi:10.3390/systems11070351

Open AccessArticle

Harnessing the Power of ChatGPT for Automating Systematic Review Process: Methodology, Case Study, Limitations, and Future Directions

¹

Department of Civil and Environmental Engineering, FAMU-FSU College of Engineering, Florida State University, Tallahassee, FL 32013, USA

²

Department of Civil and Environmental Engineering, FAMU-FSU College of Engineering, Florida A&M University, Tallahassee, FL 32013, USA

³

Department of Building and Real Estate, Faculty of Construction and Environment, The Hong Kong Polytechnic University, Kowloon TU428, Hong Kong

⁴

Public Works Department, Geomatics Lab, Faculty of Engineering, Cairo University, Giza 12613, Egypt

⁵

Department of Industrial and System Engineering, The Hong Kong Polytechnic University, Hung Hom TU428, Hong Kong

^*

Authors to whom correspondence should be addressed.

Systems 2023, 11(7), 351; https://doi.org/10.3390/systems11070351

Submission received: 8 June 2023 / Revised: 4 July 2023 / Accepted: 7 July 2023 / Published: 9 July 2023

(This article belongs to the Special Issue Human–AI Teaming: Synergy, Decision-Making and Interdependency)

Abstract

:

Systematic reviews (SR) are crucial in synthesizing and analyzing existing scientific literature to inform evidence-based decision-making. However, traditional SR methods often have limitations, including a lack of automation and decision support, resulting in time-consuming and error-prone reviews. To address these limitations and drive the field forward, we harness the power of the revolutionary language model, ChatGPT, which has demonstrated remarkable capabilities in various scientific writing tasks. By utilizing ChatGPT’s natural language processing abilities, our objective is to automate and streamline the steps involved in traditional SR, explicitly focusing on literature search, screening, data extraction, and content analysis. Therefore, our methodology comprises four modules: (1) Preparation of Boolean research terms and article collection, (2) Abstract screening and articles categorization, (3) Full-text filtering and information extraction, and (4) Content analysis to identify trends, challenges, gaps, and proposed solutions. Throughout each step, our focus has been on providing quantitative analyses to strengthen the robustness of the review process. To illustrate the practical application of our method, we have chosen the topic of IoT applications in water and wastewater management and quality monitoring due to its critical importance and the dearth of comprehensive reviews in this field. The findings demonstrate the potential of ChatGPT in bridging the gap between traditional SR methods and AI language models, resulting in enhanced efficiency and reliability of SR processes. Notably, ChatGPT exhibits exceptional performance in filtering and categorizing relevant articles, leading to significant time and effort savings. Our quantitative assessment reveals the following: (1) the overall accuracy of ChatGPT for article discarding and classification is 88%, and (2) the F-1 scores of ChatGPT for article discarding and classification are 91% and 88%, respectively, compared to expert assessments. However, we identify limitations in its suitability for article extraction. Overall, this research contributes valuable insights to the field of SR, empowering researchers to conduct more comprehensive and reliable reviews while advancing knowledge and decision-making across various domains.

Keywords:

ChatGPT; systematic review; automation; Internet of Things (IoT); article filtration; article categorization; information extraction; content analysis

1. Introduction

Review articles serve various purposes within the academic literature with different types, including narrative reviews, Systematic reviews (SR), meta-analyses, scoping reviews, and integrative reviews [1]. Narrative reviews provide a broad overview and subjective analysis of existing literature [2], while SRs employ a thorough methodology to synthesize all relevant studies on a specific research question, ensuring objectivity and minimizing bias [3]. SRs offer several advantages, such as providing a reliable and comprehensive assessment of evidence, guiding evidence-based practice and policymaking, identifying research gaps, and enhancing statistical power through meta-analysis [4,5]. It is worth mentioning that SR articles are a valuable tool for synthesizing and analyzing research evidence in many fields of research, particularly in fields where research evidence is constantly evolving, such as in healthcare [6,7,8,9], project management [10,11,12,13], construction management [14,15,16,17,18,19], and aviation routing management [20,21]. To ensure that SRs are reported accurately and comprehensively, PRISMA (Preferred Reporting Items for SRs and Meta-Analyses) is widely used. Developing and executing a comprehensive search strategy to conduct an SR using the PRISMA method is essential.

The search strategy is vital in identifying relevant studies to be included in SRs. Such a strategy involves carefully selecting appropriate databases, applying pertinent Boolean research terms (BST) and keywords, and executing systematic searches to capture a comprehensive variety of evidence related to the research question [22]. In accordance to the PRISMA guidelines [23], inclusion and exclusion criteria are also crucial in the SR process. These predetermined criteria help assess the relevance of articles during the study selection phase, ensuring that the chosen studies align with the review’s objectives and provide pertinent information to address the research question. Furthermore, snowballing is mainly applied to identify additional relevant articles that may have been missed in the initial literature search. The snowballing process can be achieved by gathering the articles from the references (backward) and citation (forward) lists of included studies [24]. However, it is important to acknowledge the PRISMA method’s limitations, including potential reporting bias, the challenges of adapting to different review articles, human uncertainties in determining the article’s eligibility, and the time consumed including and excluding articles from the database [25,26]. Despite these limitations, the SR process, PRISMA guidelines, and snowballing procedures significantly all contribute to evidence synthesis and knowledge advancement across various fields. With the continued advancement of AI-driven language and chatbot technologies, there is an increasing potential for automating the SR process through alternative methods. Leveraging these AI-powered tools offers opportunities to streamline the SR process, saving time and costs while addressing uncertainties arising from human responses. By exploring these possibilities, we can optimize workflows and enhance the overall efficiency of conducting SR.

ChatGPT (Generative Pre-trained Transformer) has proven to be a valuable tool in various fields, including healthcare [27,28,29], education [30,31,32,33], construction management [34,35], and scientific writing [36,37,38]. Within scientific writing, ChatGPT has proven its efficacy in generating abstracts, introductions, and research article summaries, while also assisting with SR processes by extracting relevant information and providing concise summaries [39,40]. Its capabilities as a powerful language model extends beyond simple language generation, offering valuable suggestions for structuring the article, enhancing clarity, and ensuring a logical flow [41]. Collaborating with ChatGPT empowers researchers to outline different manuscript sections, including the introduction, methods, results, and discussion, facilitating comprehensive and cohesive narratives [42]. Furthermore, ChatGPT’s role extends to the editing and proofreading stages of scientific writing, serving as a meticulous grammar and language checker to adhere to the required style and formatting guidelines [43]. However, it is essential to recognize that while ChatGPT provides indispensable support, its usage should complement human expertise. Researchers must critically evaluate the model’s outputs, thoroughly verify information, and ensure the accuracy and reliability of the generated content [44]. By combining the capabilities of ChatGPT with human insight, researchers can significantly enhance the efficiency, productivity, and overall quality of their research and scientific writing endeavors.

Despite the capabilities of ChatGPT in various aspects of scientific writing, there is no previous research focusing on automating the SR process by levering the power of ChatGPT. However, a recent study by Qureshi [45] has raised important questions about the possibilities of ChatGPT in automating the SR process. It is worth mentioning that this study [45] just raised the question and discussed ChatGPT’s capabilities in the SR process; however, they did not introduce a practical implementation of how we can do this by levering the ChatGPT. While acknowledging the outstanding capabilities of ChatGPT in automating the SR process, the study [45] recommended further research to investigate its limitations and capacities. Therefore, our paper aims to bridge this gap by harnessing the power of ChatGPT to introduce a practical implementation of the automated SR process. Our main focus is on streamlining the traditional process of SR and introducing practical implementations of ChatGPT at different stages of the SR process.

In order to showcase the practical implementation of our methodology, we delve into the extensive domain of Internet of Things (IoT) applications pertaining to water and wastewater management, as well as water quality monitoring. This subject holds significant importance due to the transformative impact of IoT in these particular domains. By undertaking this exploration, we contribute to the automation of the systematic review (SR) process, which can be applicable to various research fields, and provide valuable insights into the current state of IoT technologies in these critical areas.

Our approach encompasses a series of well-designed steps, commencing with a comprehensive and systematic search across relevant databases. Subsequently, we employ stringent filtering and extraction techniques to extract the most pertinent information from the collected literature. This is followed by a thorough content analysis of the selected studies, enabling us to unveil patterns, identify emerging trends, and gain a holistic understanding of the overall landscape regarding IoT applications in water management and water quality monitoring. By harnessing the capabilities of ChatGPT technology, we can leverage its natural language processing capabilities to streamline the analysis process and unveil concealed connections within the research corpus.

It is important to emphasize that while this paper outlines the methodology for conducting an SR, it does not delve into the specific findings regarding IoT applications in water management and water quality monitoring. Instead, the findings will be meticulously documented and published separately, allowing for a comprehensive exploration of this dynamic and critical area. The detailed objectives of the study can be summarized in the following points:

✓: To investigate the potential of ChatGPT in generating relevant keywords and phrases for literature search in water and wastewater management applications and water quality monitoring.
✓: To compare the accuracy and efficiency of utilizing ChatGPT for screening and filtering studies to be included in an SR, in contrast to conventional methods.
✓: To assess the completeness and accuracy of employing ChatGPT in extracting and synthesizing information from abstracts and full-text articles of the selected studies.
✓: To compare the quality and rigor of the SR process when utilizing ChatGPT against traditional SR methods. This comparison will consider various metrics, including reproducibility, bias, and transparency.
✓: To provide comprehensive guidance on the best practices for integrating ChatGPT into the methodology of SRs specifically focused on water and wastewater management.

To fulfil the objectives of this study, a novel methodology is devised to integrate ChatGPT into the SR procedure, and its performance is compared against traditional SR approaches. This paper makes a valuable contribution to the existing body of knowledge on utilizing artificial intelligence (AI) in advancing SR methodologies by presenting an innovative approach that leverages ChatGPT (based on the GPT-3.5 architecture model) to enhance the overall process. The proposed methodology is employed to conduct an SR article focusing on IoT applications in water and wastewater management. Furthermore, the implications and limitations of this methodology for future research endeavors in the field are thoroughly examined and discussed.

2. Research Methodology

2.1. Exploring ChatGPT: Characteristics and Interactions

ChatGPT is a powerful language model that is specifically designed to facilitate interactive conversations and simulate human-like dialogue. It is built upon the foundation of GPT-3.5, an advanced variant of the GPT-3 model developed by OpenAI. ChatGPT leverages the enhancements and refinements introduced in GPT-3.5, which include improved natural language understanding, longer consecutive output, and better adherence to instructions. By utilizing ChatGPT, our study benefits from its ability to retain context from previous interactions, allowing for more coherent and context-aware responses. This feature enables ChatGPT to generate high-quality and engaging conversational experiences, making it an ideal choice for chat-based applications and conversational agents. Furthermore, ChatGPT based on GPT-3.5 offers advanced natural language processing capabilities, enabling it to perform tasks such as summarization, question answering, and handling large datasets with enhanced accuracy and relevance. Generally, GPT is a general-purpose language model developed by OpenAI, while ChatGPT is a variant of GPT specifically designed for conversational interactions.

In the proposed methodology, we adopted an interactive approach by engaging in conversations with ChatGPT. To ensure effective interaction, we carefully prepared prompts that prompted ChatGPT to generate responses in a conversational manner. Notably, we made a deliberate decision to retain the conversation history throughout the interaction. By intentionally preserving the dialogue context and not clearing the conversation history before generating new responses, we observed a significant improvement in the learning and performance of ChatGPT. Retaining the conversation history allows ChatGPT to maintain a contextual understanding of the ongoing conversation, resulting in more coherent and relevant responses. This approach enables ChatGPT to effectively build upon the previous exchanges, consider the entirety of the conversation’s context, and provide responses that are not only accurate but also contextually appropriate. By leveraging the full conversational context, our methodology harnesses the true potential of ChatGPT based on GPT-3.5 and enhances the overall quality of the interactive experience.

2.2. Automation of SR Process Using ChatGPT

This study utilized a mixed-methods research design, combining ChatGPT, an AI-driven language model, with traditional SR methods to automate and streamline the review process while enhancing its efficiency and reliability. By bridging the gap between traditional SR methods and AI language models, this approach facilitated a comprehensive exploration of the research topic through qualitative and quantitative analyses. Qualitative analysis identified trends, challenges, gaps, and recommendations within selected studies, while quantitative analysis evaluated ChatGPT’s performance compared to expert assessments. This methodology involved iterative stages depicted in Figure 1, where ChatGPT automated specific tasks while ensuring result accuracy and reliability through human oversight. These stages encompassed extracting research questions, generating Boolean research terms (BSTs), filtering publications based on abstracts, conducting full-text filtration and information extraction, and performing comprehensive content analysis. The following subsections provide a comprehensive and detailed description of the proposed methodology, encompassing each stage of the automation process.

2.2.1. Initialization, Extraction of Research Words and Articles Records

The methodology for automating SR process steps involves the following procedures. Firstly, a suitable database is chosen as the primary source of information. A crucial step in commencing the SR article involves identifying and including pertinent articles addressing the research questions within the SR. To facilitate this process, it becomes imperative to generate BSTs capable of effectively searching through diverse databases, such as Scopus, Google Scholar, or Web of Science. To enhance the quality of responses from ChatGPT, which utilizes reinforcement learning [45], we implemented a strategy of gradual input of questions. General questions about the research topic are initially posed, followed by more specific inquiries to stimulate ChatGPT’s understanding and generate accurate responses. This approach facilitates a progressive refinement of ChatGPT’s understanding and enables the generation of accurate responses. Following the initialization process, ChatGPT is informed about the objective of conducting a SR within a specific research area. ChatGPT leverages this information to generate search terms or BSTs tailored to the selected database. These BSTs are designed to refine the search and include relevant keywords associated with the research topic. It is important to note that while ChatGPT streamlines the search process, manual searching remains necessary to account for potential formatting inconsistencies or limitations, ensuring the accurate retrieval of relevant articles. This manual search complements the automated search process and serves to validate the results obtained from ChatGPT.

To evaluate ChatGPT’s proficiency in keyword extraction, it is assigned the task of identifying frequently used keywords based on the BSTs employed for publication extraction. The extracted keywords are then compared with keywords obtained from established software tools (e.g., VOSviewer software) for validation and analysis. This comparative analysis facilitates the assessment of the degree of overlap and potential differences in the extracted keywords, ensuring the reliability of the keyword extraction process.

2.2.2. Articles Filtration Using Titles and Abstracts

Traditionally, the initial filtration of articles in the SR process involves manual investigation of abstracts, which is considered time-consuming and prone to human errors. To overcome these challenges, an alternative approach is being employed using ChatGPT to perform the filtration process. Initially, broad categories of interest are identified based on an analysis of research trends in the field. These categories are selected to encompass the key focus areas and ensure that the filtration process targets the most relevant articles within those domains. To better elaborate on the capabilities of ChatGPT, the problem is restructured as a classification task, where ChatGPT is assigned the responsibility of categorizing articles into specific predefined categories. In cases where an article does not fit into any of these categories, ChatGPT should classify it as irrelevant or under the “not related” category. To assess the classification abilities of ChatGPT across various input scenarios, two task scenarios are conducted. In the first scenario (i.e., ChatGPT (APA)), ChatGPT is provided with only the article APA reference as input, while in the second scenario (i.e., ChatGPT (APA + Abstract)), both the article APA reference and abstract are included as input. By employing these two scenarios, we are able to examine how the inclusion of Supplementary Information affected the accuracy of the classification results, enabling a comprehensive evaluation of ChatGPT’s performance with different input levels. By comparing the results of these two scenarios, the impact of including Supplementary Information on the classification accuracy can be assessed, allowing for determining the most suitable methodology for automating the initial articles filtration process.

As the classification of articles utilizing ChatGPT represents a novel approach, it is of utmost importance to establish a robust evaluation methodology that can accurately assess its performance. Recognizing the significance of evaluation, we embarked on a comprehensive evaluation process incorporating expert volunteers’ invaluable opinions and expertise to provide a comprehensive and reliable assessment. These volunteers, consisting of researchers and engineers with varying levels of expertise in water and wastewater management, provided a benchmark against which ChatGPT’s classification outcomes were compared. The evaluation process incorporates human interpretation and contextual understanding, enriching the assessment with valuable feedback and insights. Expert volunteers are given a questionnaire containing article titles and abstracts to evaluate and classify. Transparency is a key aspect of our evaluation approach. To better evaluate the agreement between raters and to decrease human biases, we evaluate the inter-rater reliability of the volunteer responses using Cohen’s kappa [46]. Based on this analysis, we can estimate the consistency of classifications among volunteers and identify any unreliable raters. Raters with a low kappa value or a lack of agreement with other raters will be excluded from further analysis to ensure the process’ accuracy and reliability.

Furthermore, a confusion matrix will be constructed to assess the relationship between expert classification (i.e., benchmark) and ChatGPT’s classifications based on the two different scenarios. The confusion matrix is a widely used tool to evaluate the identification accuracy between actual and predicted values in classification tasks. It provides valuable insights into the precision and accuracy of the classification model [47]. The confusion matrix consists of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) values. The diagonal values of the matrix represent the correctly identified samples, while FP and FN represent incorrect predictions. As depicted in Figure 2, the confusion matrix will allow us to calculate various performance metrics such as precision, accuracy, and F1-score based on the TP, TN, FP, and FN values. Our evaluation will consider the expert classifications (i.e., benchmark) as true values and ChatGPT classifications as the predicted values.

2.2.3. Full-Text Filtration and Information Extraction

After the initial articles’ filtration using titles and abstracts, a second round of article filtration is traditionally conducted to evaluate the suitability of the remaining articles for inclusion in the review and to extract valuable information from them. However, this manual reading process can be time-consuming. To address this challenge, an automated approach utilizing ChatGPT is employed for full-text filtering. The approach focuses on identifying sub-categories within each main category, enabling a targeted exploration of specific areas of interest, and ensuring a comprehensive coverage of diverse topics relevant to the review. Careful selection of these sub-categories allows for two primary objectives: extracting relevant information for each sub-category and eliminating articles that do not align with the research goals. To automate the information extraction process, a prompt is designed to solicit ChatGPT’s recommendations for relevant questions related to each sub-category. ChatGPT’s responses will help extract information from the articles and eliminate irrelevant studies. Accordingly, two task scenarios are conducted to evaluate ChatGPT’s efficacy in automating this process. The first scenario involves providing ChatGPT with only the article reference as an input (i.e., ChatGPT (APA)), while in the second scenario, the input includes the article’s relevant sections, such as abstracts, methodologies, and some parts of the results and discussions. The length of the prompts is adjusted to balance obtaining reliable responses from ChatGPT and saving time.

It is important to highlight that in the second scenario, the relevant information in the articles includes data presented in tabular and figure formats, which constitute a significant amount of details influencing the quality of the extracted information. To address these limitations, we took measures to incorporate tabular information into the input provided to ChatGPT. This inclusion of structured data from tables aimed to enhance the model’s understanding and improve the accuracy of its responses. However, it is essential to acknowledge that models such as ChatGPT may not possess the specific capability to interpret visual data when it comes to extracting information from figures. Therefore, we recommend that researchers carefully analyze figures and rely on human interpretation to extract relevant information, particularly when the figures contain substantial and intricate content. By retaining control over full-text filtration and information extraction, researchers can ensure the accurate interpretation and the inclusion of important details from non-textual sources.

The evaluation process in this stage is subjective and cannot solely be relied on to assess ChatGPT’s performance in extracting information. To overcome this limitation, a collective approach is adopted. The authors collaboratively answer the questions posed to a subset of articles, following the conventional systematic review process. The agreement between the authors’ answers and ChatGPT’s responses indicates ChatGPT’s efficacy in comprehending and extracting information from the articles.

2.2.4. Content Analysis of the Extracted Information

The content analysis of the extracted information is a critical phase in SR methodology, which traditionally consumes a significant amount of time. This phase focuses on analyzing the content collected in the previous stages to identify patterns, extract key insights, and generate comprehensive data statistics. The primary objective is to facilitate a thorough discussion and evaluation of the research, including identifying research gaps and limitations in previous studies, ultimately leading to informed recommendations. To expedite this time-consuming process, ChatGPT is utilized for automating the content analysis, providing efficient analysis capabilities. It is important to emphasize that ChatGPT’s role is confined to analyzing the given information through text analysis of the questions and responses. The authors maintain complete control over the conversation, guiding ChatGPT using specific prompts tailored to the analysis objectives.

The evaluation of ChatGPT’s responses in this stage is subjective and relies on the expertise and judgment of the authors. While ChatGPT’s responses offer initial analysis, the authors critically evaluate and validate the generated content. The collected responses are then compiled and organized to facilitate structured data exploration, allowing for a rigorous examination of the insights derived from the extracted information. ChatGPT’s automated responses will serve as a valuable starting point for further exploration and examination. By incorporating ChatGPT to automate the content analysis process, the methodology aims to improve efficiency while preserving the authors’ control and oversight. This approach enables a streamlined analysis of the extracted information, leading to a comprehensive discussion, identifying research gaps, and formulating well-informed recommendations.

2.3. Case Study Selection

To demonstrate the effectiveness of our suggested SR approach, we have intentionally selected the topic of Internet of Things (IoT) applications in water and wastewater management and water quality monitoring. This topic holds immense significance due to the transformative impact of IoT in these domains. However, despite the growing importance and advancements of IoT technologies, there remains a lack of comprehensive reviews that delve into the intricacies of this specific domain [48,49,50,51]. Therefore, our research aims to contribute to the automation of the SR process by leveraging the power of ChatGPT to conduct an SR in the context of IoT applications in water and wastewater management. Furthermore, selecting this case study topic is well-aligned with the authors’ background, facilitating better oversight and validation of ChatGPT’s responses. This ensures the accuracy and reliability of all generated content.

It is worth noting that our case study concentrates on three specific subtopics within the broader domain of IoT applications in water and wastewater management: IoT-based water quality monitoring, IoT-based water infrastructure management, and IoT-based wastewater infrastructure management. These subtopics have been carefully chosen to comprehensively cover various aspects and applications of IoT technologies in water and wastewater management. Moreover, they allow for thorough testing of the proposed methodology through distinct and specific topics under the overarching theme of IoT application in infrastructure management. This comprehensive approach contributes to advancing the potential of ChatGPT as a tool for automating SR and understanding IoT applications in water and wastewater management.

3. Results and Discussion

This section endeavors to provide a thorough exposition of our methodology implementation within the context of the case study focusing on IoT applications in water and wastewater management alongside water quality monitoring. Furthermore, we will offer a detailed assessment of the performance and outcomes achieved by ChatGPT across various sections.

3.1. Research Words Generation, Article Exrcation, and Keywords Retiveal

Figure 3 showcases the flowchart representing the initial phase of our methodology. For this study, we directed our attention toward the Scopus database as the primary source of information. To enhance the quality of responses from ChatGPT, we implemented a strategy of gradual input of questions. Practically, the training of ChatGPT was initiated by posing general questions pertaining to the research topic. These initial inquiries served as a foundation for further exploration and understanding. Subsequently, we transitioned to more targeted and specific questions, delving into various aspects, such as the definition of IoT, civil infrastructures, and the intersection of infrastructure management with IoT applications in water and wastewater management. A compilation of these questions employed during the initialization phase can be found in Table 1.

Furthermore, additional questions were posed for a comprehensive understanding of ChatGPT’s capabilities, and the corresponding responses provided by ChatGPT are displayed in Figures S1–S7. This gradual approach empowered ChatGPT to generate well-informed, contextually relevant responses, and increasingly refined as we progressed through our SR methodology.

Upon completing the initialization process, we apprised ChatGPT of our intention to conduct an SR focusing on “IoT applications in water and wastewater management and water quality monitoring”. Surprisingly, ChatGPT generated BSTs derived from the Scopus database, as depicted in Figure 4a, presenting an unexpected and noteworthy outcome. This successful generation of BSTs highlights the potential of ChatGPT in assisting with the literature search process. Moving forward, we included and excluded articles from the database by instructing ChatGPT to generate BSTs that constrained the search to English-language journal articles and conference papers published between 2010 and 2022, as demonstrated in Figure 4b. Furthermore, Figure 4c shows an additional request to ensure that the BSTs encompassed publications with the BSTs present in their titles, abstracts, or keywords. Following these gradual iterations of refinement, the final set of BSTs was obtained, which are as follows: “TITLE-ABS-KEY((“internet of things” or “IoT”) AND (“water” OR “wastewater” OR “sewage” OR “sanitation”) AND (“infrastructure” OR “infrastructures”)) AND (LIMIT-TO (DOCTYPE, “ar”) OR LIMIT-TO (DOCTYPE, “cp”)) AND (PUBYEAR > 2009 AND PUBYEAR < 2023)”. However, it is essential to note that despite ChatGPT’s assistance in generating the BSTs (refer to Figure S6), we encountered inconsistencies in the formatting of references associated with these publications, indicating challenges in the extraction process. These findings corroborate with a previous study [52] that documented similar issues encountered by ChatGPT models in reference extraction.

Consequently, we resorted to manual searching on Scopus in order to ensure the accurate retrieval of relevant articles. Table 2 provides examples of ChatGPT’s responses, illustrating errors in the DOI, publication title, or both. For additional instances of references generated by ChatGPT, please refer to Figure S7. Following the extraction of all relevant articles from Scopus, our focus shifted towards evaluating the proficiency of ChatGPT in retrieving keywords as part of the SR process. To assess this, we assigned ChatGPT the task of identifying the top 50 frequently used keywords based on the BSTs employed for publication extraction, as illustrated in Figure 5. The effectiveness of ChatGPT’s keyword extraction was then evaluated through a comparative analysis with VOSviewer software (1.6.19), a widely used tool for visualizing and analyzing bibliographic data. By comparing the keywords extracted by ChatGPT with those obtained from VOSviewer, we sought to assess the degree of overlap and potential differences in the extracted keywords.

Table 3 presents the similarity percentage between the keywords obtained from ChatGPT and VOSviewer for different numbers of keywords considered. This comparative analysis allowed us to gauge the level of agreement between ChatGPT’s keyword extraction and the results generated by VOSviewer. While our findings indicated a certain level of agreement between the keywords extracted by ChatGPT and those obtained from VOSviewer, we also observed some notable differences (refer to Table 3). Specific unique keywords surfaced in VOSviewer that ChatGPT did not identify, and vice versa. These differences showed the poor performance of AI-powered keyword extraction methods compared to traditional software tools. The presence of unique keywords exclusively identified by VOSviewer suggests that ChatGPT achieved partial success in extracting the keywords. Therefore, it is recommended to rely on alternative methods, such as Scoups or VOSviewer, for a more reliable approach. Such differences in the extracted keywords can be attributed to factors such as the training data, biases, and algorithmic limitations, which can impact the effectiveness and accuracy of AI-powered keyword extraction. Further research in this area would shed light on the strengths and weaknesses of AI models such as ChatGPT and inform the future refinement and improvement of keyword extraction techniques. The compilation and summary of the unique keywords obtained from both ChatGPT and VOSviewer are provided in Table S1, providing a comprehensive overview of the extracted terms from different perspectives.

In summary, the initial phase of our methodology revealed the considerable capability of ChatGPT in generating pertinent BSTs for retrieving relevant articles and the limited capabilities of extracting keywords. Regrettably, ChatGPT was unable to extract relevant articles without human guidance autonomously. These preliminary findings lay a solid foundation for the subsequent stages of our methodology, which will primarily concentrate on the accurate filtering and categorization of the extracted articles in order to enhance the depth and comprehensiveness of the SR process. In the next phase, we will explore how ChatGPT can filter and categorize the articles extracted in phase one.

3.2. First-Round Article Classification and Filtration (Title and Abstract)

A total of 496 English language journal articles and conference proceedings relevant to the research topic were retrieved from the Scopus database using BSTs suggested by ChatGPT. Figure 6 showcases the flow chart of filtrating and categorizing articles in the first part and extract filtrating and information extraction from related articles in the second part.

Initially, we identified three broad categories of interest based on our comprehensive analysis of research trends in the field: IoT-based water infrastructure management, IoT-based wastewater infrastructure management, and IoT-based water quality monitoring. These categories were selected to encompass the key focus areas in our research and ensure that the filtration process targeted the most relevant articles within these specific domains.

To better elaborate on the capabilities of ChatGPT, we transformed the task into a classification problem, where ChatGPT was asked to assign articles to one of four distinct categories: water management, wastewater management, water quality, or unrelated. To facilitate this classification process, we requested ChatGPT to generate definitions for each of the four categories, as depicted in Figure 7. ChatGPT responded by generating precise definitions for each category, which would subsequently serve as guiding principles for categorizing articles (see Figure 7). By incorporating these guidelines, we aimed to enhance the accuracy and consistency of ChatGPT’s classification outputs, thus optimizing the subsequent stages of our methodology.

We evaluated the classification/discarding performance of ChatGPT in two distinct scenarios by comparing the performance to the human experts’ evaluations. This task was executed by carefully crafting prompts for ChatGPT and ensuring that each prompt contained 10 articles/time and APA references. By limiting the number of articles in each prompt, we aimed to balance information comprehensiveness and manageable input sizes for ChatGPT. Moreover, we imposed specific constraints during the classification process to maintain consistency and control. These constraints encompassed categorizing articles exclusively into the predefined four categories, refraining from making assumptions, focusing on articles directly related to the three main categories of interest, and presenting the classification results in a structured tabular format.

Upon preparing the prompts, ChatGPT generated responses that included the classification output in a visually organized table (Figure 8). Within this table, “x” markings indicated the assigned category for each article, while accompanying explanations provided insights into the underlying decision-making process employed by ChatGPT (refer to Figure 8). This comprehensive representation facilitated the interpretation of ChatGPT’s classification outcomes and allowed for a deeper understanding of the rationale behind each categorization.

To assess the classification and the discarding of articles, we carefully selected a subset of 120 articles, comprising approximately 25% of the total articles (496), representing all four categories. We then organized the titles and abstracts of these articles and shared them with the experts using Google Forms to facilitate the management of the evaluation process. A sample of the questions, including the article’s title and abstract, illustrating the format used in the questionnaire is attached in Figure S8. We provided the article title and abstract as this is the followed method in the traditional discarding process of the articles. To flexibly account for articles that may cover multiple categories, we permitted volunteers to select a maximum of two categories, but not one of them the” not related” for the selected articles. This approach acknowledged the complexity of some articles, ensuring that they were not constrained to a single classification. The volunteers’ responses were then converted into a numerical scale, where the four predefined categories were represented by the numbers 1, 2, 3, and 4, making quantitative analysis and comparison easier.

After eliminating incorrect raters based on Cohen’s Kappa coefficient values, we employed the majority vote approach to determine the final category for each article. This consensus-based classification was then used as a benchmark to evaluate the filtration process of ChatGPT. Table S2 provides a detailed breakdown of the final categories assigned to the articles based on the majority vote of the volunteers.

Figure 9(a1) shows the confusion matrix of the comparison between the benchmark (true classifications) and the classifications from ChatGPT (APA). By analyzing the findings, we observed that the “not related” class achieved a promising accuracy of 78.00%, an F1-score of 81.00%, and a recall of 80.00%. This indicates that ChatGPT demonstrated effective performance in removing irrelevant articles. However, for the remaining classes, the F1-scores were lower than 80%. These lower accuracies were expected, since ChatGPT relied solely on APA information for classification.

The generation of the confusion matrix provided a comprehensive evaluation of ChatGPT’s performance. While employing ChatGPT (APA) in the classification process exhibited promising results in filtering out irrelevant articles, there is room for improvement in its classification accuracy for other categories. It is worth mentioning that the sole dependence on the APA information to filter was an intentional choice aimed at assessing ChatGPT’s performance at different stages and input levels, even though it deviated from conventional methods. However, recognizing the potential limitations of relying solely on APA information, we sought to improve the accuracy of the filtering process by incorporating article abstracts. This modified approach, ChatGPT (APA + Abstract), aimed to leverage both the APA and abstract information to enhance the system’s performance.

To implement the classification process of the articles using the ChatGPT (APA + Abstract) approach, we obtained the APA and abstract information of the articles from Scopus in a CSV file format. This allowed us to gather the necessary data for creating prompts that could be fed into ChatGPT. However, it is crucial to consider that the performance of ChatGPT models is mainly constrained by token length and capacity [53]. Each token represents a text unit, such as a word or character. The maximum token limit for ChatGPT models is a crucial factor to consider when designing prompts. Exceeding the token limit would require truncating or omitting input parts, potentially losing important information. Therefore, we limited the number of articles in each prompt to five per time. This decision was made considering the average token length of APA information and article abstracts and not to confuse the ChatGPT model. By incorporating article abstracts into the classification process, we aimed to address the potential limitations of relying solely on APA information. Abstracts often provide a concise summary of an article, offering valuable contextual cues that can aid in accurate classification. Figure 10 provides a visual representation of the process, illustrating how ChatGPT was fed with prompts containing both APA and abstract information, and it showcases the system’s classification responses.

It can be observed that the classification process conducted by ChatGPT (APA + Abstarct) occasionally results in assigning two categories for a single article. While this is deemed acceptable when the two categories do not include the “Not related” category, indicating that the article covers distinct topics, complications arise when an article is classified as both relevant and “Not related”. This situation can pose challenges for users, particularly due to the criticality of accurately including or excluding articles in the SR process.

Notably, ChatGPT occasionally tends to retain articles to the maximum extent, even if they are unrelated, by assigning them to the closest corresponding category. Figure 10 provides an illustration of an article being classified into two categories, with one of them being “Not related”. Alongside the classification outputs, ChatGPT also provides justifications for its selections, which are pivotal in informing the decision-making process. ChatGPT provides insights into the factors and reasoning underlying its decisions by explaining its classifications. This justification feature serves as a valuable tool for evaluating and validating the appropriateness of the classification decisions.

To address the challenge posed by articles being classified into two categories, one of which is “Not related”, we leveraged the explanations provided by ChatGPT to assist in confirming decisions regarding article inclusion or exclusion. Practically, we collected the articles that ChatGPT assigned two categories and re-requested their classification. However, this time, we provided ChatGPT with the explanations accompanying its initial classifications. In practical applications, we recommended reading the justification provided by the ChatGPT for the articles classified into two classes to confirm the relevance of the article or not.

Similarly, we evaluated the performance of the classification from ChatGPT (APA + Abstract) by comparing ChatGPT’s results (APA + Abstract) to our benchmark, which consisted of the opinions of experts. This evaluation aimed to assess the efficacy of the filtration process, particularly in relation to the “Not related” class (Figure 9(b1)). The results showed significant improvement when applying ChatGPT (APA + Abstract) compared to ChatGPT (APA) alone. Regarding precision, recall, and F1-score, the ChatGPT (APA + Abstract) achieved impressive values for the “Not related” class, with scores of 85.00%, 93.00%, and 90.00%, respectively. These metrics outperformed the corresponding scores obtained by ChatGPT (APA) (Figure 9(b2)). Furthermore, the F1-scores for the three other classes, namely, water management, wastewater management, and water quality, were also notably higher, with scores of 91.00%, 87.00%, and 86.00%, respectively. The implementation of ChatGPT (APA + Abstract) led to a reduction in misclassification rates of approximately 64% compared to ChatGPT (APA), demonstrating its capacity for improved accuracy. Additionally, other evaluation measures, such as accuracy, macro-F1, and weighted F1, experienced enhancements. These findings collectively underscore the exceptional performance of ChatGPT (APA + Abstract) in effectively filtering and categorizing articles, positioning it as a valuable tool for subsequent classification and article exclusion with enhanced precision.

However, it is important to acknowledge that certain limitations remain, particularly regarding the number of articles that can be filtered simultaneously. While ChatGPT exhibits remarkable capabilities, practical constraints need to be considered when scaling up its application. This evaluation provides valuable insights into the effectiveness and potential of ChatGPT (APA + Abstract) as a robust classification system, offering improved precision and reliability in filtering and categorizing scientific articles. By combining AI-driven classification strengths with human evaluators’ expertise, we can harness the power of automation while ensuring the highest standards of accuracy and relevance.

Despite the limitation on the feeding rate of articles into ChatGPT, it continues to surpass traditional filtering methods in terms of time efficiency. The performance of ChatGPT (APA + Abstract) in article filtering is considered outstanding. Therefore, ChatGPT (APA + Abstract) was utilized to screen all articles within the study. The comprehensive results of the filtering and categorizing of all articles can be found in Tables S3–S6. It is important to note that the output of this step goes beyond the elimination of articles; it also involves categorizing relevant articles into three main classes. Following the filtration process, a total of 351 articles were discarded as they were deemed irrelevant, while 145 articles were retained as relevant. The relevant articles were categorized into specific domains, with 76 articles on water management, 53 on wastewater management, and 32 on water quality. It is important to acknowledge that specific articles may overlap and fall into multiple categories, resulting in 161 articles across the three domains. However, when considering unique articles, the total count stands at 145.

Ultimately, the utilization of ChatGPT (APA + Abstract) in the filtration and categorization process demonstrates its effectiveness in efficiently managing a large volume of articles, streamlining the identification of relevant content, and facilitating the organization of articles based on their thematic relevance. By leveraging the capabilities of AI-powered classification, researchers can optimize their workflow, allocate their time more effectively, and enhance the accuracy and precision of their literature review processes.

3.3. Second-Round Article Filtration (Full-Text) and Information Extraction

The full-text filtration and information extraction phase was carried out during a second round of article filtration to evaluate the suitability of the remaining 145 articles for inclusion in our review and extract valuable information from them. This challenge was addressed by utilizing the capabilities of ChatGPT for full-text filtering, as illustrated in Figure 6. To effectively leverage ChatGPT for this purpose, we initially identified five sub-categories within each main category to concentrate on specific areas of interest and ensure a comprehensive exploration of the diverse topics relevant to our review. These sub-categories were thoughtfully selected to cover diverse aspects of the subject matter, including sensors and sensing technology, data acquisition and transmission, data analytics and visualization, applications, case studies, and research gaps and trends.

To automate extracting information and harness the capabilities of ChatGPT, we devised a prompt that solicited ChatGPTs’ recommendations for relevant questions pertaining to each sub-category. The response generated by ChatGPT to this request is depicted in Figure 11, while Figure 12 showcases the 14 questions that were generated belonging to the five sub-categories. It is important to note that these questions generated are of a general nature and elicit responses in the form of “yes” or “no”. The answers to these questions by ChatGPT would help extract information from the articles and remove irrelevant articles. In this phase, we tested the performance of ChatGPT in two scenarios, including ChatGPT (APA) and ChatGPT (APA + Abstract + relevant information). Practically, the ChatGPT prompts were constructed using the article’s APA, abstract, methodology, discussion, and occasionally the conclusions section. Due to the extended length of these extracted sections from the articles compared to the previous steps (i.e., abstract only), the ChatGPT prompts were designed to handle one article at a time.

However, as previously discussed, the prompt’s length is carefully adjusted to balance obtaining reliable responses from ChatGPT and saving time. It is worth noting that the time invested in this step is considerably less than the time of manual execution, particularly considering the added benefit of information extraction alongside the article’s filtration.

During the assessment of ChatGPT’s responses to the 14 questions, we observed three distinct scenarios. Firstly, when the answers to a question were “yes,” ChatGPT confirmed this affirmative response and provided relevant information from the article that corresponded to the question (refer to Figure 13). Secondly, in instances where the answers were “No”, ChatGPT simply reported “No” without furnishing any straightforward answers derived from the article (as shown in Figure 14). Lastly, when ChatGPT determined that the majority of answers were “No”, it classified the paper as “unrelated” (as shown in Figure S9).

In this phase, we evaluated ChatGPT’s performance by comparing its responses to individual articles (we selected one article known for the authors as an example). Initially, we asked ChatGPT to answer these questions based on the article’s APA information. However, as demonstrated in Figure 14, where ChatGPT provided incorrect responses, APA information proved to be inadequate. For example, in Answer 1-1, ChatGPT mistakenly claimed that the author used the wrong type of sensors, and in Answer 4-1, ChatGPT inaccurately identified the research location as Saudi Arabia instead of Hong Kong.

To improve the accuracy of ChatGPT’s responses, we supplemented its understanding by incorporating additional information from the articles themselves. We considered various sections, including the titles, abstracts, methodology descriptions, relevant parts of the results, and conclusions, as these sections often provided more detailed and context-rich information compared to abstracts alone. However, we intentionally excluded article introductions and related work sections to maintain clarity and avoid confusion. Figure 15 provides an example of a ChatGPT prompt with a title, abstract, methodology description, and ChatGPT’s response to the questions. In this example, we used the same article as before, and it is evident that the quality of ChatGPT’s responses has significantly improved. For instance, in Answer 1-1, ChatGPT accurately reported the use of 58 ultrasonic sensors, and in Answer 4-1, ChatGPT correctly identified the research area’s location.

At this stage, it can be concluded that by refining the prompt and incorporating additional article information, we enhanced the accuracy of ChatGPT’s responses during the information extraction phase. This iterative process allowed us to leverage the strengths of ChatGPT while ensuring the reliability and validity of the extracted information. Nonetheless, human oversight and critical evaluation remained essential to validate and interpret the results obtained from ChatGPT.

To overcome the limitation of the subjective evaluation, we collaboratively answered the 14 questions for a subset of 30 articles, along with our responses to ChatGPT’s outputs. Remarkably, despite the expected total of 420 individual answers for the 14 questions and 30 articles, our answers and ChatGPT’s responses amounted to 381, owing to the classification of 3 articles as irrelevant. The summarized outcomes of these responses are presented in Figure 15, while more details about the answers can be found in Table S7. Among the 381 obtained responses, ChatGPT accurately captured 371, resulting in an impressive similarity rate exceeding 97%.

Regarding discarding articles, both ChatGPT and the authors agreed on the same articles. However, it is worth noting that ChatGPT’s responses were completely different for unrelated articles, and it stopped responding to questions (Please refer to Figure S9). This substantial level of agreement underscores the efficacy of ChatGPT in effectively comprehending and extracting information from the articles. Upon evaluating the efficacy of this approach in filtering the initial set of 145 articles, we successfully identified 56 articles as irrelevant, enabling us to focus on extracting pertinent information from the remaining 86 articles. This demonstrates the valuable role of ChatGPT in streamlining the article filtration process and automating information extraction from a substantial number of articles.

Since the snowballing process is an integral part of conducting an SR, we employed both backward and forward snowballing techniques to uncover additional relevant studies that might have been overlooked during the initial database search [24]. The backward snowballing method involves scrutinizing the references of the included papers to identify related articles, while the forward snowballing technique entails searching for studies among the articles that cited the included ones [24]. We manually conducted the snowballing process in this study by screening the titles of articles. However, we recognize the potential of leveraging ChatGPT to automate this step in order to advance the full automation of the SR process. By implementing the snowballing strategy, we successfully identified 52 new articles through multiple iterations in addition to the articles previously identified. These 52 articles underwent the same comprehensive filtration method outlined earlier in our methodology. As a result, 19 articles were excluded due to their lack of relevance, while the remaining 33 articles met the criteria for inclusion in our review database. Consequently, the total number of relevant articles included in our review increased to 119.

Overall, leveraging ChatGPT ensures a more thorough filtering process, assists in extracting information based on responses to comprehensive questions, and enables the inclusion of snowballing articles, expanding our review’s breadth and scope. By capitalizing on ChatGPT’s capabilities, we enhance the SR methodology’s efficiency, accuracy, and reliability.

3.4. Analysis and Interpretation of Extracted Information

This phase focuses on analyzing the content collected in the previous phases, explicitly emphasizing the sub-categories outlined in Figure 12. The flowchart for phase 3 is illustrated in Figure 16, providing a visual representation of the analysis process. In order to streamline the analysis process, the “Yes” responses to each question were initially compiled and organized. Subsequently, these compiled responses were further analyzed and presented in Table S8. This approach facilitates a cohesive and structured data exploration, allowing for a more rigorous examination of the insights obtained from ChatGPT.

Accordingly, Table 4 provides a comprehensive overview of the response statistics obtained during Phase 2 and the corresponding objectives for analyzing each question. These responses served as prompts for ChatGPT, with a maximum of ten responses per prompt, covering all sub-categories outlined in Figure 12. The content analysis encompassed information extraction related to sensors and sensing technologies, data acquisition and transmission, data analytics and visualization, and applications and case studies, as well as limitations and gaps identified in the reviewed articles. Leveraging ChatGPT as an analytical tool facilitated a more thorough identification of various patterns and trends within the data analysis process. For example, a specific prompt was designed to explore the utilization of multiple types of sensors and their associated benefits, as depicted in Figure 17.

Similarly, trends in data transfer technologies were examined based on the responses to question 2-1 (Figure 12). Figure 18 illustrates ChatGPT’s responses concerning the specific applications of wireless communication technologies. Furthermore, multiple prompts were devised within the data analysis and the visualization section. These prompts aided in exploring diverse approaches employed for data analysis, including AI and ML techniques, as well as visualization methods utilized for decision-making processes (Figure S10). Additionally, questions 4-1 and 4-2 were integral to the review process, assessing the implementation of proposed systems or case studies in the studied papers while identifying prevailing trends and scopes (Figure 19). The benefits associated with such implementations were also investigated within each article (Figure S11).

The analysis stage also involved thoroughly examining the limitations and research gaps discussed in previous studies, along with the corresponding recommendations put forth by researchers. Leveraging ChatGPT in this phase facilitated a comprehensive exploration and in-depth understanding of the challenges and limitations encountered in prior research and the proposed solutions adopted to address them. To ensure a systematic approach to identifying and categorizing the limitations and challenges discussed by different authors, a carefully designed prompt (Figure 20) was employed, utilizing the results obtained from questions 5-1 and 5-2 in Figure 12.

This approach allowed for extracting and organizing valuable insights from the collected data. Additionally, a comprehensive list of recommendations was compiled, drawing from the proposed solutions identified in question 5–3 and categorized based on common trends (Figure 21).

This approach yielded a wealth of information regarding the challenges, limitations, and potential solutions found in the reviewed articles. In order to gain a deeper understanding and assess the extent of the resolved issues, a ChatGPT prompt was utilized to compare the limitations and the challenges highlighted by various authors with the suggested solutions and recommendations. This comparative analysis provided valuable insights into the existing research gaps and identified areas for further investigation and research. An example depicting the resulting research gaps is illustrated in Figure 22.

4. ChatGPT Strengths, Limitations, and Future Directions in Automating SR Process

ChatGPT, built on the GPT-3.5 architecture, represents a significant breakthrough in AI research, enabling the generation of coherent and meaningful human-like language by leveraging vast amounts of language data. This innovative language model holds promise for various domains, including systematic reviews, and can potentially contribute to the advancement of general artificial intelligence. However, it is important to acknowledge that, being a generative model, ChatGPT cannot guarantee the absolute accuracy of its outputs. Therefore, this section will explore the strengths, limitations, potential areas for enhancement, and future research directions concerning ChatGPT in the context of conducting SRs.

4.1. Strengths of ChatGPT in SR Process

ChatGPT has been proven to be a valuable tool in the SR process, offering several strengths that enhance the efficiency and effectiveness of the methodology. Through our methodology and evaluation, we have identified the following key strengths of ChatGPT in conducting SRs:

Full Automation: ChatGPT contributes to automating several tasks in the SR process, such as generating research questions, suggesting BRTs, categorizing the relevant articles, discarding unrelated ones, proposing sub-categories to be covered for each main category, generating research questions to aid in information extraction from the articles, and extracting all relevant information. This level of automation facilitated by ChatGPT helps streamline the SR process and decrease the time and errors.
Enhanced accuracy and efficiency: ChatGPT offers a valuable advantage by improving the accuracy and efficiency of filtering and classifying articles. Researchers can benefit from its ability to swiftly identify relevant studies, reducing uncertainty, and saving significant time and effort. Moreover, ChatGPT’s proficiency in natural language processing aids in precise content analysis, minimizing the risk of errors, and omissions in research interpretation.
Time-saving: ChatGPT demonstrates significant potential in saving time during SRs, which are known to be time-consuming and resource-intensive processes that require high levels of expertise and attention to detail. ChatGPT assists in this process by swiftly analyzing and summarizing large volumes of the literature, aiding researchers in identifying relevant studies and extracting key information more efficiently. In our study, ChatGPT played a significant role in tasks such as filtering, categorizing, and content analysis, which resulted in decreased time and effort as well as reduced sources of uncertainty. However, it is important to note that human experts should carefully review ChatGPT-generated summaries.
Improved reproducibility: While ChatGPT’s responses were found to be influenced by the user prompts, the same procedure can be replicated multiple times by following the same guidelines and adhering to the recommended approach. This enhances the reproducibility of the results, allowing for consistent outcomes to be obtained through repeated application of the methodology. ChatGPT’s responses are markedly affected by the user prompts, and the same procedure can be reproduced several times by conducting the same procedures and following the recommendations.
Flexibility: The method introduced utilizing ChatGPT for automating the SR process can be applied for conducting SRs across various fields. This flexibility allows for the potential utilization of ChatGPT in various research domains, providing opportunities for its application beyond the specific context of the current study.

4.2. Limitations of ChatGPT in SR Process

ChatGPT, despite its strengths, also has certain limitations that need to be considered when applying it to the SR methodology. These limitations arise from the nature of the model and the challenges associated with its implementation in complex research tasks. Understanding these limitations and constraints is considered crucial to ensuring the appropriate use and interpretation of ChatGPT-generated outputs in the SR process. This subsection discusses the limitations of ChatGPT in the context of SR methodology and identifies improvement opportunities. Our study has uncovered the following limitations:

Limited ability to extract full-text articles: Despite ChatGPT’s capability to suggest and adjust BSTs based on user requests, it is not optimized for article extraction, which may impact the comprehensiveness of the SR. As a result, ChatGPT’s limitations in extracting articles may constrain the SR process’s completeness.
Limited ability to extract all information from articles: Despite ChatGPT’s capability to filter, categorize articles, and extract text information, it may encounter limitations in extracting all relevant information, especially if the information is presented in non-standard formats such as figures or other non-textual forms. This may result in incomplete extraction of relevant data, particularly from articles that utilize non-traditional data presentation methods, potentially impacting the comprehensiveness and accuracy of the extracted information during the SR process.
Dependence on input data: ChatGPT’s performance highly depends on the input data quality. If the data is biased or incomplete, GPT’s output may be similarly flawed.
Limited Access to Real-Time Data: One notable drawback of ChatGPT in its application to automating the SR process pertains to its dependence on a pre-existing database. ChatGPT relies solely on the information it was trained on, lacking access to real-time data from the internet. Consequently, the model’s knowledge and comprehension are confined to the training data, limiting its ability to incorporate the latest research studies, publications, and emerging evidence. This limitation poses challenges in providing comprehensive and up-to-date information throughout the systematic review process.
Length of prompts: While ChatGPT has the ability to generate high-quality responses, the length and complexity of the prompts used can impact the accuracy and coherence of the generated text. Our study revealed that longer prompts tended to result in more accurate and relevant responses, but also required more time and effort to prepare. Conversely, shorter prompts were easier and quicker to generate, but may have led to less accurate or incomplete responses. Hence, balancing the prompt’s length and complexity with the generated text’s accuracy and relevance is important. Additionally, careful consideration should be given to the prompt formulation process to ensure that the generated responses meet the desired quality standards in the context of the SR process.
Token limitations: ChatGPT limits the number of tokens that can be processed simultaneously. This means that the length of the input sequence (i.e., prompt plus generated text) is limited and may require multiple iterations or segmentation to generate longer responses. Our study encountered this limitation when attempting to generate longer responses. This limitation can affect the efficiency and effectiveness of the ChatGPT’s model for certain tasks, especially in Phase 2, where the filtration occurred by feeding the ChatGPT with some parts from the article.
Memory limitations: The ChatGPT ‘s ability to recall previous prompts and maintain a coherent and accurate discourse on a specific topic is a crucial consideration, as it can impose constraints that impact its scalability and applicability to certain tasks. Within our study, we encountered restrictions related to memory capacity, wherein ChatGPT occasionally struggled to provide responses that remained focused on the precise topic, leading to deviations or inaccuracies in its understanding of our prompts. This was particularly noticeable when working with large datasets or engaging in multiple iterations, highlighting the potential impact of memory limitations on the model’s performance.

4.3. Future Perspectives: Expanding the Potential of ChatGPT in SR

As technology advances and AI-driven language models such as ChatGPT become more sophisticated, there are exciting opportunities for further development and utilization in the field of SR. The future perspectives of ChatGPT in SR offer potential avenues for enhancing the review process’s efficiency, accuracy, and comprehensiveness. By addressing existing challenges and building upon the strengths of ChatGPT, researchers can unlock its full potential in advancing evidence synthesis and knowledge discovery. This subsection explores some of the future perspectives and areas of improvement for ChatGPT in the SR methodology, including:

Conducting the snowballing procedure using ChatGPT: This approach involves utilizing ChatGPT to search the database using BSTs, applying the first round of filtering based on abstracts, and then collecting remaining articles along with their references (backward) and cited publications (forward). These collected articles would undergo another round of abstract screening before proceeding to the second level of filtering. Automating the snowballing procedure with ChatGPT could streamline the filtration process, making it more efficient and time-saving for researchers.
Developing more sophisticated algorithms to extract information from articles: Advanced techniques such as entity recognition and topic modeling could be employed to enhance the accuracy and precision of information extraction from articles. These techniques can enable ChatGPT to identify and extract relevant information more effectively, particularly from non-standard formats such as tables, figures, and other complex structures commonly found in scholarly literature.
Improving the interpretability of ChatGPT’s output: Efforts could be made to develop tools or techniques to visualize and comprehend ChatGPT’s output. This may involve creating visual representations or graphical displays that aid in understanding the generated summaries or recommendations. Additionally, developing more transparent algorithms, which are easier for researchers to comprehend, can improve the interpretability of ChatGPT’s output.
Expanding the scope of input data for ChatGPT: One potential avenue for enhancing the performance of ChatGPT in conducting SRs could be to explore the model’s applicability on data from fields with more relevant articles. This could involve testing the content analysis capabilities of ChatGPT by inputting a large amount of data and examining the conclusions drawn by the model. Additionally, employing ChatGPT on data from new fields can serve as a valuable means to test the robustness and integrity of the developed methodology in response to different aspects.
Access to Real-Time Data: The SR process using ChatGPT can benefit from several avenues for improvement. Firstly, ChatGPT can provide accurate, current information regarding articles based on real-time access to databases, such as Scopus and Web of Science. In addition, internet connectivity enhances data retrieval and screening capabilities by allowing users to access a broader range of sources. Secondly, dynamic search strategies enable real-time feedback to be integrated into iterative enhancements. Thirdly, automated citation management and reference management, integration of collaborative platforms, and access to diverse perspectives and global research materials enhance the SR process. However, the success of these enhancements critically hinges on the particular implementation, ethical considerations, and rigorous validation of retrieved information.

Overall, it is essential to embrace the development of AI and use it with caution and supervision in critical domains. While ChatGPT offers significant potential in automating SR processes, it is essential to acknowledge and address its limitations. Strategies for enhancing ChatGPT’s performance in conducting SRs should be carefully devised and implemented.

5. Ethical Considerations in Utilizing AI-Language Models

The utilization of AI language models such as ChatGPT in scientific writing necessitates careful attention to ethical considerations. Integrating these models raises important questions that require thorough examination and appropriate safeguards. One crucial ethical consideration in utilizing AI language models is the validation, verification, and critical evaluation of AI-generated outputs in order to ensure their accuracy, reliability, and appropriate contextualization within the broader scientific knowledge. In this regard, the involvement of human experts is paramount. Their supervision and expertise play a critical role in aligning the outputs with established standards, identifying and rectifying potential inaccuracies or biases, and providing a comprehensive and accurate interpretation of the AI-generated content. By incorporating human judgment and critical evaluation, researchers uphold responsible practices that enhance the reliability and credibility of the findings derived from AI language models.

Ethical considerations also encompass aspects such as data privacy, informed consent, and bias mitigation strategies. Researchers must adhere to established guidelines and regulations to protect data privacy when utilizing AI language models. This involves handling sensitive or personal information with utmost care and ensuring strict confidentiality to comply with privacy standards. Obtaining informed consent becomes crucial when utilizing data collected from individuals or sources with sensitive information. Moreover, researchers must proactively implement strategies to identify and mitigate biases that may arise from the input data used in the automated SR process, ensuring fair and unbiased outcomes. By conscientiously addressing these ethical considerations, researchers contribute to the cultivation of a responsible and ethical environment for the utilization of AI language models in scientific writing.

6. Concluded Remarks and Recommendations

Our study presents a novel methodology for conducting systematic reviews by leveraging the power of ChatGPT. By combining the strengths of human expertise and AI capabilities, we aimed to streamline the traditional SR process and improve its efficiency and accuracy. Our study applied this method to conduct a comprehensive SR on IoT applications in water and wastewater infrastructure management, and our findings highlight the benefits of using ChatGPT in each step of the process. Our study revealed that ChatGPT effectively generates research questions and suggests Boolean research terms, but not appropriate for article extraction. However, it performs excellently in filtering and categorizing articles and excellently in full-text filtration and information extraction after preparing prompts. Our comprehensive content analysis of the selected publications revealed valuable insights into the current research landscape, highlighting emerging trends, identifying research gaps, and shedding light on future directions in the domains of IoT-based sensing and monitoring, data analytics and visualization, as well as applications and case studies. We evaluated our methodology using quantitative comparisons with traditional review techniques and expert opinions, and the results show that our approach significantly saves time and effort while maintaining high levels of accuracy. Our findings demonstrate the potential of ChatGPT in improving the efficiency and accuracy of SRs, contributing to the advancement of scientific knowledge. In conclusion, there are promising avenues for future research in fully exploring the capabilities of ChatGPT in SRs, investigating its limitations in diverse research contexts, and applying our approach to other fields to further enhance the efficiency and accuracy of SRs. We strongly recommend adopting our proposed framework as a reliable guide for conducting SRs in diverse domains. Our proposed framework, as depicted in Figure 23, provides a robust foundation for automating the SR process, offering adaptability and scalability to accommodate research complexities. By recognizing the strengths and limitations of ChatGPT and taking appropriate measures to enhance its performance, researchers can maximize the benefits of AI in evidence synthesis while ensuring the precision and integrity of SRs in the scientific community.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/systems11070351/s1, Figure S1: Initialization Process. (a–e) Introducing IoT Technology; Figure S2: Initialization Process. (a–d) Introducing Civil Engineering Infrastructure; Figure S3: Initialization Process. (a–d) Introducing Water and Wastewater Infrastructure; Figure S4: Initialization Process. (a–d) Implementing IoT In Water and Wastewater Infrastructure; Figure S5: Initialization Process. (a–d) Investigating the Systematic Review Capability; Figure S6: ChatGPT’s Utilization of BSTs. (a–e) Extracting Search Keywords; Figure S7: Examples of references from ChatGPT. (a) Extracting related paper based on the Boolean search term. (b) Example of one of the incorrect references. Figure S8: A section of the questionnaire created using Google Forms; Figure S9: Two examples of ChatGPT’s responses in case of irrelevant articles; Figure S10: User prompt and ChatGPT answer to the methods used for data analysis and visualization; Figure S11: User prompt and ChatGPT answer for the benefits of implementing the case studies. Table S1: Unique keywords as extracted from ChatGPT and VosViewer; Table S2: Comparison between ChatGPT and human experts in classification process for Selected 120 articles; Table S3: Categorization of all articles using ChatGPT (APA+Abstract); Table S4: Articles belong to IoT-based water quality monitoring as classified using ChatGPT with explanation; Table S5: Articles belong to IoT-based wastewater infrastructure management as classified using ChatGPT with explanation; Table S6: Articles belong to IoT-based water infrastructure management as classified using ChatGPT with explanation; Table S7: Comparison between answers form ChatGPT and human experts for the 14 questions related to the five subcategorizes for selected 30 articles; Table S8: ChatGPT responses to the 14 questions with Yes/No and the detailed description for the answers. (a) IoT-based water infrastructure management, (b) IoT-based wastewater infrastructure management, and (c) IoT-based water quality monitoring.

Author Contributions

Conceptualization, A.A., E.A. and M.E.; methodology, A.A., E.A. and M.E.; validation, A.A., E.A. and M.E.; formal analysis, A.A., E.A. and M.E.; investigation, E.A. and A.E.E.E.; writing—original draft preparation, A.A., E.A. and M.E.; writing—review and editing, E.A., A.E.E.E. and A.A.; visualization, M.E., E.A. and A.A.; supervision, E.A., A.E.E.E. and T.Z.; project administration, A.E.E.E. and T.Z.; funding acquisition, A.E.E.E. and T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the University Grant Committee of Hong Kong Polytechnic University: [Grant Number Project No. P0036181].

Data Availability Statement

Not applicable.

Acknowledgments

The Author would like to thank greatly the volunteers who participated in the filtering process.

Conflicts of Interest

The authors declare no conflict of interest.

References

Paré, G.; Trudel, M.-C.; Jaana, M.; Kitsiou, S. Synthesizing Information Systems Knowledge: A Typology of Literature Reviews. Inf. Manag. 2015, 52, 183–199. [Google Scholar] [CrossRef]
Yuan, Y.; Hunt, R.H. Systematic Reviews: The Good, the Bad and the Ugly. Am. J. Gastroenterol. 2009, 104, 1086–1092. [Google Scholar] [CrossRef] [PubMed]
Kitchenham, B. Procedures for Performing Systematic Reviews; Keele University: Keele, UK, 2004. [Google Scholar]
Mulrow, C.D. Systematic Reviews: Rationale for Systematic Reviews. BMJ 1994, 309, 597–599. [Google Scholar] [CrossRef] [PubMed]
Needleman, I.G. A Guide to Systematic Reviews. J. Clin. Periodontol. 2002, 29, 6–9. [Google Scholar] [CrossRef]
Agbo, C.; Mahmoud, Q.; Eklund, J. Blockchain Technology in Healthcare: A Systematic Review. Healthcare 2019, 7, 56. [Google Scholar] [CrossRef] [Green Version]
FitzGerald, C.; Hurst, S. Implicit Bias in Healthcare Professionals: A Systematic Review. BMC Med. Ethics 2017, 18, 19. [Google Scholar] [CrossRef] [Green Version]
Milne-Ives, M.; de Cock, C.; Lim, E.; Shehadeh, M.H.; de Pennington, N.; Mole, G.; Normando, E.; Meinert, E. The Effectiveness of Artificial Intelligence Conversational Agents in Health Care: Systematic Review. J. Med. Internet Res. 2020, 22, e20346. [Google Scholar] [CrossRef]
Abu-Odah, H.; Su, J.; Wang, M.; Lin, S.-Y.; Bayuo, J.; Musa, S.S.; Molassiotis, A. Palliative Care Landscape in the COVID-19 Era: Bibliometric Analysis of Global Research. Healthcare 2022, 10, 1344. [Google Scholar] [CrossRef]
Aarseth, W.; Ahola, T.; Aaltonen, K.; Økland, A.; Andersen, B. Project Sustainability Strategies: A Systematic Literature Review. Int. J. Proj. Manag. 2017, 35, 1071–1083. [Google Scholar] [CrossRef]
Shaban, I.A.; Eltoukhy, A.E.E.; Zayed, T. Systematic and Scientometric Analyses of Predictors for Modelling Water Pipes Deterioration. Autom. Constr. 2023, 149, 104710. [Google Scholar] [CrossRef]
Silva, M. A Systematic Review of Foresight in Project Management Literature. Procedia Comput. Sci. 2015, 64, 792–799. [Google Scholar] [CrossRef]
Karam, A.; Eltoukhy, A.E.E.; Shaban, I.A.; Attia, E.-A. A Review of COVID-19-Related Literature on Freight Transport: Impacts, Mitigation Strategies, Recovery Measures, and Future Research Directions. Int. J. Environ. Res. Public Health 2022, 19, 12287. [Google Scholar] [CrossRef] [PubMed]
Araújo, A.G.; Pereira Carneiro, A.M.; Palha, R.P. Sustainable Construction Management: A Systematic Review of the Literature with Meta-Analysis. J. Clean. Prod. 2020, 256, 120350. [Google Scholar] [CrossRef]
Hussein, M.; Eltoukhy, A.E.E.; Karam, A.; Shaban, I.A.; Zayed, T. Modelling in Off-Site Construction Supply Chain Management: A Review and Future Directions for Sustainable Modular Integrated Construction. J. Clean. Prod. 2021, 310, 127503. [Google Scholar] [CrossRef]
Taiwo, R.; Shaban, I.A.; Zayed, T. Development of Sustainable Water Infrastructure: A Proper Understanding of Water Pipe Failure. J. Clean. Prod. 2023, 398, 136653. [Google Scholar] [CrossRef]
Michalski, A.; Głodziński, E.; Böde, K. Lean Construction Management Techniques and BIM Technology—Systematic Literature Review. Procedia Comput. Sci. 2022, 196, 1036–1043. [Google Scholar] [CrossRef]
Abdelkader, E.M.; Zayed, T.; Faris, N. Synthesized Evaluation of Reinforced Concrete Bridge Defects, Their Non-Destructive Inspection and Analysis Methods: A Systematic Review and Bibliometric Analysis of the Past Three Decades. Buildings 2023, 13, 800. [Google Scholar] [CrossRef]
Elshaboury, N.; Al-Sakkaf, A.; Mohammed Abdelkader, E.; Alfalah, G. Construction and Demolition Waste Management Research: A Science Mapping Analysis. Int. J. Environ. Res. Public Health 2022, 19, 4496. [Google Scholar] [CrossRef]
Eltoukhy, A.E.E.; Chan, F.T.S.; Chung, S.H. Airline Schedule Planning: A Review and Future Directions. Ind. Manag. Data Syst. 2017, 117, 1201–1243. [Google Scholar] [CrossRef]
Hassan, L.K.; Santos, B.F.; Vink, J. Airline Disruption Management: A Literature Review and Practical Challenges. Comput. Oper. Res. 2021, 127, 105137. [Google Scholar] [CrossRef]
Aromataris, E.; Riitano, D. Systematic Reviews. AJN Am. J. Nurs. 2014, 114, 49–56. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Meline, T. Selecting Studies for Systemic Review: Inclusion and Exclusion Criteria. Contemp. Issues Commun. Sci. Disord. 2006, 33, 21–27. [Google Scholar] [CrossRef]
Wohlin, C. Guidelines for Snowballing in Systematic Literature Studies and a Replication in Software Engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, London, UK, 13–14 May 2014; ACM: New York, NY, USA, 2014; pp. 1–10. [Google Scholar]
Moher, D.; Liberati, A.; Tetzlaff, J.; Altman, D.G. Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. Int. J. Surg. 2010, 8, 336–341. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sarkis-Onofre, R.; Catalá-López, F.; Aromataris, E.; Lockwood, C. How to Properly Use the PRISMA Statement. Syst. Rev. 2021, 10, 117. [Google Scholar] [CrossRef]
Aydın, Ö.; Karaarslan, E. OpenAI ChatGPT Generated Literature Review: Digital Twin in Healthcare. SSRN Electron. J. 2022. [Google Scholar] [CrossRef]
Cascella, M.; Montomoli, J.; Bellini, V.; Bignami, E. Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios. J. Med. Syst. 2023, 47, 33. [Google Scholar] [CrossRef]
Vaishya, R.; Misra, A.; Vaish, A. ChatGPT: Is This Version Good for Healthcare and Research? Diabetes Metab. Syndr. Clin. Res. Rev. 2023, 17, 102744. [Google Scholar] [CrossRef]
Halaweh, M. ChatGPT in Education: Strategies for Responsible Implementation. Contemp. Educ. Technol. 2023, 15, ep421. [Google Scholar] [CrossRef]
Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J.; et al. Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. PLOS Digit. Health 2023, 2, e0000198. [Google Scholar] [CrossRef]
Zhai, X. ChatGPT for Next Generation Science Learning. XRDS Crossroads ACM Mag. Stud. 2023, 29, 42–46. [Google Scholar] [CrossRef]
Rudolph, J.; Tan, S.; Tan, S. ChatGPT: Bullshit Spewer or the End of Traditional Assessments in Higher Education? J. Appl. Learn. Teach. 2023, 6, 342–362. [Google Scholar] [CrossRef]
Prieto, S.A.; Mengiste, E.T.; García de Soto, B. Investigating the Use of ChatGPT for the Scheduling of Construction Projects. Buildings 2023, 13, 857. [Google Scholar] [CrossRef]
You, H.; Ye, Y.; Zhou, T.; Zhu, Q.; Du, J. Robot-Enabled Construction Assembly with Automated Sequence Planning Based on ChatGPT: RoboGPT. arXiv 2023, arXiv:2304.11018. [Google Scholar]
Alkaissi, H.; McFarlane, S.I. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus 2023, 15, e35179. [Google Scholar] [CrossRef] [PubMed]
Salvagno, M.; Taccone, F.S.; Gerli, A.G. Can Artificial Intelligence Help for Scientific Writing? Crit. Care 2023, 27, 75. [Google Scholar] [CrossRef] [PubMed]
Zheng, H.; Zhan, H. ChatGPT in Scientific Writing: A Cautionary Tale. Am. J. Med. 2023. [Google Scholar] [CrossRef]
Dergaa, I.; Chamari, K.; Zmijewski, P.; Ben Saad, H. From Human Writing to Artificial Intelligence Generated Text: Examining the Prospects and Potential Threats of ChatGPT in Academic Writing. Biol. Sport 2023, 40, 615–622. [Google Scholar] [CrossRef]
Khosravi, H.; Shafie, M.R.; Hajiabadi, M.; Raihan, A.S.; Ahmed, I. Chatbots and ChatGPT: A Bibliometric Analysis and Systematic Review of Publications in Web of Science and Scopus Databases. arXiv 2023, arXiv:2304.05436. [Google Scholar]
Lecler, A.; Duron, L.; Soyer, P. Revolutionizing Radiology with GPT-Based Models: Current Applications, Future Possibilities and Limitations of ChatGPT. Diagn. Interv. Imaging 2023, 104, 269–274. [Google Scholar] [CrossRef]
Hosseini, M.; Horbach, S.P.J.M. Fighting Reviewer Fatigue or Amplifying Bias? Considerations and Recommendations for Use of ChatGPT and Other Large Language Models in Scholarly Peer Review. Res. Integr. Peer. Rev. 2023, 8, 4. [Google Scholar] [CrossRef]
Fang, T.; Yang, S.; Lan, K.; Wong, D.F.; Hu, J.; Chao, L.S.; Zhang, Y. Is ChatGPT a Highly Fluent Grammatical Error Correction System? A Comprehensive Evaluation. arXiv 2023, arXiv:2304.01746. [Google Scholar]
Sallam, M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef] [PubMed]
Qureshi, R.; Shaughnessy, D.; Gill, K.A.R.; Robinson, K.A.; Li, T.; Agai, E. Are ChatGPT and Large Language Models “the Answer” to Bringing Us Closer to Systematic Review Automation? Syst. Rev. 2023, 12, 72. [Google Scholar] [CrossRef] [PubMed]
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Zeng, G. On the Confusion Matrix in Credit Scoring and Its Analytical Properties. Commun. Stat. Theory Methods 2020, 49, 2080–2093. [Google Scholar] [CrossRef]
Jan, F.; Min-Allah, N.; Saeed, S.; Iqbal, S.Z.; Ahmed, R. IoT-Based Solutions to Monitor Water Level, Leakage, and Motor Control for Smart Water Tanks. Water 2022, 14, 309. [Google Scholar] [CrossRef]
Singh, M.; Ahmed, S. IoT Based Smart Water Management Systems: A Systematic Review. Mater. Today Proc. 2021, 46, 5211–5218. [Google Scholar] [CrossRef]
Zulkifli, C.Z.; Garfan, S.; Talal, M.; Alamoodi, A.H.; Alamleh, A.; Ahmaro, I.Y.Y.; Sulaiman, S.; Ibrahim, A.B.; Zaidan, B.B.; Ismail, A.R.; et al. IoT-Based Water Monitoring Systems: A Systematic Review. Water 2022, 14, 3621. [Google Scholar] [CrossRef]
Alshami, A.; Elsayed, M.; Mohandes, S.R.; Kineber, A.F.; Zayed, T.; Alyanbaawi, A.; Hamed, M.M. Performance Assessment of Sewer Networks under Different Blockage Situations Using Internet-of-Things-Based Technologies. Sustainability 2022, 14, 14036. [Google Scholar] [CrossRef]
Haluza, D.; Jungwirth, D. Artificial Intelligence and Ten Societal Megatrends: An Exploratory Study Using GPT-3. Systems 2023, 11, 120. [Google Scholar] [CrossRef]
Yang, X.; Li, Y.; Zhang, X.; Chen, H.; Cheng, W. Exploring the Limits of ChatGPT for Query or Aspect-Based Text Summarization. arXiv 2023, arXiv:2302.08081. [Google Scholar]

Figure 1. Overview of the SR Process Automation Stages.

Figure 2. Illustration of the components of the confusion matrix and the equation used to estimate the assessment metrics. Symbols in green cells represent the number of correctly classified samples, while symbols in magenta represent the number of misclassified samples. q, r, s, and t represent the total number of articles belonging to different categories, while u, v, w, and x represent the total number of ChatGPT classifications.

Figure 3. The flowchart depicts the initial phase of the systematic review with the ChatGPT. The flowchart shows three primary steps: (1) the development of Boolean research terms, (2) the extraction of relevant research articles, and (3) the extraction of the most common keywords. The performance of the ChatGPT was evaluated utilizing conventional, cutting-edge techniques for conducting systematic reviews.

Figure 4. Response from the ChatGPT to our request to create research terms for use in Scopus searches. (a) response with BST, (b) response with BST for the latest 12 years, and (c) response with BST for the latest 12 years and include articles and conferences with English language only.

Figure 5. User prompt asking the ChatGPT to retrieve the top 50 keywords and the ChatGPT’s response in tabular format.

Figure 6. Flow chart of the first and second phases of the filtration process. The figure depicts the details of phase 1 of article filtration and phase 2 of information extraction and sub-categories generation.

Figure 7. User prompt asking the ChatGPT about its information about the three main categorizes.

Figure 8. APA-style article filtration procedure (feeding rate 5 articles per time). (a) The prompt for the user. (b) The response of the ChatGPT to the requirement. The ChatGPT presents the answers in a tabular format with an “x” next to the corresponding category. The ChatGPT explains the decision beneath the table.

Figure 9. The confusion matrix comparing the classification of the articles by experts and the ChatGPT. (a1,b1) display confusion matrices, while the (a2,b2) depict the performance metrics of categorization process.

Figure 10. (a) An illustration of ChatGPT input utilizing APA metadata and the abstract. (b) ChatGPT’s response to the request. ChatGPT classified the article as both unrelated and in the water quality category. Nonetheless, reviewing the explanation from the user’s perspective would aid in determining that the article is unrelated.

Figure 11. The ChatGPT’s response to our request for proposing research questions that fit into each class. There are 14 questions in all.

Figure 12. Our systematic review taxonomy. The first level represents the three categories of the review, the second level depicts the sub-categories, and the third level illustrates questions to aid with information extraction. The 14 questions and five sub-categories are identical for each main category.

Figure 13. Illustration of a ChatGPT question-answer request prompt. The sole input was the APA article format. The left panel displays the ChatGPT’s responses to these questions in the required tabular format. The dots indicate that a portion of the questions and answers were displayed, as the complete prompt and answers are too long to be presented.

Figure 14. Illustration of a ChatGPT question-answer request prompt. The sole input was the article titles, abstracts, and methods section portions. The left panel displays the ChatGPT’s responses to these questions in the required tabular format. The dots indicate that a portion of the questions and answers were displayed, as the complete prompt and answers are too long to be presented.

Figure 15. A comparison of the ChatGPT’s response to the authors’ general response for the 30 articles in the sample.

Figure 16. Flow chart of Phase 3.

Figure 17. User prompt and ChatGPT answer for the use of different types of sensors.

Figure 18. User prompt and ChatGPT answer for questions related to wireless communication technologies.

Figure 19. User prompt and ChatGPT answer for the trends within the proposed systems or case studies.

Figure 20. User prompt and ChatGPT answer to identify and categorize the limitations and challenges discussed by previous authors.

Figure 21. User prompt and ChatGPT answer to generate a comprehensive list of the compiled recommendations.

Figure 22. User prompt and ChatGPT answer for comparing the limitations and challenges highlighted by various authors with the suggested solutions and recommendations.

Figure 23. Automated Framework for Streamlining SR Methodology: A Proposed Approach.

Table 1. Examples of the question asked to the ChatGPT to feed the Ai with information about the topic.

ID	Question
1	What is the Internet of Things?
2	What are the applications of the IoT so far?
3	What are the requirements to build the IoT system?
4	What are the infrastructures from the Civil engineering perspective?
5	How can the concept of the IoT be implemented in the domain of water and wastewater management?
6	What are the academic insights about implementing the IoT in water and wastewater management?

Table 2. Examples of the references extracted from the ChatGPT and the evaluation of the correctness for both title and DOIs.

No.	ChatGPT Reference	Doi Check	Title Check
1	AlFuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., & Ayyash, M. (2015). Internet of things: A survey on enabling technologies, protocols, and applications. IEEE Communications Surveys & Tutorials, 17(4), 2347-2376. https://doi.org/10.1109/COMST.2015.2444095	Correct	Correct
2	Akbar, M., & Khan, M. S. (2018). IoT-based smart wastewater management system. International Journal of Distributed Sensor Networks, 14(6), 1550147718781026. https://doi.org/10.1177/1550147718781026	DOI not found	Correct
3	Arumugam, S., & Selvamani, S. (2020). Internet of things (IoT) based smart water management system: A review. Journal of Ambient Intelligence and Humanized Computing, 11(10), 4369-4390. https://doi.org/10.1007/s12652-020-02770-w	Incorrect	Correct
4	Dong, J., Li, Y., & Li, X. (2019). A review of internet of things-based smart water management systems. Journal of Sensors, 2019, 1-11. https://doi.org/10.1155/2019/2567391	DOI not found	Incorrect
5	Kaur, P., Kumar, M., & Singh, P. (2021). IoT-enabled water management: A review. In I. Ahmad, S. Shafi, S. S. Gill, & V. Chang (Eds.), Internet of things and big data analytics towards next-generation intelligence (pp. 381-398). Springer. https://doi.org/10.1007/978-981-33-6965-5_17	DOI not found	Incorrect
6	Kumar, M., Kumar, V., & Al-Fuqaha, A. (2021). An overview of cyber-physical system-based water management in smart cities. Journal of Sensor and Actuator Networks, 10(2), 19. https://doi.org/10.3390/jsan10020019	DOI not found	Incorrect

Table 3. The similarity percentage between the keywords from ChatGPT and VOSviewer.

ChatGPT	VOS Viewer	Similarity (%)	Number of Unique Keywords from ChatGPT
50	50	20	40
100	100	28	72
180	180	23	138
200	200	21	158
50	263	68	16

Table 4. The gathered responses (yes) for each of the three major categories.

Sub-Category	Question	Water Quality Monitoring	Water Infrastructure Management	Wastewater Infrastructure Management.	Objectives
		Answers (YES)
Sensors’ development	1-1: Sensor development.	26	28	36	Identify trends in sensor development and manufacturing, study the advantages of employing several sensors, investigate the frequency of sensor use, categorize sensors according to their functionality, and investigate the methods used to evaluate sensor performance.
	1-2: Use of different types of sensors.	18	37	19
	1-3: sensors performance evaluation.	21	15	15
Data transmission	2-1: Data collection and transmission method.	33	45	38	Identify trends and anomalies in transmission methods, including the utilization of wireless communications, the types of wireless technologies employed, and the frequency of their occurrence in the examined papers. Analyze, also, the effectiveness of utilizing various communication technologies.
	2-2: Use of wireless communication.	31	38	31
	2-3: Connectivity performance evaluation.	7	5	5
Data analysis	3-1: Data analysis methods.	20	33	14	Define frequently applied data analysis techniques, including AI and ML techniques, and study the trends in visualization approaches.
	3-2: Use of ML algorithms.	6	11	0
	3-3: Data visualization to facilitate decision-making.	12	19	17
Case studies	4-1: The use in real-world settings.	28	39	28	Identify trends in the implementation of IoT-based systems in various real-world contexts and the outcomes and advantages of these implementations.
Case studies	4-2: Benefits and outcomes.	17	37	27
limitations and gaps	5-1: limitations and gaps in current research.	14	24	15	Define the limitations and gaps identified by the authors, the obstacles encountered in implementing their systems, the offered solutions, and the recommendations for overcoming them.
	5-2: Implementation challenges.	25	42	23
	5-3: Recommendations or solutions.	20	37	16

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alshami, A.; Elsayed, M.; Ali, E.; Eltoukhy, A.E.E.; Zayed, T. Harnessing the Power of ChatGPT for Automating Systematic Review Process: Methodology, Case Study, Limitations, and Future Directions. Systems 2023, 11, 351. https://doi.org/10.3390/systems11070351

AMA Style

Alshami A, Elsayed M, Ali E, Eltoukhy AEE, Zayed T. Harnessing the Power of ChatGPT for Automating Systematic Review Process: Methodology, Case Study, Limitations, and Future Directions. Systems. 2023; 11(7):351. https://doi.org/10.3390/systems11070351

Chicago/Turabian Style

Alshami, Ahmad, Moustafa Elsayed, Eslam Ali, Abdelrahman E. E. Eltoukhy, and Tarek Zayed. 2023. "Harnessing the Power of ChatGPT for Automating Systematic Review Process: Methodology, Case Study, Limitations, and Future Directions" Systems 11, no. 7: 351. https://doi.org/10.3390/systems11070351

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Harnessing the Power of ChatGPT for Automating Systematic Review Process: Methodology, Case Study, Limitations, and Future Directions

Abstract

1. Introduction

2. Research Methodology

2.1. Exploring ChatGPT: Characteristics and Interactions

2.2. Automation of SR Process Using ChatGPT

2.2.1. Initialization, Extraction of Research Words and Articles Records

2.2.2. Articles Filtration Using Titles and Abstracts

2.2.3. Full-Text Filtration and Information Extraction

2.2.4. Content Analysis of the Extracted Information

2.3. Case Study Selection

3. Results and Discussion

3.1. Research Words Generation, Article Exrcation, and Keywords Retiveal

3.2. First-Round Article Classification and Filtration (Title and Abstract)

3.3. Second-Round Article Filtration (Full-Text) and Information Extraction

3.4. Analysis and Interpretation of Extracted Information

4. ChatGPT Strengths, Limitations, and Future Directions in Automating SR Process

4.1. Strengths of ChatGPT in SR Process

4.2. Limitations of ChatGPT in SR Process

4.3. Future Perspectives: Expanding the Potential of ChatGPT in SR

5. Ethical Considerations in Utilizing AI-Language Models

6. Concluded Remarks and Recommendations

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI