A Hybrid Semi-Automated Workflow for Systematic and Literature Review Processes with Large Language Model Analysis
Abstract
:1. Introduction
1.1. Research Aims and Contributions in This Work
- A human-oriented workflow that effectively blends with an existing LLM and reduces the human workload for performing SRs. The aim is to lessen the cognitive burden of a human-only approach to increase accuracy while maintaining the transparency and reproducibility of the review process.
- A novel strategy of prompt engineering, categorizing extracted data/information into identifiers, verifiers, and data fields (IVD), which is particularly efficient for extracting information from a large volume of research articles for performing SRs. The IVD strategy may be applied in other applications involving fixed-format documents using LLMs.
- A novel approach to study selection in SRs, leveraging the capabilities of LLMs for efficient data extraction. This strategy employs an LLM to extract short, meaningful information from full-text articles. Humans can make quick, manual decisions based on pre-determined exclusion criteria with these extractions. This process is not only faster but may also reduce the human bias that has been linked to conventional SRs [12]. These techniques could set the foundation for future advancements in the field.
1.2. Focus and Limitations of This Work
- Reduce the workload of a researcher;
- Give the correct classifications regarding inclusion and exclusion.
2. Related Works
2.1. The Systematic Review Process
2.2. General AI Application
2.2.1. LLM-Based SR Approach
2.2.2. LLM-Assisted Human-in-the-Loop Approach
3. The Proposed Hybrid Workflow
3.1. Retrieve Data in Full Text
3.2. Define Selection Criteria
3.3. IVD Prompt Development and Testing
3.3.1. Development of the Prompt Engineering Methodology
- (a)
- Identifiers are the prompted outputs of the LLM, which are used by humans to make inclusion/exclusion decisions. The identifier prompts are engineered to reflect the exclusion criteria of a review and are typically one or a couple of words in length. Strategically worded prompts can extract identifying keywords or phrases in the output to aid human reviewers in making swift and accurate exclusion decisions. For example, in our case study, we focused on desk-based workers, so we developed a prompt to output an identifier that indicated the occupation of participants in the study to aid in decision-making about inclusion and exclusion.
- (b)
- Verifiers are prompted outputs, which are concise phrases or sentences that summarize certain aspects of the study. Verifiers can be used in a similar way to conventional abstract screening used in the SR process. The output should include clear and concise sentences reflecting a summary of pertinent information, enabling human reviewers to validate the exclusion decisions quickly. This method aims to replace conventional title/abstract screening with a more efficient mechanism. For example, the LLM can be prompted to output a summary of the study’s objective, method, or results to enable the human reviewer to use the verifier to assess the relevance of the study.
- (c)
- Data Fields are the prompted outputs that direct the LLM to focus on predetermined fields that align with the data extraction employed in SRs. For example, we gathered data on the population and context of the study design, such as “number of participants” and “study duration”. Identifiers may be treated as data fields, as well.
3.3.2. Prompt Testing and Refinement
- Prepare initial few-shot prompts: Examples improve performance regardless of their correctness. This is not solely because of the accuracy of each individual example but also because they assist in formatting the outputs. Furthermore, we demonstrate that using “examples” explicitly leads to better results compared with not using examples [27]. Therefore, random examples could be added based on user experience.
- Design initial few-shot prompt based on a relevant SR example: Develop prompts that concisely instruct the LLM on the task, incorporating the few-shot examples to illustrate the expected output. An LLM application could be utilized by providing the instruction, “You need to improve the following prompt to extract information from a research article,” followed by the initial draft prompt.
- Test the initial few-shot prompt: Input an initial few-shot prompt on a limited number of full-text articles. Review the outputs to improve the prompt with more relevant examples from the extraction.
- Implement more broadly: Once the few-shot prompting is optimized, implement them for the broader data extraction task across the full dataset. Monitor the outputs to verify that the LLM maintains performance consistency as it processes varying articles.
3.3.3. Assessment of the Prompt Development and LLM Outputs
3.4. LLM Data Extraction and Formatting: JSON and CSV
3.5. Selecting the Articles: Decision-Making
- The identifier filtering only looks through small phrases consisting of one to three words, which would typically be repeated in multiple rows. For example, identifiers that have the same values across multiple articles can be automated using standard spreadsheet software.
- Next, the verifier screening is performed on the articles that failed the identifier filtering steps. The verifiers summarize key aspects of the articles into a few sentences, replacing the need to read larger excerpts of the article to make inclusion/exclusion decisions faster. The amount of repetition for these fields is less likely than for identifiers, as each article presents a different approach and solution.
- Lastly, during data cleaning, the human has fewer articles to evaluate when the information was not found by the LLM. Assuming the prompts were good, the human already had a good idea of what to look for in a full-text article for data cleaning. Our experience demonstrated that any article with less than 50% completeness in the CSV row is unlikely to be related to the research topic and may be excluded.
4. Case Study
4.1. Data Resources
4.2. Data Preparation
4.3. Prompt Engineering
- x is the total number of distinct terms (values) for a prompted output, e.g., identifier. This number may be high if the articles used a range of terms to describe a certain aspect of the study or if the prompt results in many incorrect responses.
- y is the number of unaligned, i.e., incorrect, or unwanted values from x distinct values. A smaller value reflects fewer erroneous responses and, thus, a better prompt.
- z is the number of rows in the spreadsheet, i.e., articles with unwanted values from y. A smaller value results from a better prompt.
4.4. Evaluation Design
- Phase 1A (human-only): We followed the PRISMA flow diagram to screen articles for the SR. We applied predefined inclusion and exclusion criteria to select relevant articles. Each article was screened independently by two reviewers, and any conflict was resolved by team discussion. The results were divided into two groups as follows: inclusion (n = 170) and exclusion (n = 220). We used predefined data extraction templates to collect information from each article.
- Phase 1B (hybrid): We used a plugin with Gemini-Pro API (Top-p = 0.1, Temperature = 0.2) to extract data from the same articles. The Gemini-Pro model is an LLM created by DeepMind from Google. This LLM (version 1.0) has an input token limit of 30,720, which corresponds to about 18,432–24,576 English words [32]. Several prompts were engineered iteratively for articles based on the inclusion criteria and relevant keywords from the case study. Each prompt was sent to Gemini-Pro via the plugin and received the responses in JSON format and ran repeat three times because of the uncertainty in the AI output [10]. The data collection period was between 13 January 2024 and 2 February 2024. We developed an LLM application using an API to send the prompts to an LLM and collect the JSON outputs. The API was designed to handle multiple requests simultaneously, which allowed us to collect the data quickly and efficiently. Once we had collected all JSON outputs, we stored them in a file (see Figure 5a). The data were pre-processed before the filter. After converting the JSON file to a spreadsheet using “jsoneditoronline”, one author filtered the data and removed any records according to the identifiers using Microsoft Excel (see Figure 5b). All records (n = 390) were filtered and screened by one author.
- Phase 2 (evaluation): One author validated the records from the hybrid workflow with the full-text articles of the included articles. Any conflicts between the hybrid and human-only methods were resolved by team discussion against the full-text articles to ensure that humans screened each study twice independently.
4.5. Results
4.5.1. Observations from Filtering and Screening
4.5.2. Observations from Data Cleaning
- Standardizing formats: This involved applying uniform names to terms that may have several variations. For example, we standardized “activPAL” from “activPAL3”, “ActivPAL3”, “activPAL3c”, “activPAL3”, and “activPAL 3 micro” to reduce the sensitivity of the search.
- Handling missing values: Certain headings, such as “devicemodel” had a high prevalence of missing data. The LLM could not differentiate between a device’s name and its model. Extracting such missing values involved identifying articles that met the selection criteria, particularly those with missing values in the identifiers, and searching for the specific missing information.
- Error removal: This involved identifying and rectifying errors in the output. Specific fields that required attention included the subjective measurements and the device model. These fields were reviewed for any inaccuracies or inconsistencies and corrected with information from the original article.
4.5.3. Revised Human-Only Decision
5. Discussion
5.1. How Can Prompt Engineering Be Optimized to Leverage the Capabilities of LLMs in Extracting Relevant Information from Full-Text Articles in SRs?
5.2. What Is the Impact of the Hybrid Workflow in Conducting SRs?
5.3. Comparison with the Human-Only Approach
- i
- Conventional SR methodology might include a snowballing strategy. It is a technique used to identify additional relevant studies that may not have been captured by the initial search strategy [13]. However, this manual process can be time-consuming and labor-intensive, especially for SRs that deal with a large volume of studies [33].
- ii
- The hybrid workflow also reduces the chances of bias in selecting or reading articles. If the process is completed manually, researchers could pay less attention to certain articles based on meta-information. In the proposed hybrid workflow, this is not possible as all the articles are handed over to the LLM, which treats all equally.
5.4. Limitations and Future Work
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Moher, D.; Liberati, A.; Tetzlaff, J.; Altman, D.G.; Group, P. Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. Int. J. Surg. 2010, 8, 336–341. [Google Scholar] [CrossRef] [PubMed]
- Chalmers, I.; Haynes, B. Reporting, updating, and correcting systematic reviews of the effects of health care. BMJ 1994, 309, 862–865. [Google Scholar] [CrossRef] [PubMed]
- Higgins, J.P.T.; Green, S. Cochrane Handbook for Systematic Reviews of Interventions; Wiley: Hoboken, NJ, USA, 2008. [Google Scholar]
- Robinson, K.A.; Whitlock, E.P.; Oneil, M.E.; Anderson, J.K.; Hartling, L.; Dryden, D.M.; Butler, M.; Newberry, S.J.; McPheeters, M.; Berkman, N.D.; et al. Integration of existing systematic reviews into new reviews: Identification of guidance needs. Syst. Rev. 2014, 3, 60. [Google Scholar] [CrossRef] [PubMed]
- Ahn, E.; Kang, H. Introduction to systematic review and meta-analysis. Korean J. Anesthesiol. 2018, 71, 103–112. [Google Scholar] [CrossRef] [PubMed]
- Liberati, A.; Altman, D.G.; Tetzlaff, J.; Mulrow, C.; Gotzsche, P.C.; Ioannidis, J.P.; Clarke, M.; Devereaux, P.J.; Kleijnen, J.; Moher, D. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate healthcare interventions: Explanation and elaboration. BMJ 2009, 339, b2700. [Google Scholar] [CrossRef] [PubMed]
- Borah, R.; Brown, A.W.; Capers, P.L.; Kaiser, K.A. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open 2017, 7, e012545. [Google Scholar] [CrossRef] [PubMed]
- Michelson, M.; Reuter, K. The significant cost of systematic reviews and meta-analyses: A call for greater involvement of machine learning to assess the promise of clinical trials. Contemp. Clin. Trials. Commun. 2019, 16, 100443. [Google Scholar] [CrossRef] [PubMed]
- Khraisha, Q.; Put, S.; Kappenberg, J.; Warraitch, A.; Hadfield, K. Can large language models replace humans in the systematic review process? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. arXiv 2023, arXiv:2310.17526. [Google Scholar] [CrossRef]
- Syriani, E.; David, I.; Kumar, G. Assessing the ability of ChatGPT to screen articles for systematic reviews. arXiv 2023, arXiv:2307.06464. [Google Scholar] [CrossRef]
- Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.J.; Horsley, T.; Weeks, L.; et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann. Intern. Med. 2018, 169, 467–473. [Google Scholar] [CrossRef] [PubMed]
- Goodyear-Smith, F.A.; van Driel, M.L.; Arroll, B.; Del Mar, C. Analysis of decisions made in meta-analyses of depression screening and the risk of confirmation bias: A case study. BMC Med. Res. Methodol. 2012, 12, 76. [Google Scholar] [CrossRef] [PubMed]
- Tsafnat, G.; Glasziou, P.; Choong, M.K.; Dunn, A.; Galgani, F.; Coiera, E. Systematic review automation technologies. Syst. Rev. 2014, 3, 74. [Google Scholar] [CrossRef] [PubMed]
- Aromataris, E.; Fernandez, R.; Godfrey, C.M.; Holly, C.; Khalil, H.; Tungpunkom, P. Summarizing systematic reviews: Methodological development, conduct and reporting of an umbrella review approach. Int. J. Evid. Based Healthc. 2015, 13, 132–140. [Google Scholar] [CrossRef] [PubMed]
- Meline, T. Selecting studies for systemic review: Inclusion and exclusion criteria. Contemp. Issues Commun. Sci. Disord. 2006, 33, 21–27. [Google Scholar] [CrossRef]
- Bannach-Brown, A.; Przybyła, P.; Thomas, J.; Rice, A.S.C.; Ananiadou, S.; Liao, J.; Macleod, M.R. Machine learning algorithms for systematic review: Reducing workload in a preclinical review of animal studies and reducing human screening error. Syst. Rev. 2019, 8, 23. [Google Scholar] [CrossRef] [PubMed]
- Yu, Z.; Menzies, T. FAST2: An intelligent assistant for finding relevant papers. Expert Syst. Appl. 2019, 120, 57–71. [Google Scholar] [CrossRef]
- van de Schoot, R.; de Bruin, J.; Schram, R.; Zahedi, P.; de Boer, J.; Weijdema, F.; Kramer, B.; Huijts, M.; Hoogerwerf, M.; Ferdinands, G.; et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat. Mach. Intell. 2021, 3, 125–133. [Google Scholar] [CrossRef]
- Marshall, I.J.; Wallace, B.C. Toward systematic review automation: A practical guide to using machine learning tools in research synthesis. Syst. Rev. 2019, 8, 163. [Google Scholar] [CrossRef] [PubMed]
- Alshami, A.; Elsayed, M.; Ali, E.; Eltoukhy, A.E.E.; Zayed, T. Harnessing the power of ChatGPT for automating systematic review process: Methodology, case study, limitations, and future directions. Systems 2023, 11, 351. [Google Scholar] [CrossRef]
- Qureshi, R.; Shaughnessy, D.; Gill, K.A.R.; Robinson, K.A.; Li, T.; Agai, E. Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? Syst. Rev. 2023, 12, 72. [Google Scholar] [CrossRef] [PubMed]
- Guo, E.; Gupta, M.; Deng, J.; Park, Y.J.; Paget, M.; Naugler, C. Automated paper screening for clinical reviews using large language models: Data analysis study. J. Med. Internet Res. 2024, 26, e48996. [Google Scholar] [CrossRef] [PubMed]
- van Dijk, S.H.B.; Brusse-Keizer, M.G.J.; Bucsán, C.C.; van der Palen, J.; Doggen, C.J.M.; Lenferink, A. Artificial intelligence in systematic reviews: Promising when appropriately used. BMJ Open 2023, 13, e072254. [Google Scholar] [CrossRef] [PubMed]
- de la Torre-López, J.; Ramírez, A.; Romero, J.R. Artificial intelligence to automate the systematic review of scientific literature. Computing 2023, 105, 2171–2194. [Google Scholar] [CrossRef]
- Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and efficient foundation language models. arXiv, 2023; arXiv.2302.13971. [Google Scholar] [CrossRef]
- Min, S.; Lyu, X.; Holtzman, A.; Artetxe, M.; Lewis, M.; Hajishirzi, H.; Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? arXiv 2022, arXiv:2202.12837. [Google Scholar]
- Chu, X.; Ilyas, I.F.; Krishnan, S.; Wang, J. Data cleaning. In Proceedings of the 2016 International Conference on Management of Data, New York, NY, USA, 26 June–1 July 2016; pp. 2201–2206. [Google Scholar]
- Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef] [PubMed]
- Lusa, S.; Punakallio, A.; Manttari, S.; Korkiakangas, E.; Oksa, J.; Oksanen, T.; Laitinen, J. Interventions to promote work ability by increasing sedentary workers’ physical activity at workplaces—A scoping review. Appl. Ergon. 2020, 82, 102962. [Google Scholar] [CrossRef] [PubMed]
- Wei, J.; Wei, J.; Tay, Y.; Tran, D.; Webson, A.; Lu, Y.; Chen, X.; Liu, H.; Huang, D.; Zhou, D. Larger language models do in-context learning differently. arXiv 2023, arXiv:2303.03846. [Google Scholar]
- Gemini, T.; Anil, R.; Borgeaud, S.; Wu, Y.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; et al. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
- Horsley, T.; Dingwall, O.; Sampson, M. Checking reference lists to find additional studies for systematic reviews. Cochrane Database Syst. Rev. 2011. [Google Scholar] [CrossRef] [PubMed]
- AMSTAR Checklist. Available online: https://amstar.ca/Amstar_Checklist.php (accessed on 19 March 2024).
Section of a Candidate Article | Verifier | Identifier | Data Fields |
---|---|---|---|
Abstract | High | Possible | Possible |
Introduction | High | Possible | Unlikely |
Literature review | Unlikely | Unlikely | Unlikely |
Methodology | Unlikely | High | High |
Experimental setup | Unlikely | High | High |
Results | Unlikely | Unlikely | High |
Discussions | Possible | Unlikely | Unlikely |
k | Time to Process |
---|---|
50 articles | 7.5 min |
25 articles | 3.5 min |
10 articles | 1.5 min |
(k = 45) | Identifiers | Data Fields | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Prompts | “Occupation” | “Study Type” | “Work-Related Outcome” | “Device.Name” | “Device.Placement” | ||||||||||
x | y | z | x | y | z | x | y | z | x | y | z | x | y | z | |
Initial prompt | 21 | 2 | 11 | 27 | 12 | 22 | 24 | 2 | 22 | 25 | 1 | 20 | 25 | 7 | 21 |
Refined prompt | 18 | 0 | 0 | 4 | 2 | 4 | 18 | 0 | 0 | 34 | 0 | 0 | 23 | 1 | 7 |
Prompt | Type | Average | Std. Deviation |
---|---|---|---|
objective | verifier | 184.90 | 61.29 |
method | verifier | 294.81 | 108.76 |
result | verifier | 291.91 | 121.40 |
occupation | identifier | 18.49 | 11.88 |
movementVariables | identifier | 15.53 | 9.45 |
devices.name | identifier, data field | 15.52 | 9.47 |
devices.sensor | identifier, data field | 17.29 | 20.47 |
(n = 390) | LLM-Assisted Hybrid Workflow | Traditional Workflow | ||
---|---|---|---|---|
Correct | Incorrect | Correct | Incorrect | |
Inclusion | 170 | 0 | 167 | 3 |
Exclusion | 220 | 0 | 217 | 3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ye, A.; Maiti, A.; Schmidt, M.; Pedersen, S.J. A Hybrid Semi-Automated Workflow for Systematic and Literature Review Processes with Large Language Model Analysis. Future Internet 2024, 16, 167. https://doi.org/10.3390/fi16050167
Ye A, Maiti A, Schmidt M, Pedersen SJ. A Hybrid Semi-Automated Workflow for Systematic and Literature Review Processes with Large Language Model Analysis. Future Internet. 2024; 16(5):167. https://doi.org/10.3390/fi16050167
Chicago/Turabian StyleYe, Anjia, Ananda Maiti, Matthew Schmidt, and Scott J. Pedersen. 2024. "A Hybrid Semi-Automated Workflow for Systematic and Literature Review Processes with Large Language Model Analysis" Future Internet 16, no. 5: 167. https://doi.org/10.3390/fi16050167
APA StyleYe, A., Maiti, A., Schmidt, M., & Pedersen, S. J. (2024). A Hybrid Semi-Automated Workflow for Systematic and Literature Review Processes with Large Language Model Analysis. Future Internet, 16(5), 167. https://doi.org/10.3390/fi16050167