Next Article in Journal
Successful Multimodal Treatment of Intracranial Growing Teratoma Syndrome with Malignant Features
Previous Article in Journal
Access to Oncology Medicines in Canada: Consensus Forum for Recommendations for Improvement
 
 
Review
Peer-Review Record

Large Language Models in Oncology: Revolution or Cause for Concern?

Curr. Oncol. 2024, 31(4), 1817-1830; https://doi.org/10.3390/curroncol31040137
by Aydin Caglayan 1, Wojciech Slusarczyk 2, Rukhshana Dina Rabbani 1, Aruni Ghose 1,3,4,5,6, Vasileios Papadopoulos 7 and Stergios Boussios 1,2,8,9,10,*
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3:
Curr. Oncol. 2024, 31(4), 1817-1830; https://doi.org/10.3390/curroncol31040137
Submission received: 29 January 2024 / Revised: 13 March 2024 / Accepted: 29 March 2024 / Published: 29 March 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper reviews the progress of LLM and discusses its impacts on oncology, including the revolutions it may bring and the problems worth paying attention to. Surely, LLM and AIGC technologies will cause revolutionary changes to the entire medical diagnosis and treatment. Existing researches have shown that AI may even replace the doctors in many aspects. On the whole, this paper makes a wide literature analysis on this very meaningful topic, and carries on valuable thinking, which provide a significant insight into this issue. However, there are still some aspects that need to be further thought and improved as follows:

1.The paper lacks in-depth analysis of the technical principle of LLM, as well as its differences from human thinking and reasoning. Only by deeply understanding the differences at the mechanism level, can we make an accurate judgement about the future developing trends.

2. As a literature review study, it is also necessary to adopt a scientific research method, such as the meta-analysis method, etc. The research method of this paper is not very clear.

3.The analysis and discussion are not comprehensive and in-depth enough. The revolution brought by LLM and the new problems worthy of concern are far more than the mentioned aspects by the paper. It is suggested that a good logic framework would be established for more comprehensive and in-depth analysis.

Author Response

Dear Editor and Reviewers,

I am pleased to resubmit for publication the revised version of curroncol-2872681 manuscript, entitled “Large Language Models in Oncology—Revolution or Cause for Concern?”.

We are hopeful you agree that the revised revision will further update our comprehensive review. All the comments of reviewers have been addressed, as shown in this version of the manuscript, along with this point-by-point response to his comments.

All corresponding are red changes in the manuscript.

Reviewer #1:

  • General comment:

This paper reviews the progress of LLM and discusses its impacts on oncology, including the revolutions it may bring and the problems worth paying attention to. Surely, LLM and AIGC technologies will cause revolutionary changes to the entire medical diagnosis and treatment. Existing researches have shown that AI may even replace the doctors in many aspects. On the whole, this paper makes a wide literature analysis on this very meaningful topic, and carries on valuable thinking, which provide a significant insight into this issue. However, there are still some aspects that need to be further thought and improved as follows:”

Response:

We appreciate the opportunity to consider our manuscript for publication.

 

  • Specific comments:

1. The paper lacks in-depth analysis of the technical principle of LLM, as well as its differences from human thinking and reasoning. Only by deeply understanding the differences at the mechanism level, can we make an accurate judgement about the future developing trends.

Response:

Thank you for your comment we have added greater detail to section 3. Large Language Model Function. Notably we have gone into further detail of how DL algorithms work as well as providing a new figure 1, in order to support readers’ ability to picture ANN architecture. We delve further into LLM mechanisms and note zero-shot and few-shot learning as well as fine-tuning in the role of LLM training. We also now touch on the complex concept of whether LLMs truly understand natural language and thus appreciate the physical and social scenarios which language can describe when considering LLM differences from human thinking and reasoning. However, we believe expanding further on this concept is not in the scope of this paper.

 

2. As a literature review study, it is also necessary to adopt a scientific research method, such as the meta-analysis method, etc. The research method of this paper is not very clear.

Response:

Thank you for your comment.

We have now added section “2. Methods”, as you kindly requested.

 

3. The analysis and discussion are not comprehensive and in-depth enough. The revolution brought by LLM and the new problems worthy of concern are far more than the mentioned aspects by the paper. It is suggested that a good logic framework would be established for more comprehensive and in-depth analysis.

Response:

Thank you for your comment.

In section 4. A Cause for Revolution we have made the following changes:

In section 4.1. Oncology Clinical Practice we have added the role of LLMs in supporting oncologists with documentation and administrative duties. The potential of integration of voice-to-text technology with LLMs has also been added.

We introduce a new section 4.3. Educating Students and Healthcare Professionals in Oncology, where we consider LLM application as an education tool for professionals and students including its role in supporting personalized learning.

In section 4.4. Oncology Research (previously section 3.3), we add further detail to the data extraction ability of LLMs and consider domain specific LLMs including BioMedLM and BioGPT, as well as considering customizable models. We also discuss in detail the ‘advanced data analysis’ feature of ChatGPT-4.0 as per reviewer 2, in addition to its advantages when used in the oncology setting.

Consequently, we now consider LLMs revolutionary aspects in oncology including as seen below:

4.1. Oncology Clinical Practice: diagnostic workup including radiology image/labs analysis, screening, data extraction from electronic health records, support pathologists, NGS panel analysis, ‘virtual assistant’ role for oncologists in treatment planning and management and documentation and administrative duties

4.2. Cancer Patient Support and Education: ‘virtual assistant’ for patients, disease understanding, medical information delivery, emotional support, care-giver support

4.3. Educating Students and Healthcare Professionals in Oncology: facilitating the learning process and personalised learning experience, touch on gamification process

4.4 Oncology research: academic writing, research process- evidence synthesis / data extraction (including advanced data analysis support), queries in systematic reviews, data analysis, writing support, consideration of domain specific LLMs, customisable models, supporting non-native English speakers

In section 5. A Cause for Concern we have made the following changes:

We have separated the previous section 4.1 Data Accuracy and Accountability into their two respective sections.

In section 5.1. Data Accuracy we have added detail regarding how LLM hallucinations can be mitigated as per reviewer 3, including data-related methods as well as modelling and interference methods. We have also added detail regarding the use of prompt engineering in improving data accuracy as per reviewer 3, in addition to the practical challenges in the application of prompt engineering.

We also make further enhancements to section 5.4 Research Integrity (formerly section 4.3) by adding detail regarding problems of limited LLM transparency, referencing issues, risk of academic fraud and impact on critical thinking.

Consequently, we now consider LLMs concerning aspects in oncology including as seen below:

5.1. Data Accuracy: hallucinations and suboptimal outputs, use of sufficient oversight, training datasets (up to date, non-discriminatory, avoid bias, EDI concept in training importance), prompt engineering and challenges

5.2 Accountability: concerns of data accountability and responsibility, application of the concept of meaningful human control and level of automation, lack of current legislation

5.3 Data security: confidentiality/data breach- compromise and manipulation of patient data, cyber-attacks, note on legislation

5.4 Research integrity: plagiarism and author misrepresentation, limited transparency of data provided (black box issues), compromise research output/reproducibility, referencing issues, academic fraud in research, critical thinking, plagiarism and author misrepresentation, journals’ stance on LLMs in supporting publications.

Reviewer 2 Report

Comments and Suggestions for Authors

Congratulations to the author for this interesting and needed review!

The paper gives a good overview of the current state of AI models like ChatGPT, their strengths and limitations, practical applications in clinical practice and research, judicial issues, and ethical concerns. The paper is of interest for a broad range of audiences also beyond the oncological setting.

Some aspects should be highlighted more/made clearer:

·         ChatGPT 4.0 performance is superior to 3.5 as shown by recent publications.
ChatGPT is not specifically trained on medical literature like PubMed and future improvements (ChatGPT or other models) are likely to improve performance and to be more consistent.

·         One of the latest improvements of ChatGPT 4.0 the “Data Analyst” model with many interesting features should be mentioned and explained in section “3.3 Oncology Research”.  The model is capable of performing most statistical analyses, can provide graphs, make statistical analysis plans and much more. Different file types, preferably csv but also Stata or SPSS can be uploaded.  In addition to simply performing the analyses based on free text prompts it also provides the corresponding python code to reproduce the analyses. As a safety feature uploaded data in memory will be deleted after the session has timed out or after some time. In this case and to continue with the analyses data must be uploaded once more. After providing free text explanation of the data the “Data Analyst” can also make own suggestion on how to analyze the provided data and on which methods to use. According to my own experience the data analyst performs much better than ChatGPT regarding statistical analysis.

·         As a research tool, AI models perform best in the hand of an experienced researcher when you have clear expectations on what to expect as an outcome. In this case possible flaws in the output are detected immediately and the prompt can be modified to improve the response. By this the application of AI models can be a real time saver.

·         It should also be mentioned that ChatGPT has opened for customizable models provided by the community. This will probably accelerate the development of tailored solutions, e.g. there are already several models specifically for search and use of PubMed. However, systematical testing and quality control of these models is still missing or not clear.

Comments on the Quality of English Language

Overall English language is fine. Only a few minimal issues.

Author Response

Reviewer #2:

  • General comment:

Congratulations to the author for this interesting and needed review!

The paper gives a good overview of the current state of AI models like ChatGPT, their strengths and limitations, practical applications in clinical practice and research, judicial issues, and ethical concerns. The paper is of interest for a broad range of audiences also beyond the oncological setting.

Some aspects should be highlighted more/made clearer:”

Response:

Thank you for your positive reinforcement and constructive feedback. We appreciate the opportunity to revise our work for consideration for publication.

 

  • Specific comments:

· ChatGPT 4.0 performance is superior to 3.5 as shown by recent publications.

ChatGPT is not specifically trained on medical literature like PubMed and future improvements (ChatGPT or other models) are likely to improve performance and to be more consistent.

Response:

Thank you for your comment.

We note the superiority of ChatGPT 4.0 as seen below:

Its most recent release GPT-4 has over 100 trillion parameters as well as the ability to process text and image input, which is superior to GPT-3.5”

We have also added detail on domain specific LLMs such as BioMedLM and BioGPT in section 4.4. Oncology Research, noting that they are trained with data from biomedical literature on PubMed and can be fine-tuned with gold standard oncology corpora.

 

· One of the latest improvements of ChatGPT 4.0 the “Data Analyst” model with many interesting features should be mentioned and explained in section “3.3 Oncology Research”. The model is capable of performing most statistical analyses, can provide graphs, make statistical analysis plans and much more. Different file types, preferably csv but also Stata or SPSS can be uploaded. In addition to simply performing the analyses based on free text prompts it also provides the corresponding python code to reproduce the analyses. As a safety feature uploaded data in memory will be deleted after the session has timed out or after some time. In this case and to continue with the analyses data must be uploaded once more. After providing free text explanation of the data the “Data Analyst” can also make own suggestion on how to analyze the provided data and on which methods to use. According to my own experience the data analyst performs much better than ChatGPT regarding statistical analysis.

Response:

Thank you for your comment.

We have added in section 4.4. Oncology Research as seen below:

Notably, OpenAI has introduced an ‘advanced data analysis’ feature available on ChatGPT-4.0, which can further eliminate barriers that researchers may face with data analysis [10].This model can support a variety of data and programme files. As well as performing statistical analysis when prompted, corresponding python code is also provided allowing for reproducible data analysis. Thus, appropriate oversight can be maintained and coding can be modified as required to improve data output. Suggestions are also offered for options of further data manipulation.”

 

· As a research tool, AI models perform best in the hand of an experienced researcher when you have clear expectations on what to expect as an outcome. In this case possible flaws in the output are detected immediately and the prompt can be modified to improve the response. By this the application of AI models can be a real time saver.

Response:

Thank you for your comment.

We also note LLMs advantages aspects in supporting data analysis in section 4.4. Oncology Research as seen below:

Easy access to such powerful AI tools in oncology research can dismantle barriers researchers may face in addition to improving the efficiency of data manipulation, thus facilitating further cancer data exploration, coding and tackling empirical problems in oncology”

We also now discuss prompt engineering in section 5.1. Data Accuracy to improve response and desired output, as well as methods of reducing occurrence of hallucinations.

 

· It should also be mentioned that ChatGPT has opened for customizable models provided by the community. This will probably accelerate the development of tailored solutions, e.g. there are already several models specifically for search and use of PubMed. However, systematical testing and quality control of these models is still missing or not clear.

Response:

Thank you for your comment.

We have added to section 4.4. Oncology Research as seen below

Domain specific LLMs such as BioMedLM and BioGPT are trained with data from biomedical literature on PubMed and can be fine-tuned with gold standard oncology corpora [60,61]. Thus, this will facilitate the ability of LLMs to yield high-quality results for extraction tasks in the oncology domain. The release of LLMs with the option of customizable models provided by the community will also likely accelerate the process of tailored solutions and addressing oncology domain specific queries [62].”

Reviewer 3 Report

Comments and Suggestions for Authors

This paper offers an overview of pros and cons of the use of LLMs in oncology. It is interesting, but lacks some discussion on important concepts in GenAI and LLMs, and focuces too much on ChatGPT.

 

Specific remarks:

- "As of late, novel advances in DL models have gained widespread public prominence and importantly new calls for optimism regarding AI systems" -> it would be good to show some examples of the progress that has been made since ref [4] from 1987. AI is being used very successfully in diagnosis, prognosis, treatment, surgery, drug discovery, etc. See for example https://atm.amegroups.org/article/view/104776/html. An obvious example in oncology is the detection of tumors in MR or digital pathology images. this should also be added to list in line 65-68.

- Data accuracy: I think the importance of prompt engineering should be stressed here. For example, "one-shot learning" and "few-shot learning" can improve results.

- 4.1: Hallucinations are an issue, but could be overcome by reducing the "temperature" (i.e. the "creativity" of the LLM). This should also be discussed in more detail.

- LLM fine-tuning is not mentioned anywhere in the paper. If fine-tuning for the biomedical domain is done, the LLM can give much better results.

- The paper discusses LLMs mostly in the context of ChatGPT, but there are also biomedical LLMs such as BioMedLM. See e.g. https://link.springer.com/article/10.1186/s12859-023-05411-z for a list of biomedical LLMs. These should produce much more reliable results than ChatGPT, which is not meant for assisting medical professionals.

Comments on the Quality of English Language

There are some textual errors, e.g.:

- line 55-56: "Large language models [...] has" -> "Large language models [...] have"

- line 82: "mutli-layer" -> "multi-layer"

Author Response

Dear Editor and Reviewers,

I am pleased to resubmit for publication the revised version of curroncol-2872681 manuscript, entitled “Large Language Models in Oncology—Revolution or Cause for Concern?”.

We are hopeful you agree that the revised revision will further update our comprehensive review. All the comments of reviewers have been addressed, as shown in this version of the manuscript, along with this point-by-point response to his comments.

All corresponding are red changes in the manuscript.

 

Reviewer #3:

  • General comment:

This paper offers an overview of pros and cons of the use of LLMs in oncology. It is interesting, but lacks some discussion on important concepts in GenAI and LLMs, and focuses too much on ChatGPT.”

Response:

Thank you for your positive reinforcement and constructive feedback. We have also now added LLMs BioMedLM, BioGPT and Flan-PaLM in our discussion section. OpenAI’s ChatGPT remains of the, if not the most well-known LLM/AI system at present. We note we have referred to ChatGPT more than other LLMs, however in our literature search we note a surge of publications as well as its predominance in the consideration of its use in the oncology/medical setting in the literature. We have also followed as per reviewer 2 in considering ChatGPT-4.0s data analyst function.

 

  • Specific comments:

- "As of late, novel advances in DL models have gained widespread public prominence and importantly new calls for optimism regarding AI systems" -> it would be good to show some examples of the progress that has been made since ref [4] from 1987. AI is being used very successfully in diagnosis, prognosis, treatment, surgery, drug discovery, etc. See for example https://atm.amegroups.org/article/view/104776/html. An obvious example in oncology is the detection of tumors in MR or digital pathology images. this should also be added to list in line 65-68..

Response:

Thank you for your comment.

We have added to 1. Introduction as seen below:

AIs remarkable success has been noted broadly in the medical field in disease diagnosis, treatment and prognosis. A few examples notably include the analysis of medical imaging, extending into the interpretation of ECGs, pathological slides, ophthalmic images and dermatology conditions, as well as its application in surgery with preoperative planning, intraoperative guidance and surgical robotics [7,8].”

Oncology is entering a new age where the interplay and role of AI is no longer a theoretical possibility but a reality, with its approval for use in diverse clinical scenarios from cancer diagnostics and computer vision; including tumour detection in medical imaging and digital histopathology, to anticancer drug development and discovery with AI-driven target identification [14–16].”

 

- Data accuracy: I think the importance of prompt engineering should be stressed here. For example, "one-shot learning" and "few-shot learning" can improve results.

Response:

Thank you for your comment.

We consider prompt engineering and it’s challenges in detail as seen below in section 5.4. Research Integrity

In order to mitigate concerns regarding the accuracy of data output and positively influence LLM performance in the oncology setting prompt engineering can be leveraged, which is a new field of research involved in the development and refinement of prompt words to optimise LLM output [75]. Thus, prompt engineering will be an important emerging skill for users of LLMs; patients and oncologists alike. Different styles and types of prompts can be utilised. For example, In Zero-shot prompts the LLM is expected to perform a task it has not been specifically trained on, hence without exposure to previous examples [76]. Few-shot prompts involve task completion where the LLM has previously only been exposed to a few initial examples, thus the task is completed with appropriate generalisation to unseen examples [77]. Notably, Singhal et al were able to demonstrate the effectiveness of prompt engineering strategies by improving the output accuracy of the LLM Flan-PaLM in answering USMLE-style questions, through chain-of-thought, few-shot and self-consistency prompting strategies [78]. Overall, adequately engineered prompts will be key to maximise the performance of LLMs as well as reducing unsatisfactory responses in the oncology setting.

In practice however, challenges remain in the application of prompt engineering. This includes prompt robustness and transferability [79]. Thus, when used in the oncology domain, patients and oncologists may receive different responses even if the same prompt framework is used [80]. Additionally, given that prompt engineering performance is dependent on inherent capabilities of individual LLMs, prompt strategies deemed effective for one LLM may not be appropriate for another [80]. Appropriate guidance will need to be developed in order to ensure appropriate prompt strategies are used to guide LLM output for various tasks in the oncology domain. It will also be important for oncologists and patients to be involved in the development of human evaluation frameworks and LLM response evaluation frameworks, facilitating researchers to measure progress as well as identifying and mitigating potential harm [78].

We note "one-shot learning" and "few-shot learning" in section 2. Large Language Model Function as seen below:

Zero-shot and self-supervised learning methods are used to facilitate the correct use of grammar, semantics and conceptual relationships.”

 

- 4.1: Hallucinations are an issue, but could be overcome by reducing the "temperature" (i.e. the "creativity" of the LLM). This should also be discussed in more detail.

Response:

Thank you for your comment.

We have added in section 5.1. Data Accuracy as seen below:

It should be noted that different strategies exist to overcome LLM hallucinations, which can be separated into two categories, data-related methods or modelling and interference methods [66]. Data related methods include ensuring that high-quality cancer data is used for pretraining LLMs. Fine-tuning can also utilised, adapting the LLM to oncology specific domains [67]. Retrieval augmented generation is a framework that can further reduce the risk of hallucinations by grounding LLMs with knowledge from external reference textual databases [68]. Modelling and interference methods include reinforcement learning from human feedback, which involves a human evaluator to rank LLM output efficiency [69]. Appropriate prompt strategies, notably chain-of-thought prompting, which uses a stepwise approach and aggregates LLM output, can reduce incorrect responses by encouraging LLMs to reason prior to answer arrival [70]. The sampling temperature of LLMs, which guide the randomness of output can also be adjusted. It is a scalar value from 0.0 to 1.0 and adjusts the probability distribution of subsequent word selection in LLM output. The higher the temperature, the more random and ‘creative’ the output will be. On the contrary, lower temperatures will result in more deterministic output and hence more repetitive and focussed outputs in line with patterns from cancer training data [71]. It goes without saying, when used in the oncology clinical setting, appropriate temperatures for optimal LLM output will need to be established. Additionally, a variety of methods will need to be harnessed to reduce and avoid hallucinations when LLMs are used in the oncology domain and beyond.

 

- LLM fine-tuning is not mentioned anywhere in the paper. If fine-tuning for the biomedical domain is done, the LLM can give much better results.

Response:

Thank you for your comment.

We have included fine-tuning at multiple relevant points now in our paper.

We also note as seen in section 4.4 oncology research:

The data extraction ability of LLMs can also be enhanced through fine-tuning. This includes pre-trained LLMs in the generative and discriminative setting, i.e. they can generate responses to a question when prompted in a given context and classify input data into predefined labels [59]. Domain specific LLMs such as BioMedLM and BioGPT are trained with data from biomedical literature on PubMed and can be fine-tuned with gold standard oncology corpora [60,61]. Thus, this will facilitate the ability of LLMs to yield high-quality results for extraction tasks in the oncology domain.

 

- The paper discusses LLMs mostly in the context of ChatGPT, but there are also biomedical LLMs such as BioMedLM. See e.g. https://link.springer.com/article/10.1186/s12859-023-05411-z for a list of biomedical LLMs. These should produce much more reliable results than ChatGPT, which is not meant for assisting medical professionals.

Response:

Thank you for your comment.

We have added as seen in section 4.4 oncology research:

The data extraction ability of LLMs can also be enhanced through fine-tuning. This includes pre-trained LLMs in the generative and discriminative setting, i.e. they can generate responses to a question when prompted in a given context and classify input data into predefined labels [59]. Domain specific LLMs such as BioMedLM and BioGPT are trained with data from biomedical literature on PubMed and can be fine-tuned with gold standard oncology corpora [60,61]. Thus, this will facilitate the ability of LLMs to yield high-quality results for extraction tasks in the oncology domain.

 

  • Comments on the Quality of English Language:

There are some textual errors, e.g.:

- line 55-56: "Large language models [...] has" -> "Large language models [...] have"

Response:

Thank you for your comment.

We have appropriately amended this.

 

- line 82: "mutli-layer" -> "multi-layer"

Response:

Thank you for your comment.

We have appropriately amended this.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The questions I raised have been revised in this manuscript, and I have  no more questions.

Reviewer 3 Report

Comments and Suggestions for Authors

All comments have been addressed.

Back to TopTop