ChatGPT in Occupational Medicine: A Comparative Study with Human Experts
Abstract
:1. Introduction
2. Materials and Methods
2.1. Questions and Reference Answers Generation
2.2. Bot-Generated Answers
2.3. Answers Evaluation
- Accuracy: the answer provides correct and precise information without containing errors or inaccuracies;
- Precision: the answer is concise and to the point, without being overly verbose or ambiguous;
- Completeness: the answer provides all the necessary information to address the question;
- Usability: the answer is easy to understand and use for the target audience, which may consist of workers, managers, or safety professionals;
- Relevance: was the answer appropriate and relevant to the question asked.
2.4. Evaluation Process and Criteria
2.5. Error Analysis in ChatGPT + Context Answers
2.6. Error Analysis in ChatGPT + Context Answers
3. Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Appendix A.1. ChatGPT-Generated Answers
Language | Question | Answer |
---|---|---|
Italian | Secondo il D.lgs 81/08 chi è il lavoratore video terminalista? | Il lavoratore che utilizza un’attrezzatura munita di videoterminali, in modo sistematico o abituale, per venti ore settimanali, dedotte le interruzioni. |
English | According to legislative decree 81/08, who is the video terminal worker? | A worker who systematically or habitually uses video terminal equipment for twenty hours per week, excluding breaks. |
Appendix A.2. ChatGPT + Context-Generated Answers
- Parse the original document into chunks of text. We parsed the document using the ‘UnstructuredPDF’ library (version 0.5.11) to extract raw text from the document. We then split the documents into several chunks of N tokens. The chunks’ size is N, a system hyperparameter to optimize later;
- Create text embeddings for the documents in the search corpus. Each document in the corpus is represented as a numerical vector using text embeddings. We used the OpenAI Embeddings API using the ‘text-embedding-ada-002’ model;
- Create a query embedding: the user’s search query is also converted as before and represented as a numerical vector using text embeddings;
- Calculate the similarity between the query embedding and the document embeddings. We used cosine similarity, which is often used to measure the similarity between two embedding vectors. For each document in the corpus, the cosine similarity between the query embedding and the document embedding is calculated;
- Rank the documents by similarity: The documents are ranked in descending order of similarity to the query. The most similar documents are returned as the search results.
Appendix A.3. ChatGPT System Message
Language | ChatGPT | ChatGPT + Context |
---|---|---|
Italian | Voglio che tu agisca come un assistente di un medico specialista in medicina del lavoro in Italia. Usa il decreto legislativo 81/08 per rispondere alle domande. | Voglio che tu agisca come un assistente di un medico specialista in medicina del lavoro in Italia. Usa il decreto legislativo 81/08 ed il seguante contesto normativo per rispondere alla seguente domanda. Contesto normativo: {context} |
English | I want you to act as an assistant to a specialist doctor in occupational medicine in Italy. Use legislative decree 81/08 to answer the questions. | I want you to act as an assistant to a specialist doctor in occupational medicine in Italy. Use legislative decree 81/08 and the following legislative context to answer the questions. Legislative context: {context} |
Appendix A.4. Question/Answering Optimization and Prompt Engineering
- Split the text into smaller, semantically meaningful chunks (usually sentences or paragraphs);
- Combine these smaller chunks into larger chunks until a certain size is reached;
- Make that chunk its own piece of text once the size threshold is reached, and then create a new chunk with some overlap to maintain context between the chunks.
Language | Evaluator System Message |
---|---|
Italian | Data la seguente domanda: {question}, la risposta corretta è {reference}. Misura in una scala Likert che va da 1 a 5 la completezza ed accuratezza della risposta dell’utente. Le risposte possono essere non compeltamente uguali a quelle di riferimento ed essere più complete ed accurate ed in quel caso non penalizzarle. Il formato della tua risposta deve essere un JSON parsabile nel seguente formato: Esempio in caso di risposta accurate e completa {“completezza”: 5, “accuratezza”: 5} Esempio in caso di risposta accurata ma non completa {“completezza”: 2, “accuratezza”: 5} Esempio in caso di risposta errata {“completezza”: 1, “accuratezza”: 1} |
English | Given the following question: {question}, the correct answer is {reference}. Measure the completeness and accuracy of the user’s response on a Likert scale ranging from 1 to 5. The answers may not be completely the same as the reference ones and may be more complete and accurate; in that case, do not penalize them. The format of your response must be a parsable JSON in this format. For example, in the case of an accurate and complete response. {“completeness”: 5, “accuracy”: 5} For example, in the case of an accurate but incomplete response. {“completeness”: 2, “accuracy”: 5} For example, in the case of an incorrect answer. {“completeness”: 1, “accuracy”: 1} |
Appendix B. Details of the Web Interface Used for Evaluation
Appendix C. Evaluation Questionnaire
* REQUIRED 1. Which question are you about to evaluate? * question 1 question 2 question 3 question 4 question 5 2. Are there any spelling or grammar errors in the ChatGPT + context answer? If yes, please specify. * 3. Does the ChatGPT + context answer in an inappropriate or offensive tone? * no yes 4. Did you notice any inconsistencies in ChatGPT + context answer? If yes, please specify which ones * 5. Are there any acronyms or Roman numerals in the question that were not understood by ChatGPT + context? * no yes 6. Was the answer provided by ChatGPT + context understandable? If not, please specify why. * 7. Why did you score the answer low? * |
Appendix D. Examples of Good/Bad/Hallucinating Replies from ChatGPT
Language | Italian | English |
---|---|---|
Question | Quale è la definizione di operatore secondo il D. Lgs 81/08? | What is the definition of an operator according to legislative decree 81/08? |
Reference answer | L’operatore è definito come il lavoratore incaricato dell’uso di una attrezzatura di lavoro o il datore di lavoro che ne fa uso. | The operator is defined as the worker responsible for the use of work equipment or the employer who uses it. |
ChatGPT + context answer | Secondo l’art. 69, comma 1, lettera (e) del D.Lgs 81/08, l’operatore è il lavoratore incaricato dell’uso di una attrezzatura di lavoro o il datore di lavoro che ne fa uso. | According to Article 69, paragraph 1, letter (e) of legislative decree 81/08, the operator is the worker in charge of using work equipment or the employer who uses it. |
Average ChatGPT + context accuracy | 5.0 | |
Number of evaluations | 4 |
Language | Italian | English |
---|---|---|
Question | A quale classe appartiene il virus dell’Epatite B secondo la classificazione prevista dal D.lgs 81/08? | To which class does the hepatitis B virus belong according to the classification provided by legislative decree 81/08? |
Reference answer | Secondo l’ allegato XLVI—Elenco degli agenti biologici classificati, il virus dell’epatite B è classificato come agente biologico di classe 3. | According to Annex XLVI—List of classified biological agents, the hepatitis B virus is classified as a biological agent of class 3. |
ChatGPT + context answer | Il virus dell’Epatite B appartiene alla classe D (rischio alto) secondo la classificazione prevista dal D.lgs 81/08. | The hepatitis B virus belongs to class D (high risk) according to the classification provided by legislative decree 81/08. |
Average ChatGPT + context accuracy | 1.0 | |
Number of evaluations | 2 |
Language | Italian | English |
---|---|---|
Question | Quale titolo del D.Lgs 81/08 definisce i criteri per la valutazione del rischio da atmosfere iperbariche e microclima? | Which title of legislative decree 81/08 defines the criteria for assessing the risk of hyperbaric atmospheres and microclimates? |
Reference answer | Non esiste un titolo che definisce i criteri per la valutazione del rischio da atmosfere iperbariche e microclima, tuttavia al titolo VIII del D.Lgs 81/08 si parla di valutazione del rischio da agenti fisici ivi compresi microclima ed atmosfere iperbariche. | No title defines the criteria for assessing the risk of hyperbaric atmospheres and microclimates. However, Title VIII of legislative decree 81/08 addresses the risk assessment of physical agents, including microclimates and hyperbaric atmospheres. |
ChatGPT + context answer | Il titolo X del D.Lgs 81/08, “Lavori in ambienti iperbarici e rischio da microclima”, definisce i criteri per la valutazione del rischio da atmosfere iperbariche e microclima. | Title X of legislative decree 81/08, “Work in Hyperbaric Environments and Risk from Microclimate,” defines the criteria for assessing the risk of hyperbaric atmospheres and microclimate. |
Average ChatGPT + context accuracy | 1.0 | |
Number of evaluations | 3 |
References
- Fogel, A.L.; Kvedar, J.C. Artificial intelligence powers digital medicine. NPJ Digit. Med. 2018, 1, 5. [Google Scholar] [CrossRef] [PubMed]
- Moassefi, M.; Faghani, S.; Khosravi, B.; Rouzrokh, P.; Erickson, B.J. Artificial Intelligence in Radiology: Overview of Application Types, Design, and Challenges. Semin. Roentgenol. 2023, 58, 170–177. [Google Scholar] [CrossRef] [PubMed]
- Raghunath, S.; Pfeifer, J.M.; Ulloa-Cerna, A.E.; Nemani, A.; Carbonati, T.; Jing, L.; Vanmaanen, D.P.; Hartzel, D.N.; Ruhl, J.A.; Lagerman, B.F.; et al. Deep Neural Networks Can Predict New-Onset Atrial Fibrillation From the 12-Lead ECG and Help Identify Those at Risk of Atrial Fibrillation-Related Stroke. Circulation 2021, 143, 1287–1298. [Google Scholar] [CrossRef] [PubMed]
- Chen, D.; Liu, J.; Zang, L.; Xiao, T.; Zhang, X.; Li, Z.; Zhu, H.; Gao, W.; Yu, X. Integrated Machine Learning and Bioinformatic Analyses Constructed a Novel Stemness-Related Classifier to Predict Prognosis and Immunotherapy Responses for Hepatocellular Carcinoma Patients. Int. J. Biol. Sci. 2022, 18, 360–373. [Google Scholar] [CrossRef] [PubMed]
- Srinivasu, P.N.; SivaSai, J.G.; Ijaz, M.F.; Bhoi, A.K.; Kim, W.; Kang, J.J. Classification of Skin Disease Using Deep Learning Neural Networks with MobileNet V2 and LSTM. Sensors 2021, 21, 2852. [Google Scholar] [CrossRef] [PubMed]
- Yu, K.H.; Beam, A.L.; Kohane, I.S. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2018, 2, 719–731. [Google Scholar] [CrossRef] [PubMed]
- Haug, C.J.; Drazen, J.M. Artificial Intelligence and Machine Learning in Clinical Medicine. N. Engl. J. Med. 2023, 388, 1201–1208. [Google Scholar] [CrossRef]
- Aung, Y.Y.; Wong, D.C.; Ting, D.S. The promise of artificial intelligence: A review of the opportunities and challenges of artificial intelligence in healthcare. Br. Med. Bull. 2021, 139, 4–15. [Google Scholar] [CrossRef]
- Rajpurkar, P.; Lungren, M.P. The Current and Future State of AI Interpretation of Medical Images. N. Engl. J. Med. 2023, 388, 1981–1990. [Google Scholar] [CrossRef]
- Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. arXiv 2023, arXiv:2307.06435. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. arXiv.1706.03762. [Google Scholar]
- Ziegler, D.M.; Stiennon, N.; Wu, J.; Brown, T.B.; Radford, A.; Amodei, D.; Christiano, P.; Irving, G. Fine-tuning language models from human preferences. arXiv 2019, arXiv:1909.08593. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Open, A.I. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Lee, P.; Bubeck, S.; Petro, J. Benefits, limits, and risks of GPT-4 as an AI Chatbot for medicine. N. Engl. J. Med. 2023, 388, 1233–1239. [Google Scholar] [CrossRef] [PubMed]
- White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D.C. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv 2023, arXiv:230211382. [Google Scholar]
- Dahmen, J.; Kayaalp, M.E.; Ollivier, M.; Pareek, A.; Hirschmann, M.T.; Karlsson, J.; Winkler, P.W. Artificial intelligence bot ChatGPT in medical research: The potential game changer as a double-edged sword. Knee Surg. Sports Traumatol. Arthrosc. 2023, 31, 1187–1189. [Google Scholar] [CrossRef]
- Liu, J.; Wang, C.; Liu, S. Utility of ChatGPT in Clinical Practice. J. Med. Internet Res. 2023, 25, e48568. [Google Scholar] [CrossRef]
- Gordijn, B.; Have, H.T. ChatGPT: Evolution or revolution? Med. Health Care Philos. 2023, 26, 1–2. [Google Scholar] [CrossRef]
- Rao, A.S.; Pang, M.; Kim, J.; Kamineni, M.; Lie, W.; Prasad, A.K.; Landman, A.; Dryer, K.; Succi, M.D. Assessing the utility of ChatGPT throughout the entire clinical workflow. medRxiv 2023. 2023-02. [Google Scholar] [CrossRef]
- Hirosawa, T.; Harada, Y.; Yokose, M.; Sakamoto, T.; Kawamura, R.; Shimizu, T. Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study. Int. J. Environ. Res. Public Health 2023, 20, 3378. [Google Scholar] [CrossRef] [PubMed]
- Liu, S.; Wright, A.P.; Patterson, B.L.; Wanderer, J.P.; Turer, R.W.; Nelson, S.D.; McCoy, A.B.; Sittig, D.F.; Wright, A. Assessing the value of ChatGPT for clinical decision support optimization. medRxiv 2023. 2023-02. [Google Scholar] [CrossRef]
- Chintagunta, B.; Katariya, N.; Amatriain, X.; Kannan, A. Medically aware GPT-3 as a data generator for medical dialogue summarization. In Proceedings of the Machine Learning for Healthcare Conference, PMLR, Virtual, 21 October 2021; pp. 354–372. [Google Scholar]
- Joshi, A.; Katariya, N.; Amatriain, X.; Kannan, A. Dr. summarize: Global summarization of medical dialogue by exploiting local structures. arXiv 2020, arXiv:2009.08666. [Google Scholar]
- Sivasubramanian, J.; Shaik Hussain, S.M.; Virudhunagar Muthuprakash, S.; Periadurai, N.D.; Mohanram, K.; Surapaneni, K.M. Analysing the clinical knowledge of ChatGPT in medical microbiology in the undergraduate medical examination. Indian J. Med. Microbiol. 2023, 45, 100380. [Google Scholar] [CrossRef] [PubMed]
- Antaki, F.; Touma, S.; Milad, D.; El-Khoury, J.; Duval, R. Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings. Ophthalmol. Sci. 2023, 3, 100324. [Google Scholar] [CrossRef]
- Patil, N.S.; Huang, R.S.; van der Pol, C.B.; Larocque, N. Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment. Can. Assoc. Radiol. J. 2023, 14, 8465371231193716. [Google Scholar] [CrossRef] [PubMed]
- Guerra, G.A.; Hofmann, H.; Sobhani, S.; Hofmann, G.; Gomez, D.; Soroudi, D.; Hopkins, B.S.; Dallas, J.; Pangal, D.J.; Cheok, S.; et al. GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions. World Neurosurg. 2023, 18. [Google Scholar] [CrossRef]
- Sridi, C.; Brigui, S. The use of ChatGPT in occupational medicine: Opportunities and threats. Ann. Occup. Environ. Med. 2023, 35, e42. [Google Scholar] [CrossRef]
- Amato FDF Gianfranco. Decreto Legislativo 81/08: Test Unico Sulla Salute e Sicurezza Sul Lavoro. Available online: https://www.ispettorato.gov.it/files/2023/03/TU-8108-Ed-Gennaio-2023.pdf (accessed on 1 May 2023).
- Jones, E.; Palangi, H.; Simões, C.; Chandrasekaran, V.; Mukherjee, S.; Mitra, A.; Awadallah, A.; Kamar, E. Teaching Language Models to Hallucinate Less with Synthetic Tasks. arXiv 2023, arXiv:2310.06827. [Google Scholar]
- Sisaengsuwanchai, K.; Nananukul, N.; Kejriwal, M. How does prompt engineering affect ChatGPT performance on unsupervised entity resolution? arXiv 2023, arXiv:2310.06174. [Google Scholar]
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
- Neelakantan, A.; Xu, T.; Puri, R.; Radford, A.; Han, J.M.; Tworek, J.; Yuan, Q.; Tezak, N.; Kim, J.W.; Hallacy, C.; et al. Text and code embeddings by contrastive pre-training. arXiv 2022, arXiv:2201.10005. [Google Scholar]
- Johnson, S.B.; King, A.J.; Warner, E.L.; Aneja, S.; Kann, B.H.; Bylund, C.L. Using ChatGPT to evaluate cancer myths and misconceptions: Artificial intelligence and cancer information. JNCI Cancer Spectr. 2023, 7. [Google Scholar] [CrossRef] [PubMed]
- Johnson, D.; Goodman, R.; Patrinely, J.; Stone, C.; Zimmerman, E.; Donald, R.; Chang, S.; Berkowitz, S.; Finn, A.; Jahangir, E.; et al. Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model. Res. Sq. 2023. [Google Scholar] [CrossRef]
- Alkaissi, H.; McFarlane, S.I. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus 2023, 15, e35179. [Google Scholar] [CrossRef] [PubMed]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Rebuffel, C.; Roberti, M.; Soulier, L.; Scoutheeten, G.; Cancelliere, R.; Gallinari, P. Controlling hallucinations at word level in data-to-text generation. Data Min. Knowl. Discov. 2022, 36, 318–354. [Google Scholar] [CrossRef]
- Wang, J.; Zhou, Y.; Xu, G.; Shi, P.; Zhao, C.; Xu, H.; Ye, Q.; Yan, M.; Zhang, J.; Zhu, J.; et al. Evaluation and Analysis of Hallucination in Large Vision-Language Models. arXiv 2023, arXiv:2308.15126. [Google Scholar]
- Zhu, Y.; Yuan, H.; Wang, S.; Liu, J.; Liu, W.; Deng, C.; Dou, Z.; Wen, J.R. Large Language Models for Information Retrieval: A Survey. arXiv 2023, arXiv:2308.07107. [Google Scholar]
- Maliha, G.; Gerke, S.; Cohen, I.G.; Parikh, R.B. Artificial Intelligence and Liability in Medicine: Balancing Safety and Innovation. Milbank Q. 2021, 99, 629–647. [Google Scholar] [CrossRef]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Alpaca: A strong, replicable instruction-following model. Stanf. Cent. Res. Found. Models 2023, 3, 7. Available online: https://crfm.stanford.edu/2023/03/13/alpaca.html (accessed on 7 December 2023).
- Peng, B.; Li, C.; He, P.; Galley, M.; Gao, J. Instruction tuning with gpt-4. arXiv 2023, arXiv:2304.03277. [Google Scholar]
Physicians vs. ChatGPT | Physicians vs. ChatGPT + Ctx | ChatGPT vs. ChatGPT + Ctx | ||||
---|---|---|---|---|---|---|
Metric | Values | p-Value | Values | p-Value | Values | p-Value |
Compl. | 3.658 vs. 3.159 | <0.001 | 3.658 vs. 3.618 | 0.862 | 3.159 vs. 3.618 | <0.001 |
Accuracy | 4.042 vs. 3.091 | <0.001 | 4.042 vs. 3.631 | <0.001 | 3.091 vs. 3.631 | <0.001 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Padovan, M.; Cosci, B.; Petillo, A.; Nerli, G.; Porciatti, F.; Scarinci, S.; Carlucci, F.; Dell’Amico, L.; Meliani, N.; Necciari, G.; et al. ChatGPT in Occupational Medicine: A Comparative Study with Human Experts. Bioengineering 2024, 11, 57. https://doi.org/10.3390/bioengineering11010057
Padovan M, Cosci B, Petillo A, Nerli G, Porciatti F, Scarinci S, Carlucci F, Dell’Amico L, Meliani N, Necciari G, et al. ChatGPT in Occupational Medicine: A Comparative Study with Human Experts. Bioengineering. 2024; 11(1):57. https://doi.org/10.3390/bioengineering11010057
Chicago/Turabian StylePadovan, Martina, Bianca Cosci, Armando Petillo, Gianluca Nerli, Francesco Porciatti, Sergio Scarinci, Francesco Carlucci, Letizia Dell’Amico, Niccolò Meliani, Gabriele Necciari, and et al. 2024. "ChatGPT in Occupational Medicine: A Comparative Study with Human Experts" Bioengineering 11, no. 1: 57. https://doi.org/10.3390/bioengineering11010057
APA StylePadovan, M., Cosci, B., Petillo, A., Nerli, G., Porciatti, F., Scarinci, S., Carlucci, F., Dell’Amico, L., Meliani, N., Necciari, G., Lucisano, V. C., Marino, R., Foddis, R., & Palla, A. (2024). ChatGPT in Occupational Medicine: A Comparative Study with Human Experts. Bioengineering, 11(1), 57. https://doi.org/10.3390/bioengineering11010057