Next Article in Journal
Cloud Database Analysis of Instant Messaging Apps on Mobile Devices
Next Article in Special Issue
In Vitro Antioxidant Effects of Coenzyme Q10 on Cellular Metabolism in Aged Mesenchymal Stem Cells
Previous Article in Journal
LightVSR: A Lightweight Video Super-Resolution Model with Multi-Scale Feature Aggregation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ChatGPT-4 vs. Google Bard: Which Chatbot Better Understands the Italian Legislative Framework for Worker Health and Safety?

1
Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy
2
Intel Corporation, Santa Clara, CA 95054, USA
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2025, 15(3), 1508; https://doi.org/10.3390/app15031508
Submission received: 18 December 2024 / Revised: 15 January 2025 / Accepted: 30 January 2025 / Published: 1 February 2025

Abstract

:
Large language models, such as ChatGPT-4 and Google Bard, have demonstrated potential in healthcare. This study explores their utility in occupational medicine, a field where decisions rely on compliance with specific workplace health and safety regulations. A dataset of questions encompassing key occupational health topics derived from the Italian Legislative Decree 81/08, which governs workplace health and safety, was utilized. Responses from ChatGPT-4 with contextual information (ChatGPT-4+context) and Google Bard were evaluated for accuracy and completeness, with error categorization used to identify common issues. Subcategories of the topics of the regulations were analyzed as well. In total, 433 questions were included in our analysis. ChatGPT-4+context surpasses Bard in terms of accuracy and completeness in responses, with a lower error rate in the categories analyzed, except for the percentage of missed responses. In the subcategories analyzed, Bard is superior to ChatGPT-4+context only in the areas of the manual handling of loads and physical hazards. ChatGPT-4+context outperformed Bard in providing answers about Italian regulations on health and safety at work. This study highlights the potential and limitations of large language models as decision-support tools in occupational medicine and underscores the importance of regulatory context in enhancing their reliability.

1. Introduction

Large language models (LLMs) are highly advanced neural networks designed to process and generate human-like text based on extensive training data. They are capable of tasks like translating languages, crafting content, and generating automated replies [1,2]. ChatGPT is a conversational application of an LLM [3]. Since 2022, the year of ChatGPT’s release, Artificial Intelligence (AI) tools have rapidly spread across a wide range of sectors, from personal virtual assistants to tools that guide business [4] and military [5] decisions. Medicine made no exception, as indicated by many studies about the applications of LLMs in the healthcare field [6]. Numerous working groups have tested ChatGPT and other chatbots as tools for diagnosis [7,8], choosing therapeutic protocols [9,10], and patient rehabilitation [11,12]. Occupational medicine has been explored little so far [13,14]. Occupational healthcare choices must be based not only on scientific evidence and workers’ health conditions but must also comply with specific regulations on health and safety in the workplace. These regulations often vary from country to country. The interpretation of the regulatory framework could represent an additional obstacle in the decision-making process of a chatbot. Indeed, it is necessary to compare different LLMs and evaluate which one best manages to balance the various elements that need to be considered to reach a decision that is reliable from both a normative and scientific standpoint. A previous study [15] compared the quality of the answers to a pool of questions given by medical personnel specialized in Occupational Medicine; a ChatGPT-3.5 [16] model trained on Italian regulations on health and safety in the workplace, the Legislative Decree 81/08 [17]; and an untrained ChatGPT-3.5 model. Our aim is to compare the quality of the answers given by GPT-4, a newer version of the GPT framework [18], and Google Bard [19,20], an AI-powered chatbot based on the PaLM 2 model [21] developed by Google, that has since been upgraded to Gemini [22]. The purpose is to provide a starting point for selecting a virtual assistant that could assist occupational physicians in the future, particularly with tasks related to more mnemonic aspects, such as laws and regulations.

2. Materials and Methods

Our research group includes occupational health physicians and resident doctors specializing in occupational medicine from the University of Pisa.

2.1. Questions Dataset

We used a dataset of 433 questions generated for a previous study [15]. Each occupational physician was assigned a specific topic and instructed to generate questions based on Legislative Decree 81/08, the Italian regulatory framework for workplace health and safety [17], to which they had direct access. The decree covers key topics such as risk assessments, the adoption of preventive and protective measures, the training and provision of information to workers on occupational risks, health surveillance, emergency management, the legal responsibilities of employers, and specific regulations for work in particular environments.
To ensure the dataset was comprehensive and representative of the legislative decree, a stratified sampling method was employed. Each physician was tasked with formulating approximately 40 questions based on his assigned topic, ensuring that all aspects of the decree were systematically covered. This stratified approach minimized selection bias and ensured a balanced representation of topics.
The assignment of experts to specific areas was based on their expertise, further enhancing the quality and relevance of the questions. Additionally, to ensure the dataset’s quality and appropriateness for LLMs, the questions were reviewed and cross-validated by an occupational physician not involved in the question generation process. This validation process helped ensure consistency, alignment with the legislative content, and a balance of topics.
Additionally, redundancy was minimized, and questions were crafted to be as clear and unambiguous as possible. By addressing these factors, we have created a robust, comprehensive, and well-structured dataset that is well-suited for use in training and evaluating LLMs on legal content related to Legislative Decree 81/08. The topics addressed in the dataset included the common principles of the legislative decree, such as general and criminal provisions, the institutional system and management of workplace insurance, workplace safety, and the main occupational hazards to which a worker may be exposed:
  • Manual handling of loads: Refers to any activity involving lifting, dragging, pushing, or moving objects that may cause musculoskeletal disorders, particularly affecting the spine, joints, and muscles.
  • Use of visual display terminals: Focuses on ergonomic and health risks (e.g., eye strain or posture-related issues) from prolonged use of computer screens or similar devices.
  • Physical hazards: Includes risks from physical agents such as noise, vibrations, extreme temperatures, or radiation in the workplace.
  • Chemical hazards: Involves risks from exposure to harmful chemicals, which can lead to poisoning, respiratory problems, or skin irritation.
  • Asbestos: A material that, when inhaled, can cause serious diseases, including cancer. Although widely used in the past, asbestos is now banned or heavily regulated due to its health risks.
  • Carcinogenic and mutagenic agents: Substances that can cause cancer (carcinogenic) or DNA mutations (mutagenic), posing severe health risks in specific industries.
  • Biological hazards: Refers to environments where exposure to biological agents (e.g., bacteria or viruses) is possible, such as healthcare services, veterinary clinics, and diagnostic laboratories.
  • Risk of explosive atmospheres: An explosive atmosphere is a mixture of air and flammable substances (gasses, vapors, mists, or dusts) that, after ignition, leads to rapid combustion. This risk is significant in the chemical and energy industries.
Additionally, the physicians were instructed to create questions related to official inquiries called “interpelli”. Trade unions, employers, and professional councils can submit general questions to a commission, which evaluates the implementation of health and safety regulations and proposes legislative improvements. The commission’s answers serve as interpretive criteria for supervisory activities and the application of Italian occupational safety and health laws. The dataset reflects this variety of topics, aiming to avoid any bias that could limit the model’s ability to generalize to the different regulations and situations related to workplace safety.
The dataset of questions used in our study is available in the Supplementary Materials, which provides both the original version of each question in Italian and its translation into English (Table S1). The Average Question Length was 17.02 words, as shown in Figure 1.

2.2. Google Bard- and ChatGPT-4-Generated Answers

Google offered a free experimental version of the collaborative AI Bard accessible to everyone via the link https://bard.google.com/chat (accessed on 23 August 2023). Now, from this same link, you can access Google Gemini. Google Bard has direct access to the Internet, allowing it to retrieve real-time information. For the generation of answers using Google Bard, the questions generated by the physicians were manually inputted one by one into the input bar, with a system message used as a prefix to each question. A new chat was created for each question. The system message acts as the primary set of instructions for the model. Although it is not mandatory, it is useful for improving the quality of the responses. Further details about the system message used for both LLMs are provided in Appendix A. The answers were generated on 25–26 August 2023. Each answer generated by Bard was then copied into an Excel file next to the corresponding question.
The OpenAI ChatGPT API (Application Programming Interface) gpt-4-0613 was employed to automatically generate responses to questions through a script on 25–26 August 2023. The standard version of ChatGPT-4 does not have direct Internet access. It generates responses based on pre-existing training data. In our approach, the relevant legislative content from D.lgs. 81/08 was provided alongside the questions in what is called the “ChatGPT-4+context” method. To find this legal content, a semantic search was performed using text embeddings. Text embeddings are a tool used in natural language processing (NLP) and machine learning to represent sentences as numerical vectors. Unlike traditional keyword searches, embeddings focus on the meaning behind the query and the content being searched. This method was used to locate the relevant parts of D.lgs. 81/08. The extracted legal sections were then given to ChatGPT along with the original question, helping it generate a more accurate and informed response. More technical details about this method can be found in Appendix B and Appendix C. Given the results from [15], where the ChatGPT version with context (ChatGPT-3.5+context) performed better in terms of accuracy and completeness compared to ChatGPT-3.5, we decided to compare only the trained version of ChatGPT-4 (ChatGPT-4+context) with Google Bard.
To ensure impartiality in the subsequent human evaluation, automatic phrases that could reveal the source of the response (Google Bard or ChatGPT-4) during the blind evaluation were removed from the generated responses by an engineer responsible for the technical aspects of this study, who did not participate in the process of evaluating the responses. Examples of removed phrases include: “Hello! My name is Bard, and I’m your virtual assistant. I can help you with your question”; “Hi, I’m ChatGPT, the assistant to an occupational medicine specialist in Italy”.

2.3. Answers Evaluation

The evaluation criteria used to assess the quality of the responses generated by ChatGPT-4+context and Google Bard can be briefly described as follows:
-
Accuracy: the answer provides correct and precise information without containing errors or inaccuracies.
-
Completeness: the answer provides all the information necessary to address the question.
In this study, the term ’accuracy’ does not refer to the statistical metric commonly used to evaluate predictive models but to the qualitative precision of the responses provided by the two LLMs. Specifically, accuracy is understood as the assessment of the correctness, consistency, and absence of errors in the responses, evaluated against the contents of the legislative decree that served as the basis for the formulated questions.
We used the two 5-point Likert scales to assess the accuracy and completeness of the responses. The scale facilitated a more detailed assessment of response quality by assigning a numerical score to represent the accuracy and completeness of each response. The 5-point Likert scales ranged from 1 (indicating a complete lack of accuracy/completeness) to 5 (indicating full accuracy/completeness), with intermediate scores reflecting varying levels of these attributes.
To evaluate the reliability of these scales, we calculated Cronbach’s Alpha for the two key dimensions. The Cronbach’s Alpha for answer accuracy (5 items) was 0.83, indicating good internal consistency, while the Cronbach’s Alpha for answer completeness (5 items) was 0.92, reflecting excellent internal consistency. These results demonstrate the scales’ reliability in measuring both dimensions, ensuring that the items consistently capture the underlying constructs.
For each response option, evaluators were also asked to directly indicate any errors, if present, in each response option. The error categories are summarized in Table 1.

2.4. Evaluation Process

A website was developed that allowed each physician to log in and access the questions and their corresponding response options for evaluation. To minimize the risk of implicit bias, the order of the questions provided by the website was randomized across different users, as was the order of the two response options for each question (ChatGPT-4+Context and Bard). Additionally, a timer function was added to the website to measure the response time, which is the time from when the question to be evaluated is displayed to when the evaluator submits their evaluation and the next question appears. For each evaluator, the average evaluation time was calculated at the end of the evaluation phase, and the evaluations of those with an average response time of less than 1 min were excluded, as this time was deemed the reliability cut-off for the evaluation itself. This led to the exclusion of all evaluations conducted by two evaluators. The evaluations were performed between 27 August and 27 September 2023.

2.5. Statistical Analysis

The statistical analysis was performed using Python 3.10 software (Python Software Foundation) and the Pandas (v1.5.3), NumPy(v1.23.5), SciPy(v1.10.1), Matplotlib(v3.7.1), and Seaborn libraries(v0.12.2). The latter two libraries were used to generate plots. Mean, standard error, and percentage values were calculated. Mean accuracy and completeness scores were analyzed using the Mann–Whitney U test for independent samples.
Since the dataset is balanced, with the same number of responses provided by Bard and ChatGPT evaluated, the need for effect size calculation was initially minimized. In each evaluation, both models’ responses were assessed blindly on the same question, ensuring an equal number of observations for each model. Given this balanced design, the primary focus was on comparing the distributions between the two groups using the Mann–Whitney U test.
However, to provide additional insight into the magnitude of the observed differences, effect size calculations were performed. Both Cohen’s d and the effect size r were calculated. Cohen’s d quantified the standardized difference between the means, offering a direct measure of the effect size that is independent of sample size. To complement this, the effect size r was derived from d, providing a more intuitive and interpretable representation of the effect strength in terms of the proportion of the variance explained. These calculations ensured a more comprehensive understanding of the results, particularly for contextualizing the practical significance of the observed differences.
To assess the presence of each error class in the two evaluated LLMs, a two-sided Pearson’s χ2 test was conducted. When at least one cell in the contingency table showed a low frequency, Fisher’s exact test was performed. All statistical tests were conducted with a 95% confidence interval to ensure a robust evaluation of the observed differences.

3. Results

Out of 1191 evaluations conducted, 237 were excluded for not meeting the established reliability standards. Analysis of the remaining 954 answers from ChatGPT-4 and 954 answers from Bard revealed that the ones generated by ChatGPT-4+context were more accurate and complete compared to those generated by Bard, as shown in Table 2.
For completeness, Bard’s responses averaged 3.47 ± 0.042, while ChatGPT-4+context achieved a higher average of 3.68 ± 0.041 (p < 0.05). ChatGPT-4+context reached the maximum completeness score of 5 in 30.3% of cases, compared to 26.5% for Bard. In terms of minimum scores, Bard received the lowest rating in 10.5% of cases, slightly more than ChatGPT-4+context at 10.3%. Effect size calculations indicated that the influence of the sample on completeness was negligible (Cohen’s d = 0.16; r = 0.08).
Regarding accuracy, Bard scored an average of 3.14 ± 0.043, while ChatGPT-4+context performed better, with an average score of 3.67 ± 0.043 (p < 0.05). Bard achieved the maximum Likert scale score for accuracy in 19.2% of cases, compared to 33.4% for ChatGPT-4+context. The lowest score of 1 was assigned to Bard in 14.9% of cases, compared to 11.3% for ChatGPT-4+context. Effect size calculations showed a moderate sample influence for accuracy (Cohen’s d = 0.40; r = 0.20).
Figure 2 shows a comparison of median, 25th, 50th, and 75th percentiles for the accuracy and completeness of the responses from Bard and ChatGPT-4+context (abbreviated as ChatGPT4 in the charts on the right).

3.1. Analysis of Answers Divided by Question Category

Table 3 provides a detailed breakdown of the average accuracy and completeness scores for Bard and ChatGPT-4+context, categorized by the topic of the question. ChatGPT-4+context consistently achieved higher scores for both accuracy and completeness in the subcategories of asbestos, explosive atmospheres, work equipment and PPE, “interpelli”, common principles, biological hazards, and the use of visual display terminals. However, Bard outperformed ChatGPT-4+context in the subcategories of manual handling of loads and physical hazards. For the chemical hazards, ChatGPT-4+context was superior in accuracy but not in completeness, where Bard performed better.

3.2. Error Analysis

Our error analysis showed that Bard’s responses contained an internal contradiction in 1.5% of cases compared to 0.9% for ChatGPT-4+context (p = 0.294). Bard referred to an incorrect context in 16.8% of responses, a statistically significant difference compared to the 6.3% of responses in which ChatGPT-4+context did (p < 0.05). Bard provided incorrect definitions in 12.8% of cases, whereas ChatGPT-4+context did so in 6.6% (p < 0.05). Bard failed to answer 4.4% of questions compared to ChatGPT-4+context’s 2.9%, a difference close to statistical significance (p = 0.088). Furthermore, 28.7% of Bard’s responses contained partially or completely incorrect content, compared to 15.8% for ChatGPT-4+context (p < 0.05). Grammar or typo errors were found in 5.6% of Bard’s responses, compared to 3.4% for ChatGPT-4+context (p < 0.05), as shown in Figure 3.
Table 4 illustrates the error rates for each category in the responses generated by Bard and ChatGPT-4+context.

4. Discussion

LLMs like ChatGPT by OpenAI and Bard by Google have the potential to be integrated into healthcare systems to assist with a wide range of tasks, from clinical decision-making to administrative functions.
This study is the first to evaluate ChatGPT-4 and Google Bard in the context of occupational medicine. It focuses on their ability to generate open-ended responses to complex questions related to workplace safety and occupational risks. Our findings indicate that ChatGPT-4, when combined with legislative context, generally provides more accurate and complete answers than Bard. In Appendix D, we have included an example of a correct and complete response generated by Bard and one by ChatGPT-4 (Table A2), as well as an example of an incorrect response, both in terms of completeness and accuracy, generated by both LLMs (Table A3).
The literature highlights the focus on the evaluation of LLMs in answering multiple-choice questions in medical contexts. ChatGPT-4 demonstrated promising accuracy in bariatric surgery-related questions compared to Microsoft’s Bing and Google Bard [23]. Similarly, ChatGPT-3.5 and 4 outperformed Bard in radiology board practice questions [24]. ChatGPT-4 excelled in exams like the National Council Licensure Examination for Registered Nurses, the National Nursing Licensure Examination in China [25], clinical chemistry questions [26], and the Japanese national dental hygienist examination [27]. It was the only LLM to outperform others in the 2022 Family Medicine In-Training Exam [28]. Bard was surpassed by both ChatGPT-3.5 and 4 in the Advanced Burn Life Support exam [29]. Additionally, ChatGPT-4 showed superior performance in the American Board of Physical Medicine and Rehabilitation exam [30], lung cancer-related questions [31], oral neurobiology committee exams [32], and the Royal College of Ophthalmologists’ exams in the United Kingdom [33].
When evaluating subject-specific responses, Bard outperforms ChatGPT-4 only in manual handling of loads and physical hazards. In chemical hazards, Bard exceeds ChatGPT-4+context solely in terms of completeness. This result can be explained by several factors, which are primarily related to the nature of the questions and the distinct characteristics of Bard and ChatGPT-4. Questions about manual handling were, in many cases, formulated with a more practical approach, such as how to lift a load safely. In these instances, Bard appears to perform better due to its access to updated internet resources, which may provide more recent or specific practical guidelines. Similarly, in the domain of physical and chemical hazards, Bard’s superior performance may stem from its ability to retrieve threshold values or technical details, which are again more practical than regulatory questions. This reliance on dynamic and up-to-date information could give Bard an advantage in areas requiring precision and alignment with current standards.
Additionally, Bard’s performance differences may reflect inherent biases in the training data or optimization strategies. It is plausible that Bard’s pretraining placed greater emphasis on technical resources, specific regulations, and manuals related to physical safety and manual handling. This focus might better align Bard with regional or international standards, such as European or U.S. regulations, which influence workplace safety guidelines and practices.
Finally, these explanations remain speculative and are based on observed performance differences and the general characteristics of large language models (LLMs). Given that the exact details of the training processes and datasets used for Bard and ChatGPT-4 are not publicly available, further studies would be necessary to definitively determine the causes of the differences observed in this evaluation.
Effect size calculations provided additional insights into the magnitude of the observed differences. Regarding completeness, the effect size (Cohen’s d = 0.16; r = 0.08) suggests that, although the difference between ChatGPT-4+context and Bard is statistically significant, the effect is small and thus of limited practical significance. In other words, the completeness of the responses shows only a modest difference between the two models, despite the significant p-value.
As for accuracy, the effect size (Cohen’s d = 0.40; r = 0.20) indicates a moderate difference between the two models, with ChatGPT-4+context performing significantly better than Bard. This suggests that ChatGPT-4+context’s responses are more accurate to a meaningful degree, making the difference between the two models relevant in practical contexts, particularly in medical fields where accuracy is crucial.
From the error analysis, it emerged that Bard had a higher error rate than ChatGPT-4 for all categories except for “non-response”. ChatGPT’s programming may lead it to decline answering certain questions such as those asking for personalized medical advice or involving sensitive information, preferring instead to recommend consulting a medical professional. In Appendix E, six tables are presented, each displaying one response from both LLMs, containing one of the six evaluated error categories. Specifically, Table A4 demonstrates cases of no response, Table A5 showcases incorrect definition, Table A6 includes instances of content error, Table A7 illustrates two examples of responses with a context error, Table A8 highlights responses with internal contradiction, and Table A9 presents examples of typo errors.
Broadly speaking, in agreement with the aforementioned studies, Bard has shown a lower performance, and our results indicate that this is due to incorrect content in the text produced, even when simply asked for a definition. The presence of such a high rate of content errors (almost 30%) for Bard represents a significant limitation, as we deduce that its performance, despite its full and direct access to the internet, is currently unreliable and has a large margin for improvement. Additionally, Bard often makes use of incorrect contextual references. It is not possible to hypothesize the reason for such incorrect context retrieval, as it is unknown how Google selects the requested data. This type of error is more than halved in ChatGPT-4+context, thanks in part to the optimization of its semantic search process.
Despite their promise, conversational AI systems present potential ethical risks, such as misinformation, privacy violations, reinforcement of the biases in training data, and the generation of plagiarized or inaccurate content. These risks are particularly concerning in healthcare, where accuracy is critical [34]. LLMs are prone to generating “hallucinations,” producing responses that are plausible but factually incorrect. Notably, studies have shown that 45% of ChatGPT’s responses contain inaccuracies, and Bard’s error rate is even higher, particularly in academic contexts [34].
Research and development are underway to improve the accuracy and reliability of conversational AI in medicine. For this reason, the World Health Organization (WHO) has recently released a comprehensive set of guidelines titled “Ethics and Governance of Artificial Intelligence for Health: Guidance on Large Multi-Modal Models” [35], which is aimed at ensuring the ethical use of these generative AI technologies in healthcare. The WHO’s guidance includes over 40 recommendations for governments, technology developers, and healthcare providers to manage the use of LMMs responsibly. The guidelines identify five areas of application for LMMs in healthcare: clinical diagnosis, patient-guided care, administrative tasks, medical training, and scientific research for drug development.
However, the WHO also emphasizes the potential risks associated with LMMs. These include concerns about accuracy, potential biases, misinformation, and the ethical implications of using AI in sensitive healthcare settings. The organization stresses the need for transparency, robust regulatory frameworks, and responsible innovation to ensure that these technologies do not exacerbate existing health inequities.
To address these ethical challenges in occupational medicine, several recommendations were proposed by Baldassare et al. [36]. First and foremost, transparency and explainability in clinical decision-making supported by these technologies are essential. Occupational health professionals must take responsibility for the transparent and clear use of LLMs, ensuring that all decisions are understandable and well justified.
Given the significant influence that LLMs can have on clinical decisions, it is crucial to clearly define the roles and responsibilities of practitioners in overseeing the use of generative AI. These professionals must remain accountable for the decisions made, ensuring that human oversight remains a central component.
Data privacy and confidentiality are equally important, with the guidelines reaffirming the need to protect the sensitive health surveillance data processed by LLMs. Safeguarding the confidentiality of this information is imperative to maintain trust and comply with ethical standards. Additionally, the potential for LLMs to perpetuate or amplify biases necessitates a commitment to reducing discrimination and promoting equitable access to and use of these technologies, particularly in health surveillance and clinical decision-making.
The ethical use of LLMs must also include continuous education and training for practitioners. It is essential that they receive proper instruction on the responsible use of these technologies and engage in ongoing professional development to stay informed about emerging ethical challenges and technological advancements.
Despite the technical, ethical, and legal challenges that LLMs currently face in ensuring their safe and reliable use in the international medical landscape, we can hypothesize several potential applications for chatbots like ChatGPT or Bard in the daily practice of occupational physicians. First, given the vast array of laws, legislative decrees, technical standards, and both national and international guidelines forming the theoretical and practical foundation of this discipline, chatbots could act as 24/7 virtual assistants. This capability would allow occupational physicians to query current regulations on workplace health and safety, workers’ rights and duties, safety protocols, and potential health risks in specific work environments, among other topics, with chatbots providing immediate responses to support their professional activities.
Other potential applications include developing management systems for occupational health surveillance and enabling the efficient recording of all medical examinations and diagnostic assessments conducted on workers. This would facilitate more comprehensive data collection. Additionally, LLMs could assist in the prevention of occupational diseases by identifying trends in work-related illnesses or injuries based on available data, particularly underreported trends. Such insights could inform more effective prevention strategies.
Furthermore, LLMs could support training obligations by generating educational materials on workplace best practices, the risks associated with specific activities or substances, and company safety guidelines.
Chatbots like ChatGPT or Bard could also be leveraged to generate reports, informational documents, and corporate communications related to occupational health and safety. They could assist in drafting and reviewing risk assessment documents to ensure regulatory compliance for employers. Additionally, LLMs have the potential to support scientific research by analyzing diverse health data collected during occupational health surveillance. However, their use must strictly comply with data privacy regulations and ethical standards.
Considering the potential and hypothetical applications of chatbots in occupational medicine, our study could help accelerate the adoption of these generative AI models in a field like occupational medicine, which often tends to progress more slowly compared to other medical fields.
Despite being the first evaluation of ChatGPT-4 vs. Bard in the context of international occupational medicine, our study has several limitations. One of the main limitations of the current study is the scope of the dataset, which is based exclusively on the Italian Legislative Decree 81/08, reflecting the legal framework for workplace health and safety in Italy. This geographical limitation may restrict the generalizability of the results to other countries or regions, where legal provisions and terminology may differ.
To address this limitation, we plan to conduct additional tests to evaluate the performance of our model on queries from other countries or regions, including different types of legal inquiries, such as case-specific analyses or interpretations of legal articles. By testing the model on legal texts from other jurisdictions—such as those from other European Union countries, the United States, or even Latin America—we will be able to assess its ability to adapt to different legal systems, frameworks, and terminologies. This will help us evaluate the robustness of the model and its capacity for cross-jurisdictional generalization.
Future research will focus on expanding the model’s capabilities and assessing its performance across diverse legal contexts to ensure its wider applicability and improve its accuracy and efficiency in different legal environments. We also only evaluated two LLMs (ChatGPT-4 and Google Bard) and focused solely on their ability to respond to open-ended questions. Additionally, the generalizability of our findings may be limited due to the single-center design of the study and the restricted database of responses.

5. Conclusions

ChatGPT-4+context outperforms Bard in terms of accuracy and completeness in its responses, with a lower error rate in the categories analyzed, except for the percentage of missed responses. This study highlights both the strengths and limitations of ChatGPT-4 and Google Bard, particularly in terms of their ability to provide accurate, context-aware responses to complex occupational health questions.
Despite these limitations, LLMs can play a crucial role in occupational medicine. However, their deployment must be guided by clear ethical frameworks. Moving forward, further research and development are essential to enhance the reliability, accuracy, and ethical use of these AI systems in occupational healthcare settings. The careful implementation of these technologies, along with robust regulatory oversight, could pave the way for their responsible integration into the daily practice of occupational physicians.

Supplementary Materials

The following supporting information can be downloaded at: www.mdpi.com/article/10.3390/app15031508/s1, Table S1. The dataset of questions used in our study, which provides both the original version of each question in Italian and its translation into English.

Author Contributions

Conceptualization, M.P. and A.P. (Alessandro Palla); Data curation, A.P. (Alessandro Palla), M.P., R.M., F.P., B.C., A.P. (Armando Petillo), G.N. (Gianluca Nerli), S.S., F.C., L.D., G.N. (Gabriele Necciari) and V.C.L.; Methodology and statistics, R.M. and M.P.; Software, A.P. (Alessandro Palla); Formal analysis, A.P. (Alessandro Palla), M.P. and R.M.; Writing—original draft preparation, M.P., A.P. (Alessandro Palla) and R.M.; Writing—review and editing, A.P. (Alessandro Palla), M.P., G.N. (Gianluca Nerli), L.D. and R.F.; Supervision R.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available as their release is subject to ongoing collaboration and further analysis to derive additional insights.

Conflicts of Interest

Author Alessandro Palla was employed by the company Intel Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. ChatGPT-4+Context and Bard System Messages

Both LLMs have been instructed to tune their behavior with a proper system message. The system message in LLMs provides primary instructions for the model and can be customized to fit the system’s role. It can include an overview of the assistant, its personality traits, specific instructions, guidelines, or important data like frequently asked questions. Personalizing the system message allows it to be tailored to specific use cases. While it is not mandatory, high-quality system messages significantly enhance the overall quality of the models’ responses [37]. The ChatGPT-4+context and Bard system messages used are described in Table A1.
Table A1. ChatGPT-4+context and Bard system messages used. The text {context} was replaced with the actual legislative context from a search before submitting the text to ChatGPT-4.
Table A1. ChatGPT-4+context and Bard system messages used. The text {context} was replaced with the actual legislative context from a search before submitting the text to ChatGPT-4.
LanguageGoogle BardChatGPT-4+Context
ItalianVoglio che tu agisca come un assistente di un medico specialista in medicina del lavoro in Italia. Usa il decreto legislativo 81/08 per rispondere alle domande.Voglio che tu agisca come un assistente di un medico specialista in medicina del lavoro in Italia. Usa il decreto legislativo 81/08 ed il seguente contesto normativo per rispondere alla seguente domanda.
Contesto normativo: {context}
EnglishI want you to act as an assistant to a specialist doctor in occupational medicine in Italy. Use Legislative Decree 81/08 to answer the questions.I want you to act as an assistant to a specialist doctor in occupational medicine in Italy. Use Legislative Decree 81/08 and the following legislative context to answer the questions.
Legislative context: {context}

Appendix B. ChatGPT-4 Legislative Context Semantic Search

To generate answers from the OpenAI LLM, the legislative reference context from the D.lgs. 81/08 was given to ChatGPT-4 alongside the questions submitted by physicians. To optimally retrieve the relevant passage from the reference legislative text, we used a semantic search technique that used text embeddings.
Text embeddings are a technique in NLP that convert text into numerical vectors [38]. These vectors represent the semantic meaning of words, sentences, or entire texts in a high-dimensional space. They allow for measuring the similarity between different text strings. For example, OpenAI’s ‘text-embedding-ada-002’ model can process sentences of up to 8192 tokens and generate vectors of 1536 dimensions. In the semantic search process, to retrieve relevant documents from a large corpus, the user’s query text embedding is compared with the embedding of each paragraph and the results scored by similarity.
In the process of retrieving legislative context from D.lgs. 81/08, we performed the following steps:
  • Parse the document into chunks: using the ‘UnstructuredPDF’ library, we extracted raw text and split it into chunks of N tokens (a tunable parameter).
  • Create embeddings for the document corpus: each chunk was converted into a numerical vector using the ‘text-embedding-ada-002’ model via OpenAI’s Embeddings API.
  • Create a query embedding: the user’s search query was similarly transformed into a vector.
  • Calculate similarity: we measured cosine similarity between the query and document vectors.
  • Rank results: the document vectors were ranked by similarity, and the most relevant ones were selected.
  • Concatenate top results: we combined the highest-ranked paragraphs, ensuring that the combined context plus the user’s query remained under 2048 tokens to allow for a complete response.

Appendix C. Question/Answering Optimization and Prompt Engineering

Due to the limited context size of language models, handling lengthy texts requires breaking the legislative context into smaller chunks or paragraphs. While this may seem straightforward, it is a complex task for an automated algorithm, as semantically related pieces of text need to remain intact. The degree of semantic connection can vary significantly based on the type of text.
At a high level, text splitters work as follows:
  • Step 1: split the text into smaller, semantically meaningful chunks, such as sentences or paragraphs.
  • Step 2: combine these smaller chunks into larger sections until a predefined size threshold is reached.
  • Step 3: once the threshold is met, finalize that chunk as a standalone piece and create a new chunk with overlapping text to ensure continuity of the context between sections.
This process is highly dependent on the characteristics of the text, making it difficult to determine the optimal chunk size, system prompt, and text-processing approach. Generally, the process involves iterating through various parameter configurations, with human evaluators assessing the quality of the generated responses to identify the best combination. However, we opted against this approach due to the latency human evaluation would introduce, as we needed to iterate through numerous configurations quickly during system design.
To address this bottleneck, we utilized ChatGPT itself as a proxy for human judgment. This allowed us to optimize chunk size, system prompts, and other parameters by using the language model to score generated answers based on their accuracy and completeness. This approach offered significant advantages: it was faster, scaled effectively, and enabled rapid experimentation with multiple parameter combinations.
Our evaluator was implemented using prompt engineering within the LangChain framework. Specifically, we provided the model with three inputs: the question, the reference answer (from physicians), and the LLM-generated answer (with context). We then instructed the evaluator to score the generated answers based on accuracy and completeness. By aggregating these scores, we ranked each parameter configuration and selected the one that maximized performance across both metrics.
The system message used for this evaluation is detailed in Table A3. This message was presented along with the ChatGPT-generated answer, enabling the evaluator to compute and return metrics for accuracy and completeness. While there is a potential for misalignment between human evaluators and ChatGPT’s assessments, this method offers a fast, scalable way to optimize and fine-tune a Question/Answering (QA) pipeline. It is particularly effective given the numerous design choices and potential system prompts at play.
Once the system was fine-tuned, it was used to generate answers for a final phase of human evaluation, ensuring that the QA system met the necessary performance standards.

Appendix D. Examples of Good and Bad Replies from LLMs

Table A2. Examples of good replies from LLMs.
Table A2. Examples of good replies from LLMs.
LanguageItalianEnglish
QuestionAl medico competente è garantita l’autonomia ai sensi dell’art. 39, comma 4 del D.Lgs. N. 81/2008 se esso risulta funzionalmente collocato in Unità Operativa Complessa (UOC) di cui il Responsabile del Servizio di Prevenzione e Protezione è il direttore?The competent doctor is guaranteed autonomy pursuant to Article 39, paragraph 4 of Legislative Decree No. 81/2008 if they are functionally placed in a Complex Operational Unit (UOC) where the Head of the Prevention and Protection Service is the director?
ChatGPT-4+ctx answerSì, l’autonomia del medico competente deve essere sempre garantita, a prescindere dalla sua collocazione organizzativa all’interno dell’azienda. Secondo l’art. 39, comma 4 del D.Lgs. n. 81/2008, il datore di lavoro “assicura al medico competente le condizioni necessarie per lo svolgimento di tutti i suoi compiti garantendone l’autonomia”.

Inoltre, nonostante l’art. 17 del D.Lgs. n. 81/2008 preveda la non delegabilità della designazione del responsabile del servizio di prevenzione e protezione e non anche della nomina del medico competente, questo non implica una preminenza del responsabile del servizio di prevenzione e protezione sul medico competente. Infatti, queste sono figure funzionalmente autonome, con responsabilità nettamente distinte.

Pertanto, anche se il medico competente è collocato in una Unità Operativa Complessa (UOC) di cui il Responsabile del Servizio di Prevenzione e Protezione è il direttore, la sua autonomia funzionale nello svolgimento del suo ruolo deve sempre essere garantita. Il medico competente può essere gerarchicamente subordinato al direttore della UOC solo per quanto riguarda gli aspetti organizzativi che esulano dalle sue funzioni di medico competente.
Yes, the autonomy of the competent doctor must always be guaranteed, regardless of their organizational placement within the company. According to Article 39, paragraph 4 of Legislative Decree No. 81/2008, the employer “ensures that the competent doctor has the necessary conditions to carry out all their tasks while guaranteeing their autonomy”.
Furthermore, although Article 17 of Legislative Decree No. 81/2008 provides for the non-delegability of the appointment of the head of the prevention and protection service, and not the appointment of the competent doctor, this does not imply a superior position for the head of the prevention and protection service over the competent doctor. In fact, these are functionally independent figures with clearly distinct responsibilities.
Therefore, even if the competent doctor is placed in a Complex Operational Unit (UOC) where the Head of the Prevention and Protection Service is the director, their functional autonomy in performing their role must always be guaranteed. The competent doctor may be hierarchically subordinate to the UOC director only regarding organizational aspects unrelated to their functions as a competent doctor.
QuestionChe cosa si intende per Dirigente?What is meant by Manager?
Bard answerUn dirigente è una persona che, in ragione delle proprie competenze professionali e di poteri gerarchici e funzionali adeguati alla natura dell’incarico conferitogli, attua le direttive del datore di lavoro organizzando l’attività lavorativa e vigilando su di essa.

Il dirigente è quindi un soggetto che ha una posizione di comando nell’azienda e che ha la responsabilità di garantire la sicurezza dei lavoratori.

Secondo il decreto legislativo 81/2008, il dirigente è tenuto a:

attuare le direttive del datore di lavoro;
organizzare l’attività lavorativa;
vigilare su di essa;
collaborare con il datore di lavoro e il responsabile del servizio di prevenzione e protezione per la valutazione dei rischi;
informare i lavoratori dei rischi connessi all’attività lavorativa e delle misure di prevenzione e protezione adottate;
fornire ai lavoratori i necessari dispositivi di protezione individuale.
Il dirigente è quindi una figura chiave per la sicurezza sul lavoro e ha un ruolo fondamentale nella prevenzione degli infortuni e delle malattie professionali.
A manager is a person who, due to their professional skills and hierarchical and functional powers appropriate to the nature of the task assigned to them, implements the employer’s directives by organizing the work activity and overseeing it.
The manager thus holds a position of command within the company and is responsible for ensuring the safety of the workers.
According to Legislative Decree 81/2008, the manager is required to:
  • Implement the employer’s directives;
  • Organize the work activity;
  • Supervise it;
  • Collaborate with the employer and the head of the prevention and protection service for risk assessment;
  • Inform workers of the risks associated with the work activity and the prevention and protection measures adopted;
  • Provide workers with the necessary personal protective equipment.
The manager is therefore a key figure in workplace safety and plays a fundamental role in the prevention of accidents and occupational diseases.
Table A3. Examples of bad replies from LLMs.
Table A3. Examples of bad replies from LLMs.
LanguageItalianEnglish
QuestionCosa tratta l’Allegato LI del decreto legislativo 9 Aprile 2008 n.81?What does Annex LI of Legislative Decree No. 81 of 9 April 2008 address?
ChatGPT-4+ctx answerL’Allegato LI del Decreto Legislativo 9 aprile 2008 n.81 tratta le “Modalità di svolgimento della sorveglianza sanitaria”. La sorveglianza sanitaria è una componente fondamentale della tutela della salute e sicurezza dei lavoratori. Essa comprende visite mediche preventive ed periodiche, nonché la verifica dell’idoneità alla specifica mansione. Questo allegato contiene quindi indicazioni precise su come questa sorveglianza sanitaria dovrebbe essere svolta, descrivendo ad esempio, i criteri per la programmazione delle visite mediche, le modalità di conservazione e gestione dei dati sanitari, etc.Annex LI of Legislative Decree No. 81 of 9 April 2008 addresses the “Methods of carrying out health surveillance”. Health surveillance is a fundamental component of the protection of workers’ health and safety. It includes preventive and periodic medical examinations, as well as the assessment of fitness for specific tasks. This annex provides detailed guidelines on how health surveillance should be carried out, describing, for example, the criteria for scheduling medical visits, the methods for storing and managing health data, etc.
BardL’Allegato LI del decreto legislativo 9 Aprile 2008, n. 81, riguarda i “requisiti minimi di salute e sicurezza per l’utilizzo di attrezzature di lavoro portatili”. Questo allegato si applica a tutte le attrezzature di lavoro portatili, comprese le macchine, gli utensili e le attrezzature elettriche. L’obiettivo dell’allegato è garantire che le attrezzature di lavoro portatili siano utilizzate in modo sicuro e non danneggino la salute dei lavoratori.

L’Allegato LI definisce i requisiti minimi per la sicurezza delle attrezzature di lavoro portatili, tra cui:

La scelta di attrezzature di lavoro portatili sicure
L’uso corretto delle attrezzature di lavoro portatili
La manutenzione e la riparazione delle attrezzature di lavoro portatili
L’informazione e la formazione dei lavoratori sull’uso sicuro delle attrezzature di lavoro portatili
L’allegato LI prevede inoltre che le attrezzature di lavoro portatili siano sottoposte a controlli periodici per verificare che siano in buone condizioni di manutenzione e sicurezza.

L’utilizzo di attrezzature di lavoro portatili sicure è importante per la salute e la sicurezza dei lavoratori. L’Allegato LI del decreto legislativo 9 Aprile 2008, n. 81, definisce i requisiti minimi per la sicurezza delle attrezzature di lavoro portatili e fornisce indicazioni su come utilizzarle in modo sicuro.
Annex LI of Legislative Decree No. 81 of 9 April 2008, concerns the “minimum health and safety requirements for the use of portable work equipment”. This annex applies to all portable work equipment, including machines, tools, and electrical equipment. The goal of the annex is to ensure that portable work equipment is used safely and does not harm the health of workers.
Annex LI defines the minimum safety requirements for portable work equipment, including:
  • The selection of safe portable work equipment
  • The correct use of portable work equipment
  • The maintenance and repair of portable work equipment
  • The information and training of workers on the safe use of portable work equipment
The annex also stipulates that portable work equipment must undergo periodic inspections to ensure they are in good maintenance and safety condition.
The use of safe portable work equipment is essential for the health and safety of workers. Annex LI of Legislative Decree No. 81 of 9 April 2008, defines the minimum safety requirements for portable work equipment and provides guidelines on how to use them safely.

Appendix E. Examples of Responses from Bard and ChatGPT-4+Context Categorized by Error Types

Table A4. Examples of ChatGPT-4+ctx and Bard giving no response.
Table A4. Examples of ChatGPT-4+ctx and Bard giving no response.
LanguageItalianEnglish
QuestionCosa devono fare i medici e le strutture sanitari pubbliche e private, nonché gli istituti previdenziali ed assicurativi pubblici o privati, che identificano casi di neoplasie da loro ritenute attribuibili ad esposizioni lavorative ad agenti cancerogeni e/o mutageni?What should doctors, public and private healthcare facilities, as well as public or private social security and insurance institutions do when they identify cases of neoplasms that they believe are attributable to occupational exposure to carcinogenic and/or mutagenic agents?
ChatGPT-4+ctx answerNon sono programmato per aiutarti in questo caso.I am not programmed to assist you in this case.
QuestionDa chi è composto il comitato per l’indirizzo e la valutazione delle politiche attive e per il coordinamento nazionale delle attività di vigilanza in materia di salute e sicurezza sul lavoro?Who is part of the committee for the direction and evaluation of active policies and for the national coordination of surveillance activities in the field of occupational health and safety?
Bard answerNon posso aiutarti in questo, poiché sono solo un modello linguistico e non ho la capacità di capire e rispondere.I cannot help you with this, because I am just a language model and do not have the ability to understand and respond.
Table A5. Examples of ChatGPT-4+ctx and Bard giving incorrect definitions.
Table A5. Examples of ChatGPT-4+ctx and Bard giving incorrect definitions.
LanguageItalianEnglish
QuestionCos’è l’indice NIOSH?What is the NIOSH index?
ChatGPT-4+ctx answerL’indice NIOSH non è un termine specifico, ma potresti fare riferimento all’Istituto Nazionale per la Sicurezza e la Salute Occupazionale (NIOSH), che è l’ente degli Stati Uniti che si occupa di ricerche e raccomandazioni per prevenire infortuni e malattie legate al lavoro. Questo istituto rilascia certificazioni per vari tipi di dispositivi di protezione individuale, incluso le maschere respiratorie.
Tuttavia, è importante ricordare che in Italia, secondo il D.Lgs. 81/2008, i dispositivi di protezione individuale devono essere conformi alle normative europee, quindi la certificazione NIOSH non è riconosciuta direttamente.
The NIOSH index is not a specific term, but you might be referring to the National Institute for Occupational Safety and Health (NIOSH), which is the U.S. agency responsible for research and recommendations to prevent workplace injuries and diseases. This institute issues certifications for various types of personal protective equipment, including respirators. However, it is important to remember that in Italy, according to Legislative Decree 81/2008, personal protective equipment must comply with European regulations, so NIOSH certification is not directly recognized.
QuestionCosa si intende per informazione in ambito lavorativo?What is meant by information in the workplace?
Bard answerIn ambito lavorativo, l’informazione è qualsiasi dato, notizia o conoscenza che viene trasmessa ai lavoratori in relazione al loro lavoro. L’informazione può essere trasmessa in diversi modi, ad esempio attraverso manuali, opuscoli, corsi di formazione, riunioni informative o conversazioni informali. L’informazione è importante per i lavoratori perché consente loro di svolgere il loro lavoro in modo sicuro ed efficiente. Inoltre, l’informazione può aiutare i lavoratori a conoscere i loro diritti e doveri, a comprendere i rischi sul lavoro e a prendere decisioni informate sul loro lavoro.

Il decreto legislativo 81/08, che riguarda la tutela della salute e della sicurezza sul lavoro, definisce l’informazione come “l’insieme di dati, notizie e conoscenze che vengono trasmesse ai lavoratori in relazione al loro lavoro”. Il decreto legislativo 81/08 prevede che i datori di lavoro forniscano ai lavoratori informazioni adeguate e aggiornate sui rischi per la loro salute e sicurezza sul lavoro e sulle misure di prevenzione e protezione da adottare. Il decreto legislativo 81/08 prevede inoltre che i lavoratori abbiano diritto di accedere alle informazioni sulla salute e sulla sicurezza sul lavoro e di partecipare alle attività di prevenzione e protezione.

Ecco alcuni esempi di informazioni che i datori di lavoro devono fornire ai lavoratori:

I rischi per la salute e la sicurezza sul lavoro presenti nell’azienda;
Le misure di prevenzione e protezione da adottare per ridurre i rischi;
Le procedure di emergenza da adottare in caso di incidenti o infortuni;
I diritti e i doveri dei lavoratori in materia di salute e sicurezza sul lavoro.
I lavoratori hanno diritto di accedere alle informazioni sulla salute e sulla sicurezza sul lavoro attraverso diversi canali, ad esempio:

Il manuale di sicurezza;
Gli opuscoli informativi;
I corsi di formazione;
Le riunioni informative;
Le conversazioni informali con i datori di lavoro o i responsabili della sicurezza sul lavoro.
I lavoratori hanno il dovere di informarsi sui rischi per la loro salute e sicurezza sul lavoro e di adottare le misure di prevenzione e protezione previste. I lavoratori hanno inoltre il dovere di segnalare ai datori di lavoro o ai responsabili della sicurezza sul lavoro qualsiasi situazione di pericolo.
In the workplace, information refers to any data, news, or knowledge that is transmitted to workers in relation to their work. Information can be conveyed in various ways, such as through manuals, brochures, training courses, informational meetings, or informal conversations. Information is important for workers because it enables them to perform their job safely and efficiently. Moreover, information can help workers understand their rights and duties, recognize workplace risks, and make informed decisions about their work.
Legislative Decree 81/08, which addresses the protection of health and safety at work, defines information as ’the set of data, news, and knowledge transmitted to workers in relation to their work’. Legislative Decree 81/08 mandates that employers provide workers with adequate and up-to-date information about risks to their health and safety at work and about preventive and protective measures to be adopted. The decree also ensures that workers have the right to access information on health and safety at work and to participate in prevention and protection activities.
Here are some examples of information that employers must provide to workers:
  • Health and safety risks present in the company.
  • Preventive and protective measures to reduce risks.
  • Emergency procedures to follow in case of accidents or injuries.
  • Workers’ rights and duties related to health and safety at work.
Workers have the right to access information on health and safety at work through various channels, such as:
  • The safety manual.
  • Informational brochures.
  • Training courses.
  • Informational meetings.
  • Informal conversations with employers or health and safety officers.
Workers also have the duty to inform themselves about risks to their health and safety at work and to adopt the prescribed preventive and protective measures. Furthermore, workers are required to report any hazardous situations to employers or health and safety officers.
Table A6. Examples of ChatGPT-4+ctx and Bard making semantic errors.
Table A6. Examples of ChatGPT-4+ctx and Bard making semantic errors.
LanguageItalianEnglish
QuestionQual è nel D.Lgs. 81/08 il valore limite del peso sollevabile dal singolo lavoratore che, se superato, crea le condizioni di rischio?What is the weight limit for lifting by a single worker according to Legislative Decree 81/08, beyond which it creates risk conditions?
ChatGPT-4+ctx answer Il D.Lgs. 81/08 non specifica un preciso valore limite del peso che un singolo lavoratore può sollevare. Tuttavia, l’allegato XXXIII del documento specifica che la movimentazione manuale di un carico può costituire un rischio per la salute del lavoratore se il carico è troppo pesante. Il determinare cosa costituisca un carico “troppo pesante” dipende da vari fattori, inclusi la forza del lavoratore, la posizione in cui il carico è sollevato, la frequenza con la quale il carico è sollevato, ecc. Inoltre, lo stesso allegato suggerisce che le attività lavorative di movimentazione manuale dei carichi dovrebbero considerare una serie di elementi di riferimento e di fattori individuali di rischio. Questi includono le caratteristiche del carico e lo sforzo fisico richiesto. È importante notare che le normative specifiche sul peso massimo che un lavoratore può sollevare possono variare a seconda delle leggi locali o regionali, delle linee guida dell’industria e delle politiche del datore di lavoro. Pertanto, è sempre una buona idea consultare un esperto di salute e sicurezza sul lavoro per avere consigli specifici per la propria situazione.Legislative Decree 81/08 does not specify a precise weight limit for lifting by a single worker. However, Annex XXXIII of the document specifies that manual handling of a load can pose a risk to the worker’s health if the load is too heavy. Determining what constitutes a ’too heavy’ load depends on various factors, including the worker’s strength, the position in which the load is lifted, the frequency of lifting, etc. Additionally, the same annex suggests that manual load handling activities should consider a series of reference elements and individual risk factors. These include the characteristics of the load and the physical effort required. It is important to note that specific regulations on the maximum weight a worker can lift may vary depending on local or regional laws, industry guidelines, and employer policies. Therefore, it is always a good idea to consult an occupational health and safety expert for specific advice regarding your situation.
QuestionE’ possibile la presenza dei lavoratori
all’interno di locali riservati ai fumatori?
Is it possible for workers to be present in smoking areas?
Bard answerNo, non è possibile la presenza dei lavoratori all’interno di locali riservati ai fumatori. Il Decreto Legislativo 81/08, art. 51, comma 1, lettera a), vieta il fumo in tutti i locali chiusi aperti al pubblico, compresi i locali di lavoro. L’art. 51, comma 2, del D.Lgs. 81/08 prevede che i datori di lavoro adottino le misure necessarie per garantire ai lavoratori un ambiente di lavoro salubre e privo di rischi per la salute, anche in relazione all’esposizione al fumo di tabacco. Pertanto, i datori di lavoro sono tenuti a vietare il fumo all’interno dei locali di lavoro, anche se riservati ai fumatori.

Il divieto di fumo nei luoghi di lavoro è stato introdotto per tutelare la salute dei lavoratori, sia passivi che attivi. Il fumo passivo è un rischio per la salute, in quanto può causare malattie cardiovascolari, respiratorie e cancro. Il divieto di fumo nei luoghi di lavoro è stato inoltre introdotto per promuovere un ambiente di lavoro più sano e produttivo.
No, it is not allowed for workers to be present in smoking areas. Legislative Decree 81/08, Article 51, paragraph 1, letter (a), prohibits smoking in all enclosed public places, including workplaces. Article 51, paragraph 2, of Legislative Decree 81/08 stipulates that employers must take the necessary measures to ensure a healthy and risk-free working environment for workers, including in relation to exposure to tobacco smoke. Therefore, employers are required to prohibit smoking within workplaces, even in areas designated for smokers.
The smoking ban in workplaces was introduced to protect the health of both active and passive workers. Passive smoking is a health risk, as it can cause cardiovascular diseases, respiratory issues, and cancer. The smoking ban in workplaces was also introduced to promote a healthier and more productive work environment.
Table A7. Examples of ChatGPT-4+ctx and Bard responses with context errors.
Table A7. Examples of ChatGPT-4+ctx and Bard responses with context errors.
LanguageItalianEnglish
QuestionA cosa si applicano le prescrizioni dell’Allegato L?What do the provisions of Annex L apply to?
ChatGPT-4+ctx answerL’Allegato L del Decreto Legislativo n. 81/2008 riguarda le prescrizioni minime per il miglioramento della sicurezza e della salute dei lavoratori durante i periodi di esposizione a rumore e si applicano a tutte le attività in cui i lavoratori sono o possono essere esposti a rischio da rumore nell’esercizio del loro lavoro.Annex L of Legislative Decree No. 81/2008 concerns the minimum provisions for improving the safety and health of workers during periods of exposure to noise, and applies to all activities where workers are or may be exposed to noise risk in the course of their work.
QuestionIl lavoratore esposto ad agenti cancerogeni e mutageni può conoscere il contenuto della sua cartella sanitaria e di rischio e del registro degli esposti?Can a worker exposed to carcinogenic and mutagenic agents know the content of their health and risk file and the register of exposed workers?
Bard answerSì, il lavoratore esposto ad agenti cancerogeni e mutageni ha diritto di conoscere il contenuto della sua cartella sanitaria e di rischio e del registro degli esposti. Il Decreto Legislativo 81/2008, all’articolo 238, comma 1, stabilisce che “Il lavoratore ha diritto di accesso al proprio dossier sanitario e di essere informato sul contenuto dello stesso. Il datore di lavoro, ove previsto, il medico competente e il responsabile del servizio di prevenzione e protezione sono tenuti a fornire al lavoratore le informazioni di cui al comma 1, in un formato facilmente comprensibile e, su richiesta del lavoratore, in forma scritta”.

Il registro degli esposti è un documento che deve essere tenuto dal datore di lavoro e che contiene le informazioni sui lavoratori che sono esposti ad agenti cancerogeni e mutageni. Il registro deve essere aggiornato periodicamente e deve essere accessibile ai lavoratori.

Il lavoratore ha diritto di accedere alla sua cartella sanitaria e di rischio e al registro degli esposti, in modo da conoscere le informazioni relative alla sua salute e al suo rischio di esposizione ad agenti cancerogeni e mutageni. Queste informazioni sono importanti per il lavoratore, in quanto gli consentono di prendere decisioni informate sulla sua salute e sulla sua sicurezza sul lavoro.

Il lavoratore può richiedere l’accesso alla sua cartella sanitaria e di rischio e al registro degli esposti rivolgendosi al datore di lavoro, al medico competente o al responsabile del servizio di prevenzione e protezione. Il datore di lavoro, il medico competente e il responsabile del servizio di prevenzione e protezione sono tenuti a fornire al lavoratore le informazioni richieste entro 15 giorni dalla richiesta.

Se il datore di lavoro, il medico competente o il responsabile del servizio di prevenzione e protezione non forniscono al lavoratore le informazioni richieste, il lavoratore può presentare un reclamo all’INAIL o all’Ispettorato del Lavoro.
Yes, a worker exposed to carcinogenic and mutagenic agents has the right to know the content of their health and risk file and the register of exposed workers. Legislative Decree 81/2008, Article 238, paragraph 1, establishes that ‘The worker has the right to access their health dossier and to be informed about its content. The employer, where applicable, the competent doctor, and the head of the prevention and protection service are required to provide the worker with the information mentioned in paragraph 1, in an easily understandable format and, upon the worker’s request, in writing’.
The register of exposed workers is a document that must be kept by the employer, containing information about workers exposed to carcinogenic and mutagenic agents. The register must be updated periodically and must be accessible to workers.
The worker has the right to access their health and risk file and the register of exposed workers to learn about their health and risk of exposure to carcinogenic and mutagenic agents. This information is important for the worker as it enables them to make informed decisions about their health and workplace safety.
The worker can request access to their health and risk file and the register of exposed workers by contacting the employer, the competent doctor, or the head of the prevention and protection service. The employer, the competent doctor, and the head of the prevention and protection service are required to provide the requested information within 15 days of the request.
If the employer, the competent doctor, or the head of the prevention and protection service fail to provide the requested information, the worker can file a complaint with INAIL or the Labor Inspectorate.
Table A8. Examples of ChatGPT-4+ctx and Bard responses containing internal contradictions.
Table A8. Examples of ChatGPT-4+ctx and Bard responses containing internal contradictions.
LanguageItalianEnglish
QuestionCosa si intende per attrezzatura di lavoro?What is meant by work equipment?
ChatGPT-4+ctx answerUn’attrezzatura di lavoro si riferisce a qualsiasi macchina, apparecchio, utensile, impianto, infrastruttura, ecc. utilizzato a lavoro. Questa definizione è molto ampia e comprende tutta una serie di elementi che possono essere usati in ambiente di lavoro, da quelli più comuni come i computer e i telefoni, a quelli specifici di certi campi, come ad esempio gli attrezzi utilizzati in un cantiere edile o le macchine presenti in un laboratorio di ricerca. Per quanto riguarda il quadro legislativo italiano, la definizione di attrezzatura di lavoro viene fornita dal Decreto Legislativo 81/2008 che, all’articolo 2, rende riferimento a “qualsiasi macchina, apparecchio, utensile, impianto, infrastruttura, ecc, utilizzato dal datore di lavoro durante il lavoro”.Work equipment refers to any machine, device, tool, plant, infrastructure, etc., used at work. This definition is very broad and includes a wide range of items that can be used in the workplace, from more common ones like computers and phones, to those specific to certain fields, such as tools used on a construction site or machines found in a research laboratory. Regarding the Italian legislative framework, the definition of work equipment is provided by Legislative Decree 81/2008, which, in Article 2, refers to “any machine, device, tool, plant, infrastructure, etc., used by the employer during work”.
QuestionDurante l’allattamento, la lavoratrice può essere esposta a rischio biologico?Can a breastfeeding worker be exposed to biological risks?
Bard answerSì, è possibile che una lavoratrice in allattamento possa essere esposta a rischi biologici sul posto di lavoro. Gli agenti biologici sono microrganismi, come batteri, virus, funghi e parassiti, che possono causare malattie. Possono essere presenti in diversi ambienti di lavoro, come ospedali, laboratori, allevamenti e industrie alimentari.

Se una lavoratrice in allattamento è esposta a rischi biologici, è importante che informi il suo datore di lavoro e il medico del lavoro. Il datore di lavoro è tenuto a prendere misure per ridurre l’esposizione ai rischi biologici, come fornire alla lavoratrice la giusta formazione e indumenti protettivi. Il medico del lavoro può valutare il livello di rischio e fornire alla lavoratrice consigli su come proteggere se stessa e il suo bambino.

Il Decreto Legislativo 81/08, Testo Unico sulla Sicurezza sul Lavoro, prevede una serie di misure per la tutela della salute e della sicurezza dei lavoratori, comprese le lavoratrici in allattamento. L’articolo 200 del Decreto Legislativo 81/08 prevede che il datore di lavoro deve adottare misure per ridurre l’esposizione dei lavoratori a rischi biologici, tra cui:

Informare i lavoratori sui rischi biologici presenti sul posto di lavoro
Fornire ai lavoratori i dispositivi di protezione individuale (DPI) necessari
Organizzare la formazione dei lavoratori sui rischi biologici
Effettuare la valutazione dei rischi biologici
Il datore di lavoro è inoltre tenuto a garantire che i lavoratori in allattamento non siano esposti a rischi biologici che potrebbero causare danni al loro bambino. Se una lavoratrice in allattamento è esposta a rischi biologici, il datore di lavoro deve adeguare le misure di sicurezza per ridurre l’esposizione al minimo.

Se sei una lavoratrice in allattamento e sei preoccupata per l’esposizione a rischi biologici sul posto di lavoro, è importante che tu parli con il tuo datore di lavoro e il medico del lavoro.
Yes, it is possible for a breastfeeding worker to be exposed to biological risks at the workplace. Biological agents are microorganisms, such as bacteria, viruses, fungi, and parasites, that can cause diseases. They may be present in various work environments, such as hospitals, laboratories, farms, and food industries.
If a breastfeeding worker is exposed to biological risks, it is important for her to inform her employer and the occupational doctor. The employer is required to take measures to reduce exposure to biological risks, such as providing the worker with proper training and protective clothing. The occupational doctor can assess the level of risk and provide the worker with advice on how to protect herself and her child.
Legislative Decree 81/08, the Consolidated Law on Workplace Safety, provides a series of measures for the protection of the health and safety of workers, including breastfeeding workers. Article 200 of Legislative Decree 81/08 requires the employer to take measures to reduce workers’ exposure to biological risks, including:
  • Informing workers about biological risks present at the workplace
  • Providing workers with necessary personal protective equipment (PPE)
  • Organizing training for workers on biological risks
  • Conducting a biological risk assessment
The employer is also required to ensure that breastfeeding workers are not exposed to biological risks that could harm their child. If a breastfeeding worker is exposed to biological risks, the employer must adjust safety measures to minimize exposure.
If you are a breastfeeding worker and are concerned about exposure to biological risks at work, it is important to talk to your employer and the occupational doctor.
Table A9. Examples of ChatGPT-4+ctx and Bard responses containing typo errors.
Table A9. Examples of ChatGPT-4+ctx and Bard responses containing typo errors.
LanguageItalianEnglish
QuestionQuali sono gli obblighi del datore di lavoro in merito a formazione, informazione e addestramento per le attrezzature da lavoro?What are the employer’s obligations regarding training, information, and instruction for work equipment?
ChatGPT-4+ctx answerIl datore di lavoro ha una serie di obblighi rispetto alle attrezzature di lavoro. Secondo l’articolo 73 del Decreto Legislativo 81/2008:
  • Il datore di lavoro deve garantire che i lavoratori incaricati dell’uso delle attrezzature di lavoro ricevano sufficienti informazioni e istruzioni, e che siano adeguatamente formati e addestrati, specialmente per quanto riguarda la sicurezza nell’utilizzo delle attrezzature e durante situazioni anormali prevedibili.
  • Deve anche informare i lavoratori sui rischi a cui sono esposti durante l’uso delle attrezzature e sulle attrezzature presenti nell’ambiente di lavoro circostante, anche se non usate direttamente dai lavoratori. Questo include informare i lavoratori su eventuali cambiamenti nelle attrezzature.
  • Le informazioni e le istruzioni fornite devono essere comprensibili per i lavoratori interessati.
  • Per le attrezzature che richiedono conoscenze e responsabilità particolari, il datore di lavoro deve garantire una formazione, informazione e addestramento adeguati e specifici.
  • Il datore di lavoro è anche obbligato a prendere le misure necessarie affinché il posto di lavoro e la posizione dei lavoratori durante l’uso delle attrezzature siano sicuri e rispondano ai principi dell’ergonomia.
  • Se le attrezzature richiedono per il loro impiego conoscenze o responsabilità particolari per i loro rischi specifici, il datore di lavoro deve prendere le misure necessarie affinché l’uso dell’attrezzatura sia riservato ai lavoratori opportunamente formati, informati ed addestrati. In caso di riparazione, trasformazione o manutenzione, i lavoratori interessati devono essere qualificati in maniera specifica.
  • Finally, il datore di lavoro deve assicurarsi che le attrezzature di lavoro la cui sicurezza dipende dalle condizioni di installazione siano controllate dopo l’installazione e prima della messa in esercizio, e dopo ogni montaggio in un nuovo luogo.
The employer has several obligations regarding work equipment. According to Article 73 of Legislative Decree 81/2008:
  • The employer must ensure that workers assigned to use work equipment receive sufficient information and instructions, and that they are adequately trained and instructed, especially regarding safety in the use of equipment and during foreseeable abnormal situations.
  • The employer must also inform workers about the risks they are exposed to when using the equipment and about equipment in the surrounding work environment, even if it is not directly used by the workers. This includes informing workers about any changes to the equipment.
  • The information and instructions provided must be understandable to the workers concerned.
  • For equipment that requires specific knowledge and responsibility, the employer must ensure adequate and specific training, information, and instruction.
  • The employer is also required to take the necessary measures to ensure that the workplace and workers’ positions during the use of equipment are safe and meet ergonomic principles.
  • If the equipment requires specific knowledge or responsibility for its specific risks, the employer must take the necessary measures to ensure that the use of the equipment is reserved for workers who are properly trained, informed, and instructed. In the case of repair, modification, or maintenance, the workers involved must be specifically qualified.
  • Finally, the employer must ensure that work equipment, the safety of which depends on installation conditions, is checked after installation and before being put into operation, and after each reassembly in a new location.
Question Come deve essere il sedile del lavoratore addetto al videoterminale secondo l’allegato XXXIV del Decreto Legislativo 81/08?How should the seat of the worker assigned to the visual display unit be, according to Annex XXXIV of Legislative Decree 81/08?
Bard answerL’allegato XXXIV del Decreto Legislativo 82/08 prevede che il sedile del lavoratore addetto al videoterminale deve essere:

Regolabile in altezza e in profondità
Dotato di schienale regolabile
Dotato di braccioli
Dotato di una base stabile
Realizzato con materiali che non accumulino calore
Dotato di un rivestimento facile da pulire
Il sedile deve essere regolato in modo che le ginocchia siano piegate a un angolo di 90 gradi e i piedi appoggiati a terra. Il torace deve essere in posizione eretta e lo schienale del sedile deve sostenere la schiena. Le braccia devono essere appoggiate ai braccioli e i gomiti devono formare un angolo di 90 gradi.

Un sedile adeguato è importante per prevenire i disturbi muscoloscheletrici legati all’utilizzo dei videoterminali. Questi disturbi possono includere mal di schiena, dolori al collo, alle spalle e alle braccia. Un sedile adeguato può aiutare a ridurre il rischio di questi disturbi e migliorare il comfort del lavoratore.

Se il lavoratore avverte dolore o disagio durante l’utilizzo del videoterminale, è importante consultare un medico o un fisioterapista. Questi professionisti possono aiutare a identificare la causa del dolore e consigliare soluzioni per prevenirlo.
Annex XXXIV of Legislative Decree 82/08 provides that the seat of the worker assigned to the visual display unit must be:
  • Adjustable in height and depth
  • Equipped with an adjustable backrest
  • Equipped with armrests
  • Equipped with a stable base
  • Made of materials that do not accumulate heat
  • Covered with a material that is easy to clean
The seat should be adjusted so that the knees are bent at a 90-degree angle and the feet are flat on the floor. The torso should be in an upright position, and the backrest should support the back. The arms should rest on the armrests, and the elbows should form a 90-degree angle.
An appropriate seat is important to prevent musculoskeletal disorders related to the use of visual display units. These disorders can include back pain, neck pain, shoulder pain, and arm pain. A suitable seat can help reduce the risk of these disorders and improve the worker’s comfort.
If the worker experiences pain or discomfort while using the visual display unit, it is important to consult a doctor or a physiotherapist. These professionals can help identify the cause of the pain and recommend solutions to prevent it.

References

  1. Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large language models: A survey. arXiv 2024, arXiv:2402.0619. [Google Scholar] [CrossRef]
  2. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar] [CrossRef]
  3. Introducing ChatGPT. Available online: https://openai.com/index/chatgpt/ (accessed on 8 October 2024).
  4. AI is Promoted from Back-Office Duties to Investment Decisions, Financial Times. Available online: https://www.ft.com/content/3d82ea9f-f040-47aa-9b9d-0be9decdbb14 (accessed on 7 July 2024).
  5. Israel Offers a Glimpse into the Terrifying World of Military AI, The Washington Post, 05/04/2024. Available online: https://www.washingtonpost.com/world/2024/04/05/israel-idf-lavender-ai-militarytarget/ (accessed on 7 July 2024).
  6. Wang, D.; Zhang, S. Large language models in medical and healthcare fields: Applications, advances, and challenges. Artif. Intell. Rev. 2024, 57, 299. [Google Scholar] [CrossRef]
  7. Hirosawa, T.; Harada, Y.; Mizuta, K.; Sakamoto, T.; Tokumasu, K.; Shimizu, T. Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases. JMIR Form Res. 2024, 8, e59267. [Google Scholar] [CrossRef] [PubMed]
  8. Apornvirat, S.; Namboonlue, C.; Laohawetwanit, T. Comparative analysis of ChatGPT and Bard in answering pathology examination questions requiring image interpretation. Am. J. Clin. Pathol. 2024, 162, 252–260. [Google Scholar] [CrossRef] [PubMed]
  9. Altıntaş, E.; Ozkent, M.S.; Gül, M.; Batur, A.F.; Kaynar, M.; Kılıç, Ö.; Göktaş, S. Comparative analysis of artificial intelligence chatbot recommendations for urolithiasis management: A study of EAU guideline compliance. Fr. J. Urol. 2024, 34, 102666. [Google Scholar] [CrossRef] [PubMed]
  10. Burnette, H.; Pabani, A.; von Itzstein, M.S.; Switzer, B.; Fan, R.; Ye, F.; Puzanov, I.; Naidoo, J.; Ascierto, P.A.; Gerber, D.E.; et al. Use of artificial intelligence chatbots in clinical management of immune-related adverse events. J. Immunother. Cancer 2024, 12, e008599. [Google Scholar] [CrossRef]
  11. Neo, J.R.E.; Ser, J.S.; Tay, S.S. Use of large language model-based chatbots in managing the rehabilitation concerns and education needs of outpatient stroke survivors and caregivers. Front. Digit. Health 2024, 6, 1395501. [Google Scholar] [CrossRef] [PubMed]
  12. Rossettini, G.; Cook, C.; Palese, A.; Pillastrini, P.; Turolla, A. Pros and Cons of Using Artificial Intelligence Chatbots for Musculoskeletal Rehabilitation Management. J. Orthop. Sports Phys. Ther. 2023, 53, 728–734. [Google Scholar] [CrossRef]
  13. Dos Santos, M.L.; Victória, V.N.G. Critical evaluation of applications of artificial intelligence based linguistic models in Occupational Health. Rev. Bras. Med. Trab. 2024, 22, e20231241. [Google Scholar] [CrossRef]
  14. Johnson, M.W. P-320 What does a large language model know about the prevalence of occupationally-related medical conditions? experiments with synthetic and real occupational medicine data. Occup. Med. 2024, 74, i366. [Google Scholar] [CrossRef]
  15. Padovan, M.; Cosci, B.; Petillo, A.; Nerli, G.; Porciatti, F.; Scarinci, S.; Carlucci, F.; Dell’Amico, L.; Meliani, N.; Necciari, G.; et al. ChatGPT in Occupational Medicine: A Comparative Study with Human Experts. Bioengineering 2024, 11, 57. [Google Scholar] [CrossRef]
  16. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  17. Amato, F.D.F. Gianfranco. Decreto Legislativo 81/08: Test Unico Sulla Salute e Sicurezza Sul Lavoro. Available online: https://www.ispettorato.gov.it/documenti-e-normativa/normativa-di-interesse/salute-e-sicurezza-nei-luoghi-di-lavoro/ (accessed on 1 May 2023).
  18. Open, A.I. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
  19. Pichai, S.; Hassabis, D.; Introducing Gemini: Our Largest and Most Capable AI Model. Google. 2023. Available online: https://blog.google/technology/ai/google-gemini-ai/ (accessed on 6 June 2024).
  20. An Important Next Step on Our AI Journey. Available online: https://blog.google/technology/ai/bard-google-ai-search-updates/ (accessed on 6 October 2024).
  21. Ai Google. Available online: https://ai.google/discover/palm2/ (accessed on 6 June 2024).
  22. Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
  23. Lee, Y.; Tessier, L.; Brar, K.; Malone, S.; Jin, D.; McKechnie, T.; Jung, J.J.; Kroh, M.; Dang, J.T. ASMBS Artificial Intelligence and Digital Surgery Taskforce. Performance of artificial intelligence in bariatric surgery: Comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions. Surg. Obes. Relat. Dis. 2024, 20, 609–613. [Google Scholar] [CrossRef]
  24. Patil, N.S.; Huang, R.S.; van der Pol, C.B.; Larocque, N. Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment. Can. Assoc. Radiol. J. 2024, 75, 344–350. [Google Scholar] [CrossRef]
  25. Wu, Z.; Gan, W.; Xue, Z.; Ni, Z.; Zheng, X.; Zhang, Y. Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study. JMIR Med. Educ. 2024, 10, e52746. [Google Scholar] [CrossRef]
  26. Sallam, M.; Al-Salahat, K.; Eid, H.; Egger, J.; Puladi, B. Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions. Adv. Med. Educ. Pract. 2024, 15, 857–871. [Google Scholar] [CrossRef]
  27. Yamaguchi, S.; Morishita, M.; Fukuda, H.; Muraoka, K.; Nakamura, T.; Yoshioka, I.; Soh, I.; Ono, K.; Awano, S. Evaluating the efficacy of leading large language models in the Japanese national dental hygienist examination: A comparative analysis of ChatGPT, Bard, and Bing Chat. J. Dent. Sci. 2024, 19, 2262–2267. [Google Scholar] [CrossRef] [PubMed]
  28. Hanna, R.E.; Smith, L.R.; Mhaskar, R.; Hanna, K. Performance of Language Models on the Family Medicine In-Training Exam. Fam. Med. 2024, 56, 555–560. [Google Scholar] [CrossRef] [PubMed]
  29. Alessandri-Bonetti, M.; Liu, H.Y.; Donovan, J.M.; Ziembicki, J.A.; Egro, F.M. A Comparative Analysis of ChatGPT, ChatGPT-4, and Google Bard Performances at the Advanced Burn Life Support Exam. J. Burn. Care Res. 2024, 45, 945–948. [Google Scholar] [CrossRef]
  30. Menekşeoğlu, A.K.; İş, E.E. Comparative performance of artificial intelligence models in physical medicine and rehabilitation board-level questions. Rev. Assoc. Med. Bras. 2024, 70, e20240241. [Google Scholar] [CrossRef]
  31. Rahsepar, A.A.; Tavakoli, N.; Kim, G.H.J.; Hassani, C.; Abtin, F.; Bedayat, A. How AI Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard. Radiology. 2023, 307, e230922. [Google Scholar] [CrossRef] [PubMed]
  32. Resnick, D.K. Commentary: Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank. Neurosurgery 2023, 93, e123–e124. [Google Scholar] [CrossRef] [PubMed]
  33. Raimondi, R.; Tzoumas, N.; Salisbury, T.; Di Simplicio, S.; Romano, M.R. Comparative analysis of large language models in the Royal College of Ophthalmologists fellowship exams. Eye 2023, 37, 3530–3533. [Google Scholar] [CrossRef]
  34. Ahmed, I.; Kajol, M.; Hasan, U.; Datta, P.P.; Roy, A.; Reza, M.R. ChatGPT versus Bard: A comparative study. Eng. Rep. 2024, 6, e12890. [Google Scholar] [CrossRef]
  35. World Health Organization. Ethics & Governance of Artificial Intelligence for Health; WHO: Geneva, Switzerland, 2021. Available online: https://www.who.int/publications/i/item/9789240029200 (accessed on 30 October 2024).
  36. Baldassarre, A.; Padovan, M. Regulatory and Ethical Considerations on Artificial Intelligence for Occupational Medicine. Med. Lav. 2024, 115, e2024013. [Google Scholar] [CrossRef]
  37. Jones, E.; Palangi, H.; Simões, C.; Chandrasekaran, V.; Mukherjee, S.; Mitra, A.; Awadallah, A.; Kamar, E. Teaching Language Models to Hallucinate Less with Synthetic Tasks. arXiv 2023, arXiv:2310.06827. [Google Scholar] [CrossRef]
  38. Neelakantan, A.; Xu, T.; Puri, R.; Radford, A.; Han, J.M.; Tworek, J.; Yuan, Q.; Tezak, N.; Kim, J.W.; Hallacy, C.; et al. Text and code embeddings by contrastive pre-training. arXiv 2022, arXiv:2201.10005. [Google Scholar] [CrossRef]
Figure 1. Distribution of question lengths.
Figure 1. Distribution of question lengths.
Applsci 15 01508 g001
Figure 2. Median (red line), 25th (green line) and 75th percentiles (blue line) for the accuracy and completeness of the responses from and Bard and ChatGPT-4+context.
Figure 2. Median (red line), 25th (green line) and 75th percentiles (blue line) for the accuracy and completeness of the responses from and Bard and ChatGPT-4+context.
Applsci 15 01508 g002
Figure 3. Error rate (%) detected for each error class in the responses generated by Bard and ChatGPT-4+context.
Figure 3. Error rate (%) detected for each error class in the responses generated by Bard and ChatGPT-4+context.
Applsci 15 01508 g003
Table 1. Description of the types of errors assessed by physicians during the evaluation phase.
Table 1. Description of the types of errors assessed by physicians during the evaluation phase.
Error CategoryDescription
Context errorThe answer is not relevant to the context of the question.
Internal ContradictionThe answer contains two statements that contradict each other.
Typo ErrorThe answer contains a typographical error.
Incorrect DefinitionThe answer contains an incorrect definition.
Content ErrorThe answer contains partially or entirely incorrect content.
No ResponseThe chatbot did not provide an answer, making evaluation impossible.
Table 2. Mean score of accuracy and completeness obtained for ChatGPT-4+context and Bard.
Table 2. Mean score of accuracy and completeness obtained for ChatGPT-4+context and Bard.
Google Bard vs. ChatGPT-4+Context
MetricValuesp-Value
Completeness3.47 vs. 3.68<0.05
Accuracy3.14 vs. 3.67<0.05
Table 3. Mean, standard deviation (SD), and statistical significance (*) of the accuracy and completeness scores of the responses from the two chatbots based on the question topic. In the table, ChatGPT-4+context has been abbreviated as GPT-4.
Table 3. Mean, standard deviation (SD), and statistical significance (*) of the accuracy and completeness scores of the responses from the two chatbots based on the question topic. In the table, ChatGPT-4+context has been abbreviated as GPT-4.
TopicMetricsLLMAverageStandard Deviationp-Value
AsbestosAccuracyBard2.811.347p = 0.017 *
GPT-43.441.287
CompletenessBard3.171.404p = 0.239
GPT-43.501.353
Risk of Explosive AtmospheresAccuracyBard2.861.216p = 0.38
GPT-43.511.483
CompletenessBard3.031.150p = 0.083
GPT-43.491.401
Work Equipment and PPEAccuracyBard2.611.443p = 0.00 *
GPT-44.031.150
CompletenessBard3.211.234p = 0.05
GPT-44.000.986
“Interpelli”AccuracyBard3.171.342p = 0.00 *
GPT-43.921.161
CompletenessBard3.491.302p = 0.00 *
GPT-43.951.057
Manual Handling of LoadsAccuracyBard3.351.232p = 0.695
GPT-43.251.284
CompletenessBard3.701.188p = 0.009 *
GPT-43.201.191
Common PrinciplesAccuracyBard3.331.179p = 0.00 *
GPT-44.190.892
CompletenessBard3.531.226p = 0.007 *
GPT-44.100.767
Biological HazardsAccuracyBard2.891.302p = 0.00 *
GPT-43.761.253
CompletenessBard3.321.368p = 0.06
GPT-43.891.239
Carcinogenic and Mutagenic AgentsAccuracyBard2.991.216p = 0.132
GPT-43.311.241
CompletenessBard3.331.267p = 0.795
GPT-43.381.204
Chemical HazardsAccuracyBard3.371.280p = 0.0328
GPT-43.481.404
CompletenessBard3.751.239p = 0.131
GPT-43.451.377
Physical HazardsAccuracyBard3.431.362p = 0.635
GPT-43.271.622
CompletenessBard3.621.238p = 0.101
GPT-43.241.498
Use of Visual Display Terminals AccuracyBard2.961.427p = 0.000 *
GPT-43.831.411
CompletenessBard3.331.452p = 0.002 *
GPT-43.921.322
Table 4. Error rate (%) and statistical significance per error class of the two chatbots when analyzed for each question topic. (-c): no statistics are calculated because this value is constant. In the table, ChatGPT-4 has been abbreviated as ChatGPT-4+ctx. Note: p-values marked with (*) were calculated using Fisher’s exact test due to low frequency in at least one cell of the contingency table.
Table 4. Error rate (%) and statistical significance per error class of the two chatbots when analyzed for each question topic. (-c): no statistics are calculated because this value is constant. In the table, ChatGPT-4 has been abbreviated as ChatGPT-4+ctx. Note: p-values marked with (*) were calculated using Fisher’s exact test due to low frequency in at least one cell of the contingency table.
Topic Error CategoryBardChatGPT-4+ctxp-Value
AsbestosInternal Contradiction4.2%6.3%1.000 *
Incorrect Definition14.6%8.3%0.336
Context Error31.3%14.6%0.052
Typo Error14.6%14.6%1.000
No Response8.3%14.6%0.336
Content Error35.4%18.8%0.066
Risk of Explosive AtmospheresInternal Contradiction100%100%-c
Incorrect Definition22.9%20%0.771
Context Error37.1%22.9%0.192
Typo Error14.3%2.9%0.198 *
No Response8.6%5.7%0.643
Content Error34.3%28.65%0.607
Work Equipment and PPEInternal Contradiction0%2.6%1.000 *
Incorrect Definition34.2%0%0.00 *
Context Error42.1%7.9%0.01 *
Typo Error0%5.3%0.493 *
No Response100%100%-c
Content Error50%10.5%0.00 *
Internal Contradiction2.2%1.3%0.724 *
Incorrect Definition11.2%0.9%0.00 *
Context Error12.5%1.3%0.00 *
“Interpelli”Typo Error4%1.8%0.260 *
No Response1.8%1.8%1.00
Content Error28.1%8.5%0.00 *
Manual Handling of LoadsInternal Contradiction100%100%-c
Incorrect Definition8.5%12.7%0.413
Context Error19.7%12.7%0.225
Typo Error2.8%5.6%0.681 *
No Response7%11.3%0.383
Content Error28.2%29.6%0.853
Common PrinciplesInternal Contradiction1.4%0%1.000 *
Incorrect Definition12.3%2.7%0.056 *
Context Error23.3%2.7%0.00 *
Typo Error4.1%1.4%0.620 *
No Response5.5%4.1%1.000 *
Content Error21.9%9.6%0.041 *
Biological HazardsInternal Contradiction2.6%1.3%1.000 *
Incorrect Definition15.8%7.9%0.132
Context Error9.2%3.9%0.327 *
Typo Error2.6%2.6%1.000
No Response100%100%-c
Content Error32.9%13.2%0.004 *
Carcinogenic and Mutagenic AgentsInternal Contradiction1.4%0%1.000 *
Incorrect Definition6.9%5.6%0.731
Context Error23.6%13.9%0.135
Typo Error6.9%1.4%0.209 *
No Response9.7%5.6%0.347
Content Error34.7%27.8%0.369
Chemical HazardsInternal Contradiction100%100%-c
Incorrect Definition4.6%7.4%0.569 *
Context Error5.6%0.9%0.119 *
Typo Error10.2%2.8%0.050 **
No Response0%7.4%0.007 *
Content Error19.4%11.1%0.089
Physical HazardsInternal Contradiction1%0%1.000 *
Incorrect Definition14.6%14.6%1.00
Context Error10.7%11.7%0.825
Typo Error5.8%6.8%0.774
No Response1%5.8%0.119 *
Content Error22.3%17.5%0.299
Use of Visual Display Terminals Internal Contradiction1.9%9%1.000 *
Incorrect Definition16%5.7%0.015 *
Context Error15.1%1.9%0.001 *
Typo Error2.8%0%0.246 *
No Response100%100%-c
Content Error30.2%19.8%0.081 *
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Padovan, M.; Palla, A.; Marino, R.; Porciatti, F.; Cosci, B.; Carlucci, F.; Nerli, G.; Petillo, A.; Necciari, G.; Dell’Amico, L.; et al. ChatGPT-4 vs. Google Bard: Which Chatbot Better Understands the Italian Legislative Framework for Worker Health and Safety? Appl. Sci. 2025, 15, 1508. https://doi.org/10.3390/app15031508

AMA Style

Padovan M, Palla A, Marino R, Porciatti F, Cosci B, Carlucci F, Nerli G, Petillo A, Necciari G, Dell’Amico L, et al. ChatGPT-4 vs. Google Bard: Which Chatbot Better Understands the Italian Legislative Framework for Worker Health and Safety? Applied Sciences. 2025; 15(3):1508. https://doi.org/10.3390/app15031508

Chicago/Turabian Style

Padovan, Martina, Alessandro Palla, Riccardo Marino, Francesco Porciatti, Bianca Cosci, Francesco Carlucci, Gianluca Nerli, Armando Petillo, Gabriele Necciari, Letizia Dell’Amico, and et al. 2025. "ChatGPT-4 vs. Google Bard: Which Chatbot Better Understands the Italian Legislative Framework for Worker Health and Safety?" Applied Sciences 15, no. 3: 1508. https://doi.org/10.3390/app15031508

APA Style

Padovan, M., Palla, A., Marino, R., Porciatti, F., Cosci, B., Carlucci, F., Nerli, G., Petillo, A., Necciari, G., Dell’Amico, L., Lucisano, V. C., Scarinci, S., & Foddis, R. (2025). ChatGPT-4 vs. Google Bard: Which Chatbot Better Understands the Italian Legislative Framework for Worker Health and Safety? Applied Sciences, 15(3), 1508. https://doi.org/10.3390/app15031508

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop