MédicoBERT: A Medical Language Model for Spanish Natural Language Processing Tasks with a Question-Answering Application Using Hyperparameter Optimization

Padilla Cuevas, Josué; Reyes-Ortiz, José A.; Cuevas-Rasgado, Alma D.; Mora-Gutiérrez, Román A.; Bravo, Maricela

doi:10.3390/app14167031

Open AccessArticle

MédicoBERT: A Medical Language Model for Spanish Natural Language Processing Tasks with a Question-Answering Application Using Hyperparameter Optimization

by

Josué Padilla Cuevas

¹

,

José A. Reyes-Ortiz

²

,

Alma D. Cuevas-Rasgado

^1,*

,

Román A. Mora-Gutiérrez

²

and

Maricela Bravo

²

¹

Computer Engineering, Universidad Autónoma del Estado de Mexico CU, Texcoco 56259, Mexico

²

Systems Department, Autonomous Metropolitan University, Azcapotzalco, Mexico City 02200, Mexico

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7031; https://doi.org/10.3390/app14167031

Submission received: 5 July 2024 / Revised: 7 August 2024 / Accepted: 7 August 2024 / Published: 10 August 2024

(This article belongs to the Special Issue Techniques and Applications of Natural Language Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The increasing volume of medical information available in digital format presents a significant challenge for researchers seeking to extract relevant information. Manually analyzing voluminous data is a time-consuming process that constrains researchers’ productivity. In this context, innovative and intelligent computational approaches to information search, such as large language models (LLMs), offer a promising solution. LLMs understand natural language questions and respond accurately to complex queries, even in the specialized domain of medicine. This paper presents MédicoBERT, a medical language model in Spanish developed by adapting a general domain language model (BERT) to medical terminology and vocabulary related to diseases, treatments, symptoms, and medications. The model was pre-trained with 3 M medical texts containing 1.1 B words. Furthermore, with promising results, MédicoBERT was adapted and evaluated to answer medical questions in Spanish. The question-answering (QA) task was fine-tuned using a Spanish corpus of over 34,000 medical questions and answers. A search was then conducted to identify the optimal hyperparameter configuration using heuristic methods and nonlinear regression models. The evaluation of MédicoBERT was carried out using metrics such as perplexity to measure the adaptation of the language model to the medical vocabulary in Spanish, where it obtained a value of 4.28, and the average F1 metric for the task of answering medical questions, where it obtained a value of 62.35%. The objective of MédicoBERT is to provide support for research in the field of natural language processing (NLP) in Spanish, with a particular emphasis on applications within the medical domain.

Keywords:

LLM; BERT; pre-training model; question answering; fine-tuning; hyperparameter optimization; NLP benchmark; Spanish medical language modeling; MédicoBERT

1. Introduction

The medical literature is expanding at a rapid pace, as evidenced by the more than two million scientific papers published worldwide since the onset of the pandemic (https://reports.dimensions.ai/covid-19/, accessed on 4 July 2024) in the field of research related to the SARS-CoV-2 virus. Manually extracting pertinent information about medical advancements, such as treatments, diseases, or drugs, from the vast number of publications is a laborious task that requires significant time and human resources. In this context, the necessity for developing computational methods or tools that facilitate the rapid and efficient extraction of information from unstructured or semi-structured medical documents is becoming increasingly paramount.

Several tools facilitate the analysis and processing of information contained in text in this format. These include technologies such as Word2Vec [1], which generates vector representations of text for subsequent computational processing, and Long Short-Term Memory (LSTM) neural networks [2]. The latter are highly effective for tasks involving large amounts of text; however, their sequential processing makes them a relatively slow method. Nevertheless, with the introduction of the Transformer architecture [3], text processing has been revolutionized, allowing a more significant number of texts to be processed in less time due to its attention mechanism. This mechanism allows information to be processed in parallel, resulting in a significant reduction in processing time.

The introduction of the Transformer architecture has facilitated the development of large language models (LLMs) based on deep learning [4], such as Bard, LLaMA, Bloom, BERT, and GPT. These models have provided the scientific community with indispensable tools for processing large amounts of text and solving complex natural language processing (NLP) tasks. Some of their most prominent applications include information extraction (IE), named entity recognition (NER), sentiment analysis, text classification, summary generation, part-of-speech (POS) tagging, translation, and question answering (QA). Nevertheless, most of these LLMs have been trained in English and on general knowledge.

This paper examines the feasibility of adapting an LLM to a specific domain despite being pre-trained on general texts such as English Wikipedia and Google Books. The proposal is to utilize the BERT model base [5], which was pre-trained in English, and adapt it to Spanish, one of the world’s five most spoken languages [6]. The objective is to specialize the model in Spanish medical terminology and expressions to perform NLP tasks, such as answering extractive questions formulated by medical researchers in Spanish, thus enabling it to extract information from the medical literature. In order to adapt the large language model to medical terminology, more than 3 million Spanish texts related to the medical domain were used. Furthermore, fine calibration of the hyperparameters was performed to optimize the model’s training and enable it to efficiently answer factoid extractive questions in the Spanish medical domain. This task involved fine-tuning using more than 34 thousand medical questions. This paper presents as its main contribution MédicoBERT, a large language model adapted to the medical domain in Spanish. This model supports various research applications in natural language processing within this domain.

The rest of this paper is organized as follows. In Section 2, we briefly review related works on pre-trained large language models. In Section 3, we describe the research development by describing the datasets used, the model training, the evaluation, and the calibration of the hyperparameters. Section 4 presents the results obtained, and Section 5 analyzes and discusses these results. Finally, Section 6 presents the conclusions and suggestions for future work.

2. Related Works

This section presents related works involving the adaptation of a pre-trained LLM to a specific domain to solve natural language processing tasks.

In [7], a biomedical language representation model for mining large amounts of text was presented. BioBERT adapts the pre-trained linguistic model BERT to biomedical corpora, including PubMed abstracts and PMC full-text articles. This adaptation outperformed the original model and other state-of-the-art models with the same architecture in several biomedical text-mining tasks. BioBert improved named entity recognition (NER), biomedical relation extraction (REL), and biomedical text-mining tasks, showing that training BERT on biomedical corpora can help better understand a complex domain. BioBERT obtained a maximum F1 score of 86.51% in the relation extraction task, a value of 93.47% in the F-measure for the NER task, and a maximum lenient accuracy of 60% in the QA task. In [8], a model called ClinicalBERT was trained on patients’ clinical notes, which contained laboratory values and medications, to process these notes and dynamically assign risk scores to predict whether a patient would be readmitted to the hospital within 30 days. The base model used was BERT, and the Medical Information Mart for Intensive Care III (MIMIC-III) dataset, consisting of medical records from 58,976 hospital admissions of 38,597 patients between 2001 and 2012, was utilized. ClinicalBERT achieved an AUROC of 71.4%.

Similarly, in the medical domain, the authors of [9] proposed adapting the Roberta model using texts from the biomedical and clinical literature in Spanish, aiming to improve NLP applications for Spanish in biomedicine. The authors of RobertaClinical confirmed that domain-specific pre-training is fundamental to obtaining better results in specific tasks such as NER and reported that they obtained an F-score of 88.21%. Similarly, [10] presented med-BERT, which recognized medical entities with an F1 score of 93% after being trained on clinical texts in English and Chinese. Models used to facilitate research in developing linguistic representations in the biomedical domain were presented in [11,12]. These models were subjected to a five-task evaluation for biomedical language comprehension (BLUE and BLURB). In contrast, SciBERT [13] is a model adapted to the scientific publications of the Semantic Scholar repository. SciBERT was trained on 18% of articles corresponding to computer science and 82% corresponding to biomedicine, resulting in a hybrid model that was evaluated with tasks for information extraction. In a related development, models for health management have been proposed. For instance, in [14], the ConBERT model was used to forecast the International Classification of Disease-9 code by utilizing electronic health records and diagnoses in natural language text. It aims to standardize the automatic generation of medical reports. Additionally, in [15], a model utilizing the BERT-BiGRU-ATT architecture was introduced to extract connections between diseases, drugs, and drug effects from texts from online health communities in China. The model proved to be effective. The authors incorporated word embeddings into their model to account for contextual, grammatical, and semantic information within the biomedical field.

The authors of [16] presented MarIA, which consists of a compendium of experiments with several Spanish language models, such as RoBERTa-base, RoBERTa-large [17], GPT2, and GPT2-large [18], pre-trained using a corpus of 135 billion words extracted from the Spanish Web Archive built by the National Library of Spain between the years 2009 and 2019. The performance of the models was evaluated using nine datasets for different NLP tasks in the general domain, as well as a question-answer set (SQAC) obtained from Spanish Wikipedia articles, encyclopedias, and Wikinews. The best experimental results were an F1 score of 88.51% for the RoBERTa-base in the NER task and an F1 score of 98.56% for the RoBERTa-large model in the POS task. In [19], the BETO model was presented, which consists of 12 self-attention layers, 12 attention heads each, a and hidden layer of size 1024. It was trained with a total of 110 million parameters. The model was pre-trained on texts from the Spanish part of Wikipedia and a fragment from the OPUS project19. For BETO, the authors reported that their model outperformed the multilingual BERT variant in the POS tagging, NER, and natural language inference (XNLI) tasks, achieving an F1 score of 88.43%, an accuracy of 98.97%, and an F1 score of 82.01%, respectively.

In contrast, the BERTIN [20] and ELECTRICIDAD models, which are large language models trained on general domain texts in Spanish, represent an alternative approach. BERTIN was trained from scratch on the Spanish part of the mC4 set, while ELECTRICIDAD was trained on the OSCAR 20 corpus. BERTIN achieved F1 scores of 87.79% in the NER task and 96.44% in the POS tagging task. On the other hand, the ELECTRICITY language model achieved F1 scores of 80.5% in the NER task and 98.16% in the POS task, as well as an accuracy of 78.78% in the XNLI task. The authors of [21,22,23] presented language models based on BERT and RoBERTa trained in French, Dutch, and Italian, respectively, which are described below.

FlauBERT was adapted to French using a large and heterogeneous corpus. Furthermore, it was applied to various NLP tasks, such as text classification (CLS), paraphrasing (PAWS-X), natural language inference, parsing, and word-sense disambiguation (WSD), and evaluated using the FLUE benchmark. RobBERT, on the other hand, was tuned to improve the performance of NLP tasks, specifically in Dutch, and the authors also evaluated the importance of tokenizers for LLM training using the Adam optimizer. Finally, the AlBERTo model focused on the language used in Italian social networks, specifically Twitter. The authors submitted AlBERTo to the EVALITA 2016 competition in the SENTIPOLC (SENTIment POLarity Classification) task, ranking among the best results in the detection of polarity, subjectivity, and irony in Italian tweets.

Table 1 compares the most relevant characteristics of training large language models. These characteristics include the language of the texts used for training, the source of the data, the domain or area of application of the model, the tasks it can solve after fine-tuning, and the score obtained when evaluated with the metrics of each task.

The construction of large language models based on the Transformer architecture has been shown to facilitate the resolution of a range of NLP tasks. Nevertheless, to obtain adequate and dependable outcomes, it is essential to tailor the models to the language and the particular domain in which the tasks are to be addressed. This paper presents MédicoBERT, a language model capable of answering medical questions in Spanish. The model was trained using a hyperparameter optimization technique.

MédicoBERT aims to address the limitations or gaps of existing models in the Spanish medical domain. These models, pre-trained on general corpora, lack the specialized medical vocabulary and terminology necessary for optimal performance in tasks such as identifying medical entities or extracting answers from medical texts in Spanish. In contrast, MédicoBERT is pre-trained on a substantial corpus of scientific texts in Spanish, comprising over 3 M and 1.8 B words. This training enables it to comprehend and identify texts on various medical topics, including cancer, diabetes, hypertension, and the Coronavirus disease 2019 (COVID-19). It ensures the precision and dependability of the information conveyed.

3. Materials and Methods

When pre-trained on domain-general datasets, large language models can perform better in domain-specific information extraction tasks. Therefore, in this work, we train MédicoBERT with an LLM to extract answers to Spanish medical questions. This section describes the datasets used and the process of adapting the LLM to the Spanish medical language. Finally, we fine-tune it using hyperparameter optimization to specialize it for the question-answering task. Figure 1 shows the datasets used and all the steps carried out for the training, adaptation, and fine-tuning of the MédicoBERT model.

3.1. Corpus

Although Spanish is one of the world’s five most widely spoken languages, training large language models in Spanish is a significant challenge. The lack of large datasets or resources for training or evaluating models in Spanish makes this task inherently tricky. This section presents the datasets used to train and evaluate the MédicoBERT language model. These include three datasets for adaptive learning and one for fine-tuning the model in the question-answering task.

The datasets utilized to adapt the model to the Spanish medical vocabulary are BioAsq [24], CORD-19 [25], and CoWeSe [26]. This adaptability ensures the relevance and applicability of our model. The initial dataset utilized is derived from the international competition focused on biomedicine, Biomedical Information Extraction and Retrieval (BioAsq), from the year 2021. It includes 249,497 scientific article abstracts and 45,322,119 words in JSON format, with labels such as id, title, abstract text, and year. The second dataset is CORD-19, a repository of more than one million full-text scientific articles, including content on COVID-19 and historical research related to coronaviruses. The documents are structured in Comma-Separated Values (CSV) format and are divided into the following labels: cord_uid, source_x, title, doi, abstract, authors, journal, and url. To adapt to the medical model, 814,402 abstracts from the CORD-19 dataset translated into Spanish were utilized. The final dataset for the pre-training of MédicoBERT is CoWeSe, which, according to its creators, represents the largest biomedical corpus in Spanish to date, encompassing nearly two million pre-processed texts and 750 million words, derived from 3000 Spanish sources. Table 2 presents a list of the corpora and the vocabulary of each dataset used for pre-training the medical model.

The MédicoBERT model was adapted and trained on the BioAsq10 dataset from 2022 to answer medical questions. The dataset consists of 39,680 biomedical questions of a factoid type and is divided into the following sections: context, question, ID, title, and answers. Figure 2 illustrates a representative sample of the questions included in this dataset.

3.2. Adaptive Learning of the MédicoBERT Model

Adaptive learning in LLMs refers to the ability of these models to improve their performance and accuracy depending on the data they are exposed to or trained on. It implies that LLMs continuously learn and improve by adapting to the needs, characteristics, and domain-specific nature of the tasks they are asked to solve. LLMs can adapt to training changes and new data, making them a robust and reliable tool for solving different NLP tasks.

This paper presents a novel approach to adapting an LLM to the medical domain. In this context, the Bidirectional Encoder Representation of Transformers LLM model, also known as BERT, developed by Google, is employed. BERT is a neural network-based model designed to solve NLP tasks.

The base LLM BERT is built using the encoder component of the Transformer architecture [3] and is pre-trained using Wikipedia and a corpus of digital books. The model is a multi-layer bidirectional Transformer encoder, and its base architecture includes 12 coding layers, a sequence length of 512,768-dimensional states, and 12 attention heads, with a total of 110 million parameters [5].

3.2.1. Training New Tokenizers

A tokenizer is a fundamental component of an LLM that transforms text into a numerical representation for processing. The BERT model uses a tokenizer based on subwords [27]. The primary purpose of this approach is to prevent frequently used words from becoming fragmented, thus preserving their whole meaning. On the other hand, uncommon or unusual words are broken down into smaller subwords, allowing the model to recognize them and avoid classifying them as unknown.

Adapting BERT to the medical domain requires replacing the default tokenizer. A tokenizer trained on a dataset from a general domain and in another language is not suitable for performing information extraction tasks in the medical literature. The medical domain-specific vocabulary in Spanish differs significantly from the domain-general vocabulary, which limits the performance of a generic tokenizer. The purpose of training new tokenizers is to improve precision in understanding the structure and composition of abbreviations and medical terms in Spanish, such as the names of drugs or diseases.

In this work, three tokenizers, one for each dataset (BioAsq, CORD-19, and CoWeSe), were trained using an NVIDIA 3090 GPU, the PyTorch deep learning framework, and the tokenizer APIs from the Transformer architecture. The objective of this training was to provide the model with the capacity to comprehend medical terminology accurately and effectively, equipping it with the requisite vocabulary for the medical domain. The vocabulary size of each trained tokenizer was 50,000 tokens.

Figure 3 depicts the output of one of the trained tokenizers, exemplifying how a sentence is divided into subwords. This segmentation aims to obtain concise and significant values for the model in the medical context. Furthermore, the figure illustrates the numerical values resulting from converting scientific texts. This conversion enables the model to efficiently process and analyze the information.

3.2.2. MédicoBERT Model Pre-Training

To pre-train MédicoBERT, the tokenizers and medical literature datasets were employed. Furthermore, the datasets were subjected to a minimal pre-processing step, which involved the removal of punctuation marks such as “?”, “¿”, “¡”, “!”, “/”, “;”, “:”, “*”, “Ç”, “$”, and “&”. These characters are irrelevant for model training and may potentially introduce noise during the process. Finally, the text was normalized by converting it to lowercase. This normalization ensured that all words were represented consistently, thus facilitating the subsequent processing.

MédicoBERT was trained using the Masked Language Modeling (MLM) technique. This technique involves training an LLM to predict the missing words in a text sequence. In this case, 15% of words in the medical literature corpus were randomly replaced with the special token ’<mask>’. The primary objective of this technique is to learn the relationships between words and their context, a crucial aspect of natural language processing. It is important to note that this technique does not require the dataset to be labeled. Below is a fragment of the masking of a text from the CORD-19 dataset, where the objective is to predict P(m), i.e., which word should replace the special token to give the sentence a more accurate meaning based on the medical context.

La pandemia por COVID-19 afecta especialmente a pacientes con <mask> con mayor incidencia y mortalidad (The COVID-19 pandemic especially affects patients with <mask> with increased incidence 248 and mortality)

P (m) = {c \overset{´}{a} n c e r (c a n c e r) | u r t i c a r i a (h i v e s) | d e (f r o m)}

The next task used to train the medical model was Next Sentence Prediction (NSP). This technique involved splitting each dataset into sentences and creating two different sets, as follows:

Related pair set: This set contains 50% of the original dataset’s sentences, where the subsequent sentence in each pair is the logical continuation of the first.
Random pair set: This set contains the remaining 50% of the original dataset’s sentences, where the sentences are randomly combined without any logical relationship.

The following is an example of training the medical model using NSP, where the model is presented with “Sentence A” and “Sentence B” and is asked to predict (P) whether “Sentence B” is the logical continuation of “Sentence A”.

“Sentence A”: El fracaso renal agudo FRA en pacientes hospitalizados por COVID-19 se presenta en el 0.5–2.5% y es un factor de mal pronóstico (Acute renal failure ARF occurs in 0.5–2.5% of COVID-19 hospitalized patients and is a poor prognostic factor).
”Sentence B”: Los mecanismos de afectación renal no están completamente aclarados (The mechanisms of renal involvement are not completely elucidated).
(P): IsNext

This task crucially compels the model to engage in a profound semantic analysis of medical text. This process is instrumental in enabling the model to discern the relationships between sentences, thereby enhancing its capacity to produce coherent and meaningful text.

3.2.3. MédicoBERT Adaptive Learning Evaluation Metrics

To evaluate MédicoBERT, a metric called perplexity was used. According to [28], perplexity is an essential NLP metric for evaluating linguistic models. It is abbreviated as (PP) and shown in Equation (1), it is defined as the inverse probability of the test set, normalized by the number of words. In this equation, P() represents the probability of a sequence of words, and perplexity measures how well a language model can predict that sequence. The lower the perplexity, the better the model. The following steps are used to calculate perplexity:

- For a set of tests

W = w_{1}, w_{2}, \dots w_{N},

P P (W) = \frac{1}{P {(w_{1}, w_{2}, . . ., w_{N})}^{\frac{1}{N}}} = \sqrt[N]{\frac{1}{P (w_{1}, w_{2}, . . ., w_{N})}}

(1)

- The more significant

P P (W)

in the model, the lower the perplexity.

- For the perplexity equation, the Nth root of the inverse of the joint probability of the word sequence represents the degree to which the model is surprised by the words

x_{1}, x_{2}, . . ., x_{N}

presented to it.

- P(

x_{1}, x_{2}, . . ., x_{N}

) represents the probability that the model generates this specific sequence of words. The higher this probability, the more expected the text is for the model.

3.2.4. MédicoBERT Adaptive Learning Hyperparameter Configuration

To determine the best hyperparameter configuration for the adaptive learning of the MédicoBert model, a large initial search strategy known as coarse calibration was used. The search strategy was conducted as follows:

Obtaining the initial hyperparameter configuration: In this stage, the hyperparameter values reported in related works were used as a starting point. Subsequently, an exhaustive manual search was performed to adjust each hyperparameter to improve training performance.
Manual hyperparameter search: This involved trying different values for each hyperparameter and evaluating the model’s performance for each configuration. This technique helped identify the hyperparameter configuration that improved model performance.

The hyperparameter configuration found for the MédicoBERT pre-training is shown below, and the results obtained with this configuration are presented in Table 3.

t r a i n i n g_a r g s = {

e p o c h s = 45,

l e a r n i n g_r a t e = 2 \times 10^{- 5},

w e i g h t_d e c a y = 0.01,

t r a i n_b a t c h_s i z e = 16,

g r a d i e n t_a c c u m u l a t i o n_s t e p s = t r a i n_b a t c h_s i z e * 4,

f p 16 = T r u e,

o p t i m i z e r = “ a d a f a c t o r^{”},

}

3.3. Fine-Tuning MédicoBERT for Question Answering

The generated medical language model (MédicoBERT) was fine-tuned for the Spanish medical question-answering (QA) task. This fine-tuning process aimed to solve a specific NLP task. For this purpose, the pre-trained model was provided with a smaller dataset explicitly labeled for the QA task. By adjusting the weights of the model’s neural connections, it was better adapted to the characteristics and patterns of the new dataset.

Question answering is a fundamental task in NLP that involves providing responses to questions formulated in natural language using relevant or query-related text. The following multi-stage fine-tuning process was employed to adapt the Spanish medical language model to the QA task:

Base architecture of the medical language model (MédicoBERT). The same architecture, generated from the BERT model and adapted to the medical literature, was employed. This architecture provided a solid basis for the QA task, as it had learned to process and statistically understand the medical language and its terminology. Furthermore, the 512-sequence length specified by BERT was employed, and to adapt it to longer texts, the sliding-window approach described in [29] was applied.
Adaptation to the structure of Spanish questions. The final layer of MédicoBERT was adapted using a dataset of 64,000 Spanish questions from the SQuAD dataset [30]. This process allowed the medical model to become familiar with the Spanish question structure, which is crucial for the QA task.
Medical question data training. The model was fine-tuned using a dataset labeled for the medical QA task. This dataset, obtained from the 2022 BioAsq10 competition (https://huggingface.co/datasets/avacaondata/bioasq22-es, accessed on 4 July 2024), contains 39,680 instances, each with a context, a factoid question, and its answer, with the partitioning for training presented in Table 4. These texts differ from those used in the pre-training or adaptation to Spanish medical vocabulary and terminology to avoid bias in the data.
Hyperparameter fine calibration. In the final stage of the process, hyperparameter optimization was conducted to enhance the model’s performance in the Spanish medical question-answering task, as measured by the F1 metric.

3.3.1. MédicoBert Fine-Tuning Evaluation Metrics

The F1 metric (Equation (2)), defined as the harmonic mean of precision (Equation (3)) and recall (Equation (4)), two metrics well known in the literature for the evaluation of NLP tasks [31], was employed to evaluate the MédicoBERT model in the QA task.

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(2)

Precision: The proportion of positive results that the model correctly classifies.

P r e c i s i o n = \frac{# (T r u e p o s i t i v e s)}{# (T r u e p o s i t i v e s + F a l s e p o s i t i v e s)}

(3)

Recall: The proportion of actual positive results that the model correctly classifies.

R e c a l l = \frac{# (T r u e p o s i t i v e s)}{# (T r u e p o s i t i v e s + F a l s e n e g a t i v e s)}

(4)

3.3.2. Hyperparameter Optimization for the MédicoBERT Model

We used an optimization calibration technique to search for suitable hyperparameters for MédicoBERT. This calibration aimed to maximize the F1 metric, used to evaluate the model’s precision and recall for the medical question-answering task.

The optimized hyperparameter configuration and the ranges of values are shown below.

maxF 1

S . A

θ : {L_{r}, w_{0}, b_{z}, e}

L_{r} : L e a r n i n g r a t e L_{r} = [0, 1], R^{+}

w_{0} : W e i g h t w_{0} = [0, 1], R^{+}

b_{z} : B a t c h b_{z} = {2, 4, 8, 16, 32, 64}

e : E p o c h e = [3, 7], R^{+}

The following approaches were employed for calibration optimization using hyperparameters from the literature:

Literature hyperparameters. This stage represents the initial phase of the process, during which the model was trained with a hyperparameter configuration derived from the existing literature. The objective was to obtain an initial configuration and a preliminary solution. The default values for this configuration were as follows: $E p o c h s = 5, B a t c h e s = 8, W e i g h t = 0.001,$ and $L e a r n i n g R a t e = 1 \times 10^{- 5}$ . The training results obtained with these values are presented in Table 5.
Coarse calibration. An exploratory method was employed to identify the configuration that enhanced the preliminary solution and reduced the search space for the optimal configuration. This search was achieved by randomly exploring the hyperparameter space and evaluating the model’s performance for each configuration. A total of 50 experiments were conducted, with the hyperparameters varied in each iteration as follows: $E p o c h s \pm 0.25, B a t c h e s * / 2, W e i g h t \pm 0.005,$ and $L e a r n i n g R a t e \pm 0.5 \times 10^{- 5}$ . The training results for the question-answering task with this configuration are presented in Table 6.
Optimization calibration. The following metaheuristic approach was employed to identify a high-quality hyperparameter configuration for the MédicoBERT model to optimize its performance [32] on the Spanish QA task:
-
Heuristic search (hill-climbing algorithm). To optimize the hyperparameter configuration of MédicoBERT for the QA task, a heuristic technique inspired by the ascent of a mountain climber was used. The search strategy was based on an iterative local search algorithm. The search started with a random configuration of the hyperparameters. At each iteration, a new random configuration was generated within a neighborhood of the current configuration. The performance of the model with both configurations was then evaluated. If the performance with the new configuration was better than with the current configuration, the current configuration was updated. This process was repeated for 67 iterations, and the training results with the found hyperparameter configuration are shown in Table 7.
-
Nonlinear regression (harmonic search algorithm). This procedure entailed generating a log-normal-type nonlinear regression model, with the F1 metric designated as the response variable and the hyperparameter values serving as predictor variables. This regression model was selected with the objective of minimizing the error. Subsequently, an optimization problem was formulated, in which the objective function was the regression model, and the constraints were the intervals of each hyperparameter. Due to the nonlinear nature of the model, the harmonic search metaheuristic proposed in [33] was employed to identify the optimal configuration of the hyperparameters (epoch, batch, weight, and learning rate). The results obtained with this hyperparameter configuration are presented in Table 8.

4. Experimental Results

This section presents the experimental results of the MédicoBERT model and is divided into two parts: evaluation of adaptive learning of medical vocabulary and evaluation of fine-tuning for Spanish question-answering task.

4.1. MédicoBERT Adaptative Learning Results

The MédicoBERT model was pre-trained for 16 days using an NVIDIA 3090 GPU (NVIDIA, Santa Clara, CA, USA), Pytorch 2.4, 3,036,923 texts from the Spanish medical literature, the base architecture of BERT, and a hyperparameter configuration identified through a coarse calibration process.

Figure 4 shows the model’s pre-training process and indicates that MédicoBERT converged on the training and validation data over 40 epochs, indicating that it reached a point of maximum performance. No significant improvement was observed in the subsequent epochs. This behavior suggests that this hyperparameter configuration is optimal for training the Spanish medical language model.

Table 3 illustrates the best results in terms of perplexity during the training of the MédicoBERT model. These results were obtained through experiments, in which the number of data used and the model’s hyperparameters (epoch, learning rate, weight, and batch size) were varied. Among the hyperparameters analyzed, it was observed that the epoch had the most significant impact on the model’s performance during training.

4.2. MédicoBERT Fine-Tuning Results for Question Answering

For the question-answer fine-tuning experiments of the MédicoBERT model, the following resources were used: an NVIDIA 3090 GPU, PyTorch, the Huggingface (https://huggingface.co/, accessed on 4 July 2024) Transformer library, an adaptive learning-generated tokenizer, and the BioASQ10 dataset, fragmented as shown in Table 4.

Five experiments were conducted with identical hyperparameter configurations to ensure the reliability and reproducibility of the results. The QA question set was consistently fragmented, and each test’s data was split randomly. Subsequently, the arithmetic mean of the F1 metric for the five experiments was calculated. The objective was to assess the degree of generalization of the MédicoBERT model, that is, its capacity to correctly predict questions not included in the training data.

Table 5 presents the most significant results of fine-tuning the MédicoBERT and base BERT models for the QA task, using the hyperparameter configuration obtained from the literature. These results underscore the importance of our research in the field of natural language processing and machine learning.

Table 6 shows the best training results of the MédicoBERT model for the Spanish medical QA task, using the optimal hyperparameter configuration obtained from coarse calibration with the exploratory method.

Table 7 presents the results of fine-tuning the MédicoBERT model for the Spanish medical QA task, utilizing the hyperparameters derived from the heuristic search conducted with the hill-climbing algorithm.

Table 8 presents the training results of the MédicoBERT model for the QA task, using the hyperparameters obtained through the harmonic search algorithm.

Figure 5 compares the average F1 scores obtained from training the base BERT and MédicoBERT models for the QA NLP task using different hyperparameter calibration strategies.

The hill-climbing algorithm facilitated the identification of the optimal hyperparameter configuration for MédicoBERT fine-tuning. The results obtained with the harmonic search substantiate this configuration’s robustness, indicating that it is statistically significant.

In evaluating various hyperparameter configurations for answering Spanish-language questions about medical texts, it was found that the hyperparameters pertaining to batch size and weight had a pronounced impact on the model’s efficiency. It was determined that processing small batches facilitated a more gradual adaptation of the model parameters. However, it is essential to exercise caution when utilizing exceedingly small values, as this markedly increases the training time. Furthermore, allowing for fractional variations in the number of epochs resulted in more precise control over the training process, thus avoiding overfitting and enabling the model to converge at an earlier stage, specifically at epoch 3.5.

Ultimately, the comparison results clearly demonstrate that the default hyperparameter settings are not optimal for the medical model. In fact, they lead to a significant loss of efficiency. This should serve as a strong motivation to consider implementing the recommended changes in your own hyperparameter optimization process.

5. Discussion

The results of the MédicoBERT model generation are divided into two categories: model fitting through adaptive learning and fine-tuning to answer medical questions in Spanish. The latter category includes searching for the best hyperparameters through optimization methods.

The adaptive learning results showed that training domain-specific tokenizers ensures an accurate transformation of text into numerical values. In addition, better recognition of specialized medical vocabulary and terminology was observed, allowing for more practical information extraction and, consequently, deeper and more meaningful analysis of medical literature.

The pre-training of the base model on the medical literature has excellent potential for improving the performance and applicability of LLMs in specific domains. This was demonstrated by achieving a low perplexity value of 4.28, indicating a successful adaptation of the model to the medical domain in Spanish. It was also observed that the performance of language models in a given domain improves as the number of texts and words available in that domain increases.

In the context of the fine-tuning results, our MédicoBERT model represents a significant advancement in adapting large language models to the medical domain in Spanish. In contrast to previous studies, our model is designed to answer medical questions. In contrast to the language models presented in [7,8,10,11,12,13,14], which focus on the medical or clinical domain in English, our model offers the advantage of being in Spanish, which makes it more accessible and valuable for Spanish-speaking health professionals. Similarly, it differs from other models in Spanish, such as [13,16,18,19], which are tailored to non-specialized or general domains, as our model focuses specifically on the medical domain. It is noteworthy that although the authors of [9] presented a model adapted to the medical domain in Spanish, it was designed to solve the named entity recognition (NER) task. In contrast, our model addresses one of the most complex NLP tasks—question answering (QA)—where it achieved promising results, correctly answering 62% of questions.

As evidenced by the literature, there is growing interest in large language models within the scientific community. However, studies focusing on Spanish and the medical domain must be expanded. This paper contributes to developing large language models explicitly designed for Spanish in the medical domain. The extant literature on Transformer-based models for the medical domain in Spanish has predominantly concentrated on natural language processing tasks, including entity recognition, sentence labeling, and classification. In contrast, our model is designed to provide accurate responses to questions. Although language models capable of answering questions in the medical domain have been developed in English, a research gap persists in Spanish. This study addresses this gap by proposing a model trained on a general question dataset (SQUAD) and a medical corpus designed and validated by domain experts (BioAsq). This combination enables the model to comprehend the general structure of questions and the nuances of medical language, facilitating more precise and pertinent responses.

One of the key strengths of MédicoBERT is its comprehensive training on CoWeSe, the most extensive Spanish medical corpus currently available. This extensive training has enabled MédicoBERT to develop a deep semantic understanding of medical language, equipping it with a profound knowledge of diseases, treatments, and symptoms. As a result, MédicoBERT is capable of retrieving accurate and relevant medical information, making it a significant advancement in the field of Spanish language models for the medical domain.

The results of the various hyperparameter calibration strategies demonstrated that calibration through optimization markedly enhances the performance and efficiency of large language models for specific tasks, such as medical question answering. Moreover, it was demonstrated that fine calibration of a medical language model in Spanish, whether through heuristic or harmonic search, leads to statistically similar results.

Like other large language models, MédicoBERT exhibits bias and generalization issues despite being trained on scientific texts in Spanish. The quality and representativeness of the training data are critical determinants of model performance. Furthermore, the absence of exposure to a more expansive natural language corpus may restrict its capacity to comprehend language in contexts beyond the medical domain. Furthermore, the computational demands of these models present a substantial obstacle for many medical researchers.

6. Conclusions and Future Work

This paper describes the process and outcomes of customizing a large pre-trained language model from the general domain to the medical domain, resulting in MédicoBERT. Three specific tokenizers were developed for the Spanish medical literature vocabulary. The model was then trained on Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks. The obtained results confirm the successful adaptation of the model to medical language, validating the effectiveness of the process.

The primary contribution of this work is the MédicoBERT language model, trained on a dataset of over 3 million texts from the Spanish medical literature. As a result, the MédicoBERT model has comprehensive knowledge about diseases, drugs, and treatments, making it a valuable tool for various natural language processing tasks in the medical field. The model is highly adaptable and can be fine-tuned for diverse NLP tasks, such as medical document classification and clinical information extraction. Additionally, this paper details steps to enhance the fine-tuning process of a large language model (LLM) through calibration and hyperparameter optimization, significantly improving the model’s efficiency in solving NLP tasks.

Furthermore, a new approach to answering medical questions in Spanish is introduced, showing promising results compared to existing methods. In summary, this work indicates that advancements in adaptive learning techniques will lead to the development of more sophisticated large language models. These models will have the ability to interact naturally and intelligently across a wide range of tasks and domains, adapting to varied terminologies and vocabularies. This progress is driven by the potential and capabilities of LLMs and the active area of research in adaptive learning in NLP.

Currently, the model demonstrates satisfactory performance in answering questions validated by experts in the medical domain. However, there is an ongoing objective to extend the scope of MédicoBERT and address potential challenges in the medical area. In future work, we intend to extend its application to other natural language processing tasks within the medical domain. In particular, we will concentrate on the identification of medical entities, including anatomical structures, diseases, and drugs. Additionally, we propose to enhance MédicoBERT’s capabilities to encompass the extraction of semantic relations within the medical domain. This includes discerning causal relationships and associations between entities, such as the indication of a particular disease for a specific drug or the determination of a disease’s relation to one or more anatomical parts. Furthermore, we will investigate the adaptation of the model for medical text classification tasks.

The enhancement of MédicoBERT for the specified tasks entails the collection, adaptation, and processing of datasets comprising accurate and validated information. Furthermore, a novel approach for fine-tuning the model hyperparameters will be employed, utilizing a simulated annealing algorithm, which will require considerable computational time. By outlining these potential areas of research and development, this paper provides a roadmap for future work on the MédicoBERT model.

Author Contributions

Conceptualization, A.D.C.-R., J.A.R.-O., R.A.M.-G., J.P.C. and M.B.; methodology, A.D.C.-R., J.A.R.-O. and J.P.C.; software, J.P.C., R.A.M.-G. and J.A.R.-O.; validation, J.A.R.-O. and A.D.C.-R.; formal analysis, J.P.C. and J.A.R.-O.; investigation, J.P.C.; resources, J.P.C. and R.A.M.-G.; data curation, J.P.C., J.A.R.-O. and M.B.; writing—original draft preparation, J.P.C.; writing—review and editing, A.D.C.-R., J.A.R.-O., R.A.M.-G. and M.B.; visualization, J.P.C.; supervision, J.A.R.-O. and A.D.C.-R.; funding acquisition, J.A.R.-O., M.B. and A.D.C.-R. All authors have read and agreed to the published version of the manuscript.

Funding

The present work was funded by the Consejo Nacional de Humanidades, Ciencia y Tecnologia (CONAHCYT) Mexico under scholarship No. 793971.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank Universidad Autonoma Metropolitana, Azcapotzalco, and Universidad Autónoma del Estado de México, Texcoco.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2013; Volume 26. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Atkinson-Abutridy, J. Large Language Models: Concepts, Techniques and Applications; CRC Press: Boca Raton, FL, USA, 2024. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Online, 6–11 June 2019. [Google Scholar]
Gordon, R.G. Ethnologue: Languages of the World; SIL International: Dallas, TX, USA, 2005. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2019, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
Huang, K.; Altosaar, J.; Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv 2019, arXiv:1904.05342. [Google Scholar]
Carrino, C.P.; Armengol-Estap’e, J.; Gutiérrez-Fandiño, A.; Llop-Palao, J.; Pàmies, M.; Gonzalez-Agirre, A.; Villegas, M. Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario. arXiv 2021, arXiv:2109.03570. [Google Scholar]
Liu, N.; Hu, Q.; Xu, H.; Xu, X.; Chen, M. Med-BERT: A Pretraining Framework for Medical Records Named Entity Recognition. IEEE Trans. Ind. Inform. 2022, 18, 5600–5608. [Google Scholar] [CrossRef]
Peng, Y.; Yan, S.; Lu, Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task, Florence, Italy, 1 August 2019; Demner-Fushman, D., Cohen, K.B., Ananiadou, S., Tsujii, J., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 58–65. [Google Scholar] [CrossRef]
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthc. 2021, 3. [Google Scholar] [CrossRef]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 3615–3620. [Google Scholar] [CrossRef]
Park, S.; Bong, J.W.; Park, I.; Lee, H.; Choi, J.; Park, P.; Kim, Y.; Choi, H.S.; Kang, S. ConBERT: A Concatenation of Bidirectional Transformers for Standardization of Operative Reports from Electronic Medical Records. Appl. Sci. 2022, 12, 1250. [Google Scholar] [CrossRef]
Zhang, Y.; Li, X.; Yang, Y.; Wang, T. Disease- and Drug-Related Knowledge Extraction for Health Management from Online Health Communities Based on BERT-BiGRU-ATT. Int. J. Environ. Res. Public Health 2022, 19, 6590. [Google Scholar] [CrossRef] [PubMed]
Gutiérrez-Fandiño, A.; Armengol-Estapé, J.; Pàmies, M.; Llop-Palao, J.; Silveira-Ocampo, J.; Carrino, C.P.; Armentano Oller, C.; Rodríguez Penagos, C.; Gonzalez-Agirre, A.; Villegas Montserrat, M. MarIA: Spanish Language Models. arXiv 2022, arXiv:2107.07253. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Cañete, J.; Chaperon, G.; Fuentes, R.; Ho, J.H.; Kang, H.; Pérez, J. Spanish pre-trained bert model and evaluation data. arXiv 2023, arXiv:2308.02976. [Google Scholar]
Rosa, J.d.l.; Ponferrada, E.G.; Villegas, P.; González de Prado Salas, P.; Romero, M.; Grandury, M. BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling. arXiv 2022, arXiv:2207.06814. [Google Scholar]
Le, H.; Vial, L.; Frej, J.; Segonne, V.; Coavoux, M.; Lecouteux, B.; Allauzen, A.; Crabbé, B.; Besacier, L.; Schwab, D. FlauBERT: Unsupervised Language Model Pre-training for French. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., et al., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 2479–2490. [Google Scholar]
Delobelle, P.; Winters, T.; Berendt, B. Robbert: A dutch roberta-based language model. arXiv 2020, arXiv:2001.06286. [Google Scholar]
Polignano, M.; Basile, P.; Degemmis, M.; Semeraro, G.; Basile, V. AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets. In Proceedings of the Italian Conference on Computational Linguistics, Bari, Italy, 13–15 November 2019. [Google Scholar]
Gasco, L.; Nentidis, A.; Krithara, A.; Estrada-Zavala, D.; Murasaki, R.T.; Primo-Peña, E.; Bojo Canales, C.; Paliouras, G.; Krallinger, M. Overview of BioASQ 2021-MESINESP track. Evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials. In Proceedings of the CEUR Workshop Proceedings, Bucharest, Romania, 21–24 September 2021. [Google Scholar]
Wang, L.L.; Lo, K.; Chandrasekhar, Y.; Reas, R.; Yang, J.; Burdick, D.; Eide, D.; Funk, K.; Katsis, Y.; Kinney, R.M.; et al. CORD-19: The COVID-19 Open Research Dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, 5–10 July 2020; Verspoor, K., Cohen, K.B., Dredze, M., Ferrara, E., May, J., Munro, R., Paris, C., Wallace, B., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2020. [Google Scholar]
Carrino, C.P.; Armengol-Estap’e, J.; de Gibert Bonet, O.; Gutiérrez-Fandiño, A.; Gonzalez-Agirre, A.; Krallinger, M.; Villegas, M. Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models. arXiv 2021, arXiv:2109.07765. [Google Scholar]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Daniel, J.; Martin, J.H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition; Pearson: Bangalore, India, 2000. [Google Scholar]
Wu, S.; Dredze, M. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 833–844. [Google Scholar] [CrossRef]
Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; Su, J., Duh, K., Carreras, X., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2016; pp. 2383–2392. [Google Scholar] [CrossRef]
Cleverdon, C.W.; Mills, J.; Keen, M. Factors Determining the Performance of Indexing Systems; Cranfield University: Cranfield, UK, 1966. [Google Scholar]
Feurer, M.; Hutter, F. Hyperparameter Optimization. In Automated Machine Learning: Methods, Systems, Challenges; Hutter, F., Kotthoff, L., Vanschoren, J., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 3–33. [Google Scholar] [CrossRef]
Geem, Z.W.; Kim, J.H.; Loganathan, G.V. A New Heuristic Optimization Algorithm: Harmony Search. Simulation 2001, 76, 60–68. [Google Scholar] [CrossRef]

Figure 1. Overview of the pre-training and fine-tuning of MédicoBERT.

Figure 2. Example of questions from the BioAsq10 training dataset.

Figure 3. Text-to-number conversion performed by the new tokenizer.

Figure 4. MédicoBERT training.

Figure 5. Calibration strategies for the base BERT and MédicoBERT models in the medical QA task.

Table 1. Comparison of LLMs reported in the literature.

LLM/Reference	Language	Dataset	Domain	Tasks	Evaluation for Each Task
BioBERT [7]	English	Abstracts PubMed, PMC articles	Medical	- REL - NER - QA	- F1 86.51% - F1 93.47% - F1 60.0%
ClinicalBERT [8]	English	MIMIC-III	Clinical/Medical	- Predict hospital readmission	- AUROC 71.4%
RobertaClinical [9]	Spanish	Medical crawler, Clinical cases, mespen_Medline, PubMed	Clinical/Medical	- NER	- F1 88.21%
Med-BERT [10]	English and Chinese	Clinical Records	Clinical	- NER	- F1 93%
BlueBERT [11]	English	PubMed MIMIC-III	Biomedical	- Sentence similarity - REL - CLS - NER - XNLI	- Pearson 84.8% - F1 74.4% - F1 87.3% - F1 86.6% - F1 84.0%
PubMedBERT [12]	English	PubMed	Biomedical	- Sentence similarity - NER - REL - CLS - QA	- macro-F1 92.30% - macro-F1 85.62 % - macro-F1 77.24% - macro-F1 82.32% - macro-F1 55.84%
SciBERT [13]	English	Semantic Scholar	Computing, Biomedical	- NER - CLS - REL	- F1 67.57% - F1 70.98% - F1 79.97%
ELECTRICIDAD [16]	Spanish	OSCAR 20	General	- XNLI - POS - NER	- Accuracy 78.78% - F1 98.16% - F1 80.5%
ROBERTA [17]	Spanish	Wikipedia, Wikinews	General	- NER - POS	- F1 88.51% - F1 98.56%
BETO [19]	Spanish	Wikipedia OPUS19	General	- XNLI - NER - POS	- F1 88.43% - Accuracy 98.97% - Accuracy 82.01%
BERTIN [20]	Spanish	mC4	General	- NER - POS	- F1 87.79% - F1 96.44%
FlauBERT [21]	French	Wikipedia, Books, Common Craw	General	- CLS - PAWS-X - WSD	- Accuracy 94.10% - Accuracy 89.34% - Accuracy 50.48%
RobBERT [22]	Dutch	OSCAR, Common Crawl, SoNaR-500	General	- Sentiment analysis - Zero-shot - NER	- F1 94.378% - Accuracy 98.75% - F1 89.08%
ALBERTo [23]	Italian	TWITA	Social networks	- Irony detection - Sentiment analysis - Subjectivity classification	- F1 60.90% - F1 72.23% - F1 79.06%
ConBERT [14]	English	Surgical records from the Korea University Guro Hospital	Clinical	- ICD-9 Classification	- F1 74.15% - AUC 98.42%

Table 2. Corpora used for pre-training MédicoBERT.

Corpus	Number of Texts	Vocabulary
BioAsq	249,473	48,217
Cord-19	814,402	104,186
CoWeSe	1,973,048	227,667

Table 3. MédicoBERT adaptative learning results.

Corpus	Number of Texts	Number of Words	Epochs	Perplexity
BioAsq+CoWeSe+ CORD-19	3,036,923	1,155,535,281	40	4.28
BioAsq+CoWeSe+ CORD-19	3,036,923	1,155,535,281	50	7.25
BioAsq+CoWeSe	1,973,048	948,899,576	30	8.25
BioAsq	249,473	45,322,119	4	65.41

Table 4. Partitioning of the BioAsq10 dataset.

Partition	Number of Examples	Percent
Train	23,808	60%
Test	7936	20%
Validation	7936	40%

Table 5. LLM training results with hyperparameter configuration from the literature.

Model	Epochs	Batch	Learning Rate	Weight	Average F1
MédicoBERT	5	8	$1 \times 10^{- 5}$	0.001	48.05
Base BERT	5	8	$1 \times 10^{- 5}$	0.001	21.33

Table 6. MédicoBERT training results for the QA task with coarse hyperparameter calibration.

Epochs	Batch	Learning Rate	Weight	Average F1
3.25	8	$2.5 \times 10^{- 5}$	0.015	57.5

Table 7. MédicoBERT training results for the QA task with heuristic search hyperparameters.

Epochs	Batch	Learning Rate	Weight	Average F1
3.5	16	$2 \times 10^{- 5}$	0.001	62.35

Table 8. MédicoBERT training results for the QA task with harmonic search hyperparameters.

Epoch	Batch	Learning Rate	Weight	Average F1
3.25	8	$2 \times 10^{- 5}$	0.015	61.72

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Padilla Cuevas, J.; Reyes-Ortiz, J.A.; Cuevas-Rasgado, A.D.; Mora-Gutiérrez, R.A.; Bravo, M. MédicoBERT: A Medical Language Model for Spanish Natural Language Processing Tasks with a Question-Answering Application Using Hyperparameter Optimization. Appl. Sci. 2024, 14, 7031. https://doi.org/10.3390/app14167031

AMA Style

Padilla Cuevas J, Reyes-Ortiz JA, Cuevas-Rasgado AD, Mora-Gutiérrez RA, Bravo M. MédicoBERT: A Medical Language Model for Spanish Natural Language Processing Tasks with a Question-Answering Application Using Hyperparameter Optimization. Applied Sciences. 2024; 14(16):7031. https://doi.org/10.3390/app14167031

Chicago/Turabian Style

Padilla Cuevas, Josué, José A. Reyes-Ortiz, Alma D. Cuevas-Rasgado, Román A. Mora-Gutiérrez, and Maricela Bravo. 2024. "MédicoBERT: A Medical Language Model for Spanish Natural Language Processing Tasks with a Question-Answering Application Using Hyperparameter Optimization" Applied Sciences 14, no. 16: 7031. https://doi.org/10.3390/app14167031

APA Style

Padilla Cuevas, J., Reyes-Ortiz, J. A., Cuevas-Rasgado, A. D., Mora-Gutiérrez, R. A., & Bravo, M. (2024). MédicoBERT: A Medical Language Model for Spanish Natural Language Processing Tasks with a Question-Answering Application Using Hyperparameter Optimization. Applied Sciences, 14(16), 7031. https://doi.org/10.3390/app14167031

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MédicoBERT: A Medical Language Model for Spanish Natural Language Processing Tasks with a Question-Answering Application Using Hyperparameter Optimization

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Corpus

3.2. Adaptive Learning of the MédicoBERT Model

3.2.1. Training New Tokenizers

3.2.2. MédicoBERT Model Pre-Training

3.2.3. MédicoBERT Adaptive Learning Evaluation Metrics

3.2.4. MédicoBERT Adaptive Learning Hyperparameter Configuration

3.3. Fine-Tuning MédicoBERT for Question Answering

3.3.1. MédicoBert Fine-Tuning Evaluation Metrics

3.3.2. Hyperparameter Optimization for the MédicoBERT Model

4. Experimental Results

4.1. MédicoBERT Adaptative Learning Results

4.2. MédicoBERT Fine-Tuning Results for Question Answering

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI