Next Article in Journal
Mitigation of Gravity Segregation by Foam to Enhance Sweep Efficiency
Previous Article in Journal
Analysis of Biomechanical Characteristics of External Fixators with Steel and Composite Frames during Anterior–Posterior Bending
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Improved Model for Medical Forum Question Classification Based on CNN and BiLSTM

1
School of Artificial Intelligence and Automation, Hohai University, Changzhou 213022, China
2
College of Information Science and Engineering, Hohai University, Changzhou 213022, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(15), 8623; https://doi.org/10.3390/app13158623
Submission received: 3 June 2023 / Revised: 21 July 2023 / Accepted: 25 July 2023 / Published: 26 July 2023
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Question Classification (QC) is the fundamental task for Question Answering Systems (QASs) implementation, and is a vital task, as it helps in identifying the question category. It plays a big role in predicting the answer to a question while building a QAS. However, classifying medical questions is still a challenging task due to the complexity of medical terms. Many researchers have proposed different techniques to solve these problems, but some of these problems remain partially solved or unsolved. With the help of deep learning technology, various text-processing problems have become much easier to solve. In this paper, an improved deep learning-based model for Medical Forum Question Classification (MFQC) is proposed to classify medical questions. In the proposed model, feature representation is performed using Word2Vec, which is a word embedding model. Additionally, the features are extracted from the word embedding layer based on Convolutional Neural Networks (CNNs). Finally, a Bidirectional Long Short Term Memory (BiLSTM) network is used to classify the extracted features. The BiLSTM model analyzes the target information of the representation and then outputs the question category via a SoftMax layer. Our model achieves state-of-the-art performance by effectively capturing semantic and syntactic features from the input questions. We evaluate the proposed CNN-BiLSTM model on two benchmark datasets and compare its performance with existing methods, demonstrating its superiority in accurately categorizing medical forum questions.

1. Introduction

The development of intelligent healthcare systems has shown significant progress in recent years [1,2,3]. Medical forum platforms provide a valuable resource for patients, caregivers, and medical professionals to seek and exchange health information. Efficiently classifying questions asked on these platforms is essential for organizing the content and providing relevant and accurate responses [4]. Question classification (QC) aims to assign predefined labels or categories to input questions, enabling effective information retrieval and facilitating knowledge sharing [5]. A Question Answering System (QAS) is a computer-based system that is designed to understand and respond to questions posed by users in natural language. It aims to provide accurate and relevant answers to user queries by extracting information from a given collection of documents or knowledge sources [6]. The need for intelligent QAS in the medical domain has increased in the last decade, due to the rapid increase of internet users around the world. Traditional search engines return a list of pages so that users can read and find answers themselves. With the help of QAS, people can search for information online and obtain precise and accurate answers in a very short time [7]. To obtain accurate answers, questions need to be preprocessed and classified based on their category. Automated approaches to QC have achieved significant progress in terms of categorizing questions [8]. However, these approaches have not completely addressed specific problems such as ambiguity in the question, lexical gap, and polysemy problem [9].
The QAS involves three main stages, namely question processing, document processing, and answer processing. QC is a subtask of question processing, which deals with identifying the question type posed by the user, based on possible categories of answers. The QC assists the QAS to eliminate categories of answers that are irrelevant to the questions. The types of questions also play a vital role in determining the types of answers proposed by the system. For instance, some questions start with “who”, “when”, “what”, “where”, “how”, “why”, etc. The questions that start with “who”, “where”, “what”, and “when”, are considered to be factoid questions and the expected answer type for these questions is short and simple. However, the questions that start with “how” and “why” are non-factoid questions and the expected answer type for these questions are long and complex.
There are three main techniques used for the QC task, namely rule-based, machine learning, and hybrid techniques [7,10]. Rule-based techniques use a set of rules programmed to generate pre-defined answers. Machine-learning techniques can learn from data and classify questions efficiently. Machine-learning techniques, such as Neural Networks (NN), Support Vector Machine (SVM), Random Forest (RF), Decision Trees (DT), and K-Nearest Neighbor (KNN) have been used for question classification. Hybrid techniques combine both rule-based and machine-learning techniques to classify questions. Deep learning techniques were derived from artificial neural networks and today they have become a principal area of machine learning [2,11]. It involves various techniques, such as Convolutional Neural Network (CNN) [12], Long Short Term Memory (LSTM) which derives from a traditional Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), etc. [13,14].
The medical QAS differs from the open-domain QAS [15]. In the medical domain, there are domain-specific terminologies and domain-specific question types, which differ for domain expert users and non-domain expert users. The size of data, domain context, and resources are three key points to consider when we are comparing the open domain and the medical domain. Athenikos and Han [16] described four characteristics of restricted-domain QC in the biomedical domain as follows: (1) Large-sized textual corpora, (e.g., MEDLINE, [17]), (2) Highly complex domain-specific terminology, that is covered by domain-specific lexical, terminological, and ontological resources, (e.g., Unified Medical Language System (UMLS), [18]), (3) Tools and methods for exploiting the semantic information (e.g., MetaMap, [19]), and (4) Domain-specific format and typology of questions [20]. The Syntactic Patterns-based methods have gained popularity in the classification system for Question Types due to the restricted number of question types and insufficient labeled data [16].
Thus, in this study, we will focus on the following research questions: (1) Can the proposed model generalize well to diverse medical forum datasets? (2) What is the impact of word embeddings on medical question classification performance? To deal with these problems, an improved model for medical forum QC based on CNN and Bidirectional Long Short Term Memory (BiLSTM) is proposed, and our main contributions can be summarized as follows:
  • An improved CNN-BiLSTM model is proposed, which is an integrated model for the medical QC task. The model classifies the posed question into one of the categories. This helps to easily identify the expected answer type by the QAS.
  • The accuracy of the proposed medical QC model is improved by tuning hyperparameters such as the learning rate, batch size, and the number of epochs.
  • A word embedding model is presented by using Word2Vec.
The rest of the paper is organized as follows. Section 2 gives more details about the related works performed by different researchers on medical QC and neural network-based deep learning methods used in this field. Section 3 describes the classification methods used to classify medical questions and the proposed architecture. Section 4 presents the experiments and analysis performed in this research, including the discussion of our proposed methods. Finally, Section 5 provides the conclusion and recommends some future directions.

2. Related Work

In this section, we discuss the work that has been performed by other researchers on medical QC and the deep neural network-based methods used in this field.

2.1. Medical Question Classification

QC in the medical domain has attracted many researchers recently. Some representative works are described as follows.
Chen et al. [21] introduced BioSentVec, a new sentence embedding model specifically designed for biomedical texts. By training on a large-scale biomedical corpus, BioSentVec generated high-quality embeddings that captured the semantic information in biomedical sentences. The embeddings were utilized for various biomedical text-mining tasks, including medical question classification, and achieved state-of-the-art performance. However, the quality and coverage of the pre-training corpus used for BioSentVec may impact its performance.
Lee et al. [22] proposed BioBERT, a pre-trained language representation model based on the transformer architecture, specifically designed for biomedical text. BioBERT achieved high-level performance on various biomedical text-mining tasks, including medical question classification. The model was pre-trained on a large-scale biomedical corpus, enabling it to capture domain-specific semantics and improve classification accuracy. However, BioBERT is a large and complex model, requiring substantial computational resources for training and fine-tuning.
Rasmy et al. [23] introduced MedBERT, a variant of BERT specifically tailored for Electronic Health Records (EHRs) and medical text. MedBERT was pre-trained on a large-scale EHR dataset and achieved state-of-the-art performance on various clinical NLP tasks, including medical question classification. The model effectively captured the complex relationships and semantics present in medical text. However, it is challenging to acquire and preprocess medical data for training MedBERT, due to privacy concerns, data access restrictions, and the need for expert annotations.
Liu et al. [24] proposed a multi-dimensional feature extraction model for classifying medical questions. This approach combines multiple neural network models to extract the characteristics of questions, namely RNN, LSTM, and GRU, which can achieve an accuracy of 54%. By comparing the accuracy and loss of their proposed model with traditional methods, the results showed that RNN has a negative impact on the extracted final features, which resulted in an increase in losses.
Faris et al. [25] proposed an automatic system that classifies medical questions asked by patients and predicts the medical specialty. This system used the Arabic language, and utilizes 15,000 medical questions asked by clients which are classified into 15 medical specialties, and achieved an accuracy of 85%. However, they used Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer which is context-free. Therefore, it does not capture the semantics of the questions, which means that the arrangement of words in a sentence does not inherently convey its meaning or semantic relationships.
Wasim et al. [26], proposed a transformation technique called Label Power Set with Logistic Regression (LPLR) for multi-label biomedical QC to address various problems such as the system that fails to exploit the dependence between labels of a given question; the classifier that fails to construct a decision boundary during training, and the predicted number of labels which is limited to five. They also generated a multi-labeled corpus (i.e., MLBioMedLAT) with the help of Open Advancement of Question Answering (OAQA).
Suffian et al. [27] presented a novel technique for the extraction of key phrases. Their approach aims to scoop out meaningful information to reduce the size of the textual dimension. To classify diseases, such as Asthma, Hypertension, Diabetes, Fever, Abdominal issues, and Heart problems, they employed machine-learning algorithms on the dataset of 690 patients that has 200 annotated medical questions for training and test sets. The results showed that they achieved an accuracy of 95%. Nevertheless, their method does not classify questions in real time.
McRoy et al. [28] created a new corpus for consumer cancer-related questions and built an Expected Answer Type (EAT) taxonomy to automatically classify these questions. This taxonomy was developed based on supervised machine-learning-based methods, and achieved an F1-score (see Section 4.2) of 96.3%. They took into consideration the dimensionality reduction and spelling correct, but the results showed that their method has a small impact on question classifier results. They concluded that statistical classification methods address effectively the problem of natural imbalance in the types of questions asked by users, particularly differentiating factual and patient-specific questions.
Liu et al. [29] examined the intentions of many Chinese health questions on the Internet. They demonstrated the intention recognition problem as a text classification problem. Two techniques were used to improve the learning-based text classifier by location-based and area-based feature weightings, which can achieve a Macro-averaged F1-score of 73.1% and Micro-averaged F1-score of 82.9%. Here, the location-based technique means that a word in a health question is intention-indicative if it appears at the beginning or the end of the health question, while the area-based technique means that a health question is likely to have an intention if it has many words that appear in the area of the health question.
Other good works are as follows: Llanos et al. [30] presented a model, where they focused on automating the classification of doctor-patient questions by simulating the consultations with virtual patients. They classified questions by looking up data in the clinical record using the computational strategy, and achieved an Average F1-score of 81.2%. Abacha et al. [31] developed a manually annotated dataset of medication question-answer pairs based on real consumer questions submitted to MedlinePlus. They also proposed recurrent and convolutional neural networks for question-type identification and focus recognition. The results showed that they achieved an accuracy of 75.7%. Yu et al. [32] demonstrated that the ladder approach can give the highest performance, which incorporates the knowledge representation of the hierarchical evidence taxonomy. The results showed that they achieved an accuracy of 57%. However, the classification performance categorized into five categories is not high due to the shortage of training data. Moreover, the problem of ambiguity in Population, Intervention, Comparator, and Outcome (PICO) sentence prediction tasks are studied in [33]. They discussed the impact that annotations for training named entity recognition systems have on training a high-performing and flexible architecture for question answering. They also mentioned that the augmentation approach can solve the problem of insufficient amounts of training annotations for PICO entity extraction, where the augmentation approach is a technique that generates additional training data by applying various transformations or modifications to the existing datasets. Dodiya et al. [34] presented a rule-based approach for QC in the healthcare domain. Their approach used 500 medical questions collected from patients and doctors, to classify questions into various categories based on the two-layered taxonomy of 6-course grain and 50 fine-grained categories developed by Li and Roth. Nevertheless, the accuracy of the “how” type of questions is very low (44%). The reason for this is that the classifier fails to identify the matching pattern of the question, and ends up assigning it to the wrong category.
As introduced above, different word and sentence embedding models have been proposed, and these models have a significant role in categorizing medical questions during the classification process. By taking advantage of the limitations of the existing works, we proposed a CNN-BiLSTM model to address some of these problems for a better classifier of medical questions.

2.2. Deep Neural Network-Based Methods

The commonly used deep neural network-based methods for text classification are RNN, LSTM, GRU, and CNN. CNN has been employed in object recognition, question answering, and sentiment analysis [2,35]. CNN-based models are used in pattern recognition in text. The presentative deep learning-based methods are introduced as follows.
Ambekar et al. [36] proposed a CNN-based unimodal disease risk prediction (CNN-UDRP) algorithm that automatically extracts features from the dataset, and predicts heart disease. In their study, they focused on two main objectives: (1) to predict heart disease along with heart disease risk based on structured data, (2) to handle missing values to improve the accuracy of heart disease. The results showed that they achieved an accuracy of 65%.
Dai et al. [37] proposed an Inception Convolutional Autoencoder Healthcare question Clustering (ICAHC) model to solve existing problems in learning and representing the question corpus, including high dimensionality, sparseness, noise, and nonprofessional expression in Chinese healthcare. Their model can be used to predict patients’ conditions and to develop an automatic Health Question Answering (HQA) system.
Lu et al. [38] used a finetuned CNN model and an enhanced LSTM for natural language inference (ESIM) to predict breast cancer. They developed a dynamic website that has a cancer detection facility, user interface, and chatbot. By finetuning the Visual Geometry Group (VGG) CNN model, they were able to classify hematoxylin and eosin (H&E) breast issue images; while ESIM was used to process text matching in website intelligent question answering.
The existing researches assume the availability of large and high-quality training datasets. However, in real-world scenarios, obtaining labeled data for medical question classification can be challenging due to privacy concerns or limited expert annotations. The works presented in this section provide the background of QC in the medical domain based on deep learning technology. However, the proposed model differs from existing works. In our proposed model, Word2Vec is used for word embedding, then, features are extracted from the word embedding matrix based on CNN. Finally, BiLSTM is used to classify questions. Although the CNN method is a commonly used model in extracting features from images, in our work, we combined it with BiLSTM to classify text-based questions. Thus, the reliance on the high-quality training datasets of our proposed model is less than other methods, by leveraging the power of a pre-trained word embedding and a hybrid model.

3. Proposed Architecture

In this section, we describe our proposed method for the QC task. We start by removing stopwords and bad symbols using different Natural Language Processing (NLP) techniques for data preprocessing. Here, stopwords are commonly used words that are considered to be insignificant or irrelevant for text analysis and information retrieval tasks, while bad symbols refer to characters or symbols that are undesirable or problematic in text data. Next, we proceed with word representation using a word embedding technique. Finally, we extract features using CNN and employ BiLSTM for sequence learning.
The proposed QC system can be modeled as: Q = q 1 ,   q 2 , ,   q N , where Q is the set of questions, and q i corresponds to the i-th question. C = c 1 ,   c 2 , ,   c n , where C is a set of the categories. Each question is classified into one of n different categories. For example, n is 7 for the ICHI dataset [39], namely, q i c 1 ,   c 2 , ,   c 7 .
Figure 1 shows the architecture of the proposed model for medical QC. The main parts of the proposed model will be introduced in detail as follows.

3.1. Word Embedding Model

Word representation plays a key role in the system performance, especially in improving the performance of supervised models by learning word-level features [40]. Words are represented as vectors in continuous space using an embedding technique called Word2Vec. Word2Vec is a useful technique for the question classification model by providing word embeddings that encode the semantic and contextual information of words. These embeddings enable the model to understand the meaning of words, capture semantic relationships, handle out-of-vocabulary words, and improve the generalization and efficiency of the classification process. In the proposed model, Word2Vec is used to learn word associations from a large corpus text using a neural network model. For example, Word2Vec can be used to detect synonymous words in a given sentence. Figure 2 shows the Word2Vec, which is the pre-trained word representation model that we used.

3.2. Features Extraction Based on CNN

Generally, the CNN model comprises the convolution layer, pooling layer, and full connection layer [41]. The input data are transferred to the convolution layer. Then, several convolutional kernels slide on the data vertically and horizontally to obtain the local sampling of data, which is later pooled in the pooling layer. The pooling process helps to reduce the dimension of data, which is finally sent to the full connection layer. The CNN used in the features extraction is introduced as follows:
Our data, which is in the text format, is represented by the embedding word vector layer in a form of a matrix T. T has a dimension of 1 × e , where 1 is the maximum length of the text and e the dimension of the word vector.
The CNN layer encodes 1 × e using the convolution kernel w n × h × e , where n is the number of convolution kernels. The width of the convolution kernels equals the dimension of the embedding word vector. The convolution kernel w i slides sampling in j ,   i = 1 ,   2 , ,   e .
The output of the convolution layer is
O p = O p 1 , O p 2 , , O p n
where O p dimension relates to the size and step size of each convolution kernel.
In the pooling layer, a matrix window that can slide like a convolution kernel is established. This window slides on O p i , where i belong to 1 ,   2 , ,   e . The output of the pooling layer is
p l = [ p l 1 ,   p l 2 , ,   p l n ]
The vector in p l is covered, and then the final classification is obtained using the full connection layer and SoftMax layer. The SoftMax layer is the layer that uses the known SoftMax activation function, which is a commonly used layer in neural networks, particularly in classification tasks.
We use CNN to extract local features, and the output of the embedding layer is taken as the input. The word vector of each word in the sentence is x i ,   x i R n × e where n is the number of words and e is the vector dimension.
Generally, the convolution operation is performed by setting the filter to extract the features of the input text sentence can be expressed as follows:
V i = f ω × x i : i + k 1 + b
where, x i : i + k 1 is the sentence vector composed of words, and b is the offset term.
After passing through the convolution layer, the characteristic matrix V can be expressed as follows:
V = r 1 , r 2 , , r n k + 1
We obtain the optimal solution of the local value by down-sampling the local feature matrix of the sentence obtained after the convolution layer. The Max-Pooling technology can be described as follows:
M p = max r 1 , r 2 , , r n k + 1
As the BiLSTM input is a serialization structure, pooling will disrupt the sequence structure V. Thus, we add a full connection layer to connect the M p i vectors after the pooling layer into the vector Z as the input of BiLSTM, as follows:
Z = M p 1 , M p 2 , , M p n

3.3. Question Classification Based on BiLSTM

The LSTM model was designed to address the RNN’s problem of memory and information storage limitation. LSTM comprises three gates: the input gate, the forget gate, and the output gate [42,43]. The LSTM model can remember long-term information using its memory cells, and regulate this process through a gate mechanism. LSTM models are much better at handling long-term dependencies and much less susceptible to the vanishing gradient problem, and are very efficient at modeling complex sequential data. With a single LSTM, information is processed from only one forward direction, while for BiLSTM, it is processed from two directions: forward and backward. The forward LSTM deals with the input sequence of past data information, while the backward LSTM receives information on the input sequence of future data information [44,45].
BiLSTM is more efficient than LSTM because it uses previous and succeeding information, and combines them to obtain the output. To control the state of memory cells, point-wise multiplication and sigmoid function operations are performed in each gate. The input data at the current state and the output from the hidden state of the previous layer enter all gates. The role of the forget gate is to decide whether the information should be kept or ignored. The output value of the forget gate is between zero and one. When the value is close to zero, the information is ignored. However, when the value is close to one, the information is kept.
The BiLSTM model has two main gates that run in opposite directions at each time [41]. h f represents the forward output of LSTM at the time t, h b represents the reverse output of LSTM at the time t, and ht represents the output of BiLSTM at the time t. xt is the input at time t. Each time state in the BiLSTM model is calculated as follows:
h f = L S T M x t , h t 1
h f = L S T M x t , h t 1
h o = v o h f + y o h b + c t
where, v o stands for the weight matrix of forward output; y o stands for the weight matrix of the reverse output; c t is the offset of t time.
Remark 1.
The main advantage of BiLSTM is its ability to capture dependencies from both past and future contexts, enabling the model to make more informed predictions. This is particularly beneficial in QAS tasks where understanding the full context of a sentence or document is important, such as sentiment analysis, called entity recognition, sequence labeling, and machine translation.

4. Experiments and Analysis

4.1. Dataset and Experimental Setup

We evaluate the performance of our proposed model on two different datasets, namely ICHI and MedQuAD. The ICHI dataset used in this research as a benchmark was obtained from [39]. It is made of 11,000 sample questions. The medical questions are classified into seven categories, namely Demographic (DEMO), Disease (DISE), Treatment (TRMT), Goal-oriented (GOAL), Pregnancy (PREG), Family support (FAML), and Socializing (SOCL). More details on the ICHI dataset used are shown in Table 1.
Figure 3 shows the number of questions by category in the ICHI dataset. The DISE category has more questions, followed by the PREG category, DEMO category, GOAL category, SOCL category, FAML category, and TRMT category, respectively.
The MedQuAD dataset contains 47,457 medical question-answer pairs created from 12 National Institutes of Health (NIH) websites [46]. The dataset has 37 question types. In our study, we removed questions from seven categories that were not in the format we wanted and 30 categories remained. The main reason we removed these seven categories is that some questions in these categories have more columns than others and their formats are different from other questions. Therefore, we used a total number of 19,749 questions. The dataset has different columns such as question id, question type, question, answer, and id. But, for our QC problem, we only chose to retain the question type and the question. In the MedQuAD dataset, questions are classified based on the following categories: brand names (BRAND1), brand names of combination products (BRAND2), usage (USAGE), treatment (TREAT), and so on. See more details in Table 2.
Figure 4 shows the word cloud of questions in the Pregnancy category of the ICHI dataset. The bigger and the bolder the word appears, the more often it is mentioned within a given category and the more important it is. For example, words like “day”, “period” and “pregnant” are more important in this category.
Figure 5 shows the word cloud for the information category in the MedQuAD dataset. Words such as “Tumor”, “Childhood” and “Cancer”, which appear in bigger sizes, are more important in this category.
The hyperparameters used in our experiment are the same for all models. The rmsprop optimizer that we used restricts the oscillations in the vertical direction. The size of the learning rate we used is set as 0.0001. We used 50 epochs, a batch size of 128, and a dropout of 0.2. The categorical cross-entropy is used as the loss function.

4.2. Evaluation Measures

A set of evaluation measures were utilized to quantify the performance of the QC problem, including Accuracy, Recall, Precision, and F1-score [47].
Accuracy is the fraction of correct predictions over the total number of questions, and can be expressed as follows:
Accuracy = TP + TN TOTAL
where TP (True Positive) means that both the actual and predicted values are positive. TN (True Negative) means that both the actual and predicted values are negative. TOTAL: The Total number of questions. Accuracy: it is the proportion of correctly classified questions.
Precision is used to measure out of all predicted positive records how many are actually positive. It measures the number of positive class predictions that actually belong to the positive class. There is a difference between accuracy and precision. Accuracy is a metric that assesses the overall correctness of predictions in classification tasks, whereas precision is a metric used to evaluate the ranking performance in information retrieval tasks. Precision can be calculated by the following formula:
Precision = TP TP + FP
where FP (False Positive) indicates that the predicted value is positive but the actual value is negative. TP + FP: Total number of questions classified in a given category.
Recall is used to measure how many positive records are predicted correctly. The recall is calculated as follows:
Recall = TP TP + FN
where FN (False Negative) signifies that the predicted value is negative, but the actual value is positive.
F1-score (F1-measure): Harmonic Mean of Precision and Recall. It balances the concerns of precision and recall in one score. The F1-score formula can be expressed as follows:
F 1 = 2 Precision Recall Precision + Recall

4.3. Result Analysis

To verify the performance of the proposed model for QC, we first carried out some experiments on the ICHI Dataset. The ICHI dataset is divided into training and test sets. The training set contains 8000 questions while the test set has 3000 questions. Before the training process, the validation set is given 10% of the training set. Here, we compared our proposed model with two baseline models such as CNN and BiLSTM. The results on the test set are shown in Table 3.
The results in Table 3 show that our proposed model achieved average values for Precision, Recall, and F1-score of 58.71%, 57.29%, and 57.14%, respectively. Comparing our results with the baseline models, our proposed model outperforms the CNN model based on the average values for Precision, Recall, and F1-score with a margin of 0.71%, 0.43%, and 0.14%, respectively; whereas, with the BiLSTM model, our proposed model increased the average values for Precision, Recall and F1-score with a margin of 1.28%, 0.58% and 0.28%, respectively. We can also see that the FAML, GOAL, and PREG categories provide better results compared to the TRMT, SOCL, DEMO, and DISE categories. This may be due to questions in these categories which are very long, which negatively affect the analysis and question-type predictions based on the deep learning methods. Another reason may be that the model failed to understand some technical words used in the medical field.
To further test the performance of the proposed model, we conducted another experiment on the MedQuAD dataset. The MedQuAD dataset is divided into training and testing sets at the proportion of 70% and 30%, respectively. Before we start training our model, the validation set is given 10% of the training set. The results on the test set are summarized in Table 4.
According to the results from Table 4, our proposed model achieved average values for Precision, Recall, and F1-score of 93.33%, 93.33%, and 93.33%, respectively. Comparing our results with the baseline models, our proposed model outperforms the CNN model based on average values for Precision, Recall, and F1-score with a margin of 3.33%, 3.33%, and 3.33%, respectively; whereas, with the BiLSTM model, our proposed model increased the average values for Precision, Recall and F1-score with a margin of 3.43%, 3.36%, and 3.36%, respectively. In addition, we can see that the results for all categories are very high, except for the “DOSE” and “CAUSES” categories. This is due to the fact that some questions contain terms that can be classified into multiple question types because they are more related to each other. This affects the classification performance in these categories. In addition, we can see that many categories achieved an accuracy of 100% and a few others obtained 0%. This means that the classification model performs well in some categories but fails to classify questions in some other categories. The classification failure may be caused by the quality of data in these categories that contain noise.
The accuracy of the proposed model compared with the other two baseline models on the test sets of the two datasets are listed in Table 5. The results from Table 5 show that the proposed model can achieve the best accuracy on the two datasets as follows: 57.73% accuracy on the ICHI dataset and 100% accuracy on the MedQuAD dataset. To make the paper more readable, only the confusion matrix of our proposed model on the test set of the ICHI dataset is given out, which is illustrated in Figure 6. The Losses and Accuracies of the proposed model on the training and test sets of the two datasets are depicted in Figure 7.
The results in Figure 7 show that the loss on the training set of the ICHI dataset is 1.44%, while on the test set is 42.27%. On the other hand, the accuracy on the training set of the ICHI dataset is 98.56%, while on the test set is 57.73%. The loss on the training set of the MedQuAD dataset is 0.00% and also on the test set is 0.00%. On the other hand, the accuracy on the training set of the MedQuAD dataset is 100.00%, the same as on the test set which is also 100.00%. This implies that our proposed model achieves a better performance on the MedQuAD dataset than on the ICHI dataset.
The results prove that our proposed model outperforms other state-of-the-art models in terms of accurately classifying medical forum questions on both benchmark datasets. Although the improvement in accuracy is not big enough, the results of Table 3 and Table 4 show that the Precision, Recall, and F1-score of the proposed model improve obviously. So, the overall performance of the proposed model is better than the other models.

4.4. Discussions on Hyper-Parameters

The number of epochs and the bias affect the performance of the model. The more the number of epochs increases, the more the accuracy improves. On the other hand, when the value of the bias decreases, the accuracy increases. For example, in our case, when the number of epochs was 10 and the bias was 0.0001, we achieved an accuracy of 100% on the MedQuAD dataset. After decreasing the number of epochs to 5, we achieved an accuracy of 99.94%. However, for the ICHI dataset, when we increase the number of epochs from 10 to 50 and reduce the bias from 0.001 to 0.0001, the accuracy reduces from 60.62% to 57.73%.

4.5. Limitations and Future Works

Although our proposed model improves the medical forum question classification task, it is essential to acknowledge the limitations of our study. For example, our model heavily relies on Word2Vec for word embedding representation, which may struggle with out-of-vocabulary words or fail to capture domain-specific semantics adequately. Exploring alternative word embedding techniques or domain-specific embeddings could be an avenue for future research. In addition, our model is a black box model, making it challenging to interpret the decision-making process. Understanding the reasoning behind the model’s predictions, especially in the medical domain, is crucial. Incorporating interpretability techniques such as attention mechanisms or gradient-based saliency maps can help shed light on the model’s decision-making process, which should be specifically studied. In future work, we plan to learn more medical features and use transformers to enhance the performance of the classification model. We also intend to collect data and build a large medical dataset to boost the QC accuracy of medical QASs.

5. Conclusions

This work presents a QC method for medical questions using the CNN-BiLSTM method. Questions are converted into word embedding vectors using Word2Vec. The model extracts CNN features from the word embedding vector, and then BiLSTM is employed to learn and classify the features. Our model classifies any medical question from the datasets and predicts a given label to which that question belongs. Our proposed model was trained and tested on two benchmark datasets, and experimental results revealed that our model outperforms all baseline methods on both datasets, with an accuracy of 57.73% on the ICHI dataset and 100.00% on the MedQuAD dataset. This model can help medical practitioners to identify the category of questions posed by patients. This work is also very crucial for building an effective QAS because it helps to know the expected answer type of a given question.

Author Contributions

Funding acquisition, J.N.; Project administration, J.N. and W.C.; Writing—original draft, E.M.; Writing—review and editing, E.M. and G.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (61873086) and the Science and Technology Support Program of Changzhou (CE20215022).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://tinyurl.com/medCat18 (accessed on 10 December 2022) and https://github.com/abachaa/MedQuAD (accessed on 10 April 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Asteris, P.G.; Gavriilaki, E.; Touloumenidou, T.; Koravou, E.E.; Koutra, M.; Papayanni, P.G.; Pouleres, A.; Karali, V.; Lemonis, M.E.; Mamou, A.; et al. Genetic prediction of icu hospitalization and mortality in COVID-19 patients using artificial neural networks. J. Cell. Mol. Med. 2022, 26, 1445–1455. [Google Scholar] [CrossRef] [PubMed]
  2. Mutabazi, E.; Ni, J.; Tang, G.; Cao, W. A Review on Medical Textual Question Answering Systems Based on Deep Learning Approaches. Appl. Sci. 2021, 11, 5456. [Google Scholar] [CrossRef]
  3. Asteris, P.G.; Kokoris, S.; Gavriilaki, E.; Tsoukalas, M.Z.; Houpas, P.; Paneta, M.; Koutzas, A.; Argyropoulos, T.; Alkayem, N.F.; Armaghani, D.J.; et al. Early prediction of COVID-19 outcome using artificial intelligence techniques and only five laboratory indices. Clin. Immunol. 2023, 246, 109218. [Google Scholar] [CrossRef] [PubMed]
  4. Roy, S.; Chakraborty, S.; Mandal, A.; Balde, G.; Sharma, P.; Natarajan, A.; Khosla, M.; Sural, S.; Ganguly, N. Knowledge-aware neural networks for medical forum question classification. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Gold Coast, Australia, 1–5 November 2021; pp. 3398–3402. [Google Scholar]
  5. Momtazi, S. Unsupervised Latent Dirichlet Allocation for supervised question classification. Inf. Process. Manag. 2018, 54, 380–393. [Google Scholar] [CrossRef]
  6. Bansal, A.; Eberhart, Z.; Wu, L.; McMillan, C. A neural question answering system for basic questions about subroutines. In Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Honolulu, HI, USA, 9–12 March 2021; IEEE: Piscataway Township, NJ, USA, 2021; pp. 60–71. [Google Scholar]
  7. Agrawal, S.; Mishra, N. Question classification system for health care: A review. In Proceedings of the Third International Conference on Advanced Informatics for Computing Research, Shimla, India, 15–16 June 2019; pp. 1–6. [Google Scholar]
  8. Roberts, K.; Kilicoglu, H.; Fiszman, M.; Demner-Fushman, D. Automatically classifying question types for consumer health questions. In Proceedings of the AMIA Annual Symposium Proceedings; American Medical Informatics Association: Bethesda, MD, USA, 2014; Volume 2014, p. 1018. [Google Scholar]
  9. Dimitrakis, E.; Sgontzos, K.; Tzitzikas, Y. A survey on question answering systems over linked data and documents. J. Intell. Inf. Syst. 2020, 55, 233–259. [Google Scholar] [CrossRef]
  10. Zulqarnain, M.; Alsaedi, A.K.Z.; Ghazali, R.; Ghouse, M.G.; Sharif, W.; Husaini, N.A. A comparative analysis on question classification task based on deep learning approaches. PeerJ Comput. Sci. 2021, 7, e570. [Google Scholar] [CrossRef]
  11. Ni, J.; Shen, K.; Chen, Y.; Yang, S.X. An Improved SSD-Like Deep Network-Based Object Detection Method for Indoor Scenes. IEEE Trans. Instrum. Meas. 2023, 72, 5006915. [Google Scholar] [CrossRef]
  12. Park, J.; Jung, D.J. Deep Convolutional Neural Network Architectures for Tonal Frequency Identification in a Lofargram. Int. J. Control Autom. Syst. 2021, 19, 1103–1112. [Google Scholar] [CrossRef]
  13. Ni, J.; Shen, K.; Chen, Y.; Cao, W.; Yang, S.X. An Improved Deep Network-Based Scene Classification Method for Self-Driving Cars. IEEE Trans. Instrum. Meas. 2022, 71, 5001614. [Google Scholar] [CrossRef]
  14. Kang, H.; Yang, S.; Huang, J.; Oh, J. Time Series Prediction of Wastewater Flow Rate by Bidirectional LSTM Deep Learning. Int. J. Control Autom. Syst. 2020, 18, 3023–3030. [Google Scholar] [CrossRef]
  15. Sarrouti, M.; Lachkar, A.; Ouatik, S.E.A. Biomedical question types classification using syntactic and rule based approach. In Proceedings of the 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), Lisbon, Portugal, 12–14 November 2015; IEEE: Piscataway Township, NJ, USA, 2015; Volume 1, pp. 265–272. [Google Scholar]
  16. Athenikos, S.J.; Han, H. Biomedical question answering: A survey. Comput. Methods Programs Biomed. 2010, 99, 1–24. [Google Scholar] [CrossRef] [PubMed]
  17. Sarrouti, M.; El Alaoui, S.O. SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions. Artif. Intell. Med. 2020, 102, 101767. [Google Scholar] [CrossRef] [PubMed]
  18. Yang, W.; Zeng, G.; Tan, B.; Ju, Z.; Chakravorty, S.; He, X.; Chen, S.; Yang, X.; Wu, Q.; Yu, Z.; et al. On the generation of medical dialogues for COVID-19. arXiv 2020, arXiv:2005.05442. [Google Scholar]
  19. Mishra, S.; Sharma, A. Automatic word embeddings-based glossary term extraction from large-sized software requirements. In Requirements Engineering: Foundation for Software Quality, Proceedings of the 26th International Working Conference, REFSQ 2020, Pisa, Italy, 24–27 March 2020; Proceedings 26; Springer: Berlin/Heidelberg, Germany; pp. 203–218.
  20. Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. (HEALTH) 2021, 3, 1–23. [Google Scholar] [CrossRef]
  21. Chen, Q.; Peng, Y.; Lu, Z. BioSentVec: Creating sentence embeddings for biomedical texts. In Proceedings of the 2019 IEEE International Conference on Healthcare Informatics (ICHI), Xi’an, China, 10–13 June 2019; IEEE: Piscataway Township, NJ, USA, 2019; pp. 1–5. [Google Scholar]
  22. Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  23. Rasmy, L.; Xiang, Y.; Xie, Z.; Tao, C.; Zhi, D. Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit. Med. 2021, 4, 86. [Google Scholar]
  24. Liu, J. Research on Question Classification Methods in the Medical Field. arXiv 2022, arXiv:2202.00298. [Google Scholar]
  25. Faris, H.; Habib, M.; Faris, M.; Alomari, M.; Alomari, A. Medical speciality classification system based on binary particle swarms and ensemble of one vs. rest support vector machines. J. Biomed. Inform. 2020, 109, 103525. [Google Scholar] [CrossRef]
  26. Wasim, M.; Asim, M.N.; Khan, M.U.G.; Mahmood, W. Multi-label biomedical question classification for lexical answer type prediction. J. Biomed. Inform. 2019, 93, 103143. [Google Scholar] [CrossRef]
  27. Suffian, M.; Khan, M.Y.; Wasi, S. Developing disease classification system based on keyword extraction and supervised learning. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 599–605. [Google Scholar] [CrossRef] [Green Version]
  28. McRoy, S.; Jones, S.; Kurmally, A. Toward automated classification of consumers’ cancer-related questions with a new taxonomy of expected answer types. Health Inform. J. 2016, 22, 523–535. [Google Scholar] [CrossRef]
  29. Liu, R.L. Intention Classification for Retrieval of Health Questions. Int. J. Knowl. Content Dev. Technol. 2017, 7, 101–120. [Google Scholar]
  30. Llanos, L.C.; Rosset, S.; Zweigenbaum, P. Automatic classification of doctor-patient questions for a virtual patient record query task. In BioNLP; Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 333–341. [Google Scholar]
  31. Abacha, A.B.; Mrabet, Y.; Sharp, M.; Goodwin, T.R.; Shooshan, S.E.; Demner-Fushman, D. Bridging the Gap Between Consumers’ Medication Questions and Trusted Answers. In Proceedings of the 17th World Congress on Medical and Health Informatics, MEDINFO 2019, Lyon, France, 25–30 August 2019; IOS Press: Amsterdam, Netherlands, 2019; pp. 25–29. [Google Scholar]
  32. Yu, H.; Sable, C.; Zhu, H.R. Classifying medical questions based on an evidence taxonomy. In Proceedings of the AAAI 2005 Workshop on Question Answering in Restricted Domains, Pittsburgh, PA, USA, 9–13 July 2005. [Google Scholar]
  33. Schmidt, L.; Weeds, J.; Higgins, J. Data mining in clinical trial text: Transformers for classification and question answering tasks. arXiv 2020, arXiv:2001.11268. [Google Scholar]
  34. Dodiya, T.; Jain, S. Question classification for medical domain question answering system. In Proceedings of the 2016 IEEE International WIE Conference on Electrical and Computer Engineering (WIECON-ECE), Pune, India, 19–21 December 2016; IEEE: Piscataway Township, NJ, USA, 2016; pp. 204–207. [Google Scholar]
  35. Kim, S.H.; Choi, H.L. Convolutional Neural Network for Monocular Vision-based Multi-target Tracking. Int. J. Control Autom. Syst. 2019, 17, 2284–2296. [Google Scholar] [CrossRef]
  36. Ambekar, S.; Phalnikar, R. Disease risk prediction by using convolutional neural network. In Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 16–18 August 2018; IEEE: Piscataway Township, NJ, USA, 2018; pp. 1–5. [Google Scholar]
  37. Dai, D.; Tang, J.; Yu, Z.; Wong, H.S.; You, J.; Cao, W.; Hu, Y.; Chen, C.P. An inception convolutional autoencoder model for Chinese healthcare question clustering. IEEE Trans. Cybern. 2019, 51, 2019–2031. [Google Scholar] [CrossRef] [PubMed]
  38. Lu, Y.; Zhao, Z.; Zhao, Z. Breast Cancer Classification Based on CNN and ESIM Model. In Proceedings of the 2021 IEEE International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Fuzhou, China, 24–26 September 2021; IEEE: Piscataway Township, NJ, USA, 2021; pp. 742–746. [Google Scholar]
  39. Jalan, R.; Gupta, M.; Varma, V. Medical forum question classification using deep learning. In Advances in Information Retrieval, Proceedings of the 40th European Conference on IR Research, ECIR 2018, Grenoble, France, 26–29 March 2018; Proceedings 40; Springer: Berlin/Heidelberg, Germany, 2018; pp. 45–58. [Google Scholar]
  40. Kearns, W.R.; Thomas, J.A. Resource and response type classification for consumer health question answering. In Proceedings of the AMIA Annual Symposium Proceedings; American Medical Informatics Association: San Francisco, CA, USA; 3–7 November 2018, Volume 2018, p. 634.
  41. Sun, F.; Chu, N. Text sentiment analysis based on CNN-BiLSTM-attention model. In Proceedings of the 2020 International Conference on Robots & Intelligent System (ICRIS), Sanya, China, 7–8 November 2020; IEEE: Piscataway Township, NJ, USA, 2020; pp. 749–752. [Google Scholar]
  42. Yu, X.; Gong, R.; Chen, P. Question Classification Method in Disease Question Answering System Based on MCDPLSTM. In Proceedings of the 2021 IEEE 21st International Conference on Software Quality, Reliability and Security Companion (QRS-C), Hainan Island, China, 6–10 December 2021; IEEE: Piscataway Township, NJ, USA, 2021; pp. 381–387. [Google Scholar]
  43. An, H.; Zhang, S.; Cui, C.; Qian, C.; Lin, W. Dynamic Model Identification for Adaptive Polishing System. Int. J. Control Autom. Syst. 2022, 20, 3110–3120. [Google Scholar] [CrossRef]
  44. Kavianpour, P.; Kavianpour, M.; Jahani, E.; Ramezani, A. A cnn-bilstm model with attention mechanism for earthquake prediction. arXiv 2021, arXiv:2112.13444. [Google Scholar] [CrossRef]
  45. Ni, J.; Liu, R.; Tang, G.; Xie, Y. An Improved Attention-based Bidirectional LSTM Model for Cyanobacterial Bloom Prediction. Int. J. Control Autom. Syst. 2022, 20, 3445–3455. [Google Scholar] [CrossRef]
  46. Ben Abacha, A.; Demner-Fushman, D. A question-entailment approach to question answering. BMC Bioinform. 2019, 20, 511. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  47. Mahanty, C.; Kumar, R.; Asteris, P.G.; Gandomi, A.H. COVID-19 patient detection based on fusion of transfer learning and fuzzy ensemble models using CXR images. Appl. Sci. 2021, 11, 11423. [Google Scholar] [CrossRef]
Figure 1. Architecture of the proposed CNN-BiLSTM model for MFQC. Word2Vec is used for word embeddings, CNN is used for Feature Extraction, BiLSTM is used for classification, the SoftMax layer uses the SoftMax activation function and finally, the class prediction layer produces the question category.
Figure 1. Architecture of the proposed CNN-BiLSTM model for MFQC. Word2Vec is used for word embeddings, CNN is used for Feature Extraction, BiLSTM is used for classification, the SoftMax layer uses the SoftMax activation function and finally, the class prediction layer produces the question category.
Applsci 13 08623 g001
Figure 2. The word representation model using Word2Vec, where X is the word representation of the input vector and Y is the word representation of the output vector. W represents the word embedding matrix, and n is the vocabulary size with word embedding of size t.
Figure 2. The word representation model using Word2Vec, where X is the word representation of the input vector and Y is the word representation of the output vector. W represents the word embedding matrix, and n is the vocabulary size with word embedding of size t.
Applsci 13 08623 g002
Figure 3. The number of questions by category (ICHI dataset).
Figure 3. The number of questions by category (ICHI dataset).
Applsci 13 08623 g003
Figure 4. Word cloud for the Pregnancy category (ICHI dataset).
Figure 4. Word cloud for the Pregnancy category (ICHI dataset).
Applsci 13 08623 g004
Figure 5. Word cloud for information category (MedQuAD dataset).
Figure 5. Word cloud for information category (MedQuAD dataset).
Applsci 13 08623 g005
Figure 6. Confusion matrix of our proposed model on the test set of the ICHI dataset.
Figure 6. Confusion matrix of our proposed model on the test set of the ICHI dataset.
Applsci 13 08623 g006
Figure 7. The Losses and Accuracies of the proposed model on the training and test sets of the two datasets: (a) the Losses on the ICHI dataset; (b) The Accuracies on the ICHI dataset; (c) the Losses on the MedQuAD dataset; (d) The Accuracies on the MedQuAD dataset.
Figure 7. The Losses and Accuracies of the proposed model on the training and test sets of the two datasets: (a) the Losses on the ICHI dataset; (b) The Accuracies on the ICHI dataset; (c) the Losses on the MedQuAD dataset; (d) The Accuracies on the MedQuAD dataset.
Applsci 13 08623 g007
Table 1. Description of ICHI question categories [39].
Table 1. Description of ICHI question categories [39].
No      Categories                                Details
1Demographic (DEMO)Questions related to a given demographic subgroup.
2Disease (DISE)Questions related to a particular disease.
3Treatment (TRMT)Questions related to a specific treatment or procedure.
4Goal-oriented (GOAL)Questions related to attaining a health goal.
5Pregnancy (PREG)Questions related to pregnancy.
6Family support (FMLY)Questions related to matters of a caregiver.
7Socializing (SOCL)Questions related to socializing.
Table 2. Description of MedQuAD dataset [46].
Table 2. Description of MedQuAD dataset [46].
NoQuestion TypesQuestion Samples
1brand names (BRAND1)What are the brand names of Abacavir?
2brand names of combination
  products (BRAND2)
What are the brand names of combination products of Abacavir?
3usage (USAGE)How should Abacavir be used and what is the dosage?
4treatment (TREAT)What are the treatments for Adult Acute Myeloid Leukemia?
5symptoms (SYMPT)What are the symptoms of Adult Acute Lymphoblastic Leukemia?
6susceptibility (SUSCEPT)Who is at risk for Adult Acute Lymphoblastic Leukemia?
7storage and disposal (STORAGE)What should I know about the storage and disposal of Abacavir?
8stages (STAGES)What are the stages of Adult Acute Lymphoblastic Leukemia?
9side effects (SIDE)What are the side effects or risks of Tretinoin?
10severe reaction (SEVERE)What to do in case of a severe reaction to Anthrax Vaccine?
11research (RESEAR)what research (or clinical trials) is being done for Adult Acute Myeloid Leukemia?
12precautions (PRECAUT)Are there safety concerns or special precautions about Abacavir?
13outlook (OUTLOOK)What is the outlook for Adult Acute Lymphoblastic Leukemia?
14other information (OTHER)What other information should I know about Abacavir?
15inheritance (INHERIT)Is Retinoblastoma inherited?
16information (INFORM)What is (are) Adult Acute Lymphoblastic Leukemia?
17indication (INDICAT)Who should get Abacavir and why is it prescribed?
18important warning (IMPORT)What important warning or information should I know about Abacavir?
19how can i learn more (HOW)How can I learn more about Anthrax Vaccine?
20genetic changes (GENETIC)What are the genetic changes related to Chronic Myelogenous Leukemia?
21frequency (FREQUE)How many people are affected by Aarskog-Scott syndrome?
22forget a dose (FORGET)What should I do if I forget a dose of Abacavir?
23exams and tests (EXAMS)How to diagnose Adult Acute Lymphoblastic Leukemia?
24emergency or overdose (EMERGE)What to do in case of emergency or overdose of Abacavir?
25dietary (DIET)What special dietary instructions should I follow with Abacavir?
26contraindication (CONTRAI)Who should not get Anthrax Vaccine and what are its contraindications?
27why get vaccinated (WHY)Why get vaccinated with Diphtheria, Tetanus, and Pertussis (DTaP) Vaccine?
28prevention (PREVENT)How to prevent Liver (Hepatocellular) Cancer?
29dose (DOSE)What is the dosage of Bacillus Calmette-Guerin (BCG) Vaccine?
30causes (CAUSES)What causes Adult Central Nervous System Tumors?
Table 3. Results of different models on the test set of the ICHI dataset.
Table 3. Results of different models on the test set of the ICHI dataset.
CNN BiLSTM Proposed
CategoriesPrecisionRecallF1-ScorePrecisionRecallF1-ScorePrecisionRecallF1-Score
FAML81.00%83.00%82.00%76.00%77.00%77.00%82.00%84.00%83.00%
GOAL72.00%73.00%72.00%86.00%70.00%77.00%84.00%65.00%73.00%
SOCL47.00%48.00%47.00%56.00%53.00%55.00%51.00%55.00%53.00%
DISE42.00%46.00%44.00%48.00%55.00%51.00%44.00%64.00%52.00%
PREG69.00%50.00%58.00%49.00%49.00%49.00%58.00%53.00%55.00%
TRMT49.00%47.00%48.00%48.00%44.00%46.00%53.00%36.00%43.00%
DEMO46.00%51.00%48.00%39.00%49.00%43.00%39.00%44.00%41.00%
Average58.00%56.86%57.00%57.43%56.71%56.86%58.71%57.29%57.14%
Table 4. Results of different models on the test set of the MedQuAD dataset.
Table 4. Results of different models on the test set of the MedQuAD dataset.
CNN BiLSTM Proposed
CategoriesPrecisionRecallF1-ScorePrecisionRecallF1-ScorePrecisionRecallF1-Score
BRAND1100.00%100.00%100.00%97.00%100.00%99.00%100.00%100.00%100.00%
INFORM100.00%100.00%100.00%100.00%99.00%100.00%100.00%100.00%100.00%
PREVENT0.00%0.00%0.00%0.00%0.00%0.00%100.00%100.00%100.00%
DOSE0.00%0.00%0.00%0.00%0.00%0.00%0.00%0.00%0.00%
CAUSES0.00%0.00%0.00%0.00%0.00%0.00%0.00%0.00%0.00%
Average90.00%90.00%90.00%89.90%89.97%89.97%93.33%93.33%93.33%
Table 5. Comparison of the accuracy of different models on the test sets of the two datasets.
Table 5. Comparison of the accuracy of different models on the test sets of the two datasets.
DatasetModelAccuracy
ICHICNN56.95%
BiLSTM57.45%
Ours (CNN-BiLSTM)57.73%
MedQuADCNN99.89%
BiLSTM99.84%
Ours (CNN-BiLSTM)100.00%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mutabazi, E.; Ni, J.; Tang, G.; Cao, W. An Improved Model for Medical Forum Question Classification Based on CNN and BiLSTM. Appl. Sci. 2023, 13, 8623. https://doi.org/10.3390/app13158623

AMA Style

Mutabazi E, Ni J, Tang G, Cao W. An Improved Model for Medical Forum Question Classification Based on CNN and BiLSTM. Applied Sciences. 2023; 13(15):8623. https://doi.org/10.3390/app13158623

Chicago/Turabian Style

Mutabazi, Emmanuel, Jianjun Ni, Guangyi Tang, and Weidong Cao. 2023. "An Improved Model for Medical Forum Question Classification Based on CNN and BiLSTM" Applied Sciences 13, no. 15: 8623. https://doi.org/10.3390/app13158623

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop