As mentioned above, posts collected from social media provide useful information about the patients’ opinions. To identify ADRs, in this section we describe how we perform data analytics on these social media posts from two aspects to tackle the related issues and to enhance the predictive performance. The first is to consider critical factors related to data processing for feature engineering, including the traditional data balance and feature selection; and the second, to develop more effective models and methods for feature learning (such as deep learning-based methods). Further details are given below.
4.1. Tackling the Data Imbalance Problem by Resampling and Ensemble Learning
The class distribution of ADRs and non-ADRs could present a very severe between-class imbalance problem. To obtain a higher accuracy, a model by machine learning often biases toward the non-ADR class with a relatively large number of data samples. Thus, an additional data balancing procedure is required to conciliate this problem. Before developing more advanced computational methods for classification, we investigate the effect of data balance by two types of methods (i.e., data resampling and ensemble learning) described below to derive models from data with imbalance class distributions.
The first type is to adopt direct-resampling techniques. Intuitively, re-balancing the class distribution often involves over-sampling or under-sampling techniques. The simple over-sampling technique duplicates the minority class instances to equalize the data amount of different classes. However, adding data this way may not improve performance because it does not provide additional information to the learning method about how to identify minority class instances. Therefore, most over-sampling techniques attempt to analyze data properties to synthesize more meaningful and informative data for the minority class. Another technique is under-sampling, in which class balance is performed by decreasing the amount of majority class data. A simple technique is to randomly sample data from the majority class with the same amount to the minority class. However, this approach also has drawbacks: it might sample data with similar representations in the feature space, leading to the majority class lost many useful data to support the original representation. In this work, we use the imbalanced-learn python package [
34] to perform the re-sampling process. This package provides many re-sampling methods reported by literature works, and we design a series of experimental evaluations to examine if resampling methods can be used to improve the predictive performance. Three steps are performed: at first, we split the data into training and test datasets; then we apply the resampling techniques on the training dataset only in order to re-balance the skewed classes and keep the test dataset in original distribution; and finally, we use the rebalanced dataset to train the classification algorithms.
In addition to the above way that separates the steps of data quantity processing and model learning, a popular alternative to alleviate the class imbalance problem is to adopt ensemble learning that couples data modeling and data balance in the same procedure. This type of learning takes a multi-view to investigate the dataset, and many operating strategies (such as bagging, boosting, and stacking) can be employed to work with the learning method chosen (such as decision tree). Some strategies sample the subsets from the original dataset and then use data of different subsets to train models or the same dataset on different learning algorithms to build the overall model; others consider different dimensions of dataset to train individual models. All these approaches attempt to inspect data from multiple angles, though they are different in operating details. With such a specific property, ensemble learning methods are more capable to overcome the effects caused by imbalanced classes. Moreover, ensemble learning methods are easy to combine with resampling and cost-sensitive techniques to provide more effective prediction.
4.2. Solving the High-Dimension Problem by Feature Selection
Another issue needs to be seriously considered in Twitter posts is the feature dimension. The dataset collected from a social forum is often composed of a lot of free-style messages. It means that the data (i.e., the tweets) includes a textual document in non-structured format. We also have to format the text for classification algorithms to understand and process. Moreover, the words with ADR meanings are obscure, implicit, and not easy to detect. Sometimes an ADR mention is not just a single word, but a set of words, or even a sentence. It needs to be recognized through context analysis. As we do not know in advance what kinds of word relations or types the ADRs are built on, we thus apply different kinds of feature extraction methods to the same sentence and the results are concatenated to be one data record. As a result, a short sentence in the Twitter post is often expanded to form high-dimensional data.
For the dataset used in this work, the number of features is much larger than the amount of data. Though rich data features are considered helpful for improving the model performance, the functional roles of data categories are to be examined carefully to verify if they are beneficial to the model or not. One way to solve this problem is to adopt feature selection methods to choose a subset of the original features to maximize the modeling performance of a learning algorithm. Consequently, the dimension of feature vectors can be reduced so as to reduce the overfitting situation and the computational effort in learning a model.
Traditionally, the positive effect of data features is the major aim of a classification task; it is also important for presenting the negative effect. To inspect the negative effect of each feature category listed in
Table 1, former studies have employed the leave-one-out method to remove individual category, and observed how the predictive performance (i.e., the metrics of accuracy and f-score) changed. Their results show that though most of the features are helpful, the performance changes are not obvious. To amend it, here we further investigate these effects by examining and analyzing the impact of each feature category in detail. A series of experiments are conducted to identify whether the rich feature categories contribute to data modeling effectively, and whether each category can be substituted by others in the modeling procedure to maintain the same level of performance. If the most appropriate feature combination can be used, it is expected that the efficiency and efficiency of the models can be improved. In our application, we aim at the metrics of accuracy, f-score, and AUC (area under the Receiver Operating Characteristic curve).
One of the best ways to select features is the type of filtering-based methods that employ statistical techniques to score features and determine what to be reserved accordingly. A popular filtering method is the univariate filter which evaluates each feature one by one independently. As the calculation for the Univariate filter is fast and the predictive performance is generally acceptable, therefore, we adopt this method for feature selection. Univariate selection performs statistical test for non-negative features to select
k best features. That is, it uses common univariate statistical test for each feature, measures the relationship between the feature and the strain, and selects the best features accordingly. This work uses the simple and efficient tools scikit-learn [
35], with the three criteria it provides for feature evaluation: chi-square, Pearson correlation, and ANOVA F-value. More details about the three algorithms are given in [
36], and the evaluation results are presented in the experimental section.
4.3. Enhanced Deep Learning for ADR Recognition
The above methods for class balance and feature selection have pipeline workflows so they are not able to guarantee the best overall performance through pursuing the best results individually at each stage and then combining them together. Therefore, we press on to integrate the above stages and build an end-to-end approach via the deep neural model. The model is designed to capture the long-term dependencies of the input sequential texts (tweets). It can combine both functions of feature extraction and feature selection, via temporal contextual information. Working in this way, this model is able to detect the key words as well as the context information just as humans can. This is a crucial characteristic for our application here because the ADR symptoms are often represented by various types of sentences and require human-like comprehension for precise recognition.
To use a deep learning approach to achieve a language-based task, it is difficult to deal with the diversity of words, especially on the small corpus dataset, because there are not sufficient data to build up a model through learning the hidden patterns from the small corpus. Adopting the transfer learning techniques, we can choose a pre-trained model (trained by a large-scale corpus on a particular application domain) and then use a specific dataset to re-train this model to derive a fine-tuned model. In this way, regardless of the relatively limited data in the specified domain dataset, the re-trained model can still achieve outstanding performance.
In this study, we adopt BERT as the encoder to develop our approach. BERT, the deep learning technique based on the transformer network architecture, has been pre-trained on a large-scale corpus. The end part of the proposed model is composed with a classifying layer. It can be a simple fully connected layer or a more complicated RNN-like layer. The output layer uses the sigmoid activation function to generate a prediction with a real number between the range of 0 and 1. Here, an output number over 0.5 means that the model will take this observation as an ADR.
Figure 1 illustrates the deep neural networks-based architecture of our ADRs classifier. As already shown, before the tweets are sent into the model, each of them is parsed by a data preprocessing procedure, including steps of stop words elimination, html tags elimination, URL links elimination, and tokenization. A vocabulary dictionary is conducted with all corpus terms and each word is turned into a number encoded by the index of the corresponding term in the dictionary. We align all input sentences to sequences of length 32 (the average data length); the over-length sentences are truncated while short sentences padded with zero values. In this network model we use a fully connected layer with size of 768 neurons as the classifying layer.
In addition to the network architecture, it is important to enhance the training method to reduce the serious imbalanced class effect in ADR detection. As the deep neural network is trained by the stochastic gradient descent optimization algorithm, it requires the choice of loss function to repeatedly estimate the loss of the model and then to update the network weights accordingly. The loss function represents the primary training objective for a neural network, and the training performance largely depends on the choice of the loss function. Different loss functions have been used for the training deep neural model. Among others, the cross-entropy function has been widely adopted because it couples with a commonly used Softmax output layer to undo the exponentialized outputs. Meanwhile, its properties are closely related to the Kullback–Leibner (KL) divergence that describes the differences between two probability distributions [
37]. Consequently, minimizing the cross-entropy function is similar to minimize the KL divergence. However, this function is not implicitly flexible about the amount of information to be back-propagated. Therefore, instead of using a static loss function, we propose a new objective function based on the method of batch-wise self-adaptive weighting. With the adaptive loss function, the model becomes flexible in estimating its error to dynamically capture the characteristics of the data and the learning environment. The model can thus be forced to learn only the most discriminative and contributive features. As the model can capture the inherent trade-offs between the classification accuracy and the robustness to noise, the trained model can thus be more immune to the overfitting problem.
Our objective function enables the deep neural networks model to balance the class weights at each batch step in the training stage. With an adaptive function to dynamically guide the learning, we can alleviate the data imbalance problem to improve the performance of the model obtained. The following set of equations (i.e., Equations (1)–(6)) quantitatively explain the proposed method. Equation (1) presents the cost function, which indicates that given a loss function
L, it is able to compute the distance between the prediction
(by the model) and the ground truth
based on the data distribution, in which
represents the set of trainable parameters of the deep model. The goal is to approximate the distribution of model mapping based on the input
and the prediction
, so it will be as close as possible to the empirical distribution. It means that the cost needs to be minimized, and we can thus turn Equation (1) into an optimization problem as described in Equation (2). Here, the deep model adjusts the trainable parameters (i.e.,
) to enable the probability distribution of the model’s output to reach the maximum log-likelihood with the training data {
}. With Equation (2) into Equation (1), we can revise Equation (1) to obtain Equation (3). In such a gradient method, the optimization algorithm can update the parameters based on the training set, as described in Equation (4).
As can be observed in Equation (1), the type of loss function influences the deep model considering the learning patterns from data. For an imbalanced dataset, if the traditional cross-entropy is used as the objective function, the model tends to take the majority class (i.e., non-ADRs here) as the prediction result, because the majority class provides the minimum loss for the training data. However, this results in a serious bias. It is necessary to amplify the error when the instance is an ADR and the model gives a wrong answer. Therefore, as Equation (5) shows it, the binary cross-entropy loss function is multiplied by a weight
wi to reduce the class imbalance effect. In this equation,
,
, and
is the data size. In contrast to the kind of studies with fixed weights used to balance the skewed classes, our weight here is computed for each mini-batch, depending on the class distribution of the batch data. The proposed batch-wise adaptive weight formula is shown in Equation (6), in which
is the batch size,
(an ADR datum has a
y value of 1), and the term (
means the percentage of the ADR class in the batch data. For example, assuming that ADRs are the minority in the mini-batch (the same as in the training set) and they occupy 10% of the mini-batch, we can calculate (
and then obtain
. Thus, we can derive
. Meanwhile, for an instance of non-ADR data,
, we can derive
. Through the above steps, during the procedure of learning, the weights can be dynamically adjusted according to the class distribution of the mini-batch, so more penalties are given to examples of ADR data wrongly predicted. This way, our method can more precisely describe the relation between the model and the data in practice, and it is thus not necessary to take special care on the data imbalance problem.