Next Article in Journal
Optimized Ensemble Learning Approach with Explainable AI for Improved Heart Disease Prediction
Previous Article in Journal
Research on Resident Behavioral Activities Based on Social Media Data: A Case Study of Four Typical Communities in Beijing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Effect of Training Data Size on Disaster Classification from Twitter

by
Dimitrios Effrosynidis
1,*,
Georgios Sylaios
2 and
Avi Arampatzis
1
1
Database & Information Retrieval Research Unit, Department of Electrical & Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece
2
Lab of Ecological Engineering & Technology, Department of Environmental Engineering, Democritus University of Thrace, 67100 Xanthi, Greece
*
Author to whom correspondence should be addressed.
Information 2024, 15(7), 393; https://doi.org/10.3390/info15070393
Submission received: 3 June 2024 / Revised: 28 June 2024 / Accepted: 6 July 2024 / Published: 8 July 2024

Abstract

:
In the realm of disaster-related tweet classification, this study presents a comprehensive analysis of various machine learning algorithms, shedding light on crucial factors influencing algorithm performance. The exceptional efficacy of simpler models is attributed to the quality and size of the dataset, enabling them to discern meaningful patterns. While powerful, complex models are time-consuming and prone to overfitting, particularly with smaller or noisier datasets. Hyperparameter tuning, notably through Bayesian optimization, emerges as a pivotal tool for enhancing the performance of simpler models. A practical guideline for algorithm selection based on dataset size is proposed, consisting of Bernoulli Naive Bayes for datasets below 5000 tweets and Logistic Regression for larger datasets exceeding 5000 tweets. Notably, Logistic Regression shines with 20,000 tweets, delivering an impressive combination of performance, speed, and interpretability. A further improvement of 0.5% is achieved by applying ensemble and stacking methods.

1. Introduction

The increasing use of social media platforms during disaster events has opened up new avenues for extracting valuable information and enhancing crisis response efforts. The identification of informative messages and the classification of crisis-related content have become essential tasks in crisis informatics.
In particular, natural disasters such as floods, earthquakes, storms, extreme weather events, landslides, and wildfires could see better operational management and assessment by utilizing the immediate and valuable information provided by social media platforms such as Facebook, Twitter, Instagram, Flicker, and others. Real-time information disseminated through public posts could promote community disaster awareness and warnings [1], helping to mobilize the scientific community to produce more accurate forecasting of the evolution of extreme events as well as to improve authorities’ actions and response [2]. For example, in [3] the authors studied the capacity of Twitter to spread vital emergency information to the public based on real-time posts uploaded on the platform during Hurricane Sandy. Belcastro et al. [4] focused on identifying secondary disaster-related events from social media. Annis and Nardi [5] integrated crowdsourced data and images into a hydraulic model to improve flood forecasting and support decision-making in early warning situations. Peary et al. [6] examined the use of social media in disaster preparedness for earthquakes and tsunamis. Styve et al. [7] utilized text Twitter data to assess the level of extreme weather events such as heavy precipitation, storms, and sea level rise events to enhance the adaptive capacity to mitigate extreme events.
Recent works have focused on different aspects of information identification and content classification. The labeled data within these studies includes various disaster events such as floods, natural disasters, non-natural disasters, and situation-aware tweets. The labels assigned to the data belong to binary classification problems, e.g., informative/non-informative [8,9,10,11,12], situation-aware/non-situation-aware [13], damage assessment [14], request for help or not [15,16], urgency classification [17], and disaster/non-disaster [18]).
To tackle these classification tasks, scholars have employed a range of algorithms, including Support Vector Machine (SVM) [8,10,12,13,14,15,16,17,19], Artificial Neural Network (ANN) [8,17], Convolutional Neural Network (CNN) [8,9,10,13,17,20], Naïve Bayes (NB) [9,17], Recurrent Neural Network (RNN) [9], Random Forest (RF) [12,13,14,15,19], XGBoost (XGB) [12,13], Logistic Regression (LR) [13,17], AdaBoost [13,14,17], Deep Belief Network (DBN) [13], Decision Tree (DT) [12,14,15], and CART [20]. Most recently, transformer-based models have seen increased usage in the domain, with BERT, RoBERTa, and DistilBERT [10,11,16,19] being the most common. This diversity reflects the exploration of different machine learning techniques and their suitability for crisis-related classification tasks.
Feature engineering and preprocessing are crucial aspects in this field. Both involve cleaning and selecting relevant features to capture the distinctive characteristics of crisis-related content. The employed feature engineering techniques encompass n-grams [8,15], linguistic features (e.g., part-of-speech, sentiment analysis) [9,13,14,17], user-based information [9], polarity-based features [9], and entity-based features [13]. Additionally, topic modeling techniques such as Latent Dirichlet Allocation (LDA) [9] are utilized to extract topic-based information. The most common preprocessing techniques involve the removal of stopwords [9,11,12,19], URLs, user names, punctuation, hashtags and numbers [10,11,12,17,18,19], lowercase [12,14,18], and lemmatization or stemming [11,12,17,18].
The training strategies employed vary, with some studies utilizing publicly available crisis-related datasets such as CrisisNLP, CrisisMMD, and SMERP and others leveraging their own collected data from specific events such as Hurricanes Sandy [15] and Harvey [17]. Training sets are typically partitioned into subsets for training, validation, and testing purposes. In some cases, only one set is used and the evaluation is performed through cross-validation [11,14,16]. Evaluation metrics, including accuracy, F1 score, precision, recall, area under the curve (AUC), and weighted average precision and recall, are employed to assess the performance of the classification models.
Comparing the results across relevant studies reveals valuable insights into the effectiveness of different approaches. The application of CNN often demonstrates superior performance compared to other algorithms such as SVM, NB, RF, and CART in identifying informative messages and classifying crisis-related content [8,9,13,20]. Deep learning models, particularly CNNs, exhibit the ability to capture complex patterns and representations from the noisy and diverse nature of social media data during crisis events.
Furthermore, the utilization of advanced language models such as BERT, DistilBERT, and RoBERTa combined with careful preprocessing steps has shown promising results in benchmarking crisis-related social media datasets [10]. The removal of symbols, emoticons, invisible and non-ASCII characters, punctuation, numbers, URLs, and hashtag signs enhances the generalization capabilities of these models [9,10,18,19]. Using the RoBERTa and BERT models with a combination of datasets has demonstrated improved performance in generalization and classification accuracy [10,16].
Several papers have challenged the notion that deep learning models consistently outperform other algorithms or have produced results showing that the performance difference is small. These studies provide valuable insights into alternative approaches that yield competitive results. While deep learning models excel at capturing complex patterns and representations in crisis-related data, other algorithms can offer complementary advantages such as interpretability, efficiency, and robustness.
For example, a number of studies have explored the effectiveness of traditional machine learning algorithms such as SVM, NB, RF, and DT for information identification and content classification. These algorithms leverage carefully engineered features such as unigrams, bigrams, and trigrams as well as topic modeling techniques such as LDA. When coupled with non-deep learning algorithms, these feature engineering techniques have demonstrated competitive performance in various scenarios [15].
Moreover, research comparing deep learning models against NB in identifying informative tweets during disasters has shown contrasting results. While CNNs consistently outperform NB classifiers, the performance of NB classifiers depends heavily on the nature of the data. Notably, NB classifiers exhibit notably poorer performance when trained on natural disaster events and evaluated on non-natural disasters, as well as and vice versa. These findings emphasize the importance of understanding the characteristics and distribution of the data when selecting appropriate algorithms [20].
Additionally, several studies have explored the use of ensemble methods and boosted algorithms such as XGBoost, AdaBoost, and Deep Belief Networks (DBN) to leverage the strengths of multiple models. These techniques aim to improve overall classification performance by combining the outputs of individual classifiers or by employing sophisticated learning algorithms that capture complex relationships in the data [13].
By considering these alternative approaches, it is possible to gain a more nuanced understanding of the strengths and limitations of different algorithms in crisis informatics. While deep learning models, particularly CNNs, have shown remarkable performance, they are not always the optimal choice in all scenarios. Other algorithms, such as traditional machine learning models and ensemble methods, offer valuable alternatives that balance performance, interpretability, efficiency, and robustness.

2. Related Work

Several studies have investigated the impact of training set size and dataset size on supervised machine learning. The authors of [21] examined various algorithms in the field of land cover classification using large-area high-resolution remotely sensed data, including Artificial Neural Network (ANN), Support Vector Machine (SVM), Random Forest (RF), and others, across a wide range of data sizes from 40 to 10,000 samples. The study found that RF achieved a high accuracy of almost 95% with small training sample sets, and there were minimal variations in overall accuracy between small and even very large sample sets.
Another study explored the impact of training and testing data splits by investigating the performance of Gaussian Process models on time series forecasting tasks. The study examined data sizes ranging from 2 months to 12 months, and found that the performance of the models varied depending on the specific training data splits [22].
Furthermore, the work of [23] focused on image datasets for plant disease classification. Their study emphasized the need for a substantial amount of data to achieve optimal performance.
In the context of machine learning algorithm validation with limited sample sizes, in [24] the authors examined simulated datasets ranging from 20 to 1000 samples. Their study compared Support Vector Machine (SVM), Logistic Regression (LR), and different validation methods. They concluded that K-fold Cross-Validation (CV) can lead to biased performance estimates, while Nested CV and training/testing split approaches provide more robust and unbiased estimates.
Additionally, the impact of dataset size on training sentiment classifiers was explored in [25]. The study used seven datasets ranging from 1000 to 243,000 samples, and evaluated algorithms such as Decision Tree (DT), Naive Bayes (NB), K-Nearest Neighbors (KNN), and Support Vector Machine (SVM). The findings showed that all algorithms improved with increased training set size, with NB performing the best overall and reaching a plateau after 81,000 samples.
In the context of political text classification, the study of [26] addressed data scarcity using deep transfer learning with BERT-NLI. BERT-NLI fine-tuned on 100–2500 texts outperformed classical models by 10.7–18.3 points. It matched the performance of classical models trained on 5000 texts with just 500 texts, showing significant data efficiency and improved handling of imbalanced data.
In the domain of sentiment analysis of Twitter data, [27] investigated the influence of different training set sizes ranging from 10% to 100%. The results indicated that changing the training set size did not significantly affect the sentiment classification accuracy, with SVM outperforming Naive Bayes.
Another Twitter study on pharmacovigilance [28] used weak supervision with noisy labeling to train machine learning models, exploring various training set sizes from 100,000 to 3 million tweets. Classical models such as SVM and deep learning models such as BERT performed well, showing similar accuracy to gold standard data.
Finally, [29] examined the impact of dataset size on Twitter classification tasks using various datasets and models, including BERT, BERTweet, LSTM, CNN, SVM, and NB. The results indicated that adding more data is not always beneficial in supervised learning. More than 1000 samples were recommended for reliable performance; notably, transformer-based models remained competitive even with smaller sample datasets.
The contributions of the present paper are summarized as follows:
  • Algorithm performance analysis: This study systematically evaluates the performance of multiple machine learning algorithms for tweet classification in the context of disaster events. It provides valuable insights into the strengths and weaknesses of each algorithm, aiding practitioners in making informed choices. It also employs ensemble and stacking techniques to further boost performance.
  • Hyperparameter tuning importance: By emphasizing the significance of hyperparameter tuning, particularly through Bayesian optimization, this work underscores the potential performance gains achievable by systematically exploring the hyperparameter space. This knowledge can guide practitioners in optimizing their models effectively.
  • Occam’s razor in machine learning: The application of Occam’s razor to machine learning model selection is explored, emphasizing the advantages of simpler models in terms of interpretability and reduced overfitting risk.
  • Impact of dataset size on model choice: This research establishes a practical guideline for selecting the most suitable algorithm based on dataset size. This contribution can aid practitioners in making efficient and effective algorithm choices that are the most appropriate based on the scale of their data.

3. Materials and Methods

This section offers all the necessary resources for comprehending the basis of our experiments and grasping their outcomes.

3.1. Dataset

The consolidated CrisisBench dataset [10] was employed for the experiments reported in this work. This is a benchmark dataset that consists of annotated data from several different data sources, such as CrisisLex (CrisisLex26, CrisisLex6) [30], CrisisNLP [31], SWDM2013 [32], ISCRAM13 [33], Disaster Response Data (DRD), Disasters on Social Media (DSM), CrisisMMD [34], and data from AIDR [35]. In total, the dataset, which is shown in Table 1, consists of 166,098 labeled tweets divided into two label classes: informative (101,759 tweets) regarding a disaster event, and not informative (64,339 tweets).

3.2. Classification Models

Naive Bayes. We used a variant of Naive Bayes called Bernoulli Naive Bayes [36], which assumes binary features. Given the class, it models the conditional probability of each feature as a Bernoulli distribution. It is commonly used for text classification tasks where features represent the presence or absence of words. Despite its simplifying assumptions, Bernoulli Naive Bayes can achieve good performance and is computationally efficient. It works by calculating the likelihood of each feature given the class, and then combining the results with prior probabilities to make predictions using Bayes’ theorem.
Light Gradient Boosting. Light Gradient Boosting (LightGBM) [37] is a gradient boosting framework that uses tree-based learning algorithms. It aims to provide a highly efficient and scalable solution for handling large-scale datasets. LightGBM builds trees in a leaf-wise manner, making it faster and more memory-efficient than other boosting algorithms. It uses a gradient-based approach to iteratively optimize the model by minimizing the loss function. LightGBM also implements features such as regularized learning, bagging, and feature parallelism to improve performance. Thanks to its ability to handle large datasets and high predictive accuracy, LightGBM has become popular in various machine learning tasks, including classification, regression, and ranking.
Linear SVC. Linear Support Vector Classifier (Linear SVC) [38] is a linear classification algorithm that belongs to the Support Vector Machine (SVM) family. It aims to find the optimal hyperplane that separates the data points of different classes. Linear SVC works by maximizing the margin between the classes while minimizing the classification error. Unlike traditional SVM, Linear SVC uses a linear kernel function, making it computationally efficient and suitable for large-scale datasets. It performs well in scenarios where the classes are linearly separable. Linear SVC is widely used in various applications such as text classification, image recognition, and sentiment analysis. It provides a powerful and interpretable solution for binary and multiclass classification problems, offering good generalization and robustness to noisy data.
Logistic Regression. Logistic Regression [39] is a popular statistical model used for binary classification tasks. It predicts the probability of an instance belonging to a certain class by fitting a logistic function to the input features. The model estimates the coefficients for each feature, which represent the influence of the corresponding feature on the outcome. Logistic Regression assumes a linear relationship between the features and the log-odds of the target variable. It is a simple and interpretable algorithm that performs well when the classes are linearly separable. Logistic Regression is widely used in domains such as healthcare, finance, and marketing for tasks including churn prediction, fraud detection, and customer segmentation. It is computationally efficient and provides probabilistic outputs, making it suitable for both binary and multi-class classification problems.
XGBoost. XGBoost (Extreme Gradient Boosting) [40] is an advanced gradient boosting algorithm that has gained popularity in machine learning competitions and real-world applications. It is designed to optimize performance and scalability by utilizing a gradient boosting framework. XGBoost trains an ensemble of weak decision tree models sequentially, with each subsequent model correcting the errors made by the previous models. It incorporates regularization techniques to prevent overfitting and provides options for customizing the learning objective and evaluation metrics. XGBoost supports both classification and regression tasks, and can efficiently handle missing values and sparse data. It also offers features such as early stopping, parallel processing, and built-in cross-validation. XGBoost excels in capturing complex patterns and interactions in the data, making it a powerful tool for predictive modeling. Its high performance and flexibility have made it a popular choice across various domains, including finance, healthcare, and online advertising.
Convolutional Neural Networks. CNNs are deep learning models commonly used for analyzing and classifying text data, including tweets. They excel at capturing local patterns and dependencies within the text through the use of convolutional layers. By applying filters to the input text, CNNs extract features and learn representations that are relevant for classification tasks. These models have proven effective in tasks such as sentiment analysis, topic classification, and spam detection. CNNs offer a robust approach for understanding and categorizing tweet content based on the textual characteristics. We use a similar architecture to that proposed by [41].

3.3. Setup

Our preprocessing step included various techniques that have previously shown increased performance on Twitter texts [42,43]. The techniques were: (a) removal of non-ASCII characters; (b) replacing URLs and user mentions; (c) removing hashtags; (d) removing numbers; (e) replacing multiple repetitions of exclamation marks, question marks, and stop marks; and (f) lemmatization.
The hyperparameters of the machine learning algorithms were tuned using Bayesian optimization [44]. Bayesian optimization combines statistical modeling and sequential decision-making to find the optimal set of hyperparameters that maximizes the performance of the model. Unlike traditional grid or random search methods, Bayesian optimization intelligently explores the hyperparameter space by learning from previous evaluations. The process begins by constructing a probabilistic surrogate model, such as a Gaussian process or tree-based model, to approximate the unknown performance function. This surrogate model captures the trade-off between exploration and exploitation, allowing for informed decisions to be made about which hyperparameters to evaluate next. Bayesian optimization uses an acquisition function to balance exploration (sampling uncertain regions) and exploitation (focusing on promising areas). By iteratively evaluating the model’s performance with different hyperparameter configurations, Bayesian optimization updates the surrogate model and refines its estimation of the performance landscape. This iterative process guides the search towards promising regions, ultimately converging to the set of hyperparameters that yield the best performance.
The CNN model was trained by following the guidelines from the CrisisBench work [10]. We used the architecture proposed by [41] and the Adam optimizer [45]. The batch size was 128, the maximum number of epochs was 1000, the filter had a size of 300 with window sizes and pooling lengths of 2, 3, and 4, and the dropout rate was 0.02. The early stopping criterion was based on the accuracy of the development set with a patience of 200.
All models were evaluated using the F-measure score due to the class label imbalance of the two classes (informative vs. not informative).
The machine learning models were developed in Python using the scikit-learn package [46], while the CNN models were developed using the keras package [47].

4. Results and Discussion

The CrisisBench dataset used in this work is already split into three sets: a training set of 109,441 samples, a development set of 15,870 samples, and a test set of 31,037 samples. We trained and tuned our models on the training set and validated them separately on both the development set and test set. We performed both validations, as we are interested in whether the models generalize well and avoid overfitting to one set. For each machine learning algorithm, we created 21 models using a subsample of the total training set and increasing its size with steps of 5%. The resulting samples were 1094, 5472, 10,944, 16,416, …, 109,441 in size.
The results of this study are presented in Table 2 for the development set and Table 3 for the test set. These tables are visualized in Figure 1 and Figure 2, respectively. One notable observation lies in the striking similarity between the results obtained on the development and test datasets. This resemblance manifests not only in the performance metrics but also in the characteristic trends seen in the line plots as the training size increases. Notably, the F-measure exhibits a consistent trend, with a marginal 0.4% increment on average in the test set, which possesses twice the sample size in comparison to the development set (31,037 vs. 15,870). This convergence of outcomes hints at the high quality of the dataset, and implies a degree of linguistic similarity between these three subsets.
Indeed, it is reasonable to anticipate such lexical overlaps, particularly when dealing with disaster-related events. It stands to reason that new and unseen data would seldom introduce a multitude of previously unencountered terms. Another noteworthy implication is that in the context of this specific problem the algorithms we employed demonstrate a remarkable capacity for generalization. This assertion is substantiated by their nearly indistinguishable performance on the development and test sets, suggesting a high likelihood of continued success when applied to entirely new and unfamiliar Twitter datasets.
Another notable observation centers on the remarkable performance achieved by the models. When subjected to training on the entire training dataset, all algorithms consistently exhibit F-measure scores exceeding 85%. Specifically, in descending order of performance on the test set, the scores are as follows: Multinomial Naive Bayes (85.59%), Convolutional Neural Network (86.61%), Bernoulli Naive Bayes (86.20%), XGBoost (87.91%), LightGBM (88.49%), Linear Support Vector Classification (88.53%), and Logistic Regression (88.54%).
It is noteworthy to highlight that our CNN model’s results align closely with those reported in the paper introducing this dataset, as cited in [10], where the authors reported an F-measure of 86.6%. Intriguingly, our simpler Bayesian-optimized models outperform the CNN, with Logistic Regression surpassing it by nearly 2%. This observation suggests that the problem at hand can be characterized as relatively tractable, since even straightforward algorithms can attain nearly 90% F-measure performance with some parameter tuning.
The phenomenon of simpler algorithms consistently outperforming their more complex counterparts on certain machine learning tasks, as evidenced in this particular case, can be attributed to a combination of several key factors.
First and foremost, the quality and size of the dataset wield substantial influence. In this instance, the dataset is not only sizable, it is characterized by its cleanliness and meticulous structure. Such a favorable data environment allows even simpler algorithms to adeptly capture meaningful patterns. This stands in contrast to scenarios where datasets may possess an inherent structure or contain distinct features, where the introduction of complex models may not yield a significant advantage.
Another pivotal factor contributing to this phenomenon is the propensity for complex models, exemplified in this case by deep neural networks such as the Convolutional Neural Network (CNN), to succumb to overfitting. This vulnerability becomes particularly pronounced when dealing with smaller or noisy datasets. Complex models have a higher capacity to memorize the idiosyncrasies and noise present in the training data, resulting in suboptimal generalization when confronted with unseen data. In stark contrast, simpler models are characterized by their reduced complexity, which renders them more resilient to overfitting, ultimately bolstering their robustness.
Hyperparameter tuning also plays a crucial role in elucidating the superior performance of simpler algorithms over complex ones. Bayesian optimization systematically explores and identifies optimal hyperparameters for simpler models. This meticulous tuning process can propel these models to deliver exceptional performance, often surpassing their more complex counterparts.
Moreover, opting for simpler models aligns with the age-old principle of Occam’s razor. This principle posits that when all other factors are equal, simpler models should be preferred. Simpler models are inherently more interpretable and carry a reduced risk of overfitting. In many real-world scenarios, these streamlined models can effectively approximate the underlying data distribution, especially when the problem at hand lacks excessive complexity.
The improved performance of simpler models is not solely evident when training with the entire dataset; rather, it becomes increasingly apparent as the size of the training dataset grows. This trend is clearly illustrated in the figures, where the CNN and Bernoulli models consistently exhibit lower performance while the other four algorithms demonstrate similar scores.
Notably, Bernoulli Naive Bayes stands out, achieving its highest F-measure at just 10% of the training set (10,944 samples). In contrast, the remaining algorithms continue to improve their F-measures as the size of the training set increases. Remarkably, starting from as little as 1% of the data, Bernoulli outperforms the other algorithms. For the rest of the algorithms, if we were to determine a point where performance gains become marginal, it would be around 60% of the training set (65,664 samples).
As demonstrated above, this problem appears relatively easy to solve, as even at a conservative cutoff of 20% all algorithms except CNN achieve F-measures above 85%. Beyond this point, the algorithms exhibit only marginal improvements, typically in the range of 1–2%. Considering the substantial increase in training data required for this minor gain, it becomes evident that the effort may not be justified.
Based on these findings, it is possible to establish a practical rule for selecting the most suitable algorithm for a given dataset size. When dealing with datasets containing fewer than 5000 tweets, the Bernoulli Naive Bayes model emerges as a promising choice. Conversely, when confronted with larger datasets exceeding 5000 tweets, Logistic Regression proves to be the preferred option. Both of these algorithms are also the fastest in terms of execution.
Notably, with a dataset size of 20,000 tweets, Logistic Regression offers a compelling advantage, delivering an impressive 87% F-measure. This is achieved with remarkable speed and a high level of interpretability, making Logistic Regression an attractive choice for such datasets.

Further Improvements using Ensemble and Stacking Approaches

As a final experiment, we tried to push the performance even further utilizing ensemble and stacking methods.
In an ensemble methodology for classification, majority voting is applied to predictions derived from multiple algorithmic models. This synergistic approach aims to enhance the predictive F-measure beyond the capability of any single model within the ensemble. By capitalizing on the divergent strengths of various models and mitigating their individual weaknesses, ensemble approaches can offer a robust solution showcasing superior performance, diminished susceptibility to overfitting, and enhanced predictive stability.
Stacking involves a hierarchical model architecture in which a secondary model (the meta-learner) is trained to synthesize the predictions of multiple primary models (the base learners). The base learners are trained on the complete dataset, while the meta-learner’s training leverages the base learners’ predictions as input features. The meta-learning phase aims to capture the essence of the predictions made by the base models, yielding a refined final prediction.
The efficacy of both voting and stacking methodologies is significantly bolstered by incorporating models that are uncorrelated or exhibit low correlation in terms of their errors. Figure 3 shows the results of computing the Pearson correlation between the predictions on the validation set (training with 20% of the data) of the six algorithms that we used. It can be observed that the highest correlations are between XGBoost and Light Gradient Boosting (0.89) and between Logistic Regression and Linear SVC (0.86). The CNN’s predictions depict the lowest overall correlations, ranging from 0.63 to 0.7.
According to these results, we decided not to use the XGBoost model for the ensemble and stacking experiment due to its high correlation with Light Gradient Boosting. The single LGB model is inferior, and requires more computational time. While we could have also dropped Linear SVC, that would result in four final models, and majority voting works best with an even number of models.
For the stacking method, we used Logistic Regression as the meta-learner. Because stacking uses the predictions of the individual models as features for the meta-learner, the validation set was used as the training set.
Table 4 shows the F-measure results for the individual models trained on 20% of the training set in comparison to the ensemble and stacking models. Both validation and test results are shown. The majority voting ensemble technique shows improved performance over the best individual model by about 0.5% in both validation and test sets, reaching 87.29% and 87.57%, respectively. Stacking shows the same behavior, improving the test set’s F-measure by about 0.5%, ultimately reaching 87.61%.

5. Conclusions

After conducting a comprehensive analysis of various machine learning algorithms and ensembles for tweet classification in disaster contexts, our study reveals several key findings.
The exceptional performance of simpler models can be attributed in part to the quality and size of the dataset. This dataset, characterized by its substantial size and meticulous organization, allowed even basic models to capture meaningful patterns. However, it is important to note that the advantages of simpler models may diminish when dealing with more complex or noisy datasets.
Complex models such as deep neural networks are susceptible to overfitting, especially in scenarios involving smaller or noisier datasets. These models tend to memorize noise, which hinders their ability to generalize effectively. In contrast, simpler models with reduced complexity exhibit greater resilience to overfitting.
Hyperparameter tuning, particularly through Bayesian optimization, played a pivotal role in enhancing the performance of the simpler models. Systematically exploring the hyperparameter space enabled these models to outperform their more complex counterparts.
The principle of Occam’s razor, favoring simpler models when all other factors are equal, holds true in this context. Simpler models are not only easier to interpret, they are less prone to overfitting; thus, in many real-world scenarios these streamlined models can effectively approximate the underlying data distribution, particularly for problems that lack excessive complexity.
Our findings also underscore the significance of training dataset size. The superiority of simpler models becomes increasingly evident as the size of the training dataset grows. Even with as little as 20% of the data, simpler models consistently achieve F-measures above 85%, while complex models exhibit only marginal improvements with more data.
Based on these insights, we propose a practical guideline for algorithm selection based on dataset size. For datasets containing fewer than 5000 tweets, Bernoulli Naive Bayes emerges as a promising choice. Conversely, for larger datasets exceeding 5000 tweets, Logistic Regression proves to be the preferred option. Notably, Logistic Regression offers a compelling advantage with a dataset size of 20,000 tweets, delivering an impressive 87% F-measure along with speed and interpretability benefits. Ensemble and stacking approaches can also be used to further boost the results by approximately 0.5%.
These findings could be used in operational forecasting and management of extreme events. Text-related information transmitted in real time through social media channels could be used to support response operations in the case of extreme weather events, floods, earthquakes, storms, and wildfires. Earth-related Digital Twin technologies could capitalize on the explosion of these new data sources to enable simultaneous communication with real-world systems and models.
In summary, our study highlights the importance of dataset quality, model complexity, hyperparameter tuning, and the principle of simplicity when choosing a machine learning algorithm. These insights provide valuable guidance for practitioners in text classification, especially when dealing with disaster-related tweet data. Ultimately, our findings emphasize the substantial performance potential of simpler models when applied judiciously, even in scenarios that might initially be perceived as more complex.
In future work, we intend to explore and expand these findings on dataset quality and model complexity across more diverse classifications, such as identifying multiple types of disaster-related tweets.

Author Contributions

Conceptualization, D.E.; methodology, D.E.; software, D.E.; validation, D.E.; formal analysis, D.E.; investigation, D.E.; resources, D.E.; data curation, D.E.; writing—original draft preparation, D.E.; writing—review and editing, D.E., G.S. and A.A.; visualization, D.E.; supervision, G.S. and A.A.; project administration, G.S.; funding acquisition, G.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the European Union’s Horizon 2020 European Green Deal Research and Innovation Program (H2020-LC-GD-2020-4), grant number No. 101037643—ILIAD (Integrated Digital Framework for Comprehensive Maritime Data and Information Services). The article reflects only the authors’ views, and the Commission is not responsible for any use that may be made of the information it contains.

Data Availability Statement

The dataset used in this study is available at https://github.com/firojalam/crisis_datasets_benchmarks?tab=readme-ov-file#datasets (accessed on 2 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Takahashi, B.; Tandoc, E.C., Jr.; Carmichael, C. Communicating on Twitter during a disaster: An analysis of tweets during Typhoon Haiyan in the Philippines. Comput. Hum. Behav. 2015, 50, 392–398. [Google Scholar] [CrossRef]
  2. Yuan, F.; Li, M.; Liu, R. Understanding the evolutions of public responses using social media: Hurricane Matthew case study. Int. J. Disaster Risk Reduct. 2020, 51, 101798. [Google Scholar] [CrossRef]
  3. Wang, B.; Zhuang, J. Crisis information distribution on Twitter: A content analysis of tweets during Hurricane Sandy. Nat. Hazards 2017, 89, 161–181. [Google Scholar] [CrossRef]
  4. Belcastro, L.; Marozzo, F.; Talia, D.; Trunfio, P.; Branda, F.; Palpanas, T.; Imran, M. Using social media for sub-event detection during disasters. J. Big Data 2021, 8, 1–22. [Google Scholar] [CrossRef]
  5. Annis, A.; Nardi, F. Integrating VGI and 2D hydraulic models into a data assimilation framework for real time flood forecasting and mapping. Geo-Spat. Inf. Sci. 2019, 22, 223–236. [Google Scholar] [CrossRef]
  6. Peary, B.D.; Shaw, R.; Takeuchi, Y. Utilization of social media in the east Japan earthquake and tsunami and its effectiveness. J. Nat. Disaster Sci. 2012, 34, 3–18. [Google Scholar] [CrossRef]
  7. Styve, L.; Navarra, C.; Petersen, J.M.; Neset, T.S.; Vrotsou, K. A visual analytics pipeline for the identification and exploration of extreme weather events from social media data. Climate 2022, 10, 174. [Google Scholar] [CrossRef]
  8. Caragea, C.; Silvescu, A.; Tapia, A.H. Identifying informative messages in disaster events using convolutional neural networks. In Proceedings of the International Conference on Information Systems for Crisis Response and Management, Rio de Janeiro, Brazil, 22–25 May 2016; pp. 137–147. [Google Scholar]
  9. Neppalli, V.K.; Caragea, C.; Caragea, D. Deep neural networks versus naive bayes classifiers for identifying informative tweets during disasters. In Proceedings of the 15th Annual Conference for Information Systems for Crisis Response and Management (ISCRAM), Rochester, NY, USA, 20–23 May 2018. [Google Scholar]
  10. Alam, F.; Sajjad, H.; Imran, M.; Ofli, F. CrisisBench: Benchmarking crisis-related social media datasets for humanitarian information processing. In Proceedings of the International AAAI Conference on Web and Social Media, Atlanta, GA, USA, 8–10 June 2021; Volume 15, pp. 923–932. [Google Scholar]
  11. Jain, P.; Ross, R.; Schoen-Phelan, B. Estimating distributed representation performance in disaster-related social media classification. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Vancouver, BC, Canada, 27–30 August 2019; pp. 723–727. [Google Scholar]
  12. Krishna, D.S.; Gorla, S.; PVGD, P.R. Disaster tweet classification: A majority voting approach using machine learning algorithms. Intell. Decis. Technol. 2023, 17, 343–355. [Google Scholar] [CrossRef]
  13. Ning, X.; Yao, L.; Wang, X.; Benatallah, B. Calling for response: Automatically distinguishing situation-aware tweets during crises. In Proceedings of the Advanced Data Mining and Applications: 13th International Conference, ADMA 2017, Singapore, 5–6 November 2017; pp. 195–208. [Google Scholar]
  14. Madichetty, S.; Sridevi, M. A novel method for identifying the damage assessment tweets during disaster. Future Gener. Comput. Syst. 2021, 116, 440–454. [Google Scholar] [CrossRef]
  15. Nazer, T.H.; Morstatter, F.; Dani, H.; Liu, H. Finding requests in social media for disaster relief. In Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Francisco, CA, USA, 8–21 August 2016; pp. 1410–1413. [Google Scholar]
  16. Toraman, C.; Kucukkaya, I.E.; Ozcelik, O.; Sahin, U. Tweets under the rubble: Detection of messages calling for help in earthquake disaster. arXiv 2023, arXiv:2302.13403. [Google Scholar]
  17. Devaraj, A.; Murthy, D.; Dontula, A. Machine-learning methods for identifying social media-based requests for urgent help during hurricanes. Int. J. Disaster Risk Reduct. 2020, 51, 101757. [Google Scholar] [CrossRef]
  18. Murzintcev, N.; Cheng, C. Disaster hashtags in social media. Isprs Int. J. Geo-Inf. 2017, 6, 204. [Google Scholar] [CrossRef]
  19. Alam, F.; Qazi, U.; Imran, M.; Ofli, F. Humaid: Human-annotated disaster incidents data from twitter with deep learning benchmarks. In Proceedings of the International AAAI Conference on Web and Social Media, Virtually, 7–10 June 2021; Volume 15, pp. 933–942. [Google Scholar]
  20. Burel, G.; Alani, H. Crisis event extraction service (crees)-automatic detection and classification of crisis-related content on social media. In Proceedings of the 15th International Conference on Information Systems for Crisis Response and Management, Rochester, NY, USA, 20–23 May 2018. [Google Scholar]
  21. Ramezan, C.A.; Warner, T.A.; Maxwell, A.E.; Price, B.S. Effects of training set size on supervised machine-learning land-cover classification of large-area high-resolution remotely sensed data. Remote Sens. 2021, 13, 368. [Google Scholar] [CrossRef]
  22. Medar, R.; Rajpurohit, V.S.; Rashmi, B. Impact of training and testing data splits on accuracy of time series forecasting in machine learning. In Proceedings of the 2017 International Conference on Computing, Communication, Control and Automation (ICCUBEA), Pune, India, 17–18 August 2017; pp. 1–6. [Google Scholar]
  23. Barbedo, J.G.A. Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification. Comput. Electron. Agric. 2018, 153, 46–53. [Google Scholar] [CrossRef]
  24. Vabalas, A.; Gowen, E.; Poliakoff, E.; Casson, A.J. Machine learning algorithm validation with a limited sample size. PLoS ONE 2019, 14, e0224365. [Google Scholar] [CrossRef]
  25. Prusa, J.; Khoshgoftaar, T.M.; Seliya, N. The effect of dataset size on training tweet sentiment classifiers. In Proceedings of the 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 9–11 December 2015; pp. 96–102. [Google Scholar]
  26. Laurer, M.; Van Atteveldt, W.; Casas, A.; Welbers, K. Less annotating, more classifying: Addressing the data scarcity issue of supervised machine learning with deep transfer learning and BERT-NLI. Political Anal. 2024, 32, 84–100. [Google Scholar] [CrossRef]
  27. Abdelwahab, O.; Bahgat, M.; Lowrance, C.J.; Elmaghraby, A. Effect of training set size on SVM and Naive Bayes for Twitter sentiment analysis. In Proceedings of the 2015 IEEE international symposium on signal processing and information technology (ISSPIT), Abu Dhabi, United Arab Emirates, 7–10 December 2015; pp. 46–51. [Google Scholar]
  28. Tekumalla, R.; Banda, J.M. Using weak supervision to generate training datasets from social media data: A proof of concept to identify drug mentions. Neural Comput. Appl. 2023, 35, 18161–18169. [Google Scholar] [CrossRef]
  29. Nguyen, T.H.; Nguyen, H.H.; Ahmadi, Z.; Hoang, T.A.; Doan, T.N. On the Impact of Dataset Size: A Twitter Classification Case Study. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Melbourne, Australia, 14–17 December 2021; pp. 210–217. [Google Scholar]
  30. Olteanu, A.; Castillo, C.; Diaz, F.; Vieweg, S. Crisislex: A lexicon for collecting and filtering microblogged communications in crises. In Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media, Ann Arbor, MI, USA, 1–4 June 2014; Volume 8, pp. 376–385. [Google Scholar]
  31. Imran, M.; Mitra, P.; Castillo, C. Twitter as a lifeline: Human-annotated twitter corpora for NLP of crisis-related messages. arXiv 2016, arXiv:1605.05894. [Google Scholar]
  32. Imran, M.; Elbassuoni, S.; Castillo, C.; Diaz, F.; Meier, P. Practical extraction of disaster-relevant information from social media. In Proceedings of the 22nd International World Wide Web Conference, Rio de Janeiro, Brazil, 13–17 May 2013; pp. 1021–1024. [Google Scholar]
  33. Imran, M.; Elbassuoni, S.; Castillo, C.; Diaz, F.; Meier, P. Extracting information nuggets from disaster-Related messages in social media. Iscram 2013, 201, 791–801. [Google Scholar]
  34. Alam, F.; Ofli, F.; Imran, M. Crisismmd: Multimodal twitter datasets from natural disasters. In Proceedings of the International AAAI Conference on Web and Social Media, Palo Alto, CA, USA, 25–28 June 2018; Volume 12. [Google Scholar]
  35. Imran, M.; Castillo, C.; Lucas, J.; Meier, P.; Vieweg, S. AIDR: Artificial intelligence for disaster response. In Proceedings of the 23rd International Conference on World Wide Web, Seoul, Republic of Korea, 7–11 April 2014; pp. 159–162. [Google Scholar]
  36. Schütze, H.; Manning, C.D.; Raghavan, P. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008; Volume 39. [Google Scholar]
  37. Bojer, C.S.; Meldgaard, J.P. Kaggle forecasting competitions: An overlooked learning opportunity. Int. J. Forecast. 2021, 37, 587–603. [Google Scholar] [CrossRef]
  38. Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
  39. Wright, R.E. Logistic Regression. In Reading and Understanding Multivariate Statistics; Grimm, L.G., Yarnold, P.R., Eds.; American Psychological Association: Washington, DC, USA, 1995; pp. 217–244. [Google Scholar]
  40. Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T.; et al. Xgboost: Extreme Gradient Boosting. R Package Version 0.4-2. 2015, Volume 1, pp. 1–4. Available online: https://cran.ms.unimelb.edu.au/web/packages/xgboost/vignettes/xgboost.pdf (accessed on 2 June 2024).
  41. Nguyen, D.; Al Mannai, K.A.; Joty, S.; Sajjad, H.; Imran, M.; Mitra, P. Robust classification of crisis-related data on social networks using convolutional neural networks. In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM-17), Montreal, QC, Canada, 15–18 May 2017; Volume 11, pp. 632–635. [Google Scholar]
  42. Effrosynidis, D.; Symeonidis, S.; Arampatzis, A. A comparison of pre-processing techniques for twitter sentiment analysis. In Proceedings of the Research and Advanced Technology for Digital Libraries: 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Thessaloniki, Greece, 18–21 September 2017; pp. 394–406. [Google Scholar]
  43. Symeonidis, S.; Effrosynidis, D.; Arampatzis, A. A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Syst. Appl. 2018, 110, 298–310. [Google Scholar] [CrossRef]
  44. Frazier, P.I. Bayesian optimization. In Recent Advances in Optimization and Modeling of Contemporary Problems; Informs: Catonsville, MD, USA, 2018; pp. 255–278. [Google Scholar]
  45. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  46. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  47. Keras. 2015. Available online: https://keras.io (accessed on 2 June 2024).
Figure 1. Comparison of Bernoulli Naive Bayes, Light Gradient Boosting, Linear Support Vector Machine, Logistic Regression, Extreme Gradient Boosting, and Convolutional Neural Network as the training set size increases with steps of 5%, followed by validation on the development set using the F-measure.
Figure 1. Comparison of Bernoulli Naive Bayes, Light Gradient Boosting, Linear Support Vector Machine, Logistic Regression, Extreme Gradient Boosting, and Convolutional Neural Network as the training set size increases with steps of 5%, followed by validation on the development set using the F-measure.
Information 15 00393 g001
Figure 2. Comparison of Bernoulli Naive Bayes, Light Gradient Boosting, Linear Support Vector Machine, Logistic Regression, Extreme Gradient Boosting, and Convolutional Neural Network as the training set size increases with steps of 5%, followed by validation on the test set using the F-measure.
Figure 2. Comparison of Bernoulli Naive Bayes, Light Gradient Boosting, Linear Support Vector Machine, Logistic Regression, Extreme Gradient Boosting, and Convolutional Neural Network as the training set size increases with steps of 5%, followed by validation on the test set using the F-measure.
Information 15 00393 g002
Figure 3. Pearson correlation between predictions on the validation set when training with 20% of the training set.
Figure 3. Pearson correlation between predictions on the validation set when training with 20% of the training set.
Information 15 00393 g003
Table 1. Dataset tweet distribution across the different sources.
Table 1. Dataset tweet distribution across the different sources.
LabelCrisisLexCrisisNLPSWDM13ISCRAM13DRDDSMCrisisMMDAIDRTotal
Informative42,14023,694716244314,849346111,4882968101,759
Not informative27,55916,70714178604753744532390164,339
Total69,69940,401857252120,896883516,0206869166,098
Table 2. F-measure scores of the Machine Learning and Deep Learning methods obtained by fitting while increasing the training set size and evaluating on the development set.
Table 2. F-measure scores of the Machine Learning and Deep Learning methods obtained by fitting while increasing the training set size and evaluating on the development set.
Training SizeCNNBernoulli NBLGBLinear SVCLog. Regr.XGB
0.01 (1094)0.7670060.8279140.7955210.8268590.8194800.805749
0.05 (5472)0.8175940.8470060.8427080.8497490.8476350.841576
0.10 (10,944)0.8323100.8519170.8501620.8602650.8565890.850461
0.15 (16,416)0.8447510.8510770.8613700.8637430.8606110.854576
0.20 (21,888)0.8421310.8531030.8659640.8691580.8660190.856966
0.25 (27,360)0.8458380.8536190.8658380.8704250.8686420.860029
0.30 (32,832)0.8459770.8548660.8716160.8729450.8690690.863636
0.35 (38,304)0.8493240.8549920.8718560.8741390.8699540.867107
0.40 (43,776)0.8554680.8543400.8672470.8751890.8722090.868828
0.45 (49,248)0.8461530.8542890.8694790.8764880.8734990.867771
0.50 (54,720)0.8532180.8549130.8731750.8765960.8738940.870020
0.55 (60,192)0.8576270.8549280.8758900.8778650.8747660.871631
0.60 (65,664)0.8577550.8548360.8702500.8786970.8748960.871449
0.65 (71,136)0.8582430.8558550.8776370.8783500.8767920.874482
0.70 (76,608)0.8536680.8565430.8727270.8783830.8763890.873957
0.75 (82,080)0.8631940.8574580.8807750.8787590.8758940.875208
0.80 (87,552)0.8594960.8575630.8785240.8789860.8771930.874284
0.85 (93,024)0.8573300.8576080.8758000.8792840.8772530.876186
0.90 (98,849)0.8621210.8573080.8764910.8800460.8776990.877721
0.95 (103,968)0.8591830.8571280.8804020.8800670.8780640.876060
1.00 (109,441)0.8638070.8574580.8814920.8806830.8785820.877902
Table 3. F-measure scores of the Machine Learning and Deep Learning models obtained by fitting while increasing the training set size and evaluating on the test set.
Table 3. F-measure scores of the Machine Learning and Deep Learning models obtained by fitting while increasing the training set size and evaluating on the test set.
Training SizeCNNBernoulli NBLGBLinear SVCLog. Regr.XGB
0.01 (1094)0.7609700.8286990.7935110.8240180.8205790.801083
0.05 (5472)0.8253690.8511770.8450270.8567570.8531330.845988
0.10 (10,944)0.8408000.8541250.8530060.8653460.8616140.853893
0.15 (16,416)0.8503520.8571580.8642570.8678500.8659440.859451
0.20 (21,888)0.8475750.8578310.8713510.8723910.8708680.861848
0.25 (27,360)0.8519990.8596520.8704320.8746240.8728850.865389
0.30 (32,832)0.8547140.8604390.8762190.8769400.8758100.869051
0.35 (38,304)0.8539160.8597150.8747840.8773440.8770320.871913
0.40 (43,776)0.8602780.8592050.8736490.8768950.8779590.873085
0.45 (49,248)0.8519930.8600480.8752930.8801860.8787560.870071
0.50 (54,720)0.8576060.8601400.8803700.8807240.8790130.874163
0.55 (60,192)0.8611180.8601430.8802560.8824130.8805100.875076
0.60 (65,664)0.8589780.8604850.8784390.8820810.8806020.873560
0.65 (71,136)0.8614690.8605490.8823620.8831760.8815920.877876
0.70 (76,608)0.8583480.8612240.8806500.8827280.8827520.877911
0.75 (82,080)0.8680000.8612190.8846330.8838060.8833760.877862
0.80 (87,552)0.8657310.8618110.8827800.8842730.8837180.878570
0.85 (93,024)0.8595790.8617360.8817770.8843100.8840660.880593
0.90 (98,849)0.8661650.8615560.8823260.8846360.8842720.879329
0.95 (103,968)0.8635600.8613400.8859980.8854650.8848550.879588
1.00 (109,441)0.8661160.8620070.8849120.8853080.8854350.879179
Table 4. F-measures on the validation and test sets when training with 20% of data across individual algorithms and ensembles.
Table 4. F-measures on the validation and test sets when training with 20% of data across individual algorithms and ensembles.
SetCNNBernoulli NBLGBLinear SVCLog. Regr.XGBEnsembleStacking
Validation0.84210.85310.86590.86910.86600.85690.8729-
Test0.84750.85780.87130.87230.87080.86180.87570.8761
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Effrosynidis, D.; Sylaios, G.; Arampatzis, A. The Effect of Training Data Size on Disaster Classification from Twitter. Information 2024, 15, 393. https://doi.org/10.3390/info15070393

AMA Style

Effrosynidis D, Sylaios G, Arampatzis A. The Effect of Training Data Size on Disaster Classification from Twitter. Information. 2024; 15(7):393. https://doi.org/10.3390/info15070393

Chicago/Turabian Style

Effrosynidis, Dimitrios, Georgios Sylaios, and Avi Arampatzis. 2024. "The Effect of Training Data Size on Disaster Classification from Twitter" Information 15, no. 7: 393. https://doi.org/10.3390/info15070393

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop