4.1. Dataset
The first step begins with the collection of visuals from social media sites, such as Facebook, Google Images, and Flicker. During crawling for imagery from the aforementioned websites, the copyright element was considered, and only visuals authorized for distribution were selected. In addition, visuals were retrieved for tags, such as storms, seismic events, hurricanes, and tidal waves. These tags were made specific for places, such as “earthquake in Nepal,” “monsoons in Brazil,” and “natural disasters in Japan.”
The identification of labels for emotions and related behaviors is among the most crucial concerns of this research. The majority of the current works focus on distinct emotions, such as “negative,” “positive,” and “neutral,” with no human-related behaviors [
39]. Nevertheless, we want to focus on emotions that are highly pertinent to disaster-related details, for instance, labels like “grief,” “excitement,” and “rage” in catastrophic conditions. Moreover, based on recent research in human psychology [
40], we argue that persons who are surrounded by a catastrophe seem to be more likely to express two important sets of human emotions. This investigation revealed various human emotions [
29]. The first collection of emotions comprises common human phrases, such as “negative,” “positive,” and “neutral.” The second category has the terms “pleased,” “anxiety,” “normal,” and “worried.”
The final collection is the expanded version of human expressions, which contained additional specific emotions, such as “anguish,” “revulsion,” “delight,” “wonder,” “excitement,” “sadness,” “pain,” “weeping,” “scar,” “antsy,” and “relief.” These emotions are related to the terms “seated,” “stand,” “ran,” “lying,” and “jogging.”
Table 2 provides a thorough classification of potential human emotions and related human actions under catastrophic situations, and
Figure 4 illustrates the selection of images extracted from the dataset. The dataset is made publicly available for academic research [
41].
Using individual viewpoints and thoughts on disasters with accompanying external perceptions, the crowdsourcing initiative aims to generate an objective for envisaged human sentiment analysis. During the crowdsourcing research phase, we provided preselected images to the study participants for annotation through Fiverr [
42]. Individuals were labeled with 500 images throughout the procedure. We disseminated images with a questionnaire survey, including a disaster-related image, to participants in order for them to label human emotion and associated behavior. In the initial query, participants were asked to evaluate images on a scale from 1 to 10, where 10 represents ‘positive’ emotion, 5 represents ‘neutral’ emotion, and 1 represents ‘negative’. The purpose of this inquiry was to ascertain the respondents’ general impression of the image. The follow-up question is somewhat more precise, such as ’sad’, ‘happy’, ‘angry’, and ‘calm,’ and can retrieve the exact emotions represented by an image to respondents. In the third question, respondents were asked to rate the imagery on a scale from 1 to 7 and to describe the emotion expressed by each image.
In addition, the participants were requested to describe their sentiments about the presented images and explicitly tag images with a specific sentiment tag if that tag was not included in the list of available tags. The fourth question seeks to identify image characteristics that elicit human emotions and behavior at the threshold of the scenario or underlying context. In the fifth query, respondents were asked to provide their opinion about a human activity, such as ‘seated,’ ‘stand,’ ‘running,’ ‘walking,’ and ‘jogging.’
Table 3 shows a concise distribution of images in the dataset.
At least six people were chosen to examine the images to confirm the uniformity of the labels. During the analysis, 10,000 unique responses from 2300 unique individuals were gathered. The participants represented a wide range of ages, genders, and 25 different regions. The latency for a person was 200 s, which allowed us to weed out sloppy and incorrect responses. Cohen’s kappa (k) is used for inter-rater reliability, which was calculated as 0.65. Before the final analysis, two trial tests were conducted to refine the test, fix discrepancies, and increase consistency and clarity. The datasets were split into test and training sets, with 5500 for training instances and 1695 for test instances, according to the standard protocol of breaking down the dataset by 70% and 30% for the training and test sets, respectively.
4.2. Experimental Setup
Multiple experiments were performed to demonstrate the efficacy of the proposed AL-based FL framework. On the one hand, we tested and evaluated the results of the AL techniques in the FL environment against two baselines: Baseline 1, a carefully annotated training set; and Baseline 2, a sparsely annotated set comprising impurities. Because the objective of this study is to assess the advantages of AL in an FL environment, comparing the two baselines rather than the service-oriented architecture (SoA) in both domains is highly practical. The first standard depicts the best-case situation in which manually labeled training data exist, whereas the second criterion depicts the worst-case situations in which a model is trained on a dataset that contains a significant amount of inconsequential data. To do this, the dataset for the human sentiment criterion was synthesized by including up to 35–40% unrelated images in the collection of unlabeled data. The human physical activities in the natural catastrophe analysis application, on the contrary, provide a more realistic situation in which the second baseline is trained on a collection of social media images with the accompanying tags/queries, without human inspection and elimination of unnecessary image data. However, for the hand-labeled baseline, each picture was carefully evaluated and labeled using crowdsourcing.
In addition, one of our goals was to demonstrate how the effectiveness of the AL approaches varies depending on whether they are used in FL environments.
Furthermore, we investigated how clients with small sample sizes impact the performance of a global model by performing the analysis. Throughout the investigations, the experimental setup for AL and FL was maintained in their original state. In the next subsections, we describe in detail the experimental procedures performed using AL and FL.
We have included a summary of the parameter values utilized throughout the experimental procedure in
Table 4.
The seeding images were determined based on the number of manually annotated training images available for an application. There are many variables to consider while training a learner, such as the amount and quality of the data in a seed. AL’s fundamental premise is that there is a trade-off between the amount of effort needed to label the data and its effectiveness, which can be seen in the first training dataset. The ability to obtain better outcomes with fewer seeds is critical for the success of AL techniques.
We started experimentation to feed as a seed with 180 instances, which included 18 images from every training image category of human sentiments in a natural disaster. Meanwhile, we used 140 instances that contained 18 images from human physical activities in the disaster dataset. The image instances were then enlarged to pertinent instances by selection using the query strategy in every repetition. Manual annotation of the test and seed sets is essential for vigilance.
In a manner similar to that of the active learning component, a predetermined experimental setup was used throughout the federated learning portion of the experiment. The dataset was split into six parts, of which five represent the training set and are distributed to five different linked nodes in such a manner that each node receives an adequate sample size from each of the categories, where test images are included in the sixth segment of the sequence.
The LSTM learning model was used and is made up of four layers: a dropout layer; a classification layer; the first layer, which has 100 neurons; and the second layer, which contains 20 neurons. The data overfitting problem was resolved by introducing dropout and regularization techniques. This layer arbitrarily eliminated specific characteristics from the model by setting them to zero. The number of communication cycles and clients handled in each round within the FL framework was set to 50 and 5, respectively. This is another major characteristic of FL frameworks.
4.3. Experimental Results
The experimental results are shown in
Figure 5, which depicts the results in terms of accuracy using the aforementioned AL techniques in an FL context. In the first 50 iterations, there is no significant difference in all six AL techniques, but afterwards, there is substantial variation in terms of accuracy utilizing these techniques. The performance of each approach stagnated after the 50th iteration. In particular, QBC approaches tend to acquire relatively fair higher accuracy rather than uncertainty approaches. The AL techniques are utilized to achieve better accuracy despite smaller sample sizes. Therefore, the QBC technique outperformed the uncertainty methods by achieving a relatively high accuracy despite the reduced training instances. In the majority of instances, with the exception of queries by the committee using consensus entropy and entropy-based disagreement strategies, the performance of FL is somewhat inferior to that of centralized learning.
The assessment of the above techniques is based on basic ML metrics, such as the F1-score, precision, recall, and test accuracy. These metrics are useful for assessing the above methods accurately, keeping in mind the imbalances in datasets in general and in implementing AL techniques. As shown in
Table 5 and
Table 6, the values of the accuracy, recall, and F1-score follows a largely comparable pattern. Moreover, there is variation in these matrices, which motivated us to present the standard deviation among them to better understand the impact of AL in FL and centralized learning contexts.
4.6. Correlation against the Benchmarks
To demonstrate the efficacy of the AL techniques,
Table 6 and
Table 7 show comparative results from using AL approaches to the two baselines in FL and a centralized learning context. For human sentiments in disaster situation datasets in federated and centralized learning, the model trained on the training samples labeled using the AL techniques and the manually labeled training sets yielded equivalent results. Nevertheless, the test accuracy remained higher than that indicated when trained models used training sets with AL techniques, which demonstrates the efficacy of AL techniques.
To understand the variations in the performance of the models in centralized and FL, we compare the standard deviation values. The lower scores in the standard deviation of the performance parameters using individual AL methods in centralized and FL environments indicate that FL has the potential to attain comparable accuracy with enhanced privacy.
There is a significantly greater difference between the performance of Criteria 1 and 2 and the suggested approach. Criteria 2 is the primary contributor to performance variance, because its results are lower in performance than those of Criteria 1 and the AL techniques.
Moreover, we compared our results with recent work that is conducted in AL-based FL environments, which are presented in
Table 9. Aussel et al. [
43] conducted their experiments on the MOA airlines dataset [
44] and proposed a communication-efficient distributed learning approach based on active federated learning. They achieved a test accuracy of 61% with 10 participating clients and 30 communication rounds. Ahmed et al. [
33] presented a similar approach to our approach where they used pre-trained ResNet [
45] for feature extraction from two datasets. They analyzed five different active learning techniques in the federated learning environment. In their investigations, the performance of the techniques under different sampling and disagreement approaches varied significantly beyond the first 500 iterations. They attained an accuracy of 71% on a natural disaster image dataset using consensus entropy AL, and 82% on a waste classification dataset using consensus entropy AL in a federated learning environment. Ahn et al. [
46] assessed annotation strategies when the AL algorithm is the maximum classifier discrepancy for AL (MCDAL) [
47]. After learning auxiliary classifiers, it maximizes their prediction disparities. It substitutes conventional ambiguity using auxiliary classifier projections. This technique outperforms state-of-the-art AL algorithms on CIFAR-10 [
48]. They create an FL framework in the annotation stage, where several active learning methodologies in the FL are contrasted, including traditional FL with client-level, separated active learning; active learning in federated active learning; and random sampling, where clients work together to perform the AL in order to choose the instances that are expected to be valuable to FL using a distributed optimization method. They achieved 51% accuracy with 49% recall and a 50% F1-score, respectively, using a CIFER-10 dataset. In our proposed approach where we analyzed six different AL techniques in the federated learning environment, we achieved comparable results to the centralized learning approach. These AL based methods include vote entropy, max. disagreement, margin sampling, consensus entropy, entropy and sampling, and least confidence.
Table 5 and
Table 6 represent the acquired results for all six AL techniques which are achieved during experimentations in both federated and centralized learning contexts. We believe that comparable accuracy is achieved in the federated learning environment in the context of centralized learning.