Distributed Fire Classification and Localization Model Based on Federated Learning with Image Clustering

Lee, Jiwon; Kang, Jeongheun; Park, Chun-Su; Jeong, Jongpil

doi:10.3390/app14209162

Open AccessArticle

Distributed Fire Classification and Localization Model Based on Federated Learning with Image Clustering

by

Jiwon Lee

^1,2

,

Jeongheun Kang

^1,3,

Chun-Su Park

^4,*

and

Jongpil Jeong

^1,*

¹

Department of Smart Factory Convergence, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon 16419, Republic of Korea

²

AI Research Team, AIPro, 25-2 Sungkyunkwan-ro, Jongno-gu, Seoul 03063, Republic of Korea

³

AI Research Lab, ATEC Mobility, 289, Pangyo-ro, Bundang-gu, Seongnam 13488, Republic of Korea

⁴

Department of Computer Education, Sungkyunkwan University, 25-2 Sungkyunkwan-ro, Jongno-gu, Seoul 03063, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(20), 9162; https://doi.org/10.3390/app14209162

Submission received: 10 September 2024 / Revised: 30 September 2024 / Accepted: 6 October 2024 / Published: 10 October 2024

(This article belongs to the Special Issue Advanced Analysis and Technology in Fire Science and Engineering - 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

In this study, we propose a fire classification system using image clustering based on a federated learning (FL) structure. This system enables fire detection in various industries, including manufacturing. The accurate classification of fire, smoke, and normal conditions is an important element of fire prevention and response systems in industrial sites. The server in the proposed system extracts data features using a pretrained vision transformer model and clusters the data using the bisecting K-means algorithm to obtain weights. The clients utilize these weights to cluster local data with the K-means algorithm and measure the difference in data distribution using the Kullback–Leibler divergence. Experimental results show that the proposed model achieves nearly 99% accuracy on the server, and the clustering accuracy on the clients remains high. In addition, the normalized mutual information value remains above 0.6 and the silhouette score reaches 0.9 as the rounds progress, indicating improved clustering quality. This study shows that the accuracy of fire classification is enhanced by using FL and clustering techniques and has a high potential for real-time detection.

Keywords:

federated learning; image clustering; fire classification; vision transformer; unsupervised learning; cluster information

1. Introduction

In human society, a fire is one of the most damaging disasters, and it can originate from natural causes or human activities via climate change. Fires cause loss of life and property and have posed a serious problem for humanity as a whole for centuries. The National Interagency Fire Center provides statistics on wildfires (i.e., the number of fires and the area burned, in acres) in the United States of America in the years 2014–2023. The data show that in 2020, there were 58,950 fires and 10,122,336 acres burned, and in 2017, there were 71,499 fires and 10,026,086 acres burned [1,2].

Traditional fire alarm systems use multiple sensors for detecting heat, temperature, smoke, etc. [3,4,5]. These systems primarily determine the presence of fire based on the signals detected by the sensors, which play an important role in real-time fire detection. However, these sensor-based fire alarm systems can be prone to frequent false alarms due to water vapor, dust, etc. [6]. Such false alarms incur social costs as well as the cost for unnecessary usage of firefighting resources and can make it difficult to respond quickly to actual fires. In particular, sensor-based systems have limitations in detecting fires in the early stages, as they can only detect fires after they have already started.

To overcome these limitations, recent research has focused on image-based fire detection systems that utilize computer vision techniques.

Computer vision technology can analyze images captured by cameras to visually recognize smoke, flames, or heat; such recognition allows rapid determination of the occurrence of a fire [7], while this technology has the potential to enable more precise detection and provide warnings at an earlier stage of fire compared to traditional sensor-based systems, image-based systems also have limitations, such as the need for a large amount of training data, a lack of infrastructure to handle the data, and data privacy concerns. In particular, large-scale image-based learning requires large amounts of data, which can impose a heavy burden on data processing and storage infrastructure. Further, since these data are mainly collected from closed-circuit television (CCTV) or other imaging devices, there is a possibility of violating individual privacy. To solve these problems, federated learning (FL) methods that can maintain high performance while protecting data privacy in distributed environments have attracted attention [8].

In this paper, we propose a new approach for fire classification, namely, an FL structure using image clustering techniques. In this structure, a vision transformer (ViT) is used to extract important features from images and group the data through clustering to achieve more accurate fire classification. Unlike traditional convolutional neural network (CNN)-based models, the ViT model is a powerful model that can better learn the global features of images by dividing them into image patches and learning the relationships between patches using the transformer structure. ViT models achieve high performance, especially on large-scale data. We utilized these characteristics to build models that can accurately classify images into three classes: fire, smoke, and normal. In this study, we applied the ViT model to an FL environment, which is designed to train the data of each client individually without sending the data to a central server. The core of the system works by allowing each client to train independently on its own dataset and then send only the weights of the resulting model to the server. This allows for collaborative model learning to improve the performance of the global model while protecting data privacy. Initial training is performed on the server using a pretrained model, and the server and client train on that model to extract features from the image effectively. The client passes the weights to the server, and the server updates the global model based on the data it receives from the client and passes the model back to the client, and this process is repeated. At this point, each client does not send data to the central server but only exchanges weights and cluster information of the trained model. This ensures data privacy and the ability to analyze images globally with the ViT model to classify fire, smoke, and normal conditions with high accuracy.

The proposed model is based on a pretrained ViT model on the server to extract features from the data. The ViT model is a powerful deep learning model that can learn the global features of images. In this study, this model is used to extract important features for classifying fire, smoke, and normal images. Based on these features, the server clusters the data using the bisecting K-means algorithm to obtain global cluster information (GCI). The GCI contains information about the center point and data distribution for each cluster, and the server delivers this information to each client to help them learn from local data. Each client clusters the local data based on the GCI received from the server. The client applies the K-means algorithm to cluster its local data, and in doing so, it evaluates how well the client data match the model trained by the server by referring to the GCI provided by the server. This evaluation allows the client to perform additional training on its local data and fine-tune the initial weights received from the server. Thus, each client sends its local training results back to the server, which synthesizes the client information to improve continuously the performance of the global model. Based on the weights and clustering information received from each client, the server ensembles the weights to improve performance. During this process, the server updates the GCI, taking into account the differences and imbalances in the characteristics of the data learned by each client and sends new weights to the clients. The process of learning and sending weights between the server and clients is repeated, and the model performance improves gradually with each iteration. In each round, clients send information, which they learned about their local data, to the server. The server uses this information to update the global model. Thus, the server builds a more sophisticated model and adjusts it to reflect uniformly the data of the client.

Compared to the CNN-based model, the ViT model performed better in terms of clustering accuracy and stability, while the CNN model performed relatively well in the early rounds, it showed a gradual degradation in performance as the rounds progressed. In particular, while CNNs are good at learning local features, they have limitations in learning global patterns. This limitation was more pronounced for the clustering problem, and the performance degradation of the CNN model accelerated in a federated learning environment where data were distributed across multiple clients. The ViT model, on the other hand, was very strong at learning global relationships in images. ViT effectively learned the relationships between each patch in the clustering task through its transformer structure, maintaining higher clustering accuracy compared to the CNN model. The ViT-based model showed minimal performance degradation as rounds progressed and performed particularly well on clients with unbalanced data distribution. These results suggest that ViT can provide a more robust and stable performance than CNN when processing data from a variety of clients in a federated learning environment. Thus, we found that the ViT model is superior to the CNN model in learning the global characteristics of the data and maintaining high performance on clustering tasks. This further highlights the strengths of ViT, especially in federated learning structures that require distributed processing of data in a variety of environments.

The proposed model can solve the problem of data imbalance and improve the overall classification performance through FL between the server and clients. FL that reflects the data characteristics of each client can be achieved while ensuring data privacy. It has high practicality and scalability in solving problems, such as fire classification in various environments.

The major contributions of this paper can be summarized as follows:

We built a new dataset for fire, smoke, and normal image classification suitable for real-world environments. Utilizing public datasets and crawled data, we built a realistic fire classification system, which enables the development of efficient and reliable fire detection models in real-world environments.
By comparing the CNN and ViT models, we found that the ViT model maintained higher performance and stability in a collaborative learning environment. While the CNN-based model degraded significantly as the rounds progressed, ViT achieved 98% classification accuracy despite the data imbalance problem. This scientifically demonstrates that ViT is a more suitable model in a federated learning structure.
The pretrained ViT model was sent to each client to fine-tune the data of each client, and GCI was used to analyze the difference in data distribution between the clients and the server. KL divergence was used to precisely measure data differences, and bisecting K-means and K-means algorithms were applied to the server and clients to reduce the complexity of association learning while maintaining performance. In particular, the silhouette score reached 0.9, demonstrating the clustering performance.

The rest of this paper is organized as follows. Section 2 reviews related work and describes the clustering algorithms, i.e., ViT and FL. Section 3 describes the fire classification model that applies image clustering techniques in the associative learning structure proposed herein. Section 4 presents the experimental environment and experimental results. Finally, Section 5 synthesizes the experimental results and discusses future work.

2. Related Work

2.1. Image Classification

Early image classification techniques were primarily based on traditional machine learning techniques. These techniques mainly involved manually extracting certain features from images and then training a classifier based on these features. Typical feature extraction methods include scale-invariant feature transform (SIFT) [9] and histogram of oriented gradients (HOG) [10], and classifiers based on these features include the support vector machine and K-nearest neighbors (K-NN) [11]. While these techniques have been successful for simple image classification tasks, they have limitations when dealing with increasing image complexity and large datasets. In particular, manually defined features lack robustness to image rotation, size changes, lighting changes, noise, etc. [12].

Deep learning techniques, notably, the CNN, emerged as a solution to these issues. The CNN provides the spatial structure of an image to extract features automatically and learn high-level patterns through various layers. This approach overcomes the limitations of manual feature extraction and has shown excellent performance on large image datasets. CNNs use filters to learn local patterns and compress them through a pooling process to achieve spatial invariance. The features learned in this process are robust to changes in image size and location [13]. Since the introduction of the CNN, various models that further improve the performance of image classification have emerged. For instance, the VGG uses a deep network structure to learn high-level patterns, and ResNet introduces residual connections to solve the problem of gradient dissipation in deep networks, allowing further expansion of network depth [14,15]. The Inception model introduces a structure that simultaneously uses filters of different sizes to extract features at different scales [16]. While all of these models are CNN-based, they leverage different structural properties to maximize performance.

Recently, transformer-based models such as ViT have emerged in an attempt to overcome the limitations of CNN [17]. The shift from traditional to deep learning methods has revolutionized the performance of image classification, and various deep learning models, including CNNs, have been successfully applied to large-scale image classification tasks.

2.2. Clustering

Clustering is an unsupervised classification technique that is used when there are no preestablished classes, and it is the process of discovering patterns or structures in data and organizing them into groups with similar characteristics [18]. The main goals of clustering are to increase internal cohesion and external separation. Internal cohesion refers to the extent of similarity of data to other data within the same cluster, and the smaller the distance between data within a cluster, the higher the internal cohesion. External separation indicates a large difference between data from different clusters. The clearer the boundaries between two clusters, the greater their external separation. How well these two criteria are satisfied is an important factor in evaluating the performance of a clustering algorithm [19]. Several types of clustering algorithms are available, and the choice of an algorithm depends on the nature of the data or the purpose. Examples of clustering algorithms are partitional clustering, hierarchical clustering, density-based clustering, and distribution-based clustering [20,21]. We demonstrated that efficient clustering can be achieved on large datasets by leveraging the Greedy Sinkhorn algorithm and memory banks [22].

2.2.1. K-Means

The K-means clustering algorithm is a representative method of partitional clustering, and it divides data into K preset clusters. Figure 1 compares the situations before and after clustering. The red clusters represent data around the first center point, the orange clusters represent data around the second center point, and the green clusters represent data around the third center point. These visualizations give you an intuitive sense of the distribution and centroids of each cluster. The basic process of the algorithm starts by the random selection of K cluster centroids, and each data point is assigned to the closest centroid. The nearest centroid is calculated based on Euclidean distance. Then, a new center point for each cluster is determined by averaging the data points within that cluster. After repeated processes of reassigning each data point to the closest cluster based on the newly calculated centroid, the algorithm converges when the centroid or assignment no longer changes [23].

The K-means algorithm has the advantages of simplicity and computational efficiency. It is widely used in many fields because it can be applied quickly even on large datasets [24]. However, it requires specifying the number of K clusters in advance, causing difficulties in determining the optimal number of clusters. Further, since K-means assumes spherical clusters, it may not be suitable for data with nonspherical structures, and the results may vary depending on the initial centroid settings [18].

2.2.2. Bisecting K-Means

Bisecting K-means is a variation of the K-means algorithm. It combines the advantages of bottom-up hierarchical clustering and K-means [25]. In this method, the data are considered as a single cluster which then progressively splits. First, the entire amount of data are viewed as one cluster, which is then divided into two clusters using K-means. Then, the largest of the partitioned clusters is selected and split using K-means, and this process is repeated until the desired number of clusters is achieved [26].

Figure 2 compares the results obtained by the K-means and bisecting K-means algorithms. The red dots represent the centroids of each cluster, providing a visual representation of how the data is divided based on the centroids between the clusters. In K-means and Bisecting K-means, the four clusters are separated by four colors each, while the eight clusters represent a finer-grained distribution of clusters with eight colors. This makes it easy to see the difference between the two algorithms and the distribution of the clustering results. The clustering results are more stable than K-means; while K-means can produce different results depending on the initial centroid settings, bisecting K-means can gradually divide the clusters to produce balanced results. The latter works especially well on large datasets by breaking large clusters into smaller units. Further, bisecting K-means does not require the specification of an initial number of clusters in advance; while the traditional K-means algorithm requires the user to set the number of clusters, K, in advance, bisecting K-means can dynamically adjust the number of clusters, making the algorithm flexible and responsive to the data structure. However, initial partitioning can affect the results. An incorrect initial partitioning can negatively affect the overall performance, while a sequential approach to dividing large clusters leads to balanced and consistent clustering results [27,28,29].

2.3. ViT

ViT is an innovative deep learning model that applies the transformer structure to image processing. Transformers are effective primarily in natural language processing, but ViT applies transformers to computer vision [30]. ViT does not process images based on a traditional CNN; instead, it processes image data in patches and inputs them into the transformer model [31]. Figure 3 shows the ViT architecture. Specifically, ViT works in the following steps.

First, with patch embedding, an input image X is divided into patches of fixed size. For example, an image of size 224 × 224 is divided into 16 × 16 patches, resulting in a total of 196 patches. For each patch, an embedding vector is generated, and the length of this vector is determined by the patch and image sizes. Second, since the transformer cannot handle order, positional encoding is added to provide position information for each patch. This encoding indicates the location of a patch in the image, allowing the model to learn spatial information. Third, the transformer model processes the input patches through a transformer block consisting of multiple self-attention and feedforward neural networks. The self-attention mechanism is used to learn how each patch is related to the others, and it allows the model to learn not only local information in the image but also global information. Fourth, class tokens are added to the model input as vectors that encapsulate the full information of the image for later classification. This class token is finally output from the transformer block, which is then connected to the MLP head to perform downstream tasks such as image classification. In the final output, the vector corresponding to the class token is used as important information for the classification task. Finally, the vector corresponding to the class token in the final output of the transformer block with the MLP head is used for downstream tasks such as image classification. This vector is connected to the MLP to perform the final classification task [32,33,34].

ViT has the advantage that it is more globally interactive than traditional CNN-based models; while CNNs use filters to extract features by considering only localized regions, ViT learns interactions between patches throughout the image via a self-attention mechanism. ViT is particularly powerful on very large datasets. With large training datasets, it can outperform a CNN and can be easily adapted to a variety of image-processing tasks with pretraining and fine-tuning. Furthermore, it has a relatively simple architecture that can process images using only iterations of transformer blocks—this ability is advantageous for model design and scaling [17]. CNNs perform well on relatively small datasets, while ViT suffers from poor generalization performance when the data are insufficient. To compensate for this, data augmentation techniques or pretraining may be required. Because of the nature of the transformer model, ViT is computationally expensive and requires longer training time than CNNs, and it has to deal with long input sequences, resulting in high memory consumption and computational complexity [35].

2.4. FL

FL was first proposed by Google. FL is a distributed learning method that does not send the data of each user to a centralized server; instead, it trains on the local device and shares only the results with the server. While the existing centralized learning model gathers data in one place to train a model, FL performs training with the data distributed across devices or on the local server. This protects data privacy, and the central server only integrates model updates from individual devices to learn a global model [9,36].

Figure 4 shows the structure and operation of FL. The FL process starts with the central server distributing the initial model to individual devices. Each device then trains the model using its local data and sends the trained results to the central server. The central server aggregates this information to create a new global model, which is then distributed back to the devices, and the process is repeated. During this process, the client’s data are never sent to the server; only model updates are sent back and forth, minimizing the risk of data leakage [37]. FL has three main features. First, local devices train models using their own data, so the data never leave the device. Second, a central server integrates model updates from all devices to create a new global model and redistributes this model to the devices. Third, the communication between the devices and the server only involves model updates, and various protocols are required for efficient communication [38].

Several unresolved challenges remain: The first challenge is asynchronous learning. Since all devices cannot participate in learning at all times, depending on the availability of devices or network conditions, some devices may not be able to participate in learning, thus creating an unbalanced learning environment. The second challenge is data imbalance, where the amount or nature of data held by each device is different, affecting model performance. To address this, the recently proposed Federated Feature Augmentation (FedFA) approach improves model performance by sharing the feature space between clients. FedFA uses each client’s local data to extract feature vectors, which are then combined with data from other clients to form a richer feature space. This process mitigates the problem of data imbalance and enables more accurate model training, especially for clients with small datasets [39]. The third problem is communication cost; there is a need for research to reduce the cost of devices periodically sending weights to the server. Finally, privacy and security concerns need to be addressed. To prevent attacks based on model updates, techniques such as differential privacy and homomorphic encryption can be applied to enhance the security of personal data [40,41,42]. FL has great potential for applications in various fields, such as smartphone applications, healthcare, autonomous vehicles, and IoT, and in the future, with the advancement of network technology and artificial intelligence (AI) technology, more diverse applications of FL are expected [43].

3. Fire Classification Based on FL with Image Clustering

3.1. Overall Architecture

Figure 5 is a proposed architecture of a fire classification model that applies clustering techniques to fire, smoke, and general images and works as an FL structure. The model uses clustering techniques to process data and works by having multiple clients and servers cooperate to improve model performance. Algorithm 1 is a pseudocode representing clustering-based associative learning using ViT. First, the server trains on the server dataset using the ViT model. The ViT model loads the pretrained vit-base-patch16-224 model and trains on the dataset containing the fire, smoke, and normal labels. When training is complete, it saves the model’s weights in a .pth file, which is later passed to multiple clients. The server extracts feature vectors from the trained ViT model, which it uses to perform clustering using the bisecting K-means algorithm. During this process, the global centroid for each class is calculated, and based on this, the global distribution is also calculated. This information is stored in the server’s GCI, which is then delivered to the client.

The client trains the local dataset using the GCI and .pth file delivered from the server. The client extracts feature vectors from the local dataset using the ViT model and performs clustering using the K-means algorithm (Algorithm 1). The client uses regular K-means clustering instead of the bisecting K-means used by the server. The client compares the global center points and local data received from the GCI to estimate the center point and local data distribution. Next, the client computes the Kullback–Leibler (KL) divergence. The KL divergence is an important metric that measures the difference in data distribution between the client and the server, allowing the client to understand how similar its data are to the server’s global model. Based on this KL divergence, an alpha value and a style vector are calculated. The alpha value indicates how similar the client’s data are to the server’s global model, and the style vector indicates the difference between the client’s data and the server’s data, reflecting the characteristics of the client’s data. The client sends the calculated alpha value, style vector, local cluster centroids, and density information to the server. The client does not send local data directly to the server but only these four pieces of information to maintain the privacy principle of associative learning.

Algorithm 1 FL with ViT clustering

1:: Initialize: Load pretrained ViT model and set server dataset.
2:: Initialize optimizer with learning rate.
3:: Server: Initial Training and Feature Extraction
4:: Load ViT model (pretrained on vit-base-patch16-224).
5:: Train model on the server dataset (fire, smoke, normal labels).
6:: Extract features from server dataset using trained ViT model.
7:: Apply Bisecting K-Means clustering on server features:
8:: Calculate global center points for each cluster.
9:: Calculate global data distribution and store Global Cluster Information (GCI).
10:: Save GCI and trained weights (.pth file).
11:: Client: Receive Initial Model and Perform Clustering
12:: for each client i do
13:: Receive GCI and pretrained ViT model from server.
14:: Load the pretrained ViT model and GCI.
15:: Extract features from client’s local dataset.
16:: Apply K-Means clustering based on GCI:
17:: Estimate local center points and data distribution.
18:: Compute Kullback–Leibler (KL) divergence between client and server data distributions.
19:: Compute alpha value and style vector based on KL divergence and local center points.
20:: Send alpha value, style vector, local center points, and density back to the server.
21:: end for
22:: Server: Aggregation and Update
23:: for each round up to N rounds do
24:: for each client i do
25:: Receive alpha value, style vector, local center points, and density from client i.
26:: Aggregate information from all clients:
27:: Update global center points based on client center points.
28:: Adjust global weights using style vectors and alpha values.
29:: end for
30:: Recompute GCI with updated global center points and distributions.
31:: Send updated GCI to all clients for the next round of training.
32:: Retrain ViT model on the server with updated center points.
33:: end for
34:: Final Model Saving
35:: Save final model weights after N rounds.

The server aggregates the alpha values, style vectors, local cluster centroids, and density information received from each client to update the global centroid and weights. The style vectors and alpha values sent by clients play an important role in helping the server update the weights in the model. Based on this updated information, the server generates a new GCI, which it passes back to the client for the next round of training. This process is repeated over multiple rounds, and the client and server work together to continuously improve the model’s performance.

3.2. Initial ViT Training

This process involves initial training executed using ViT to improve model performance and ground truth in the FL structure. To improve the performance of the initialized model, pretraining is performed to set it as an initial model for association learning using a ViT model. We used the “vit-base-patch16-224” pretraining model and image preprocessor provided by Google. This model trains on images from the dataset on the server and converts class labels into indices to start model training. During the training process, the Adam optimizer and cross-entropy loss function are used to update the model weights. We evaluated model performance by running a specified number of epochs. Each time the results improve, the model weights are saved in a .pth file, which is used as the initial model for associative learning. During the training process, the server and client initialize and use the model. This process ensures that the starting model has a high ground truth and leads to good performance of the final global model.

3.3. Obtaining GCI

3.3.1. First Round

On the server side, the clustering information for fire, smoke, and normal images, which are private data held by the server, must be obtained, and the initial weights must be delivered to the client. Figure 6 shows the process of acquiring initial weights on the server side. The server uses the pretrained ViT model for the input image to extract features for the image. Feature extraction is performed by creating a model that excludes the fully connected layer. The bisecting K-means algorithm is used to cluster the feature vectors and calculate the center point for each label based on the feature extraction results. The algorithm then estimates the data density and calculates the distance between data points inside clusters to store the data distribution. Finally, we obtain the GCI, which is the initial weight that is passed to the client in the first round. The GCI includes the global center point and global distribution.

3.3.2. N Round

For N rounds, the GCI is obtained by an ensemble based on the information obtained from each client. Algorithm 2 is a pseudocode representation of the ensemble algorithm for N rounds. The proposed algorithm works by adjusting the cluster information on the server side based on the style direction of each client in an FL structure.

Algorithm 2 N-round GCI

Require:

c p_i n f o_s e r v e r

,

s t y l e_d i r e c t i o n

,

a l p h a

,

a c c_i n f o

1:: Global Variables: $s t e p$ , $c a l l_c n t$
2:: $c a l l_c n t \leftarrow c a l l_c n t + 1$
3:: $b e f o r e_a c c \leftarrow a c c_i n f o [0]$
4:: $p r e s e n t_a c c \leftarrow a c c_i n f o [1]$
5:: if $c a l l_c n t = 1$ then
6:: Do nothing
7:: else if $p r e s e n t_a c c < b e f o r e_a c c$ then
8:: $s t e p \leftarrow s t e p \times 1.1$
9:: else if $p r e s e n t_a c c > b e f o r e_a c c$ then
10:: if $p r e s e n t_a c c < 0.95$ then
11:: $s t e p \leftarrow s t e p \times 0.9$
12:: else
13:: $s t e p \leftarrow s t e p \times 0.1$
14:: end if
15:: end if
16:: Print $[A d j u s t e d s t e p] : s t e p$
17:: $m o v e d_c p \leftarrow$ DataFrame with same index and columns as $c p_i n f o_s e r v e r$
18:: $m o v e d_c p [^{'} L a b e l^{'}] \leftarrow c p_i n f o_s e r v e r [^{'} L a b e l^{'}]$
19:: $l a b e l s \leftarrow unique (c p_i n f o_s e r v e r [^{'} L a b e l^{'}])$
20:: $c l i e n t s \leftarrow list (s t y l e_d i r e c t i o n . k e y s ())$
21:: $n u m_c l i e n t s \leftarrow len (c l i e n t s)$
22:: for each $l a b e l$ in $l a b e l s$ do
23:: $w e i g h t e d_s u m \leftarrow zeros (768)$
24:: for $d f_s t y l e, a$ in $zip (s t y l e_d i r e c t i o n . v a l u e s (), a l p h a . v a l u e s ())$ do
25:: $d f_s t y l e . fillna (0, inplace = True)$
26:: $w e i g h t e d_s u m \leftarrow w e i g h t e d_s u m + d f_s t y l e [d f_s t y l e [^{'} L a b e l^{'}] = = l a b e l] . i l o c [:, : - 1] \times a$
27:: end for
28:: $a v e r a g e_v e c t o r \leftarrow w e i g h t e d_s u m / n u m_c l i e n t s$
29:: $s e r v e r_s t y l e \leftarrow c p_i n f o_s e r v e r [c p_i n f o_s e r v e r [^{'} L a b e l^{'}] = = l a b e l] . i l o c [0, : - 1]$
30:: $n e w_v e c t o r \leftarrow s e r v e r_s t y l e + s t e p \times a v e r a g e_v e c t o r$
31:: $i n d e x \leftarrow m o v e d_c p [m o v e d_c p [^{'} L a b e l^{'}] = = l a b e l] . i n d e x [0]$
32:: $m o v e d_c p . l o c [i n d e x, m o v e d_c p . c o l u m n s [: - 1]] \leftarrow n e w_v e c t o r$
33:: end for
34:: return $m o v e d_c p$

Here, cp_info_server is the cluster point information managed by the server. It is organized in the format dataFrame and contains each cluster point and its label. Further, style_direction is a value indicating the direction of style change for each client, and it is organized per client and sent by the client to the server. The alpha value is a client-specific weight that controls each client’s contribution to updating the GCI. The parameter acc_info consists of two values, indicating the accuracy of the previous and current rounds; step is the step size used for the update; call_cnt is the number of times the function was called in the current round. The parameters step and call_cnt use the value passed in from the previous state, and call_cnt checks to identify the round that the algorithm is in. If we do nothing on the first call and present_acc is lower than before_acc, we increase the step by a factor of 1.1 because performance has worsened; conversely, we reduce it by a factor of 0.9 because performance has improved if present_acc is higher than before_acc. If present_acc is above 95%, we reduce the step by 0.1 because no significant change is needed.

The moved cluster point information is stored in moved_cp, and labels is a list of unique labels present in the dataset; clients is a list of clients, and num_clients is the number of clients. For each label, we weigh the style direction of the client by the value of alpha and compute the average vector. Then, we add the average vector to the cluster point style vector of the server, scaled by step, and compute the new vector. The cluster points are updated for the corresponding labels in moved_cp. Finally, the GCI is obtained in round N by returning dataFrame moved_cp for the moved cluster points. This value is passed to the client as a weight in round N.

3.4. Client Model Training Process

Figure 7 details the behavior of a client operating in the FL structure. The client takes in and processes its own local images, such as fire, smoke, and normal. The data are used entirely locally and are never shared externally. After preprocessing the input image, the client extracts features using the ViT model. ViT works by dividing the image into small patches, which are then analyzed, and only the important information is extracted. The feature vector for each image extracted in this process is 768-dimensional, which encapsulates the important patterns and characteristics of the image. After the client processes the data, it receives global cluster information from the server. This information consists of the cluster centroids that the server obtained as a result of processing data from other clients. The client uses these global cluster centroids as a basis for local clustering to determine how to classify its data. This process serves to help clients compare their data to global data.

The client clusters its data based on the global cluster centroids it receives from the server. It places each data point in the appropriate cluster, calculating the distance between data points and assigning them to the closest cluster centroid. Based on these assignments, the client estimates how the data are distributed within each cluster. Density estimation, which calculates the degree to which data are clustered, is used to assess how densely the data are clustered within each cluster. Density estimation is an important step in clustering because it determines the relationship between data within a cluster by calculating how close or far apart they are. In this process, the client clusters data points that fall below a threshold distance and analyzes their relationships. By doing so, the client calculates how evenly distributed the data are, or if they are concentrated in certain areas.

The client extracts a style vector that reflects the characteristics of the data in each cluster. The style vector indicates how the client data differ in direction from the global data on the server. For example, it indicates whether the client’s data are more concentrated in certain classes or have stronger characteristics compared to the server’s data. The style vector numerically calculates these differences and helps improve the global model when it is passed to the server. In this step, the client calculates the KL divergence. KL divergence is a mathematical measure of how different two probability distributions are, and it is used in associative learning to measure the difference between the client’s data distribution and the server’s global data distribution. First, the client computes a distribution for its local data. This determines the proportion of data points in each class. Next, the global data distribution received from the server is compared to the client’s local distribution. The server knows the overall proportion of each class by aggregating the data it receives from all clients. KL divergence compares these two distributions and calculates the difference, which is mathematically the logarithm of the odds ratio between the local and global data distributions, weighted by the proportion of the local distribution. This allows you to measure how closely the client data match the global data on the server. A larger KL divergence value means that the data on the client and server are very different, while a smaller value means that the data are similar. The result of this KL divergence calculation is converted to an alpha value. The alpha value quantifies how much the client can contribute to the server’s model. A high alpha value indicates that the client’s data are having a significant impact on the global model.

Finally, the client sends the calculated cluster centroid, style vector, alpha, and density values to the server. This allows the server to update the global model based on the information it receives from each client. In particular, the style vector received from each client is an important reference for the server to adjust the global cluster centroids. The server fine-tunes the global cluster centroids to reflect each client’s data characteristics and differences, and these adjusted cluster centroids are used in the next round of training. This process is the core of federated learning, allowing the server to incrementally improve the global model to reflect the data characteristics of each client without directly sharing the client’s data.

4. Experiment and Results

4.1. Experiment Environments

The experimental environment used in this study is described below. Table 1 lists the specifications of the main hardware and software used in the experiment. We used an Intel Core i7-10,700F 2.90 GHz CPU (Intel, Santa Clara, CA, USA), a high-performance multicore processor that performs complex calculations and data processing tasks. An NVIDIA GeForce RTX 3070 GPU (NVIDIA, Sanata Clara, CA, USA) was used to perform accelerated training and inference of deep learning models, making it ideal for training AI models. We used 32 GB RAM to support data processing and complex algorithms. The operating system was Windows 10 Pro, and Python version 3.11.9 was used for compatibility. Pytorch version 2.4.0+cu121 was used for model development and training optimization, and CUDA version 12.1 was used to run the experiments.

4.2. Dataset

4.2.1. Data Definition and Introduction

The dataset used in this study includes a portion of the publicly available D-fire dataset and images collected through direct web crawling. The D-fire dataset is a public dataset of real-world images for fire and smoke detection, divided into fire-only images, smoke-only images, images with both fire and smoke, and normal images. The dataset is collected from real-world environments and is labeled with high quality, and duplicate images were removed to increase the reliability of the data. It reflects a wide variety of situations to help models work accurately in the real world, and the large dataset provides reliable performance in complex situations. The D-fire dataset is divided into four main categories. First, fire-only images are those where the fire phenomenon is clearly visible. Second, smoke-only images are those where there is no fire, but smoke is captured. Smoke is an important early sign of a fire and is used to increase the sensitivity of prevention and detection systems. Third, images with both fire and smoke reflect a complex situation and help model real-world fire scenes. Finally, normal images represent situations where there is no fire or smoke at all and are used to train the system to avoid raising false alarms.

After collecting images from the D-fire dataset, the collected images were thoroughly reviewed to remove as many duplicates as possible, and unnecessary images were deleted to prevent duplicates from negatively impacting model training. A web crawl was performed on the fire, smoke, and normal images to obtain additional data for later experiments. The web crawl was performed by automatically collecting images from various sources based on specific keywords, and the crawled images were then categorized into fire, smoke, and normal conditions. The D-fire dataset and the data collected from the web crawl were merged to form the final dataset for the experiments by deleting images that were not real, for example, images of fire or Photoshopped images.

Figure 8 shows a graph of the number of images with corresponding labels: 6144 normal images, 1216 smoke images, and 2549 fire images, totaling 9909 images. In general, when collecting image data in a distributed environment, fire and smoke are less frequent than normal situations; we took this fact into account when collecting data. Figure 9 shows sample images from the dataset according to their labels.

This experiment was designed based on prior research showing that federated learning on small datasets can work well with fewer than 10 clients [44]; while federated learning typically assumes an environment with many clients, prior research has shown that sufficient learning results can be achieved with fewer clients without significant performance degradation. For this reason, we limit the number of clients to less than 10 to evaluate the performance and efficiency of federated learning.

In this experiment, we configured one server and 3 clients, 5 clients, and 10 clients to compare the impact of different numbers of clients on associative learning. Table 2 shows the distribution of image data used by each client and server by label. All clients and servers used the same 3000 images, but each client set a different percentage of data per label to reproduce a non-independent and identically distributed (Non-IID) environment. By non-IID, we mean that each client does not have the same distribution of data and contains data biased toward certain classes. In this experiment, we skewed the distribution by varying the ratio between labels on each client to address data imbalance issues that may occur in real-world environments.

Specifically, Server and Client 1, Client 5, and Client 6 have a balanced data distribution with equal proportions of fire, smoke, and normal data, 1:1:1. Client 2, Client 7, and Client 8, on the other hand, have a heavy normal data setup, with a 1:1:4 distribution of fire, smoke, and normal data. Client 3 and Client 10 have a 4:1:1 ratio of fire, smoke, and normal data, reflecting the dominance of fire data. Finally, Client 4 and Client 9 have a distribution that is heavily dominated by smoke data, so we set the fire, smoke, and normal data to 1:4:1.

4.2.2. Hyperparameters

For initial ViT training using server-side data, we used the pretrained Google vit-base-patch16-224 model using the ViTForImageClassification and ViTImageProcessor classes (Google, Mountain View, CA, USA). The number of epochs was 50; the learning rate was set at 0.0001. An Adam optimizer and cross-entropy loss criterion were adopted. For the FL structure using the .pth file obtained from the initial ViT training, the number of experimental epochs was 10; the learning rate was set at 0.0001, and the Adam optimizer was used. The step for moving the center point was 1e6, and if the average silhouette score of the three clients exceeded 0.8, the step size was reduced and multiplied by 0.9 per round. When passing the global center point from the server to the client, the ratio was set as step/2 to reflect the style vector of each client. The distance threshold used to calculate the distance between elements in the same cluster and sum distance data that are less than the threshold was 30.

4.3. Evaluation Metrics and Visualization

4.3.1. Evaluation Metrics

In this study, we used a number of metrics to evaluate the performance of our model for classifying fire, smoke, and normal images. The first metric is accuracy. Accuracy is the percentage of samples correctly predicted by the model. It is calculated as the ratio of samples correctly predicted by the model to the total number of samples. The accuracy is determined using four elements of the confusion matrix: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). TP means that the actual value is true and is correctly predicted to be true by the model. TN means that the true value is false and is correctly predicted by the model as false. FP means that the actual value is false, but the model incorrectly predicts it to be true. FN means that the true value is false, but the model incorrectly predicts it as true. The formula for accuracy using the four factors is as follows:

Accuracy = \frac{T P + T N}{T P + F N + F P + T N}

(1)

The NMI is a metric that evaluates the performance of clustering. It is calculated by normalizing the entropy of each clustering result and takes a value between 0 and 1; the closer it is to 1, the more similar are the two clustering results. Mutual information (MI) measures the degree of information sharing between two clustering results, while entropy represents the uncertainty of each clustering. The formula for NMI is as follows:

NMI (U, V) = \frac{2 \times I (U, V)}{H (U) + H (V)}

(2)

I (U, V)

represents MI; U and V denote the two clustering results. The

I (U, V)

is calculated as follows:

I (U, V) = \sum_{i = 1}^{| U |} \sum_{j = 1}^{| V |} P (U_{i}, V_{j}) log (\frac{P (U_{i}, V_{j})}{P (U_{i}) P (V_{j})})

(3)

P (U_{i}, V_{j})

is defined as the proportion of data points belonging to cluster

U_{i}

and cluster

V_{j}

.

H (U)

and

H (V)

are defined as the entropy of clustering U and V, respectively, and are given by the following formulas:

H (U) = - \sum_{i = 1}^{| U |} P (U_{i}) log P (U_{i})

(4)

H (V) = - \sum_{i = 1}^{| V |} P (V_{i}) log P (V_{i})

(5)

The silhouette score is a metric that evaluates the clustering performance and measures how well the data points are clustered. A score close to 1 means that the data points are well clustered within their respective clusters and are clearly separated from other clusters. Scores close to zero are obtained when the distances between the inner and outer clusters are similar. Values close to 1 indicate that the data points belong to the wrong cluster. Thus, this metric has the advantage of considering simultaneously the internal consistency and external separation of clusters. The silhouette score for the entire dataset is calculated as the average of the silhouette scores for each data point:

S = \frac{1}{N} \sum_{i = 1}^{N} s (i)

(6)

The silhouette score

s (i)

for each data point i is given by the following formula:

s (i) = \frac{b (i) - a (i)}{max (a (i), b (i))}

(7)

Here,

a (i)

denotes the average distance between data point i and all other data points in the same cluster and is referred to as the intracluster distance or internal distance. The cluster that data point i belongs to is called

C_{i}

,

| C_{i} |

is the number of data points in cluster

C_{i}

, and

d i s t a n c e (i, j)

is the distance between data points i and j. The metric

a (i)

indicates how closely a data point i is connected within the cluster it belongs to, with smaller values indicating a higher density within the cluster. Its formula is as follows:

a (i) = \frac{1}{| C_{i} | - 1} \sum_{j \in C_{i}, j \neq i} distance (i, j)

(8)

The metric

b (i)

denotes the average distance to the nearest cluster among the other clusters to which data point i does not belong; this metric is called the nearest cluster distance or outer distance. Let

C_{k}

denote one of the other clusters except

C_{i}

. Let

| C_{k} |

be the number of data points that belong to cluster

C_{k}

, and let

min_{C_{k} \neq C (i)}

be the operation to find the minimum value among the remaining clusters except cluster

C (i)

.This metric indicates how far a data point i is from the other clusters, with larger values indicating better separation from the other clusters. If

b (i)

is greater than

a (i)

, the data points are properly clustered, and if the converse holds, the clustering may be improper. The formula is as follows:

b (i) = min_{C_{k} \neq C_{i}} \frac{1}{| C_{k} |} \sum_{j \in C_{k}} distance (i, j)

(9)

4.3.2. Confusion Matrix

Figure 10 shows an example image of a confusion matrix. This matrix is used to evaluate the performance of a classification model. It summarizes the relationship between predicted and true classes. The rows represent the true labels, and the columns represent the predicted classes. Each cell represents a count for a particular combination of true and predicted classes. The confusion matrix allows us to identify intuitively the classes that are incorrectly predicted by the model and helps us analyze the performance of each class. Thus, we can identify the strengths and weaknesses of the model and delve into various aspects of model performance.

4.3.3. PCA

High-dimensional data, such as image data, can be visualized in two dimensions using the PCA. Figure 11 shows an example of a two-dimensional (2D) PCA image. Fire data points were colored red, Smoke data points were colored blue, and Normal data points were colored green. Two-dimensional PCA reduces the data to two principal components to identify the two axes that can explain the most variance in the high-dimensional space; the data are then transformed into 2D data based on these axes. This method allows visualization of the distribution of the data of each class in a 2D space to grasp how well the model separates each class and intuitively understand the data structure and distribution. PCA visualization allows us to analyze the model-trained data and evaluate the model performance from different perspectives. Thus, the performance of the proposed fire classification model can be evaluated and improved.

4.4. Results

As a result of having three clients, Table 3 lists the NMI values for each client per round. Table 4 lists the silhouette scores for each client per round, and Table 5 lists the accuracy values for the server and each client per round.

As a result of having five clients, Table 6 lists the NMI values for each client per round. Table 7 lists the silhouette scores for each client per round, and Table 8 lists the accuracy values for the server and each client per round.

As a result of 10 clients, Table 9 lists the NMI values for each client per round. Table 10 lists the silhouette scores for each client per round, and Table 11 lists the accuracy values for the server and each client per round.

First, based on NMI, we see that with three clients, the clustering quality is stable, with NMI scores mostly staying above 0.80. This means that the local models generated by each client reflect the overall data well. However, with five clients, the NMI fluctuates between 0.70 and 0.85, and the variation in clustering quality between clients increases. This suggests that as the data are distributed across more clients, the quality of clustering may decrease on some clients. With 10 clients, the NMI score varies widely from the low 0.60s to 0.80, suggesting that as the number of clients increases, the data distribution becomes more unbalanced and clustering performance is likely to deteriorate. In particular, some clients may have very low NMI scores.

We also see differences in terms of silhouette score. With three clients, the silhouette score consistently shows values close to 0.8, indicating high data aggregation within clusters and clear boundaries between clusters. With five clients, the silhouette score varies between 0.7 and 0.85, with more clients with unclear cluster boundaries than with three clients. With 10 clients, the silhouette score falls below 0.7 on average, indicating that the boundaries between clusters are more blurred, and clustering efficiency tends to decrease as the number of clients increases.

Looking at accuracy, with three clients, the accuracy ranges between 93% and 97%, indicating that the small number of clients means that the amount of data covered by each client is sufficient and the learning performance is stable. With five clients, the accuracy varies from 88% to 97%, with some clients performing below 90%, which is likely due to data imbalance or distributed learning. With 10 clients, the accuracy variance is wider, ranging from 85% to 95%, indicating that the difference in performance is due to the size and quality of the data each client handles.

In terms of convergence speed and learning stability, the convergence speed is relatively fast with three clients. When converged models from each client are merged, performance does not degrade significantly and optimization is stable. With five clients, convergence can be a bit slower, and the difference in model performance between clients can be large. With 10 clients, convergence is even slower, and the model gap between clients becomes larger, making learning less stable. This is because synchronization becomes more difficult and the optimization process becomes more complex. In conclusion, three clients provides overall stable results in performance metrics such as NMI, silhouette score, and accuracy. It is also very favorable in terms of convergence speed and communication efficiency. On the other hand, as the number of clients increases to 5 and 10, the performance variance tends to increase, the clustering quality decreases, and the communication and computational resource consumption increases. Therefore, three clients is the optimal setting for balancing resource efficiency and performance.

Figure 12 graphs the NMI values of each client over the round, and Figure 13 graphs the silhouette score values of each client over the rounds. Figure 14 graphs the values of the server and each client over the rounds. For NMI, we can see that it decreases for all clients but remains at 0.6. Silhouette score shows an increase over the rounds for all clients, reaching 0.9. For accuracy, we can see that the server has the highest accuracy, and it does not change much over the rounds.

In this study, the confusion matrix was used for visualization. Figure 15 shows the confusion matrices of the server and clients. From the figure, we can see that the proposed model classifies most cases correctly.

From the application of the bisecting K-means algorithm on the server, we see that the algorithm correctly classifies each label for all but 1 of the 3000 images, and a high classification accuracy is achieved. Meanwhile, for the clients, the model maintains high accuracy in both non-IID and independently and identically distributed (IID) environments. Near-perfect predictions were obtained for the fire class for all clients. Some confusion was observed in the cases of normal and smoke classes, but it did not significantly affect the overall performance. The results for Client 2 and Client 3 reflect the Non-IID data distribution, with confusion between normal and smoke classes, but the fire class is predicted very accurately, showing that the model is not biased and performs well despite the data imbalance between the classes. In conclusion, we can see that the model maintains overall high performance even with the non-IID data distribution, showing that it performs very well even when the data distribution is unbalanced.

In this study, high-dimensional data such as images were visualized in two dimensions using the PCA technique. Figure 16 shows the 2D PCA visualization of the data distribution of each class in a 2D space. In the figure, squares represent fire; triangles, smoke; and circles, normal; the predicted classes are visualized as circles with the same color as the actual symbols. The figure shows that the proposed model separates the classes well and provides a clear understanding of the data structure and clustering performance.

We ran our experiments in the same setup as ViT, using ResNet18 as the CNN backbone. To compare the performance of a CNN and ViT, Table 12 presents the NMI values of each client per round, which allows us to analyze the difference in clustering quality. In addition, Table 13 compares the silhouette scores of each client per round to evaluate the degree of intracluster cohesion and intercluster separation. Finally, Table 14 shows the overall performance change through the server accuracy values per round. According to the CNN-based experimental results, the clustering accuracy of the server performed relatively well at 0.587 in the early rounds, but the performance gradually decreased as the rounds progressed. The NMI value of each client was generally low, with a continuous downward trend from 0.334 to 0.087. This resulted in poor clustering quality for the clients. Silhouette scores were also low, averaging around 0.3 for each client, indicating that the boundaries between clusters were not clear and data within clusters were not well separated.

In contrast, the ViT-based experiments showed much more stable and superior performance compared to the CNN. The ViT model achieved high clustering accuracy from the early rounds and showed little performance degradation as the rounds progressed. In the final round, the ViT model maintained a high clustering accuracy of 0.9787, and the NMI values of the clients decreased slightly from 0.859 to 0.744 but still remained high. Silhouette scores were also above 0.8 for all clients, with well-defined boundaries between clusters and clear separation of data within clusters. ViT outperformed the CNN not only in clustering accuracy but also in NMI and silhouette scores, and it showed stable clustering performance even in the presence of data imbalance.

The proposed model achieves 98% classification accuracy for fire, smoke, and normal classes in an FL structure in a non-IID environment. The images used in the experiments are real images, improving the model’s applicability for industrial sites. This model can make a significant contribution in the event of a fire. The proposed model provides reliable data and highly accurate fire classification and is expected to play an important role in building a system for the early diagnosis of fire in industrial sites.

5. Conclusions

A fire classification system using image clustering based on an FL structure is proposed herein. The proposed architecture focuses on optimizing the classification performance to reflect data characteristics through cooperation between the server and clients. The client data are non-IID to ensure that the performance is maintained in real-world environments. In the proposed system, a ViT model pretrained with server data is used to obtain initial weights that are sent from the server to the client. Using the ViT model, the server extracts data features and clusters the data using the bisecting K-means algorithm. The model then calculates the global center point and global distribution and generates the GCI. The client clusters the local data using the K-means algorithm based on the GCI sent to it and calculates new center points and data distributions. Then, it measures the difference in data distributions between the clients and server using KL divergence and calculates the style vector and alpha value, which are then delivered to the server. The values sent to the server are ensembled to improve clustering performance. The main contributions of the proposed architecture in this work are as follows:

An improved ViT-based fire classification model was proposed. An initial ViT model was built to achieve association learning to create a base model that can effectively classify fire, smoke, and normal situations. Using this model, both the server and clients achieved high accuracy; in particular, the server achieved 98% accuracy.
The proposed architecture maintained good performance even when each client’s local data were non-IID distributed. As the rounds progressed, the NMI values of clients tended to decrease slightly but remained above 0.6, indicating that the clustering is consistently accurate. The silhouette score of each client shows an increasing trend as the rounds progressed, finally reaching 0.9, indicating increasing clustering quality. These results imply that the clients effectively classify and cluster local data through the GCI, which is the weight sent by the server.
The proposed model can be integrated into real-world fire classification systems for real-time detection. Experimental results show that the model can quickly and accurately classify fire, smoke, and normal conditions, making it an important contribution to fire prevention and response systems in industrial sites. It has high reliability because it utilizes real image data. By calculating KL divergence to indicate how similar each client’s data are to the GCI, we proposed a way to measure and understand differences in data distributions, enabling clients to learn more precisely.

In future research, we plan to pursue the following directions. First, we will study different variants of the ViT model. In addition to vit-base-patch16-224, we will use other variants of ViT, CNN, Swin transformer, etc., to explore the lightweighting and performance optimization of the model. Second, we will explore ways to enhance client data privacy in FL by applying techniques such as encryption and differential privacy to increase the data protection level. Third, we will integrate the proposed model into a real-world fire classification system to evaluate the possibility of real-time detection and will seek to optimize the model accordingly. By doing so, we hope to provide a practical solution for rapid response in the event of a fire.

Author Contributions

Conceptualization, J.L.; methodology, J.L.; software, J.L.; validation, J.L.; formal analysis, J.L.; investigation, J.L.; resources, J.L.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, C.-S.P. and J.J.; visualization, J.L. and J.K.; supervision, C.-S.P. and J.J.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the SungKyunKwan University and the BK21 FOUR (Graduate School Innovation) funded by the Ministry of Education (MOE, Korea) and National Research Foundation of Korea (NRF).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the following repositories: D-fire dataset is openly available on GitHub at https://github.com/gaiasd/DFireDataset (accessed on 9 September 2024). The dataset collected by the authors is available on https://github.com/Stellajiwon/Fire-Classification-Data (accessed on 9 September 2024).

Acknowledgments

This research was supported by the SungKyunKwan University and the BK21 FOUR (Graduate School Innovation) funded by the Ministry of Education (MOE, Korea) and National Research Foundation of Korea (NRF). Moreover, this research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ICT Creative Consilience Program (IITP-2024-2020-0-01821) supervised by the IITP (Institute for Information and Communications Technology Planning and Evaluation).

Conflicts of Interest

Author Jeongheung Kang was employed by the company ATEC Mobility. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

National Interagency Fire Center. Wildland Fire Summary and Statistics Annual Report 2023. Available online: https://www.nifc.gov/sites/default/files/NICC/2-Predictive%20Services/Intelligence/Annual%20Reports/2023/annual_report_2023_0.pdf (accessed on 5 September 2024).
National Interagency Fire Center Total Wildland Fires and Acres, 1983–2023. Available online: https://www.nifc.gov/fire-information/statistics/wildfires (accessed on 5 September 2024).
Sridhar, P.; Thangavel, S.K.; Parameswaran, L.; Oruganti, V.R.M. Fire Sensor and Surveillance Camera-Based GTCNN for Fire Detection System. IEEE Sens. J. 2023, 23, 7626–7633. [Google Scholar] [CrossRef]
Zhang, L.; Huang, Y.; Dong, H.; Xu, R.; Jiang, S. Flame-Retardant Shape Memory polyurethane/MXene Paper and the Application for Early Fire Alarm Sensor. Compos. Part B Eng. 2021, 223, 109149. [Google Scholar] [CrossRef]
Lv, L.-Y.; Cao, C.-F.; Qu, Y.-X.; Zhang, G.-D.; Zhao, L.; Cao, K.; Song, P.; Tang, L.-C. Smart Fire-Warning Materials and Sensors: Design Principle, Performances, and Applications. Mater. Sci. Eng. R Rep. 2022, 150, 100690. [Google Scholar] [CrossRef]
Vorwerk, P.; Kelleter, J.; Müller, S.; Krause, U. Classification in Early Fire Detection Using Multi-Sensor Nodes-A Transfer Learning Approach. Sensors 2024, 24, 1428. [Google Scholar] [CrossRef]
Liu, P.; Xiang, P.; Lu, D. A New Multi-sensor Fire Detection Method Based on LSTM Networks with Environmental Information Fusion. Neural Comput. Appl. 2023, 35, 25275–25289. [Google Scholar] [CrossRef]
Ahn, Y.; Choi, H.; Kim, B.S. Development of Early Fire Detection Model for Buildings Using Computer Vision-Based CCTV. J. Build. Eng. 2023, 65, 105647. [Google Scholar] [CrossRef]
Liu, Y.; Kang, Y.; Zou, T.; Pu, Y.; He, Y.; Ye, X.; Ouyang, Y.; Zhang, Y.-Q.; Yang, Q. Vertical Federated Learning: Concepts, Advances, and Challenges. IEEE Trans. Knowl. Data Eng. 2024, 36, 3615–3634. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
Thanh Noi, P.; Kappas, M. Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover Classification Using Sentinel-2 Imagery. Sensors 2018, 18, 18. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
Shafiq, M.; Gu, Z. Deep Residual Learning for Image Recognition: A Survey. Appl. Sci. 2022, 12, 8972. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, A.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, H.; Zhai, X.; Unterthiner, A.; Dehghani, M.; Minderer, M.; Heigold, T.; Gelly, N.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Georgakis, A.; Gatziolis, D.; Stamatellos, G. A Primer on Clustering of Forest Management Units for Reliable Design-Based Direct Estimates and Model-Based Small Area Estimation. Forests 2023, 14, 1994. [Google Scholar] [CrossRef]
Sarkar, M.; Puja, A.R.; Chowdhury, F.R. Optimizing Marketing Strategies with RFM Method and K-Means Clustering-Based AI Customer Segmentation Analysis. J. Bus. Manag. Stud. 2024, 6, 54–60. [Google Scholar] [CrossRef]
Pitafi, S.; Anwar, T.; Sharif, Z. A Taxonomy of Machine Learning Clustering Algorithms, Challenges, and Future Realms. Appl. Sci. 2023, 13, 3529. [Google Scholar] [CrossRef]
Ran, X.; Xi, Y.; Lu, Y.; Wang, X.; Lu, Z. Comprehensive Survey on Hierarchical Clustering Algorithms and the Recent Developments. Artif. Intell. Rev. 2023, 56, 8219–8264. [Google Scholar] [CrossRef]
Zhou, T.; Wang, W. Prototype-Based Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6858–6872. [Google Scholar] [CrossRef]
Ali, I.; Rehman, A.U.; Khan, D.M.; Khan, Z.; Shafiq, M.; Choi, J.-G. Model Selection Using K-Means Clustering Algorithm for the Symmetrical Segmentation of Remote Sensing Datasets. Symmetry 2022, 14, 1149. [Google Scholar] [CrossRef]
Huang, Z.; Zheng, H.; Li, C.; Che, C. Application of Machine Learning-Based K-Means Clustering for Financial Fraud Detection. Acad. J. Sci. Technol. 2024, 10, 33–39. [Google Scholar] [CrossRef]
Xumin, N.; Yong, G. Research on K-means clustering algorithm: An improved K-means clustering algorithm. In Proceedings of the 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, Ji’an, China, 29–31 May 2010; pp. 63–67. [Google Scholar]
Wang, B.; Wang, J. Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-Means, and Local Search. J. Mach. Learn. Res. 2023, 24, 1–36. [Google Scholar]
Seniwati, E.; Sidauruk, A.; Haryoko, H.; Lukman, A. Clustering Performance between K-Means and Bisecting K-Means for Students Interest in Senior High School. Build. Inform. Technol. Sci. (BITS) 2023, 5, 308–316. [Google Scholar] [CrossRef]
Rohilla, M.S.S.; Kumar, C.; Singh, M.S. Data Clustering Using Bisecting K-Means. In Proceedings of the 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India, 18–19 October 2019; pp. 80–83. [Google Scholar]
Steinbach, M.; Karypis, G.; Kumar, V. A Comparison of Document Clustering Techniques. In Proceedings of the KDD Workshop on Text Mining, Boston, MA, USA, 20–23 August 2000; pp. 525–526. [Google Scholar]
Islam, A.M.; Masud, F.B.; Ahmed, M.R.; Jafar, A.I.; Ullah, J.R.; Islam, S.; Shatabda, S.; Islam, A.K.M.M. An Attention-Guided Deep-Learning-Based Network with Bayesian Optimization for Forest Fire Classification and Localization. Forests 2023, 14, 2080. [Google Scholar] [CrossRef]
Sun, W.; Qin, Z.; Deng, H.; Wang, J.; Zhang, Y.; Zhang, K.; Barnes, N.; Birchfield, S.; Kong, L.; Zhong, Y. Vicinity Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12635–12649. [Google Scholar] [CrossRef] [PubMed]
Han, P.; Han, S.; Huang, G. Flatten Transformer: Vision Transformer using Focused Linear Attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 5961–5971. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Papa, L.; Russo, P.; Amerini, I.; Zhou, L. A Survey on Efficient Vision Transformers: Algorithms, Techniques, and Performance Benchmarking. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 1, 1–20. [Google Scholar] [CrossRef]
Li, C.; Zhang, C. Toward a Deeper Understanding: RetNet Viewed through Convolution. arXiv 2023, arXiv:2309.05375. [Google Scholar] [CrossRef]
Ahn, J.; Lee, Y.; Kim, N.; Park, C.; Jeong, J. Federated Learning for Predictive Maintenance and Anomaly Detection Using Time Series Data Distribution Shifts in Manufacturing Processes. Sensors 2023, 23, 7331. [Google Scholar] [CrossRef]
Lu, Z.; Pan, H.; Dai, Y.; Si, X.; Zhang, Y. Federated Learning with Non-IID Data: A Survey. IEEE Internet Things J. 2024, 11, 19188–19209. [Google Scholar] [CrossRef]
Gecer, M.; Garbinato, B. Federated Learning for Mobility Applications. ACM Comput. Surv. 2024, 56, 1–28. [Google Scholar] [CrossRef]
Konukoglu, Z.; Konukoglu, E. FedFA: Federated Feature Augmentation. arXiv 2023, arXiv:2301.12995. [Google Scholar]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and Open Problems in Federated Learning. Found. Trends® Mach. Learn. 2021, 14, 1–120. [Google Scholar] [CrossRef]
Kanchan, S.; Jang, J.W.; Yoon, J.Y.; Choi, B.J. GSFedSec: Group Signature-Based Secure Aggregation for Privacy Preservation in Federated Learning. Appl. Sci. 2024, 14, 7993. [Google Scholar] [CrossRef]
Chai, S.; Yang, J.W.; Li, Y. Communication Efficiency Optimization in Federated Learning Based on Multi-objective Evolutionary Algorithm. Evol. Intell. 2023, 16, 1033–1044. [Google Scholar] [CrossRef]
Ficco, M.; Guerriero, A.; Milite, E.; Palmieri, F.; Pietrantuono, R.; Russo, S. Federated Learning for IoT Devices: Enhancing TinyML with On-Board Training. Inf. Fusion 2024, 104, 102189. [Google Scholar] [CrossRef]
Kamp, H.; Fischer, J.; Vreeken, J. Federated Learning from Small Datasets. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 6953–6961. [Google Scholar]

Figure 1. Comparison of the original data and the data after K-means clustering.

Figure 2. Comparison of the results of K-means and bisecting K-means algorithms.

Figure 3. Architecture of the vision transformer (ViT).

Figure 4. FL process.

Figure 5. Proposed architecture.

Figure 6. Initial GCI process.

Figure 7. Client model training process.

Figure 8. Comparison of the number of data labels.

Figure 9. Sample images in the experiment dataset corresponding to the following labels: (a) normal; (b) fire; (c) smoke.

Figure 10. Confusion matrix.

Figure 11. Two-dimensional (2D) PCA.

Figure 12. NMI line chart of clients: (a) Client 1; (b) Client 2; (c) Client 3.

Figure 13. Silhouette score line chart of clients: (a) Client 1; (b) Client 2; (c) Client 3.

Figure 14. Accuracy line chart of the server and clients: (a) Server; (b) Client 1; (c) Client 2; (d) Client 3.

Figure 15. Confusion matrices of the server and clients: (a) Server; (b) Client 1; (c) Client 2; (d) Client 3.

Figure 16. Two-dimensional PCA of clients: (a) Client 1; (b) Client 2; (c) Client 3.

Table 1. Experimental environment hardware and software.

Hardware Environment	Software Environment
CPU: Intel Core i7-10700F 2.90 GHz	Operating system: Windows 10 Pro
GPU: NVIDIA GeForce RTX 3070	Python: 3.11.9
RAM: 32 GB	Pytorch: 2.4.0+cu121
	CUDA: 12.1

Table 2. Distribution of label image data by server and clients.

	Fire	Smoke	Normal
Server	1000	1000	1000
Client 1	1000	1000	1000
Client 2	500	500	2000
Client 3	2000	500	500
Client 4	500	2000	500
Client 5	1000	1000	1000
Client 6	1000	1000	1000
Client 7	500	500	2000
Client 8	500	500	2000
Client 9	500	2000	500
Client 10	2000	500	500

Table 3. NMI values of each client over rounds.

Round	Client 1	Client 2	Client 3
0	0.859	0.768	0.872
1	0.807	0.699	0.832
2	0.810	0.714	0.842
3	0.793	0.696	0.830
4	0.785	0.671	0.814
5	0.781	0.653	0.806
6	0.764	0.634	0.792
7	0.791	0.668	0.817
8	0.746	0.608	0.767
9	0.744	0.607	0.769

Table 4. Silhouette scores of each client over rounds.

Round	Client 1	Client 2	Client 3
0	0.554	0.448	0.639
1	0.708	0.630	0.757
2	0.769	0.718	0.806
3	0.809	0.763	0.836
4	0.834	0.796	0.852
5	0.850	0.818	0.864
6	0.862	0.834	0.869
7	0.877	0.856	0.888
8	0.874	0.856	0.876
9	0.878	0.859	0.879

Table 5. Accuracy of the server and clients over rounds.

Round	Server	Client 1	Client 2	Client 3
0	0.9987	0.965	0.936	0.976
1	0.9843	0.941	0.896	0.962
2	0.9937	0.945	0.908	0.966
3	0.9940	0.936	0.896	0.963
4	0.9903	0.931	0.884	0.958
5	0.9890	0.928	0.874	0.954
6	0.9837	0.920	0.859	0.948
7	0.9967	0.936	0.889	0.959
8	0.9650	0.911	0.841	0.939
9	0.9787	0.913	0.846	0.942

Table 6. NMI values of each client over epochs.

Round	Client 1	Client 2	Client 3	Client 4	Client 5
0	0.859	0.768	0.872	0.868	0.878
1	0.827	0.740	0.844	0.854	0.865
2	0.822	0.739	0.847	0.858	0.868
3	0.811	0.718	0.839	0.858	0.858
4	0.799	0.701	0.825	0.851	0.850
5	0.790	0.686	0.812	0.846	0.841
6	0.783	0.675	0.799	0.840	0.821
7	0.766	0.655	0.777	0.821	0.801
8	0.754	0.643	0.762	0.807	0.790
9	0.738	0.625	0.749	0.801	0.776

Table 7. Silhouette scores of each client over epochs.

Round	Client 1	Client 2	Client 3	Client 4	Client 5
0	0.554	0.448	0.639	0.573	0.565
1	0.710	0.650	0.755	0.736	0.728
2	0.766	0.723	0.798	0.789	0.784
3	0.802	0.769	0.827	0.829	0.819
4	0.831	0.801	0.847	0.858	0.844
5	0.850	0.826	0.857	0.877	0.857
6	0.862	0.844	0.861	0.890	0.869
7	0.865	0.849	0.860	0.897	0.872
8	0.869	0.857	0.860	0.901	0.875
9	0.870	0.863	0.860	0.902	0.875

Table 8. Accuracy of the server and clients over epochs.

Round	Server	Client 1	Client 2	Client 3	Client 4	Client 5
0	0.9997	0.965	0.936	0.976	0.976	0.968
1	0.991	0.953	0.922	0.968	0.973	0.963
2	0.9977	0.952	0.922	0.968	0.974	0.965
3	0.9967	0.945	0.909	0.965	0.974	0.960
4	0.994	0.938	0.898	0.961	0.972	0.957
5	0.9857	0.932	0.885	0.957	0.970	0.952
6	0.9823	0.930	0.881	0.953	0.968	0.945
7	0.9743	0.922	0.863	0.944	0.963	0.934
8	0.975	0.917	0.855	0.939	0.959	0.930
9	0.9647	0.908	0.844	0.933	0.957	0.922

Table 9. NMI values of each client over epochs.

Round	Client 1	Client 2	Client 3	Client 4	Client 5	Client 6	Client 7	Client 8	Client 9	Client 10
0	0.859	0.768	0.872	0.868	0.878	0.867	0.799	0.798	0.851	0.892
1	0.827	0.740	0.844	0.854	0.865	0.852	0.773	0.763	0.845	0.870
2	0.823	0.738	0.847	0.857	0.867	0.853	0.787	0.770	0.842	0.869
3	0.812	0.718	0.839	0.860	0.859	0.855	0.763	0.758	0.835	0.869
4	0.798	0.700	0.823	0.851	0.850	0.835	0.739	0.740	0.821	0.860
5	0.790	0.685	0.812	0.846	0.841	0.819	0.723	0.718	0.815	0.848
6	0.780	0.675	0.795	0.837	0.821	0.802	0.713	0.710	0.795	0.825
7	0.766	0.660	0.778	0.824	0.806	0.783	0.694	0.695	0.789	0.813
8	0.764	0.656	0.772	0.815	0.801	0.782	0.692	0.692	0.789	0.814
9	0.737	0.624	0.747	0.795	0.776	0.756	0.661	0.670	0.779	0.786

Table 10. Silhouette scores of each client over epochs.

Round	Client 1	Client 2	Client 3	Client 4	Client 5	Client 6	Client 7	Client 8	Client 9	Client 10
0	0.554	0.448	0.639	0.573	0.565	0.561	0.459	0.462	0.574	0.647
1	0.710	0.650	0.755	0.736	0.728	0.721	0.659	0.662	0.737	0.764
2	0.766	0.724	0.798	0.789	0.784	0.778	0.731	0.737	0.791	0.807
3	0.802	0.769	0.827	0.829	0.819	0.814	0.773	0.777	0.829	0.834
4	0.832	0.801	0.847	0.858	0.844	0.838	0.803	0.806	0.855	0.852
5	0.850	0.825	0.857	0.876	0.857	0.853	0.826	0.827	0.874	0.862
6	0.861	0.842	0.859	0.890	0.867	0.865	0.840	0.843	0.887	0.866
7	0.868	0.851	0.863	0.897	0.873	0.874	0.851	0.850	0.897	0.869
8	0.874	0.860	0.865	0.902	0.879	0.880	0.860	0.860	0.902	0.875
9	0.871	0.865	0.860	0.901	0.875	0.876	0.860	0.862	0.902	0.871

Table 11. Accuracy of the server and clients over epochs.

Round	Server	Client 1	Client 2	Client 3	Client 4	Client 5	Client 6	Client 7	Client 8	Client 9	Client10
0	0.9997	0.965	0.936	0.976	0.976	0.968	0.970	0.961	0.950	0.940	0.930
1	0.991	0.953	0.922	0.968	0.973	0.963	0.968	0.956	0.945	0.935	0.925
2	0.9977	0.952	0.922	0.968	0.974	0.965	0.970	0.960	0.948	0.938	0.928
3	0.9967	0.945	0.909	0.965	0.974	0.960	0.965	0.955	0.943	0.933	0.923
4	0.994	0.938	0.898	0.961	0.972	0.957	0.960	0.951	0.940	0.930	0.920
5	0.9857	0.932	0.885	0.957	0.970	0.952	0.956	0.948	0.936	0.926	0.916
6	0.9823	0.930	0.881	0.953	0.968	0.945	0.952	0.944	0.931	0.921	0.911
7	0.9743	0.922	0.863	0.944	0.963	0.934	0.940	0.932	0.921	0.911	0.901
8	0.975	0.917	0.855	0.939	0.959	0.930	0.936	0.928	0.917	0.907	0.897
9	0.9647	0.908	0.844	0.933	0.957	0.922	0.930	0.923	0.912	0.902	0.892

Table 12. NMI comparison between CNN and ViT.

Round	CNN			ViT
	Client 1	Client 2	Client 3	Client 1	Client 2	Client 3
0	0.334	0.241	0.407	0.859	0.768	0.872
1	0.257	0.194	0.256	0.807	0.699	0.832
2	0.209	0.154	0.221	0.810	0.714	0.842
3	0.187	0.135	0.203	0.793	0.696	0.830
4	0.153	0.118	0.181	0.785	0.671	0.814
5	0.127	0.100	0.158	0.781	0.653	0.806
6	0.114	0.086	0.148	0.764	0.634	0.792
7	0.102	0.080	0.137	0.791	0.668	0.817
8	0.093	0.076	0.127	0.746	0.608	0.767
9	0.087	0.070	0.123	0.744	0.607	0.769

Table 13. Silhouette score comparison between CNN and ViT.

Round	CNN			ViT
	Client 1	Client 2	Client 3	Client 1	Client 2	Client 3
0	0.475	0.450	0.542	0.554	0.448	0.639
1	0.426	0.413	0.420	0.708	0.630	0.757
2	0.414	0.398	0.405	0.769	0.718	0.806
3	0.384	0.373	0.371	0.809	0.763	0.836
4	0.346	0.351	0.336	0.834	0.796	0.852
5	0.360	0.370	0.358	0.850	0.818	0.864
6	0.369	0.382	0.368	0.862	0.834	0.869
7	0.377	0.389	0.377	0.877	0.856	0.888
8	0.381	0.393	0.382	0.874	0.856	0.876
9	0.381	0.394	0.384	0.878	0.859	0.879

Table 14. Server accuracy comparison between CNN and ViT.

Round	CNN	ViT
0	0.5873	0.9987
1	0.5640	0.9843
2	0.5477	0.9937
3	0.5400	0.9940
4	0.4987	0.9903
5	0.4830	0.9890
6	0.4790	0.9837
7	0.4723	0.9967
8	0.4650	0.9650
9	0.4593	0.9787

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, J.; Kang, J.; Park, C.-S.; Jeong, J. Distributed Fire Classification and Localization Model Based on Federated Learning with Image Clustering. Appl. Sci. 2024, 14, 9162. https://doi.org/10.3390/app14209162

AMA Style

Lee J, Kang J, Park C-S, Jeong J. Distributed Fire Classification and Localization Model Based on Federated Learning with Image Clustering. Applied Sciences. 2024; 14(20):9162. https://doi.org/10.3390/app14209162

Chicago/Turabian Style

Lee, Jiwon, Jeongheun Kang, Chun-Su Park, and Jongpil Jeong. 2024. "Distributed Fire Classification and Localization Model Based on Federated Learning with Image Clustering" Applied Sciences 14, no. 20: 9162. https://doi.org/10.3390/app14209162

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Distributed Fire Classification and Localization Model Based on Federated Learning with Image Clustering

Abstract

1. Introduction

2. Related Work

2.1. Image Classification

2.2. Clustering

2.2.1. K-Means

2.2.2. Bisecting K-Means

2.3. ViT

2.4. FL

3. Fire Classification Based on FL with Image Clustering

3.1. Overall Architecture

3.2. Initial ViT Training

3.3. Obtaining GCI

3.3.1. First Round

3.3.2. N Round

3.4. Client Model Training Process

4. Experiment and Results

4.1. Experiment Environments

4.2. Dataset

4.2.1. Data Definition and Introduction

4.2.2. Hyperparameters

4.3. Evaluation Metrics and Visualization

4.3.1. Evaluation Metrics

4.3.2. Confusion Matrix

4.3.3. PCA

4.4. Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI