1. Introduction
The COVID-19 pandemic spread rapidly across the planet, exposing the human species to a systemic risk [
1]. Under uncertainty about the mechanism of viral spread and effective cures, the only possible mechanisms to prevent or mitigate the epidemic spread of infection are non-pharmaceutical: quarantines/lockdowns and mask use [
2]. However, lockdowns damage economic and social life, so adaptive methods for individual isolation have been proposed [
3,
4]. Although the practical means of implementing adaptive quarantine methods in real-world societies remain uncertain, the effectiveness of mask-wearing—despite all the controversies [
5]—is proven [
6,
7]. In addition, wearing protective masks at work, school, or public places does not hinder economic and social activities. Consequently, mask use should be enforced in all places where infection can spread [
8].
Installing real-time tracking systems to detect those wearing masks is required to mitigate the consequences of epidemic diseases. Accordingly, we need an affordable hardware platform to apply the tracking solution at a large scale and provide the computational power for deep learning image recognition. One way to build a prototype for such a mask detection system is to use a Single Board Computer (SBC) such as Raspberry Pi [
9], which has adequate resources, including a camera attached to supply the video stream.
There are significant anthropometric variations based on race, gender, culture, and age [
10,
11]. Furthermore, the type of mask also plays a role in training an efficient and accurate detection system [
12]. Thus, obtaining comprehensive data for efficiently training a mask detection system is difficult [
13]. The situation becomes even more difficult when considering a lightweight SBC implementation for the mask detection solution.
Traditional centralized mask-detection systems require that every raw video frame be sent to a cloud or data center server for training. Deployed on a large scale, this configuration faces three significant challenges. First, there is bandwidth usage: simultaneously transmitting continuous video streams from several cameras can stress connections. The second issue concerns privacy: concentrating raw biometric images in a single location subjects them to specific regulations (e.g., GDPR in the European Union). The third challenge is vulnerability to failures: any server or network interruption can completely halt detection at the local level. Federated learning [
14,
15] can address these challenges by keeping images on local devices and only sending encrypted model updates, thus reducing costs while significantly improving privacy and system robustness.
This paper addresses the problem of building a lightweight and scalable solution to the mask detection problem using image recognition and machine learning (ML). To this end, we propose a novel federated learning implementation that deals with the difficulty in collecting a comprehensive data set. The distributed training approach of federated learning is better suited for the heterogeneous distribution of anthropometric traits in a large population (i.e., at national, international, or continental levels [
16]). Further, federated learning ensures better data privacy because the raw image data remains on the client side (i.e., not exposed by communicating it to a centralized server [
17]). The main contributions of our work are as follows:
We provide a novel FL framework for face mask detection using image recognition;
We tailor a face and mask detection model that requires limited hardware resources, namely, Raspberry Pi single-board computers;
We present simulation-based experiments that demonstrate the effectiveness and accuracy of our FL framework compared to the conventional, centralized ML approach.
The remainder of this paper is organized as follows:
Section 2 presents existing work on mask detection systems,
Section 3 describes the hardware platform used to train and run the face detection model,
Section 4 describes the FL model, the dataset, and the software architecture,
Section 5 shows the evaluations of our proposed mask detection system, and
Section 6 states the final remarks.
3. Experimental Configuration
We need a device capable of running all the algorithms required to achieve the performance that our mask detection task demands. Accordingly, at the client level, we identified the Raspberry Pi as the most convenient option for both quality and affordability, according to [
29]. However, although the Raspberry Pi has the specifications of a standard computer, its computing power is limited compared to a traditional desktop or laptop.
For the project described in this paper, we selected the Raspberry Pi 4 Model B with 8GB of RAM and a Micro-SDXC card (to store the Raspberry Pi OS, project files, and a database of images necessary for training). The video capture component is a webcam that provides high-definition video quality and operates at a frame rate of 30 fps.
Given the processor’s propensity to overheat, particularly during training sessions, we implemented an enhanced cooling solution. Consequently, our system comprises two large radiators (which also serve as a protective casing for the entire device) and two fans to maintain optimal temperature control.
Figure 1 shows the entire hardware configuration.
To address the constraints of implementing a lightweight yet accurate and adaptive mask detection model, we propose a federated learning framework, where the Raspberry Pi system shown in
Figure 1 represents a single component (or client) within a larger network. This network includes
M aggregators, each overseeing
N clients; see
Figure 2. The operational FL sequence is as follows: Initially, the server sends the initial model weights to each of the
M aggregators linked to it. Subsequently, each aggregator relays these weights to its
N clients. The training then commences individually among these clients, using their local data. After finishing the training, each client sends their updated weights to their respective aggregator. Each aggregator then builds a more general model with the weights received from its clients. In the final step, each aggregator transmits its aggregated weights back to the server. The server performs a comprehensive aggregation of all the weights received from the lower-level aggregators, thus delivering the final model weights. This marks the end of the first communication cycle, and the process is repeated until the desired results are obtained. This paper considers an FL architecture with one aggregator (
) that oversees
N clients.
4. Materials and Methods
We propose a method combining computer vision, transfer learning via MobileNetV2, and federated learning for distributed training to create a real-time mask detector. This approach is designed with a strong focus on privacy, providing an efficient, scalable, and privacy-conscious solution for the real-time identification of face masks.
Computer vision plays a crucial role in our proposed system, enabling the real-time detection of people not wearing masks and the automatic analysis of visual data. This technology is particularly effective in monitoring large groups where direct supervision is challenging.
MobileNetV2, with transfer learning, enables effective processing by exploiting pre-trained ImageNet features to achieve high accuracy even with less data.
Federated learning is a key component of our proposed system, facilitating model training on multiple devices using local datasets without the need for data exchange. This distributed approach enhances data security, aligns with data protection regulations, and reduces privacy concerns. Moreover, federated learning enhances scalability and adaptability, enabling the model to glean insights from diverse data sources and improve its generalization across various user groups and data types.
The key concepts in federated learning are client, aggregator/server, round, epoch, and privacy boundary. The client is the edge device (e.g., Raspberry Pi board) responsible for conducting local training on its private dataset. The aggregator/server is the entity that gathers the weights from the clients and calculates their weighted average (FedAvg) to update the global model. The round refers to a complete cycle that involves updating global weights, client-side local training, and aggregation at the server. The local epoch is a process in which a client makes a complete pass over its dataset during a round. The privacy boundary ensures that the raw images are kept on the client device; only encrypted weight updates are sent.
4.1. Transfer Learning
We have chosen a neural network architecture that meets our needs in terms of efficiency and size so that it can be deployed efficiently on a Raspberry Pi, thus guaranteeing practicality. Accordingly, we adapt the pre-trained MobileNetV2 architecture as the basis for the initial model that the server passes to the clients [
30].
MobileNetV2 encompasses only 3.4 million parameters (approximately 10 MB of storage); this makes it the most lightweight network to achieve an accuracy of at least 97% in our initial repeated cross-validation experiments. The use of depth-wise separable convolutions and inverted residual blocks helps to diminish the size of the weight-update tensors transmitted during federated learning rounds. Given its compact size and high accuracy, MobileNetV2 is particularly well-suited for deployment on resource-constrained devices, such as the Raspberry Pi 4, where memory and computational efficiency are crucial.
MobileNetV2 [
31] was originally trained using transfer learning [
32] on the ImageNet dataset [
33], is known for its efficiency, and is designed for mobile and embedded vision applications. In our model, we only use the feature extraction components of MobileNetV2, eliminating the top classification layers. On top of this basis model, we introduce a custom head tailored to our task: an Average Pooling Layer, a Flatten Layer, a Dense Layer with ReLU Activation, and an Output Dense Layer.
The Average Pooling Layer reduces the spatial dimensions of the feature maps, the Flatten Layer transforms them into a one-dimensional vector, the Dense Layer with ReLU Activation provides the network with the capacity to learn complex patterns, and the Output Dense Layer has two neurons corresponding to the two classes of our classification task (i.e., mask and no mask), with the softmax activation function employed to interpret the model’s outputs as class probabilities.
To emphasize the feature extraction capabilities of the pre-trained MobileNetV2, all of its layers are frozen during training, allowing us to harness the rich representations learned by MobileNetV2 on ImageNet while fine-tuning the model for our specific classification challenge.
Figure 3 presents the entire neural network architecture we propose for mask detection.
Furthermore, the use of transfer learning with ImageNet pre-trained weights means that a large dataset for training the model is not required. In a real-world federated learning scenario, each client might have a small and heterogeneous dataset, yet the approach can still achieve robust performance. This is a significant advantage, as it demonstrates the effectiveness of the method even with limited data.
4.2. Federated Learning
Federated learning involves numerous devices; hence, it is also termed
collaborative learning [
34,
35]. Training is decentralized, using local data across multiple devices, eliminating the need to share user data between users [
36].
The rationale behind using FL for the mask detection problem is twofold. First, FL ensures the individual’s privacy, as the brute data (i.e., images) remain at the local level, with the server receiving only the model weights. Second, the main aim of our method is to minimize the occurrence of false positives. (The main problem is when an individual is labeled as correctly wearing the mask, even if it is not.) Experimental results suggest that a false positive is expected once per 500 FPS. FL repeatedly trains identical neural networks with various simulated clients, each possessing unique data, so—at least in theory—the probability of developing a sturdy model should be significantly enhanced within the FL paradigm.
To experiment with and test our mask detection system, we simulated the entire process, even if the software involved by running the prediction model at the client level can run on the Raspberry Pi described in
Figure 1. The reason is the unavailability of sufficiently many Raspberry Pis. In this situation, the Raspberry Pi functions as both a server and a client (see
Figure 4); however, the system was designed and implemented as two separate entities—server and clients. The simulation process starts with the server generating the clients. Since this is a client simulation, and we have only a few Raspberry Pi boards, we randomly split the entire dataset among the clients so that each client receives the same amount of Independent Identically Distributed (IID) data.
The dataset consists of 6000 images, evenly balanced in two classes (3000 with masks and 3000 without masks). To explain the distribution process, we define a general formula for allocating samples to clients, taking into account the training split used in preprocessing, where the following are defined:
D is the total number of images in the dataset (i.e., 6000);
S is the training split ratio (i.e., 0.8 for 80% training data, with the remaining 20% reserved for testing);
C is the number of classes (i.e., 2: with mask and without mask);
N is the number of clients (varies per experiment).
The total number of training images is calculated as , with the number of samples per client being , and the number of samples per class per client .
For , , and , we have:
Experiment 1 (): Each client receives samples, with samples per class (with mask and without mask);
Experiment 2 (): Each client receives samples, with samples per class;
Experiment 3 (): Each client receives samples, with samples per class.
This distribution ensures no overlap between clients’ datasets, maintaining data privacy. The dataset assignment remains fixed across all communication rounds, with each client training on its unique subset throughout the experiment. This setup aligns with federated learning principles, preventing the server from accessing confidential data and ensuring robust model training across diverse client datasets.
The FL process we simulate (according to the architecture in
Figure 4) begins when clients acquire the initial global model (based on MobileNetV2) from the server. Subsequently, the clients will use their local data to train their models over several epochs. (Throughout this local training phase, the mean gradient is employed for optimization.) Following several iterations of local updates, the resulting local models are dispatched to the server for comprehensive aggregation. This cycle continues until the aggregated model meets the desired performance standards.
The aggregation creates a new global model in each
t communication round [
37]. In this paper, we perform the aggregation according to Equation (
1), where
represents the global weights,
K is the total number of clients,
n is the total number of data samples across all clients,
is the number of data samples held by client
k, and
are the local weights from client
k. Algorithms 1 and 2 present the pseudocode for the federated learning procedure that we emulate in this paper.
Algorithm 1 ServerUpdate |
- 1:
Initialize - 2:
for each round do - 3:
random set of n simulated clients - 4:
for each client in parallel do - 5:
UpdateClientWeights - 6:
end for - 7:
// update global weights - 8:
- 9:
end for
|
Algorithm 2 UpdateClientWeights |
- 1:
// run on client k - 2:
for each local epochs to e do - 3:
- 4:
end for - 5:
return w
|
4.3. Computer Vision
The application running at the client level to achieve mask detection requires a few procedures in the domain of computer vision [
38]. Each data stream that arrives from the video frame is processed independently. If a person’s face is detected, mask recognition and label prediction follow. Accordingly, the output is presented in real-time as a rectangle that borders the person’s face, and its color corresponds to the final mask-wearing decision (red—no mask; blue—mask incorrectly worn; green—mask correctly worn).
As prescribed by [
39], we utilized diverse libraries to analyze the photos collected from the video stream, such as OpenCV, TensorFlow, and Numpy; we used them to extract relevant information from incoming photos, make predictions, scale the photos, and create the appropriate output. To make predictions on people’s faces we used the OpenCV model [
40]. An essential observation is that we integrated data collection. When a person is recognized with a mask, the corresponding photographs will be recorded on the device. Indeed, each device/client can collect data to perform federated learning. The diagram in
Figure 5 illustrates the application running at the client level.
4.4. Dataset
We took the facemask-wearing dataset from Kaggle [
41], which contains 12K images divided into test, train, and validation folders. The dataset includes images with different resolutions and orientations individuals wearing the mask properly, improperly, or not wearing the mask.
For this paper’s experiments, we randomly select 6000 images divided equally into two groups, with and without masks, each containing 3000 images. The dataset includes individuals from different ethnic backgrounds, skin tones, genders, and facial structures; this diversity is crucial to develop models that perform well in different demographics and minimize bias. The photos show the faces of men and women from various regions around the world, ensuring a comprehensive representation. The with-mask group includes face masks commonly used during the COVID-19 pandemic, such as surgical masks, standard disposable blue/green masks, N95/FFP2 respirators, high-filtration masks (with and without exhalation valves), cloth masks, reusable masks (made from different fabrics, colors, and patterns), KN95 masks (similar to N95 but with unique design variations), and fashion masks (masks with custom designs, logos, and patterns). In this way, we ensure that models trained on the dataset can accurately recognize faces in various scenarios, enhancing their reliability and fairness. To significantly mitigate demographic bias, we established an equal distribution of 50% female and 50% male participants, ensuring that each of the four main visible-ethnicity categories provided approximately 25% of the images.
4.5. Data Preprocessing
Image preprocessing is an essential step in preparing data for neural network models, especially with a pre-trained neural network architecture such as MobileNetV2. This step ensures that the input data meet the necessary format and scale requirements, thus improving model performance and convergence. In
Figure 6, we provide a breakdown of each step, focusing on how images are prepared as a MobileNetV2 input.
The images are loaded and resized; each image is loaded from the specific directory path. The target size is set to 224 × 224 pixels, which resizes all images to a uniform input shape—this size is standard for many prebuilt models.
We converted each image from a PIL (i.e., Pillow, a Python 3.9 image processing library) image object to a NumPy array; this transformation allows for further image manipulations and operations. After completing these steps, we begin the preprocessing that prepares the image array specifically for MobileNetV2, which rescales the pixel values from their original range of [0, 255] to a new range that matches the neural network requirements, namely the range [−1, 1], to normalize the input data.
In the next step, we extract the labels for each image, which are obtained from the directory name path. We transform the class labels into binary vectors: label binarization. After this step, we transformed the binary vectors into one-hot codes, which are required for categorical classification models—this allows MobileNetV2 to assign probabilities to each class.
The final step splits the dataset, where 80% of the data is used for training and 20% for testing; to perform all these operations, we use the Tensorflow image preprocessing library. For label binarization, one-hot encoding, and data splitting, we use the Sklearn library.
4.6. Assessment Metrics
We evaluated the performance of our model with the accuracy metric; as shown in
Table 1, column ’Results,’ previous work mostly used this metric as being the most relevant. Indeed, for our problem, namely correctly labeling a person as properly wearing or not wearing a face mask, what matters is the ratio of correctly labeled persons to the total number of assessed individuals, hence accuracy.
6. Discussions and Conclusions
This paper introduces a novel federated learning framework to address the challenge of creating an efficient and scalable face mask detection system. The core motivation is to avoid potential public health disasters and severe disruption associated with epidemic respiratory infections; in this context, we use FL to circumvent the difficulties associated with centralized data collection and model training— particularly in the context of heterogeneous data distribution across different populations—and improve data privacy while still producing a highly accurate model. To the best of our knowledge, this is the first mask detection system to integrate MobileNetV2 with a federated learning structure, operating end-to-end on Raspberry Pi-like devices, retaining raw images locally, and addressing the privacy and bandwidth challenges faced by prior centralized systems.
Our federated learning mask recognition system achieved a remarkable training accuracy of 98% after eight rounds of communication with 25 clients and two epochs on a Raspberry Pi 4, outperforming almost all the methods listed in
Table 1. The only exception is the last method in the table, which achieved a slightly higher accuracy of 98.7%. However, it is important to note that their model was trained in a centralized environment, while our model works in a decentralized environment. In contrast to the other approaches, which utilized high-performance hardware such as Intel i7 CPUs or Google Colab for training, our solution showcases outstanding resource efficiency on the Raspberry Pi 4. Moreover, our FL model prioritizes privacy by retaining data on client devices, making it perfectly suited for low-resource and privacy-critical applications. By training decentrally, our model learns from a much wider variety of data, resulting in a significantly more robust model. Nevertheless, issues such as early-round loss variations indicate a need for further optimization, differing from centralized methods, which rely on more powerful hardware, but do not address privacy concerns.
6.1. Performance
The experiments demonstrate the accuracy of our FL mask detection framework. We demonstrated that the accuracy of the FL model progressively increased with the number of communication rounds, thus showing that more frequent interactions between clients and the server contribute positively to the model performance. Indeed, the model achieved a strong accuracy of 98% after eight communication rounds with 25 clients and two epochs; this significantly improved the accuracy observed in earlier rounds with higher client counts. These results confirm the assumption of the FL paradigm, so even if fewer clients in each round may initially slow down the model’s learning, increased communication brings more precise weight updates, thus increasing accuracy over time.
Even with a relatively small number of communication rounds, our mask detection FL model demonstrated substantial accuracy, which, although slightly lower than the centralized model, is still highly effective for practical applications. This means that FL is a viable alternative to centralized training, particularly when dealing with data privacy issues, heterogeneous environments, and hardware limitations.
However, our results also emphasize specific challenges inherent in FL, such as variability in performance based on the number of clients and communication rounds. Fluctuations in loss values (especially in the earlier rounds) imply that there may be inefficiencies or imbalances in how the model aggregates information from different clients. These variations suggest the need for further investigation into optimization techniques that could stabilize the training process and ensure more consistent performance across different configurations.
By implementing the MobileNetV2 architecture on resource-constrained devices such as the Raspberry Pi, we demonstrate that it is possible to achieve high accuracy in mask detection while preserving data privacy and accommodating the diverse anthropometric traits of a broad population.
Moreover, by implementing the MobileNetV2 architecture on resource-constrained devices such as the Raspberry Pi, we demonstrate that high accuracy in mask detection can be achieved while preserving data privacy and accommodating the diverse anthropometric traits of a broad population.
6.2. Epidemic Respiratory Infections Scenario
Traditional centralized models require large-scale data collection for training at the server level, thus exposing individual data to privacy issues. Furthermore, communication with a central server entails additional logistic problems. In contrast, FL uses local training at the device level, reducing individual data exposure and thus enhancing individual privacy. As such, when used in public health emergencies —such as epidemic respiratory infections—the FL architecture allows institutions to monitor mask compliance safely, a crucial factor in public health response where trust and privacy are paramount.
The FL model is compatible with smaller-scale and low-power hardware devices, such as the Raspberry Pi, making it a cost-effective and accessible choice for organizations in all sectors, including public health agencies, schools, businesses, retail stores, and more. Because it has reduced hardware requirements, the FL mask detection model can be deployed in various public spaces, including work environments and educational institutions, with minimal costs. In addition, the distributed approach to private mask detection enables areas with limited infrastructure and resources to participate, enhancing this solution’s reach and inclusivity.
Furthermore, the decentralized nature of the FL model supports scalability in diverse regions and heterogeneous situations. Its capacity to be expanded by adding more devices makes it well suited to geographic scalability and inclusion, as it adapts to real-time local data variations; this enables people from various regions and backgrounds to contribute to and benefit from the system more quickly, making it a democratic solution that can be implemented globally without expensive centralized resources. Indeed, our FL solution fosters fast, equitable access and resilience in managing health crises.
6.3. Limitations
Using federated learning on the Raspberry Pi has limitations, mainly related to low hardware resources. Consequently, the most crucial downside is the training time, which increases significantly due to the limited processor and memory of the device. In a federated learning scenario, where training is performed on multiple devices in parallel, each Raspberry Pi will contribute gradually, slowing down the whole process of aggregating and updating the centralized model. Therefore, an increase in latency is expected when we aim for high accuracy. However, the federated approach can have significant advantages in the long run: using multiple devices that perform parallel training on distributed and varied datasets will achieve a more robust and generalized model.
6.4. Future Work
To reduce loss variability and enhance model accuracy, particularly in the early stages of communication, future research should focus on improving the process for aggregating client updates. To confirm the usefulness and adaptability of this Federated Learning framework, it should also be applied to various other image recognition tasks. Further, we plan to advance our system in future studies by implementing advanced aggregation techniques to ensure more consistent training and better convergence. Our objective also includes integrating secure aggregation and differential privacy techniques to improve data protection and overall system resilience. Moreover, we can extend this framework to cover additional computer vision tasks, such as emotion or activity recognition, which will underscore its adaptability.