1. Introduction
In December 2019, an outbreak of pneumonia with unknown origins was reported in Wuhan, China. After conducting several tests around the associated virus, it was concluded to be a new variant of the existing coronavirus, associated with SARS-CoV. On 12 March 2020, the WHO declared a state of global emergency, considering it a pandemic, after nearly 125,000 cases were reported to be spread across more than 118 countries at that point. Since then, strict measures were implemented worldwide to contain the spread of the virus and reduce the chains of contagion, due to the virus’ high level of transmissibility and inherently devastating effects, especially in people with chronic diseases, with weakened immune systems, and with those of older age. These measures severely affected all sectors, from the closure of the overwhelming majority of public establishments to bans on movement on public roads. The main symptoms of this disease are fever, cough, headaches, fatigue, and loss of taste, and its transmission through droplets released by the nose and mouth required rules of physical distancing and the mandatory use of masks in all activities that imply direct or indirect contact [
1].
In this way, the need to develop advanced systems capable of monitoring people’s behavior in an optimized way, especially in places that generate large concentrations of people in small areas, thus reducing as much as possible the spread of the virus within the community.
With the easing of restrictions, the levels of mobility and concentration of people, especially in public spaces and shopping areas, began to gradually increase again. However, the persistent presence of the virus means that behavior must still be moderate and adopt every precaution so that the number of infections and new infections remains at increasingly lower values to return normality in a manner as accelerated as possible. The fact that many people are asymptomatic to the disease also contributes to careless attitudes and negligent behavior, mainly associated with not wearing a mask. These behaviors and risk factors are imperative to monitor. Since this type of management is quite complicated to conduct in terms of human resources (e.g., at the entrances to shopping areas, where there are multiple entrance points and a large influx of people simultaneously), it is necessary to adopt methodologies that allow this monitoring to be performed in a more simple and optimized manner.
This article involves the presentation of the study and implementation of algorithms that allow, in real-time, the identification of risk factors and behaviors, such as the detection of the presence or absence of masks in people, as well as the precise measurement of their body temperature, to identify risk factors regarding possible cases of virus presence. This paper can be divided into two distinct modules: (A) detection of the presence or absence of masks on people in places where its use is mandatory, and secondly, (B) a punctual temperature measurement to detect situations where people are in a feverish state, one that represents a key symptom of the SARS-CoV-2 virus. Moreover, such algorithmic development is suitable to be implemented in an integrated system that allows to deploy a product in the market for monitoring purposes.
The main contributions of the paper are as follows:
- 1.
A methodology for the generation of hybrid datasets with added masks on top of real samples from public datasets (
Section 3.1.1);
- 2.
An RGB dataset with added synthetic masks, on top of public datasets. MoLa RGB CovSurv [
2] was made publicly available.
- 3.
An IR dataset with information on the presence of the caruncle, masks, and glasses. MoLa IR CovSurv [
3] was made publicly available.
- 4.
State-of-the-art object detectors and keypoint face detectors were trained and evaluated, using a hyperparameter genetic search algorithm, i.e., Evolve. Considering the highest precision and lowest computational requirements, two models were selected. YOLOv5 small was the best choice for the RGB and IR mask and glasses detection. Moreover, a keypoint detector with a Resnet-50 backbone was selected for the caruncle detection in IR images.
Using these algorithms, we can implement them in an embedded system, and using the RGB cameras, we can install this as a monitoring system to assist with controlling the entrance of crowded establishments. Furthermore, it replaces the in-person task of measuring body temperature. The architecture description of the proposed solution can be consulted in
Figure 1.
The paper is organized as follows. Initially, the state-of-the-art is presented regarding deep-learning-based algorithmic solutions for the use-case at hand (i.e., RGB mask detection and IR keypoint detection).
In the implementation section, for the RGB mask detection, a public dataset collection was made. Moreover, due to the lack of contextualized samples, a synthetic data generation toolchain was developed to generate the MoLa RGB CovSurv dataset [
2].
For the IR algorithmic development, the same procedure was used, and publicly available datasets were used to create a pool of samples with extra label information (i.e., caruncle, mask, and glasses position). Moreover, a new MoLa IR CovSurv dataset [
3] was formed.
Several evaluations were performed for RGB and IR detection use-cases using the generated datasets, where YOLOv5 was used as the main object detector, and a keypoint detector, based on the Resnet-50 backbone, was used for the caruncle detection.
Finally, results are presented and discussed, making it possible to select the best algorithms with the highest precision and lowest computational requirements.
Figure 2 summarizes the entire development pipeline of this article.
2. Related Work
Human mask detection in a surveillance scenario requires an approach similar to the ones used in object detection methodologies. There are several studies focused on object detection, which can be applied to various topics, and which can be an approach to consider in the task of mask detection. The authors [
4,
5,
6] developed the R-CNN family of algorithms to detect different regions of interest in the image while using a CNN to classify the presence of the object in that region. More recently, the YOLO [
7] object detection family presented as YOLOv2 [
8], YOLOv3 [
9], YOLOv4 [
10], and YOLOv5 [
11], provide a more accurate and faster method compared to the R-CNN family. Most recently, several object detection algorithms were used for the sole purpose of mask detection in a COVID-19 context. Jiang et al. [
12] proposed a one-stage detector, achieving state-of-the-art results on a public face mask dataset. In the same context, Loey et al. [
13] used YOLOv2 with a Resnet-50 backbone with two publicly available medical masks dataset, reaching an average precision of 81%. Alternatively, the authors [
14] used a single-shot detector with a MobileNetv2 backbone for the sole purpose of detecting masks in a surveillance scenario. Moreover, public datasets with real and synthetic samples were used for the algorithmic development, allowing to achieve 92.64% accuracy, with 64ms of inference time.
For the detection of facial points, an important requirement for the detection of the caruncle location in human faces, state-of-the-art algorithms were developed. The first efficient algorithm for face detection in images was presented in 2001 by [
15]. Later, in 2015, the authors [
16] presented a cascaded CNN model, i.e., using 3 distinct CNNs (12-net, 24-net, and 48-net), in which a gradual analysis of the image is performed, and initially, several small boxes are generated, which refer to certain facial elements; throughout the process, dimensional adjustments and calibrations are made until the face is identified as a whole. Sun et al. [
17] presented an algorithm consisting of three levels of CNNs in cascade form for the detection of the five main facial points: Left-Eye Center (LE), Right-Eye Center (RE), Nose Tip (N), Left-Mouth Corner (LM) and Right-Mouth Corner (RM). It is a supervised approach, and when the bounding box of a face is provided, the location of the respective points is predicted. Haavisto et al. [
18] presents a DBN-based algorithm to identify 15 facial points based on grayscale images. Longpre et al. [
19] presented an approach to predict facial features in grayscale images. This algorithm consists of a mixture of convolutional layers based on the architectures of CNNs LeNet and VGG. Upon reception of an image, the goal is to return the coordinates (x,y) of 30 facial points. Agarwal et al. [
20] presented NaimishNet, an adaptation of LeNet architecture architecture for identifying facial features.
Several studies were already developed to monitor risk behavior in an attempt to mitigate the spread of COVID-19.
The author [
21] proposed a monitoring and warning approach to respect social distancing (SD), relying on vision systems, and it was effective at preventing the spread of COVID-19 infectious disease. In this study, a real-time, vision-based system that can detect SD violations and send nonintrusive audio-visual cues using recent DL models is presented. A critical value of social density was defined, and they showed that the probability of occurrence of SD violation can be kept close to zero if the pedestrian density is kept below this value. The proposed system is also ethically fair: it does not record data or target individuals, and no human supervisor is present during operation. The proposed system was evaluated on real-world datasets.
The author [
22] proposed a detection and diagnosis system using IoT-based smart glasses that can automatically and quickly detect COVID-19 from thermal images. The proposed design can perform face detection in case of suspected COVID-19 among crowds that have high body temperatures. The design will add information on the visited location of the suspected virus carriers through Google Location History (GLH) to provide reliable data on the detection process.
The authors [
23,
24] evaluated the probability of the COVID-19 disease through sound analysis. Ref. [
23] proposed the study of voice (speech) signal processing in the process of screening and early diagnosis of the COVID-19 virus, using Recurrent Neural Network (RNN), and more specifically, its well-known architecture, Long Short-Term Memory (LSTM), to analyze the acoustic characteristics of cough, breath, and voice of patients. The presented study shows a low accuracy in the voice test compared to that of the cough and breath sound samples. However, they highlight the possibility of increasing the accuracy of voice testing by expanding the dataset and targeting a larger group of healthy and infected people. Ref. [
24] proposes a study that analyses cough sound. They present a reliable tool that can differentiate between different respiratory diseases, which is very relevant in the COVID-19 context.
The authors [
25,
26] present DL approaches for detecting or not face masks on individuals. Ref. [
25] proposes a system that restricts the growth of COVID-19 by tracking people not wearing a face mask in a smart city network where all public places are monitored by Closed Circuit Television (CCTV) cameras. While a person without a mask is detected, the corresponding authority is informed through the city network. It uses a DL architecture trained on a dataset consisting of images of people with and without masks collected from various sources. The trained architecture achieved 98.7% accuracy in distinguishing people with and without face masks using previously unseen test data. Ref. [
26] proposes the implementation of a facial mask and social distancing detection model as an embedded vision system. The pretrained models such as MobileNet, ResNet classifier, and VGG are used in our context. People violating social distancing or not wearing masks were detected. After the implementation and deployment of the models, the selected one achieved a 100% confidence index.
5. Discussion
In
Section 4.1, the object detection algorithms, YOLOv5 family and FaceMaskDetection-SSD, are evaluated ore precisely to detect the presence or absence of masking. Although all the algorithms of the YOLOv5 family presented good results, the method to be used for the mask detection task is the Small model of the YOLOv5 architecture. This choice is justified by the fact that the different metrics obtained do not change substantially, since more layers were added along the remaining deeper models, and the task does not present a high degree of complexity since it is intended to detect only two distinct classes (with or without mask). Considering the inference times obtained are: 0.032 s for the Small model, 0.045 s for the Medium model, 0.062 for the Large model, and 0.089 s for the Extra-Large model. Thus, the best choice was to select the lightest model (Small), with 82.38% of mAP_0.5.
Figure 11b shows qualitative results obtained on different samples, based on the inference of the selected model. The FaceMaskDetection-SSD method shows a 36.4% of mAP_0.5 when inferred on our test dataset. This may be because the model was trained on 7971 samples, which is a significantly lower number than our dataset. Hence, its inference capability on our test dataset is much lower. Furthermore, the FaceMaskDetection-SSD model has a lower complexity than our lighter model, YOLOv5s, with 1.01 M and 1.9 M parameters, respectively.
Section 4.2 presents models capable of detecting the facial points of interest (using a thermographic camera) to be able to carry out effective temperature measurements as a way to screen for the potential presence of the SARS-CoV2 virus. This task is composed of two distinct steps: in the first step, and given that temperature measurements are not possible with the presence of glasses, object detection algorithms capable of detecting not only the presence of this object, but also the presence of masks were implemented (
Section 4.2.2). In the second step, for the glasses and mask detection component, the algorithms forming the YOLOv5 architecture were selected, while for the face points detection component (
Section 4.2.3), algorithms whose Backbones are made up of CNNs that are part of the Resnet and HrNetv2 architectures were selected. The results obtained by the different algorithms for both steps are quite satisfactory in the sense that these results experience practically no improvement with the use of deeper algorithms, since the number of classes and face points to be identified is quite low, in conjunction with the use of a highly uniform dataset whose samples are quite similar. Since the goal was to achieve high precision and low computational requirements, this led to the choice of the Small model for the glasses and mask detection aspect (corresponding to E5.1, with a precision of 81.86% and an inference time equal to the selected model for mask detection, 0.032 s), and the model with Backbone Resnet-50 (corresponding to E9, with a precision of 78.68% and an inference time of 0.024 s).
Figure 11a shows qualitative results from the inference of the algorithms chosen for both tasks.
6. Conclusions
This article presents a system capable of detecting behaviors and risk factors of people within the scope of the COVID-19 pandemic, and more specifically, the implementation of algorithms for the detection of masks in public spaces, as well as the punctual execution of temperature measurements for the detection of possible cases of fever. Initially, a search was carried out associated with the existing state-of-the-art algorithms suitable for performing the proposed tasks. The selected algorithms belong to the themes of object detection and Keypoint Detection. The first task was mask detection in RGB images. As a basis for training the selected algorithms in this component, it was necessary to create a dataset and generate the respective labels. Regarding the dataset and given that the number of existing samples in this area is still scarce, a tool capable of applying synthetic masks to RGB images was developed, using pretrained models capable of locating the faces present and their respective facial points. Based on this information, a mask is subsequently applied within the existing types and textures to the facial points where it should be placed. The labels associated with this dataset were automatically sourced from the pretrained models used. Subsequently, using this dataset, multiple algorithms based on the YOLOv5 architecture were evaluated. After the training and respective evaluation of the results obtained, all models obtained good results, however, the Small model was the selected one (with a precision of 71.01%). This choice is justified because the obtained metrics are very similar despite the use of different and deeper models, mainly due to the fact that the required degree of complexity is not high because it is only intended to detect two different classes. Another reason is the balance between precision and real-time performance of the Small model regarding the other tested models.
For the temperature measurement component, it was also necessary to create a dataset consisting of thermographic images and generate the respective labels. In this case, algorithms were implemented both for mask and goggles detection, and for the detection of facial points associated with the human caruncle area, where the temperature measurement is performed with greater accuracy. The labels were originated in a semiautomatic way, i.e., based on the pretrained models enunciated in the previous task, as well as from manual labeling, image by image. For the mask and glasses detection task, the models coming from YOLOv5 architecture, associated with the object detection theme, were also tested, while for the face points detection task, algorithms were implemented, associated with the keypoint detection theme, which differ from each other in the present Backbone and whose constitutions correspond to variations of CNNs Resnet and HRNetv2. Respectively, the YOLOv5 Small algorithm was chosen (with a precision of 81.86%) as well as the algorithm whose Backbone is formed by the Resnet-50 architecture (with a precision of 78.68%). These choices, like the mask detection component, were based on the commitment between the obtained metrics and the real-time performance.