**2. Materials and Methods**

This section presents the main steps in developing an automatic system for measuring the human temperature at sanitary barriers by combining thermography and computer vision technologies.

The Research Ethics Committee of the Federal Institute of Espírito Santo, linked to the National Research Ethics Commission of the Ministry of Health of Brazil, approved this research under the Certificate of Presentation and Ethical Appreciation (CAAE) 33502120.2.0000.5072, opinion number 4.180.201, on 29 July 2020.

The volunteers who participated in this research were informed about the objectives, the scope of their participation, the confidential treatment of their data, and the consolidated statistically grouped method of disclosing data. All participants provided written consent.

The inclusion criterion considered the volunteers 18 years old or older and the signature on the consent term of free participation without any burden or bonus for the volunteer or researchers, with the possibility of withdrawing from the study at any time.

#### *2.1. Fever and Human Thermography*

Fever occurs when there is an increase in the body's thermal threshold, usually maintained at around 37 ◦C, triggering metabolic responses of heat production and conservation, for example, tremors and peripheral vasoconstriction. These responses help to raise the body temperature to the new threshold. After fever is resolved or treated, threshold returns to baseline and heat loss processes begin, e.g., peripheral vasodilation and sweating [17].

The surface temperature threshold for determining whether a patient is in a febrile state varies among different authors. However, the most adopted thresholds are 37.5 ◦C ([18–20]) and 38.0 ◦C ([21–23]).

However, the surface temperature of the human body is different from the core temperature, which is the gold standard for diagnosing fever. The surface temperature presents different and typically lower values from the core temperature for the different regions of the face, a surface commonly inspected in sanitary barriers. Different face regions can vary the nonfever temperature from 32.3 ◦C up to 35.9 ◦C [24–26]. Therefore, properly identifying the region of interest (ROI) on the human face where the temperature is being measured and applying an adequate threshold leads to a more accurate diagnosis than measuring the maximum face temperature without considering which region of the face is being treated.

For this reason, this work adopts the following ROIs: medial palpebral commissure (eyes region), temporal (forehead region), and external acoustic meatus (ear region). These regions are recommended by literature [25,27,28].

#### *2.2. Infrared Thermography*

In physics, waves are periodic disturbances that maintain their shape as they propagate through space as a function of time. The literature describes visible light, ultraviolet

radiation (UV), and infrared radiation (IR) specifically as types of electromagnetic (EM) waves. The spatial periodicity, or the interval between two wave peaks, is called the wavelength, λ, and is given in meters, nanometers, or micrometers. The temporal periodicity, or the time interval between two wave peaks, is denoted as the oscillation period, T, and is given in seconds or its submultiples. The frequency, v, is the inverse of period T, with the unit as Hertz.

Figure 1 presents an overview of the most common characteristics of EM waves. Visible light, defined by the range over which the light receptors of human eyes can detect, covers a small range within the spectrum, with wavelengths ranging from 380 nm to 780 nm [11].

**Figure 1.** Overview of the electromagnetic wave spectrum. Adapted from [11].

The spectral region with wavelengths in the range of 0.7–1000 μm is generally called the infrared region, which is the focus of this work [11]. Infrared radiation is invisible to the human eye and has a long wavelength and low energy [29]. Anybody with a temperature above absolute zero (0 K, −273.15 ◦C) emits infrared radiation perceived as heat. The amount of radiation emitted by a body depends on the temperature and properties of the material [11].

Within the infrared spectrum, some bands exhibit particular characteristics, which affect their applications. The main ranges are near-infrared (NIR, from 0.7–1 μm), shortwave infrared (SWIR, 1–2.5 μm), mid-wave infrared (MWIR, 3–5 μm), long-wave infrared (LWIR, 7.5–14 μm), and very long-wave infrared (VLWIR, from 14 to 1000 μm) [29]. Figure 2 depicts these ranges.

**Figure 2.** Infrared spectral regions. Adapted from [29].

The infrared radiation spectrum bands generally applied in technologies involving thermographic images are MWIR and LWIR [29].

Infrared thermal imaging, also called infrared thermography (IRT), is a rapidly evolving technology. Currently, researchers are applying IRT to intelligent solutions in different fields, including condition monitoring, predictive maintenance, and gas detection. Medicine is another area that has benefited from this technology, employing IRT in oncology (breast, skin, etc.), surgery, medication effectiveness monitoring, and, more recently, for acute respiratory syndrome testing applications [30].

Technologies based on IRT can detect the intensity of thermal radiation emitted by objects since the bodies transmit, radiate, and reflect infrared radiation. Radiation transmission, or transmissivity, is the ability of a material to allow infrared radiation to pass through it. Emissivity is the capacity of a material to emit infrared radiation. Finally, reflectivity is the capability of the material/object surface to reflect radiation, that is, temperature reflected from the object.

#### *2.3. Machine Learning*

With the high volume of data generated by devices, sensors, and users, machines capable of identifying patterns and assisting in making decisions have become essential, with supervised learning and unsupervised machine learning being the most widely adopted methods. Reinforcement and semisupervised learning are other methods that may be used [31].

Deep learning is a set of machine learning technologies that utilize algorithms to detect, recognize, and classify objects and text in images or other documents. One of the leading deep learning architectures is the convolutional neural network (CNN), which is used to solve most image analysis problems [32].

#### 2.3.1. Convolutional Neural Networks

CNNs have been widely applied in image classifiers. They excel in analyzing images and learning abstract representations. A typical CNN has an input layer, an output layer, and several hidden layers. The hidden layers of a CNN generally consist of a series of convolutional layers. The first convolution layer learns to identify the simple features. The following layers learn to detect more significant and complex characteristics. Other operations include the rectified linear unit (ReLU), grouping, fully connected, and normalizing layers. Finally, backpropagation is used for error distribution and weight adjustment [33,34].

Digital images can be represented by a matrix in which each pixel contains one or more values. First, a CNN trains and tests each input image with the pixel values going through a series of convolution operations with filters (kernels). Then, the results are grouped (pooling) to reduce the matrix dimensions and generate a new, simplified matrix. These operations complete the feature-extraction step. Then, a vector is created from the feature

map, which is used to feed the input layer of a multilayer neural network (fully connected, FC) [35]. Figure 3 presents a simplified diagram of a CNN [35,36].

**Figure 3.** Simplified diagram of the structure of a CNN.

2.3.2. Region Based Convolutional Neural Networks

Region-based CNNs (R-CNNs) emerged as an improvement of CNNs. They are able to detect and locate specific objects in an image. The architecture of an R-CNN is similar to that of a CNN. However, an added step of extracting the region containing the object to be detected is included [37]. Figure 4 presents a simplified diagram of an R-CNN [36,37].

**Figure 4.** Simplified diagram of the structure of an R-CNN.

The R-CNN detector consists of four main steps: candidate box generation, resource extraction, classification, and regression. For candidate box generation, approximately 2000 boxes are determined in the image using the selective search method. For resource extraction, the CNN extracts the resources of each candidate box. In the third step, a classifier determines whether the extracted features belong to a specific class. Finally, the regression step adjusts the position of the bounding box by referring to a particular resource [38,39].

#### 2.3.3. You Only Look Once Network

According to [36], many improved algorithms have emerged from proposals of R-CNN models, all providing different degrees of improvement in the detection performance compared to the original R-CNN.

The You Only Look Once (YOLO) network, proposed by [40], is a pretrained object detector in the common objects in context (COCO) image dataset, with RGB (red, green, and blue) images of various object classes. Its main contribution is real-time image detection. Additionally, unlike other object detection algorithms, the YOLO network input is an entire image. It performs object detection through a fixed-grid regression consisting of

24 convolutional layers and two multilayer neural networks. The network can process images in real-time at 45 frames per second (FPS). Furthermore, YOLO produces fewer false positives than other similar architectures [41]. In this study, YOLO was used to apply the transfer of learning in the training of a specific dataset.

The learning transfer is a technique that takes advantage of a structure of a pre-trained CNNs structure for a given application as a starting point for a new task that is, until then, unknown. Thus, the structure of convolutional layers and filters in the feature extraction stage are reused for a new application. Afterward, changes are made in the FC layer, where the classes of the pre-trained network can be removed and/or new classes can be added to meet the new application. After the FC changes, only this layer needs to be retrained, drastically reducing the effort of training a complete CNN, which demands a high computational cost and requires a large amount of training data to achieve high performance. In this work, a pre-trained structure with a dataset of 998 images will be used to recognize volunteers' faces. The aim is for the new structure to be able to detect the ROIs in human faces, with difficulties not originally imposed: volunteers wearing semifacial masks and glasses.

### 2.3.4. Optical Character Recognition

Optical character recognition (OCR) is a technology that allows the recognition and extraction of characters in image files to generate analyzable, editable, and searchable data [42]. This technology uses image and natural language processing to solve different challenges [43].

Tesseract is a free open-source OCR software, originally developed at Hewlett-Packard Laboratories Bristol and Hewlett-Packard Co., Greeley, Colorado, between 1985 and 1994. From 2006 to 2018, Google improved the software, and it is currently available on GitHub [44]. It can recognize texts in over 100 languages.

In this study, Tesseract was used to identify the minimum and maximum temperature values in the temperature scale of the analyzed thermographs.
