1. Introduction
The automatic and remote monitoring of the environment, vital signs and human behaviour has increasingly wider applications related to improving the quality of life and improving care for the elderly, sick or disabled [
1,
2,
3,
4,
5,
6]. We can also use these solutions to increase the comfort of the life of patients or safety of people working in difficult conditions. Such systems can use different types of devices to observe and record parameters, phenomena and the surroundings. Those most commonly used are acceleration sensors, motion sensors, sensor networks, cameras and advanced vision systems using visible light cameras and thermovision. Typical threats to disabled or elderly people include falls, cuts, contusions and burns. Early detection of such situations can save the health or even the life of the injured person. Direct contact with high temperatures that are dangerous to health or life may be particularly burdensome in their consequences. The process of treating burns is usually difficult and long lasting, so it is worth preventing this type of accident. Such situations occur constantly in everyday life, and people with various diseases or health problems may become their potential victims. Since burns can occur at a temperature of 70 °C within 1 s [
7], it seems reasonable to use a warning system that could protect people who are at risk. Taking into account the fact that burns occur extremely quickly, the article proposes an autonomous mobile system allowing for the early detection of danger and warning against burns. There are several devices on the market that can be used to build such a system—
Figure 1. The proposed system uses the Flir One Pro [
8] thermal imaging camera working with a smartphone. The applied mobile camera can record a thermal image directly in front of the user and transmit this information to the mobile device where the image analysis process, threat detection and warning take place. The Flir One Pro camera captures thermal images with resolution 320 × 240 (images are scaled up) and a maximum temperature range of up to 400 °C. The device is equipped with its own battery and a built-in visible light camera with a resolution of 1440 × 1080. The FLIR ONE cameras can be connected to Android and iOS devices via a micro USB or USB-C connector. They can record images at a frequency of eight frames per second.
A better solution is to use integrated devices in the form of goggles with a built-in IR camera, but in this case, it is important that the equipment used has sufficient computing power or is able to communicate with mobile devices, as is the case with the Flir One Pro camera. The thermal imaging camera integrated with the goggles presented in
Figure 1 (SATIR Thermal Vision 256) has an optical resolution of 256 × 192 and a temperature range of −20 °C~+550 °C. The set allows for communication in the 4G standard, which makes it possible to expand its capabilities and use external image analysis algorithms. Currently, however, it is not possible to directly use the device in the proposed system.
To detect the risk of burns, images transmitted from a thermal imaging camera (in the form of a temperature matrix in degrees Celsius) showing the environment directly in front of the system user are used. This makes it possible for the application to react directly, based on the temperature value in the image, which may additionally determine the level of threat. The thermal imaging camera mounted on glasses or on the chest captures a thermal image directly in front of the user and transmits it to the mobile monitoring application that analyses the image and determines the level of threat. The algorithm implemented in it (discussed below) allows for the detection of arms or hands in the field of view of the camera and the detection of high temperatures (over 70 °C). The system will analyze thermal images autonomously based on deep learning methods using a CNN, which will enable high effectiveness and additional classification of threats when the user’s arms or hands appear in the monitored area. The use of CNNs should significantly increase the effectiveness of the system relative to similar solutions.
2. Related Works
Since the main task of the proposed system is to warn of hand burns, hand detection is the most critical component of the system. Issues related to hand and arm detection are often discussed in the available literature in various applications. They are very often related to recognizing gestures [
10], detecting location, tracking hands, building human–robot communication interfaces, controlling devices and systems or manipulating objects in 3D space [
11,
12]. The typical solutions most often use independent image analysis methods (using RGB (conventional cameras), RGB-D (Kinnect—Microsoft, Leap Motion—
https://leap2.ultraleap.com/products/leap-motion-controller-2/ (accessed on 3 September 2024)) or IR (Flir One,
https://www.flir.eu/products/flir-one-pro/ (accessed on 3 September 2024)) cameras) but also specialized equipment in the form of gloves communicating with a computer or robot. The appearance of depth sensors on the market has also resulted in 3D analysis methods [
11] allowing for the assessment of hand location in three dimensions. Currently, methods using deep learning and convolutional neural networks are becoming more and more common [
13].
The first group are methods based on the analysis of an image in the RGB or YCbCr palette. They use colour information for hand detection and localization which is the basis for further interpretation. In article [
14], an image is initially mapped to the CbCr palette and next edge detection is performed using morphological operations. As a result, the algorithm locates areas that meet the colour criteria and designates areas of the hand. The proposed algorithm obtained the following results: TPR = 94.6% and FPR = 2.8%. The authors did not present the details of the set of studied images, which makes it difficult to assess the universality of the method. Another example of a hand localization method in visible light is the method described in paper [
15]. Here, the authors proposed a fast, real-time classification method based on hand shape. The algorithm detects the hand area based on colour, determines the orientation of the hand in a vertical position and then uses shape context analysis, template matching, the orientation histograms, Hausdorff distance analysis and Hu moments to determine shape coefficients and classify them. The set contained 499 images and the method was used for gesture recognition.
In recent years, with a significant drop in the prices of thermal imaging cameras and the rapid development of thermovision, the number of applications of this imaging method in many areas has been increasing. This opens up completely new possibilities and makes it possible to expand the methods related to visible light. Hand detection in thermal images makes it possible to eliminate problems typical for images in visible light, e.g., weak or uneven lighting, the impact of skin colour, complex coloured background. Exemplary solutions are based on the hand area model or information about the brightness in the analyzed area (model-based or appearance-based). The article [
16] described a hand segmentation method based on statistical features of textures. The set of features included, among others, mean brightness, standard deviation, entropy, homogeneity and contrast. The areas designated in this way were further analyzed using the above-mentioned texture features, which allowed for the classification of regions located in the hand area. Image segmentation was performed using k-means cluster analysis. The algorithm was used in the treatment of rheumatoid arthritis. In paper [
17], an adaptive hand segmentation algorithm was proposed. First, a Gaussian model representing the background was prepared, thanks to which the approximate area of the hand was determined. In the next step, five areas were generated inside the hand area and the temperature distribution models were created. After analyzing the image by the five prepared models, the resulting masks were combined into one resulting mask. The effectiveness of the operation was determined as the ratio of the bounding box marked by the expert to the bounding box determined by the algorithm and amounted to 86%.
Another application of hand segmentation in thermal imaging is biometrics [
18]. The goal in this publication was to segment the hand and determine the vein system. The authors proposed several approaches based on thermovision and a combination of visible light and thermovision. In the case of thermal images, the active shape model method was used. In cases where segmentation for visible light images was combined with thermal images, masks established in visible light were used in thermal imaging to preselect the hand region. In this scenario, it was important to match the mask obtained from the visible image to the shape of the hand in the thermal image.
In previous studies, the authors proposed a hand detection method with an SVM classifier using the proposed geometric features and texture parameters in the hand area [
19]. A set of over 5100 images was prepared, containing the user’s hands and arms in various situations and with objects with increased temperatures. The hand detection efficiency reached the level of Acc = 0.89 and high-temperature detection was correct for all images in the test set.
In the hand detection process, thermal images are also used in combination with visible light and depth image information [
20]. In the case of 3D methods, the information about the colour and depth (RGB-D) are used. The use of a thermal camera, in this case, increased efficiency by reducing the impact of variable external lighting on segmentation results. The authors annotated ground truth bounding boxes for RGB, depth and thermal images and the Fast R-CNN object detector was used to analyze the images. The algorithm was trained on a set of several thousand images (2000 RGB images, 1000 thermal images, and 1000 depth images) recorded from two cameras and a depth sensor. The authors noted that RGB images, then depth images and finally thermal images had the greatest impact on the detection efficiency, but they did not indicate the efficiency values obtained by the proposed method. Another example is the publication [
21] where deep neural networks were proposed for the detection and classification of gestures on sequences of combined colour images, depth images and stereo-IR images. The proposed 3D recurrent network was trained on the Sport1M set containing video sequences with 487 types of sports activities. The algorithm did not directly detect hands and the gesture classification efficiency reached 98.2%.
In publication [
22], the authors again applied a deep convolutional network to recognize hand gestures in static images. It was noted that hand detection in the case of a complex background is difficult and classic methods are not always effective; therefore, the proposed model simultaneously functioned as a detector and classifier. Relatively few images were used: 1600 training and 400 test images. Gesture classification efficiency reached approximately 94.7%.
Another method related to hand analysis was proposed in paper [
23] where the goal was to locate the hand and detect skin in visible light images. The performance of RCNN [
24] and Fast-RCNN [
25] combined with skin segmentation was compared on several image sets (over 13,200 images containing the hand). Hand detection efficiency on various image sets reached maximum values of 96–97%. As a further improvement, the authors indicated increased resistance to poor lighting, shadows or image blur, because in the case of other image sets, the effectiveness was very low, even around 30–40%. In publication [
26], the aim was to detect the driver’s hands while using a mobile phone or when they were placed on the steering wheel. The authors proposed a modified version of Fast-RCNN for hand, smartphone and steering wheel detection. Then, using geometric relationships, the system determined whether the driver had his hands on the steering wheel or was using a smartphone. The effectiveness of smartphone detection reached 94%, and the detection of hands on the steering wheel was effective in 93% of cases.
Another example of a hand detection algorithm combined with position determination using CNN was presented in publication [
27]. The authors suggested that methods of detecting hands and their position and orientation also helped computers understand human intentions and provide guidance for more complex tasks. The proposed CNN model was tested on selected sets (e.g., Oxford Hand Dataset—13,050 hand images) and achieved a sensitivity of 99–100% at the stage of generating the proposed hand areas for position analysis.
In publication [
28], the authors tested many detectors and, as a result, proposed a hand recognition method using the Yolov7 and Yolov7x models. The Oxford Hand Dataset was again used for testing. The following results were achieved: 84.7% precision and 79.9% recall.
In light of the research and examples described above, a hand detection method based on deep learning and convolutional neural networks is proposed. A set of over 21,000 thermal images and a model based on a 15-layered convolutional network were prepared. The network was optimized and a set of hyperparameters is proposed to ensure high training efficiency and speed. The system presented here completely and automatically analyzes images obtained from a mobile thermal imaging camera and detects hot objects as well as the risk of burning hands. Its task will be to increase the safety of visually impaired or blind people during their everyday activities. The proposed system is an innovative application of thermal imaging because there are no similar solutions on the market combining mobile technologies and thermal image analysis using CNN. The main reason for proposing the CNN network was the known and confirmed high effectiveness in the process of image classification, object detection and semantic segmentation. It results from the image analysis mechanism used (based on the hierarchical arrangement of convolutional layers and the use of receptive fields) and effective and automatic feature extraction [
29,
30]. The use of CNN was made possible by providing a set of over 21,000 images.
4. Experiments and Results
The results of hand detection and threat level classification obtained using the developed algorithm are presented below. Hot object detection is a task that is easy to accomplish using classic image segmentation methods (e.g., thresholding). Therefore, there are no requirements for the number of training images. Detection of the risk of burns is based on the CNN network and in this case the size of the image set affects the results obtained. Depending on the number of images in the examined set, the algorithm obtained different results in terms of the effectiveness of detecting the risk of burns. Thus, for 10,590 images (50% of the entire set), the accuracy reached 0.975. When the number of images was 15,885 (75% of the entire set), the algorithm achieved Acc = 0.986. For the full set of 21,180 images, it was 0.995. The algorithm reached Acc above 0.9 for approximately 5295 images (25% of the entire set). The first block of algorithm detects hot objects with a temperature that poses a threat to health with 100% efficiency. The threshold value of 70 degrees Celsius makes it possible to detect objects that threaten to burn quickly. The algorithm block which is responsible for burn detection requires a more complex image analysis. Using the proposed CNN-based model, the system will be able to determine with high efficiency whether there are hands or arms in the analyzed image. At the beginning of the study, several different network configurations and structures were tested and compared. To evaluate the performance of the model, four typical values of accuracy, precision, specificity and sensitivity, based on the confusion matrix determining the number of correct and incorrect hand detections in images, were proposed. The set was divided into training and test sets in two configurations (similar to [
19,
21]): 75/25 and 90/10, and the impact on the learning process was verified.
As can be seen in
Table 1, the algorithm achieves very good results for various set configurations. Further research and results concern a set divided in a 90/10 ratio (19,062 training images and 2118 test images). Some selected cases of correct and incorrect hand detection are presented below.
Figure 8 shows cases where the algorithm correctly responded to the presence of hands (TP) and the presence of hot objects at the same time. In some cases, even images with a small portion of the hand are classified correctly.
Figure 9 shows examples of images where the algorithm detected hands even though they do not actually appear in the image—in expert opinion (FP). This situation results from the existence of areas of warm air (with a temperature and shape similar to that of the hand) surrounded by hot objects. The next images in
Figure 10 are examples where the algorithm did not correctly detect hands (TN) because they do not appear in the images. In such a situation, the algorithm should only warn about the existence of a hot object, but the level and intensity of warning may be lower.
Figure 11 shows false negative cases (FN), in which the algorithm did not detect hands even though they are actually present in the image. They only constitute less than 0.2% percent, which is very important because it is a key element of the algorithm whose operation affects the safety of the system user. The lack of hand detection may be caused by too small parts of the hand visible in the image (which only appear in the field of view of the camera) or by interference in the hand area resulting from the camera operation. Bearing in mind that the algorithm analyses subsequent frames continuously, the hand will be detected when most of it appears in the image. However, if the hand is not detected, the high-temperature detection block will inform the user about the hot object, which should increase their alertness.
By analyzing available publications related to the analysis of hand images, several examples were selected in which the problem of hand detection in visible light or thermal images appears. Since in most cases the authors did not present comprehensive results of the achieved effectiveness in the hand detection process (or hand detection was only part of the analysis process), the comparison included those solutions and results that could allow for at least partial comparison of effectiveness. Results obtained in visible light, RGBD, IR using classic algorithms and deep learning are summarized with respect to the method proposed here. In the case of reference solutions, the results obtained by their authors and for their datasets are presented. The proposed CNN models were tested on the image set discussed here. Additional information such as the number of images used by the authors is also included. For the method discussed here, a set of averaged results obtained during testing for 10-fold cross-validation is presented (
Table 2).
6. Conclusions
In summary, the main contributions of this research are the proposal of a deep learning-assisted automated method for burns detection, carrying out an analysis of the impact of several hyper-parameters on the model and proposing the best tuned model and a comparison with the state-of-the-art methods. High efficiency of the described method proves that it can be used as an element of a thermal threat warning system. Detection of a dangerous temperature is achieved with 100% accuracy. The hand and burn risk detection reaches Acc = 99.5% and Prec = 99.5%. Compared to previous studies and classic methods, it can be seen that the effectiveness has improved significantly by about 8–9%. Compared to current methods using CNNs, it can be observed that the effectiveness is comparable or better (
Table 2). The use of deep learning and convolutional neural networks and optimization of the network structure also allowed for faster image analysis than in the case of the compared methods. The most important features of the proposed method are an independent and fully automatic hand detection block, high effectiveness and speed of action and classification of various threat levels. Given that the system will analyze the environment continuously, it can be assumed that the user will be warned of a hot object and burn early enough, but a situation of direct risk of burns may only occur when a user does not react to previous warnings. The solution should be considered as a real-time system [
32,
33] with hard real-time constraints due to the fact that the image analysis results should be delivered simultaneously with the recording of subsequent image frames, and exceeding the time limit may pose a health risk. A hot object is detected from a distance of over 4 m, so the user is informed about a potential threat in advance, which increases their vigilance. When recording an image at 8 frames/second, the system reacts to the appearance of a hand and a hot object in the frame within 0.125–0.25 s (first or second frame). If the user is about 0.5 m away from the hot object, touching the object may take about 1 s—so the user has about 0.5 s to react. However, if the user holds out their hand and the hot object is further away, the reaction time is longer. The method may have limitations in very small spaces where an object may suddenly appear in the field of view when the user turns towards a hot object that was previously not visible to the camera.
When testing the developed model on a desktop computer using a GPU, the average classification time for 2218 test images was approximately 0.74 s using an Nvidia RTX 3060 graphics card and the CUDA library version 12.5. It can therefore be assumed that porting the model to a mobile platform should not prevent its use.
The burn risk detection efficiency (hand and hot object) reached 99.7%. Various configurations and structures of the neural network and various training and test set configurations for over 21,000 images were tested in the hand detection process. The obtained research results prove that it is possible to automatically and quickly analyze the user’s environment and protect them against burns using the proposed deep learning methods and device sets. With further development and miniaturization of vision equipment, the proposed method can be used with more comfort for the user. Ultimately, the proposed CNN model will be implemented on Android mobile devices. For this purpose, the TensorFlow library can be used, which offers support for the Android system and allows for building and exporting trained CNN networks [
34,
35]. The capabilities and hardware reserves offered by newer devices (fast multi-core CPUs with frequencies above 3 Ghz, GPUs clocked at over 1.4 GHz and specialized AI systems) and large memory capacity (up to a dozen GB) will allow for further expansion as well as the use of faster mobile thermal imaging cameras when they appear on the market. Hardware support for machine learning is increasingly offered in mobile devices (for example, Dimensity 9300, Snapdragon 8 Gen 3, Dimensity 8300 [
36] processors contain specialized AI systems) and allows for the use of increasingly advanced solutions related to image processing and analysis. The proposed CNN system and architecture contribute to the development of the field of thermal image analysis by showing new possibilities offered by available mobile devices. Applications provided by camera manufacturers most often only enable temperature measurement pointwise or in a given area. The addition of artificial intelligence algorithms can facilitate reasoning, observation of phenomena or processes, and enable the automation of performed tasks, similarly to vision systems operating in visible light. Thermal images in vision systems are usually used less frequently than images in visible light, and it should be remembered that the results of the analysis of both types of images can also be combined, which most often improves the final effectiveness of these systems.