Details of the three parts are described below.
3.1. Tracking-by-Detection
This part has two main components, one is an object detector, YOLOv5 [
41], for human detection, the other is a tracker, DeepSort [
46].
The video collected from a thermal camera is the input to the detector YOLOv5 for frame-by-frame human detection. To integrate clothing status and key posture recognition into this detection procedure, we classify persons into six categories (see
Table 3). Here the clothing status is represented by the sleeve status (long, short) for three reasons: (i) these two are the most common clothing situations in an office environment while the lower part of the body is often totally occluded by the desk; (ii) according to to [
10,
34,
35], sleeve status is significantly important in estimating
; (iii) the change between a long-sleeved status to a short-sleeved status by rolling up sleeves or taking off outer jackets is a sign of feeling hot and vice versa, indicating a person’s thermal sensation directly; (iv) the sleeves status helps to locate skin region and clothes region separately for further skin and clothes temperatures acquisition. For example, the elbows of a person wearing short-sleeved clothes are skin regions, while the elbows of a person wearing long-sleeved clothes are clothes regions. This localization makes it possible to use such key body points to calculate a person’s skin temperature and clothes temperature, because key body points on arms are widely used sensitive heat receptors in thermal comfort assessment [
35,
54,
55,
56]. Besides the two statuses of long sleeves and short sleeves, another status called difficult to predict clothes type due to occlusion is also usual in daily life. For clear illustration, such cases are in
Figure 2. The right persons in
Figure 2a,b are partly occluded by the computer monitor; the right person in
Figure 2c moves the arms out of the scene; the left person in
Figure 2d occludes his lower arms by hiding them behind the torso. These occlusions make it unrealistic to know whether the sleeves are long or short. One thing to be noted is that even though a person is occluded in a few frames, his or her clothing status can be recognized in other frames. Therefore, voting of a classified category over a few seconds is important. When it comes to the key posture recognition, from ISO standards [
8,
9,
11,
12], a person’s metabolic rate
M is closely related to the behaving posture (sitting, standing, lying down, etc.). And in a typical office environment the most common ones are sitting and standing, therefore, these two are considered in our study.
The ultimate goal of this research is to acquire every occupant’s personal factors and thus facilitate individual thermal comfort assessment. This means that each person must be tracked across time. To this end, we adopt DeepSort. This tracker receives the image information and YOLOv5-predicted detections, and then decides which tracking ID a detection should be associated to. Like
Figure 1 shows, DeepSort can use the detected bounding box information in the
th frame (
indicating the
ith box’s top-left coordinates, width, height, respectively) to infer the location of the same object in the
tth frame in the form of
by Kalman filter. At the same time, DeepSort extracts and saves the deep features of the object as its appearance information. In this way, two similarity metrics (location and appearance) can be calculated, based on which each detected person can be linked to a specific identity thus making the same person be tracked with a consistent ID over time.
The reason why this DeepSort-by-YOLOv5 paradigm is chosen and applied to such a specific research field is explained further below. The data we use is in a thermal mode having significantly fewer details compared with its RGB counterpart. This makes the reuse of such limited details/features extremely important. Compared with other detectors, YOLOv5 introduces PANet (Path Aggregation Network) [
57] as its neck, making the deeper layers access to the lower-layer features much more efficiently, so the thermal features are well reused. When it comes to the tracking part, the Maximum Age strategy in DeepSort that deletes a track only when it is not associated to any detection more than
frames can guarantee a consistent ID with the existence of a few false negatives (FN) from YOLOv5. The Tentative Track strategy in DeepSort which confirms a track only after it is associated with detection in three continuous frames also guarantees that occasional false positives (FP) from YOLOv5 have no severe influence on the output. That is to say, this tracking-by-detection framework smooths the direct output from a detector by filtering the undesired consequences of FN and FP, making both the detector and the tracker benefit each other. Additionally, the low complexity and real-time performance of DeepSort fit well the relatively simple scene in our case compared with other cases like pedestrians/vehicles tracking in autonomous driving assistance systems.
Overall, this design not only locates and tracks each individual with a consistent ID in the scene, but also predicts the person’s clothing and posture status simultaneously that directly influence and M estimation.
3.2. Estimation
estimation relying on lookup tables in ISO standards [
8,
9,
10,
12] and updated clothes databases [
58,
59] can be a fast solution for laboratory studies, but it is unfeasible to use such a scheme in real applications due to reasons: (i) looking up the
value for a person needs extra manual work which is tedious and expensive; (ii) if this look-up task is expected to be done automatically, the solution must have the ability to recognize hundreds of different garment combinations that vary in materials and number of layers as the latest research has revealed the significant importance of them in thermal comfort [
2], which is far beyond the capability of existing algorithms.
Therefore, to realize automated estimation, we go another way—using the difference between the skin temperature and the clothes temperature to calculate . This method is intuitive since the difference between and explicitly reveals the heat insulation of clothes to isolate the bare skin from the environmental air. The larger the temperature difference, the higher the clothing insulation rate.
To get
and
for each individual, the person’s skin region
and clothing-covered region
need to be differentiated from each other. Empirically,
includes face, hands, and neck;
includes shoulders, torso, and upper arms. However, in daily life, accessories (hat, glasses, scarf, watch, etc.), spontaneous behaviors (lower one’s head, turn one’s face away, hide one’s arm behind the torso, etc.), and inevitable occlusions by things in front make many body parts be detected unreliably and even totally invisible. After considering such situations, this research counts the lower arms (the middle point of the elbow and wrist) for short-sleeved clothes and the nose area as
, and the elbows for long-sleeved clothes and the shoulders as
. These regions are also widely used heat receptors in thermal comfort research [
35,
54,
55,
56].
Figure 3 illustrates
in green crosses and
in red crosses on four images.
To locate these body parts, we employ OpenPose [
60]—a 2D pose estimation tool. OpenPose has a robust ability against occlusions to detect key body points. The level of the ability against occlusions is determined by a parameter called confidence threshold which means that only the detected key point whose confidence score is higher than the threshold will be counted as the output. The higher threshold, the lower the level of ability against occlusions but the higher accuracy of detection; the lower threshold, the higher-level ability against occlusions but more false positives. This can be shown in
Figure 4 which draws the detected key body points by OpenPose with different confidence thresholds of 0.1, 0.3, 0.5, and 0.7.
Since the detected key points are representations of
and
and thus directly related to
and
, a higher accuracy instead of the ability against occlusions is much more important. Like in
Figure 4a,b, the detected elbows of the left person are in fact in the computer monitor region; the result in
Figure 4c is more accurate, but the detected wrists of the right person are in the laptop region which will influence the lower arm localization in
. These preliminary trials inspire us to set the confidence threshold as high as possible, but a too high threshold produces more missing detections. Therefore, our work uses 0.6 as the threshold in the entire research which has been proved as an effective parameter in the experimental part
Section 4.3. To further decrease the influence of miss detections, an accumulation strategy of all the detected key points within a duration like five minutes is introduced since a person’s clothes status is not changed very frequently, which at the same time filters out potential noises.
Another thing worth mentioning is that although OpenPose detects key body points for each person, it has no function of multi-person tracking, and hence our tracking-by-detection framework is still necessary.
In mathematics, based on the recognized sleeves status and OpenPose-predicted key body points, the skin region
and the clothing-covered region
are determined, both of which are a set of pixel coordinates
in the image plane like Equation (
1) and (
2).
In the equations, the subscript (t, , ) refers to the index of each frame within a time period of frames; the superscript (, , , , , , , ) refers to the index of each detected key point. So in the consecutive frames there are and key points detected in and , respectively.
The thermal camera we use is Xenics Gobi-384-GigE that can visualize a thermography of the scene it captures and measure the temperature of each pixel within the image with an accurate resolution of 0.08 °C. Therefore, temperatures of the detected key points (, ,..., ) in and (, ,…, ) in are easily read from the camera. Then an average calculation of the temperature values (, ,…, ) and (, ,…, ) gets and , respectively.
As long as
and
of each individual are calculated, the person’s
can be estimated by:
where
h equals to 8.6 referring to human’s heat transfer coefficient;
is the operative temperature considering both the air temperature and the mean radiation temperature, so here it is calculated by the average temperature of the background region in each frame. This calculation comes from [
35] according to [
10,
61], and all the temperatures
,
, and
are in degrees Celsius. We claim that our emphasis is the OpenPose strategy for localizing
and
to get
and
, based on which any
calculation method can be applied.
3.3. M Estimation
In this part, we first in
Section 3.3.1 propose three vision-based features to represent each person’s activity intensity, based on which
M is estimated in
Section 3.3.2.
3.3.1. Three Vision-Based Features
Though
M can be estimated by a person’s key posture or activity type listed in ISO standards [
8,
9,
11,
12] and updated databases [
62,
63], this is a rough estimation in many cases, since we have observed that different people tend to have different activity intensities for the same posture. For example, some people will do a bit of stretching when standing up while others may just stand still. Therefore, a more accurate and dynamic
M estimation is expected. This is done by computing three vision-based features—a person’s bounding box changes in two aspects (location and scale) and the optical flow intensity within the bounding box, over a few seconds like 10 s (210 frames) in our case. Here, the choice of 10 s comes from an observation that it takes similar durations for a smart bracelet to monitor a user’s heartbeats and blood oxygen content—two human physiological signals indicating the
M value. This three-feature idea is motivated by that: the bounding box location change captures the general body movement; the bounding box scale change captures the motion of limbs; the optical flow intensity within the box captures the subtle movement that the box changes may ignore.
To realize this, for the location change of a certain person’s bounding boxes during 10 s (210 frames), the center coordinates of the person’s bounding box in each frame is drawn as a point in a 2D plane, and totally the 210 2D points form a cluster-shaped pattern. The more spread out the points are, the larger the general body movement is. The degree of spread can be approximated by fitting an ellipse to the cluster and then calculating the area of this ellipse. In mathematics, first, the covariance matrix of the vector (composed of the horizontal coordinates of the 210 points) and the vector (composed of the vertical coordinates of the 210 points) is computed, and then the two eigenvalues of the covariance matrix are computed, at last, the multiplication of these two eigenvalues represents the area of the ellipse.
For the scale change of a certain person’s bounding boxes, after translating the 210 bounding boxes from 210 frames, they will have the same center at the origin, and then the upper-right coordinates of each bounding box represents its scale. Similarly, the 210 upper right points form a cluster in a 2D plane, and the area of the ellipse fitting to the cluster will represent the scale change across time. The larger the area, the larger movement of limbs.
When it comes to the optical flow intensity in a person’s bounding box from the
tth to
th frame (
equals to 210 here), for each frame two optical flows in horizontal and vertical directions are extracted by the TV-L1 algorithm [
64] realized in a tool called MMAction [
65]. Each optical flow is saved as an 8-bit image in which pixels with a grayscale value of 127 represent no movement while these pixels with grayscale values farther away from 127 represent larger movements. Therefore, within a duration of
frames, a person’s optical flow intensity
is calculated by:
where
indicates the frame index;
is the person’s optical flow intensity in the
th frame;
and
are the person’s optical flow intensity in the horizontal and vertical directions in the
th frame, respectively;
is any pixel in the optical flow;
is the bounding box region of the person in the
th frame;
and
mean the two optical flows in the horizontal and vertical directions, respectively. In Equations (
6) and (
7), the number of pixels in the bounding box is acted as the denominator to normalize the influence of the size of the box.
In this way, the three features (bounding box location change, bounding box scale change, optical flow intensity) representing an individual’s activity intensity are acquired. A visualization showing the bounding box location change by a cluster of 210 2D points/circles, the bounding box scale change also by a cluster of 210 2D points/circles, and the optical flow intensity within the bounding box in each frame from a duration of 210 frames are in
Figure 5, in which ID 1 person is standing with very limited movements while ID 2 person is standing and stretching with large movements. This figure intuitively illustrates that the larger body movements of ID 2, the more spread out the points/circles in
Figure 5d,f, and the larger optical flow intensity in
Figure 5h.
3.3.2. M Estimation from the Three Features
In real life, persons may have various activities which are unrealistic to be analyzed accurately. However, for an office environment, staff usually have scheduled routines and thus relatively fixed behaviors. Generally, the sitting staff are typing the keyboard, reading, taking notes, sorting through files, chatting with colleagues, online meetings, etc. And the standing staff are also occupied by the same tasks but may be involved with some walking or body stretching. This prior knowledge is such important that it gives a metabolic rate range from which each individual’s M varies.
Therefore, with the above prior knowledge of standard office behaviors, by referring Table A.1 and Table A.2 in ISO 8996 [
11], the CBE (Center for the Built Environment) thermal comfort tool [
66], and the 2011 compendium of physical activities tables [
63,
67], the usual metabolic rate range of a sitting office staff is quite narrow from 58 W/m
2 (1.0 MET) to 87 W/m
2 (1.5 MET), while a standing staff’s metabolic rate usually varies from 75 W/m
2 (1.3 MET) to 174 W/m
2 (3.0 MET). According to the CBE thermal comfort tool, the slight
M change of a sitting person within the range [58 W/m
2, 87 W/m
2] has a mild influence on his or her thermal sensation, while the
M change within the much larger range of a standing person significantly influences the thermal feeling. This result inspires us to use a middle value of 72.5 W/m
2 to represent a sitting office staff’s
M for simplicity and generalization which also relieves the three-feature extraction for him or her, but we need to specifically define a standing person’s
M from his or her dynamic activity intensity situation represented by the three vision-based features.
To map such features to a value of M, a classification idea is introduced. Similar to Table A.2 in ISO 8896 where metabolic rates from 55 W/m2 to more than 260 W/m2 are categorized into resting, low, moderate, high, and very high levels, we decide to categorize the metabolic rate of a standing office staff into low, moderate, and high levels. Specifically, a low level means standing with very limited movements or transient spontaneous movements (standing quietly in a line, reading, using a cellphone, normally chatting, etc.); a moderate level means standing with spontaneous but lasting movements (natural and small paces, limbs movements, head movements, discussing with gestures, etc.); a high level means standing with significant movements usually indicating intentional actions like sustained location changes by walking, constant trunk movements to stretch/relax the body, etc.
It is extremely important that the three levels do not mean there are only three options for the
M value. Instead, for a person’s activity intensity, there are three classification probabilities
,
, and
indicating the possibilities of being viewed as low, moderate, and high level, respectively. Based on
,
, and
, the person’s final
M is estimated by:
where
,
, and
are the lower boundary, the middle value, and the upper boundary of a standing person’s
M, that are, 75 W/m
2, 125 W/m
2, and 174 W/m
2, respectively.
To realize this solution, the classification probabilities
,
, and
are in need. With only three features describing a person’s activity intensity within a few seconds as the input, a simple and flexible classification model instead of a CNN can be used. So, in this study, several lightweight models are employed and the random forest model works best. The training and testing details are in
Section 4.4.
In summary, the proposed
M estimation method has several advantages: (i) the three explicitly-extracted features can guide the metabolic rate estimation efficiently, considering that the features automatically extracted by a learning method are relatively difficult to anticipate and thus may potentially fail for a specific task; (ii) the three features are really low dimensional, making it possible to use lightweight machine learning classifiers which are flexible to be integrated into the whole system; (iii) the probability-weighted summation (Equation (
8)) makes the estimated
M continuously change in a range, which not only fits the real-life scenario than limited and discrete choices in existing methods but also avoids the very difficult annotation if a regression model is adopted.