The proposed methodology for classifying a person’s actions and objects using the ANU-Net technique is shown in
Figure 2. We proposed a method for detecting and classifying objects and actions such as hats, glasses, normal, rotation, and hat with glasses from human thermal face images, based on deep learning techniques, in this paper. This model is presented in five steps. Initially, the input thermal images should be resized and then converted into grayscale, utilizing the median filter in the pre-processing step. Then, principle component analysis (PCA) is used to extract the features from the pre-processed thermal image in the feature extraction step. Furthermore, as to feature selection, the horse herd optimization algorithm (HOA) is used. After feature selection, the face needs to be detected in order to identify the objects on the face. The LeNet-5 technique is used to detect human faces in thermal images. Finally, classification of the objects and actions on faces is carried out using the ANU-Net approach, combined with the Monarch butterfly optimization algorithm to achieve higher accuracy.
3.2. Feature Extraction
After pre-processing, the features are extracted that would be useful in categorizing the images using feature extraction techniques. The extraction of features process is for extracted features which help to detect the face accurately and clearly. In this step, the representative features are extracted with principle component analysis (PCA). The obtaining of features is a significant phase in the model construction process [
20]. The principle component technique is used to increase explanation ability while simultaneously reducing the loss of details to reduce the dimension of the dataset being used. The goal of PCA, a mathematical approach that uses a principle component examination, is to reduce the dataset dimension. The process of transforming data into a new coordinate system is known as an orthogonal linear transformation. The native features use linear combinations to offer new features. Applying features with increasing variation is dispatch limiting. Accordingly, the PCA method converts the n vectors {
x1,
x2, …,
xn} from the d-dimensional space to the n vectors {
x1,
x2, …,
xn} in a new d-dimensional space.
Since PCA calculates unit directions, the data can be projected into the input vector.
Y, where
y =
it has a large variance, stands for greater variance. This is written as
where
=
W/‖
W‖.
Reconstruction of the input data, i.e.,
use of linear square estimation.
With the use of data reconstruction, the error can be recreated and identified by comparing the differences between the original and modified data.
To solve this issue requires implementing a novel method for restricting the dimension and enhancing PCA entertainment after overall presentation creation. Based on PCA, the reconstruction error can be reduced [
21]. When extending k-dimensional data to subspace, calculations are visible.
According to the proposed method, a maximum likelihood-based model may be used to map the underlying space into the data space.
The feasible outcome’s input vector is written as
The conditional likelihood will increase by
Maximum likelihood observation can also be expressed in
Where
S can be expressed as,
The maximum log-likelihood can be expressed using the equation below.
Maximization of
and
the error can be reduced and yield a better solution based on data reconstruction in PCA. Maximum
likelihood can be represented below,
The aforementioned model improves output and reduces errors in feature extraction using PCA. These extracted features are used to detect human faces from thermal images.
3.3. Feature Selection
Before using detection algorithms, feature selection is an essential step needed to enhance the performance of detection. Multiple features are frequently used to produce good detection results. However, adding more features could result in the so-called “curse of dimensionality”, which impairs detection performance while lengthening computation times and boosting model complexity. As a result, feature selection is required to increase a detection algorithm’s ability to discriminate. By eliminating redundant or unnecessary characteristics, the most important subset of the initial feature set is discovered throughout the feature selection process.
The goal of this work is to use an algorithm to choose features by tackling the feature selection issue. As a result, the horse herd optimization algorithm (HOA) was the main technique employed for this. HOA is a powerful algorithm that takes inspiration from the herding behavior of horses of different ages [
22]. HOA performs remarkably well in resolving difficult high-dimensional issues. Its exploration and exploitation efficiency is quite high. It performs better in accuracy and efficiency than several popular metaheuristic optimization techniques. It can identify the optimal solution with the least amount of effort, the least cost, and the least amount of complexity. These behavioral characteristics are frequently seen in horses: hierarchy (H), roam (R), imitation (I), defense mechanism (D), grazing (G), and sociability (S).
The movement of the horses follows Equation (12) at every iteration.
Every cycle of the algorithm can be described in Equation (13)
Grazing: Horses are grazing creatures that eat vegetation such as grass and other forages. Their daily grazing time ranges from 16 to 20 h, and they take only brief breaks. Equations (17) and (18) correspond to how grazing is represented mathematically:
With each repetition, this factor brings linearity down by ωg. P is a chance number between 0 and 1, and are the respective lower and upper limits of the grazing space.
Hierarchy (H): Horses cannot exist in complete freedom. They go through life following a leader, which is frequently observed in humans. By the law of hierarchy, an adult stallion is also in charge of providing authority within wild horse herds. Equation (19) can be used to describe this.
where
shows where the best horse is located
and how the best horse’s placement affects the velocity parameter.
Sociability (S): Horses need to interact with people and occasionally coexist with other animals. Pluralism makes it easier to leave and enhances the likelihood of survival. Due to their social characteristics, you may frequently observe horses fighting with one another, and a horse’s singularity is what makes them so irritable. The herd described in the following equation attracts the attention of horses between the ages of 5 and 15 years, as is evident from observation:
N stands for the total number of horses, and AGE stands for the age range of each horse. In the evaluation of the sensitivity parameter, the s coefficient for horse’s β and γ is determined.
Imitation: Horses mimic one another and learn from one another’s desirable and unattractive behavior, including where to find the best pasture. The imitation behavior of horses is likewise considered a factor in the current approach.
As the number of horses in the best places, pN is indicated. Ten percent of the horses is the recommended value for p.
Defense mechanism (D): Horses’ behavior is a result of their experiences as prey animals. They exhibit the fight-or-flight reaction to defend themselves. Their initial response is to run away. Additionally, they buck upon being trapped. For sustenance, so they can remove rivals, and to avoid dangerous areas where wolves and other natural predators are present, horses engage in battle.
Equations (26) and (27) show the horse’s defense mechanism, which is a negative coefficient, to prevent the animal from being in the wrong positions.
The number of horses in the most underdeveloped areas is also shown by qN. A 20% horse population is thought to be equivalent to q.
Roam (R): In the retrieval of food, wild horses graze and travel around the countryside from pasture to pasture. Although they still have this quality, most domestic horses are kept in stables. A factor r displays this tendency and simulates it as a random movement. Young horses nearly always exhibit roaming, which eventually decreases as they mature.
Horse speed
δ when it is between 0 and 5 years old:
Horses’ speeds
γ between the ages of 5 and 10:
Horses’ speeds
β between 10 and 15 years old:
Older-than-15-years-old horses’ top speeds
α:
The outcomes validated HOA’s capacity to deal with complex situations, such as several uncertain variables in high-dimensional areas. Adult α horses begin a highly precise local search in the vicinity of the global optimum. They exhibit a tremendous desire to explore new terrain and locate new global spots. Young δ horses provide ideal candidates for the random search phase due to certain behavioral traits they possess.
3.4. Detection
After the feature selections, the face detection process is to separate the image into background and foreground. In the detection step, face identification is more important to the classification process, as here we classify the objects and actions on the face. The detection process used LeNet-5 to execute detection in this research. The LeNet-5 technique is utilized with images where the background and foreground colors of images are very close. The LeNet-5 technique is described below.
Effective deep skin lesions are detected by using this detection network. According to the needs of the application, many kinds of CNNs can be employed for detection, and faces can be detected by using a LeNet-5 pre-trained network on the ImageNet database. LeNet-5 has seven layers: an input layer, two pooling layers, a fully connected layer, two convolutional layers, and an output layer [
23]. LeNet-5’s detailed architecture is shown in
Table 1. Several weighted layers in LeNet-5 are built on the concept of eliminating the convolution layer blocks by leveraging shortcut connections. The fundamental building blocks are referred to as “bottleneck” blocks, two design rules are used by these blocks: The same output feature size is produced by using the same number of filters. The convolution layers down-sample at a rate of two strides per layer. Before the activation of rectified-linear-unit (ReLU) and after each convolution, batch normalization is conducted.
The final detection network uses these region proposals for object classification. Anchor boxes are originally produced across each feature map pixel with varying scales and aspect ratios in the RPN. Nine anchor boxes are typically utilized, with aspect ratios of 1:1, 1:2, and 2:1 and scales of 128, 256, and 512. An anchor box’s probability of containing a background or object is predicted by RPN. The required object proposals are sent to the next stage in the form of a list of filtered anchor boxes. Equations (34) and (35) must be used to convert the final predicted region proposals from the anchor boxes. The translation between the center coordinates that is scale-invariant is shown in Equation (34). Equation (35) shows how the height and width translate in log space.
where the bounding box regression vectors are represented by
, and
, and coordinates for the height, width, and center in x and y are depicted by
h,
w, and
x,
y. Additionally,
are the corresponding centers of the anchor box and proposal box. The convolutional layers and fully connected layers are utilized in this process to detect the human face in thermal images based on extracted features.
3.5. Classification
We classify the objects and actions from the detected face in thermal images. So, for the classification of thermal images, we create an integrated network termed Attention U-Net++. This is a better classification technique compared with other existing classification techniques and is also novel and reasonable. A sequence of U-Nets with various depths is integrated using nested U-Net architecture, which takes from DenseNet. The nested framework is distinctive from U-Net, and used nested convolutional blocks and redesigned dense skip links among the encoder and decoder at various depths [
24]. In layered U-Nets, each nested convolutional block captures semantic information using many convolution layers. Additionally, each layer in the block is linked together via connections, allowing the concatenation layer to combine semantic data of various levels.
3.5.1. Attention Gate (AG)
The AG employs the PASSR net’s model and includes an effective attention gate into nested architecture. Here is a more thorough analysis of the attention gate:
To facilitate learning of the subsequent input, the initial input (g) serves as the gating signal.
As a gating signal to facilitate the learning of the next input, the first input (g) is used (f). In other words, this gating signal (g) can choose more advantageous features from encoded features (f) and transfer them to the top decoder.
These input data are combined pixel by pixel following a CNN operation (Wg, Wf) and batch norm (bg, bf).
The S-shaped activation function sigmoid is chosen to obtain the attention coefficient (α) and to perform the divergence of the gate’s parameters.
The result can be produced by multiplying each pixel’s encoder feature by a certain coefficient.
Following is a formulation of the attention gate feature selection phase:
The AG can learn to classify the task-related target region, and it can suppress the task-unrelated target region. This work incorporates the attention gate to enhance the effectiveness of propagating semantic information through skip links in the innovative proposed network.
3.5.2. Attention-Based Nested U-Net
Based on the attention mechanism and the nested U-Net architecture, ANU-Net is an integrated network for thermal image classification. In ANU-Net, which employs nested U-Net as its primary network design, hierarchical traits can be recovered that are more useful. Through the extensive skip connections, the encoder transmits the context information collected to the relevant layers decoder. By using the proposed ANU-Net classification model, the time complexity is reduced, effective training is performed, and the performance of the classification is improved.
When there are several dense skip connections, each convolutional block in the decoder obtains two equal-scale feature maps as inputs: The outcomes of earlier attention gates with residual connections at the same depth are used to create the preliminary feature maps, and the output of the deeper block deconvolution process is used to create the final feature map. After receiving and concatenating all extracted feature maps, the decoder reconstructs features from the bottom up.
The extracted feature map from ANU-Net may be expressed as follows: Let
indicate the outcome of the convolutional block, while i defines the feature depth and j signifies the sequence of the convolution block.
and the attention gate and mean up-sampling selection accordingly, indicate that concatenate the outcome of the attention gates from node Xi, k = 0 to Xi, k = j − 1 in the ith layer.
Only the encoder’s selected same-scale feature maps will be used by the decoder’s convolution blocks after the concatenation procedure, rather than all of the feature maps that were obtained via dense skip connections. The outcome of the j preceding blocks in this layer serves as inputs, while block X1’s up-sampling feature in the second layer serves as additional input. Two of ANU-Net key breakthroughs are the network transfer of features collected from the encoder. Additionally, the attention gate is implemented in the decoder path in between layered blocks retrieved at various layers and can be combined with a targeted selection. ANU-Net accuracy should therefore be increased as a result. The accuracy of the classification technique has not made more changes to other existing techniques. So, the monarch butterfly optimization (MBO) algorithm is utilized with ANU-Net to improve classification accuracy.