1. Introduction
The global population of individuals aged 65 and above is increasing rapidly [
1]. Unattended falls can be a life-threatening risk to these individuals if they are unable to call for help. According to WHO’s global report on fall prevention among seniors, unintentional falls within the home environment rank as the second leading cause of accidental injuries and subsequent fatalities [
2], and around half of this demographic are not able to stand up independently after they fall [
3]. If the elderly lose consciousness due to the injury, they may miss the chance of timely treatment and face a higher risk of death [
4]. However, if the elderly are equipped with fall detection devices, their physical conditions can be monitored and responded to in real time, and they can be rescued swiftly if their lives and health are threatened by a fatal injury caused by the accidental fall.
With the development of artificial intelligence, object detection algorithms are becoming more and more important to our lives and have been applied in many fields, such as security, healthcare, robotics [
5], and autonomous driving [
6,
7,
8]. Currently, there are two categories of fall detection methods: non-computer-vision-based detection methods, and computer-vision-based detection methods. The non-computer-vision-based methods detect fall movement via various sensors. For example, the elderly can wear wearable devices with built-in sensors (e.g., accelerometers) on the wrist, chest, or waist, and the fall movements can be detected by analyzing the human posture data (velocity, acceleration, etc.) obtained by these sensors. In a study of Mthie et al. [
9], fall detection was implemented by analyzing the changes in acceleration signals of different movements, and falling and standing up movements were effectively distinguished. Lu et al. [
10] analyzed the pressure signals to determine whether an accidental fall had occurred. However, these methods have the drawback of relying on the performance of the sensors, which increases the cost of the device. Furthermore, the installation/wearing of devices equipped with multiple sensors could reduce the comfort of the elderly in their daily lives. Fall detection via computer vision relies on figures and videos of the daily activities of elderly individuals recorded by cameras. When the person in the camera is detected to fall or to have fallen, the surveillance system can immediately send a distress message proactively so that they can be rescued in time. In the detection process, image processing, pattern recognition, and other related techniques are utilized to extract the information of human motion, and the human behavior detection model is constructed to identify the fall movement. Compared with the non-machine-vision methods, the machine vision methods have three main outstanding advantages: (1) they are non-intrusive, so the elderly are free from being inconvenienced by wearing the devices; (2) they are not affected by environmental noise, which prevents missed detection or misjudgment caused by the interference of wearable sensors; (3) they can monitor other abnormal emergencies at the same time.
With the advances in artificial intelligence, detection algorithms are becoming more and more important to our lives and have been applied in many fields, such as finance, medical treatment, robotics, and autonomous driving [
6,
11]. Currently, deep learning algorithms have become the mainstream in computer-vision-based fall detection and have been extensively investigated. Deep-learning-based fall detection algorithms can be categorized as either two-stage algorithms or one-stage algorithms. The two-stage detection algorithms, such as R-CNN [
12], Fast R-CNN [
13], and Faster R-CNN [
14], initially generate a series of candidate frames as samples and then classify the samples via a convolutional neural network (CNN). Liu et al. [
15] successfully used Faster R-CNN for the detection of the elderly falling off furniture. The algorithm detected and tracked human activity characteristics, measured the changes in these characteristics, and then determined whether the individual fell by analyzing the locations of the individual and the furniture in the Cartesian system. One obvious drawback of such algorithms is their slow detection speed. The one-stage algorithms predict the localization and classification of targets directly through the target detection network. The algorithms can implement the end-to-end prediction of target frames and target classes at once, making the detection faster and more efficient than that of the two-stage algorithms. The mainstream one-stage algorithms include SSD [
16] and the YOLO [
17,
18,
19] series algorithms. The YOLO series algorithms have the advantages of being fast and efficient, highly accurate, and easy to deploy and use. Among them, YOLOv5 is one of the leading target detection algorithms in the industry, with faster speed and higher accuracy than YOLOv3 and YOLOv4. Yin et al. [
20] achieved real-time accurate fall detection using the improved YOLOv5s model, but the speed and volume of the algorithm were still deficient. This disadvantage led to an increase in cost in their study; that is, they needed to upload the video data to the cloud to judge whether a fall had taken place.
Although video monitoring systems are widely used in public places, this method still has limits in the detection of indoor falling. This is because the existing fall algorithms require a large net metering and bandwidth, whereas the slow speed of the home network limits the fall detection efficiency, making the algorithm difficult to deploy in embedded devices. In order to solve the problem of the efficiency and accuracy of detection, this paper proposes a lightweight fall detection algorithm based on YOLOv5s. The algorithm has the advantages of high detection accuracy, fast detection speed, less hardware requirement and computational burden, and is feasible for deployment in embedded devices. The structure of the paper is organized as follows:
Section 2 illustrates the modelling of the lightweight YOLOv5s algorithm;.
Section 3 describes the training and validation of the algorithm;
Section 4 is the results and relevant discussions on the mechanisms of high-accuracy and high-efficiency detection achieved via the lightweight algorithm.
2. Methodology
2.1. The series of YOLOv5 Algorithms
Typically, there are four categories of YOLOv5 algorithm: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x; and these models differ in two parameters: the depth and the width of the network. The depth refers to the number of layers of the network, and the width refers to the number of channels output by the network layer.
Increasing the width and depth of the model can improve the performance of detection; however, this also increases the computation time, consumes more memory, and increases the inference time. Among the algorithms, YOLOv5s has the smallest depth and width, and has fewer channels per layer of the output mapping networks, which significantly reduces the number of parameters and, eventually, computational tasks. The typical network structure of the YOLOv5s algorithm consists of four different parts: input, backbone, neck, and prediction (head), as shown in
Figure 1.
The inputs of the YOLOv5 are images, which are processed via Mosaic-4 for data enhancement, including cropping, stitching, and scaling operations of the figures. This increases the training efficiency by enriching the detection dataset and enlarging the small-scale targets.
The Backbone part uses CSPDarknet53 as the feature extraction network. The network consists of Conv structure and C3 structure, which are used to extract the feature maps of the input images and provide the basis for the subsequent processing of the model. Then, the extracted feature maps are directly input to the spatial pyramid pooling module (SPP) and are transformed into feature vectors with fixed sizes.
The neck network adopts a PANet+FPN [
21] structure to achieve multi-scale feature fusion. FPN enhances the effect of target detection by fusing high underlying features, especially for small-sized targets; PANet [
22] is the modified FPN with an additional bottom-up information flow path and the information delivery path is shortened, which enhances the accuracy of the underlying localization information to the entire feature extraction network. The structure of PANet is shown in
Figure 2.
The head predicts the features of the target. Anchor boxes are applied on the targeted feature map to generate the final output vector with category probabilities and target boxes.
2.2. Improved YOLOv5s Network
Although YOLOv5s is relatively small in size among different versions of YOLOv5, it is still a challenge to deploy the YOLOv5s model directly into embedded devices due to the following potential issues: (1) real-time inference may be slow; (2) the model size may exceed the available memory of the embedded device, resulting in the model not being able to be loaded or run; (3) the device may overheat during inference, which affects the inference performance and lifespan of the device. Thus, further lightweight processing of the YOLOv5s algorithm is needed to reduce the computational task of the model.
To optimize the algorithm by avoiding the issues, the K-means++ algorithm is used on the fall dataset to optimize the scale of predefined anchors, which improves the matching degree between anchor points and real samples. The backbone is replaced by the lightweight ShuffleNetV2 network to simplify the fall detection model. Then, the SE attention module is embedded at the end of the backbone to make up for the loss of accuracy caused by model simplification. Finally, the SIOU loss function is introduced to improve the detection accuracy of the model and accelerate the convergence speed. The structure of the improved YOLOv5s network is shown in
Figure 3.
2.2.1. K-Means++ Algorithm
The anchor boxes in YOLOv5 are conventionally clustered using the K-means algorithm. However, this algorithm randomly assigns initial cluster centers, which could deviate from the optimal cluster centers, lead to local optimum solution, and impact the efficacy of the clustering. Therefore, the K-means++ clustering approach is applied to generate more proper anchor values, aiming to improve the training convergence without additional parameters and computation. The mechanism of K-means++ can be expressed as follows:
Firstly, a sample is randomly selected from the sample data set as the initial cluster center. The shortest distances between the samples and the selected clustering center are then calculated, and the samples are assigned to the category corresponding to its closest cluster center. The probability of each sample to be selected as the next cluster center is calculated according to Equation (1).
where D(n) represents the shortest distance from the sample to the cluster center. When a new sample is assigned into the cluster, the clustering center is recalculated based on the updated clustering samples. This process is repeated to select all the clusters to obtain the K clustering center set.
Table 1 shows the default and optimized sizes of the priori anchor boxes. The K-means++ algorithm selects the initial clustering centers using a smarter initialization, which avoids the instability of random initialization and avoids falling into local optimal solutions. Additionally, the K-means++ converges quickly after selecting the appropriate cluster centers, which plays a role in the training convergence of the fall detection dataset.
In
Table 1, 80 × 80 represents the size of the shallow feature map (P3), which contains more low-level information and is suitable for the detection of small-sized targets. In contrast, 20 × 20 represents the size of the deep feature map (P5), which contains more high-level information, such as contour, structure, and other information, and it is suitable for the detection of large-sized targets. The other 40 × 40 is the size of the mesoscale feature map (P4), which uses an anchor size between the two mentioned above and is used for detecting medium-sized targets. The second column indicates the preset anchor box size for the three scales, and the two numbers in parentheses indicate the width and height of the anchor box. The third column shows the optimized anchor box size.
2.2.2. Lightweight ShuffleNetV2 Backbone Network
The CSPDarknet53 feature extraction network is commonly used in YOLOv5s. Although the features extraction is efficient, it is difficult to deploy the algorithm into the embedded devices due to its heavy computational duties. In this study, the ShuffleNetv2 [
23] network was used in the improved algorithm to replace the original backbone of YOLOv5s, which met the requirements of being lightweight and highly accurate. ShuffleNetV2 inherits the deep separable convolution [
24] and channel shuffle from ShuffleNetV1 [
25], and can split the channel. The two basic units of ShuffleNetV2 are shown in
Figure 4. It can be seen that the channels of the units are divided into two branches because the channel splits before concatenation (
Figure 4a), which can effectively reduce the redundant features and increase the network computational efficiency. Adding the channel shuffle module after shortcut connections avoids the problem that the output of a channel only comes from a small part of the original feature map, which realizes the exchange of feature information between different branches and improves the detection accuracy. The application of channel split and channel shuffle compresses the computation and memory usage of the model, significantly simplifying the model.
For S_Block1, the left branch constantly maps through shortcut connections to increase the network structural depth, which reduces the fragmentation and accelerates the training speed. The right branch performs convolution through multiple layers to ensure the equal channel numbers of input and output. This minimizes the memory access to improve the modelling speed.
S_Block2 is a downsampling module where the two branches are introduced directly into the input without the splitting operation. It adjusts the number of channels using 1 × 1 convolution and downsamples of the depth convolution in steps of 2. The two branches are concatenated together to halve the size of the feature map and double the number of channels.
2.2.3. SE Attention Module
Inspired by the way that human eyes can naturally and efficiently find important areas in complex scenes, the “attention mechanism” has been introduced into the field of computer vision [
26]. Due to its excellent performance, the attention mechanism is widely used to solve various tasks in computer vision, such as image recognition, object detection, semantic segmentation, etc. Currently, most studies focus on the extraction of spatial features but lack attention to different channels. To enhance the network’s perception of character motion features in videos, the relationship among the channels should be considered.
The SE attention module [
27] can automatically learn the importance of each channel through training, and assign different weights to the spatial and channel dimensions of the network. The more important the information about the channel, the larger the weighting factor. This module directs the network to focus on important features and ignore irrelevant features, which improves its ability to distinguish the features. The structure of the SE attention module is shown in
Figure 5.
The SE attention module consists of three main operations: Squeeze, Excitation, and Rescale. The Squeeze operation transforms the input feature map into a global description vector through global averaging pooling. The fall usually involves changes in human posture and the surrounding environment. With the Squeeze operation, the network is able to capture the global features from inputs, which provides a more comprehensive contextual understanding, and improves the accuracy on distinguishing fall movement from other motions. While keeping the number of channels C unchanged, the size of the input feature map is compressed from (
H,
W) to (1,1), and then global pooling is used to encode the overall spatial feature of one channel into a global feature. The value indicating the
cth channel after the squeezing operation can be calculated via Equation (2).
where
represents the one-dimensional vector of the
cth feature.
H represents the height of the feature map and
W represents the width of the feature map.
denotes the
cth global feature.
The Excitation operation emphasizes the attention of features related to the fall movement by learning the relationships among feature map channels. The features of the fall movement involve overall body movements and posture changes, which are often transferred via specific channels. Through the excitation operation, the network is able to self-adaptively learn the weights of each channel to highlight the focus on fall movement features, which enhances the network’s response to the channels related to fall movement, improving the accuracy and sensitivity of detection. The weights can be calculated via Equation (3).
where
σ refers to the Sigmoid function,
δ refers to the ReLU function, and
refers to the bottleneck structure consists of two fully connected layers,
and
.
The Rescale operation re-weights the feature map based on the learned excitation vectors. The detection of fall movement is usually disturbed by the complex background interference. By re-weighting the feature map, the rescale operation highlights important features and suppresses interference with less important features, which improves the localization and detection of fall movement and enhances the network’s ability to perceive key actions of falling. The rescaling the weights of the channels can be calculated via Equation (4).
where
indicates the output result.
denotes the product of the feature map
and channel weights
.
2.2.4. Improvement of the Loss Function
In the YOLOv5s network, the loss function consists of three components: the rectangular frame loss, the classification loss, and the confidence loss. The rectangular box loss uses GIOU loss [
28], which adds the smallest rectangular box surrounded by the real and predicted boxes in the calculation of the loss based on IOU. This solves the problem of gradient vanish when there is no overlapping area between the two boxes. Assuming that A represents the ground truth, B represents the prediction frame, and C is the smallest bounding box that can cover them, the GIOU loss can be calculated by the following equations:
However, GIOU could degenerate to IOU when the predicting box contains the ground truth. Additionally, slower convergence and less-accurate regression are the major drawback of the GIOU. Therefore, the improved algorithm uses SIOU loss [
29] to replace the original GIOU loss function in this study. SIOU loss considers the vector angle between the desired regressions and the redefined penalty indicator. The SIOU loss function consists of four losses: angular loss, distance loss, shape loss, and IOU loss.
For the angle loss, SIOU loss adds the angle perception between the center of the real frame
and the predicted frame B, which reduces the distance-related extra variables. The angle loss can be calculated with the following equations:
where
is the final calculation result of the angular loss.
x is the sine of the angle α between the center points of the real and predicted frames in the figure;
d is the distance between them; and
is the relative height difference between the two points.
The distance loss of SIOU (Δ) is different from that of GIOU due to the new addition of the angle calculation, which is expressed as follows:
where
and
are the squared ratios of the relative distances of the centroids of the real and predicted boxes in the X and Y directions to the width and height of the smallest outer rectangles. e is the Euler constant.
The formula for calculating shape loss is shown in Equations (14) and (15).
where
and
are the widths and heights of the prediction frame and the real frame, respectively;
θ is the attention coefficient in the shape loss calculation formula, and the values were defined between 2 and 6 for different data sets.
The final SIOU loss can be calculated via Equation (16).
The SIOU loss function effectively increases the convergence speed of the model and improves the performance of the fall detection model.