YOLOv8s-NE: Enhancing Object Detection of Small Objects in Nursery Environments Based on Improved YOLOv8

Amir, Supri Bin; Horio, Keiichi

doi:10.3390/electronics13163293

Open AccessArticle

YOLOv8s-NE: Enhancing Object Detection of Small Objects in Nursery Environments Based on Improved YOLOv8

by

Supri Bin Amir

^1,2,* and

Keiichi Horio

¹

Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology, 2-4 Hibikino, Wakamatsu-ku, Kitakyushu 808-0196, Japan

²

Department of Information Systems, Hasanuddin University, Makassar 90245, South Sulawesi, Indonesia

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3293; https://doi.org/10.3390/electronics13163293

Submission received: 26 April 2024 / Revised: 26 May 2024 / Accepted: 16 August 2024 / Published: 19 August 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The primary objective of this research investigation is to examine object detection within the specific environment of a nursery. The nursery environment presents a complex scene with a multitude of objects, varying in size and background. To simulate real-world conditions, we gathered data from a nursery. Our study is centered around the detection of small objects, particularly in nursery settings where objects that include stationery, toys, and small accessories are commonly present. These objects are of significant importance in facilitating cognition of the activities and interactions taking place within the confines of the room. Due to their small size and the possibility of occlusion by other objects or children, precisely detecting these objects is regrettably fraught with inherent challenges. This study introduces YOLOv8s-NE in an effort to enhance the detection of small objects found in the nursery. We improve the standard YOLOv8 by incorporating an extra detection head to effectively for small objects. We replace the C2f module with C2f_DCN to further improve the model’s ability to detect objects of varying sizes that can be deformed or occluded within the image. Furthermore, we introduce NAM attention to focus on the important features and ignore less informative ones, thereby improving the accuracy of our proposed model. We used the five-fold cross-validation approach to split the dataset in order to evaluate the performance of YOLOv8s-NE, thereby facilitating a more comprehensive model evaluation. Our model achieves

34.1 %

of AP_s,

45.1 %

of mAP_50:90, and

76.7 %

of mAP₅₀ detection accuracy at 37.55 FPS on the nursery dataset. In terms of AP_s, mAP_50:90, and mAP₅₀ metrics, our proposed YOLOv8s-NE model outperforms the standard YOLOv8s model, with improvements of

4.6 %

,

4.7 %

, and

3.9 %

, respectively. We apply our proposed YOLOv8s-NE model as a safety system by developing an algorithm to detect objects on top of cabinets that could be potentially risky to children.

Keywords:

small object detection; multiple detection head; DCNv2; attention mechanism; nursery

1. Introduction

Object detection is a part of the popular field of computer vision, which involves identifying and localizing objects in images or videos. It goes beyond simple image classification by not only determining the class or category of an object but also providing information about its precise position within the image. The use of object detection has become increasingly prevalent due to its potential to enhance the efficiency and effectiveness of various systems. For instance, in camera surveillance systems, object detection plays a very critical part in detecting and tracking targets such as people, and vehicles and even recognizing activities [1,2,3].

This research study aims to focus on object detection in the specific setting of a nursery environment. The nursery environment presents a complex scene with a multitude of objects, varying in size, shape, and attributes. One of the key challenges in nursery environments often contain objects such as toys, stationery, or small accessories that may be critical for understanding the activities and interactions within the space. However, detecting these small objects accurately presents inherent difficulties due to their limited visual appearance and potential occlusions by other objects or children.

Detecting small objects requires the development of specialized techniques that can handle the unique characteristics of these objects. The low resolution and limited visual details make it challenging to distinguish small objects from the background or other similar-looking objects. Moreover, the presence of occlusions further complicates the task, as small objects can easily be concealed or partially hidden by larger objects or children. By addressing the difficulty of recognizing small objects in nursery environments, the present research desires to contribute to the development of reliable and precise object detection systems. The outcome of this research has the capacity to improve surveillance and management of nursery environments by providing reliable detection of small objects, thereby improving safety, efficiency, and overall quality of care provided in these settings.

Several applications extensively apply object detection algorithms due to their significant advancements in both detection accuracy and speed [4,5,6,7]. Object detection algorithms can be classified as either two-stage or one-stage detectors. The two-stage detectors consist of a separate region proposal network (RPN) and classification. The first stage extracts features and generates a region proposal, which is then used in the second stage for classification to determine class labels and regression to determine the bounding box. R-CNN [8], Fast R-CNN [9] and Faster R-CNN [10] are popular object detection algorithms that use two-stage approach. Contrary to the one-step object detection algorithms, which simultaneously perform both operations, YOLO [11], SSD [12] and RetinaNet [13] are among the most popular one-step detection algorithms.

The popular object detection algorithm YOLOv8 [14], which was introduced by the Ultralytics company in January 2023, attracted the attention of the computer vision community because of its enhanced accuracy and speed. YOLOv8 is a well-liked option for object detection tasks in a variety of fields, such as autonomous driving [15,16], robotics [17], and surveillance [18,19] as a result of its exceptional performance in accuracy and speed. Some limitations appear when we apply YOLOv8 and other algorithms trained on generalized datasets to domain-specific datasets. These limitations include detection for limited visual features of small objects, inadequate complex background adaptation, and limited ability to handle target deformation and sizes or position changes of objects. Unfortunately, the nursery environment does not yield the best results from the standard YOLOv8 model, especially when it deals with recognizing smaller objects.

To address these issues, this study introduces YOLOv8s-NE, an enhancement of standard YOLOv8 that aims to improve the detection of small objects in the nursery environment. Some improvements to standard YOLOv8 are being developed to achieve our goals. Our innovation is to enhance the detection of small objects in standard YOLOv8 by adding an extra detection head to be adaptively sensitive to small objects. This integration involves the incorporation of small features, which effectively exploits the characteristics of small objects. For further improvement, the DCNv2 [20] is integrated into the original C2f module to enhance the ability of the model to handle the arbitrary sizes and shapes of objects caused by partial occlusion between objects during the children’s activities within the nursery. Finally, we introduce an attention mechanism module into the YOLOv8 model. This additional module selectively learns and adjusts the relationships between the channels and the spatial information. This effectively extracts relevant information from small objects with restricted visual features. All of these enhancements to standard YOLOv8 greatly enhance their ability to detect small objects in a nursery environment.

As a result, we propose YOLOv8s-NE to achieve our goals of accurately detecting small objects and ensuring real-time performance. This research effort’s main contribution to detecting small objects in the nursery environment is briefly summarized as follows:

To improve feature fusion for small objects, an extra detection head with large-scale feature maps is added, connecting the shallower backbone network of the YOLOv8 model;
The original C2f of the YOLOv8’s network incorporates the DCNv2 module to improve its capacity to detect objects of various sizes and shapes that may arise due to partial occlusion;
We add four Normalization-based Attention Modules (NAM) to the neck of YOLOv8, effectively extracts relevant information from small objects with restricted visual features;
We perform five-fold cross-validation, which provides a more comprehensive model evaluation and it was improved by $4.5 %$ and $3.9 %$ for AP_s and mAP₅₀ metrics, respectively, over the baseline YOLOv8s model;
Finally, we apply our proposed model as a safety system by developing an algorithm to detect objects on cabinets in a nursery.

2. Overview of YOLOv8 Network Architecture

The Ultralytics-developed YOLOv8 object detection algorithm comes in five scaled versions. YOLOv8n for low-resource efficiency; YOLOv8s balances speed and accuracy; YOLOv8m offers moderate performance; YOLOv8l emphasizes high accuracy and YOLOv8x excels with increased computational demands for top accuracy. The YOLOv8 network architecture, seen in Figure 1, consists of four primary components: input, backbone, neck, and head.

(1) Input: The input component’s purpose is to receive and prepare the input image data. Pre-processing steps commonly involve image augmentation techniques such as mosaic, scaling, mixup, etc. Choosing the appropriate image augmentation technique for the dataset results in a considerable improvement in the performance of the model. Furthermore, the image size needs to be normalized to 640 × 640 pixels according to the YOLOv8 input standard.

(2) Backbone: The primary task of the backbone component in the network is to extract features from the input image by utilizing a sequence of convolutional layers. The backbone component in YOLOv8 consists of a modified version of the Conv module, C2f module, and Spatial Pyramid Pooling-Fast (SPPF) module. The Conv module includes convolution, batch normalization, and the SiLU activation function. Two Conv modules and an n bottleneck, connected by the Split and Concat modules, comprise the C2f module. For multi-scale fusion, SPPF uses serial max-pooling, which expands the feature map of the receptive field.

(3) Neck: The neck component fuses backbone features from different layers of the network, improving their representational ability. YOLOv8 uses the Feature Pyramid Network (FPN) [21] and Pixel Aggregation Network (PAN) [22] structures. This combination enables the network to fuse more feature information, creating a multi-scale feature fusion module.

(4) Head: The head component is responsible for generating the network’s output by localizing the bounding boxes and predicting class probabilities for each object in the image. The standard YOLOv8 model outputs three different sizes of feature maps: 80 × 80 × 256, 40 × 40 × 512, and 20 × 20 × 1024, corresponding to the detection of small, medium, and large objects, respectively. YOLOv8 uses a decoupled head [23], enabling separate processing of classification and regression tasks. The classification task uses binary cross-entropy (BCE) loss, and the bounding box’s loss function uses CIOU loss [24] and DFL [25]. NMS [26] performed at the final step in this part by selecting the higher predicted bounding box objects in the final output.

3. YOLOv8s-NE Network Architecture Design

To address the main challenges in object detection for nursery scene imagery, YOLOv8s-NE, an object detection specifically designed for nursery scenarios, is proposed. The schematic diagram of YOLOv8s-NE illustrates the architectural network design of the model, as depicted in Figure 2. YOLOv8s-NE encounters the following improvements over standard YOLOv8: (1) Add an extra detection head with large-scale feature maps to preserve the textural features of small objects; (2) Substitute every C2f module with YOLOv8 with a new module called C2f_DCN; (3) Add four NAM modules after the last four of the C2f_DCN modules at the neck of YOLOv8.

3.1. Extra Detection Head for Small Objects Detection

The standard YOLOv8 model utilizes three scale layers for detection, with the feature map dimensions of

20 \times 20 \times 1024

,

40 \times 40 \times 512

, and

80 \times 80 \times 256

, respectively. This model can only work well on certain datasets but performs inadequately on datasets with size diversity, specifically datasets that contain a high amount of small objects. The YOLOv8 backbone network is constructed using multiple layers of down-sampling convolutions, which leads to a reduction in the feature map size as the network progresses deeper in the feature extraction process. The limited dimensions of the feature map pose challenges in identifying and locating small objects within the image. Unfortunately, the nursery image dataset contains a high proportion of small objects. Therefore, to enhance the feature fusion effect of small objects, we add an extra detection head with large-scale feature maps of

160 \times 160 \times 128

for small objects and connect the rest of the shallower backbone network to the standard YOLOv8. As a result, the network’s feature fusion layer receives detailed information at a lower level of image features. Thus, the additional detection head becomes more sensitive in detecting smaller objects. The final structure of the network with the extra detection head is visualized in Figure 2.

3.2. Deformable ConvNets V2 (DCNv2)

The objects to be detected in the nursery environment have complex characteristics. The objects exhibit significant variation in size, from large pieces of cabinets to small toys. Adding to this complexity is the phenomenon of occlusion, in which some objects are partially or completely obscured by other objects or people, which is quite common. For example, children engaged in play or larger objects present in the nursery environment can obscure toys. The altered appearance of the occlusion causes the perceived shapes of the occluded objects to vary. Consequently, objects in these settings will pose a challenge in terms of accurate detection, particularly when it comes to accurately detecting small objects.

The DCNv1 deformable convolution, as described in the [27], is designed to improve the limitation of regular convolution, which relies on fixed geometric structures. The innovation involves learnable 2-D offsets on top of the regular convolution, which enables the sampling grid to be flexibly deformed. This process can be depicted by the Equations (1) and (2).

Y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n} + ∆ p_{n}),

(1)

R = {(- 1, - 1), (- 1, 0), \dots, (0, 1), (1, 1)},

(2)

where

Y (p_{0})

is the value of the output feature at location

p_{0}

, and a 3 × 3 kernel grid

(R)

with a dilation of 1 is used. Here, x represents the input feature map, w denotes the sampled value weights,

p_{n}

enumerates the locations in R, and

∆ p_{n}

is the learnable 2-D offsets consisting of the x and y directions’ offsets.

The prior deformable convolution’s limitation is resolved by DCNv2, as shown in Figure 3, which incorporates modulation scalars

∆ m_{p_{n}} \in [0, 1]

, which adjusts the weighting of sample points, as described by Equation (3).

Y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n} + ∆ p_{n}) \cdot ∆ m_{p_{n}}

(3)

In our work, the C2f module is replaced by the C2f_DCN module. The term is to substitute the Conv module in the bottleneck module with DCNv2, as depicted in Figure 2. The new module enhances the capacity to detect objects of various sizes and shapes, even when partially occluded by other objects in a nursery setting.

3.3. Normalization-Based Attention Module (NAM)

The NAM is an attention mechanism proposed by [28] adopting the CBAM module [29] and redesigning the channel and spatial attention submodules. Adjusting the variance measure of training weights in both channel and spatial dimensions allows for the re-weighting of attention. The scaling factors that come from batch normalization (BN) can be formulated as in Equation (4).

B_{out} = BN (B_{in}) = γ \frac{B_{in} - μ_{b}}{\sqrt{σ_{b}^{2} + ϵ}} + β,

(4)

where

γ

and

β

are the trainable scale and shift parameters, respectively. The variables

μ_{b}

,

σ_{b}

represent the mean and standard deviation, respectively, of each batch denoted by b.

The scaling factor

γ

represents the variance in BN. A higher variance indicates more variation and richer information. The channel attention module is able to prioritize relevant channels by utilizing

γ

normalized correlation weights

W_{γ}

, while disregarding less informative weights. Suppose we have an input feature map

F_{1}

in

R^{H \times W \times C}

, where H, W, and C indicate the height, width, and number of channels, respectively. The resulting value

M_{c}

of the channel attention can be expressed using the Equation (5).

M_{c} = sigmoid (W_{γ} (BN (F_{1})))

(5)

The spatial attention module utilizes BN for pixels in the spatial dimension. This process is denoted as pixel normalization (PN). It prioritizes the pixels that contain more valuable information by considering the scaling factor

λ

and adjusts the associated weights

W_{λ}

. Similarly,

F_{2} \in R^{H \times W \times C}

is the input feature map, and the output

M_{s}

of the spatial attention models is shown in Equation (6).

M_{s} = sigmoid (W_{λ} (PN (F_{2})))

(6)

In the nursery setting, containing lots of small objects with low resolutions and limited visual features, it becomes difficult to distinguish the actual object from the background or other objects that may appear similar. In order to overcome this limitation, we incorporate four NAM modules positioned after the last four of the C2f_DCN modules before processing each detection head in YOLOv8’s neck, as depicted in Figure 2. Aiming to focus on the object’s pertinent features by refining the channel and spatial information.

4. Experiments and Results

4.1. Dataset

We utilized the cameras to gather the dataset for this study from the nursery environment. The dataset comprises 12 classes of objects commonly found in nursery environments. The object categories consist of toys, cabinets, tables, chairs, gadgets, people, beds, blankets, books, basket storage, bottles, and cups. Additional image datasets related to the nursery environment were also collected by the internet due to the limited number of objects and the variation of objects in the nursery where we collected the dataset. The collected dataset is based on the actual environment, which includes a variety of object sizes, lighting conditions, and backgrounds, making it relevant to the real world, as shown in Figure 4.

We labeled the objects within the images with bounding boxes that indicate the location or area of each object class by using roboflow.com with YOLOv8’s output format. The dataset contains a total of 360 images, which we randomly split into two sets: the training set comprises

80 %

of the images, totaling 288 images, and the validation set comprises

20 %

, totaling 72 images. The dataset includes 7016 labeled objects from 12 different object classes, with objects of different sizes found in nursery dataset images. We defined the sizes of objects based on the bounding box’s area, following the reference [30]. Table 1 shows the size distributions of annotated objects among different object categories in the nursery dataset. The dataset contains large numbers of small objects, as shown in Figure 5.

We preprocessed all images in the dataset by resizing them to

640 \times 640

pixels. We make sure the model performance is not influenced by a particular data partition by applying a five-fold cross-validation approach and reporting the mean performance.

4.2. Experimental Settings

We used the software and hardware specifications from Table 2 to implement all experiments and the model parameter settings listed in Table 3. Every experiment was conducted using typical data augmentation techniques, wherein the image was uniformly resized to

640 \times 640

pixels.

4.3. Evaluation Metrics

To verify the generalized ability of YOLOv8s-NE, our metrics for evaluations consist of two indicators, as follows:

(1) The metrics for mean average precision (AP) include mAP₅₀ and mAP_50:95. The mean average precision (mAP) at an intersection over union (IoU) threshold of 0.5 represents the average accuracy value for all classes. The other metric for average precision (AP), specifically mAP_50:59, calculates the AP for all classes at different IoU thresholds, ranging from 0.5 to 0.95 and increasing by 0.05 for each step. The metrics AP_s, AP_m, and AP_l represent the average precision (AP) specifically for bounding boxes characterized as small, medium, and large based on their areas. To obtain comprehensive AP evaluation metrics that are based on scales, see [31]. This study primarily aims to evaluate our proposed model’s performance in detecting small objects, with AP_s serving as the main indicator for the experiments.

The specific calculation of the mean Average Precision (AP) is as follows:

Precision = \frac{TP}{TP + FP} = \frac{TP}{all detection},

(7)

Recall = \frac{TP}{TP + FN} = \frac{TP}{all ground truth},

(8)

where TP, FP, and FN are true positives, false positives, and FN false negatives, respectively. AP is calculated from the area under the precision-recall curve using the following formula:

A P = \int_{0}^{1} P (R) d R,

(9)

Then, using Equations (10) and (11), respectively, determine

mAP

and

{mAP}_{50 : 95}

.

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i},

(10)

where N is the total number of classes examined and

{AP}_{i}

is the AP value for the i-th class.

{mAP}_{50 : 95} = \frac{{mAP}_{50} + {mAP}_{55} + \dots + {mAP}_{95}}{10}

(11)

(2) Furthermore, we apply Frames Per Second (FPS) as the indicator for the real-time capabilities.

The FPS calculation is given by the Equation (12).

FPS = \frac{1}{pre - process (ms) + inference time (ms) + post - process (ms)} \times 1000

(12)

4.4. Experimental Results

4.4.1. Results of Comparison Experiments

This section explores the effect of improvements to the network model. We used YOLOv8s (small version) as a baseline for model improvement. We performed five-fold cross-validation, which provides a more comprehensive model evaluation, to demonstrate the detection performance evaluation of the proposed YOLOv8s-NE model over the standard YOLOv8s model. Table 4 presents the experimental results.

Effect of the addition of an extra detection head (YOLOv8s-v1).
Compared to the baseline YOLOv8s model, with the extra detection head, the detection ability of small objects is improved by 3.3% of AP_s.

Effect of integrating an extra detection head and DCNv2 (YOLOv8s-v2).
The C2f_DCN module, which incorporates the DCNv2, replaces the standard C2f module in the model network. Compared to the YOLOv8s-v1 model, the detection ability of small objects is significantly improved by 0.6% of AP_s. YOLOv8s-v2 contributes to the overall improvement of mAP_50:95 and mAP₅₀ by 0.5% and 0.3%, respectively.

Effect of integrating an extra detection head, DCNv2 and NAM module (YOLOv8s-NE).
Finally, we add four NAM modules after the last four of the C2f_DCN modules in the neck of the network model before processing each detection head. Compared to YOLOv8s-v2, YOLOv8s-NE achieves a significant 0.7% of AP_s. It also improves mAP_50:95 and mAP₅₀ by 0.6% and 0.9%, respectively.

Overall, in terms of AP_s, mAP_50:90, and mAP₅₀ metrics, the proposed YOLOv8s-NE model outperforms the standard YOLOv8s model by

4.6 %

,

4.7 %

, and

3.9 %

, respectively. Even though the YOLOv8s-NE achieves 37.55 FPS, it is still applicable to the real-time scenario.

4.4.2. Proposed Method Performance Demonstration

To facilitate a more lucid and intuitive comparison between the proposed YOLOv8s-NE prediction results and other improved YOLOv8s models, we present Figure 6 and Figure 7, which visualize their respective abilities to detect objects within a nursery environment. In alignment with our research focus, these visualizations represent an evaluation performed to identify only the small objects contained in two nursery settings: one with a few objects and one with many. We show in Figure 6 that the proposed YOLOv8s-NE model achieves the best detection performance, with the least number of missed and false detections, with 88.9% precision and 100% recall. Following this, the YOLOv8s-v1 and the YOLOv8s-v2 models both achieve moderate performance, identifying most of the small objects. Lastly, the YOLOv8s model has the worst detection performance, missing many small objects and having relatively low precision and recall values of 60% and 75%, respectively. Similarly, the YOLOv8s-NE model also outperforms the other models in the many-object setting presented in Figure 7. In this evaluation, the YOLOv8s-NE achieves the best detection performance, with precision of 64% and recall of 61.5%. It is followed by YOLOv8-v2, which achieves a precision of 52% and recall of 50%; YOLOv8-v1 follows with a precision of 48% and recall of 46.2% and YOLOv8s, which demonstrates the lowest precision of 39.3% and recall of 42.3%. This signifies that the proposed YOLOv8s-NE model greatly improves the detection of small objects in the nursery environment.

4.4.3. Results of Ablation Experiments with Object Detection Models

To evaluate the competitiveness of YOLOv8s-NE, we conducted a comparative analysis against various object detection models. A five-fold cross-validation was performed, and the results are summarized in Table 5. As presented in Table 5, the proposed model demonstrates a superior performances in AP_s metric, with improvements of 10.6%, 14.7%, 1.2%, 5.6%, 6% and 1.5% when compared to Faster RCNN, YOLOv3-tiny, YOLOv3 [32], YOLOv5s, YOLOv6s [33], and YOLOv9-c [34], respectively, which indicates that YOLOv8s-NE model demonstrates superior detection ability, specifically for small target objects in the nursery environment. The mAP_50:95 and mAP₅₀ achieve 45.1% and 76.7%, respectively, which is the highest detection performance compared to other models. The high mAP_50:95 and mAP₅₀ scores achieved by our proposed model indicate its superiority in detecting various object classes within a nursery environment. In addition, the high mAP_50:95 score indicates the superior performance of the model across different detection thresholds. In terms of the FPS, although our proposed model achieves 37.55 fps, which is higher than YOLOv3 but lower than other models, it still retains real-time capabilities.

4.5. Application of the Nursery Environment Safety System Based on YOLOv8s-NE

In this study, we apply the YOLOv8s-NE model to develop a safety system for the nursery environment. Specifically, our system is designed to detect objects placed on top of cabinets. We found that these objects pose a risk to children if they are pulled down and hit a child during their activities in the nursery.

Initially, the YOLOv8s-NE model is applied to input video frames to perform object detection, resulting in a set of detected objects as

o b j

and a set of detected cabinets as

c a b

. Each object and cabinet is represented by a bounding box with the coordinates object and cabinet

(o b j . x M i n, o b j . y M i n, o b j . x M a x, o b j . y M a x)

and

(c a b . x M i n, c a b . y M i n, o b j . x M a x, o b j . y M a x)

, respectively. To ascertain which objects are situated on top of the cabinets, a comparative analysis of the bounding box coordinates is performed. Specifically, for each object and each cabinet, the coordinates are compared to determine if the bounding box of the object is entirely encompassed within the bounding box of the cabinet. In Algorithm 1, we outline the process for how our system detects the objects placed on top of the cabinet. Figure 8 displays our system detecting four objects on the cabinets.

Algorithm 1 Detect objects on top of cabinets

: Input: receive $d e t e c t i o n R e s u l t s$ Bbox values $(x M i n, y M i n, x M a x, y M a x)$ from YOLOv8s-NE model.
2:: Output: Target objects on top of cabinets.
function calcRelPosition( $o b j . x M i n, o b j . y M i n, o b j . x M a x, o b j . y M a x$ )
4:: $r e l a t i v e X \leftarrow (o b j . x M i n - o b j . x M a x) / 2$
$r e l a t i v e Y \leftarrow o b j . y M a x$
6:: return $r e l a t i v e X, r e l a t i v e Y$
end function
8:: function calcTopAreaCabinet( $c a b i n e t B b o x$ ,alpha=0.2)
$c a b \leftarrow c a b i n e t B b o x$
10:: $α \leftarrow$ alpha ▹ $α$ is tolerance value
$y T o p A r e a C a b i n e t \leftarrow c a b . y M i n + α \times (c a b . y M a x - c a b . y M i n)$
12:: return $y T o p A r e a C a b i n e t$
end function
14:: function isWithinRangeX( $v a l u e, m i n V a l u e, m a x V a l u e$ )
return $m i n V a l u e \leq v a l u e \leq m a x V a l u e$
16:: end function
function isWithinRangeY( $v a l u e, m i n V a l u e, m a x V a l u e$ )
18:: return $m i n V a l u e \leq v a l u e \leq m a x V a l u e$
end function
20:: for each $t a r g e t O b j e c t$ in $d e t e c t i o n R e s u l t s$ do
$(r e l a t i v e X, r e l a t i v e Y) \leftarrow$ calcRelPosition $(o b j . x M i n, o b j . y M i n, o b j . x M a x, o b j . y M a x)$
22:: $y T o p A r e a C a b i n e t \leftarrow$ calcTopAreaCabinet $(c a b i n e t B b o x)$
$i s W i t h i n X \leftarrow$ isWithinRangeX $(r e l a t i v e X, c a b . x M i n, c a b . x M a x)$
24:: $i s W i t h i n Y \leftarrow$ isWithinRangeY $(r e l a t i v e Y, c a b . y M i n, c a b . y T o p A r e a C a b i n e t)$
if $i s W i t h i n X$ and $i s W i t h i n Y$ then
26:: output "Target objects on cabinet."
end if
28:: end for

5. Conclusions and Future Work

This study introduces YOLOv8s-NE in an effort to enhance the model’s ability to detect small objects found in a nursery environment. We improve the standard YOLOv8 by incorporating an extra detection head that is adaptable for detecting small objects. We replace the C2f module with C2f_DCN to enhance the detection accuracy for objects that possess scale invariance and are affected by deformation or occlusion within the image. Lastly, we introduce NAM attention to focus on the important features and ignore less informative ones, thereby improving the detection performance. Compared to the baseline YOLOv8s model, the extra detection head increases the AP_s by 3.3%. The C2f_DCN module significantly improves the detection ability of small objects by 0.6% of AP_s. It also contributes to the overall improvement of mAP_50:95 and mAP₅₀ by 0.5% and 0.3%, respectively. Finally, NAM attention effectively improves AP_s by 0.7%, mAP_50:95 and mAP₅₀ by roughly 0.6%, and 0.9%. In terms of AP_s, mAP_50:95, and mAP₅₀ metrics, our proposed YOLOv8s-NE model outperforms the standard YOLOv8s model, with improvements of 4.6%, 4.7%, and 3.9%, respectively. Compared to other object detection models, the proposed YOLOv8s-NE model offers superior performance in terms of AP_s, but it decreases FPS while maintaining functionality for real-time scenarios. In this study, we applied our proposed YOLOv8s-NE model as an innovative safety system in a nursery environment. The algorithm is designed to detect objects placed on top of cabinets that could pose a potential threat to children during activity.

Our future direction involves the development of an advanced algorithm and system designed not only to detect objects but also to recognize potential hazards or risk during children’s activities within the nursery. This approach aims to enhance the safety of the nursery environment for children.

Author Contributions

Conceptualization, S.B.A. and K.H.; methodology, S.B.A. and K.H.; software, S.B.A.; validation, S.B.A. and K.H.; formal analysis, S.B.A.; investigation, S.B.A., and K.H.; resources, S.B.A. and K.H.; data creation, S.B.A.; writing—original draft preparation, S.B.A.; writing—review and editing, S.B.A. and K.H.; visualization, S.B.A.; supervision, K.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JSPS KAKENHI Grant Number 24K15108.

Institutional Review Board Statement

The studies involving human participants were reviewed and approved by the Kyushu Institute of Technology Institutional Review Board, approval code 24-02, approval date 16 August 2024.

Informed Consent Statement

Informed consent was obtained from Nijiironohana Nursery School prior to the study, and procedures were approved by the Kyushu Institute of Technology Institutional Review Board.

Data Availability Statement

Dataset are stored PC in the Kyushu Institute of Technology.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Q.; Hu, S.; Shimasaki, K.; Ishii, I. An Active Multi-Object Ultrafast Tracking System with CNN-Based Hybrid Object Detection. Sensors 2023, 23, 4150. [Google Scholar] [CrossRef] [PubMed]
Hua, S.; Kapoor, M.; Anastasiu, D.C. Vehicle Tracking and Speed Estimation from Traffic Videos. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; Volume 2018. [Google Scholar]
Domozi, Z.; Stojcsics, D.; Benhamida, A.; Kozlovszky, M.; Molnar, A. Real Time Object Detection for Aerial Search and Rescue Missions for Missing Persons. In Proceedings of the 2020 IEEE 15th International Conference of System of Systems Engineering (SoSE), Budapest, Hungary, 2–4 June 2020; pp. 519–524. [Google Scholar]
Choutri, K.; Lagha, M.; Meshoul, S.; Batouche, M.; Bouzidi, F.; Charef, W. Fire Detection and Geo-Localization Using UAV’s Aerial Images and Yolo-Based Models. Appl. Sci. 2023, 13, 11548. [Google Scholar] [CrossRef]
Jawaharlalnehru, A.; Sambandham, T.; Sekar, V.; Arunnehru, J.; Loganathan, V.; Kannadasan, R.; Khan, A.A.; Wechtaisong, C.; Haq, M.A.; Alhussen, A.; et al. Target Object Detection from Unmanned Aerial Vehicle (UAV) Images Based on Improved YOLO Algorithm. Electronics 2022, 11, 2343. [Google Scholar] [CrossRef]
Song, X.; Cao, S.; Zhang, J.; Hou, Z. Steel Surface Defect Detection Algorithm Based on YOLOv8. Electronics 2024, 13, 988. [Google Scholar] [CrossRef]
Arifando, R.; Eto, S.; Wada, C. Improved YOLOv5-Based Lightweight Object Detection Algorithm for People with Visual Impairment to Detect Buses. Appl. Sci. 2023, 13, 5802. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
Ultralytics YOLO (Version 8.0.0). Available online: https://github.com/ultralytics/ultralytics (accessed on 20 September 2023).
Ren, Z.; Zhang, H.; Li, Z. Improved YOLOv5 Network for Real-Time Object Detection in Vehicle-Mounted Camera Capture Scenarios. Sensors 2023, 23, 4589. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Liu, C.; Cai, Y.; Chen, L.; Li, Y. YOLOv8-QSD: An Improved Small Object Detection Algorithm for Autonomous Vehicles Based on YOLOv8. IEEE Trans. Instrum. Meas. 2024, 73, 1–16. [Google Scholar] [CrossRef]
Jin, Y.; Shi, Z.; Xu, X.; Wu, G.; Li, H.; Wen, S. Target Localization and Grasping of NAO Robot Based on YOLOv8 Network and Monocular Ranging. Electronics 2023, 12, 3981. [Google Scholar] [CrossRef]
Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A Modified YOLOv8 Detection Network for UAV Aerial Image Recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
Chen, H.; Zhou, G.; Jiang, H. Student Behavior Detection in the Classroom Based on Improved YOLOv8. Sensors 2023, 23, 8385. [Google Scholar] [CrossRef] [PubMed]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More Deformable, Better Results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Wang, W.; Xie, E.; Song, X.; Zang, Y.; Wang, W.; Lu, T.; Yu, G.; Shen, C. Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8439–8448. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual, 6–12 December 2020; pp. 21002–21012. [Google Scholar]
Hosang, J.; Benenson, R.; Schiele, B. Learning Non-maximum Suppression. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6469–6477. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Liu, Y.; Shao, Z.; Teng, Y.; Hoffmann, N. NAM: Normalization-based Attention Module. arXiv 2021, arXiv:2111.12419. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Padilla, R.; Passos, W.L.; Dias, T.L.B.; Netto, S.L.; da Silva, E.A.B. A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. Electronics 2021, 10, 279. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. YOLOv6 v3.0: A Full-Scale Reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar]
Wang, C.-Y.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]

Figure 1. The standard of YOLOv8 network architecture.

Figure 2. Our proposed YOLOv8s-NE model network architecture.

Figure 3. The illustration of DCNv2 deformable convolution process.

Figure 4. Example nursery image dataset with different (a) sizes, (b) lighting conditions and (c) background settings.

Figure 5. Example ground truth of a nursery image dataset with green bounding boxes of different sizes. The image (a) contains 29 small objects, (b) 13 medium objects, and (c) 5 large objects.

Figure 6. Visualized only small objects (a) image input (b) ground truth (green) and the inferences (red) of (c) YOLOv8s; (d) YOLOv8-v1; (e) YOLOv8-v2; (f) YOLOv8s-NE.

Figure 7. Visualized only small objects (a) image input (b) ground truth (green) and the inferences (red) of (c) YOLOv8s; (d) YOLOv8-v1; (e) YOLOv8-v2; (f) YOLOv8s-NE.

Figure 8. Our system detects four objects (two bottles, a gadget, and basket storage) on top of cabinets.

Table 1. The size distributions of annotated objects from the nursery dataset.

Small	Medium	Large
(0 < area < $32 \times 32$ pixels)	( $32 \times 32$ < area < $96 \times 96$ pixels)	( $96 \times 96$ pixels < area)
2321	3145	1550

Table 2. Software environment and hardware setting.

Platform	Description
System	Windows 11 Pro
CPU	12th Gen Intel(R) Core (TM) i7-12700 2.10 GHz
GPU	NVIDIA GeForce RTX 2080 Ti, 11264 MiB
IDE	Visual Studio Code
Framework	PyTorch 2.0.1
Python version	Python 3.9.17

Table 3. The model parameter settings.

Paramater	Configuration
Neural network optimizer	SGD
Learning rate	0.01
Epochs	450
Batch size	8
Worker	1

Table 4. Comparative results of the improved YOLOv8s models with five-fold cross-validation.

Model	Description	${AP}_{s}$ (%)	${AP}_{m}$ (%)	${AP}_{l}$ (%)	${mAP}_{50 : 95}$ (%)	${mAP}_{50}$ (%)	FPS
YOLOv8s ¹	YOLOv8s	29.5	46.6	59.2	40.4	72.8	60.55
YOLOv8s-v1	YOLOv8s+extra head	32.8	46.3	57.3	44.0	75.5	52.26
YOLOv8s-v2	YOLOv8s+extra head+C2f_DCN	33.4	47.1	59.0	44.5	75.8	40.13
YOLOv8s-NE ²	YOLOv8s+extra head+C2f_DCN+NAM	34.1	47.1	59.2	45.1	76.7	37.55

¹ baseline model. ² our proposed YOLOv8s-NE model.

Table 5. The comparative results of object detection models with five-fold cross-validation.

Model	${AP}_{s}$ (%)	${AP}_{m}$ (%)	${AP}_{l}$ (%)	${mAP}_{50 : 95}$ (%)	${mAP}_{50}$ (%)	FPS
Faster RCNN	23.5	44.8	48.0	39.8	72.8	11.60
YOLOv3-tiny	19.4	41.4	56.4	36.5	63.7	86.94
YOLOv3	32.9	47.6	59.5	44.5	74.4	34.18
YOLOv5s	28.5	42.5	55.7	39.2	69.9	116.04
YOLOv6s	28.1	46.3	58.6	41.9	70.3	59.49
YOLOv9c	32.6	48.0	59.2	44.8	75.5	43.43
YOLOv8s-NE ¹	34.1	47.1	59.2	45.1	76.7	37.55

¹ our proposed model.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Amir, S.B.; Horio, K. YOLOv8s-NE: Enhancing Object Detection of Small Objects in Nursery Environments Based on Improved YOLOv8. Electronics 2024, 13, 3293. https://doi.org/10.3390/electronics13163293

AMA Style

Amir SB, Horio K. YOLOv8s-NE: Enhancing Object Detection of Small Objects in Nursery Environments Based on Improved YOLOv8. Electronics. 2024; 13(16):3293. https://doi.org/10.3390/electronics13163293

Chicago/Turabian Style

Amir, Supri Bin, and Keiichi Horio. 2024. "YOLOv8s-NE: Enhancing Object Detection of Small Objects in Nursery Environments Based on Improved YOLOv8" Electronics 13, no. 16: 3293. https://doi.org/10.3390/electronics13163293

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv8s-NE: Enhancing Object Detection of Small Objects in Nursery Environments Based on Improved YOLOv8

Abstract

1. Introduction

2. Overview of YOLOv8 Network Architecture

3. YOLOv8s-NE Network Architecture Design

3.1. Extra Detection Head for Small Objects Detection

3.2. Deformable ConvNets V2 (DCNv2)

3.3. Normalization-Based Attention Module (NAM)

4. Experiments and Results

4.1. Dataset

4.2. Experimental Settings

4.3. Evaluation Metrics

4.4. Experimental Results

4.4.1. Results of Comparison Experiments

4.4.2. Proposed Method Performance Demonstration

4.4.3. Results of Ablation Experiments with Object Detection Models

4.5. Application of the Nursery Environment Safety System Based on YOLOv8s-NE

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI