Next Article in Journal
Garment Recognition and Reconstruction Using Object Simultaneous Localization and Mapping
Previous Article in Journal
Tomato Urban Gardening Supported by an IoT-Based System: A Latin American Experience Report on Technology Adoption
Previous Article in Special Issue
Event-Driven Maximum Correntropy Filter Based on Cauchy Kernel for Spatial Orientation Using Gyros/Star Sensor Integration
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

YOLO-OD: Obstacle Detection for Visually Impaired Navigation Assistance

College of Computer Science and Technology, Changchun University, No. 6543, Satellite Road, Changchun 130022, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(23), 7621; https://doi.org/10.3390/s24237621
Submission received: 14 October 2024 / Revised: 8 November 2024 / Accepted: 25 November 2024 / Published: 28 November 2024

Abstract

:
Visually impaired individuals frequently encounter difficulties in detecting and avoiding obstacles in the wild. To address this issue, we propose an obstacle detection method for visual navigation assistance, named YOLO-OD. To improve the ability to detect and differentiate between different sized obstacles in outdoor environments, we introduce the Feature Weighting Block (FWB), which improves feature importance discrimination. To address the challenges of detecting cluttered outdoor environments and handling occlusions, we introduce the Adaptive Bottleneck Block (ABB), which captures varying features across different scenes. To solve the problem of detecting relatively small obstacles in outdoor environments, we propose the Enhanced Feature Attention Head (EFAH). The proposed YOLO-OD achieves an average precision of 30.02% on a public dataset, making it a worth studying approach for blind and visually impaired navigation aids.Our study effectively addresses the navigation challenges faced by visually impaired individuals by improving model performance, thereby enhancing its practical values. The code for YOLO-OD has been made publicly available to ensure reproducibility and facilitate further research.

1. Introduction

According to the World Health Organization, the global population is approximately 7.7 billion, with around 246 million individuals experiencing visual impairments, including 39 million who are blind. A primary challenge faced by visually impaired individuals is mobility, as they struggle with recognizing and navigating obstacles, particularly in urban environments. These obstacles significantly impact their safety and independence. Advances in computer technology have led to a proliferation of computer-assisted devices, particularly in image processing and medical image analysis. However, visual detection tasks for visually impaired individuals are particularly hindered by challenges such as occlusion [1], dense crowds [2], and small object detection [3], especially in real-world settings with complex environments. These challenges limit the effectiveness of existing assistive technologies, making it difficult for visually impaired individuals to navigate safely. Occlusion complicates object localization, while small objects are challenging to detect due to their diminutive size and inconspicuous features. In densely populated scenes, overlapping objects blur boundaries, further complicating segmentation and recognition. Accurate and efficient object detection technology is crucial for visually impaired individuals. However, existing algorithms often struggle with precision and reliability in practical applications, particularly when faced with overlapping objects, varying lighting conditions, and small obstacle detection. These limitations make current solutions insufficient for ensuring the safety and autonomy of visually impaired individuals. Therefore, the research presented in this paper is essential for addressing these challenges.
Visually impaired individuals need to locate and avoid obstacles in strange environments [4]. Traditional techniques, such as guide dogs [5] and white canes [6], are limited in terms of speed, precision, and coverage. Guide dogs are effective navigation means, but they cannot detect every small obstacle and are also costly and require professional training. White canes, on the other hand, can only detect obstacles within a very limited range, making them insufficient for safe navigation in complex urban settings. Consequently, there is an urgent need for computer-aid approaches [7]. Numerous studies utilizing object detection aim to assist visually impaired individuals. For example, Said et al. [8] proposed a real-time intelligent navigation assistance system. Compared to traditional methods, data-driven object detection technologies offer higher precision, greater generalizability, and sustainable development potential. These advances have the potential to be integrated into assistive devices, providing visually impaired individuals with a greater sense of safety and independence in navigating urban and unfamiliar environments.
In the field of computer vision, object detection is a fundamental task that has been extensively researched. Object detectors can be roughly defined as one-stage and two-stage methods. One-stage object detectors [9,10,11] perform both object localization and classification by directly predict bounding boxes and class scores for each object from extracted features. These methods are generally faster as they perform detection and classification in a single network pass, which makes them well-suited for real-time applications. In contrast, two-stage object detectors [12,13,14] employ a two-step strategy. These detectors first generate region proposals in the first stage and then refine these proposals by classifying them and adjusting their bounding boxes in the second stage. These methods normally offer higher detection precision and generalizability, which makes them ideal for scenarios that demand high precision. The advent of universal object detectors has facilitated the deployment of object detection technology in assistive devices designed for individuals with visual impairments. By enabling real-time recognition of pedestrians, vehicles, and other obstacles, object detection helps visually impaired individuals navigate more effectively and improves their quality of life.
As illustrated in Figure 1, for visually impaired individuals, object detection in the real world often suffers from challenges such as occlusion, small object detection, and complex backgrounds, making safe navigation extremely challenging in dynamic environments. Specifically, occlusion results in an incomplete presentation of object features, which can impair semantic information and increase the difficulty of subsequent regression and classification, directly impacting the ability of visually impaired individuals to accurately detect obstacles in their surroundings, especially in busy environments. The high complexity of densely crowded scenes, such as crowded pedestrian crossings, poses a significant challenge to feature extractors because distinguishing between objects and backgrounds becomes difficult, leading to redundant information that interferes with object features. This significantly affects the ability of assistive technologies to provide timely and accurate information for visually impaired individuals. Overlapping object regions also present a challenge to detectors trying to distinguish between different object boundaries. In the detection of smaller obstacles, such as potholes or traffic cones, due to the lack of effective pixels representing these smaller obstacles, less useful information is extracted, which ultimately hampers the effectiveness of traditional detection methods. This issue is especially critical for visually impaired individuals who rely on precise detection of small obstacles to ensure their safety. This hampers the ability of feature extractors to accurately adapt to the features of objects across different scales, ultimately limiting the discriminative power and precision of the features. Our proposed methods aim to mitigate these challenges by enhancing feature extraction and improving robustness to occlusion and background complexity, ultimately ensuring that visually impaired individuals can receive reliable and accurate assistance in various real-world environments.
To address the aforementioned challenges, we propose YOLO-OD. Building on YOLOv8, we integrated an Adaptive Bottleneck Block (ABB) within the detection neck to enhance object detection in complex scenes, added a Feature Weighting Block (FWB) to process features initially extracted by the backbone, Enhanced Feature Attention Head (EFAH) is proposed, which incorporates Convolutional Block Attention Module (CBAM) [15] into the classification header to improve the classification precision, and introduces a transposed convolution layer and a convolutional layer in the regression header to improve the localization precision. These modifications collectively aim to enhance the robustness and precision of the model, thereby providing effective navigational assistance to visually impaired individuals.
To provide a concise overview of our contributions, we present them in five key aspects:
(1) We propose a new obstacle detection method that incorporates a feature-weighted block (FWB), an adaptive bottleneck block (ABB), and an enhanced feature attention head (EFAH) to improve usability in visual impairment detection.
(2) To enhance the detection and distinction of obstacles of varying sizes in outdoor environments, we introduce the Feature Weighting Block (FWB), designed to improve the discrimination of feature importance.
(3) To tackle the challenges of cluttered outdoor environments and occlusion, we present the Adaptive Bottleneck Block (ABB), which captures diverse features across different scenes.
(4) For the detection of relatively small obstacles, We propose the Enhanced Feature Attention Head (EFAH), which can better capture the detailed information of features through the transposed convolution layer.
(5) Our proposed method was validated on a publicly available dataset and achieved an average accuracy of 30.02%, making it a worthwhile research method for navigational aids for the blind and visually impaired.

2. Related Work

2.1. One-Stage Detectors

One-stage object detectors, exemplified by the YOLO [16] and SSD series, are optimized for real-time applications due to their high detection speeds. YOLOv3 [17] introduced a more complex network structure and multi-scale predictions, thereby improving precision while maintaining speed. YOLOv4 [18] further optimized the architecture with CSPNet and various training techniques, enhancing both speed and precision. YOLOv5 [19] was designed to enhance ease of use, with a focus on ensuring efficient implementation and deployment. One of the state of the art YOLO, YOLOv8 [20], introduces architectural refinements, enhanced anchor-free detection, dynamic task prioritization, and superior real-time performance.The Single Shot MultiBox Detector SSD [21] utilize multi-scale feature maps for efficient object detection, offering a balance between speed and accuracy without region proposal steps, ideal for real-time applications.

2.2. Two-Stage Detectors

Two-stage object detectors, particularly the R-CNN series [22], are distinguished by their precision in object detection. R-CNN (Region-CNN) [23] introduces the two-stage approach by generating region proposals using selective search and applying CNNs to classify each proposal. While effective, it was slow due to the need for individual CNN processing per region. Fast R-CNN [24] improves the efficiency of R-CNN by using shared convolutional feature maps and introducing RoI pooling. This allowed all region proposals to share the same feature map, greatly reducing computation time and maintaining high accuracy. Faster R-CNN [25] introduces the Region Proposal Network (RPN), integrating proposal generation directly into the network, which significantly increased speed and efficiency. By jointly predicting object bounds and scores, it streamlined the two-stage detection process.Cascade R-CNN [26] extendes the Faster R-CNN architecture by progressively refining region proposals across multiple stages, with each stage using increasingly higher IoU thresholds. This made the model more accurate for difficult detection cases, especially in handling high-quality proposals.

2.3. Similar Works

Many researchers have already done similar work on solving the problem of obstacle avoidance for the visually impaired [27,28]. Atitallah et al. [29] propose developing an obstacle detection system based on a modified YOLOv5 neural network architecture. The proposed system is capable of recognizing and locating a set of indoor and outdoor objects that are extremely useful for Blind and Visually Impaired (BVI) navigation aids. Woo et al. [30] propose a smart data driven mobility aid for VIP (SOMAVIP), enabling real-time scene perception. The ODOMOVIP comprises of responsive devices and IoT embedded smart city environments, deep machine learning, computer vision, and data processing methods. Although much research has been conducted, many challenges remain, such as the significant shortcomings in detecting small objects and complex backgrounds. Our proposed YOLO-OD addresses these issues by utilizing FWB for preliminary weighting of the extracted features, ABB to enhance the ability of model to detect objects in complex scenes, by adding multi-scale convolution, learnable normalization layer and learnable activation function. Finally, EFAH to improve the classification and regression of model by capturing more detailed information.

3. Methodology

3.1. Architecture of the Proposed Method

To address the challenges faced by visually impaired individuals in complex environments, such as small obstacles, occlusion, and background noise, we propose YOLO-OD, inspired by the architecture of YOLOv8. As illustrated in Figure 2, the proposed YOLO-OD leverages its advanced structure to improve detection performance, particularly in real-world tests involving urban obstacles such as potholes, traffic cones, and pedestrian congestion. These enhancements are crucial for improving the safety and independence of visually impaired individuals. The YOLO-OD consists of three main components: the backbone network (CSPDarknet-53), the detection neck (KAN Path Aggregation Feature Pyramid Network, KANPAFPN) [31], and the detection head (Enhanced Feature Attention Head).
CSPDarknet-53 is chosen as the backbone due to its effective utilization of cross-stage partial connections, which scientifically optimize feature extraction by reducing redundancy in the network. This structure allows for efficient feature reuse and enhances the ability of model to process high-dimensional data with minimal computational overhead. The multiscale feature extraction capabilities of backbone make it particularly well-suited for identifying both subtle and prominent obstacles, essential for assisting visually impaired individuals in dynamic environments.
The KAN-PAFPN is employed for feature fusion, where integration is executed through top-down and bottom-up pathways to capture detailed and abstract features. However, the indiscriminate merging of features may cause significant information to be diluted, especially in urban environments where visually impaired individuals need precise obstacle detection. Subsequently, the extracted features from the obstacles are indiscriminately merged within the PAFPN, resulting in equal treatment of both significant and insignificant features, the model may struggle to prioritize key information, potentially leading to missed detections or inaccurate classifications. To address this challenge, the incorporation of the Feature Weighting Block (FWB) before the KAN-PAFPN allows for differential weighting of features, which significantly improved the ability of model to detect small and critical obstacles in real-world tests, such as potholes and narrow pathways, ensuring more reliable navigation for visually impaired individuals [32]. Furthermore, to address the complex environments faced by visually impaired individuals, we introduces the Adaptive Bottleneck. The module enhances object detection by capturing diverse features and adapting to various scenarios, thereby improving the overall precision and robustness of the system.
The EFAH consists of two parts: the classification head and the regression head. The classification head incorporates the CBAM, which selectively focuses on spatial and channel-wise information, helping to differentiate obstacles more effectively from complex backgrounds. To enhance performance in scenarios involving visually impaired individuals, several modifications have been implemented. The classification head incorporates the CBAM to enhance its ability to distinguish between obstacles. The regression head has been enhanced with a transposed convolution followed by a convolution layer, adds more detailed information, improves precision in predicting obstacle locations for more reliable navigation assistance [33].

3.2. Feature Weighting Block

As illustrated in Figure 3, to help visually impaired individuals navigate safely, particularly in urban environments with varying obstacles, we propose the FWB, which is a feature-weighted processing module. The specific computational process can be described as follows:
F = F W B ( f ) F W B ( f ) = H S i g m o i d ( L e a k y R e L U ( C o n v 2 d _ 1 × 1 ( A v g P o o l ( f ) ) ) )
where f represents the original features extracted by CSPDarknet-53, with dimensions H × W × C , where H, W, and C represent the height, width, and number of channels, respectively. C o n v 2 d _ 1 × 1 represents a convolution layer with a feature kernel size of 1 × 1, mainly used to adjust the number of channels. Averagepooling (Avgpool) is a downsampling technique in convolutional neural networks that reduces the spatial dimensions of feature maps by taking the average of values within a sliding window, after which the features become 1 × 1 × C , Following this, a convolutional layer adjusts the number of channels to 1. The LeakyReLU activation function is mainly used to apply a small non-zero slope to negative input values to ensure that all inputs contribute to the output, thus preventing neurons from becoming inactive during training [34]. The Hard Sigmoid activation function is a computationally efficient piecewise linear approximation of the standard sigmoid, designed to reduce complexity by capping and scaling inputs without the use of exponential computation, after which values between 0 and 1 are compressed, producing an output dimension of 1 × 1 × w , where w is a specific value. F is the final feature representation. FWB assigns weights to said features based on their semantic importance, ensuring that more critical features related to obstacles are highlighted. This is crucial for improving detection accuracy in scenarios where small or partially occluded obstacles might be overlooked by conventional methods. This block allows the model to better differentiate between obstacles of different sizes, as demonstrated in our experiments with datasets containing varying-sized obstacles. In particular, the FWB showed a significant improvement in detecting smaller obstacles, which are often missed by conventional models. This enhancement is crucial in providing reliable navigational assistance to visually impaired individuals in complex and dynamic environments.

3.3. Adaptivebottleneck Block

As illustrated in Figure 4, we propose the Adaptive Bottleneck Block in KAN-PAFPN, which aims to address the lack of flexibility when dealing with various scenarios and features. The Adaptive Bottleneck Block enhances the ability of model to detect high-contrast edges and fine textures in complex environments. This capability is crucial for providing safer and more reliable obstacle avoidance assistance for visually impaired individuals navigating urban environments. To achieve this, the module employs multi-scale convolutional kernels (1 × 1, 3 × 3, 5 × 5) [35] to capture multi-scale features. The use of multi-scale convolutional kernels allows the model to capture diverse features across different receptive fields, which is particularly important for detecting a wide range of obstacles, from large, easily visible barriers to small, less noticeable hazards that visually impaired individuals might encounter on urban streets. The multi-scale convolutional kernel is calculated as follows:
f 1 × 1 = C o n v 2 d 1 × 1 ( f ) f 3 × 3 = C o n v 2 d 3 × 3 ( f ) f 5 × 5 = C o n v 2 d 5 × 5 ( f )
f c o n c a t = C o n c a t ( f 1 × 1 , f 3 × 3 , f 5 × 5 )
where C o n v L a y e r n × n refers to a convolution layer with a kernel size of n × n . C o n c a t means the features are concatenated along the channel dimension. After the convolutionallayers, the fixed BatchNormalizationlayer is replaced by the LearnableBatchNormalizationlayer, which includes additional trainable parameters that adaptively adjust the normalization process based on the input data distribution.
In the standard Batch Normalization (BN) layer, the normalized output is computed as:
x ^ = x μ batch σ batch 2 + ϵ
where μ batch and σ batch 2 are the mean and variance calculated across the batch, and ϵ is a small constant for numerical stability, x ^ indicates the output eigenvalues after normalization. Here, the mean and variance are calculated statically from the current batch of data, limiting the model’s ability to adapt to more complex data distributions over time.
Next, a linear transformation is applied:
y = γ x ^ + β
where y represents the output eigenvalue after normalization and linear transformation. γ and β are trainable parameters that scale and shift the normalized features. This allows the model to adjust the normalized features slightly, but the reliance on batch statistics for μ batch and σ batch 2 makes it less flexible in handling diverse or complex input distributions.
However, in our proposed LearnableBatchNormlayer, we introduce additional flexibility by making the mean and variance learnable. This is achieved by introducing trainable parameters Δ μ and Δ σ , which allow the model to adjust the mean and variance during the training process, rather than relying solely on batch statistics. The normalization process is calculated as:
f n o r m = L e a r n a b l e B a t c h N o r m L a y e r γ x ( μ batch + Δ μ ) σ batch 2 + Δ σ + ϵ + β
where Δ μ and Δ σ are trainable parameters that adjust the batch mean and variance, respectively. This modification allows the normalization process to dynamically adjust based on the varying input data distribution, such as different lighting conditions in urban areas or varying obstacle textures in crowded environments. Experiments show that this approach enhances the detection stability in these challenging scenarios, thereby improving the safety of visually impaired individuals.
By introducing these learnable adjustments, Δ μ and Δ σ , the Adaptive Batch Normalization module provides the flexibility to handle diverse input conditions more effectively than standard Batch Normalization. This capability is especially useful in visually impaired navigation scenarios where environmental conditions, such as lighting and obstacle types, can vary significantly. Specifically:
  • Δ μ acts as a learnable shift to the batch mean μ batch , allowing the model to modify the mean in response to changes in data distribution.
  • Δ σ adjusts the batch variance σ batch 2 , allowing the model to fine-tune the spread of the data distribution.
Thus, the learnable parameters γ , β , Δ μ , and Δ σ enable the LearnableBatchNormlayer to adapt more effectively to diverse input conditions, making it particularly useful in complex and dynamic environments, such as obstacle detection for visually impaired individuals, where robustness is critical for system performance.
The learnable adjustments to mean and variance provide greater stability and adaptability, especially when working with small batch sizes or data with high variability, resulting in improved model robustness.
This adaptive normalization technique enhances robustness in a variety of situations, including different lighting conditions and complex backgrounds, and thus effectively helps visually impaired people with obstacle avoidance. In addition, the AdaptiveBottleneck module incorporates the LearnableActivation function. Unlike fixed activation functions such as S i L U in the original pafpn, the LearnableActivation function is calculated as:
F = L e a r n a b l e A c t i v a t i o n ( ω ( x 1 + e x + i c i B i ( x ) ) )
where F represents the output feature map after processing through the Adaptive Bottleneck module, ω and c i being trainable parameters, x represents the feature, and B i ( x ) representing basis functions such as B-splines.
This dynamic adjustment ensures optimal non-linear transformation for varying input features, enhancing the detection of high-contrast edge features and fine textures, which are critical for accurate obstacle detection.

3.4. Enhanced Feature Attention Head

As illustrated in Figure 5, in order to improve the detection accuracy of smaller obstacles in scenarios where visually impaired people are traveling, we propose the Enhanced Feature Attention Head, which is divided into a classification head and a regression head. In the classification stage, the input feature is first processed through a convolutional layer, which extracts preliminary features. Subsequently, the Convolutional Block Attention Module (CBAM) is integrated to refine the feature by focusing on crucial regions and suppressing irrelevant information. This attention mechanism enhances the information on space and channel, thereby improving classification accuracy. A subsequent convolutional layer operation processes these attended features, which are then utilized to calculate the classification loss. In the regression head, the input feature is firstly processed by a convolutional layer to resize and channel dimension. For precise localization, a transposed convolution module is introduced to expand the size of the features so that more detailed information about the features can be captured, thus helping to achieve finer object localization. This is followed by another convolutional layer that refines the upsampled features, resize features back to original size. A final convolutional layer processes these refined features, which are then used to output prediction results.

4. Experiment

4.1. Datasets and Evaluation Metrics

The dataset used in this article is a road obstacle dataset in COCO format, available at the link [36]. This dataset contains a total of 6276 images, which has four categories, with the number of samples for each category as follows: car: 22,318, person: 53,756, traffic cone: 1176, and pothole: 1719.
We employ a set of metrics to evaluate the effectiveness of our model, including mAP (50, s, m, l), GFLOPS, and Params. mAP is the Average Precision (AP) across all classes. mAP50 represents the mean average precision at an IoU (Intersection over Union) threshold of 0.5. mAPs quantifies the mean detection precision for objects with an area smaller than 32 × 32 pixels, mAPm assesses the mean detection precision for objects with an area between 32 × 32 and 96 × 96 pixels, and mAPl evaluates the mean detection precision for objects with an area larger than 96 × 96 pixels. GFLOPS is employed to evaluate the computational speed of a method, whereas Params is utilized to quantify the number of parameters within the method.

4.2. Implementation Details

The code for the proposed model and the comparison models was implemented using the MMYOLO and MMDetection frameworks. YOLO-OD was trained on two RTX 3090 GPUs for 150 epochs, while the other models used their default settings. MMDetection used the SGD optimizer with a learning rate of 0.02, while MMYOLO used the SGD optimizer with a learning rate of 0.01. For MMDetection, the input image size was 1333 × 800 pixels, and the test image size was also 1333 × 800 pixels. In contrast, MMYOLO used an input image size of 640 × 640 pixels. The batch size was set to 16 by default, and the confidence threshold for visualization was set to 0.5 by default.

4.3. Ablation Studies

Effect of the Feature Weighting Block (FWB). In Table 1, the addition of FWB to YOLOv8-s leads to a modest increase in computational cost, with the parameters rising from 11.13 to 11.14 and GFLOPs from 14.27 to 14.28. However, the benefits are evident through the improvement in mAPs, which increases from 18.93 to 20.02. This gain in performance demonstrates the effectiveness of FWB in enhancing small obstacle detection capabilities, which is particularly crucial for visually impaired individuals navigating complex environments. Furthermore, there is also a significant improvement in overall mAP, increasing from 28.51 to 29.33, mAPm from 28.93 to 30.42, and mAPl from 34.01 to 34.72. These improvements indicate that FWB not only provides considerable enhancements in small object detection but also yields notable gains in the detection of medium and large objects. This comprehensive enhancement across different object sizes underscores the versatility of FWB in improving the overall detection performance of the model. Similarly, Table 2 demonstrates the impact of incorporating FWB into YOLOv5-s. The parameters increase slightly from 8.12 to 8.13, and GFLOPs rise from 12.33 to 12.34. Despite this minimal increase, there is a significant improvement in the mAPs metric, which rises from 11.24 to 12.17. Additionally, the overall mAP improves from 14.23 to 15.22, mAPm from 17.71 to 19.14, and mAPl from 22.88 to 24.55. These improvements clearly demonstrate that the inclusion of FWB not only enhances the ability of model to detect small obstacles but also significantly improves its performance for medium and large objects. This further validates the effectiveness of FWB across different scenarios and object scales. The inclusion of FWB significantly enhances the feature representation capability of the model, thereby achieving substantial improvements in detection performance with a negligible impact on model complexity.
Effect of the Adaptive Bottleneck Block (ABB). As shown in Table 1, adding the Adaptive Bottleneck Block (ABB) significantly improves YOLOv8-s in terms of both small and medium obstacle detection. Specifically, mAPs increased from 18.93 to 20.23, and mAPm rose from 28.93 to 29.94. Although the computational cost increased—with GFLOPs rising from 14.27 to 58.52 and parameters growing from 11.14 to 40.09—this trade-off is justified by the notable accuracy gains. Such improvements are crucial for accurately detecting overlapping and small obstacles in complex urban environments, thereby enhancing the safety of navigation systems for visually impaired individuals. Similarly, Table 2 shows that incorporating ABB into YOLO X-s increased the mAP from 23.22 to 24.76 and mAP50 from 40.31 to 41.76. While the GFLOPs increased from 13.32 to 47.92 and the parameters grew from 8.94 to 38.56, these performance improvements highlight effectiveness of ABB in capturing complex features. This enhanced capability is particularly beneficial in navigating intricate and cluttered environments, offering more reliable detection of potential obstacles, which is key to supporting visually impaired individuals.
Effect of the Enhanced Feature Attention Head (EFAH). As shown in Table 1, integrating the Enhanced Feature Attention Head (EFAH) into YOLOv8-s results in improved performance across various metrics, with mAP increasing from 28.51 to 29.44, mAPs rising from 18.93 to 20.23, and only a slight increase in model complexity, as GFLOPs grow from 14.27 to 14.90 and parameters from 11.14 to 11.37. This indicates that EFAH effectively enhances the ability of model to focus on crucial features, significantly benefiting the detection of small obstacles, which is critical for navigation assistance for visually impaired individuals in crowded environments. In Table 3, applying EFAH to Deformable DETR yields similar advantages, with mAPs improving from 15.92 to 16.73 and mAPl from 30.36 to 31.65. Although GFLOPs and parameters increase from 208.00 to 231.00 and 41.36 to 41.63, respectively, this slight complexity boost is offset by the enhanced detection accuracy. Such improvements are essential in complex scenarios involving obstacles of various sizes, ensuring the model can capture fine details necessary for the safe and efficient navigation of visually impaired individuals. The consistent gains from EFAH across different models underscore its effectiveness in addressing the challenge of small-object detection, providing an essential enhancement for applications requiring precise obstacle recognition in dynamic and intricate environments.

4.4. Comparisons

As illustrated in Table 4, the proposed method demonstrates a significant advantage over both the YOLO series models [20] and other mainstream object detectors, including the one-stage SSD [21] model and two-stage detectors like Faster R-CNN [25], Cascade R-CNN [26], and Deformable DETR [37]. This superiority is evident in both accuracy and efficiency, especially in detecting small objects—a critical feature for ensuring the safe navigation of visually impaired individuals in complex environments. Compared to the baseline YOLOv8-s model, our method achieves an increase in mAP from 28.51 to 30.01 and in mAPs from 18.93 to 20.96, underscoring the enhanced ability of model to detect small, less conspicuous obstacles. Such improvements are crucial for applications requiring precise small-object detection, as they provide better obstacle awareness in environments where visual cues are limited or subtle. Additionally, while the two-stage Cascade R-CNN achieves a respectable mAP of 27.23 and mAPs of 17.12, our model surpasses it with 30.01 and 20.96, respectively, while also being more computationally efficient. Specifically, our model has fewer parameters (48.32 vs. 69.16) and lower GFLOPs (59.15 vs. 236.00), making it more suitable for deployment on resource-limited assistive devices. The proposed method stands out as the most precise among both the two-stage detectors (Faster R-CNN, Cascade R-CNN, Deformable DETR) and the one-stage detectors (SSD, YOLOv5-s, YOLOv7-tiny, YOLOX-s, YOLOv8-s, YOLOv10-n). Its ability to excel in detecting small obstacles further substantiates its potential as an effective tool for assistive technology aimed at visually impaired individuals. Overall, the proposed model’s high precision, combined with its efficient computation, positions it as a robust solution for developing assistive technology dedicated to enhancing the independence and safety of visually impaired users. The performance gains in small-object detection, in particular, highlight the capability of model to navigate complex urban environments, making it an invaluable aid in supporting visually impaired individuals in real-world scenarios.

4.5. Visualization

As illustrated in Figure 6, the detection results of YOLOv8 (left) and our proposed YOLO-OD (right) are visualized. The red box represents the actual detected box, and the green box highlights the difference between the detection results of the two models. We can see that the red car in the first row on the left is not detected by YOLOv8, whereas it is accurately detected by YOLO-OD, reflecting the superiority of YOLO-OD in detecting small obstacles that are crucial for visually impaired individuals. Similarly, the pedestrians in the first set of images in the second row are not detected by YOLOv8 due to the occlusion of the railings and the complex background. However, they are accurately detected by YOLO-OD, demonstrating the ability of YOLO-OD to handle challenging conditions such as occlusion and complex environments. This ability is particularly critical for visually impaired individuals who need reliable detection to navigate safely through urban areas with many obstacles. These visualization results clearly illustrate the practical benefits of YOLO-OD in real-world scenarios, particularly for visually impaired individuals who face significant challenges when navigating through environments with multiple overlapping objects and varying background complexities.

5. Discussion

Our work proposes an object detection method named YOLO-OD, specifically designed to help visually impaired individuals navigate safely through challenging environments, such as low-light or cluttered urban areas, by effectively detecting obstacles that may compromise their mobility and safety. YOLO-OD was developed to address the shortcomings of existing object detection methods, which often struggle with small or occluded obstacles in dynamic environments that visually impaired individuals face. We have demonstrated the superior detection performance of YOLO-OD through experiments on public obstacle detection datasets, emphasizing its capability to overcome these challenges. The experimental results show that YOLO-OD significantly outperforms other comparison models, particularly in detecting small and partially occluded obstacles which are critical for the safety of visually impaired individuals. We conducted ablation experiments on different models, including both one-stage and two-stage detectors, to understand the contribution of individual components, such as the Adaptive Bottleneck Block and Feature Weighting Block, to the improved performance of YOLO-OD in challenging environments. Our goal is to further improve the detection accuracy of our method while optimizing inference speed. Below are the directions for improvement that we plan to explore:
1. Addressing the specific challenges of obstacle detection for visually impaired individuals in low-light conditions to further refine the model.
2. Upgrading the model using techniques such as wavelet convolution and StarNet to enhance both detection accuracy and inference speed.
Through these improvements, we expect that YOLO-OD will achieve even more outstanding performance in obstacle detection tasks for visually impaired individuals, providing stronger support for those with visual impairments.

Author Contributions

Conceptualization, W.W.; methodology, B.J.; software, W.W.; validation, X.Y.; formal analysis, Y.S.; investigation, L.Y.; resources, W.W.; data curation, C.W.; writing—original draft preparation, B.J.; writing—review and editing, W.W.; visualization, X.Y.; supervision, W.W.; project administration, L.Y.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Jilin Provincial Education Department’s Science and Technology Project (Project No. JJKH20240749KJ) and by Changchun Yaxin Technology Co., Ltd. through the Jilin Provincial Science and Technology Development Program (Project No. 2023JBH05L54).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

You can access and obtain the data through the following links: Code repository for YOLO-OD: https://github.com/jjking00/YOLO-OD (accessed on 21 November 2024). Dataset: https://aistudio.baidu.com/datasetdetail/198589 (accessed on 21 November 2024). We confirm that the code and dataset will remain publicly accessible to ensure reproducibility and verification of the experimental results.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MDPIMultidisciplinary Digital Publishing Institute
DOAJDirectory of open access journals
TLAThree letter acronym
LDLinear dichroism

References

  1. Li, F.; Li, X.; Liu, Q.; Li, Z. Occlusion handling and multi-scale pedestrian detection based on deep learning: A review. IEEE Access 2022, 10, 19937–19957. [Google Scholar] [CrossRef]
  2. Khan, S.D.; Salih, Y.; Zafar, B.; Noorwali, A. A deep-fusion network for crowd counting in high-density crowded scenes. Int. J. Comput. Intell. Syst. 2021, 14, 168. [Google Scholar]
  3. Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
  4. Real, S.; Araujo, A. Navigation systems for the blind and visually impaired: Past work, challenges, and open problems. Sensors 2019, 19, 3404. [Google Scholar] [CrossRef] [PubMed]
  5. Saegusa, S.; Yasuda, Y.; Uratani, Y.; Tanaka, E.; Makino, T.; Chang, J.Y.J. Development of a guide-dog robot: Leading and recognizing a visually-handicapped person using a LRF. J. Adv. Mech. Des. Syst. Manuf. 2010, 4, 194–205. [Google Scholar] [CrossRef]
  6. dos Santos, A.D.P.; Medola, F.O.; Cinelli, M.J.; Garcia Ramirez, A.R.; Sandnes, F.E. Are electronic white canes better than traditional canes? A comparative study with blind and blindfolded participants. Univers. Access Inf. Soc. 2021, 20, 93–103. [Google Scholar] [CrossRef]
  7. Arakeri, M.P.; Keerthana, N.; Madhura, M.; Sankar, A.; Munnavar, T. Assistive technology for the visually impaired using computer vision. In Proceedings of the 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Bangalore, India, 19–22 September 2018; IEEE: Piscataway, NY, USA, 2018; pp. 1725–1730. [Google Scholar]
  8. Said, Y.; Atri, M.; Albahar, M.A.; Ben Atitallah, A.; Alsariera, Y.A. Obstacle detection system for navigation assistance of visually impaired people based on deep learning techniques. Sensors 2023, 23, 5262. [Google Scholar] [CrossRef]
  9. Jiang, Z.; Zhao, L.; Li, S.; Jia, Y. Real-time object detection method based on improved YOLOv4-tiny. arXiv 2020, arXiv:2011.04244. [Google Scholar]
  10. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
  11. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
  12. Wang, X.; Shrivastava, A.; Gupta, A. A-fast-rcnn: Hard positive generation via adversary for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2606–2615. [Google Scholar]
  13. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef]
  14. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  15. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  16. Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275. [Google Scholar] [CrossRef] [PubMed]
  17. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  18. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  19. Liu, G.; Hu, Y.; Chen, Z.; Guo, J.; Ni, P. Lightweight object detection algorithm for robots with improved YOLOv5. Eng. Appl. Artif. Intell. 2023, 123, 106217. [Google Scholar] [CrossRef]
  20. Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
  21. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  22. Cheng, B.; Wei, Y.; Shi, H.; Feris, R.; Xiong, J.; Huang, T. Revisiting rcnn: On awakening the classification power of faster rcnn. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 453–468. [Google Scholar]
  23. Ren, Y.; Zhu, C.; Xiao, S. Object detection based on fast/faster RCNN employing fully convolutional architectures. Math. Probl. Eng. 2018, 2018, 3598316. [Google Scholar] [CrossRef]
  24. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  25. Liu, B.; Zhao, W.; Sun, Q. Study of object detection based on Faster R-CNN. In Proceedings of the 2017 Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017; IEEE: Piscataway, NY, USA, 2017; pp. 6233–6236. [Google Scholar]
  26. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
  27. Chandna, S.; Singhal, A. Towards outdoor navigation system for visually impaired people using YOLOv5. In Proceedings of the 2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Virtual, 27–28 January 2022; IEEE: Piscataway, NY, USA, 2022; pp. 617–622. [Google Scholar]
  28. Elgendy, M.; Sik-Lanyi, C.; Kelemen, A. A novel marker detection system for people with visual impairment using the improved tiny-yolov3 model. Comput. Methods Programs Biomed. 2021, 205, 106112. [Google Scholar] [CrossRef]
  29. Atitallah, A.B.; Said, Y.; Atitallah, M.A.B.; Albekairi, M.; Kaaniche, K.; Boubaker, S. An effective obstacle detection system using deep learning advantages to aid blind and visually impaired navigation. Ain Shams Eng. J. 2024, 15, 102387. [Google Scholar] [CrossRef]
  30. Khan, W.; Hussain, A.; Khan, B.M.; Crockett, K. Outdoor mobility aid for people with visual impairment: Obstacle detection and responsive framework for the scene perception during the outdoor mobility of people with visual impairment. Expert Syst. Appl. 2023, 228, 120464. [Google Scholar] [CrossRef]
  31. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-Arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
  32. Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
  33. Zeiler, M.D.; Krishnan, D.; Taylor, G.W.; Fergus, R. Deconvolutional networks. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; IEEE: Piscataway, NY, USA, 2010; pp. 2528–2535. [Google Scholar]
  34. Xu, J.; Li, Z.; Du, B.; Zhang, M.; Liu, J. Reluplex made more practical: Leaky ReLU. In Proceedings of the 2020 IEEE Symposium on Computers and communications (ISCC), Rennes, France, 7–10 June 2020; IEEE: Piscataway, NY, USA, 2020; pp. 1–7. [Google Scholar]
  35. Cui, Z.; Chen, W.; Chen, Y. Multi-scale convolutional neural networks for time series classification. arXiv 2016, arXiv:1603.06995. [Google Scholar]
  36. Zyh. Road Obstacle Dataset in COCO Format. Baidu AI Studio. 2023. Available online: https://aistudio.baidu.com/datasetdetail/198589 (accessed on 13 October 2024).
  37. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Figure 1. Visualization of picture samples from the real world.
Figure 1. Visualization of picture samples from the real world.
Sensors 24 07621 g001
Figure 2. Architecture of the proposed method.
Figure 2. Architecture of the proposed method.
Sensors 24 07621 g002
Figure 3. Feature Weighting Block.
Figure 3. Feature Weighting Block.
Sensors 24 07621 g003
Figure 4. AdaptiveBottleneck Block.
Figure 4. AdaptiveBottleneck Block.
Sensors 24 07621 g004
Figure 5. Enhanced Feature Attention Head.
Figure 5. Enhanced Feature Attention Head.
Sensors 24 07621 g005
Figure 6. Visualization of YOLOv8 (left) and YOLO-OD (right) detection results.
Figure 6. Visualization of YOLOv8 (left) and YOLO-OD (right) detection results.
Sensors 24 07621 g006
Table 1. Ablation experiments.
Table 1. Ablation experiments.
ModelsmAPmAP50mAPsmAPmmAPlFlops/GParameters/M
YOLOv8-s (baseline)28.5142.0218.9328.9334.0114.2711.14
YOLOv8-s + FWB29.3342.5420.0230.4234.7214.2811.14
YOLOv8-s + ABB29.5542.8520.2329.9435.1458.5240.09
YOLOv8-s + EFAH29.4442.4619.2529.8635.0614.9011.37
YOLOv8-s + FWB + ABB30.0042.3320.8730.4135.7458.5348.09
YOLOv8-s + ABB + EFAH29.7842.8619.9830.0535.3759.1448.32
YOLOv8-s + FWB + EFAH29.7242.7219.9230.3535.0814.9011.37
YOLOv8-s + FWB + ABB + EFAH30.0243.3220.9631.0735.5859.1548.32
Table 2. Ablation experiments on other one-stage detectors.
Table 2. Ablation experiments on other one-stage detectors.
ModelsmAPmAP50mAPsmAPmmAPlFlops/GParameters/M
YOLOv5-s14.2328.5210.6517.7112.838.1212.33
YOLOv5-s + FWB15.2229.4111.2419.0514.458.1312.34
YOLOv5-s + EFAH15.3428.8412.1719.1613.788.2312.39
YOLOv7-tiny21.9137.6113.0224.0125.926.566.02
YOLOv7-tiny + FWB23.4338.8514.6625.3227.676.576.03
YOLOv7-tiny + EFAH22.1238.2613.6823.6527.096.676.12
YOLO X-s23.2240.3114.5125.0627.5213.328.94
YOLO X-s + FWB23.5440.8214.7625.6727.6113.338.94
YOLO X-s + ABB24.7641.7615.2526.2330.2747.9238.56
YOLO X-s + EFAH23.9141.1214.6525.4228.4715.809.83
SSD30021.2239.048.5221.9528.6130.5824.15
SSD300 + FWB22.0640.129.2623.2329.4730.5824.15
SSD300 + EFAH21.6339.378.6322.6729.0944.7256.23
Table 3. Ablation experiments on two-stage detectors.
Table 3. Ablation experiments on two-stage detectors.
ModelsmAPmAP50mAPsmAPmmAPlFlops/GParameters/M
Cascade R-CNN27.2340.9517.1227.2634.0114.2711.14
Cascade R-CNN + FWB27.1241.8518.0627.6234.7214.2811.14
Cascade R-CNN + EFAH27.6341.6618.4528.4735.1458.5248.09
Faster R-CNN25.9041.6117.7225.9433.04236.0069.16
Faster R-CNN + FWB26.0341.5219.0427.1632.77236.0069.17
Faster R-CNN + EFAH26.3241.7418.1226.5532.82236.0069.38
Deformable-DETR27.9244.7515.9229.9330.36208.0041.36
Deformable-DETR + FWB28.5245.2416.9530.3130.38208.0041.38
Deformable-DETR + EFAH28.3245.0216.7330.4431.65231.0041.63
Table 4. Comparison experiments.
Table 4. Comparison experiments.
ModelsBackbonemAPmAP50mAPsmAPmmAPlFlops/GParameters/M
YOLOv5-sYOLOv5CSPDarknet14.2328.5210.6517.7112.838.1212.33
YOLOv7-tinyE-ELAN21.9137.6113.0224.0125.926.566.02
YOLO X-sYOLOXCSPDarknet23.2240.3114.5125.0627.5213.328.94
SSD300SSDVGG21.2239.048.5221.9528.6130.5824.15
YOLOv8-sYOLOv8CSPDarknet28.5142.0218.9328.9334.0114.2711.14
YOLOv10-nCSPDarknet29.142.319.930.135.28.252.69
Faster R-CNNResNet-5025.9041.617.725.930.3208.0041.36
Cascade R-CNNResNet-5027.2340.9517.1227.2633.04236.0069.16
Deformable DETRResNet-5027.944.715.929.935.4193.0040.10
OursCSPDarknet30.0143.3220.9631.0735.5859.1548.32
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, W.; Jing, B.; Yu, X.; Sun, Y.; Yang, L.; Wang, C. YOLO-OD: Obstacle Detection for Visually Impaired Navigation Assistance. Sensors 2024, 24, 7621. https://doi.org/10.3390/s24237621

AMA Style

Wang W, Jing B, Yu X, Sun Y, Yang L, Wang C. YOLO-OD: Obstacle Detection for Visually Impaired Navigation Assistance. Sensors. 2024; 24(23):7621. https://doi.org/10.3390/s24237621

Chicago/Turabian Style

Wang, Wei, Bin Jing, Xiaoru Yu, Yan Sun, Liping Yang, and Chunliang Wang. 2024. "YOLO-OD: Obstacle Detection for Visually Impaired Navigation Assistance" Sensors 24, no. 23: 7621. https://doi.org/10.3390/s24237621

APA Style

Wang, W., Jing, B., Yu, X., Sun, Y., Yang, L., & Wang, C. (2024). YOLO-OD: Obstacle Detection for Visually Impaired Navigation Assistance. Sensors, 24(23), 7621. https://doi.org/10.3390/s24237621

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop