1. Introduction
According to a report by the World Health Organization, approximately 2.2 billion people, or approximately 27% of the global population, will experience near or distance vision impairment by 2023, and nearly half of them (approximately 1 billion) will be unable to prevent vision impairment. Nearly 80% of near vision impairment cases were due to presbyopia. Conditions such as cataracts, refractive errors, age-related macular degeneration, glaucoma, and diabetic retinopathy predominantly contribute to distance vision impairment [
1]. Advances in deep learning technology, including computer vision, are being applied across various fields to enhance daily life. This technology is particularly effective in assisting the visually impaired, enhancing their visual functions for tasks, such as mobility and object recognition [
2,
3]. To advance our study, we investigated the requirements for walking aids for visually impaired individuals through a literature review. Kim et al. [
4] surveyed 154 visually impaired adults to identify the essential features for developing an AI-based guide robot and analyzed the results. The most essential feature of a guide robot was “listening and providing guidance”, which was rated highest at 48.1%, followed by “exploring the surroundings and providing information”, which was rated at 29.9%. Other important features included “providing spontaneous information about places” and “conveying display content through voice guidance”. Based on the requirements identified in these references, we designed a system that applies the needs of visually impaired individuals into a walking aid system, utilizing YOLO-based object detection and generating natural language voice outputs to assist them.
One of the primary challenges visually impaired individuals face is the difficulty of navigating independently in complex urban environments characterized by various traffic signal systems, number of pedestrians, intersections, and roundabouts [
5]. Walking aids and facilities, such as canes [
6,
7] and tactile paving [
8], have been provided to promote safe travel. However, owing to factors such as technological limitations, road information (such as details about obstacles on the road or the shape of the road), and legal regulations regarding the installation of safety facilities, visually impaired individuals cannot rely solely on these aids and facilities. Therefore, to navigate complex urban roads, walking aids must include functions to detect their surroundings and provide information on traffic signal systems, structures, and road safety facilities that are difficult to recognize visually.
Recently, various studies have introduced methods for providing environmental scanning, traffic, and street information—such as the location and color of traffic lights and the presence of crosswalks—to visually impaired individuals [
5,
6,
7,
8,
9,
10,
11,
12,
13,
14]. The most used smart cane [
6] alerts visually impaired individuals to approaching objects or obstacles through vibrations. However, because the cane must be held manually, its usability is reduced in crowded places.
In [
7], a system to detect braille blocks—a walking aid for visually impaired individuals that guides them through various environments by recognizing the geometric features and colors of the blocks—was developed. However, its use is limited to individuals with knowledge of braille blocks. In addition, recognition is limited when the blocks are obscured by obstacles, such as pedestrian traffic and facilities, or when the blocks are discolored and damaged due to aging.
In a subsequent study that enhanced the system’s object recognition performance [
10], an intelligent navigation support system was developed using deep learning and a neural architecture search (NAS) to provide obstacle information to visually impaired individuals. By designing a high-performance object recognition model based on a NAS, the system offers faster and more accurate recognition compared to previous models. Additionally, it detects obstacles in real time and delivers the results to the user via voice notifications. However, conveying object names through voice notifications, especially when delivering complex information, may pose challenges.
Recently, with advancements in sensor technology, information for visually impaired individuals has increasingly been provided in wearable forms. Chen et al. [
11] propose a wearable assistive system for visually impaired individuals that uses object recognition, distance measurement, and tactile feedback. This system utilizes the YOLOv3 object recognition model and two stereo cameras to detect obstacles and measure their distance in real time. The measured information is used to generate various vibration patterns based on the situation, utilizing shape memory alloy (SMA) actuators and small vibration motors. These vibration signals are provided via gloves, offering the advantage of intuitively conveying the direction and distance of obstacles. However, users must understand the different vibration patterns for each situation, and variations in tactile perception among individuals may exist.
Regarding smart devices, some devices combine smart glasses with a depth camera or connect a camera with a smartwatch or smartphone to detect objects and provide auditory warnings to visually impaired individuals [
12,
13]. However, these devices are limited to individuals wearing smart devices, and their implementation costs are high.
A device for object detection and notification for visually impaired individuals that applies object recognition algorithms to augmented reality devices was developed by [
14]. It combines a Microsoft Hololens, a wearable device, with YOLOv2, an object recognition algorithm, to convey the names of the detected objects through voice. However, this device has limitations in providing contextual information for complex situations and has high implementation costs.
Representative walking aid apps for visually impaired individuals include OKO [
15], Be My AI [
16], and Oorion [
17]. These apps utilize smartphone cameras to detect objects, obstacles, and text in real time and convey the recognized information to users via voice. Particularly, Be My AI combines augmented reality technology to offer a more intuitive user experience by guiding visually impaired users along their walking paths and providing obstacle information. While these apps are similar to this study in that they deliver the names of surrounding objects to users in real time, they have limitations, and they only convey the names of simple objects without expressing the state of complex objects in sentences. By contrast, our system integrates object recognition with natural language generation to provide voice-guided descriptive sentences composed of object information and cautionary phrases rather than just simple object names. This study proposes solving these issues by combining real-time object detection and inference using the object recognition model YOLO [
18] with the large-scale natural language model KoAlpaca [
19] to generate descriptive sentences that provide complex information about the recognized objects. This sentence is conveyed using text-to-speech (TTS) conversion [
20], implementing a walking aid system for visually impaired individuals. The hardware setup reduces the implementation costs using a single board, webcam, and headphones without requiring expensive devices. The main contribution of this study is that the proposed method provides a function that explains the contextual situation of detected objects in natural language for visually impaired individuals. Existing assistive tools have primarily focused on replacing visual functions; however, visually impaired individuals require situational explanations of the detected obstacles. Therefore, the usefulness of assistive tools can be considerably enhanced by combining a natural language generation model to provide situational explanations.
The remainder of this paper is structured as follows.
Section 2 discusses related research and the characteristics of the models used in this study.
Section 3 outlines multi-object recognition using YOLOv5.
Section 4 presents the generation of alert sentences using KoAlpaca, detailing the training data, model training, and evaluation.
Section 5 confirms the applicability of the system through the experimental results, explaining the integration and lightweight implementation of the two systems. Finally,
Section 6 describes the results and contributions of our research and presents future research directions and limitations.
3. Multiple Object Detection with YOLOv5
This study proposes a method to detect situations in the walking environments of visually impaired individuals and automatically generate walking information based on these situations. Real-time video data were processed frame by frame through object detection. The object detection algorithm uses the YOLOv5 model to detect objects in the video input from the camera and obtain labels for the detected objects.
Four classes were defined for the training data: bollards, crosswalks, pedestrian traffic lights (red), and pedestrian traffic lights (green). When an object is detected above a certain threshold, the detected object label is input into the “Instruction” field of the natural language generation model KoAlpaca, which then generates information about the object and outputs it through the “Output” field.
Figure 2 depicts the structure of the full system. The generation of obstacle information automatically compiles the current walking situation and risk factors based on recognized objects. The KoAlpaca model analyzes the input object information and generates walking information, including details regarding objects observed during walking, such as crosswalks and bollards, along with cautionary phrases related to them. The generated walking information is conveyed to the user as a voice through a TTS module. Additionally, for the object detection and natural language generation experiments, the system was implemented on a single-board computer, Jetson Nano [
34], enabling rapid and efficient notifications of visually impaired individuals in their walking environment.
Figure 2 illustrates the setup, which includes compatible headphones that can be connected via Bluetooth through a single board. This tool is used for the prototype implementation and experiments of the system proposed in this study. To allow visually impaired individuals to perceive environmental sounds, we consider the application of bone-conduction headphones.
3.1. Multiple Object Detection Pipeline
To implement multi-object detection, YOLOv5 is utilized. YOLOv5 can perform real-time multi-object detection and inference simultaneously and is available as open-source software with various versions, depending on the network size.
Figure 3 shows the structure of the YOLOv5 model for multi-object detection.
The input image passes through the backbone, neck, and head, ultimately returning the final prediction. The features of the input image are extracted in the backbone field. The multiple convolution and pooling layers alter the image resolution, allowing features to be extracted at various resolutions. The extracted features then pass through the neck field, where they are fused.
At this stage, a path aggregation network [
35] is used to fuse the features of low and high resolutions to improve performance. The fused features are then moved to the head field, where convolution layers transform them into the final output. The final output includes the bounding box parameters (x, y, w, and h), confidence score indicating the probability of the object’s presence, and class probability, thereby completing the final recognition.
In this study, YOLOv5 was fine tuned using training data that allowed visually impaired individuals to recognize the appearance and status of obstacles they needed to avoid or be aware of when crossing roads in urban areas.
3.2. YOLOv5 Training Data
To detect situations occurring in the walking environment of visually impaired individuals, the training data combined publicly available image sets obtained from Roboflow [
36] with images directly captured by us. By increasing the quantity of road and street images as seen from a pedestrian’s eye level, we aim to enable the YOLOv5 model to learn better the objects of the defined four classes (bollards, crosswalks, pedestrian traffic light (red), and pedestrian traffic light (green)).
The collection of public data is explained as follows. The classes were designed to focus on static objects that can be detected while walking. Because the shapes of bollards and signals vary by region, various forms of original images were collected through five Roboflow projects. We collected 3345, 2146, 679, 9497, and 1682 images from the following: the “Traffic-light Computer Vision Project” [
37] and the “Capstone for Detection1 Computer Vision Project” [
38], which includes walking environments such as signals, crosswalks, and bollards; the “Traffic-sign Project” [
39], which includes various forms of bollards and crosswalks, depending on the region; the “Pedestrian Signs and Lanes Computer Vision Project” [
40], which includes signals from various regions; and the “July_6 Computer Vision Project” [
41], respectively. Images exceeding the permissible range of the object area and sharpness variations were removed, resulting in the construction of 5180 raw images.
Of these, 25% of the raw images contained objects classified into two or more classes. Specifically, 1105 images contained objects classified into only two classes and 193 images contained objects classified into three classes, totaling 1298 images.
To ensure sufficient images from the pedestrian’s eye level, fifty-two images corresponding to the four classes were captured between 2 and 4 PM under clear and rainy or cloudy weather conditions. These images were augmented through blurring effects and noise application to triple the number and were refined for use in training. The device used to capture these images was an iPhone 15 Pro with a resolution of 4032 × 3024.
Annotation for creating the training data was conducted using RoboFlow. The four classes were selected considering obstacles encountered in real walking environments. The total number of annotations in the constructed training dataset was 9045.
Figure 4 shows examples of collected images.
Table 1 lists the specifications of the training data, including the number of images and annotations in each class.
YOLOv5 recommends more than 1500 training images per class over 300 epochs [
42]. Therefore, we used data augmentation techniques to expand the diversity of the training data from raw images. As presented in column 4 of
Table 2, the number of annotations for each class is nearly equal. This demonstrates our ability to expand the training images using data augmentation techniques while simultaneously addressing the imbalance in the number of annotations per class.
Two augmentation techniques, blurring and noise effects, were selected and applied considering the decreased object detection rate owing to weather conditions during walking. Thus, 12,846 images and 22,342 annotations were constructed, excluding images similar to the original images but with reduced resolution, achieving approximately three times the number of collected images through augmentation.
Figure 5 illustrates examples of the augmented data. The constructed dataset was categorized into training, validation, and test datasets with 11,499, 1085, and 262 images, respectively, set at a ratio of 9:0.8:0.2. YOLOv5, owing to its feature pyramid structure [
43] and model architecture, downsamples input images in multiples of 32. Therefore, the input images were resized to 640 × 640 pixels to satisfy model requirements.
3.3. Training YOLOv5
In this study, we selected YOLO as the object detection model because of its multi-object detection and rapid inference capability. We used YOLOv5, which combines bottleneck [
44] and cross-stage partial network (CSPNet) [
45] techniques. The bottleneck technique controls the computational load and information loss through the channels of feature maps extracted from the images, thereby minimizing errors with minimal computation. The CSP technique performs convolution operations only on part of the basic layer before convolution with the next layer, and the remaining part is merged with the results after computation. This reduces the computational load of convolution operations, enhances the learning ability of the model, and reduces computing memory consumption. Therefore, YOLOv5, with its combined BottleneckCSP technology, is a model capable of accurate and fast object detection with low computing memory consumption. YOLOv5 provides five pretrained weight files based on the size of the weight network used.
The development environment for training YOLOv5 is listed in
Table 2. We used a single NVIDIA RTX 4090Ti GPU and conducted training in an Anaconda environment. NVIDIA’s CUDA was used as the GPU development tool, and training was based on Python and PyTorch.
To implement a lightweight system, we conducted an ablation study by training and comparing the nano and small versions of the YOLOv5 backbone, which are smaller-sized weight files provided by YOLOv5. We used the default learning rate provided by YOLOv5 for parameter training [
46]. The epoch and batch size were set to 100 and 16, respectively, and applied equally to both versions to compare the performance and size of the models.
Table 3 presents the results of the comparative analysis of the two versions. Although approximately a fourfold difference in model size exists, the performance improvement is only approximately 1%. Therefore, to build an efficient lightweight system, the smaller nano model was used for system implementation.
Figure 6 shows an example of the training results. The results show that multiple objects belonging to the four classes are detected. We can generate natural language sentences that describe the situation in advance using the information inferred from these images. For example, in an image where a crosswalk and bollard are inferred together, a natural language sentence can be generated to caution the user, such as “Be careful, there is a crosswalk with bollards ahead”.
3.4. Evaluation of YOLOv5
The performance of the trained model was confirmed using the confusion matrix [
47] and evaluation metrics observed during model training for model evaluation. The evaluation metrics included step-by-step loss rates, precision, and recall to determine the mean average precision (mAP) to assess model performance [
48].
Figure 7 shows the confusion matrix for the test data. The detection rate for each class in the test data was confirmed through the confusion matrix. For bollards, the true detection rate is 92%, the false positive rate is 8%, and the false negative rate is 17%. No false detections occurred between classes, and the average false positive rate for all classes was 11.8% and the false negative rate was 18%.
Figure 8 illustrates the precision, recall, and mean average precision (mAP) values of the model for each class. Precision represents the ratio of correctly predicted objects out of all objects predicted by the model; the higher the precision, the greater the model’s accuracy. Recall indicates the proportion of actual objects in the input data that the model correctly identified; a higher recall means the model has a better detection rate [
48]. The mAP represents the mean precision value and indicates the recognition accuracy. The higher the mAP value, the more accurate the model’s predictions are.
Table 4 lists the performance of the classwise model for the test data. The detection and recognition accuracies are 88.84% and 98.68%, respectively. Although the confusion matrix and precision–recall curve show that the recognition accuracy for crosswalks is lower, in the actual experimental results, there are no instances where the bounding box for crosswalk objects is generated or recognized.
5. Experimental Results
The proposed system was implemented on a single board to detect situations and objects that may occur in the walking environments of visually impaired individuals using a webcam. It automatically generates obstacle notifications and caution sentences based on the names of detected objects.
The proposed system integrates object recognition and natural language generation through the Slack API [
56], a cloud-based messaging platform. When the number of frames in which YOLOv5 recognizes an object reaches a certain threshold, the object’s name is sent to the Slack user channel. Subsequently, the object name sent to the user channel is retrieved using the “conversations.history” API method, which searches existing messages in the Slack channel and is passed to the KoAlpaca generator. The notification and cautionary sentences generated based on the transmitted object name are then sent back to the Slack channel via the “chat.postMessage” API. Finally, the sent sentences are converted to speech via a TTS module, which is delivered to the user through headphones connected to the single board.
The single board used was a Jetson Orin Nano, an embedded computing board from NVIDIA designed for lightweight, low-power systems. Unlike existing single boards, such as Arduino and Raspberry Pi, it was designed to support AI computations. The implementation on a single board was based on NVIDIA’s Ampere GPU architecture [
57], using NVIDIA’s CUDA as the GPU development tool. The system was developed using Anaconda Python(3.7.0), and PyTorch(11.3) software.
Table 10 lists the environment of the computing system in the Jetson Orin Nano.
A time domain analysis was conducted to verify the applicability of the proposed system. The analysis examined the number of classes and objects within a frame, the average time taken for object detection and sentence generation, the average time of audio prompts, and the total time to process a single frame to verify the performance across various scenarios.
Table 11 presents the results of the time domain analysis of the proposed system. We used the Shapiro–Wilk test method to verify the normality of the resulting data of the four processing indices. As a result, except for the average time of the audio prompt of class 3 and seven objects in the frame (
p-value = 0.0270), it was confirmed that all of them followed a normal distribution through normality verification.
The proposed system automatically detects objects and situations and generates cautionary sentences based on them. This allows for conveying contextually understood sentences about complex situations rather than object information. Additionally, by delivering sentences via a TTS module, the system can reduce its reliance on sensory-dependent aids, such as canes and clickers, which visually impaired individuals commonly use. Compared with other methods that use various sensors and devices for object detection and information conveyance to support the visually impaired, the proposed system achieves detection with a low-cost webcam and conveys cautionary sentences through headphones.
Table 12 presents the results of comparing the proposed system with related works in this field.
A qualitative evaluation of the proposed system was conducted through a pilot test with one visually impaired person and three blindfolded healthy people in their 20s. The test for the visually impaired participant was conducted in an indoor desktop environment, whereas the normal participants performed the test while walking on actual roads.
Figure 11 shows scenes of the experiment for each participant. All participants used our system with the assistance of a helper. Before the experiment, a brief description of the system was provided to the users to understand the experimental steps. The results were confirmed through usability scores and feedback from participants after the experiment. Scores were out of five, with higher scores indicating better usability of the proposed system.
Table 13 lists the obtained experimental results.
The pilot test results indicated an average usability score of 4.05. While the participants expressed some dissatisfaction with the notification conveying time and online implementation, they highly praised the system for its ability to detect objects from a distance and its accuracy.
In particular, the participants praised the system’s ability to convey sentences that described the visual context of detected objects rather than simply conveying the presence of an obstacle or a warning sound. The visually impaired participant noted that compared to the visual impairment assistance app they were currently using, the proposed system provided faster notifications and helped them quickly understand complex walking situations through voice that described the walking scenarios in sentences. These reviews confirm that the proposed system can provide accurate notifications to users and demonstrate its applicability in real walking situations.
6. Discussion
6.1. Results and Contribution
In this study, we developed a multi-modal walking environment information generation system for visually impaired individuals by merging object recognition and cautionary sentence generation systems using YOLOv5 and KoAlpaca, respectively. We applied image data augmentation to ensure the diversity of training data for object recognition in various environments and used GPT, a generative AI, to build our own natural language data and train the pedestrian caution warning generation system. The training of each model was evaluated using the evaluation metrics. The system was implemented on a single board, and a comparative analysis with related studies was conducted. In addition, a pilot test with visually impaired individuals was conducted to verify the applicability and flexibility of the system in a real walking environment.
The constructed system is expected to bring about significant changes when applied to various wearable devices, and it has the following outcomes. First, it enables real-time monitoring of various situations and objects observed in walking environments using AI. Using images collected from the sight of actual pedestrians as training data, we improved the object detection rate in a real walking environment. Second, an automatic information provision system allows for automating the detection and monitoring of the walking environment, converting environmental information into text, and conveying it in real time. This allows for immediate responses based on AI and offers various potential applications, such as data accumulation.
6.2. Limitations and Future Work
We developed a walking environment information generation system for visually impaired individuals, focusing on four static classes. The system was designed to recognize the walking environment and generate environmental information. A pilot test was conducted with one visually impaired participant and three sighted participants. The system’s contribution was assessed through comparison with previous research and by evaluating system execution time. However, this study had limitations. The experiments were conducted in a controlled laboratory environment with a small number of participants. To generalize the proposed system, it should undergo validation with a more diverse and larger group of participants, considering factors such as age, gender, and physical abilities, after obtaining IRB approval. This would allow for a more accurate assessment of the system’s effectiveness and usability and help improve its performance by reflecting the actual needs of the participants.
Additionally, for the generalization of the system, the classes should be updated to include static and dynamic objects. This means effectively processing a large and diverse amount of data and applying it to the system. Future research should focus on obtaining high-quality data to enhance the system’s performance and generating various types of sentences to provide more information to visually impaired individuals.