1. Introduction
Blindness and impaired vision result from a range of causes including glaucoma, cataract, and age-related macular degeneration [
1]. While the latter causes central vision loss, the former affects mainly the outer visual field and Peripheral Vision (PV) in particular [
2]. While most visual acuity problems are correctable using different techniques and traditional solutions such as eyeglasses, visual field defects are not easily rehabilitated. This is because most of these defects happen after brain injury or eye conditions where parts of the visual system become permanently diseased [
3].
Two types of vision areas define a human’s visual field: central and peripheral. These areas are used to see and recognise different levels of details and information. Our brain uses the central visual (5°) field most for reading, focusing, drawing, crossing the road, and many other daily activities that require a deep understanding of specific details. On the other hand, the peripheral vision is used to detect larger contrasts, colours and motion and extends up to 160° horizontally and 145° vertically for each eye [
4]. While the peripheral vision is inferior to the central vision in terms of detailed view, it is particularly useful to attract the brain’s attention for the surrounding environment. One of the critical roles that a human’s peripheral vision provides is the ability to detect and avoid potential hazards in the surroundings. To explore the fine details about a specific object, humans use head movements to gather more information and increase their cognitive understanding. Standard visual field extensions for both eyes are shown in
Figure 1. The central vision is shown as a white circle in the middle covering 30° around the fixation point (assumed to be the centre of the figure). Due to retinal eccentricity [
5], different degrees of resolution occur in different parts of the visual field areas. The more central the area, the more resolution for vision [
6].
In the case of peripheral vision loss, the outer visual field areas are impaired to varying degrees, while central vision may remain healthy. Tunnel vision is one form of peripheral vision loss and the most extreme case. People with tunnel vision can see through a tiny circular area in their central vision (≈10°). In this case, it is essential for the person to continuously shift their focus around to have a full understanding of the surroundings and possible threats [
7,
8].
Figure 2 shows a simulated view of the same scene with healthy and tunnel vision.
People with peripheral vision loss have both healthy and defected vision areas in their visual field. Eye specialists use a perimetry test to measure these areas. The test clearly defines where the person can and can’t see. It can also plot the progression of field loss over time [
9].
Designing a system that implements computer vision algorithms in real time to provide useful information about any possible threats existing in the user’s blind area will enhance functional vision by giving cues from the affected field without the need for shift of their fixation point all the time. It is essential that the additional cues provide fast and trustful notification that reflects the hazard type, danger degree and, most importantly, the location of that hazard.
Smart Assistive Technologies (AT) and mobile healthcare systems are developing rapidly. With the massive growth in the hardware and software sectors, wearable smart devices have become widely affordable. Vision assistance devices have been developed to be worn on several body parts such as the head, chest, fingers, feet, and ears. Digital cameras have been used to collect data about the surrounding environment and process this to generate useful outputs for many vision rehabilitation applications such as indoor/outdoor navigation, obstacle detection and tracking and activity recognition. All these applications have been developed with the common goal to enhance the individual’s quality of life [
10].
Object recognition, object tracking, visual odometry, activity classification and many other real-time computer vision-related algorithms are now used every day in several applications like video surveillance, AT, video compression and robotic navigation. These applications are becoming more affordable in the healthcare field due to the considerable developments for mobiles and portable smart devices [
11].
Traditional computer vision applications use the captured data to respond in a real-time manner to a specific condition or scenario. On the other hand, context-aware systems try to understand the context and circumstances of a given case and respond or update their response accordingly [
12]. This is applied not only in the system design phase but also in the on-going processing time while the system is performing its tasks [
13].
Context-Aware Assistive Systems (CAAS) have become widely used in autonomous cars, mobile phone applications and healthcare sectors [
14]. Using the context awareness concepts with computer vision algorithms and new wearable technology could provide smart, context-aware and easy to use wearable AT for the visually impaired people. Incorporating such context-awareness assistive software into wearable technology suitable for use by the visually impaired requires careful consideration and a delicate balance between computation power, visual display and usability/wearability.
While Virtual Reality (VR) attempts to replace the user’s vision with a computer-generated, virtual environment [
15], Augmented Reality (AR) and Mixed Reality (MR) [
16] use the individuals’ vision to add more helpful information and extend their knowledge without blocking the original vision [
15,
17]. In AR applications, the computer-generated inputs are not able to interact with the real world content, but, in MR applications, they can react to each other.
In this paper, we propose a smart assistive technology using smart glasses and computer vision algorithms; a system that recognises objects in the user’s visual field and classifies them to determine the possible danger level is designed. Motion features for the detected objects such as the speed, direction, location and age (the appearance time in terms of the number of frames) were extracted using object tracking modules. This work is part of a larger project to developing a user-centred design for a wearable, context-aware hazard detection system for people with peripheral vision loss [
18,
19].
A Neural Network classifier is implemented to classify the detected objects based on the extracted motion features into one of five classes. Public and private datasets are used to train the system, with a predefined ground truth labelled by an expert. This is used to generate a meaningful notification that is reliable and in the best visual position to warn the person about any possible hazard. The contributions of this paper can be summarised as follows:
A context-aware assistive technology is developed to increase cognitive awareness for people who have vision impairment using computer vision and machine learning algorithms.
An egocentric indoor and outdoor hazard recognition dataset is created using a wearable camera and classified using deep learning object detector and Kalman Filter tracker to be used in the hazard detection and classification for people with vision defects.
A motion model that describes the hazard type in the user’s environment based on motion features is presented to be used in the classification stage.
A machine learning-based hazard classification system using motion features for multiple hazards simultaneously is proposed to provide a smart and early warning system to help people with peripheral vision loss.
This paper is structured as follows: in
Section 2, a review of the related literature is presented, followed by a description of our proposed system.
Section 4 describes the used datasets employed while system evaluation experiments are presented and analysed in
Section 5. Finally, research findings, conclusions, and recommendations for future work are provided in
Section 6.
3. The Proposed System
We propose a wearable AT for Visually Impaired People (VIP) providing early smart notifications for potential hazards. The proposed technology is to be used with wearable smart glasses that utilise a wide-angle camera integrated into an Android device.
Rather than replace their (already limited) visual field, this system is to help VIP compensate for their visual defects by increasing their cognitive awareness about the peripheral field. The purpose is to alert the user to any possible hazards or threats in their environment.
Figure 3 shows the general concept in this work. In this figure, the grey area represents the user’s peripheral vision (blind), while the blue area represents the user’s actual vision. People with peripheral vision loss miss the information that describes their surrounding. Therefore, they could not build a mental map of the physical environment. Although it is possible to use other sensory data as sound and smell, the vision sensor is valuable to determine dangerous situations during navigation. The system will provide visual feedback for the user by detecting, tracking and classifying the objects in the blind area and generating a suitable notification to increase user’s awareness.
3.1. User Requirements
An exploratory study was conducted with five visually impaired participants to understand the daily challenges and needs for VIP. The participants answered a questionnaire including questions about the significant challenges they face that would affect their quality of life in general and their independent navigation in particular.
It was found that 80% of the participants prefer having notifications about moving rather than stationary objects. When they were asked to specify the type of objects they are interested in, cars, people and bicycles were the most chosen options.
Figure 4 shows the users’ preferences for the types of objects they would require a notification.
Several research papers explored the VIP requirements for assistive technologies. In their research paper [
37], Jafri and Khan presented their obstacle detection and avoidance application for VIP based on the results they got from a semi-structured interview. While the human guide was superior to the white cane as a navigation aid, the participants mentioned that this method causes them problems as they depend entirely on the guide who may not provide an accurate warning about obstacles. In the same study, moving and minimal obstacles were the most difficult to detect and avoid during the indoor navigation.
The lack of information that describes the physical environment is one of the core challenges for VIP navigation. This was mentioned by many participants as the need for a clear description for indoor and outdoor main landmarks that would help them build a mental map [
38,
39]. In their comprehensive study about computer vision algorithms for AT, Leo et al. [
40] mentioned several open challenges for developing AT for VIP. Object detection and tracking problems are examples, especially for egocentric video streams.
Based on these requirements, preferences and challenges, we developed our design in a smart way that would provide early, meaningful and straightforward notifications for extending the user’s mental map. We define the possible hazard by any moving or stationary object in the users’ pathway that they are not able to see or recognise and may collide with while walking. Therefore, the system starts scanning the real-time video to search for objects and then tracks their movement. Motion features such as object speed, direction and location in addition to the object type and other features decide the level of danger for each detect hazard. As shown in
Figure 3, the goal of the system is to produce notifications for the user just to become aware of potential hazards. In this case, the system prioritises the detected hazards to generate useful feedback without overloading the user with too much information.
Visual field test results are used to delineate both healthy and defected vision areas.
Figure 5 shows three examples of visual field test results for peripheral vision defects. The left column shows the central field test (left/right eye results), while the right column shows the full field test (both eyes together). For the central visual field, the test checks the visual sensitivity for each eye’s visual field (30° around the fixation point) and displays different grey levels for each location representing different visual sensitivity. A full field test result covers ≈160°.
This system uses the visual field test results to search for possible threats in the user’s blind area and classifies these threats based on their danger level. The smart glasses will be used to display essential notifications outputs in the user’s healthy vision area.
Figure 6 shows an overview of the proposed system and the main components used in our project.
The first stage is object detection and recognition, where objects are detected using a deep learning object classifier to determine the type and location. Motion features are extracted using the moving objects’ tracking module to determine the age (the appearance time in terms of frames), speed and direction for each detected object. This information is processed and used to determine the level of danger for each identified object using a neural network classifier.
Objects in the peripheral vision of VIP manifest themselves in different ways such as hazards, obstructions, surprises or immediate dangers. For example, someone walking in the street may not be aware of a cyclist/pedestrian walking on the other side of the road or of dangers such as a car crossing their walking route, street bollard, overhanging cables, trees or bushes to the side of the road. Not all activities in the periphery are equally important to the VIP. Therefore, a system that prioritises all these activities is needed to nudge the users to turn their head to the most immediate threat to see it through their healthy vision.
Based on the mentioned user’s preferences and needs, five hazard classes are defined using smart glasses-captured videos and public datasets. The class number represents the danger level (one is the lowest, five is the highest):
Class 1: static object not in the user’s pathway,
Class 2: moving objects not related to the user (any type),
Class 3: static object in the user’s pathway,
Class 4: person moving towards the user (or user’s pathway),
Class 5: object moving towards the user (or user’s pathway).
The visual field has different levels of visual sensitivity depending on where the image lies relative to the fovea or fixation point [
3]. This inspired us to define the user’s navigation route as the depth extent of the central vision and a small part of the macular vision (
°) around the fixation point. While the fixation point will vary, images are treated as centred around the fixation point.
3.2. Deep Learning-Based Object Detection
The first stage of the proposed system is to detect predefined objects that exist in the real world, but they are not visible to the visually impaired people. The goal of this stage is to obtain (1) the type of the detected objects and (2) current locations of these objects. In the related literature, researchers used You Only Look Once (YOLO) [
33], Faster- Recurrent Convolutional Neural Networks (RCNNs) [
41], Single-Shot Detectors (SSDs) [
42] and other object recognition systems for object detection using deep convolutional neural networks. In our system, we found that YOLO needs a powerful graphics processing unit to perform the classification process which is not available in the smart glasses. On the other hand, the Faster R-CNNs are quite slow (on the order of seven frames per second), and this will affect the whole process of hazard classification.
A research group originally developed SSD in Google. The method can detect multiple objects at the same time in an image using a single deep neural network [
42]. Since our system will be running on resource-constrained devices, we used an existing lightweight network architecture called MobileNets [
43]. We used a combined version of SSDs and MobileNets, which is called MobileNets SSD. This module was trained on Common Objects in Context (COCO) dataset [
44] and then readjusted on a Pascal Visual Object Classes (VOC) [
45] dataset to achieve better accuracy rates.
This framework was implemented using the OpenCV 3.3 Deep Neural Network (DNN) module to create the real-time object detector that is capable of detecting 21 classes including airplanes, bicycles, birds, boats, bottles, buses, cars, cats, chairs, cows, dining tables, dogs, horses, motorbikes, people, potted plants, sheep, sofas, trains, and TV monitors. This framework was found to be the best choice to cover the object types mentioned in the user’s requirements section. In this work, a pre-trained version of the detector is used while it is planned, for our near future work, in order to re-train the classifier to reduce the number of classes based on the users’ requirements.
The detection stage starts by processing each frame to extract the objects’ blobs. These blobs are then sent to the OpenCV deep learning module to recognise the type for each detected blob. The final check is to filter out the objects with low confidence to reduce the number of false detections.
3.3. Multiple Object Tracking
Since the system had detected moving objects in the previous stage, the approximate location for each object is known. For each detected object, we used the location information to initialise a Kalman Filter (KF) to predict its motion over time. KF is a recursive estimator that predicts the state of the system
at time
t based on information from the previous state
using the following equation:
where
refers to the state transition model that describes the change that happens to the state between time
and
t.
is the vector of control inputs and
is the control matrix.
is the noise vector for the process transition model.
Then, the measurements’ vector
is computed using the following equation:
where
is the transformation matrix between the state vector parameters and the measurement domain and
is the measurements’ noise vector. The process noise at time
t is assumed to be Gaussian distributed noise with covariance
.
The KF estimation process has two phases; the prediction and the update. In the prediction phase, the filter uses the initial estimate state and its associated variance of uncertainty (covariance) matrix to create an estimate of the current state. For a better and more accurate estimation, the update phase computes the KF gain and uses the measurements vector from the current state to enhance the prediction result in the next state .
The KF is used in this work to estimate the detected object’s location and speed. Thus, the state of each object is represented as:
where
,
are the centre of mass coordinates for each object and
,
are the velocity components.
In the prediction phase, the system predicts both the state vector
and the covariance state
using the following equations:
This estimation is corrected in the next iteration (frame) after calculating the KF gain
using the following equation:
where
is the observation noise covariance matrix.
Finally, the system corrects the state vector
X and the covariance matrix
P using the following KF update equations:
These two phases are applied for all detected objects over time to update the motion model for each hazard object.
As Kalman filtering is all about matrices and vectors’ operations, from the simple addition of two vectors to the inversion of a matrix, we believe that it would run in real-time applications. However, the performance of KF is highly correlated with the used processing unit. In the proposed work, we are presenting the technology that would work on a wearable device to track the moving objects. Since the object tracker is used to determine the motion model for each detected object, it is possible to skip some frames for detecting and tracking the object if we found that the process would slow down the hazard detection phase.
One of the well-known problems for tracking multiple objects at the same time is to decide which detection refers to which object. To track multiple objects at the same time, the system uses the Hungarian algorithm for best assignments between detected and estimated measurements [
46]. Initially, the system defines a tracker instance for each detected object. The tracker object includes a KF and other motion features history for each identified object. The Hungarian algorithm is one of the best optimisation algorithms used to solve the assignment problem in polynomial time [
46]. The algorithm also keeps track of all missing and new detections to maintain tracking consistency and efficiency.
In object tracking problems, the goal of the Hungarian algorithm is to find the best assignment that has the lowest cost between detections and tracks. The cost, in this case, represents the Euclidean distance between these two sets of variables. Each time, the system detects new objects, and the multi-object tracking algorithm updates its state to include the new/old objects using Algorithm 1.
Algorithm 1: Multi-object tracking update procedure. |
|
At this stage, all the detected objects have been tracked and our system can now determine the type, position and speed for each one of them. Objects with low type confidence were filtered out to reduce false alarms and increase the system’s reliability. In the next step, the motion features for each tracked object are extracted to create a hazard profile and prepare these features for the classification stage.
3.4. Motion Feature Extraction
The purpose of this step is to collect information about how each object is behaving while it is in the user’s environment. From the detection stage, the system recognises the type of the detected object and the confidence of that recognition and saves this information into a global feature array.
The tracker will access the same information to add the following object’s features:
Age: a feature that represents the appearance duration (number of frames) for the tracked object.
Current and estimated next location. This information is important to distinguish between moving objects and static obstacles.
Speed (pixels/second).
Motion direction.
Figure 7 shows a sample motion features for one of our testing videos where these features were extracted over two consecutive frames. The moving object (type 15: person) is moving towards the camera. As seen, the detector recognised the object type as a person, and the tracker estimated the speed and direction for that person.
3.5. Hazard Classification Using Machine Learning
The purpose of this stage is to classify the detected hazards to decide which object has a higher priority to notify the user. As mentioned in the users’ requirements section, the VIP needs and challenges differ in terms of the object type, motion type and other physical features. Generating feedback for each detected hazard wouldn’t be useful and may be considered to be annoying. For these reasons, we grouped the VIP choices into five hazard classes that were described in
Section 3.
Figure 8 shows a visual example for these classes.
These classes are based on a questionnaire in which we asked a group of visually impaired participants about the hazardous situations they face while navigating. In addition, we asked them to classify predefined hazard classes to estimate the most dangerous conditions. Based on the questionnaire results and group consultation, we defined these hazard classes.
5. System Evaluation and Output
This work is part of a larger project for developing a user-centred, wearable assistive device for people with visual field defects. In this paper, we presented an assistive technology for people with peripheral vision loss. Therefore, we analysed the performance for the hazard detection and classification subsystems and evaluated the feedback generation module based on users’ recommendations. We implemented the proposed system on the Moverio BT-200 smart glasses which captures 15 FPS. videos and the average processing time of a single frame at the glasses is 0.49 s. Based on this, the glasses can process at least 2 FPS, which means that we have to reduce the input frame rate to guarantee real-time feedback generation. Thus, the glasses are processing one frame per seven captured frames without affecting the overall detection accuracy, and is considered sufficient for the purposes of this paper.
The presented evaluation in the paper was performed on a MacBook laptop (2.7 GHz Intel core i5 processor, 8 GB RAM) (Cupertino, CA, USA) which was able to process the high resolution CamVid videos (30 FPS) with an average of 0.2160 s per frame and an average of 0.1932 s for videos captured by the Moverio BT-200 smart glasses.
A three-layer NN model was created with seven inputs to the input layer representing the detection and motion features: object type, detection-type confidence, object age, object location, object speed, and motion direction. The output layer has five nodes representing the five hazard classes. For each detected object, the classifier decides its hazard class based on its motion features. Some of these objects may change its class over time depending on the way it is moving around the user. For each object, its class is determined for every frame in which the system detects it.
The average MSE for each of the ten experiments is calculated to evaluate the performance per specific number of hidden neurons using the following equation:
where
p is the predicted value,
o is the observed value and
N is the total number of values.
The datasets described in
Section 4.2 are used to evaluate the classification model. A total of 3536 samples are used, and the best NN configurations were found to provide the lowest False Positive Rate (FPR) and the highest True Positive Rate (TPR) for all the hazard classes using 19 hidden layers and 0.3 decision threshold. The best NN configuration is found to provide the highest TPR of 90% with the lowest FPR of 7%. An average of 13% False Negative Rate (FNR) is achieved.
In this system, the TPR represents the rate of truly detected, tracked and classified hazards compared to the total number of classifications. The FPR represents the rate of falsely providing a hazard notification (false alarm). The FNR represents the rate of cases that the system reported it is not dangerous while, in fact, it is. The average MSE value for the five classes is 8.7655%.
Figure 11 shows the Receiver Operating Characteristics (ROC) curve for the five classes. These results are very good as this is the first version of our system. However, it could be improved in the future using more useful features for modelling the hazard classification process.
The regression analysis has been applied to the classification results to understand the relationship between the predicted hazard class (dependent variable) and the extracted features (independent variables). The best testing coefficient of determination was R = 0.72, meaning that the regression model can reasonably predict the hazard class perfectly.
These results are promising and could be used to determine hazard classes in real-time applications to help people with impaired vision in their daily activities.
Figure 12 shows some examples of the feedback generation stage. Column (a) represents the captured frame. The redline shape overlapping with the image defines the seeing area for a patient with severe glaucoma condition mentioned in
Figure 5. Column (b) shows the hazard detection and tracking modules results. Yellow lines show an approximate illustration for the user’s navigation route. These results are fed to the NN classifier to determine the hazard class. The result of the classification module is displayed in column (c).
In the first row, two objects are detected and tracked; stationary man (class 1) and a moving woman (class 2). Since class 2 is more dangerous than class 1, and the user is unable to see both of them, the system generates a green arrow pointing to the left side (according to the reference cross symbol in the centre of the healthy area). In the second row, two new objects are detected to the right side of the user; stationary train (class 3) and a moving car (class 2). Although class 3 has a higher priority, the system ignores this because the user can see this object (2b) and thus generates a notification for the class 2 hazard. In this case, the notification is for the moving woman, not the car since its position is closer to the user. It is worth mentioning that the wrong object type (train instead of a wall) is due to false detection.
In the third row, the system detected a stationary car in the user’s navigation route (class 3) and a person moving towards the user (class 4). Since class 4 has higher priority than class 3 and the user can see the class 3 hazard in the seeing area (3b), a notification is generated for the class 4 hazard as an orange arrow pointing to the right side. Class 5 (highest priority) is detected in the fourth-row example as two cars coming towards the user. The system generates one red arrow pointing to the left side that shows a closer car). Finally, class 3 hazard notification is shown in the fifth-row example as a bicycle is detected in the user’s navigation route. Although the user can see this object, the system produces a simple notification (cross symbol with light orange colour) without any arrows to notify the user to look straight because there are no other dangerous hazards in the scene.
Currently, the implemented system generates a single notification for the highest hazard level as a visual notification (arrows with different colours reflecting the hazard type and pointing to the object direction). A participant study was conducted with a group of patients with different visual field defects to explore their preferences, suggestions and opinions about the notification style, frequency and other presentation features.
After describing the project’s idea and design, basic demography information and the visual impairment history of the participants were collected. We found that 100% of the participants use portable devices like the iPhone and iPad and only 20% of them use these devices for navigation. Regarding the feedback format, the participants prefer to use either visual notifications (33%) or hybrid style (66%) [visual and vibration (75%), visual and beeps (25%)]. The participants tried the Moverio BT-200 smart glasses with basic notifications in an indoor environment. Due to the ethical approval constraints, we were unable yet to perform outdoor experiments for the proposed system. Therefore, we presented the basic system concepts (a single notification for the highest hazard level) to the participants and collected their feedback through a questionnaire, and the results will be analysed and used for the ongoing developments and will be published in an extended study in the near future. In general, we believe that the participants were extremely happy and satisfied with the technology and thrilled to try more features soon.
6. Conclusions
This paper presents a novel context-aware hazard attention system to be used on smart glasses to help people who suffer from peripheral visual field defects. The system includes hazard detection and recognition, hazard tracking and real-time hazard classification modules. Based on the detected motion features, the system assigns a hazard type class for each detected threat in order to generate a suitable visual notification output.
The main goal of this system is to increase the user’s awareness of the surrounding environment without interfering with healthy vision. Unlike other obstacle avoidance and navigation systems, the system is directed to the people who have partially healthy vision. Our system uses this healthy vision and augments it by new, meaningful and smart notifications that appear only if necessary. This system has been tested on both publicly available and new private datasets. The classification stage shows promising results and the system can truly classify any detected hazard into one of five predefined hazard classes.
Through our research collaboration with the Department of Health Services Research, we created our own dataset for hazard detection and classification. Using Epson’s Moverio BT-200 smart glasses, we captured indoor and outdoor videos and an expert labelled them to one of five hazard classes. These classes were discussed with a visual field loss patient group.
We had a group meeting with the mentioned patients’ group to discuss the idea of the proposed technology and to gather their requirements regarding the feedback generation stage according to their personal experience. The participants tried the glasses and discussed the project’s idea and stages and gave us their feedback. Some of them accepted to perform real-world experiments in future. In general, we believe that the participants were extremely happy and satisfied with the technology and thrilled to try more features soon.
Our next research directions will focus on adding more features for better hazard model representation and personalising the notification style and position using the user’s visual field test results. In addition, we will use some of the state-of-the-art smart glasses with the newest technologies in the field.