An Assistive Model for the Visually Impaired Integrating the Domains of IoT, Blockchain and Deep Learning

Jadon, Shruti; Taluri, Saisamarth; Birthi, Sakshi; Mahesh, Sanjana; Kumar, Sankalp; Shashidhar, Sai Shruthi; Honnavalli, Prasad B.

doi:10.3390/sym15091627

Open AccessArticle

An Assistive Model for the Visually Impaired Integrating the Domains of IoT, Blockchain and Deep Learning

by

Shruti Jadon

^*,†

,

Saisamarth Taluri

^†

,

Sakshi Birthi

^†

,

Sanjana Mahesh

^†

,

Sankalp Kumar

^†

,

Sai Shruthi Shashidhar

^† and

Prasad B. Honnavalli

Department of Computer Science and Engineering, PES University, Bengaluru 560085, India

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Symmetry 2023, 15(9), 1627; https://doi.org/10.3390/sym15091627

Submission received: 24 July 2023 / Revised: 17 August 2023 / Accepted: 21 August 2023 / Published: 23 August 2023

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

Internet of Things, blockchain and deep learning are emerging technologies that have recently gained popularity due to their various benefits and applications. All three domains have had success independently in various applications such as automation, agriculture, travel, finance, image recognition, speech recognition, and many others. This paper proposes an efficient, lightweight, and user-friendly solution to help visually impaired individuals navigate their way by taking advantage of modern technologies. The proposed method involves the usage of a camera lens attached to a Raspberry Pi device to capture live video frames of the user’s environment, which are then transmitted to cloud storage. The link to access these images is stored within a symmetrical private blockchain network (no superior access), where all deep learning servers act as nodes. The deep learning model deployed on these servers analyses the video frames to detect objects and feeds the output back to the cloud service. Ultimately, the user receives audio notifications about obstacles through an earphone plugged into the Raspberry Pi. In particular, when running the model on a high-performing network and an RTX 3090 GPU, the average obstacle notification time is reported within 2 s, highlighting the proposed system’s responsiveness and effectiveness in aiding visually impaired individuals.

Keywords:

IoT; blockchain; deep learning; object detection; depth estimation; visually impaired; cloud storage

1. Introduction

Individuals with visual impairments face challenges, often relying on alternative senses, such as auditory and tactile, to effectively explore and navigate their environment. Walking sticks and guide dogs offer temporary assistance to the visually impaired, but their coverage area is limited. Walking sticks prove effective mainly in familiar environments. This is due to the static nature and the short distances between objects in such an environment. However, in the presence of dynamic obstacles such as vehicles, animals, or distant objects, walking sticks fail to serve their intended purpose.

A lot of research has been conducted to overcome the limitations of walking sticks. One such example is the use of a combination of different sensors to detect objects and inform the visually impaired. The approach proposed by Satam et al. [1] involves the use of three ultrasonic sensors, one in front of the stick and two on each side to detect objects from almost every side. When an object is detected, it sounds a buzzer alarm to the user. Although this solution helps in detecting objects that the walking stick cannot, as it covers more distance than a walking stick, it still fails when objects are large and dynamic. B. Zuria [2] presents a camera-based wearable system that aims to offer descriptive information on potential obstacles located in front of the user. The proposed system employs haptic feedback as a means of summarising environmental information for the user.

To overcome the above drawbacks, this paper proposes a multi-domain technological solution that replicates the function of the human eye. The solution aims to provide a simple, user-friendly, portable, and wearable IoT device that scans the surrounding environment to detect obstacles by integrating the domains of IoT, cloud, blockchain, and deep learning. It involves utilizing a camera lens possessing a wide-angle field of view spanning 155 degrees, which helps capture the surroundings. The authors developed a deep learning model to analyse the captured images, identifying obstacle positions as left, right, or centre and special attention is given to obstacles at closer distances. The user is then alerted about their surroundings through audio cues. Additionally, the solution integrates blockchain to address the system’s privacy and security implications. Testing the prototype of the proposed model showed that it can efficiently detect objects at distances greater than 2 feet from the user.

This paper delves into the detailed design, implementation, and evaluation of the proposed system. Through detailed analysis and comparative studies, the authors aim to demonstrate the effectiveness and practicality of this novel solution, shedding light on its potential implications and avenues for future research.

The subsequent sections of this paper are structured as follows: Section 2 provides the background of various technical domains involved in the proposed system. Section 3 presents the related work conducted by other researchers on the problem statement or the technology used. Section 4 provides a detailed explanation of the proposed solution. In Section 5, the results obtained while testing the prototype are discussed. Section 6 lists some possibilities where further research can be conducted, and the paper is concluded in Section 7.

2. Background

This section highlights the fundamental aspects integrated into the research presented in this paper: IoT, cloud, blockchain, and deep learning. The advent of the internet has led to massive developments in the world of technology and has changed the everyday lives of people. Blockchain, IoT, edge computing, and cloud computing are some of the areas that have emerged as the internet evolved.

2.1. Internet of Things

The internet is now integral to households, with the world’s future hinging on connectivity and security. The Internet of Things (IoT) is one of the most promising domains that helps in improving the customer experience and aims to make efficient use of resources. It has many applications in sectors such as agriculture, environment, smart living, automation, and healthcare.

IoT refers to the interconnection of physical ’things’, e.g., sensors and embedded software, which have unique identifiers and can share data across networks. For a device to be classified as an IoT device, there are certain constraints (requirements): smaller memory size, lesser processing power, lower cost, and smaller physical size. IoT devices represent the state of things, and this state and its surroundings can be changed as needed. Changes to the environment can be made from a distance, such as turning on the air conditioning 30 min before someone reaches their home. IoT offers many such applications that aid in leading a comfortable life.

IoT constitutes a network of intelligent entities, establishing a global framework for both wired and wireless communication. This integration introduces the concept of smart elements, ranging from cities to devices, each contributing to a more connected ecosystem [3]. These entities possess the capacity to function autonomously, responding in real-time to various demands and facilitating applications that cater to end-users directly. An essential facet of IoT is automation, a transformative element with the potential to revolutionize the digital landscape [4,5]. Additionally, IoT’s connection to cloud systems for data processing amplifies its reach, enabling the extraction of insights from the information garnered by IoT devices [6]. This extensive application spans multiple domains, fostering industrial advancements and charting the course toward an interconnected and intelligent global paradigm [7].

2.2. Cloud

Cloud computing represents a paradigm in computing that facilitates ubiquitous and instantaneous access to storage, processing capabilities, networking infrastructure, development tools, and deployment resources. There are typically three types of cloud service models:

Infrastructure as a Service (IaaS): a cloud model where storage, network, servers, and virtualisation are provided.
Platform as a Service (PaaS): the cloud model offers developers a comprehensive platform for the creation and deployment of applications.
Software as a Service (SaaS): the cloud model offers users a fully configurable application as a service, allowing them to tailor it to their specific requirements.

Based on their requirements, users can choose from three deployment models: public, private, or hybrid clouds. Private cloud refers to a setup where infrastructure is exclusive to one enterprise/organisation. In contrast, a public cloud setup stores data from multiple organisations in a shared yet isolated environment. A hybrid cloud model, on the other hand, allows organisations to deploy applications in a private or a public environment and transition between them. It is common practice to host critical applications within a private cloud infrastructure, while noncritical applications are typically deployed in a public cloud environment [8]. Cloud services come with a set of advantages such as scalability, elasticity, measured service (pay-as-you-go) and reliability which prove to be useful for certain applications [9]. The utilization of cloud services in conjunction with the Internet of Things (IoT) has witnessed substantial enhancements in both its security and performance aspects [10].

2.3. Blockchain

Blockchain is an immutable, decentralised, and distributed database designed to facilitate seamless transaction recording within a network [11]. Transactions are consolidated into blocks, each consisting of a header and a body. The header comprises essential information, including the hash of the previous block, metadata such as timestamp, and the root of a hash tree, known as the Merkle root [12]. The Merkle root represents the topmost hash in a hierarchical structure called a Merkle tree, constructed by iteratively hashing pairs of data until a single root hash is generated. This root hash, which represents the entire dataset, is propagated to subsequent blocks, creating a chain-like linkage between them, as shown in Figure 1. Several fundamental components contribute to the architecture of a blockchain, including:

Distributed ledger: blockchain uses a distributed ledger, characterised as a decentralised database, to record and store all transactions across multiple nodes or computers. Each participating node retains an exact replica of the ledger, ensuring transparency and consistency of data [13].
Immutable records: upon being recorded on the blockchain, transactions acquire immutability and resistance to tampering. It becomes impossible to alter or delete these transactions, thereby safeguarding the integrity and enabling reliable auditability of the data [11,14].
Smart contracts: smart contracts refer to self-executing lines of code that enforce and automate the terms of an agreement between nodes on the blockchain [15]. These contracts are verified and executed automatically through a computer network, eliminating the need for intermediaries and ensuring a reliable and transparent execution of contractual obligations [14,16].
Consensus mechanism: the consensus mechanism is a standardised approach employed by the nodes within a blockchain network that are responsible for managing the blockchain and storing transaction records. It enables these nodes to reach a dependable consensus concerning the present state of the data within the blockchain network [17].

In existing client-server technology, there is a high risk of failure and loss of data due to a centralised structure. In applications that demand real-time and time-sensitive operations, such an outcome is not tolerable. Hence, blockchain proves to be a valuable asset in such applications, as it has a decentralised structure by enforcing distributed ledger technology. The various systems are connected to the blockchain as nodes. Upon the addition of a new block to the blockchain, any node within the network is able to perform the processing load. In the event that a node fails, the remaining nodes can independently continue processing operations. As subsequent blocks are added to the blockchain, the processing load is distributed among the functioning nodes. This mechanism guarantees uninterrupted functionality of the system, even in the presence of node failures [18].

As the ledger is distributed across all nodes on the blockchain, it ensures data transparency between nodes, as well as trust and reliability of the data. Records within the blockchain are inherently immutable, which means that participants are prohibited from altering or modifying transactions once they have been recorded in the shared ledger. This characteristic ensures the security and privacy of sensitive data, as it prevents unauthorised tampering or changes to the recorded information [19].

2.4. Deep Learning

In the expansive domain of artificial intelligence (AI), machine learning (ML) is a pivotal technique, providing machines the capability to automatically learn from data and improve their performance based on experience. Deep learning (DL) is a subfield of machine learning that utilizes neural networks to process and analyze vast amounts of data. Neural networks, inspired by the structures of neurons in the human brain, facilitate the modeling of complex patterns and relationships in the data. This approach has gained significant popularity and proven to be highly effective within the broader domain of machine learning [20]. Deep learning finds extensive applications in various domains, including image classification, object recognition, speech recognition, language translation, personality analysis, and numerous other tasks [21]. Deep neural networks (DNNs) represent a specific category of artificial neural networks characterised by multiple layers, each layer comprising numerous artificial “neurons”. These neurons connect to every neuron in the subsequent layer through specialized computational pathways. This layered architecture allows DNNs to perform complex computations and learn hierarchical representations of data [22].

Object detection is a technique that allows an AI system to identify and categorise objects in an image [23]. Through the use of deep learning techniques, particularly convolutional neural networks (CNN), such systems can locate, identify, and label objects. This process involves training a model on annotated images with labels indicating the objects within them [24]. CNNs are specifically designed to analyse visual data and have shown remarkable performance in object-detection tasks. The deep learning algorithm then uses these tags as filters to identify objects in new images. This method is efficient and superior because it does not require human intervention and can be automated at a large scale.

Monocular depth estimation is a process that entails determining the distance between an object and a camera by comparing it with other objects depicted in an image [25]. In the traditional approach, the task of monocular depth estimation relied on geometric calculations that took into account various factors, such as the apparent size and distance of an object from the camera [26]. However, these calculations are often complex and occasionally unreliable. As a result, deep learning algorithms have emerged as a viable alternative for monocular depth estimation [27]. The utilisation of deep learning algorithms is advantageous, as they exhibit exceptional proficiency in quickly identifying patterns, surpassing human capabilities in this regard. Moreover, deep learning algorithms possess the ability to rapidly learn from mistakes, obviating the need for iterative trial-and-error processes typically employed by humans [28].

Having seen the background of various domains, the paper now proceeds to provide insights to the related work conducted in each domain. The next section of the paper gives a brief description of the literature survey conducted on related papers.

3. Related Work

Extensive research has been conducted on the domains of IoT, blockchain, and deep learning, resulting in significant advancements in these fields. Several pertinent studies have been examined to progress further, and their details are outlined below.

In their study, G. C. Mallikarjuna et al. [29] present a hardware configuration comprising a Raspberry Pi 3, a speaker, and a Raspberry Pi Camera to capture images from the surroundings. The system employs neural networks for object detection and relays the output to the user through audio using the speaker, and is specifically trained to recognise four objects: bus, car, cat, and person. However, there are limitations within the proposed system. Firstly, it focuses solely on the detection of a limited set of four objects. Second, it exclusively performs object detection and lacks consideration for factors such as the object’s position or depth information.

F. Ashiq et al. [30] propose a system that combines various components to create a comprehensive solution. This system incorporates a Raspberry Pi digital signal processing (DSP) board, a GSM module, a GPS module, headphones, and a camera. The DSP board is responsible for capturing real-time video footage from the camera. This video feed is then transmitted to an object detection and recognition module, which utilizes a convolutional neural network (CNN) model. The CNN model analyzes the video frames and predicts the objects present in them. The names of these objects are then passed to a text-to-speech converter module (SAPI), which pronounces the object names through the headphones. Additionally, the system keeps track of the user’s exact location and stores this information in a server database.

A. Shah et al. [31] propose an embedded IoT system with sensors and actuators attached to the shoe of visually impaired individuals, employing computer vision techniques to detect and avoid obstacles by providing haptic feedback to users. It also provides smartphone-based voice assistance. M. A. Rahman and M. S. Sadi. [32] introduce a system that aims to assist individuals with visual impairments in real-time, whether they are indoors or outdoors. The system’s primary focus is on object recognition and it utilizes advanced technology to provide audio alerts to the user. By employing four laser sensors, the system can effectively detect objects, while leveraging the single-shot detector (SSD) model for accurate and efficient object recognition. Through this approach, the system is able to identify different objects and convey relevant information to visually impaired users via audio messages.

S. Durgadevi et al. [33] propose the concept of a smart shoe that uses IoT technology to help visually impaired and older people. R. S. Krishnan et al. [34] propose the development of monitoring systems for visually impaired individuals with the help of IoT devices and cloud services such as SMS and GPS to help visually impaired users call for help in the case of an emergency at the push of a button. In the work proposed by S. Durgadevi et al. [33] and R. S. Krishnan et al. [34], various IoT sensors such as ultrasonic, PIR, MEMS, flame sensors and soil moisture sensors are used to detect obstacles in the way of visually impaired individuals and notify them via output sensors such as vibration, buzzer, and audio outputs.

M. Trent et al. [35] propose a method of a wearable device that aids navigation. It uses iBeacons which are single-celled and use Bluetooth low energy for tracking the location of the user. A band of ultrasonic sensors to be tied around the waist detects obstacles. The output is given to the user in an audio format. C. Dragne et al. [36] propose an assistive mechatronic system that uses cameras and LIDAR sensors for object detection. The results provide a foundation for further development of a mechatronic system for visually impaired individuals using multiple sensors, cameras, and the implementation of machine learning.

The research mentioned above and various other solutions devised to help visually impaired people depend on IoT technology primarily to collect input data. Some of them optionally use deep learning techniques for better identification of the obstacles and to produce outputs more precisely by providing additional information such as distance, direction, etc., to the user.

Some of the existing work has also used cloud services for storage, sending messages about the location (using GPS) to caretakers via SMS. The proposed paper presents a novel approach that combines the concepts of blockchain, IoT, deep learning, and cloud computing to enhance privacy, security, and reliability in handling sensitive user data, as well as to mitigate the risks associated with a single point of failure when running deep learning models.

4. Proposed Method

The following section of this research paper presents a comprehensive explanation for addressing the proposed method, aiming to provide readers with a detailed understanding of the approach taken. The subsequent sections delve into various aspects, including the underlying assumptions, system architecture, hardware components, deep learning techniques, and the integration of blockchain technology. Additionally, the algorithms utilised in the proposed method are briefly discussed, giving readers a glimpse into the computational processes underlying the approach without delving into technical details.

4.1. Assumptions and Justifications

This research paper is anchored on several fundamental assumptions which support its investigative approach and ensuing results. It is crucial to recognize and adhere to these assumptions for the successful deployment and operation of the proposed system. Failing to adhere to one or more of these assumptions may result in diminished performance, intermittent disruptions, or a complete halt of the system’s functionality. To provide clarity and facilitate understanding, the assumptions and their respective implications are elaborated in the following points:

Illumination: the model presupposes a consistently well-lit environment for the user.
Justification: the proposed model leverages a combination of object detection and relative depth estimation. While object detection exhibits a degree of robustness to varying lighting conditions, the relative depth estimation significantly relies on the interaction of light and shadows. An adequate and consistent lighting environment is paramount to ensure the accurate interpretation of object depths based on these light-shadow dynamics.
Implication of Violation: insufficient lighting can lead to severe misinterpretations by the depth estimation module. Even if objects are correctly identified, their relative positions might be inaccurately gauged. For instance, an object situated further away might be incorrectly inferred as being closer, leading to potentially grave inaccuracies in real-world applications.
Internet Connectivity: the proposed model necessitates uninterrupted internet connectivity for its proper functioning.
Justification: various components of the system, including the Raspberry Pi, Firebase Storage, and the blockchain network comprised of GPU compute units, are interdependent and rely on a robust internet connection to communicate and function cohesively. Communication is critical, especially when uploading images and transmitting output inferences back to the user.
Implication of Violation: in scenarios of slow or unstable internet [37], there could be significant delays in system responses due to the time taken to upload images and relay output inferences, potentially undermining the real-time requirements of the model. A complete loss of internet connectivity would result in the full failure of the proposed system, rendering it non-operational.
Power Source: the model assumes a consistent power supply from a battery-operated source to drive the Raspberry Pi device for its entire operational span.
Justification: the Raspberry Pi module serves as the primary point of operation for the user, making the entirety of the proposed system contingent upon its consistent and proper functioning. A reliable power source is indispensable to ensure that the Raspberry Pi remains operational, thereby ensuring the stability of the overall system.
Implication of Violation: a subpar or unstable power supply can expose the Raspberry Pi to erratic behaviors such as unexpected shutdowns or restarts. These disruptions jeopardize the proposed system’s performance and responsiveness. In scenarios of a complete power outage, the entire system will come to an abrupt halt, rendering it entirely non-functional.
Environment Density: the model is optimized for deployment in less crowded environments where objects are relatively spaced out.
Justification: in highly populated areas, such as bustling streets or busy shopping centers, the sheer volume and proximity of objects pose a significant challenge. The proposed system could potentially be overwhelmed by the multitude of objects present, which would compromise its ability to provide precise inferences.
Implication of Violation: in dense environments, while the system might accurately notify the user of an object to their right, it might concurrently overlook or deprioritize objects on the left that are slightly farther away. Such inaccuracies could lead to incomplete or misleading feedback for the user. The proposed model excels and delivers the most accurate results in environments where objects are less densely packed and reasonably distant from one another.

4.2. System Architecture

This section of the paper presents a detailed description of the system architecture that has been developed based on the analysis of research conducted in the various domains. Central to the system’s real-time data acquisition is the IoT component, designed with the visually impaired user’s needs at its core. The user equips a Raspberry Pi device with an attached camera module, which captures the environment in front of the user at 20 frames per second. This live feed is broken down into continuous image frames, providing a dynamic and real-time depiction of the user’s immediate surroundings that are promptly uploaded to Firebase cloud storage. The rate at which image frames are uploaded is dependent on the internet speed and connectivity of the Raspberry Pi device. Once the image is uploaded to Firebase storage, a unique URL for each image is generated. Subsequently, the name of the image, along with its associated URL, is pushed to the Firebase Realtime Database.

The integrity and security of the data links to the stored video frames which are fortified using a blockchain component. The link of the image is then retrieved from the Firebase Realtime Database and is then placed as a transaction in the private blockchain network. An integral aspect of this process is the verification of the transaction. Every node in the blockchain network verifies the transaction using the proof of authority as the consensus mechanism. This not only ensures the authenticity of the data but also decentralizes its access.

Every node in the blockchain network is equipped with a GPU Compute resource which contains the proposed deep learning algorithm. For further processing, any node can retrieve the link of the image from the blockchain network. Once obtained, the image is downloaded using the link and is then subjected to the proposed multifaceted deep learning algorithm, a combination of YoloV7 [38] for object detection, MiDaS Swin2 Large [39] for depth estimation, and a novel algorithm specifically designed to retrieve the closest object and identify its type.

Once the inference is complete, the resultant textual information and a description of the detected object and its position relative to the user is pushed back to the Firestore Database. The data from this database is then accessed by the Raspberry Pi, which retrieves the textual data and converts it to an auditory response. The user, equipped with an audio device, e.g., headphones, receives detailed auditory cues rather than generic beeps or alerts.

In summation, the proposed system as discussed above and illustrated in Figure 2 incorporates cutting-edge technologies, leveraging the real-time capabilities of IoT, the security and decentralization of blockchain, and the analytical prowess of deep learning. The authors aim to provide an efficient architecture designed specifically to assist the visually impaired.

4.3. Hardware Requirements

In the realm of computer science and associated studies, hardware plays a central and critical role. This section provides a comprehensive exploration of the various hardware components used by the proposed system, shedding light on their role in the system.

Raspberry Pi 4 Model B
Figure 3 represents the Raspberry Pi 4 Model B, featuring a Broadcom BCM2711 chip. It is a single-board computer that incorporates a quad-core Cortex-A72 central processing unit (CPU) based on the ARM v8 architecture. The CPU operates at a clock rate of 1.5 GHz, providing efficient and capable processing power for various applications. Its operating temperature lies between 0 and 50 degrees Celsius. It has a micro-SD card slot to load the operating system and provides 16 GB of data storage. It also comes with a camera serial interface connector and an audio jack to plug in the audio device. All these features of the Raspberry Pi device make it suitable for solving the problem at hand.
Camera Lens
The depicted Figure 4 presents the Arducam MINI OV5647, an imaging device featuring a wide-angle camera module characterised by a horizontal field of view (HFOV) of 155 degrees. The camera module possesses a focal length of 1.3 mm and operates at a peak current of 300 mA. Having a max frame rate of 30 frames per second (FPS), it also comes with an IR filter. The Arducam MINI OV5647 is connected to the Raspberry Pi computer board via a flat ribbon cable. The connection is established through the 15-pin MIPI Camera Serial Interface (CSI) connector. The selection of this specific lens is attributed to its wider field of view, which allows it to approximate the visual range of the human eye to a certain extent.
Wired Earphones
The wired earphones are connected to the audio jack of the Raspberry Pi to send alerts to the user about obstacles [40,41]. This is illustrated in Figure 5. The schematic representation of the logical design of the hardware of the proposed model is illustrated in Figure 6. It can be observed that the camera lens is connected to the Raspberry Pi via the CSI connector, while the earphones are connected to the audio jack. The Raspberry Pi is then powered on to function using a power supply. The red LED light indicates that the Raspberry Pi is on. Continuous red light indicates that the Raspberry Pi has enough power supply to function, and blinking red light indicates that the power supply is not sufficient.
The above-mentioned components are used by the authors in the proposed model. Figure 5 shows the actual setup comprising the hardware components that the authors used to test the model and its responsiveness. In conclusion, this section highlights the critical hardware components and specifications necessary to achieve optimal performance and functionality.

4.4. Modules

This section introduces the different areas used by the proposed model and further describes them by explaining the input and output of the various modules. The work proposed in this paper deals with three major areas: image capturing and output feed, privacy and security, object detection and classification, and relative depth estimation. The technical details of these modules are elaborated on in the following subsections.

4.4.1. Image Capturing and Output Feed: IoT and Cloud

The problem at hand necessitates the transmission of the images of the surroundings to the cloud and the subsequent delivery of the processed results from the cloud back to the user in an audio format, all facilitated by a compact, portable device. These functions can be suitably carried out by an IoT device: Raspberry Pi 4 Model B. Another requirement to enhance the capability of the model is the utilisation of a camera lens that possesses an expanded field of view. This leads to the deep learning model being able to detect objects, which can be a possible obstacle to the user, even if they are not in the direct line of sight of the user.

Figure 7 provides an overview of the control flow in the model. The camera lens, with a horizontal field of view of 155 degrees, is connected to the Raspberry Pi through the camera serial interface (CSI). It captures video frames and these images are subsequently resized to a smaller size to facilitate faster transfer to the cloud [42]. These resized images are then transmitted to Firebase Storage, a cloud-based storage service, as depicted in Figure 8.

Figure 8 shows that each image link is stored in a real-time database. The blockchain module then takes over to perform further processing. Upon the completion of the deep learning model’s output transmission to the Firestore database, a change in the document snapshot is detected, prompting the Raspberry Pi to retrieve and return the corresponding result from the Firestore. The generated output is delivered to the user in an audio format, as illustrated in Figure 7.

Although the frames of the video are reduced to a smaller size in the IoT device, when frames are saved at such a fast rate, they occupy memory in the Raspberry Pi device as well. To avoid the memory of Raspberry Pi being utilised completely, the images are deleted from the device as soon as they are sent to the cloud, since they are not used again by the device. The final output received by the user is of the following format: [Object] detected at your [Position]. For example, an output inference could say “Chair detected at your left”.

Algorithm 1 elaborates the process of taking the input of a real-time video feed from the camera lens and sending it over to the Firebase Storage and the metadata to the real-time database. The algorithm initializes Firebase and creates a PiCamera object. It then loops forever, capturing a frame and saving it locally. The image is then smoothed and uploaded to Firebase Storage. The URL of the uploaded image is then retrieved and pushed to Firebase Realtime Database. The local copy of the image is then removed and the captured frame buffer is cleared. If an error occurs, the camera is closed.

Algorithm 2 first initializes the Firebase app and Firestore client. This allows the algorithm to access Firebase’s cloud database. Next, it creates a thread synchronization event to notify the main thread when the document changes. Then, it sets up a listener to watch the document. This listener will be called whenever the document changes. When the listener is called, it will retrieve the result from the document and set the callback. Finally, the audio output is given to the user.

Algorithm 1: Image Capture and Upload to Firebase

Algorithm 2: Result retrieval from Firestore

1: Initialise the Firebase app and Firestore client
2: Create a thread synchronization event to notify the main thread
3: Set up a listener to watch the document
4: On change to the document, retrieve the result and set the callback
5: Give the audio output to user

4.4.2. Privacy and Security: Blockchain

In the previous section, the authors convey how the IoT configuration is actively acquiring real-time video frames. The captured images encompass sensitive user data, including location information and depictions of other individuals present within the environment. A significant challenge encountered within this system pertains to determining a secure and efficient method for storing such sensitive information.

Blockchain is a technology that can facilitate fine-grained access control through smart contracts [43]. This ensures that only authorized entities have access to specific data, improving data privacy and reducing the risk of unauthorized data usage or breaches [44]. Hence, the authors proposed the concept of a private blockchain network for the purpose of storing the captured images.

However, storing images in a blockchain network is expensive due to data replication across all nodes, high transaction fees for large data uploads and rapid and excessive growth of a blockchain’s data size due to the accumulation of large amounts of data which can impact network performance and resource requirements. This is referred to as “Blockchain Bloat”.

In response to this challenge, the authors decided to address it by utilizing a secure storage space for images, coupled with the integration of a blockchain network to ensure streamlined and secure access control to the data. Therefore, the authors adopted a two-fold approach: images were stored in the cloud using Firebase for secure storage, and a private blockchain network was employed to record the links to these images as transactions within the network.

Blockchain also provides immutability so it guarantees that once the image link data is recorded on the blockchain, it cannot be altered without consensus from the network. Furthermore, should any image become corrupted, tampered or be removed from the Firebase storage, the hash link associated with the image will no longer correspond to the one stored within the blockchain network. As a result, such an alteration can be readily detected. This ensures an additional level of security.

In real-time systems such as the proposed project, continuous data acquisition is crucial for maintaining up-to-date and accurate information. Having a single point of failure, where data collection or distribution relies on a sole component, poses a significant risk. If that component fails, the entire system can collapse, leading to data gaps, delays, or disruptions.

To ensure uninterrupted data flow, blockchain provides decentralization by distributing data and control among multiple authorized participants [45,46]. Further, identical copies of the blockchain ledger are maintained across multiple nodes. This redundancy ensures that if one node fails or experiences issues, others can seamlessly take over, preserving data integrity and continuous operation.

The current framework employs Firebase for image storage within the proposed system. Nonetheless, should the need arise to transition to a different storage platform, the existing setup remains adaptable due to the blockchain’s role as an abstraction layer, ensuring portability. The process involves using the link of the new platform for image storage onto the blockchain network with minimal alterations to the initial code. This streamlined approach effectively diminishes the degree of interdependence among system components, mitigating tight coupling.

A suitable consensus mechanism has to be chosen for the proposed private blockchain network. The consensus mechanism is used to validate the nodes that are added to the blockchain network. There are two commonly used consensus mechanisms [47]:

Proof of Work (PoW)
In the present consensus algorithm, the integration of fresh blocks into the blockchain network is achieved by mining, which involves resolving intricate computational challenges to add a legitimate block to the blockchain. The determination of a block’s validity is contingent upon the longest sequence of blocks, thus resolving any potential conflicts that may arise.
Proof of Stake (PoS)
This mechanism depends on cryptocurrency holders who pledge their coins as collateral in order to participate in the block validation process, whereby a subset of them is chosen randomly to serve as validators. Cryptocurrency refers to a form of digital or virtual currency utilised to ensure the security of transactions conducted on the blockchain. Only when a consensus is reached among multiple validators regarding the accuracy of a transaction is it deemed valid and allowed to proceed.

For the given real-time model, instant processing and transferring of data is required. PoW requires high energy and is computationally expensive, so it is not suitable for the proposed model. PoS eliminates the requirement of high energy and computational power, but in this mechanism [48], although two or more validators might stake the same amount of coins, the value of stake differs depending on their individual assets.

To address this constraint, a modified version of Proof of Stake (PoS) called the Proof of Authority (PoA) consensus mechanism has been employed. PoA functions by assigning significance to the validator’s identity rather than a stake associated with monetary value [49,50].

In the private blockchain network model, the servers that run the deep learning model act as nodes in the blockchain as depicted in Figure 9. A new deep learning server can join as a node of the blockchain only after being approved by the network administrator. Nodes can add blocks to the blockchain once it is validated using the PoA consensus mechanism.

In the proposed work, the private blockchain network is deployed using Ganache, a local blockchain development tool. Ganache provides a fast and convenient environment for testing and deploying Ethereum smart contracts. It offers a personal Ethereum network for developers, allowing them to simulate various blockchain behaviors without the need for a public network. Smart contracts are utilized within this setup to automate and enforce predefined agreements, offering a secure and efficient decentralized application environment.

Algorithm 3 provides a general overview of the process of image link retrieval from the cloud-based Firebase Realtime Database and the application of blockchain where the image link is being stored through smart contract in the private network. The deep learning servers continuously retrieve the last image link from blockchain network using the reference of same smart contract used while uploading the link. The algorithm for the same is shown in Algorithm 4.

The nodes within the blockchain are computational units that run the deep learning algorithms of the proposed model. These algorithms process images and produce textual results, which will be further elaborated in the subsequent section.

Algorithm 3: Image link retrieval from Firebase and Upload to Blockchain

Algorithm 4: Retrieval of image link from Blockchain by Deep learning server

1: Initiate connection with the Private Blockchain Network
2: Get a reference to the Smart Contract using its address
3: Retrieve the last pending block data from using Smart Contract reference
4: Send the image data to the Deep learning server for further processing

4.4.3. Object Detection and Relative Depth Estimation: Deep Learning

The present study focuses on exploring object detection and relative depth estimation using a deep neural network architecture. The study investigates the use of the object detection framework based on the widely recognised and popular convolutional neural network (CNN) [51] model known as YOLOv7. This real-time object detection system was developed by researchers at the University of Maryland and represents an enhanced version of the original YOLO (You Only Look Once) system, initially introduced in 2015. YOLOv7 has been specifically designed to identify objects quickly and precisely in both images and videos. By performing a single pass through the input data, it efficiently predicts bounding boxes and class probabilities for the detected objects within the input data. Furthermore, YOLOv7 exhibits efficiency and scalability, having been trained on a substantial dataset comprising annotated images from the MS COCO dataset [52]. The selection of YOLOv7 for this study stems from its advantageous properties; in particular, it exhibits a significantly reduced inference time in comparison to other object detection models, e.g., RCNN, Faster RCNN, and SPP-Net. This characteristic renders YOLOv7 highly suitable for real-time applications, including the requirements of this study.

To identify the closest identifiable objects, the research incorporates a monocular depth estimation model referred to as MiDaS (Monocular Depth Scaling). MiDaS is an approach that enables the estimation of object depths in an image using a single camera. The foundation of MiDaS lies in the concept of utilising a CNN to predict the depth of objects by analysing their visual appearance, as well as the patterns of light and shadow within the scene. MiDaS underwent training using ten distinct datasets. These datasets, namely ReDWeb, DIML, Movies, MegaDepth, WSVD, TartanAir, HRWSI, ApolloScape, BlendedMVS, and IRS, were utilised in the training process. To optimise the performance of MiDaS, a multi-objective optimisation technique was employed. In order to strike a balance between time constraints and accuracy, the authors adopted MiDaS v3.1 Swin2 Large.

The identification of objects within the image is facilitated by the YOLO system, as shown in Figure 10, which detects objects and constructs the corresponding bounding boxes. Concurrently, the relative depth map is extracted from the input image. The determination of the nearest object involves a comparison of the depth values across the bounding boxes associated with each object. The step-by-step procedure to perform depth estimation on an input image retrieved from the Firebase cloud and acquiring the extracted depth map through the MiDaS model is elucidated in Algorithm 5.

Algorithm 5: Depth Estimation

Input: Image from Firebase Cloud

Output: Relative Depthmap of Input Image

1: Read image and convert the colour space from BRG to RGB
2: Load MiDaS from TorchHub and Move model to GPU if available
3: Load transforms to resize and normalise the image for the model
4: Load the image and apply the transform
5: Predict depthmap and save image

Algorithm 6 elaborates on the process of evaluating the closest obstacle and its corresponding class by leveraging the object detection and depth map obtained from Algorithm 5. To maintain accuracy, object detections with confidence levels below 30% are ignored due to the likelihood of distant or partially visible objects. The position of the nearest object is categorised into one of three regions situated in front of the user, namely, left, middle, or right. The resulting output encompasses the appropriate region and the class of the object.

Algorithm 6: Nearest Object Detection

Input: Image from Firebase Cloud and Relative Depth Map

Output: Nearest Object Type with Location

1: Load YOLOv7 model and yolov7 weights from TorchHub
2: Read image and convert colour space from BRG to RGB
3: Obtain result dataframe and filter out objects with less than 0.3 confidence
4: If result is empty (No detectable objects in image)—return empty string
5: Obtain bounding box coordinates for each object in result
6: Obtain depthmap values in each object bounding box from Depth Estimation Algorithm
7: Store maximum depthmap value for each object
8: Compare maximum values and select the highest—Closest object
9: Divide image frame into 3 regions—left/middle/right
10: Select region where closest object is detected
11: Return closest object class and region

4.5. Accessibility and Cost

The proposed model utilizes certain hardware components, specifically the Raspberry Pi 4 Model B and the camera lens Arducam MINI OV5647, both of which are readily available in electronics stores. The Raspberry Pi, in particular, has gained prominence in numerous IoT-related projects, making it easily accessible from a multitude of vendors. As for the camera lens, while the Arducam MINI OV5647 is recommended, any lens with a wide field of view compatible with the Raspberry Pi can serve as an appropriate substitute. Other essential components include wired earphones for audio feedback and a power bank to power the Raspberry Pi device. Both of these are commonly available in standard retail outlets.

Prioritizing user-friendliness and portability in the prototype design, the choice of lightweight and low-profile components such as the Raspberry Pi board and camera lens is deliberate. To extend the device’s operational duration, it can be coupled with a power bank. This design ensures that users experience minimal burden and maintain flexibility while utilizing the prototype.

In terms of expenses, the Raspberry Pi board is priced between 35 USD and 60 USD, with variations based on geographic location. A compatible wide-angle lens for the Raspberry Pi can be acquired for approximately 30 USD, leading to a combined cost ranging from 70 to 100 USD. If the model moves towards mass production, a production-ready model would cost less due to bulk purchasing.

5. Results and Discussion

This section presents the comprehensive results derived from the experiments conducted on object and depth estimation models. This study extensively documents the datasets used for benchmarking, the performance evaluation of various models, and the specific error metrics employed for comparison. Additionally, experiments that were performed using different camera lenses connected to the IoT device are included, as well as observations recorded in different environmental settings.

5.1. Datasets

Within the realm of object detection, the COCO dataset holds a prominent status as a standard benchmark for assessing the performance of different object detection models. This dataset encompasses an extensive assortment of images containing diverse objects characterised by varying sizes and shapes. Consequently, it serves as an ideal dataset for evaluating the efficacy of object detection models. Employing the COCO dataset as a benchmark allows for a standardised and objective evaluation of the accuracy and effectiveness of these models.

With regards to depth estimation, NYU Depth v2 [53] is a popular dataset that comprises more than 1449 RGB-D indoor images captured from various indoor scenes, with pixel-level depth annotation for each image. Benchmarking on NYU Depth v2 is a common practice in the domain of computer vision for evaluating and comparing different methods of depth estimation. Another popular dataset for depth estimation is the KITTI Vision Benchmark Suite [54], which contains more than 4000 images with accompanying LIDAR point clouds, GPS/IMU data, and 3D bounding box annotations for objects commonly found on the streets.

5.2. Comparative Analysis

This section is dedicated to the quantitative evaluation of the accuracy of object and depth estimation models, employing specific error metrics. Object detection was measured using Mean Average Precision (mAP), while depth estimation used zero-shot error. These metrics provide reliable indicators of the model’s ability to accurately estimate object boundaries and depth information. Furthermore, this section includes an analysis of the inference latency of the models, which assesses their computational efficiency. Taking into account these quantitative measures and examining the inference latency, a comprehensive evaluation of the performance, accuracy, and efficiency of various models was conducted.

5.2.1. Mean Average Precision

The evaluation of different object detection models can be conducted effectively by comparing their Mean Average Precision (mAP) scores. Calculating the mAP involves the construction of a precision-recall curve by systematically varying the confidence threshold used for model predictions. At each threshold, the precision and recall values are determined for each class and these values are then utilised to compute the Average Precision (AP) which can be determined by measuring the area under precision-recall curve. Finally, the AP values for all classes are averaged, resulting in the mAP score. By considering the trade-off between these two measures, mAP offers a comprehensive evaluation of the performance of object detection or image classification models across multiple classes. Higher mAP values are indicative of higher model performance, highlighting greater accuracy and effectiveness in object detection tasks.

By analyzing the mAP values presented in Table 1, it becomes evident that the YOLOv5ns model demonstrates the least-accurate performance, while the YOLOv7-X model exhibits the highest performance among the tested models. However, considering only the accuracy metric is not sufficient to select the most suitable model, as the proposed approach requires real-time inference and limited computing resources further contribute to the decision-making process when choosing a specific model. These considerations will be further discussed in subsequent sections to ensure an appropriate selection that balances both accuracy and computational efficiency.

5.2.2. Zero-Shot Error

In the process of evaluating and comparing depth estimation models such as MiDaS, zero-shot error is a popular metric. Zero-shot error is commonly used to determine the error rate of a model when presented with a new class or category of data on which it has not been trained explicitly. When the zero-shot error is compared across different models, valuable insights can be obtained regarding their capability to generalise and perform effectively on novel data. This metric aids in gauging the robustness and adaptability of the models to handle diverse and previously unseen scenarios in depth-estimation tasks.

Lower values of the zero-shot error indicate better performance of the depth estimation models on unseen images. By examining the values presented in Table 2, it appears that the v3.1 family of models outperforms the older versions of MiDaS in terms of their ability to handle novel data. However, it is essential to take into account the trade-off between speed and performance when selecting a final model. This consideration will be further discussed in the upcoming section, as it is crucial to balance the accuracy of depth estimation with the computational efficiency required for the proposed approach in the paper.

5.2.3. Inference Latency

This section focuses on comparing the inference latency of the proposed system, using different object detection models and depth estimation models on the RTX 3090 GPU. Figure 11 illustrates the mAP 50:95 by latency in milliseconds for the YOLOv5 and YOLOv7 models. Among the models examined, v5l, v5x, and the v7 family of models exhibit favourable mAP scores. However, when considering latency, the v7 models demonstrate better performance. In particular, the v7 model strikes a favourable balance between accuracy and latency. Consequently, the proposed solution presented in the paper employs YOLOv7 as its object detection framework. By leveraging this model, the system aims to achieve a desirable combination of accuracy and inference speed for its real-time application.

In Figure 12, the relative performance of various selected depth estimation models is compared to the v2.1 Large model with respect to the latency in milliseconds. It is apparent that the v3.1 BEiT models exhibit subpar performance when compared to the other models. In particular, the large model v3.1 BEiT is excluded from Figure 12 due to its excessively high latency. Among the remaining v3.1 models, compared to the Next-ViT model, the Swin models demonstrate a better balance of accuracy and latency. Consequently, the proposed method in the paper has opted to use the Swin2 Large model, as it achieves the highest relative performance while keeping latency levels low.

By selecting the Swin2 Large model, the proposed method aims to strike a favourable trade-off between accuracy and inference speed in the context of depth estimation.

5.3. Object Detection—YOLOv7

Based on mAP tests conducted on various object detection models and comparisons of inference latencies, the authors chose to proceed with YOLOv7 due to its optimal balance between mAP and latency. This section delves into a thorough presentation and analysis of the results, derived from an extensive evaluation conducted on YOLOv7. The assessment of the performance of such a model requires an exploration into several metrics, particularly emphasizing precision, recall, and accuracy. Through their examination of these metrics across varying IoU thresholds, object detection areas, and numbers of detections, the paper seeks to furnish readers with a nuanced and wide-ranging perspective on the capabilities and efficiency of YOLOv7.

5.3.1. Precision Analysis

In realm of object detection, precision is a vital measure that tells us how well a model can identify true objects while avoiding mistakes. Essentially, precision calculates the number of correct detections as a fraction of all detections the model makes. A model with high precision is good at accurately spotting targets and making fewer false identifications. Therefore, when evaluating any object detection system, it is crucial to consider its precision as it indicates the system’s reliability in distinguishing the correct objects from irrelevant ones.

When observing the model’s performance over a varied IoU threshold, which spans from 0.50 to 0.95, the Average Precision (AP) stands at 0.512. This metric considers all areas and provides a broad view of the model’s precision across a variety of overlap criteria between predicted and actual bounding boxes. It is significant to note that as the IoU threshold is fixed at 0.50, the AP rises notably to 0.697. This suggests that the model exhibits enhanced precision when the overlap between the ground truth and predicted bounding boxes is allowed to be more lenient. However, as expected in the rigorous landscape of object detection, when the IoU criterion is tightened to 0.75, the AP slightly contracts to 0.555. This decrement underscores the inherent challenge of maintaining precision as the demand for exact bounding box overlaps surges.

Beyond the overarching IoU threshold, the dimension of the detected objects plays a pivotal role in influencing precision. For instance, the model exhibits a discernible struggle with small-sized objects, reflected in the relatively modest AP of 0.352. This could be attributed to the intricacies and details associated with smaller objects that might be more challenging to detect precisely. On the contrary, medium-sized objects are detected with a better precision, evidenced by their AP of 0.559. The model’s performance is highest with large-sized objects, achieving an AP of 0.667. This explains that the model is better at recognizing objects of larger dimensions, possibly due to the pronounced features and less ambiguity in detection.

5.3.2. Recall Analysis

When evaluating the performance of object detection models, recall stands as an imperative metric that evaluates a model’s capacity to discern and identify all pertinent objects within a given image. Essentially, it quantifies the proportion of true positives relative to the sum of true positives and false negatives. A high recall value signifies that the model adeptly recognizes the vast majority of relevant objects, ensuring minimal omissions.

Considering an IoU threshold that ranges from 0.50 to 0.95 and accounting for only a single detection, the Average Recall (AR) clocks in at 0.384. This might appear modest, but as the model is allowed more leeway in detections, increasing the number to 10, the AR witnesses a substantial leap to 0.638. This suggests that the model’s recall potential expands with increased detection allowances. This observation is further solidified when one perceives the AR value of 0.688 upon considering up to 100 detections. The upward trajectory of AR with increased detections is a testament to the model’s latent capacity to recognize and retrieve relevant instances, provided it is granted more detection opportunities.

Similarly to precision, recall also exhibits variability based on the detected object’s size. Small objects, with an AR of 0.537, reflect a decent but not optimal recall rate. This denotes possible hindrances in capturing the entirety of smaller instances within images. In stark contrast, medium objects display a commendable recall with an AR of 0.735. However, the star performers remain the large objects, registering a pinnacle AR of 0.838. This consistent high performance reiterates the model’s predisposition to effectively recall larger objects.

5.3.3. Visual Analysis

In the pursuit of quantifying model performance, raw numerical outputs, though invaluable, often lack the immediate clarity that visual representations afford. While these figures give a precise quantitative overview, their interpretation can be greatly enriched by corresponding visual aids which convey intuitive insights.

At the heart of assessing the YOLOv7 object detection model, especially when trained on the COCO dataset with its extensive 80 classes, lies the confusion matrix. This essential visual tool adeptly encapsulates the instances of true positives, false positives, true negatives, and false negatives for each class. As illustrated in Figure 13, the matrix is pivotal in providing insights into YOLOv7’s capability to accurately differentiate among the myriad classes of the COCO dataset. Moreover, it emphasizes specific classes or regions where the model may benefit from further tuning or training to enhance its detection accuracy.

Concluding the visual evaluation tools is the PR curve (precision-recall curve), which stands out for its ability represent how accurate the model is versus how many objects it can detect. Particularly when considering object detection using YOLOv7 evaluated on the COCO dataset [52] with its 80 distinct classes, as presented in Figure 14, a close examination of this curve reveals the inherent trade-offs between these two pivotal metrics. Such a study can guide model fine-tuning efforts, spotlighting specific areas that might need improvement.

In totality, these results render a comprehensive portrait of the YOLOv7 object detection model’s performance. While the model showcases a commendable aptitude, especially in the domain of larger object detection, there undeniably remains scope for refinement, particularly in precision concerning smaller objects. However, given the specific use case, these findings indicate that YOLOv7 stands as an excellent option for object detection. Its strengths and potential areas for improvement have been taken into account, and it is being utilized as a foundational component in this paper’s proposed model.

5.4. Communication Cost

The communication cost plays a significant role in the efficiency of the proposed system. This section provides a detailed breakdown of the time spent on each subsystem operation in the proposed model considering adequate network conditions and a RTX 3090 GPU, which can be instrumental in identifying potential bottlenecks and areas of improvement.

In analyzing the communication costs, several operations within the system exhibited significant temporal expenditure. Capturing an image and subsequently uploading it to Firebase storage took an average of

300 \pm 100

ms, while the storage of the image link in Firebase’s Realtime Database followed closely at

270 \pm 10

ms. Transitioning the data from the Firebase Database to the private blockchain network was similarly time-intensive, taking

300 \pm 100

ms. Operations on the blockchain, such as link uploading and retrieval, were relatively swift, necessitating only

50 \pm 10

ms and

100 \pm 10

ms respectively. Notably, downloading the image using the GPU node was one of the lengthier processes, taking

650 \pm 200

ms.

The cold starts of the YOLOv7 and MiDaS models, which are one-time loading processes, are especially notable: they took

5500 \pm 500

ms and

3500 \pm 500

ms, respectively, signifying substantial initialization costs. However, once initialized, the YOLOv7 and MiDaS inferences were highly efficient, requiring just

70 \pm 10

ms and

45 \pm 10

ms, respectively. Algorithmic text output generation was the fastest operation, taking less than 10 ms, while transmitting the output text as an audio cue through a Raspberry Pi wrapped up the sequence at

200 \pm 50

ms.

It is crucial to note that the total time for processing a single image, considering the cold start (i.e., the first initialization), amounts to

10,995 \pm 1000

ms. However, once the system is initialized, the effective processing time for each subsequent image dramatically reduces to

1995 \pm 500

ms or approximately 2 s. Considering effective processing time, the average time taken for GPU compute node to generate output inference is 125 ms or about 8 frames per second. These results, detailed in Table 3, provide a comprehensive understanding of where time expenditures occur in the proposed system, enabling targeted optimizations and efficiency improvements in future iterations.

5.5. Camera Lens Characteristics

This section provides a comprehensive analysis of the results obtained from experiments conducted using two distinct types of camera lenses. The experiments encompassed the manipulation of varying distances between the lenses and objects, as well as the evaluation of performance under low-light conditions.

5.5.1. Comparison Based on Angle of View

In comparison to a wide-angle camera lens, a standard camera lens has limited capability to capture information about the environment. In Figure 15, the camera is unable to capture objects such as the chair and sofa in an image. However, in Figure 16, the camera is able to capture a significant portion of the surrounding environment, approximately 155 degrees in front of the user, thus facilitating the detection and avoidance of objects. One drawback of wide-angle lenses is the distortion that occurs at the edges of the captured image. Despite this, YOLOv7 demonstrates a high degree of robustness, as the accuracy of predicting slightly distorted objects is negligibly affected compared to undistorted objects.

5.5.2. Distance from Camera Lens

This study determined that a minimum distance of 2 feet is necessary for the wide angle lens to capture accurate results. Figure 17 represents an image taken from a distance of 2 feet from the chair using a wide angle camera lens and the chair is detected successfully with a high level of accuracy and confidence.

It has been noted that at distances less than 2 feet, erroneous classifications occur due to the inclusion of partial objects within the captured image. In Figure 18, an image taken from a distance of 1 foot from the chair using a wide angle lens, the chair is incorrectly interpreted as a TV, while the air vent is misclassified as a bed, although with low confidence levels. These misclassifications highlight the challenges faced when capturing images at close proximity, as the larger field of view of the wide angle lens may lead to incomplete object representation, resulting in incorrect object identification.

5.5.3. Performance in Low-Light Conditions

The proposed system described in the paper functions optimally in well-lit environments, emphasising the necessity of abundant light for an accurate prediction of the nearest object. However, the performance of the model is significantly affected by adverse conditions such as the presence of water, fog and dirt on the lens, which can result in increased inaccuracies. With respect to object detection, the proposed model of the paper exhibits a higher level of robustness when it comes to different lighting conditions, as demonstrated by the results presented in Figure 19 and Figure 20. In these figures, the couch and chair in both images were detected but with a slightly lower confidence in Figure 20. It is also to be noted that in such conditions, non-existant objects might be inferenced due to loss of detail as visible in Figure 20, which represents another erroneous chair detection. YOLOv7 successfully detects objects with comparable confidences despite variations in lighting conditions.

The main factor contributing to the limited functionality of the proposed system under dimly lit conditions is the significant influence of monocular depth estimation, which is highly dependent on the concept of light and shadows. This dependency on lighting conditions becomes evident when observing Figure 21 and Figure 22, where the performance of the system is significantly compromised in low-light environments. In these figures, it is apparent that certain objects, such as the chair and vent, are not clearly visible due to the insufficient illumination. As a result, the accuracy and reliability of the proposed model of the paper are reduced, leading to decreased performance in identifying and localising objects accurately.

5.6. Edge Cases

In this section of the paper, a detailed investigation is conducted to evaluate the performance of the proposed model in challenging edge cases. These edge cases are defined as scenarios that involve irrelevant objects, terrain obstacles, crowded environments and motion-induced irregularities. The objective of this analysis is to evaluate the robustness and generalisability of the models under consideration in real-world conditions.

5.6.1. Terrain Obstacles

Given that YOLOv7 is trained only on a limited number of classes, it is unable to identify general objects such as rocks, stones and debris, and consequently, in Figure 23, the protruding stone slabs, which pose significant safety risks, will not be detected.

5.6.2. Irrelevant Objects

Despite that the object detection threshold confidence has been set to 0.3 (30%), it is evident that in specific cases certain objects such as the car inside the gate on the left in Figure 24 do not need to be flagged to the user, as they are not on the street and do not cause any obstruction. There might be other such similar cases where partial or non-obstructing objects might give erroneous notifications to the user of the proposed system.

5.6.3. Crowded Environments

In dynamic and crowded situations, such as the one shown in Figure 25, the proposed model offers limited assistance to the user, as there are numerous objects in close proximity and does not facilitate the user in distinguishing between the positions of the objects. Additionally, the user may experience confusion due to the high number of alerts generated by the proposed system, which are triggered by the large number of objects present in the vicinity.

5.6.4. Sudden Change in Direction

When the user is in motion and makes sudden changes in direction or orientation, it may result in blurred images, causing the proposed system to perform suboptimally due to erroneous results from object detection and depth estimation. As shown in Figure 26, this issue is particularly challenging, making it difficult for the proposed system to accurately identify objects. Optical image stabilisation is a technique that can help mitigate the impact of blurry images, but requires additional hardware, which increases both the cost and complexity of the proposed system.

5.6.5. Erroneous Detection

When the user is stationary and interacts with objects or people, the wide angle of the camera causes the persons or objects located close to the user to be identified and alerted, potentially leading to unnecessary disturbances. In Figure 27, a person standing next to the user would be detected and alerted, despite not posing any significant obstacle. However, it is important to consider that providing the user with awareness of their surroundings can be beneficial in certain situations. To address this issue, it is possible to incorporate a form of intelligence into the model, which can adapt to situations where the user is stationary.

5.7. Comparison of Various Related Systems

This section offers a comparative analysis between the proposed system and various related works. The proposed system discussed in this study offers a broader spectrum of features than its contemporaries. While many visual assistance systems are engineered to operate seamlessly in both indoor and outdoor environments, only a select few have adeptly integrated lightweight deep learning models to facilitate real-time obstacle detection on handheld devices, thus ensuring more rapid response times.

In Table 4, a comparative analysis between the methodology introduced in this study and various analogous systems is provided. The findings indicate that the proposed model offers marked advantages over other similar methodologies. Notably, the majority of existing studies exhibit limited ranges, predominantly due to their reliance on singular modes of ultrasound detection. A salient contribution of the present study is the enhancement of system security via the incorporation of blockchain technology. When contrasted with other methodologies, the proposed system is faster than other methods because it delegates processing tasks to external GPU servers. As previously underscored, the adoption of a wide-angle lens proves advantageous in extending the visual field in front of the user—a feature noticeably absent in many comparative studies. While ultrasound-based systems are primarily suited for object detection, they fall short in the realm of recognition. This limitation is adeptly addressed in the current study through the integration of object detection and recognition techniques. Benefiting from the blockchain infrastructure, the proposed system possesses the capability for scalability, accommodating new users with ease. In contrast, studies that rely on local processing often suffer with scalability. Furthermore, the proposed model harnesses the advanced object detection algorithm, YOLOv7, which boasts a superior

m A P_{0.5}^{v a l}

of 0.69 on the COCO dataset when compared with other relevant contributions.

Additionally, the authors provide a detailed analysis, comparing several significant contributions that underscore the superior merits of the proposed system. In the study by [55], a smartphone application was developed, aiming to assist the visually impaired in navigation. Within their system, the authors employed the YOLOv3 model for object detection, which subsequently relayed its results audibly to the user through the application. Notably, YOLOv3’s performance on the COCO dataset is characterized by a mean average precision (COCO

m A P_{0.5}^{v a l}

) of 0.545. In contrast, YOLOv7, as leveraged in the present study, achieves a mAP of 0.697 on the same dataset. Moreover, the research in [55] predominantly relies on the intrinsic processing capacities of smartphones for model execution, thereby facilitating obstacle detection. However, such an approach can be optimized further by delegating the inference tasks to external servers equipped with robust GPU capabilities. Another major limitation of the application described in [55] is its omission of obstacle positioning details. Such an oversight circumscribes its utility, as users lack directional guidance to adeptly circumvent detected obstacles.

In the study presented in [56], the authors employed Tensorflow models to develop a classifier on the Raspberry Pi board aimed at obstacle detection. Specifically, they utilized the ssdlite-mobilenet-v2-coco model to identify nine distinct objects typically found on sidewalks. To relay feedback to users, the system integrated three vibration sensors along with a speaker, offering both tactile and auditory cues. The incorporation of tactile feedback represents a novel innovation, aiding users in pinpointing the obstacle’s location. However, the study in [56] bears certain limitations. The classifier in [56] was trained exclusively to recognize a set of nine obstacles commonly encountered on sidewalks. As a consequence, its applicability remains restricted to such environments, rendering it unsuitable for diverse settings. Conversely, the YOLOv7 model, as employed in this study, has undergone training on more expansive datasets, including PASCAL VOC, ImageNet, and COCO which can recognize over 80 classes of objects. Lastly, the approach delineated in [56] employs speakers to furnish auditory feedback, a choice that may compromise audibility in noise-saturated environments and inadvertently contribute to noise pollution. To mitigate this, the current paper’s proposed model advocates for the use of earphones, ensuring clear feedback delivery while preserving ambient acoustic quality.

The authors in [57] present a multifaceted system, encompassing a smart cane, a mobile application, and a web server. Within this system, the caretaker employs a large-screen smartphone equipped with a navigation application. This application offers a visual representation of the visually impaired person’s (VIP) field of view. Interaction with the application’s buttons facilitates the transmission of haptic feedback to guide the VIP. Additionally, the system incorporates vocal communication between the caretaker and the VIP. While this arrangement precludes the necessity for the caretaker to be co-located with the VIP, it does not eradicate the VIP’s reliance on the caretaker. Conversely, the model introduced in this study harnesses the capabilities of deep learning to address this challenge. By doing so, it aspires to significantly diminish, if not entirely remove, the dependence on a caretaker. The proposed model has demonstrated commendable performance in the realm of assistive navigation. Beyond removing manual intervention, this approach instills a heightened sense of confidence and autonomy within the VIP. Ultimately, this translates to a tangible enhancement in their overall quality of life.

The system proposed in [58] introduces an IoT-driven smart glasses system for the visually impaired person (VIP), structured in alignment with the OSI reference model. Employing an ultrasonic sensor alongside an ambient light sensor, the system perpetually gathers data. This collected data is then compared against predefined threshold values, with the user subsequently being alerted of obstructions based on these comparisons. Feedback to the user is conveyed via actuators such as vibration motors and piezoelectric sensors. Additionally, in post-processing, the video feed is archived in cloud storage. The system’s capabilities are circumscribed to mere obstacle detection without actual object recognition. Contrastingly, the model delineated in this study handles this constraint. It not only employs YOLOv7 for adept object detection but also capitalizes on the MiDaS depth estimation model to find the proximate obstacle. The system then furnishes the user with an auditory output detailing the identified object alongside its spatial positioning. A further limitation of [58] is its sensor-based detection mechanism, which constricts its spatial coverage range. In comparison, the present study utilizes an optical camera lens endowed with a superior field of view (FOV), thereby ensuring an expansive spatial coverage and enhancing its utility.

5.8. Ethical Implications and Privacy Laws

The fusion of the Internet of Things, blockchain, and deep learning technologies as presented in this paper, while innovative, also brings about potential ethical and privacy concerns which are addressed below.

Ethical considerations:

Informed consent: prior to the deployment of the proposed system, individuals with visual impairments will be comprehensively educated about the technological functionality, its data management processes, and potential associated risks.
Data collection protocols: it is imperative that any images gathered do not violate an individual’s right to privacy. Thus, the system under discussion grants users the autonomy to activate or deactivate the device as they deem fit.
Beneficence principle: it is posited that the system will offer genuine advantages to its users, ensuring that the derived benefits substantially surpass any prospective harm or discomfort.

Adherence to Privacy Laws:

Data handling and storage: the video frames capture not only the user’s surroundings but potentially other individuals who might not have provided consent. The decentralized nature of blockchain, while ensuring no superior access, will still comply with regulations such as general data protection regulation (GDPR) which give individuals a right over their personal data.
Data minimization practices: the requisite image data is temporarily stored on the Raspberry Pi and is immediately removed post-usage. This practice considerably mitigates potential misuse of sensitive information.
User data autonomy: users will have the ability to delete all their data and revoke consent at any point.

Potential concerns: while deep learning models are notably robust, there exists a potential for occasional inaccuracies. An erroneous object identification might result in an inaccurate alert or an overlooked barrier, which could misinform the user.

In conclusion, the integration of these technologies offers immense potential for enhancing the lives of visually impaired individuals. Still, it is paramount to deploy them ethically, adhering strictly to privacy regulations, and with a continuous feedback mechanism to mitigate potential concerns.

6. Future Scope

Although the proposed solution has been tested with a prototype under various conditions and has been made adaptable to most circumstances, there are still some edge cases and limitations which can be worked upon in the future to increase usefulness and practicality.

Some research opportunities include:

Currently, the model does not function accurately under dim or no-light conditions. Due to the dependency on lighting for depth estimation, the model fails to detect objects in dark environments. This limitation can be handled with the use of infrared sensors or other technologies.
The proposed model performs suboptimally in crowded areas where there are many obstacles in close proximity to the user. Further research can be conducted to make the model more efficient and accurate in such circumstances.
Scheduling algorithms or load balancing algorithms can be incorporated among the nodes of the blockchain network to optimise the processing load.
Additional features such as geolocation and navigation can be implemented to help the visually impaired travel to destinations more easily.
Usage of binocular camera lenses will further help in determining the distance of the obstacle from the user along with the object itself. This gives a more detailed description of the surroundings to the users.
Face recognition can also be incorporated to help identify trustworthy individuals.
Language translation can also be added to improve the features of the product to help a user in a new city.

Beyond the aforementioned points, there exists significant potential for evaluating the performance of the proposed model in practical scenarios. Conducting a study based on the prototype from this paper, which involves empirical testing with multiple visually impaired individuals, would not only be interesting but also contribute substantially to the existing body of knowledge in this domain.

7. Conclusions

This paper introduces a comprehensive solution to help visually impaired individuals by integrating the domains of IoT, blockchain, and deep learning. The combination of these three domains, while relatively niche, has proven to be advantageous when applied appropriately. The paper addresses the need for an affordable yet effective solution to aid visually impaired individuals navigate their surroundings. By incorporating blockchain technology, the system ensures data security, privacy, and a distributed model, thus avoiding a single point of failure. The present study analyses and compares multiple object detection and depth estimation models, evaluating their accuracy and latency with the aim of achieving a favourable equilibrium and identifying the most appropriate model. The object detection model employed YOLOv7 and has significantly high COCO

m A P_{0.5}^{v a l}

of 0.69 when compared to other relevant contributions. Additionally, the article highlights various edge cases that need to be considered in the chosen approach. The proposed model demonstrates practicality by providing real-time alerts to users within approximately 2 s, with a 125 ms processing time per image assuming a robust network connection and a high-performance GPU for the deep learning server. However, it is essential to acknowledge the limitations of the system. These identified limitations provide opportunities for future research and the advancement of this field. Overall, this study serves as a stepping stone towards enhancing assistance for visually impaired individuals and provides valuable information for future studies aiming to refine and expand upon the proposed system.

Author Contributions

Conceptualization, S.J. and S.S.S.; methodology, S.J., S.B., S.M. and S.K.; software, S.T. and S.B.; validation, S.T., S.B., S.S.S. and P.B.H.; formal analysis, S.T. and S.K.; resources, S.M.; data curation, S.S.S.; writing—original draft preparation, S.T., S.B., S.M., S.K. and S.S.S.; writing—review and editing, S.J.; supervision, S.J. and P.B.H.; project administration, S.K. and P.B.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GPU	Graphical Processing Unit
GPS	Global Positioning System
URL	Uniform Resource Locator
LED	Light Emitting Diode
IoT	Internet Of Things
LIDAR	Light Detection And Ranging
PIR	Passive Infrared Sensor
MEMS	Micro-electromechanical systems
SMS	Short Message Service
YOLO	You Only Look Once
CPU	Central Processing Unit
ARM	Advanced RISC Machine
CSI	Camera Serial Interface
PoW	Proof Of Work
PoS	Proof Of Stake
PoA	Proof Of Authority
PASCAL VOC	PASCAL Visual Object Class
COCO	Common Object in Context
RCNN	Region-based Convolution Neural Network
SPP	Spatial Pyramid Pooling
CNN	Convolution Neural Network
DCNN	Deep Convolutional Neural Network
MiDaS	Monocular Depth Scaling
ReDWeb	Relative Depth from Web
DIML	Digital Image Media Lab RGB-D Dataset
WSVD	Web Stereo Video Dataset
HRWSI	High-Resolution Web Stereo Image
BlendedMVS	Blended Multi-View Stereo
IRS	Indoor Robotics Stereo
BRG	Blue Red Green
RGB	Red Green Blue
USD	United States Dollar
KITTI	Karlsruhe Institute of Technology and the Toyota Technological Institute
IMU	Inertial Measurement Unit
mAP	Mean Average Precision
AP	Average Precision
BEiT	Bidirectional Encoder representation from Image Transformers
IoU	Intersection Over Union
AR	Average Recall
VIP	Visually Impaired Person
FOV	Field Of View
GDPR	General Data Protection Regulation

References

Satam, I.A.; Al-Hamadani, M.N.; Ahmed, A.H. Design and implement smart blind stick. J. Adv. Res. Dyn. Control. Syst. 2019, 11, 42–47. [Google Scholar]
Bauer, Z. Monocular Depth Estimation: Datasets, Methods, and Applications. Ph.D. Thesis, University of Alicante, Alicante, Spain, 2021. [Google Scholar]
Sharma, N.; Shamkuwar, M.; Singh, I. The history, present and future with IoT. In Internet of Things and Big Data Analytics for Smart Generation; Springer Nature Switzerland AG: Cham, Switzerland, 2019; pp. 27–51. [Google Scholar]
Ugajin, A. Automation in Hospitals and Health Care. In Springer Handbook of Automation; Springer Nature Switzerland AG: Cham, Switzerland, 2023; pp. 1209–1233. [Google Scholar]
Ko, H.S.; Eshraghi, S. Automation in Home Appliances. In Springer Handbook of Automation; Springer Nature Switzerland AG: Cham, Switzerland, 2023; pp. 1311–1330. [Google Scholar]
Sohaib, O.; Lu, H.; Hussain, W. Internet of Things (IoT) in E-commerce: For people with disabilities. In Proceedings of the 2017 12th IEEE Conference on Industrial Electronics and Applications (ICIEA), Siem Reap, Cambodia, 18–20 June 2017; pp. 419–423. [Google Scholar]
Cheng, S. Future Development of Industries in the Intelligent World Riding the New Wave of Technologies. In China’s Opportunities for Development in an Era of Great Global Change; Springer: Singapore, 2023; pp. 203–213. [Google Scholar]
Nadeem, F. Evaluating and Ranking Cloud IaaS, PaaS and SaaS Models Based on Functional and Non-Functional Key Performance Indicators. IEEE Access 2022, 10, 63245–63257. [Google Scholar] [CrossRef]
Karam, Y.; Baker, T.; Taleb-Bendiab, A. Security support for intention driven elastic cloud computing. In Proceedings of the 2012 Sixth UKSim/AMSS European Symposium on Computer Modeling and Simulation, Valletta, Malta, 14–16 November 2012; pp. 67–73. [Google Scholar]
Ahmad, W.; Rasool, A.; Javed, A.R.; Baker, T.; Jalil, Z. Cyber security in iot-based cloud computing: A comprehensive survey. Electronics 2022, 11, 16. [Google Scholar] [CrossRef]
Rajasekaran, A.S.; Azees, M.; Al-Turjman, F. A comprehensive survey on blockchain technology. Sustain. Energy Technol. Assessments 2022, 52, 102039. [Google Scholar] [CrossRef]
de Ocáriz Borde, H.S. An Overview of Trees in Blockchain Technology: Merkle Trees and Merkle Patricia Tries; University of Cambridge: Cambridge, UK, 2022. [Google Scholar]
Soltani, R.; Zaman, M.; Joshi, R.; Sampalli, S. Distributed Ledger Technologies and Their Applications: A Review. Appl. Sci. 2022, 12, 7898. [Google Scholar] [CrossRef]
Altaş, H.; Dalkiliç, G.; Cabuk, U.C. Data immutability and event management via blockchain in the Internet of things. Turk. J. Electr. Eng. Comput. Sci. 2022, 30, 451–468. [Google Scholar]
Hassan, A.; Ali, M.I.; Ahammed, R.; Khan, M.M.; Alsufyani, N.; Alsufyani, A. Secured insurance framework using blockchain and smart contract. Sci. Program. 2021, 2021, 6787406. [Google Scholar] [CrossRef]
Taherdoost, H. Smart Contracts in Blockchain Technology: A Critical Review. Information 2023, 14, 117. [Google Scholar] [CrossRef]
Borse, M.; Shendkar, P.; Undre, Y.; Mahadik, A.; Patil, R.Y. A Review of Blockchain Consensus Algorithm. I Expert Clouds and Applications, Proceedings of the ICOECA 2022, Bangalore, India, 3–4 February 2022; Springer: Singapore, 2022; pp. 415–426. [Google Scholar]
Ali, A.; Pasha, M.F.; Guerrieri, A.; Guzzo, A.; Sun, X.; Saeed, A.; Hussain, A.; Fortino, G. A Novel Homomorphic Encryption and Consortium Blockchain-based Hybrid Deep Learning Model for Industrial Internet of Medical Things. IEEE Trans. Netw. Sci. Eng. 2023. [Google Scholar] [CrossRef]
Yaga, D.; Mell, P.; Roby, N.; Scarfone, K. Blockchain technology overview. arXiv 2019, arXiv:1906.11078. [Google Scholar]
Deng, L.; Yu, D. Deep learning: Methods and applications. Found. Trends Signal Process. 2014, 7, 197–387. [Google Scholar] [CrossRef]
Sarker, I.H. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2021, 2, 420. [Google Scholar] [CrossRef] [PubMed]
Benuwa, B.B.; Zhan, Y.Z.; Ghansah, B.; Wornyo, D.K.; Banaseka Kataka, F. A review of deep machine learning. Int. J. Eng. Res. Afr. 2016, 24, 124–136. [Google Scholar] [CrossRef]
Xiao, Y.; Tian, Z.; Yu, J.; Zhang, Y.; Liu, S.; Du, S.; Lan, X. A review of object detection based on deep learning. Multimed. Tools Appl. 2020, 79, 23729–23791. [Google Scholar]
Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
Zhao, C.; Sun, Q.; Zhang, C.; Tang, Y.; Qian, F. Monocular depth estimation based on deep learning: An overview. Sci. China Technol. Sci. 2020, 63, 1612–1627. [Google Scholar] [CrossRef]
Salma, R.F.O.; Arman, M.M. Smart parking guidance system using 360o camera and haar-cascade classifier on iot system. Int. J. Recent Technol. Eng 2019, 8, 864–872. [Google Scholar]
Ming, Y.; Meng, X.; Fan, C.; Yu, H. Deep learning for monocular depth estimation: A review. Neurocomputing 2021, 438, 14–33. [Google Scholar] [CrossRef]
Ashok Kumar, K.; Karunakar Reddy, V.; Narmada, A. An Integration of AI, Blockchain and IoT Technologies for Combating COVID-19. In Emergent Converging Technologies and Biomedical Systems, Select Proceedings of the ETBS 2021; Springer: Singapore, 2022; pp. 173–181. [Google Scholar]
Mallikarjuna, G.C.; Hajare, R.; Pavan, P. Cognitive IoT system for visually impaired: Machine learning approach. Mater. Today Proc. 2022, 49, 529–535. [Google Scholar] [CrossRef]
Ashiq, F.; Asif, M.; Ahmad, M.B.; Zafar, S.; Masood, K.; Mahmood, T.; Mahmood, M.T.; Lee, I.H. CNN-based object recognition and tracking system to assist visually impaired people. IEEE Access 2022, 10, 14819–14834. [Google Scholar] [CrossRef]
Shah, A.; Sharma, G.; Bhargava, L. Smart Implementation of Computer Vision and Machine Learningfor Pothole Detection. In Proceedings of the 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 28–29 January 2021; pp. 65–69. [Google Scholar]
Rahman, M.A.; Sadi, M.S. IoT enabled automated object recognition for the visually impaired. Comput. Methods Programs Biomed. Update 2021, 1, 100015. [Google Scholar] [CrossRef]
Durgadevi, S.; Komathi, C.; ThirupuraSundari, K.; Haresh, S.; Harishanker, A. IOT Based Assistive System for Visually Impaired and Aged People. In Proceedings of the 2022 2nd International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC), Mathura, India, 21–22 January 2022; pp. 1–4. [Google Scholar]
Krishnan, R.S.; Narayanan, K.L.; Murali, S.M.; Sangeetha, A.; Ram, C.R.S.; Robinson, Y.H. IoT based blind people monitoring system for visually impaired care homes. In Proceedings of the 2021 5th international conference on trends in electronics and informatics (ICOEI), Tirunelveli, India, 3–5 June 2021; pp. 505–509. [Google Scholar]
Trent, M.; Abdelgawad, A.; Yelamarthi, K. A smart wearable navigation system for visually impaired. In Proceedings of the Smart Objects and Technologies for Social Good: Second International Conference, GOODTECHS 2016, Venice, Italy, 30 November–1 December 2016; Springer: Cham, Switzerland, 2017; pp. 333–341. [Google Scholar]
Dragne, C.; Todiriţe, I.; Iliescu, M.; Pandelea, M. Distance Assessment by Object Detection—For Visually Impaired Assistive Mechatronic System. Appl. Sci. 2022, 12, 6342. [Google Scholar] [CrossRef]
Mohanraj, I.; Siddharth, S. A framework for tracking system aiding disabilities. In Proceedings of the 2017 IEEE International Conference on Current Trends in Advanced Computing (ICCTAC), Bangalore, India, 2–3 March 2017; pp. 1–7. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1623–1637. [Google Scholar] [CrossRef] [PubMed]
Landa-Hernández, A.; Bayro-Corrochano, E. Cognitive guidance system for the blind. In Proceedings of the World Automation Congress 2012, Puerto Vallarta, Mexico, 24–28 June 2012; pp. 1–6. [Google Scholar]
Lakde, C.K.; Prasad, P.S. Navigation system for visually impaired people. In Proceedings of the 2015 International Conference on Computation of Power, Energy, Information and Communication (ICCPEIC), Melmaruvathur, India, 22–23 April 2015; pp. 93–98. [Google Scholar]
Ahn, H.; Lee, J.H.; Cho, H.J. Research of panoramic image generation using IoT device with camera for cloud computing environment. Wirel. Pers. Commun. 2019, 105, 619–634. [Google Scholar] [CrossRef]
Samaniego, M.; Deters, R. Internet of smart things-iost: Using blockchain and clips to make things autonomous. In Proceedings of the 2017 IEEE international conference on cognitive computing (ICCC), Honolulu, HI, USA, 25–30 June 2017; pp. 9–16. [Google Scholar]
Alrubei, S.M.; Ball, E.; Rigelsford, J.M. A secure blockchain platform for supporting AI-enabled IoT applications at the Edge layer. IEEE Access 2022, 10, 18583–18595. [Google Scholar] [CrossRef]
Javed, A.R.; Hassan, M.A.; Shahzad, F.; Ahmed, W.; Singh, S.; Baker, T.; Gadekallu, T.R. Integration of blockchain technology and federated learning in vehicular (iot) networks: A comprehensive survey. Sensors 2022, 22, 4394. [Google Scholar]
Niya, S.R.; Stiller, B. Efficient Designs for Practical Blockchain-IoT Integration. In Proceedings of the NOMS 2022–2022 IEEE/IFIP Network Operations and Management Symposium, Budapest, Hungary, 25–29 April 2022; pp. 1–6. [Google Scholar]
Mingxiao, D.; Xiaofeng, M.; Zhe, Z.; Xiangwei, W.; Qijun, C. A review on consensus algorithm of blockchain. In Proceedings of the 2017 IEEE international conference on systems, man, and cybernetics (SMC), Banff, AB, Canada, 5–8 October 2017; pp. 2567–2572. [Google Scholar]
Zhang, R.; Chan, W.K.V. Evaluation of energy consumption in block-chains with proof of work and proof of stake. J. Phys. Conf. Ser. 2020, 1584, 012023. [Google Scholar] [CrossRef]
Al Asad, N.; Elahi, M.T.; Al Hasan, A.; Yousuf, M.A. Permission-based blockchain with proof of authority for secured healthcare data sharing. In Proceedings of the 2020 2nd International Conference on Advanced Information and Communication Technology (ICAICT)), Dhaka, Bangladesh, 28–29 November 2020; pp. 35–40. [Google Scholar]
Manolache, M.A.; Manolache, S.; Tapus, N. Decision making using the blockchain proof of authority consensus. Procedia Comput. Sci. 2022, 199, 580–588. [Google Scholar] [CrossRef]
Du, J. Understanding of object detection based on CNN family and YOLO. J. Phys. Conf. Ser. 2018, 1004, 012029. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings Part V 13. Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the ECCV, Florence, Italy, 7–13 October 2012. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Rachburee, N.; Punlumjeak, W. An assistive model of obstacle detection based on deep learning: YOLOv3 for visually impaired people. Int. J. Electr. Comput. Eng. 2021, 11, 3434–3442. [Google Scholar] [CrossRef]
Pehlivan, S.; Unay, M.; Akan, A. Designing an obstacle detection and alerting system for visually impaired people on sidewalks. In Proceedings of the 2019 Medical Technologies Congress (TIPTEKNO), Izmir, Turkey, 3–5 October 2019; pp. 1–4. [Google Scholar]
Chaudary, B.; Pohjolainen, S.; Aziz, S.; Arhippainen, L.; Pulli, P. Teleguidance-based remote navigation assistance for visually impaired and blind people—Usability and user experience. Virtual Real. 2023, 27, 141–158. [Google Scholar] [PubMed]
Bulletin, E.C. IoT Embedded Smart Glasses for the Visually Impaired. Eur. Chem. Bull. 2023, 12, 4582–4599. [Google Scholar]

Figure 1. Structure of blockchain.

Figure 2. High-level architecture of the proposed system.

Figure 3. Raspberry Pi 4 Model B.

Figure 4. Arducam MINI OV5647.

Figure 5. Working prototype of the proposed model.

Figure 6. Illustration of hardware design of the proposed system.

Figure 7. An overview of the IoT and cloud architecture.

Figure 8. Illustration representing the design of the cloud infrastructure of the proposed model.

Figure 9. Proposed blockchain architecture.

Figure 10. Illustration of the deep learning pipeline demonstrating object detection and depth estimation on an input image.

Figure 11. An evaluation of YOLO models based on mean average precision (mAP) and latency (ms).

Figure 12. Comparative analysis of selected MidaS models against MiDaS v2.1 Large384 Model in terms of relative performance.

Figure 13. Visualization of the confusion matrix for the YOLOv7 object detection model trained on the COCO dataset.

Figure 14. Illustration of the precision-recall curve obtained while evaluating YOLOv7 on the COCO dataset.

Figure 15. Photo taken using standard angle camera lens with horizontal field of view of 60 degrees.

Figure 16. Photo taken using a wide angle camera lens with horizontal field of view of 155 degrees.

Figure 17. Image taken at 2 feet away from the chair using wide angle lens.

Figure 18. Image taken at 1 foot from the chair using wide angle lens.

Figure 19. Image taken in an environment with abundant light.

Figure 20. Image taken in an environment which is dimly lit.

Figure 21. Relative depth map in an environment with abundant light.

Figure 22. Relative depth map in an environment which is dimly lit.

Figure 23. Image of terrain obstacles in front of the user.

Figure 24. Image of a street taken with wide angle lens.

Figure 25. Image of crowded college premises.

Figure 26. Blurry image due to sudden movement of camera when the image was being captured.

Figure 27. Image taken when user is interacting with other people.

Table 1. Performance comparison of YOLO object detection models on the COCO Dataset, measured by mAP [38].

Model	COCO ${mAP}_{0.5 : 0.95}^{val}$	COCO ${mAP}_{0.5}^{val}$
YOLOv5n	0.280	0.457
YOLOv5s	0.374	0.568
YOLOv5m	0.454	0.641
YOLOv5l	0.490	0.673
YOLOv5x	0.507	0.689
YOLOv7	0.514	0.697
YOLOv7-X	0.531	0.712

Table 2. Zero-shot error comparison of MiDaS models on various benchmarking datasets [39].

Model	${DIW}_{WHDR}$	$Eth 3 d_{AbsRel}$	${Sintel}_{AbsRel}$	${TUM}_{δ 1}$
MiDaS v2.1 Large384	0.1295	0.1155	0.3285	12.51
MiDaS v3.0 DPT Large384	0.1082	0.0888	0.2697	9.97
MiDaS v3.1 BEiT Base384	0.1159	0.0967	0.2901	9.88
MiDaS v3.1 Next-ViT Large384	0.1031	0.0954	0.2295	9.21
MiDaS v3.1 BEiT Large384	0.1239	0.0667	0.2545	7.17
MiDaS v3.1 Swin Large384	0.1126	0.0853	0.2428	8.74
MiDaS v3.1 Swin2 Base384	0.1095	0.0790	0.2404	8.93
MiDaS v3.1 Swin2 Large384	0.1106	0.0732	0.2442	8.87

Table 3. Time breakdown of subsystem operations in the proposed system.

Event	Time (in ms)
Capture image and upload to Firebase Storage	$300 \pm 100$
Store image link in Firebase Realtime Database	$270 \pm 10$
Firebase Database to Private Blockchain Network	$300 \pm 100$
Blockchain Link Upload	$50 \pm 10$
Blockchain Link Retrieve	$100 \pm 10$
Download Image from Link on GPU Node	$650 \pm 200$
Cold start YOLOv7 Model (One-time loading)	$5500 \pm 500$
Cold start MiDaS Model (One-time loading)	$3500 \pm 500$
YOLOv7 Inference	$70 \pm 10$
MiDaS Inference	$45 \pm 10$
Algorithmic Text Output Generation	<10
Output text transmission to Audio Cue through Raspberry Pi	$200 \pm 50$

Table 4. Comparison of features of the proposed methodology with similar systems.

Paper	Coverage	Range	Security	Processing Time	FOV	Detection Method	Scalability	COCO ${mAP}_{0.5}^{val}$
[29]	Indoor/ Outdoor	Unclear	No	<8 s	Standard	Detection/Recognition	No	NA
[30]	Indoor/ Outdoor	Unclear	No	Unclear	Standard	Detection/Recognition	No	NA
[31]	Outdoor	Unclear	No	Unclear	NA	Detection/Recognition	No	NA
[32]	Indoor/ Outdoor	1.5 m	No	Unclear	Standard	Detection/Recognition	No	NA
[33]	Indoor	10 m	No	Unclear	NA	Detection	No	NA
[34]	Outdoor	10 m	No	Unclear	NA	Detection	No	NA
[35]	Indoor/ Outdoor	4 m	No	Unclear	NA	Detection	No	NA
[36]	Indoor	1.5 m	No	<1 s	FishEye	Detection/Recognition	No	0.57
[55]	Indoor/ Outdoor	Unclear	No	Unclear	Standard	Detection/Recognition	No	0.54
[56]	Indoor/ Outdoor	Not Limited	No	Unclear	Standard	Detection/Recognition	No	0.21
[57]	Indoor/ Outdoor	Not Limited	No	Unclear	NA	Detection/Recognition	No	NA
[58]	Outdoor	1 m	Yes	Unclear	NA	Detection	No	NA
Proposed Paper	Outdoor	Not Limited	Yes	0.125 s	Wide Angle	Detection/Recognition	Yes	0.69

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jadon, S.; Taluri, S.; Birthi, S.; Mahesh, S.; Kumar, S.; Shashidhar, S.S.; Honnavalli, P.B. An Assistive Model for the Visually Impaired Integrating the Domains of IoT, Blockchain and Deep Learning. Symmetry 2023, 15, 1627. https://doi.org/10.3390/sym15091627

AMA Style

Jadon S, Taluri S, Birthi S, Mahesh S, Kumar S, Shashidhar SS, Honnavalli PB. An Assistive Model for the Visually Impaired Integrating the Domains of IoT, Blockchain and Deep Learning. Symmetry. 2023; 15(9):1627. https://doi.org/10.3390/sym15091627

Chicago/Turabian Style

Jadon, Shruti, Saisamarth Taluri, Sakshi Birthi, Sanjana Mahesh, Sankalp Kumar, Sai Shruthi Shashidhar, and Prasad B. Honnavalli. 2023. "An Assistive Model for the Visually Impaired Integrating the Domains of IoT, Blockchain and Deep Learning" Symmetry 15, no. 9: 1627. https://doi.org/10.3390/sym15091627

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Assistive Model for the Visually Impaired Integrating the Domains of IoT, Blockchain and Deep Learning

Abstract

1. Introduction

2. Background

2.1. Internet of Things

2.2. Cloud

2.3. Blockchain

2.4. Deep Learning

3. Related Work

4. Proposed Method

4.1. Assumptions and Justifications

4.2. System Architecture

4.3. Hardware Requirements

4.4. Modules

4.4.1. Image Capturing and Output Feed: IoT and Cloud

4.4.2. Privacy and Security: Blockchain

4.4.3. Object Detection and Relative Depth Estimation: Deep Learning

4.5. Accessibility and Cost

5. Results and Discussion

5.1. Datasets

5.2. Comparative Analysis

5.2.1. Mean Average Precision

5.2.2. Zero-Shot Error

5.2.3. Inference Latency

5.3. Object Detection—YOLOv7

5.3.1. Precision Analysis

5.3.2. Recall Analysis

5.3.3. Visual Analysis

5.4. Communication Cost

5.5. Camera Lens Characteristics

5.5.1. Comparison Based on Angle of View

5.5.2. Distance from Camera Lens

5.5.3. Performance in Low-Light Conditions

5.6. Edge Cases

5.6.1. Terrain Obstacles

5.6.2. Irrelevant Objects

5.6.3. Crowded Environments

5.6.4. Sudden Change in Direction

5.6.5. Erroneous Detection

5.7. Comparison of Various Related Systems

5.8. Ethical Implications and Privacy Laws

6. Future Scope

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI