A User Location Reset Method through Object Recognition in Indoor Navigation System Using Unity and a Smartphone (INSUS)

Fajrianti, Evianita Dewi; Panduman, Yohanes Yohanie Fridelin; Funabiki, Nobuo; Haz, Amma Liesvarastranta; Brata, Komang Candra; Sukaridhoto, Sritrusta

doi:10.3390/network4030014

Open AccessArticle

A User Location Reset Method through Object Recognition in Indoor Navigation System Using Unity and a Smartphone (INSUS)

by

Evianita Dewi Fajrianti

¹

,

Yohanes Yohanie Fridelin Panduman

¹

,

Nobuo Funabiki

^1,*

,

Amma Liesvarastranta Haz

¹

,

Komang Candra Brata

^1,2

and

Sritrusta Sukaridhoto

³

¹

Graduate School of Natural Science and Technology, Okayama University, Okayama 700-8530, Japan

²

Department of Informatics Engineering, Universitas Brawijaya, Malang 65145, Indonesia

³

Department of Informatic and Computer, Politeknik Elektronika Negeri Surabaya, Surabaya 60111, Indonesia

^*

Author to whom correspondence should be addressed.

Network 2024, 4(3), 295-312; https://doi.org/10.3390/network4030014

Submission received: 14 June 2024 / Revised: 17 July 2024 / Accepted: 18 July 2024 / Published: 22 July 2024

Download

Browse Figures

Versions Notes

Abstract

To enhance user experiences of reaching destinations in large, complex buildings, we have developed a indoor navigation system using Unity and a smartphone called INSUS. It can reset the user location using a quick response (QR) code to reduce the loss of direction of the user during navigation. However, this approach needs a number of QR code sheets to be prepared in the field, causing extra loads at implementation. In this paper, we propose another reset method to reduce loads by recognizing information of naturally installed signs in the field using object detection and Optical Character Recognition (OCR) technologies. A lot of signs exist in a building, containing texts such as room numbers, room names, and floor numbers. In the proposal, the Sign Image is taken with a smartphone, the sign is detected by YOLOv8, the text inside the sign is recognized by PaddleOCR, and it is compared with each record in the Room Database using Levenshtein distance. For evaluations, we applied the proposal in two buildings in Okayama University, Japan. The results show that YOLOv8 achieved mAP@0.5

0.995

and mAP@0.5:0.95

0.978

, and PaddleOCR could extract text in the sign image accurately with an averaged CER% lower than 10%. The combination of both YOLOv8 and PaddleOCR decreases the execution time by

6.71 s

compared to the previous method. The results confirmed the effectiveness of the proposal.

Keywords:

indoor navigation system; INSUS; location reset method; natural sign; text; YOLO; PaddleOCR

1. Introduction

Nowadays, indoor navigation is increasing in importance around the world due to the growth of large, complex buildings such as shopping malls, campuses, airports, and libraries [1]. These buildings often lack proper signs and staff assistance to guide new guests to their destinations [2]. In such situations, visitors often need to ambulate for a while to find destinations. Then, an indoor navigation system using a smartphone is useful for providing an efficient and user-friendly solution [3]. It allows a guest to reach the destination in the building without confusion and time-wasting.

Previously, we developed an indoor navigation system called INSUS using Unity and a smartphone to guide users in unfamiliar indoor environments [4]. Unity is a game engine development environment that allows the integration of a phone’s sensors and an augmented reality (AR) software development kit (SDK) to build an AR application [5]. Besides its function as an integration tool, Unity supports the development of multi-platform applications, enabling the same system to be deployed on Android and iOS [6,7,8]. INSUS employs AR technology to overlay directional navigation guides on the smartphone screen, providing intuitive navigational assistance. It utilizes the smartphone’s gyroscope sensor and the simultaneous localization and mapping (SLAM) algorithm to accurately detect and map the user’s location.

To reduce loss of direction during navigation, INSUS employs the user location reset method using a quick response (QR) code. First, a user needs to scan a QR code that encodes the location information to initialize his/her initial position. Then, SLAM calculates the user’s location by measuring the distance from feature points captured by the smartphone’s camera across frames. Finally, the gyroscope sensor is used to estimate the user’s orientation to determine in which direction the user is facing. Unfortunately, INSUS often faces the loss of direction problem during navigation. The loss of direction refers to localization errors during navigation due to small accumulated errors from the SLAM localization components. Since the localization experiences errors, the generated path from the user’s current position to the target position becomes inaccurate. This results in false or absent path guidance visualization. To mitigate it, a user needs to periodically scan the QR code to reset the current position. Therefore, a number of QR code sheets need to be placed around the building, which will add extra load to the implementation.

In this paper, as an alternative user location reset method, we propose the use of natural signs that have been allocated in buildings. These signs may indicate room numbers, room names, or floor numbers which can uniquely identify the location. The proposed method uses the object detection technique to extract the sign from the captured image and the Optical Character Recognition (OCR) technique to recognize the text from the sign [9,10]. Specifically, YOLOv8 is adopted for object detection [11], and PaddleOCR is adopted for OCR [12]. The recognized text is compared with the registered text in the database using the Levenshtein distance to obtain the location information. Since the text recognized by OCR may contain errors, including wrong or missing characters, the Levenshtein distance between this text and each registered text in the database is calculated, and the one giving the shortest distance is selected as the matched one.

The utilization of these algorithms is based on Python, while the development of INSUS is based on Unity. We integrate these two components with an application programming interface (API), which allows two different programming languages to work together using a standardized format (e.g., JSON). We employ a representational state transfer (REST) API to communicate between Unity and Python, as exemplified by several studies [13,14,15]. Python is responsible for the machine learning operations of YOLOv8, PaddleOCR, and the Levenshtein distance process. It receives and processes the input from Unity and returns the output back to Unity. We utilize the FastAPI library in Python to handle these interactions [16]. On the other hand, Unity is responsible for providing the input in the form of images and processing the output to update the user’s current coordinates. Unity handles this interaction using the Unity Web Request module [17,18].

For evaluations of the proposal, first, the accuracy of sign detection by YOLOv8 is evaluated. Images containing signs in the building are collected as a dataset to generate the model. Then, the precision, recall, and mean average precision (mAP) values over different intersections over union (IoU) thresholds ranging from

0.5

to

0.95

are calculated [19]. Second, the accuracy of the text recognition by PaddleOCR is measured using the character error rate (CER) metric [20]. Finally, the performance is compared between the proposal and the previous method in #2 and #3 Engineering Buildings at Okayama University. The results confirm the effectiveness of the proposal in reducing the execution time required for the user location reset method.

Our contributions are as follows: we trained a YOLOv8 model with our custom dataset to detect room numbers from sign images. We confirmed the validity of PaddleOCR to extract text with an average CER lower than 10% from small images. We also incorporated the Levenshtein distance to mitigate slight character recognition errors and achieve accurate database matching to acquire the correct room coordinates.

The rest of this paper is organized as follows. Section 2 discusses related works. Section 3 presents previous work. Section 4 presents the user location reset method using object recognition. Section 5 evaluates the proposed system through experiments. Finally, Section 6 concludes this paper with future works.

2. Literature Review

In this section, we discuss works related to the proposed method.

2.1. YOLO Model-Based Detection Methods

First, we reviewed works on implementations of the You Only Look Once (YOLO) model in indoor navigation systems. It integrates a camera to detect specific objects in the environment. YOLO determines the user’s position or minimizes navigation errors based on the detected objects according to the defined parameters of the indoor navigation system.

In [21], Ahmad et al. presented a modified YOLOv1-based neural network for object detection to improve detection accuracy and speed. The proposed model enhances the original YOLOv1 by adjusting the loss function to a proportion style, adding a spatial pyramid pooling layer, and incorporating an inception model with a 1 × 1 convolution kernel to reduce weight parameters. Extensive experiments on the Pascal VOC 2007/2012 datasets demonstrated that the modified model achieved better performance than the original YOLOv1.

In [22], Sang et al. developed a fine-tuned model of YOLOv2 to extract vehicle-type information from images or videos. The model incorporates k-means++ clustering for vehicle bounding boxes, normalization for bounding box dimensions, and a multi-layer feature fusion strategy. The fine-tuned model achieved a mAP of 94.78% on the BIT-Vehicle validation dataset and demonstrated the ability to predict new datasets without encountering them in the training process. However, due to the low variety and amount of vehicle-type data, incorporating an additional dataset could further improve the model’s accuracy and robustness.

In [23], Zhao proposed a new method for initializing the width and height of predicted bounding boxes to improve the YOLOv3 object detection model. The method uses Markov chains and intersection-over-union for faster convergence and more accurate initial cluster centers. On the MS COCO dataset, it achieves an average IoU of 60.44% (0.56% higher) and a running time 1/297 of the original, while on the PASCAL VOC dataset, it achieves an average IoU of 67.45% (0.13% higher) and a running time 1/81 of the original. The proposed method outperforms YOLOv3 in terms of recall, mAP, F1-score, and the detection of small objects.

In [24], Wang et al. proposed a YPD-SLAM system that provides a real-time VSLAM in dynamic indoor environments. It integrates YOLO-FastestV2 target detection and Cylinder and Plane extraction (CAPE). YOLO-FastestV2 detects and isolates the target to enhance the feature point extraction. The deployed proposed system only utilizes CPU for cost-effective hardware implementation. It achieved robust, accurate, and real-time performance. However, tracking failures were detected when the user rotated too quickly.

In [25], Cong et al. proposed a Visual Simultaneous Localization and Mapping (VSLAM) algorithm that is based on ORB-SLAM3 for dynamic scenarios. The proposed algorithm is integrated with the YOLOv5 object detection model. It utilizes the depth information to categorize detected objects and eliminate dynamic feature points. The results indicate an improvement in accuracy compared to the original ORB-SLAM3. This highlights the effectiveness of the algorithm in dynamic scenarios. However, some feature points are removed in stationary scenarios, and missed object class detection is observed.

In [26], Gupta et al. proposed a transfer-learning-based model for real-time object detection, enhancing the YOLOv6 algorithm through pruning and finetuning to improve detection accuracy and inference speed. The model, which utilizes Google Text-to-Speech for audio feedback, was trained on the MS-COCO dataset and demonstrates significant improvements over various baseline models, achieving a 37.8% higher average precision with 1235 frames per second. Despite its improved performance, the model struggles with detecting objects against textured backgrounds.

In [27], Kucukayan proposed Indoor Human Detection (IHD) for drones using the YOLO-IHD model. The proposed systems incorporate YOLOv7-tiny to detect small objects in the images. The results demonstrate an increase in performance, achieving a

42.51 %

improvement on the IHD dataset and a

33.05 %

improvement on the VisDrone dataset compared to the baseline model. However, low-light conditions and crowded areas might reduce the accuracy of the detection.

In [28], Lou et al. proposed a small-size object detection algorithm based on YOLOv8 designed to address the limitations of human observation in complex scenes. It ensures higher precision and consistent accuracy across various object sizes. The algorithm introduces three main innovations: a new downsampling method that preserves context features, an improved feature fusion network, and a novel network structure to enhance detection accuracy. Experimental results demonstrate that the proposed DC-YOLOv8 outperforms YOLOX, YOLOR, YOLOv3, scaled YOLOv5, YOLOv7-Tiny, and YOLOv8, with notable improvements in mAP, precision, and recall ratios on the Visdron, Tinyperson, and PASCAL VOC2007 datasets.

The performance of YOLO models has consistently improved with each new version. Incremental advancements in the YOLO architecture have significantly enhanced detection accuracy, speed, and efficiency. This is particularly emphasized by YOLOv8’s superior ability to detect smaller-sized objects accurately. Since our system’s use case involves detecting room numbers from images where the room number is relatively small compared to the overall picture, the selection of YOLOv8 is well justified.

2.2. Optical Character Recognition Methods

Second, we reviewed works on OCR to extract text from images. This involves training deep learning models on a collection of text images with the corresponding reference text labels. Commonly, the performance of OCR models is related to the characteristics of the trained text image [29]. Thus, it is important to select the appropriate OCR model for each use case.

In [30], Kamisetty et al. proposed an invoices document processing method to digitize physical documents. The methods involve the usage of computer vision and OCR techniques. The data extracted from the physical documents are structured in JSON and CSV formats. From the experimentation, Tesseract OCR demonstrated superior performance compared to other OCR models. However, this approach is limited to only the invoice images, which leads to poor text recognition from images that are different from the trained image.

In [31], Salehudin et al. evaluated EasyOCR’s performance for extracting textual information within Latin characters under image degradation. This study aimed to highlight the capabilities and limitations of EasyOCR. Based on the results, EasyOCR excels at recognizing unique lowercase and uppercase characters including C, S, U, and Z. However, EasyOCR’s character detection accuracy decreases, ranging from 30 to 40% when recognizing characters with fonts under size 18.

In [32], Qi proposed a real-time system for identifying spray mark characters on moving steel slabs in the manufacturing process. The system utilizes high-sensitivity cameras, optical filters, temperature control systems, and the PaddleOCR model. The research addresses challenges related to complex lighting conditions, high temperatures, and fast-moving steel slabs. The implemented system has been operational for a year and has achieved positive detection rates every week. The minimum weekly positive rate is reported to be

91.2 %

, while the minimum weekly detection rate is

98.7 %

. However, due to the limitations of the cameras, the conveyor needs to be slowed down in the manufacturing process so as not to blur the images and create recognition errors.

2.3. Indoor Positioning Methods of AR-Based Indoor Navigation Systems

In [33], Huang et al. developed an augmented-reality-based navigation system called ARBIN for indoor navigation. The localization method employed in their work uses Bluetooth RSSI values from Lbeacon placed inside the building. This approach was able to achieve 3–5 m localization accuracy and provide correct instructions to reach destinations. However, the need to use Lbeacon for user localization constrains the implementation, making it less scalable due to the requirements for additional Lbeacons and extensive RSSI measurement and mapping in new environments.

In [34], Yang proposed using AR markers in an AR-based smartphone indoor navigation system to help visually impaired people navigate indoor environments. The AR markers act as guides, requiring the system to scan them incrementally to ensure that the user reaches their destination accurately. The approach achieved results close to the true value during the evaluation. However, implementing AR markers for user localization hinders scalability and complicates the implementation process, as the markers need to be placed throughout the building.

In [35], Ng developed a mobile augmented reality system for indoor navigation by utilizing a smartphone’s internal sensors. The system employs the IndoorAtlas SDK, which leverages the magnetic field captured by the smartphone sensors and the building’s WiFi signal. The localization method achieved an accuracy of around 1.2 m, and the developed system received positive feedback from survey participants. However, utilizing the magnetic field requires mapping the building through fingerprinting, as it is necessary to determine the user’s coordinates relative to their current location. When implemented in new buildings or locations, this limits and constrains the system’s scalability.

3. Review of Indoor Navigation System Using Unity and a Smartphone (INSUS)

In this section, we review INSUS in our previous studies.

3.1. INSUS Overview

INSUS has been designed and implemented on Android and iOS platforms [36]. It shows navigation guides on the smartphone display. It integrates AR technology, the 3D virtual environment, the SLAM algorithm, and the path-planning algorithm in the Unity game engine environment. Figure 1 shows the INSUS overview.

The system consists of the modules for Input, Unity game engine, the SEMAR server, and Output. Input considers the QR code reader and other necessary input components for navigation. Unity game engine provides services for real-time navigation processes through the SLAM algorithm, including path-planning and the estimation of user positions. The SEMAR server stores all the data related to navigation and user identification. Output displays the 3D arrow as a navigation guide for users.

3.2. Input

The application receives four inputs as the components for INSUS to determine the user’s initial position, perform real-time localization, find the shortest paths, and visualize 3D arrows for navigation. The QR codes determine the user’s starting position and serve as calibration locations when loss of direction errors occur. Information stored inside the QR code includes the room name and its 3D coordinates (XYZ), aligned with the virtual 3D environment. The camera continuously captures images for real-time localization using the SLAM algorithm. QR codes are used to determine the user’s starting position and to recalibrate the position when a loss of direction error occurs. The gyroscope measures the angle and speed at which the user rotates along an axis. It provides a user’s pose, including roll, pitch, and yaw. Thus, the gyroscope allows for stabilizing real-time localization through the SLAM algorithm [37].

Three-dimensional environments provide a virtual representation of the real environment. Information in a 3D environment includes 1:1 scale models of rooms and objects aligned with real-world coordinates [38]. The combination of the Navigation Mesh (NavMesh) and the 3D environment enables accurate navigation and path-planning to reach the target position.

3.3. Unity Game Engine

The Unity game engine is a development platform widely used for creating immersive applications [39]. It specializes in real-time rendering and simulation, making it ideal for applications such as games, simulations, and AR experiences [40]. Unity’s robust feature set includes support for multi-platform deployment, a powerful physics engine, and extensive libraries for graphics and animation [41].

In our system, Unity serves as the primary platform for integrating inputs, enabling navigational processes through the AR interface, performing path-planning in the 3D environment, visualizing the user interface, and managing external communication through the Unity Web Request module.

Unity utilizes AR Foundation as an AR software development kit (SDK). It utilizes continuous camera inputs and gyroscope sensors to determine real-time localization via the SLAM algorithm and by rendering 3D navigation guides that overlay the real environment through the AR interface. The SLAM implementation from AR Foundation is based on Oriented Rotated Brief (ORB)-SLAM [42], which works through feature extraction, feature tracking, pose estimation, and mapping [43]. Feature extraction and tracking involve identifying distinctive features in images captured by the camera and tracking them across multiple frames. This tracking provides information regarding the user’s pose, rotation, and position.

In this study, SLAM’s function for the mapping process was not applied because a 3D environment had already been prepared. The Unity game engine facilitates the creation of a NavMesh to perform path-planning in a predefined 3D environment. The NavMesh enables the A* algorithm to use the 3D environment as a map to find the fastest path to the target position [44].

The limitation of the current implementation is the loss of direction, which results from small accumulated errors in various components used for SLAM localization. These errors can arise from camera miscalibration, inaccuracies in feature tracking, path-planning algorithms, and NavMesh being misaligned from the real environment. As these small errors accumulate, they lead to significant localization errors, hindering the visualization process of the path from the user’s current location to the destination. It is necessary to reset the location using a QR code to eliminate these accumulated errors. However, the current implementation of the user location reset method requires the placement of a lot of QR codes throughout the building, adding an extra burden during the implementation process. To address this, we propose using pre-installed signs in the building as an alternative to QR codes through an object detection function and text extraction function.

The user interface displays several components to facilitate interaction and navigation, including the initial login, QR code scanning, destination input, and real-time navigation guides. The Unity Web Request module facilitates communication with the SEMAR server by handling HTTP requests and responses to ensure reliable interactions between the Unity game engine and the SEMAR server.

3.4. Output

The output from INSUS includes real-time navigational guidance displayed to the user via an AR interface. This guidance features 3D arrows that show the direction to follow and path overlays in the real environment. These visualizations enable effective indoor navigation. The 3D arrows appear after the floor has been detected by plane detection via AR Foundation. It ensures that the guidance is accurately aligned with the user’s surroundings. The path overlay is calculated based on the fastest route determined by the A* algorithm.

3.5. SEMAR Server

The SEMAR server functions as a database provider that manages communication through the HTTP protocol [45]. It is responsible for storing and retrieving various types of user data, including the user identity for login and authentication processes, the user’s initial position after scanning a QR code, the target position inputted by the user through the destination input interface, and the room label associated with the target position. The SEMAR server ensures reliable communication between INSUS and the server by handling HTTP requests and responses using REST API services. The REST API service handles GET requests when users log into the application and POST requests for transmitting user data to the database.

4. Proposal

In this section, we propose a user location reset method by object recognition.

4.1. System Overview

This section presents an overview of this approach that utilizes existing signs in the building. Figure 2 provides an overview of the integration functions for running this method in INSUS with the help of the SEMAR server. Since the main process of the proposed system is based on Python language and located in the SEMAR server, we utilize an application programming interface (API) to integrate these two components. First, the camera captures an image containing signs as the input, which is called a sign image. The image transmission function sends it to the SEMAR server via HTTP communication using the REST API service in base64 format. Upon receiving the data, the object detection function identifies the sign in the image and saves the result as an isolated sign image. Then, the text extraction function recognizes and extracts text from the isolated sign image. The database matching function selects the room number in the Room Database that is most closely similar to the extracted text to determine the room coordinates. We employ the Levenshtein distance to calculate the similarity between the extracted text and the room number. The SEMAR server sends the room coordinates back to Unity via HTTP through the Unity Web Socket module. Finally, Unity uses the new coordinates to update the user’s location.

The connection between Unity and the SEMAR server through the REST API service is visualized in Figure 3. Since the REST API is based on the HTTP communication protocol, we employ the Unity Web Request module on the Unity side to send an encoded base64 sign image to the SEMAR server using JSON data. On the SEMAR server side, due to the utilization of Python as the main process, we employ the FastAPI library to receive the image and process the image via YOLOv8, PaddleOCR, Levenshtein distance, and the regular expression library to select corresponding room coordinates in the database. Finally, the SEMAR server sends the room coordinates back to Unity to update the user’s current coordinates through the Unity Web Request module in JSON data.

4.2. Sign Image

In this study, we employ sign image as an alternative to QR codes for the user location reset method. Sign image refers to visual representations of existing signs installed within the building, such as room numbers, room names, floor numbers, and signs for toilets or elevators. Our approach utilizes the room number to identify the user’s location. Each room number provides a unique visual identifier for each room within the building.

By recognizing the sign image, INSUS can retrieve the text from the room number and determine the room’s coordinates to update the user’s location. This approach minimizes the need to place numerous QR codes during implementation. Figure 4 illustrates a sign image captured by a smartphone.

4.3. Image Transmission Function

The image transmission function sends the captured images to the SEMAR server in base64 format [45]. The proposed approach utilizes the sign image captured from the surrounding environment during navigation to reset the user’s location when encountering loss of direction errors. The image is captured through the phone’s camera when the user presses the rescan button. The image size is defined at

480 \times 480

pixels to ensure a small size while retaining image quality. The image is then encoded into base64 format and sent to the SEMAR server through the Unity Web Request. It handles all external communication between INSUS and the SEMAR server via HTTP through a REST API service using a POST request. Since this communication only allows text-based data transmission, the base64 format encodes the binary data of the image into text, ensuring the efficient transmission of image data.

4.4. Object Detection Function

The proposed approach of the user location reset employs a text recognition algorithm to retrieve text information from a sign image. However, direct implementation of this algorithm could lead to inefficient processes. Since the algorithm scans all text present in the image, this also causes any unwanted text to be extracted. As a result, the accuracy and the performance decrease. To address this challenge, we introduced the object detection function to identify the sign in the sign image and isolate it for the text extraction function. The object detection function recognizes labeled objects from the dataset using the YOLOv8 model to produce bounding boxes that outline the room number. Then, it isolates the image inside the bounding box, similar to extracting the pixel values, which results in an isolated sign image.

First, the YOLOv8 model [46] receives an image as input with a size of

480 \times 480

pixels after it has been decoded from base64 format. Then, it convolves the image to automatically extract the features of the labeled objects. This model introduces the sigmoid function to estimate whether the object is present in the input image by enclosing it with bounding boxes and providing the confidence score of its prediction result. Finally, through this implementation, the object detection function could produce isolate sign image with a high confidence score to be processed in the text extraction function.

4.5. Text Extraction Function

The text extraction function pre-processes the isolated sign image before retrieving and cleaning the text from the image. The OpenCV implements the pre-processing step to standardize the image quality [47]. Then, PaddleOCR recognizes and extracts text from pre-processed images. Finally, the extracted text is cleaned of uppercase letters and punctuation using the regular expression (re) library [48].

First, we employ image normalization and contrast enhancement through OpenCV to standardize the image quality, addressing the varied illumination levels in the surrounding environment. This includes noise reduction through blur functions, bilateral filters, and average brightness adjustment through histogram equalization.

Then, due to its high accuracy in various illumination conditions, as reported in [49], we utilize PaddleOCR to automatically recognize and extract text from the standardized isolated sign image. This detects text in images, adjusts the orientation of the detected text to horizontal, and recognizes text from the corrected orientation.

Finally, the system removes lowercase letters and punctuation that might be generated from the output of the PaddleOCR using the re library. This step produces cleaned text that is ready to be compared with the room number in the room database for the database matching function.

4.6. Database Matching Function

We implemented a database matching function to determine the room coordinates. It compares the cleaned text against the list of room numbers in the room database using the Levenshtein distance algorithm [50]. The Levenshtein distance measures the similarity between two strings of text based on the edit distance. Equation (1) formulates the calculation of the Levenshtein distance. We employed the RapidFuzz Python library to implement the Levenshtein distance [51].

{lev}_{a, b} (s, t) = \{\begin{matrix} max (| s |, | t |) & if min (| s |, | t |) = 0 \\ min \{\begin{matrix} {lev}_{a, b} (s [1 :], t) + 1 \\ {lev}_{a, b} (s, t [1 :]) + 1 \\ {lev}_{a, b} (s [1 :], t [1 :]) + 1_{(s_{i} \neq t_{j})} \end{matrix} & otherwise \end{matrix}

(1)

where

l e v_{a, b} (s, t)

represent the Levenshtein distance between strings s and t. The

|s|

and

|t|

refer to the length of string s and t. The smaller the Levenshtein distance, the more similar the strings are. It selects the room data with the smallest Levenshtein distance values. To conclude the system’s workflow, the types of data processed in each function are visualized in Table 1.

In Table 1, the source represents the image captured by the phone camera. The object detection function visualizes the YOLOv8 output and the isolated sign image. Then, the text extraction function’s results consist of the PaddleOCR output and the cleaned text. The output of PaddleOCR contains some incorrect characters due to recognition errors, which are caused by the visual similarity of characters in the isolated sign image to other letters or numbers. Finally, the database matching function selects the room number with the smallest Levenshtein distance based on the cleaned text, along with its room coordinates.

5. Evaluations

In this section, we evaluated the proposal by applying it in #2 and #3 Engineering Buildings in Okayama University, Japan.

5.1. Training Preparation and Dataset Augmentation

Here, the dataset preparation, training environment, and hyperparameters used to generate the YOLOv8 model are discussed for the object detection function.

First, we prepared a custom sign image dataset based on the YOLO format. Each image consists of the class name and bounding boxes around the sign. We collected 213 images with separate labels from #2 and #3 Engineering Buildings at Okayama University. Since the training process of a deep learning approach requires a number of datasets [52], we performed augmentation processes using the Albumentations Python package to generate a larger dataset [53]. This resulted in the generation of 2130 images from the original 213 images. It applied various augmentation methods, including lighting adjustments, image compression, and color shifts. Figure 5 shows the result of augmented methods compared to the original dataset. The image dimensions were standardized to 480 px × 480 px. We divided 80% images for training, 10% for validation, and 10% for testing the model.

The training processes were conducted on a device with Ubuntu 20.04 as the Operating System (OS), equipped with an Intel® Xeon® Gold 5218 processor and NVIDIA QUADRO 6000 with 24GB of VRAM to facilitate accelerated computations. PyTorch version 1.12.1 was employed to train the YOLOv8 model. We utilized Python version 3.9 with CUDA version 11.3 to enable the GPU acceleration computation. We trained the YOLOv8 model for 400 epochs. The specific details of the training environment are presented in Table 2.

In the hyperparameters, Stochastic Gradient Descent (SGD) provides balance during the training process of the YOLOv8 model. The initial and final learning rate is defined as

1 \times 10^{- 2}

. The Weight Decay Coefficient of

0.937

prevents models from reaching overfitting during the training. To enable the reproducibility of the trained model, we defined the random set value as 42. The details of the hyperparameter values are described in Table 3.

5.2. Performance Analysis of the Object Detection Function

This section explains the training results and the performance measurement of the YOLOv8 using the augmented sign image dataset via Box, Class Loss, precision, recall, and also the mAP validation methods.

5.2.1. Box and Class Loss Validation of the Object Detection Function

We validated the model by applying the Box Loss and Class Loss validation methods. These methods compare detection results over training and validation datasets. This is intended to measure the performance of the model for different datasets to avoid overfitting.

The Box Loss method measures the difference between the detected bounding box and the ground truth bounding box coordinates. A lower value of the Box Loss indicates better performance in terms of accurately detecting the location of objects in images. As shown in Figure 6a, the trained model accurately predicts the object’s bounding box. It achieved final Validation Box Loss and Train Box Loss values of

0.311

and

0.256

, respectively. These results represent the ability of the YOLOv8 model to detect the bounding box over a varied dataset.

The Class Loss method measures the difference between the detected class and the ground truth class. A lower value of the measure indicates that the YOLOv8 model accurately predicts the class of the objects. Figure 6b shows the validation of the model in predicting the class of the objects. It achieved accurate classification results, with final Validation Class Loss and Train Class Loss values of

0.182

and

0.164

, respectively. These results represent the ability of the trained model to detect the class in the datasets. The validation results indicate that the trained model successfully predicted the class of the object and avoided overfitting.

5.2.2. Precision, Recall, and mAP Validation of the Object Detection Function

The evaluation of a YOLOv8 model is measured by the values of the precision, recall, and mAP parameters compared between the training and testing dataset. These parameters refer to the model’s ability to accurately identify and classify classes [54]. The precision measures the proportion of correctly predicted data (true positives) compared to all data classified as positive. The recall measures the proportion of correctly predicted data (true positives) of all actual positive data in the dataset [55]. The mAP metric evaluates the accuracy and coverage of the object detection bounding box by using the Intersection over Union (IoU) threshold [56]. The IoU threshold represents the tolerance level for the overlap area between the detected bounding boxes and the ground truth bounding boxes. In this study, we employed an IoU threshold at 0.5 and 0.5–0.95.

In the metrics of precision and recall, the YOLOv8 model demonstrates a precision value of

99.99 %

and achieves a

100 %

recall value. This indicates that the model is robust and accurate in detecting the sign image from an input image, as illustrated in Figure 7a. The model achieved a final mAP of

0.995

at IoU 0.5 and a final mAP@0.5–0.95 value of

0.978

. Based on [57], these mAP values demonstrate the consistent and reliable capability of the YOLOv8 model in detecting the trained object. The corresponding mAP results are shown in Figure 7b. A high mAP value indicates that the model could accurately and confidently detect objects with performance similar to the expected result.

5.3. Performance Analysis of Text Extraction and Database Matching Function

This section presents the performance measurement of the text extraction function in recognizing the room number and the Levenshtein distance of the database matching function. We compared the accuracy and execution time of the PaddleOCR before and after incorporating YOLOv8 to obtain the room coordinate from the cleaned text.

5.3.1. Experimental Scenarios

We measured the performance of the text extraction function and the database matching function by implementing INSUS in the #2 and #3 Engineering Buildings in Okayama University. The experiment was conducted in two scenarios. The first scenario involved text extraction using only the PaddleOCR model. The second scenario combines the YOLOv8 model and the PaddleOCR model. For both scenarios, we utilized the same environment, and for each room on each floor, we measured at a distance of one meter from the door. This distance was chosen to ensure consistency, image clarity, and relevance to real-world usage. The specifications of the evaluation environment are shown in Table 4.

The brightness value on each floor in the building is represented using Illuminance (LUX). The illuminance value is measured using the light sensor on smartphones [58]. Based on [59], the average illuminance value of the implementation building shows that the lighting conditions are visible. During the experimentation, we utilized two devices with different operating systems. The specifications of the experimental devices are detailed in Table 5.

5.3.2. Accuracy of Text Extraction and Database Matching Function

In this evaluation, the accuracy of the cleaned text was calculated by the Character Error Recognition (CER) method [60,61]. It represents the percentage of characters incorrectly recognized by the OCR model compared to the total number of characters in the reference text. The lower values indicate better performance, while higher values indicate worse performance of the OCR model.

The effectiveness of the user location reset method is determined by its ability to retrieve the room coordinates from the Room Database. Based on the experimental scenarios, we evaluated each room for each floor in both buildings. For each scenario, we took five images of each room on each floor in both buildings. Then, the CER values were averaged and normalized to account for variations in the number of rooms. The results demonstrate the impact of incorporating the YOLOv8 model with PaddleOCR, as visualized in Table 6.

Table 6 shows the results of a combination of the YOLOv8 model and PaddleOCR. It consistently achieved CER lower than

10 %

compared to only the PaddleOCR. However, this approach still caused errors in text extraction. To address this problem, we employed the Levenshtein distance algorithm in the database matching function. It compared the cleaned text with the list of room numbers in the Room Database and selected the corresponding room coordinates with the smallest Levenshtein distance. The sample results of the database matching function are displayed in Table 7.

5.4. Comparison of the Execution Time of the User Location Reset Method

This measurement evaluates the proposed system’s impact on the time required to successfully reset the user location. It measures the time taken to execute the user location reset method until it produces cleaned text from the sign image and resets the user location using the room coordinates retrieved from the database matching function. The evaluation was conducted in two scenarios, as previously described. The measurement process begins with the user positioned 1 m from the door, scanning the Sign Image using INSUS. If the method fails to correctly reset the user’s location, the user moves closer to the sign image to rescan it. It was intended to produce more accurately cleaned text. This process is repeated until the correct text is successfully retrieved and the user’s location is accurately reset. The results of this comparison are shown in Table 8.

Due to YOLOv8’s ability to automatically isolate the sign image, the incorporation of YOLOv8 and PaddleOCR achieved better performance, since it allows the PaddleOCR to recognize and extract only the room number from the isolated image rather than the whole image. In addition, the approach that only utilized PaddleOCR requires more time to extract accurate information as it produces inaccurate cleaned text. Thus, the user needs to move closer to the sign image and prolong the execution time. Compared to the proposed approach, it can detect the sign image from the input image and automatically isolate it, therefore resulting in a faster execution time.

6. Conclusions

This paper proposed the user location reset method using object recognition as an alternative approach in INSUS. The integration between Unity (INSUS) and Python (SEMAR server) is facilitated through HTTP communication using the REST API service. YOLOv8 is adopted to locate and isolate the sign image. PaddleOCR is used to extract the text information from the sign image. The database matching function with the Levenshtein distance is applied to detect the same room number from the database. The location identified from the database matching function is then used to update the user’s current position within the building, completing the method’s workflow.

For evaluations, we applied the proposal to two buildings in Okayama University, Japan. The results show that YOLOv8 achieved mAP@0.5

0.995

and mAP@0.5:0.95

0.978

, and PaddleOCR could extract text in the sign image accurately, with the averaged CER lower than 10%. The database matching using Levenshtein distance predicted the correct room number from the database, thereby accurately providing the coordinates to reset the user’s location. The combination of both YOLOv8 and PaddleOCR decreased the execution time by 6.71 s. Following the results of the proposed approach, the benefits include improved user location reset accuracy after incorporating the object detection function, reduced execution time, and the elimination of the need to install and maintain QR codes. This makes the system more flexible and easier to implement in various environments. The results confirmed the effectiveness of the proposal.

For future works, we will increase types of sign images to be recognized to improve the usability and accuracy of the proposal.

Author Contributions

Conceptualization, E.D.F., N.F. and S.S.; Methodology, E.D.F.; Software, E.D.F.; Validation, E.D.F.; Resources, S.S.; Data curation, A.L.H. and K.C.B.; writing—original draft preparation, E.D.F.; writing—review and editing, N.F. and Y.Y.F.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were not required for this study because the involvement of humans was limited to obtaining the indoor user location coordinates during the testing phase to validate the feasibility of our developed system.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors thank the reviewers for their thorough reading and helpful comments and all the colleagues in the Distributed System Laboratory, Okayama University who were involved in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Franco, J.T. Navigating Complexity and Change in Architecture with Data-Driven Technologies. 2023. Available online: https://www.archdaily.com/1001585/navigating-complexity-and-change-in-architecture-with-data-driven-technologies (accessed on 5 June 2024).
Engel, C.; Mueller, K.; Constantinescu, A.; Loitsch, C.; Petrausch, V.; Weber, G.; Stiefelhagen, R. Travelling more independently: A Requirements Analysis for Accessible Journeys to Unknown Buildings for People with Visual Impairments. In Proceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility, Virtual Event, Greece, 26–28 October 2020; Volume 11. [Google Scholar] [CrossRef]
Mansour, A.; Chen, W. SUNS: A user-friendly scheme for seamless and ubiquitous navigation based on an enhanced indoor-outdoor environmental awareness approach. Remote Sens. 2022, 14, 5263. [Google Scholar] [CrossRef]
Fajrianti, E.D.; Funabiki, N.; Sukaridhoto, S.; Panduman, Y.Y.F.; Dezheng, K.; Shihao, F.; Surya Pradhana, A.A. Insus: Indoor navigation system using unity and smartphone for user ambulation assistance. Information 2023, 14, 359. [Google Scholar] [CrossRef]
Simon, J. Augmented Reality Application Development using Unity and Vuforia. Interdiscip. Descr. Complex Syst. INDECS 2023, 21, 69–77. [Google Scholar] [CrossRef]
Haas, J.K. A History of the Unity Game Engine. 2014. Available online: https://www.semanticscholar.org/paper/A-History-of-the-Unity-Game-Engine-Haas/5e6b2255d5b7565d11e71e980b1ca141aeb3391d (accessed on 7 July 2024).
Unity. Unity Real-Time Development Platform: 3D, 2D, VR & AR Engine. Available online: https://unity.com/cn (accessed on 7 July 2024).
Linowes, J. Augmented Reality with Unity AR Foundation: A Practical Guide to Cross-Platform AR Development with Unity 2020 and Later Versions; Packt Publishing Ltd.: Birmingham, UK, 2021. [Google Scholar]
Afif, M.; Ayachi, R.; Said, Y.; Pissaloux, E.; Atri, M. An evaluation of retinanet on indoor object detection for blind and visually impaired persons assistance navigation. Neural Process. Lett. 2020, 51, 2265–2279. [Google Scholar] [CrossRef]
Pivavaruk, I.; Cacho, J.R.F. OCR Enhanced Augmented Reality Indoor Navigation. In Proceedings of the 2022 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), Virtual, 12–14 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 186–192. [Google Scholar]
Farooq, J.; Muaz, M.; Khan Jadoon, K.; Aafaq, N.; Khan, M.K.A. An improved YOLOv8 for foreign object debris detection with optimized architecture for small objects. Multimed. Tools Appl. 2023, 83, 60921–60947. [Google Scholar] [CrossRef]
Chidsin, W.; Gu, Y.; Goncharenko, I. Smartphone-Based Positioning Using Graph Map for Indoor Environment. In Proceedings of the 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE), Kokura, Japan, 29 October–1 November 2024; IEEE: Piscataway, NJ, USA, 2023; pp. 895–897. [Google Scholar]
Bueno, J. Development of Unity 3D Module For REST API Integration: Unity 3D and REST API Technology. 2017. Available online: https://www.academia.edu/83722981/Development_of_Unity_3D_Module_For_REST_API_Integration_Unity_3D_and_REST_API_Technology (accessed on 8 July 2024).
Ward, T.; Bolt, A.; Hemmings, N.; Carter, S.; Sanchez, M.; Barreira, R.; Noury, S.; Anderson, K.; Lemmon, J.; Coe, J.; et al. Using unity to help solve intelligence. arXiv 2020, arXiv:2011.09294. [Google Scholar]
Wang, Z.; Han, K.; Tiwari, P. Digital twin simulation of connected and automated vehicles with the unity game engine. In Proceedings of the 2021 IEEE 1st International Conference on Digital Twins and Parallel Intelligence (DTPI), Beijing, China, 15 July–15 August 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–4. [Google Scholar]
Lubanovic, B. FASTAPI Modern Python Web Development; O’Reilly Media: Sebastopol, CA, USA, 2023. [Google Scholar]
Dwarampudi, V.S.S.R.; Mandhala, V.N. Social Media Login Authentication with Unity and Web Sockets. In Proceedings of the 2023 International Conference on Computer Science and Emerging Technologies (CSET), Bangalore, India, 10–12 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Kwon, H. Visualization Methods of Information Regarding Academic Publications, Research Topics, and Authors. Proceedings 2022, 81, 154. [Google Scholar] [CrossRef]
Sun, W.; Dai, L.; Zhang, X.; Chang, P.; He, X. RSOD: Real-time small object detection algorithm in UAV-based traffic monitoring. Appl. Intell. 2022, 52, 8448–8463. [Google Scholar] [CrossRef]
Schaefer, R.; Neudecker, C. A two-step approach for automatic OCR post-correction. In Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Dubrovnik, Barcelona, Spain, 12 December 2020; pp. 52–57. [Google Scholar]
Ahmad, T.; Ma, Y.; Yahya, M.; Ahmad, B.; Nazir, S.; Haq, A.u. Object detection through modified YOLO neural network. Sci. Program. 2020, 2020, 8403262. [Google Scholar] [CrossRef]
Sang, J.; Wu, Z.; Guo, P.; Hu, H.; Xiang, H.; Zhang, Q.; Cai, B. An improved YOLOv2 for vehicle detection. Sensors 2018, 18, 4272. [Google Scholar] [CrossRef]
Zhao, L.; Li, S. Object detection algorithm based on improved YOLOv3. Electronics 2020, 9, 537. [Google Scholar] [CrossRef]
Wang, Y.; Bu, H.; Zhang, X.; Cheng, J. YPD-SLAM: A real-time VSLAM system for handling dynamic indoor environments. Sensors 2022, 22, 8561. [Google Scholar] [CrossRef]
Cong, P.; Liu, J.; Li, J.; Xiao, Y.; Chen, X.; Feng, X.; Zhang, X. YDD-SLAM: Indoor Dynamic Visual SLAM Fusing YOLOv5 with Depth Information. Sensors 2023, 23, 9592. [Google Scholar] [CrossRef]
Gupta, C.; Gill, N.S.; Gulia, P.; Chatterjee, J.M. A novel finetuned YOLOv6 transfer learning model for real-time object detection. J. Real-Time Image Process. 2023, 20, 42. [Google Scholar] [CrossRef]
Kucukayan, G.; Karacan, H. YOLO-IHD: Improved Real-Time Human Detection System for Indoor Drones. Sensors 2024, 24, 922. [Google Scholar] [CrossRef] [PubMed]
Lou, H.; Duan, X.; Guo, J.; Liu, H.; Gu, J.; Bi, L.; Chen, H. DC-YOLOv8: Small-size object detection algorithm based on camera sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
Wang, J.; Tang, J.; Yang, M.; Bai, X.; Luo, J. Improving OCR-based image captioning by incorporating geometrical relationship. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1306–1315. [Google Scholar]
Kamisetty, V.N.S.R.; Chidvilas, B.S.; Revathy, S.; Jeyanthi, P.; Anu, V.M.; Gladence, L.M. Digitization of Data from Invoice using OCR. In Proceedings of the 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 29–31 March 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–10. [Google Scholar]
Salehudin, M.; Basah, S.; Yazid, H.; Basaruddin, K.; Safar, M.; Som, M.M.; Sidek, K. Analysis of Optical Character Recognition using EasyOCR under Image Degradation. J. Phys. Conf. Ser. 2023, 2641, 012001. [Google Scholar] [CrossRef]
Peng, Q.; Tu, L. Paddle-OCR-Based Real-Time Online Recognition System for Steel Plate Slab Spray Marking Characters. J. Control. Autom. Electr. Syst. 2023, 35, 221–233. [Google Scholar] [CrossRef]
Huang, B.C.; Hsu, J.; Chu, E.T.H.; Wu, H.M. Arbin: Augmented reality based indoor navigation system. Sensors 2020, 20, 5890. [Google Scholar] [CrossRef]
Yang, G.; Saniie, J. Indoor navigation for visually impaired using AR markers. In Proceedings of the 2017 IEEE International Conference on Electro Information Technology (EIT), Lincoln, NE, USA, 14–17 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–5. [Google Scholar]
Ng, X.H.; Lim, W.N. Design of a mobile augmented reality-based indoor navigation system. In Proceedings of the 2020 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Istanbul, Turkey, 22–24 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Fajrianti, E.D.; Haz, A.L.; Funabiki, N.; Sukaridhoto, S. A Cross-Platform Implementation of Indoor Navigation System Using Unity and Smartphone INSUS. In Proceedings of the 2023 Sixth International Conference on Vocational Education and Electrical Engineering (ICVEE), Surabaya, Indonesia, 14–15 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 249–254. [Google Scholar]
Foxlin, E. Motion tracking requirements and technologies. In Handbook of Virtual Environment Technology; CRC Press: Boca Raton, FL, USA, 2002. [Google Scholar]
Yan, J.; Zlatanova, S.; Diakité, A. A unified 3D space-based navigation model for seamless navigation in indoor and outdoor. Int. J. Digit. Earth 2021, 14, 985–1003. [Google Scholar] [CrossRef]
Hussain, A.; Shakeel, H.; Hussain, F.; Uddin, N.; Ghouri, T.L. Unity game development engine: A technical survey. Univ. Sindh J. Inf. Commun. Technol 2020, 4, 73–81. [Google Scholar]
Sukaridhoto, S.; Fajrianti, E.D.; Haz, A.L.; Budiarti, R.P.N.; Agustien, L. Implementation of virtual Fiber Optic module using Virtual Reality for vocational telecommunications students. JOIV Int. J. Inform. Vis. 2023, 7, 356–362. [Google Scholar] [CrossRef]
Sukaridhoto, S.; Haz, A.L.; Fajrianti, E.D.; Budiarti, R.P.N. Comparative Study of 3D Assets Optimization of Virtual Reality Application on VR Standalone Device. Int. J. Adv. Sci. Eng. Inf. Technol. 2023, 13. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2564–2571. [Google Scholar]
Brata, K.C.; Funabiki, N.; Panduman, Y.Y.F.; Fajrianti, E.D. An Enhancement of Outdoor Location-Based Augmented Reality Anchor Precision through VSLAM and Google Street View. Sensors 2024, 24, 1161. [Google Scholar] [CrossRef] [PubMed]
Candra, A.; Budiman, M.A.; Hartanto, K. Dijkstra’s and a-star in finding the shortest path: A tutorial. In Proceedings of the 2020 International Conference on Data Science, Artificial Intelligence, and Business Analytics (DATABIA), Medan, Indonesia, 16–17 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 28–32. [Google Scholar]
Panduman, Y.Y.F.; Funabiki, N.; Puspitaningayu, P.; Kuribayashi, M.; Sukaridhoto, S.; Kao, W.C. Design and implementation of SEMAR IoT server platform with applications. Sensors 2022, 22, 6436. [Google Scholar] [CrossRef]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Itseez. Open Source Computer Vision Library. 2015. Available online: https://github.com/itseez/opencv (accessed on 6 June 2024).
Van Rossum, G. The Python Library Reference, Release 3.8.2; Python Software Foundation: Wilmington, DE, USA, 2020. [Google Scholar]
Fajrianti, E.D.; Funabiki, N.; Haz, A.L.; Sukaridhoto, S. A Proposal of OCR-based User Positioning Method in Indoor Navigation System Using Unity and Smartphone (INSUS). In Proceedings of the 2023 12th International Conference on Networks, Communication and Computing, Osaka Japan, 15–17 December 2023; pp. 99–105. [Google Scholar]
Po, D.K. Similarity based information retrieval using Levenshtein distance algorithm. Int. J. Adv. Sci. Res. Eng 2020, 6, 6–10. [Google Scholar] [CrossRef]
Bachmann, M. Rapidfuzz/RapidFuzz: Release 3.8.1. 2024. Available online: https://github.com/rapidfuzz/RapidFuzz/releases (accessed on 23 May 2024).
Yu, F.; Seff, A.; Zhang, Y.; Song, S.; Funkhouser, T.; Xiao, J. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv 2015, arXiv:1506.03365. [Google Scholar]
Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and flexible image augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
Zhang, L.; Zhao, C.; Feng, Y.; Li, D. Pests identification of ip102 by yolov5 embedded with the novel lightweight module. Agronomy 2023, 13, 1583. [Google Scholar] [CrossRef]
Beger, A. Precision-recall curves 2016.
Zhu, H.; Wei, H.; Li, B.; Yuan, X.; Kehtarnavaz, N. A review of video object detection: Datasets, metrics and methods. Appl. Sci. 2020, 10, 7834. [Google Scholar] [CrossRef]
Padilla, R.; Netto, S.L.; Da Silva, E.A. A survey on performance metrics for object-detection algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 237–242. [Google Scholar]
Vochozka, V. Using a Mobile Phone as a Measurement Tool for Illuminance in Physics Education. J. Phys. Conf. Ser. 2024, 2693, 012016. [Google Scholar] [CrossRef]
Bhandary, S.K.; Dhakal, R.; Sanghavi, V.; Verkicharla, P.K. Ambient light level varies with different locations and environmental conditions: Potential to impact myopia. PLoS ONE 2021, 16, e0254027. [Google Scholar] [CrossRef] [PubMed]
da Silva, L.V.; Junior, P.L.J.D.; da Costa Botelho, S.S. An Optical Character Recognition Post-processing Method for technical documents. In Proceedings of the Anais Estendidos do XXXVI Conference on Graphics, Patterns and Images, Rio Grande, Brazil, 6–9 November 2023; SBC: Porto Alegre, Brazil, 2023; pp. 126–131. [Google Scholar]
Randika, A.; Ray, N.; Xiao, X.; Latimer, A. Unknown-box approximation to improve optical character recognition performance. In Proceedings of the Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, 5–10 September 2021; Proceedings, Part I 16. Springer: Berlin/Heidelberg, Germany, 2021; pp. 481–496. [Google Scholar]

Figure 1. Overview of INSUS.

Figure 2. Overview of INSUS with object-detection-based user location reset method.

Figure 3. Connection between Unity (INSUS) and Python (SEMAR server) with HTTP communication using REST API service.

Figure 4. Example input image with sign.

Figure 5. (a) Example in original dataset; (b) example in the augmented dataset.

Figure 6. (a) Validation training from box loss.; (b) validation training from class loss.

Figure 7. (a) Precision and recall during the training process; (b) mAP@0.5 and mAP@0.5-0.95 results during the training process.

Table 1. Examples of data in each function for the proposed method.

Source	Object Detection Function		Text Extraction Function		Database Matching Function
Source	YOLOv8 Output	Isolated Sign Image	PaddleOCR Output	Cleaned Text	Room Number	Room Coordinates
			D-1D2	d1d2	d102	x: 63.909 y: −0.473 z: 3.763
			D-L01	dl01	d101	x: 73.360 y: −0.473 z: 5.563

Table 2. Device specification for model training.

Component	CPU	RAM	GPU	OS	Python Version	CUDA Version	PyTorch Version
Specification	Intel® Xeon® Gold 5218	24 GB	NVIDIA QUADRO RTX 6000 VRAM 24 GB	Ubuntu 20.04	3.9	11.3	1.12.1

Table 3. Hyperparameter specifications for YOLOv8 model training.

Hyperparameter Name	Hyperparameter Value
Optimizer	SGD
Initial Learning Rate ( $l r 0$ )	0.01
Final Learning Rate ( $l r F$ )	0.01
Momentum	0.937
Weight Decay Coefficient	$5 \times 10^{- 4}$
Random set	42

Table 4. Specifications of evaluation environment.

Name	Floor Level	Number of Rooms	Average Illuminance (LUX)
#2 Engineering Building	1	8	106.18
	2	8	116.91
	3	8	91.67
	4	6	121.22
#3 Engineering Building	1	18	112.11
	2	16	123.68
	3	17	115.62
	4	18	97.21

Table 5. Specifications of evaluation devices.

	Samsung Galaxy S22 Ultra	iPhone X
OS	Android	iOS
GPU	Adreno730	Apple GPU
CPU	Octa-core ( $1 \times 3$ GHz, $2 \times 2.5$ GHz, $4 \times 1.8$ GHz)	Hexa-core (2.39 GHz)
Memory	8 GB	3 GB
Camera	108 MP	12 MP

Table 6. CER results for every floor level on both buildings with and without the incorporation of the YOLOv8 model.

		#2 Building				#3 Building
	Floor Level	1F	2F	3F	4F	1F	2F	3F	4F
Android	PaddleOCR	42%	39%	36%	35%	36%	36%	39%	43%
Android	YOLOv8 model + PaddleOCR	2%	2%	5%	8%	3%	5%	3%	2%
iOS	PaddleOCR	51%	38%	32%	37%	34%	40%	34%	36%
iOS	YOLOv8 model + PaddleOCR	2%	2%	3%	4%	4%	3%	3%	2%

Table 7. Sample results of Levenshtein distance from two implementation buildings.

Implementation Building		Cleaned text	Levenshtein Distance	Room Number
#2 Engineering Building	1F	d1d2	1	d102
	2F	d205	0	d205
	3F	db01	1	d301
	4F	d4o1	1	d401
#3 Engineering Building	1F	el13	1	e113
	2F	e2l2	1	e212
	3F	3303	1	e303
	4F	e401	0	e401

Table 8. Execution time (seconds) in #2 and #3 Engineering Buildings.

	#2 Engineering Building				#3 Engineering Building
Floor Level	1F	2F	3F	4F	1F	2F	3F	4F
PaddleOCR	11.7	11.2	10.57	10.8	11.16	11.05	10.72	10.47
YOLOv8 model + PaddleOCR	4.22	4.17	3.93	4.34	3.49	4.46	4.81	4.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fajrianti, E.D.; Panduman, Y.Y.F.; Funabiki, N.; Haz, A.L.; Brata, K.C.; Sukaridhoto, S. A User Location Reset Method through Object Recognition in Indoor Navigation System Using Unity and a Smartphone (INSUS). Network 2024, 4, 295-312. https://doi.org/10.3390/network4030014

AMA Style

Fajrianti ED, Panduman YYF, Funabiki N, Haz AL, Brata KC, Sukaridhoto S. A User Location Reset Method through Object Recognition in Indoor Navigation System Using Unity and a Smartphone (INSUS). Network. 2024; 4(3):295-312. https://doi.org/10.3390/network4030014

Chicago/Turabian Style

Fajrianti, Evianita Dewi, Yohanes Yohanie Fridelin Panduman, Nobuo Funabiki, Amma Liesvarastranta Haz, Komang Candra Brata, and Sritrusta Sukaridhoto. 2024. "A User Location Reset Method through Object Recognition in Indoor Navigation System Using Unity and a Smartphone (INSUS)" Network 4, no. 3: 295-312. https://doi.org/10.3390/network4030014

APA Style

Fajrianti, E. D., Panduman, Y. Y. F., Funabiki, N., Haz, A. L., Brata, K. C., & Sukaridhoto, S. (2024). A User Location Reset Method through Object Recognition in Indoor Navigation System Using Unity and a Smartphone (INSUS). Network, 4(3), 295-312. https://doi.org/10.3390/network4030014

Article Menu

A User Location Reset Method through Object Recognition in Indoor Navigation System Using Unity and a Smartphone (INSUS)

Abstract

1. Introduction

2. Literature Review

2.1. YOLO Model-Based Detection Methods

2.2. Optical Character Recognition Methods

2.3. Indoor Positioning Methods of AR-Based Indoor Navigation Systems

3. Review of Indoor Navigation System Using Unity and a Smartphone (INSUS)

3.1. INSUS Overview

3.2. Input

3.3. Unity Game Engine

3.4. Output

3.5. SEMAR Server

4. Proposal

4.1. System Overview

4.2. Sign Image

4.3. Image Transmission Function

4.4. Object Detection Function

4.5. Text Extraction Function

4.6. Database Matching Function

5. Evaluations

5.1. Training Preparation and Dataset Augmentation

5.2. Performance Analysis of the Object Detection Function

5.2.1. Box and Class Loss Validation of the Object Detection Function

5.2.2. Precision, Recall, and mAP Validation of the Object Detection Function

5.3. Performance Analysis of Text Extraction and Database Matching Function

5.3.1. Experimental Scenarios

5.3.2. Accuracy of Text Extraction and Database Matching Function

5.4. Comparison of the Execution Time of the User Location Reset Method

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI