1. Introduction
Achieving the large-scale use of automatic guided vehicles (AGVs) called for by Industry 4.0 is a daunting undertaking. Most AGVs nowadays are used in massive industries, such as Amazon, Alibaba, Lotte, Carrefour, Walmart, and Pinduoduo [
1,
2,
3,
4,
5,
6]. AGVs have shown great benefits in the logistics field and have led to a significant reduction in handling and transportation costs. However, to meet the anticipated needs of Society 5.0, further work is required to improve the cost, speed, safety, and versatility of AGV systems [
5,
6]. Industry 4.0 requirements include interoperability (compatibility), information transparency, technical assistance, and independent decisions. The system built for the AGV fulfills the element of the independent decision making, where the AGV will decide autonomously to navigate and stop through a DAL mechanism. The proposal of DAL can only be optimal with the support of other existing systems, such as YOLO for the detection system, SURF for route confirmation, and kNN used for AGV verification against references, such as start, home, and obstacles.
Continuous advancements in wireless connectivity, sensor miniaturization, cloud computing, big data, and analytics have led to a new concept called the Internet of Things (IoT), in which devices collect and exchange information with one another with little or no human invention [
5,
7,
8]. The integration of IoT and Artificial Intelligence (AI) technology has led to the development of the Artificial Intelligence of Things (AIoT), which is essentially a more intelligent and capable ecosystem of connected devices and systems. In the context of AGVs, AIoT primarily aims to emulate the execution of human tasks within logistics and storage systems through the utilization of Internet networks and intelligent decision making [
8,
9,
10,
11]. Although Industry 4.0 already leverages such AGV technology to a certain extent, its use is generally limited to larger enterprises and structured environments [
2].
One of the many components of Industry 4.0 is warehousing, which integrates technology and automation to optimize various tasks in the storage and distribution of goods. Within warehouse environments, AGVs are mainly used for order picking and material transport and are typically guided by line tracking, barcodes, laser sensors, and camera techniques [
1]. Such methods work well in structured environments with clearly defined paths, controlled obstacles, predictable workflows, and minimal human interactions [
12,
13]. However, in environments with a dynamic layout, obstacle variability, and the need for complex decision making, existing AGV systems face significant challenges. Unfortunately, these problems are yet to be adequately resolved in the context of warehousing. Broadly, other studies link Industry 4.0 with several machine models and simulations through dynamic, intelligent, flexible, and open applications. Including photogrammetry models is integral to Industry 4.0 regarding flexibility, primarily assisted by GNSS and GCP navigation. It has also been developed to forge a wheeled AGV, using artificial neural networks to deal with high accuracy [
14,
15].
In the literature, AIoT approaches have been widely used for applications such as roadside recognition for AGV forklift equipment, unmanned vehicle obstacle detection, autonomous lane detection, crop and weed detection, and collision avoidance [
2,
16]. The authors in [
17] used a recurrent neural network (RNN) to guide musculoskeletal arms and robotic arms to achieve the precise and efficient completion of goal-directed tasks. Where research [
17] was developed from Krizhevsky et al. [
18] who earlier developed a convolutional neural network (CNN) called AlexNet that is efficient and has dropout regularization to reduce overfitting [
19]. Zhang et al. [
5,
11,
12,
20,
21] proposed that features be generated from two consistent domains using Generative Adversarial Networks (GANs) But the training procedure is slow, so updates appear through unsupervised cross-domain object detection; this detection is known as CA-FRCNN (Cycle-Consistent Domain Adaptive Faster RCNN). Another method, You Only Look Once (YOLO), is a real-time object identification method that has been developed by Fang et al. Several variants of the original YOLO model have been proposed [
22]. In general, the results have shown that YOLO has many advantages over other lightweight models, including real-time processing capabilities, multi-object detection, and customization for specific applications, making it a versatile and efficient choice for many computer vision tasks.
YOLO has found many applications across various industries, including robotics, agriculture, medicine, health, education, and the military. As the detection speed of YOLO has improved over the past few years, the scope of its applications has expanded [
23], and there is now growing interest in applying YOLO to AGVs. However, YOLO has a high power consumption and relies on fast GPU cards and complex computational processes. Furthermore, while YOLO facilitates autonomous driving, navigation, and obstacle avoidance, particularly in unstructured environments, there remain many challenging concerns to be overcome [
18,
19,
20,
22,
23,
24].
The reliable navigation of modern AGVs generally depends on the successful recognition and detection of routing markers. Stereo camera systems, such as D435i, Kinect, or RealSense, provide an effective solution for the detection of fixed objects. However, they suffer from several severe limitations in practical situations, such as vulnerability to environmental conditions, the need to maintain accurate calibration and camera alignment over time, and a high computational complexity. Consequently, the feasibility of using mono camera systems for AGV navigation has gained increasing traction in recent years. In such approaches, a relative baseline is calculated as the AGV moves based on the pixel shift at a fixed point on an object, and the speeded-up robust features (SURF) law is then applied to the pixel shift for navigation purposes. Compared with stereoscopic vision systems, mono camera systems have a lower power consumption, thus facilitating a longer AGV running time. Moreover, in automatic navigation scenarios, the ability to detect random obstacles, narrow positions, and changes in the object size, orientation, and type is typically more important than performing depth estimation [
7,
25,
26].
The studies in [
3,
27,
28] examined the roles of AI, robotics, and data mining in AGV navigation, and concluded that effective algorithms for navigating indoor spaces rely heavily on the extraction of appropriate local features for performing keyframe selection, localization, and relative posture calculation. Many features and feature processing methods have been proposed, including segments of invariant column [
4], SIFT (Scale Invariant Feature Transform) [
29,
30], and FREAK (Fast Retina Keypoint) [
29]. It was shown in [
31] that the feature processing speed can be accelerated through a bag-of-words (BoW) technique, in which a histogram of visible words is used to represent the quantified image. The Term Frequency–Inverse Document Frequency (TF-IDF), as statistics-based methods, can also be applied to each histogram bin to quantify the relevance of a particular visual term to any image within the image set [
27,
32]. However, although local features are theoretically less sensitive to lighting variations and motion blur, indoor environments still pose a significant challenge owing to their extreme visual diversity, the presence of repeated patterns, and the potential for occlusion. Random Sample Consensus (RANSAC) [
2,
28,
33] provides a means of overcoming these problems through more accurate and robust feature matching. However, in complex, unstructured environments, the resulting substantial mismatch ratio increases the computation time required by RANSAC to estimate the relative poses with precision.
In previous studies [
34,
35,
36], the present group developed a R-CNN (region-based CNN) with a structure of eye-in-hand that utilized a single camera to perform an estimate of depth and object location. In a later study, this approach was extended to the task of object picking. An action learning (AL) method was also proposed to help the manipulator robot learn from its mistakes [
32]. Although the system can learn from actions, the working area coverage is very narrow, and the object detection area is fixed. So, the AL method is less suitable for unstructured areas, distances, and dynamic object positions. In the present study, a robust and efficient navigation method is proposed for AGVs by merging AL with deep action learning (DAL) and utilizing SURF and the k-nearest neighbors (kNN) method as feedback guides for the navigation process.
This study’s primary contributions can be summed up as follows:
A DAL architecture is employed to perform robust and accurate detection of objects in an indoor environment.
Object localization is performed using a single monochrome camera fixed to the AGV. An automated navigation capability is realized through the amalgamation of YOLOv4, SURF, and kNN in a seamless DAL architecture.
The AGV’s self-navigation performance is enhanced by representing obstacles as points or nodes in the AGV mapping system, thereby improving its ability to plan routes around them.
The experimental outcomes demonstrate that the suggested system performs robustly and meets the requirements of advanced AGV operations.
The remainder of this paper is organized as follows. The general system design is presented in
Section 2, including both the AGV robotic platform and the navigation system. The suggested visual navigation system, the localization, and detection techniques are described in depth in
Section 3.
Section 4 discusses the visual navigation, obstacle avoidance, safety, and moving obstacle detection issues for the AGV platform in an indoor warehouse environment. The results of the experiment are presented and analyzed in
Section 5.
Section 6 concludes by offering a brief conclusion and suggesting future research directions.
2. System Design
The proposed system comprises two main components: the physical AGV robotic platform and the navigation system used to control its motion. The navigation system is designed to allow the AGV to perform naturally inside an environment containing various objects, such as walls, aisles, shelves, and objects on the floor. For evaluation purposes, in this research, the AGV navigates by passing various obstacles or markers from the starting position to the home position. As the AGV moves, it performs continuous object detection and recognition using a visual navigation system implemented using a DAL architecture based on YOLOv4 for segmentation purposes and SURF to perform collision avoidance.
The fundamental elements of the suggested DAL architecture are shown in
Figure 1. The brown boxes refer to the simulated localization environment, which contains various objects, including sports cones, gallon water containers, and cardboard boxes. A dataset consisting of images showing these objects and the associated environment was compiled to support the YOLO segmentation process. While the AGV is moving, the paired RGB images captured by the mono camera are used as input data to the SURF algorithm (shown in purple) to perform obstacle avoidance according to the object detection results obtained from the kNN-assisted YOLOv4 model. Finally, a set of commands was produced to instruct the AGV to move forward, backward, left, right, or stop, as required, to safely reach the designated home position (depicted in white).
In general, DAL is divided into three main parts. So, it needs to be declared that the DAL concept adopts AL and reinforcement learning (RL) so that the first part is the environment. This environment is set in the form of cones, water containers, cardboard, and other items. On the other hand, a previously collected dataset is inserted into a passive environment. In the middle is the visual navigation system. The RGB image in the form of an indoor environment becomes input from the system for detection and recognition by YOLOv4 with consideration of three optimizers: SDGM, RMSProp, and ADAM. Then, the results of YOLOv4 recognition are also used by kNN to find connection points between the previous image and the current image. The results of this search for the shortest distance are used to consider the AGV’s movements in finding a safe route. The RGB input from the AGV is taken continuously and compared; therefore, as soon as the AGV moves, the first image will be taken, and then it will move again for the following image. In this case, there is a shift in the baseline, both on the x-axis and/or z-axis, so it seems that the AGV is using a stereo camera with a pair of identical images.
SURF processes identical images to find link points between the first and second images. There is no limit to the number of matching points. Naturally, many matching points are more valid in this case, but it is also necessary to limit it to make performance efficient; ten nodes are enough. The navigation algorithm uses the results of these nodes to avoid obstacles, approach the target, turn, and stop. Another algorithm related to DAL tests the accuracy of target detection. The two algorithms are combined to find the target and determine navigation by the AGV in an indoor context.