*2.3. Forklift Precision Control Algorithm*

The more common forklift control algorithm is based on the improvement of PID control [16], and the desired control effect is achieved by adjusting different control parameters according to the actual usage. However, due to its ease of use and constrained range of adjustment, this method is difficult to adapt to complex factory or warehouse environments and frequently exhibits the trait of low control accuracy in use.

Jiang, Zhizheng et al. proposed a Robust H-based forklift control algorithm [17] to model the dynamics of an electric power steering (EPS) system of an electric forklift. The standard H control model of the EPS system is transformed to derive the generalized equation of state for the EPS of an electric forklift. The principle of genetic optimized robust control is described, and the constraint function of genetic algorithm (GA) is constructed for the parameter optimization of the weighting function of the H control model, and the genetic optimized robust controller is derived. The accuracy and robustness of the forklift control process are effectively enhanced. This method, however, also only focuses on the function of EPS in forklift control and neglects to design for the actual operating circumstances during forklift operation or take into account the dynamic adjustment of the control volume to increase control accuracy.

#### **3. Intelligent Forklift Cargo Precision Transfer System**

In a factory or warehouse, intelligent forklifts are needed to accurately insert and transport pallets, as well as autonomously determine whether there is any pallet in the pallet storage area that need to be transferred. We introduce an intelligent forklift cargo precision transfer system to carry out this function, as depicted in Figure 2. In order to determine whether there is any pallet that need to be transported at the pallet storage location, we use a standard RGB surveillance camera. The forklift is then dispatched to the area of the pallet, and the exact position of the pallet is recognized by the RGB-D camera that comes with the forklift. Finally, according to the recognized exact position, a high-precision control algorithm is used to control the forklift to insert and pick up the pallets.

**Figure 2.** The overall flow of the intelligent forklift precision cargo transfer system. RGB camera captures images of the pallet storage area and transmits them to the pallet monitoring module, which identifies the pallets and informs the dispatching system. The dispatching system dispatches forklifts to insert and pick up the identified pallets. The forklift reaches the pallet and uses the pallet positioning module to identify the pallet position, and after the position is identified, it is handed over to the high precision control module to control the vehicle to insert and pick up the pallet.

#### *3.1. Pallet Monitoring Module*

Pallet monitoring is the cornerstone of pallet transfer by forklift, and we carry out this monitoring task using standard RGB monitoring cameras. One advantage of these RGB cameras is that they are less expensive than photoelectric sensors or RGB-D cameras, and they are also much easier to deploy due to the fact that one RGB monitoring camera can cover at least 50 pallet positions. To prevent the influence of items placed on the pallets on the recognition outcomes, we used the side aperture feature of the pallets, which is not easily covered up, as the identification mark. We used a Yolov5-based network for pallet monitoring. However, the size of the pixels occupied by the same pallet in the same image is different because of the viewing angle when the surveillance camera is placed. The traditional Yolov5 scheme will easily fail when the pallet is far from the camera because the pallet's pixels in the field of view will be small. In order to handle these situations, we added a small target detection module and a convolutional block attention module (CBAM).

#### 3.1.1. BackBone Modules

CspDarknet53 [18] is a Backbone structure based on Darknet53 [19], which contains 5 CSP modules. Each CSP module's convolution kernel is 3 × 3 in size with stride = 2, so it can function as a downsampling. The input image is 608 × 608, and since Backbone has 5 CSP modules, the pattern of feature map change is 608 → 304 → 152 → 76 → 38 → 19. After 5 CSP modules, the 19 × 19-inch feature map is obtained.

## 3.1.2. Neck Module

The feature extractor of this network uses a new enhanced bottom-up pathway FPN [20] structure that improves the propagation of low-level features. Each stage of the third pathway takes the feature map of the previous stage as input and processes them with a 3 × 3 convolutional layer. The output is added to the feature maps of the same stage of the top–down pathway via lateral connections, and these feature maps inform the next stage. Adaptive feature pooling is also used to recover the corrupted information path between each candidate region and all feature levels, aggregating each candidate region at each feature level to avoid arbitrary assignment.

#### 3.1.3. Small Target Detection

We incorporate a transformer prediction head (TPH) into Yolov5 in order to detect small targets. A low-level, high-resolution feature map, which is more sensitive to small objects, is used to generate the added TPH. We also swap out some convolution and CSP blocks with the transformer encoder block. The transformer encoder block in CSPDarknet53 has more information acquisition advantages than the original bottleneck block. The first sub-layer in each transformer encoder block is a multi-headed attention layer, and the second sub-layer (MLP) is a fully connected layer. Each sub-layer is connected by residuals. The transformer encoder block improves the ability to record various local details and can also make use of the self-attentive mechanism to unlock the potential of feature representation.

#### 3.1.4. Convolutional Block Attention Module (CBAM)

CBAM [21] is a simple but effective attention module. It is a lightweight module that can be integrated into a CNN and can be trained in an end-to-end manner. According to the experiments in the paper [22], the performance of the model is greatly improved after integrating CBAM into different models for different classification and detection datasets, which proves the effectiveness of the module. In images of pallet surveillance, large coverage areas can contain a high number of interference terms. Using CBAM can extract attention regions and help the network resist confusing information and focus attention on useful target objects.

#### *3.2. Pallet Positioning Module*

The relative poses between the target pallet and the intelligent forklift must be identified after the forklift approaches the target pallet in order to provide precise real-time pose for the forklift's control algorithm. In order to accomplish this, we suggest a novel approach based on a deep 3D Hough voting network. To be more specific, we employ a RGB-D camera to record the color and depth data of the pallet. We then input the color and depth data into a feature extraction module to extract the surface features and geometric information of the pallet. This extracted information is then fed to a key point detection module *Mk* to predict the offset of each point relative to our specified key points, which are typically defined as the 8 corner points of the two apertures of the pallet. Additionally, we employ a center voting module *Mc* to predict each point's offset from the target center and an instance segmentation module, *Ms*, to predict the label of each point. Finally, the obtained 8 key points are used to estimate the pallets' poses relative to the forklift by the least squares method. The whole algorithm flow is shown in Figure 3.

**Figure 3.** Flow of pallet position recognition algorithm. To predict the translational offsets to the keypoints, center points, and labels of each point, the feature extraction module extracts features from the RGB-D images. These features are then fed into the *Mk*, *Ms*, and *Mc* modules. The key-points in the same instances are then voted to the key-points of the instance they belong to after a clustering algorithm is used to distinguish between the various instances. Finally, a least squares method is used to fit the bit pose based on the eight extracted key points. *Kp* is the number of key points.

## 3.2.1. Key-Point Detection Module

The pallet image is fed to the key-point detection module *Mk* after feature extraction to detect key-points on the pallet. The main function of *Mk* is to predict the Euclidean translation offset from the visible points to the key-points, and these visible points and the predicted translation offset work together to finally be able to vote out the key-points. These voted points are pooled together by a clustering algorithm [23], and the center of the pool is taken as the final key point.

The loss function of *Mk* is calculated as follows. Given a set of extracted feature points {*pi*}*<sup>N</sup> <sup>i</sup>*=1, where *pi* = [*xi*, *fi*], *xi* denotes the 3D coordinates of the points and *fi* denotes the point features. Similarly ' *kpj* (*<sup>M</sup> <sup>j</sup>*=<sup>1</sup> is used to denote the selected key points. We use *o f j i <sup>M</sup> j*=1 to denote the translation offset of the *i*-th point with respect to the j-th key point. Thus, the key point can be represented as *kp<sup>j</sup> <sup>i</sup>* <sup>=</sup> *xi* <sup>+</sup> *o f <sup>j</sup> <sup>i</sup>* . To supervise the learning of *o f <sup>j</sup> i* , we use the loss function:

$$L\_{\text{key-points}} = \frac{1}{N} \sum\_{i=1}^{N} \sum\_{j=1}^{M} \left\| o f\_i^j - o f\_i^{j\*} \right\| \tag{1}$$

*o fj* <sup>∗</sup>( *<sup>i</sup>*) is the true value of the translation offset. *<sup>M</sup>* is the number of selected keypoints, which is usually selected as 8 in the system. *N* is the number of feature points.

#### 3.2.2. Instance Segmentation Module

In order to handle the case where there are multiple objects in the image, i.e., multiple goods or other objects in addition to the pallet, it is necessary to segment the other objects from the pallet in order to extract the key points more accurately. As a result, we present the *Ms* instance segmentation module. The instance segmentation module *Ms* predicts the semantic label of each point using the extracted point-by-point data. To supervise the learning of this subject, we employ Focal Loss [24].

$$L\_s = -a(1 - q\_i)^\gamma \log(q\_i), \text{ where } q\_i = c\_i \cdot l\_i \tag{2}$$

with *α* the *α*-balance parameter. *γ* the focusing parameter. *ci* the predicted confidence for the *i*-th point belongs to each class and *li* the one-hot representation of ground true class label.

We also employ a center voting module, *Mc*, to help distinguish between the various instances. We use a similar module to CenterNet [25], but expand the center points from 2D points to 3D points. This module is able to predict the Euclidean translation offset of each point to its object center in order to achieve a better instance segmentation. Similar to the key-point detection module mentioned above, it can be used to regard the object's center as a certain kind of key-point. We denote by Δ*xi* the translational offset of each feature point with respect to its object center, then the learning of Δ*xi* can be supervised by the following loss function.

$$L\_c = \frac{1}{N} \sum\_{i=1}^{N} ||\Delta x\_i - \Delta x\_i^\*||\tag{3}$$

where *N* denotes the total number of seed points on the object surface and Δ*x*∗ *i* is the ground truth translation offset from seed *pi* to the instance center. It is an indication function indicating whether point *pi* belongs to that instance.

#### 3.2.3. Network Architecture

As shown in Figure 3, the first part of the network is a feature extraction module. In this module, a PSPNet [26] is used to extract the appearance information in RGB images. PointNet++ [27] extracts the geometric information in the point cloud and its normal mapping. The two are fused by the DensionFusion block [28] to obtain the combined features of each point. The next *Mk*, *Ms* and *Mc* consist of shared multilayer perceptrons (MLPs). We supervise the learning of module *Mk*, *Ms* and *Mc* jointly with a multi-tasks loss:

$$L\_{multi-task} = \lambda\_1 L\_k + \lambda\_2 L\_s + \lambda\_3 L\_c \tag{4}$$

We sample *n* = 12,288 points for each RGB-D image frame and set *λ*<sup>1</sup> = *λ*<sup>2</sup> = *λ*<sup>3</sup> = 1.0.

#### 3.2.4. Least Squares Fitting

In the least-square fit [29], we denote the final set of 8 key points inferred by the network as {*kpi*}*<sup>M</sup> <sup>j</sup>*=1. The coordinates of this point set are in the camera coordinate system, and the corresponding set of 8 points in the pallet coordinate system is denoted as ' *kp i* (*<sup>M</sup> <sup>j</sup>*=1. To obtain the bit-pose relationship (*R*, *t*) between the pallet and the camera, we will minimize the set of these two points between the following loss functions.

$$L\_{\text{least}-squares} = \sum\_{j=1}^{M} \left\| kp\_j - \left( \mathcal{R} \cdot kp'\_j + t \right) \right\|^2 \tag{5}$$

The camera and the intelligent forklift are often fixedly coupled when in use. After obtaining the relative position of the pallet and the camera, only one step of external reference conversion needs to be added to obtain the relative position between the forklift and the pallet.

#### *3.3. High Precision Control Module*

Once the pallet position has been determined, we must accurately control the forklift to insert the pallet in that position. Traditional forklift control models [30] and algorithms are overly simplistic, especially the actual running condition of the vehicle and the amount of dynamic adjustment control are not considered enough, which frequently results in significant inaccuracies when used in practice. We propose a high-precision trajectory control method for intelligent forklifts that incorporates forklift motion cycle prediction into the control process, continuously updates the intelligent forklift prediction model, and determines the optimal control amount in order to achieve the goal of increasing control accuracy.

In order to describe the motion of the intelligent forklift in the prediction cycle, a discrete prediction model based on the forklift model is required. Using the forklift motion model as the base model, a prediction model for the non-linear optimization problem is established and a discrete vehicle model in incremental form is performed, as shown:

$$
\mu(k) = \mu(k-1) + \Delta\mu(k)\tag{6}
$$

As shown in Figure 4, according to the recurrence relationship of the incremental model, the *i*-th future prediction state can be represented by the initial state and the sequence of *i* control increments to complete the construction of the non-linear prediction model.

**Figure 4.** Incremental model recurrence relationship. Get the value at different times.

Combining the path tracking objective function based on the approximate tracking error [22], the non-linear prediction model and constraints [31], the path tracking optimization problem based on the approximate tracking error is constructed as shown:

$$\min J = \sum \left( \|\boldsymbol{\varepsilon}\|^2 + \|\Delta \boldsymbol{u}\|^2 + \|\boldsymbol{u}\|^2 \right) \tag{7}$$

Additionally, the constraints are added to the optimization problem to obtain the prediction state and control increment by minimizing the sum of squares of approximate tracking error, control increment, and control quantity in the prediction cycle to complete the update of the model and control quantity.

In general, we update the prediction model based on the observed values and the discrete prediction model, establish the path tracking objective function based on the approximation error, obtain the optimal control volume, and combine the previous moment control volume output to the controlled vehicle to complete the high-precision tracking control.

#### **4. Experiment Results**

To guarantee that the findings obtained in this article are consistent with the results in actual use, we run the system through real-world scenarios in order to obtain more realistic and accurate results.

#### *4.1. Experiments Environment Construction*

The whole system consists of three main parts: pallet monitoring module, pallet positioning module, and high-precision control module. The supporting scheduling system is not discussed in any detail in this work. The forklift body control algorithm, or the high-precision control module, does not need to build an experiment environment and can be tested directly with the forklift.
