Identification and Positioning Method of Bulk Cargo Terminal Unloading Hopper Based on Monocular Vision Three-Dimensional Measurement

Shen, Ziyang; Wang, Jiaqi; Zhang, Yujie; Zheng, Luocheng; Mi, Chao; Shen, Yang

doi:10.3390/jmse12081282

Open AccessArticle

Identification and Positioning Method of Bulk Cargo Terminal Unloading Hopper Based on Monocular Vision Three-Dimensional Measurement

by

Ziyang Shen

¹,

Jiaqi Wang

²,

Yujie Zhang

²

,

Luocheng Zheng

²,

Chao Mi

^2,3

and

Yang Shen

^3,4,*

¹

College of Transport & Communications, Shanghai Maritime University, Shanghai 201306, China

²

Logistics Engineering College, Shanghai Maritime University, Shanghai 201306, China

³

Shanghai SMUVision Smart Technology Ltd., Shanghai 201306, China

⁴

Higher Technology School, Shanghai Maritime University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(8), 1282; https://doi.org/10.3390/jmse12081282

Submission received: 16 June 2024 / Revised: 28 July 2024 / Accepted: 29 July 2024 / Published: 30 July 2024

(This article belongs to the Special Issue Novel Technologies and Achievements of Transport and Logistics in Marine Science)

Download

Browse Figures

Versions Notes

Abstract

:

Rapid identification and localization of dry bulk cargo hoppers are currently core issues in the automation control of gantry cranes at dry bulk terminals. The current conventional method relies on LiDAR systems for the identification and positioning of bulk unloading hoppers. However, this approach is complex and costly. In contrast, GPS-based positioning solutions for bulk unloading hoppers are prone to damage due to the vibrations generated during the operation process. Therefore, in this paper, a hopper localization system based on monocular camera vision is proposed to locate the position of the bulk unloading hopper. The hopper identification and localization process are divided into three stages. The first stage uses the improved YOLOv5 model to quickly and roughly locate the hopper target. The second stage uses morphological geometrical features to locate the corner points of the hopper target. The third stage determines the three-dimensional coordinates of the hopper target by solving the position of the corner points in the world coordinate system through the PnP (Perspective-n-Point) algorithm. The experimental results show that the average positioning accuracy of the coordinates of the method is above 93%, demonstrating the accuracy and effectiveness of the method.

Keywords:

automated terminal; gantry crane; three-dimensional target measurement

1. Introduction

The loading and unloading speed of cargo is a key metric in dry bulk terminals [1], and the schematic diagram of the gantry crane ship unloading operation is shown in Figure 1 [2]. At present, the efficiency of the gantry crane loading and unloading operation mainly depends on the experience of the operator [3]. Therefore, it is necessary to position the bulk unloading hopper to assist the grab in unloading the ship. To measure the relative position between the grab and the bulk unloading hopper, there are two main engineering application schemes:

1. Using a contact measurement approach, sensors such as attitude sensors and GPS devices are deployed on the upper part of the bulk hopper for direct positional measurement. However, these sensors are highly susceptible to vibration, impact, and shock, leading to potential failures. Moreover, the bulk unloading hopper is mobile and lacks inherent power and communication networks, making the deployment of such sensors complex.

2. Using a non-contact measurement approach typically involves deploying LiDAR capable of three-dimensional measurement for the bulk unloading hopper. However, LiDAR is expensive and easily affected by material dust and rain, which severely compromise measurement accuracy. Additionally, LiDAR’s limited field of view results in a small effective measurement range, thereby impacting overall measurement efficiency [4].

Based on this, this paper proposes a monocular camera vision localization system for locating the bulk unloading hopper to improve the efficiency of gantry crane unloading operations. In this paper, we use the improved YOLOv5 model with image processing for bulk unloading hopper corner point localization and apply the PnP algorithm to measure the target’s world coordinates. The contributions of this paper are mainly reflected in the following aspects:

1. We propose a device and approach based on monocular vision for three-dimensional measurement, achieving bulk unloading hopper recognition and spatial positioning solely through a single camera.

2. We have improved the traditional YOLOv5 model and proposed a balanced accuracy-speed YOLOv5 model for the coarse localization of bulk unloading hoppers. This model is combined with morphological image processing methods to accurately determine the pixel coordinates of the hopper corners.

3. We propose a method for solving the three-dimensional coordinates of the bulk unloading hopper using a monocular camera based on the PnP algorithm. By utilizing the pixel coordinates of the four corners of the bulk unloading hopper, we directly determine its spatial coordinates.

The detailed composition of this paper is as follows: In Section 2, we discuss related work, including target detection algorithms and key point detection techniques based on machine vision. In Section 3, we introduce the proposed method for bulk unloading hopper localization in detail. Section 4 presents experimental results based on real data. Finally, Section 5 summarizes this paper.

2. Related Works

Vision-based measurement (VBM) systems are capable of obtaining wealth of information through analyzing images [5], and with the development of relevant visual inspection algorithms and the improvement of hardware arithmetic power, solutions using vision-based measurement are becoming more and more popular in ports. Ngo et al. [6], relied on a vision system to propose a solution for lane line detection in container terminals. Ji et al. [7], used vision systems combined with deep learning models to detect truck bodies and prevent truck lifting accidents. Lee [8] installed at least four cameras on cranes to assist in the detection and identification of the corner cast of the container. However, the identification of objects using only target detection does not tell the positional information of the target. Therefore, it is necessary to solve the 3D target coordinate information through the 2D image of the 3D target, and there are two common methods in deep learning: (1) direct regression methods; (2) key point solving coordinates.

Direct Regression Approach: Xu et al. [9], introduced a generalized framework to extend the 2D object detector to 3D object detection, employing End-to-end learning (ETL) for estimating the location of 3D objects. This approach uses an external network as at least one component of its framework, which allows for an inherent disconnect in component learning and system complexity. On this basis, Brazil et al. [10], proposed a Monocular 3D Region Proposal Network (M3D-RPN) to generate both 2D and 3D object candidate frames and design a novel depth-aware convolutional layer (CNN) to learn spatial features. Haq et al. [11], utilized geometric constraints in 3D and 2D detection spaces to predict 3D objects gained performance improvement. And these algorithms are usually too complex and have relatively poor real-time performance.

Key point coordinates localization: using known world coordinates, the position information of the target is obtained by solving the mapping relationship between the world coordinates of the target and the pixel coordinates. Mi et al. [12] used a triangulation method for the image after Zhang’s calibration [13] according to the characteristics of the pinhole camera, and utilized machine vision to realize the real-time orientation of the container’s three-dimensional pose through the center offset displacement of the locking hole. The camera position in the above literature is relatively fixed and does not satisfy the scenario of camera movement in this paper. The Perspective-n-Point (PnP) algorithm is commonly used to solve the transformation of pixel coordinates to world coordinates. Mi et al. [14], used two-stage PnP to solve the camera position and marker position; Shan et al. [15], solved the pose information of the object by the PnP algorithm; Li et al. [16], used PnP to solve the marker’s pose in the camera coordinate system. All of the above papers use target markers to improve the detection accuracy of the target, while the target in this paper has a simple structure, which can directly locate the coordinates of the target’s corner points, and the coordinate information can be known without fixing the target marker. For example, Rad et al. [17], predicted the 2D projections of the corner points of the target’s 3D boundaries from segmented 2D images, which is robust to partially occluded targets. Tekin et al. [18], proposed a similar scheme using the yolo architecture, with an innovative confidence scoring system that dispenses with the need for subsequent refinement steps. Ma et al. [19], used neural networks and greedy algorithms to directly regress the angular point information of the target. However, the accuracy of pose estimation is largely dependent on the quality of key point detection, and the quality of extracted features plays a key role in the performance of key point detection. Therefore, improving the quality of extracted features is still crucial for accurate key point detection as well as reliable attitude estimation results.

In addition to detection accuracy, the speed of target detection is also an important performance metric at bulk cargo terminal operations. YOLOv5 [20], due to its reliable accuracy and fast single-stage detection speed, has become a mainstream choice. However, the original YOLOv5 model tends to exhibit low frame rates when deployed on edge computing devices and when performing synchronous inference across multiple video streams, posing significant safety risks for industrial production and monitoring. In response, Zhang et al. [21], inspired by ShuffleNetv2 [22], have enhanced YOLOv5 by removing the focus layer and replacing the original feature extraction network with the ShuffleNetv2 algorithm, and pruning the neck part of the network. This effectively reduces the size of the original YOLOv5 model while significantly increasing detection speed. Li et al. [23], by integrating YOLOv5 with two lightweight models, ShuffleNet and MobileNet, significantly reduced the size of the model. However, this modification led to a notable decline in detection accuracy. Thus, effectively modifying YOLOv5 to balance detection accuracy with speed, to meet the computational demands of multiple automated handling devices at bulk cargo terminals on resource-limited devices, has become a critical challenge.

3. Our Method

In the process of automated loading and unloading operations at bulk cargo terminals, it is necessary to accurately position the bulk unloading hoppers. Fixed cameras, which have a wide field of view to cover the entire operation scene, often encounter challenges. The weak targets within the wide and complex scenes at the bulk terminal contain limited feature information, leading to significant deviations in target detection and identification based on visual neural networks.

Additionally, the GPS-based bulk unloading hopper positioning method may result in structural damage to the GPS equipment due to the violent vibrations caused by the collision of the grab bucket. Therefore, to improve the accuracy and robustness of bulk unloading hopper positioning during automated loading and unloading operations at bulk cargo terminals, this paper proposes a three-stage bulk unloading hopper offset positioning method.

3.1. System Design

The bulk cargo terminal loading and unloading operation process involves the use of a variety of equipment, mainly including the gantry crane and bulk unloading hopper, as shown in Figure 2. GPS is installed on the gantry for obtaining real-time world coordinate positioning of the key points of the gantry during operation. There are two GPS devices on the gantry crane elephant trunk: GPS1

(X, Y, Z)

and GPS2

(X^{'}, Y^{'}, Z^{'})

At the same time, a self-vertical camera is installed at the end of the maintenance platform of the crane, directly above the spreader, to obtain a visual image of the entire process of automated loading and unloading. By using GPS1 and GPS2, the camera’s coordinates in the world coordinate system can be calculated as

(X_{c}, Y_{c}, Z_{c})

, as well as the direction vector

\vec{v}

of the camera’s x-axis in the world coordinate system.

Due to the inherent shortcomings of visual cameras, accurately locating small targets in complex scenes, such as loading and unloading operations at bulk cargo terminals, is challenging. The 3D coordinate positioning method based solely on monocular machine vision can be significantly affected by pixel-level deviations in target recognition results, which seriously impacts the accuracy of automated loading and unloading operations at bulk cargo terminals.

In the automatic loading and unloading process of bulk cargo terminals,

O_{1}

,

O_{2}

,

O_{3}

, and

O_{4}

are the four corner points of the bulk unloading hopper,

O_{H}

is the center point of the bulk unloading hopper, and its world coordinates are

(X_{H}, Y_{H}, Z_{H})

, where

O_{P}

is the predicted center point of the bulk unloading hopper, and its world coordinates are

(X_{P}, Y_{P}, Z_{P})

, as shown in Figure 3.

The bulk unloading hopper prediction accuracy is calculated considering the geometric error, combined with the bulk unloading hopper size for the definition of the percentage of accuracy. For the definition of error distance as shown in Equation (1), the distance error between the predicted bulk unloading hopper center point and the real bulk unloading hopper center point is calculated, which is calculated by the Euclidean distance formula.

\begin{matrix} d = \sqrt{{(X_{P} - X_{H})}^{2} + {(Y_{P} - Y_{H})}^{2} + {(Z_{P} - Z_{H})}^{2}} \end{matrix}

(1)

In order to determine the reference dimension of the bulk unloading hopper, the side length L of the hopper is a reasonable choice. If the hopper is square, half of the diagonal is taken as the reference scale, which is shown as

L_{r e f e r e n c e}

, represents the maximum extent of the hopper.

L_{r e f e r e n c e} = \frac{\sqrt{2} \times L}{2}

(2)

Predictive positioning accuracy

P

is designed to be inversely proportional relationship, i.e., the larger the error, the lower the precision.

P = \{\begin{matrix} (1 - \frac{d}{L_{r e f e r e n c e}}) \times 100 % & , d > L_{r e f e r e n c e} \\ 0 & , {d \leq L}_{r e f e r e n c e} \end{matrix}

(3)

Taking the bulk cargo terminal of Tianjin Port as an example, the side length L of the bulk unloading hopper is 7 m, and the predicted positioning accuracy under this parameter

P

is expressed in Equation (4).

P = (1 - \frac{\sqrt{2} d}{7}) \times 100 %

(4)

At the Tianjin Port operation site, the maximum allowable operational error between the center points of the grab and the bulk unloading hopper is 50 cm, while the actual maximum average error for manual operations can reach 35 cm. According to the calculation based on Formula (4), the allowable operational accuracy for the center points of the grab and the hopper at Tianjin Port is 90%, while the actual average operational accuracy for manual operations reaches as high as 93%. Therefore, to ensure accuracy and safety, this paper adopts the average operational accuracy of manual operations, which is 93%, as the requirement for predictive positioning accuracy of automated bulk cargo terminal handling equipment.

At the same time, the improved neural network recognition algorithm for small and weak targets will significantly increase the processing time, which is intolerable in the automated loading and unloading process of bulk cargo terminals and will cause great safety hazards in the operation process.

Therefore, we have designed a three-stage bulk unloading hopper positioning method based on machine vision. The main method architecture is shown in Figure 4. In the coarse positioning stage, the bulk unloading hopper is identified by a lightweight and efficient visual neural network during operation, and the cropped target is passed to the fine positioning stage. In the fine positioning stage, a linear fitting method is used to determine the exact position of the target corner within the coarse positioning detection frame according to the geometric characteristics of the target. This position is then mapped to the coordinate system of the original image, providing accurate input data for the subsequent three-dimensional coordinate solving stage. In the three-dimensional coordinate solving stage, the image coordinates are back-projected to the world coordinate system using the camera’s intrinsic and extrinsic parameter matrices, and the deviation between the grab and the bulk unloading hopper is calculated.

3.2. The Coarse Positioning Stage

For the detection task during automated loading and unloading operations at bulk cargo terminals, achieving real-time detection while maintaining accuracy is crucial. This necessitates lightweight optimization of the model to strike a balance between ensuring accuracy and preventing significant degradation. Such optimization is essential to facilitate real-time synchronous processing of video streams from multiple gantry cranes in future implementations.

YOLOv5, proposed by Jocher, G. et al. [20] adopts the concept of one-stage object detection, we applied YOLOv5 for detecting the position of the bulk unloading hopper itself in the actual image and its network structure is shown in Figure 5. The YOLOv5 network architecture consists of three main components: the backbone, neck, and head. The CBS module comprises a convolutional layer followed by batch normalization and SiLU activation. The CSP1_X module evolves from CSPNet and includes CBL modules, several residual modules, convolutional layers, and concatenation operations, with ‘X’ indicating the number of such modules. Similarly, the CSP2_X module is also derived from CSPNet, consisting of a convolutional layer and ‘X’ number of CBL modules. It can balance the detection accuracy and detection speed.

YOLOv5 includes five different models: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The primary differences among these models lie in the depth and width of the network, as well as variations in the number of channels. The designations n, s, m, l, and x represent models of different sizes, with ‘n’ being the smallest and ‘x’ the largest. And the official performance comparison of each version on the COCO public dataset is shown in Table 1. [20] Although YOLOv5x achieves the best detection performance, its high computational complexity leads to a slower inference time due to its deep network architecture. In contrast, the lightweight model is more suitable for industrial applications ported to embedded devices. Therefore, we choose YOLOv5n as the base model in the coarse localization phase.

3.2.1. The Shufflenet V2 Optimized Yolov5 Backbone Network

We adopt a strategy of alternately stacking the base and downsampling modules of ShuffleNet V2 [22] to replace the original C3 and Conv modules in YOLOv5 to reduce model size and increase model speed. In the architecture of ShuffleNet V2, researchers invoke an innovative channel splitting technique that splits the input feature channel (c) into two branches, each of which possesses (c-c′) and c′ channels, respectively, where c′ usually takes the value of half of c. This splitting operation is to avoid the use of group convolution, thus reducing the Memory Access Cost (MAC) [22]. The basic module and down sampling module network structure is shown in Figure 6 and Figure 7.

Among these two branches, one remains unchanged as a residual connection (Identity branch), while the other is processed by three convolutional layers. These layers include two 1 × 1 convolutions and a 3 × 3 Depthwise Convolution (DWConv), all of which maintain the same number of input and output channels to ensure consistent channel widths. It is worth noting that these 1 × 1 convolutions do not use group convolution to minimize the increase in MACs caused by group convolution, and because the channel splitting operation itself already creates two separate channel groups.

The processed branches are subsequently merged through a concatenation (Concat) operation, which increases the number of channels in the network. This step ensures that information between different branches can be exchanged through the subsequent Channel Shuffle operation, which physically reorders the channels. The Channel Shuffle operation not only facilitates the flow of information between different groups but also enhances the network’s expressiveness without adding extra computational burden. Ultimately, through this channel splitting and shuffling, ShuffleNet V2 efficiently integrates and exchanges information while maintaining high computational efficiency, achieving an excellent balance of speed and accuracy across various levels of computational complexity.

To improve detection accuracy while ensuring the model remains lightweight, we adopt a strategy of alternately stacking the base and downsampling modules of ShuffleNet V2 to replace the original C3 and Conv modules in YOLOv5, thereby forming an efficient network structure. This design not only reduces the computational burden of the model but also enhances its performance in target detection through precise feature extraction. In this revised structure, the base module of ShuffleNet V2 is responsible for extracting rich feature information, while the downsampling module reduces the spatial resolution of the feature map via a convolution operation with a stride of 2 and increases the number of channels to facilitate feature fusion and information transfer at different levels. This alternate stacking strategy enables the network to capture multi-scale target information while maintaining a low parameter count, resulting in improved performance in target detection tasks.

3.2.2. Gsconv Convolutional Module Optimized Yolov5 Neck Part

Despite upgrading the backbone network to ShuffleNet V2 to reduce the computational burden of the model while maintaining detection accuracy, the network still contains a large number of convolutional operations, which consume a significant amount of computational resources to some extent. Although ShuffleNet V2 reduces computational load through pointwise convolution and channel shuffling, it still exhibits notable deficiencies in the depthwise separable convolution process.

Firstly, the separation of spatial and channel convolutions in depthwise separable convolution results in the fragmentation of feature information during processing, adversely affecting feature representation and fusion. This issue is particularly pronounced in complex scenarios or tasks requiring high-precision feature extraction, leading to a significant decline in model performance.

Secondly, while ShuffleNet V2’s structural design simplifies the computational process, it does not fully address the problem of efficient feature fusion, still necessitating substantial computational resources to maintain performance.

Furthermore, in practical applications, especially when handling high-resolution images or requiring real-time processing, the inefficiency of ShuffleNet V2 remains prominent. Its computational cost and resource consumption cannot be overlooked when pursuing higher detection accuracy.

GSConv [24], proposed by Li, H. et al., combines the features of depthwise separable convolution and standard convolution. The core idea of GSConv is to connect the output feature maps of these two types of convolution operations. This connection aims to realize feature complementation and enhancement, thereby improving the overall effectiveness of the model in capturing and processing diverse information from the input data. Subsequently, by introducing the Shuffle mechanism, GSConv facilitates the contact of information between the deeply separable convolution and the standard convolution feature maps, thus significantly reducing the computational complexity of the model while maintaining its accuracy.

Deep separable convolution effectively reduces the number of parameters and the computational effort of the model by decomposing the standard convolution into two steps: deep convolution and point-by-point convolution. However, this decomposition also leads to the loss of information between feature channels because the feature maps of each channel are processed independently. In contrast, standard convolution is able to maintain the contact between channels, but the computational expense is high. GSConv reduces the computational burden of the model while maintaining the accuracy of the model by concatenating the output feature maps of the depth-separable convolution and the standard convolution, and realizing the rearrangement of the information between the channels by shuffling operations. This design not only retains the advantage of depth-separable convolution in computational efficiency, but also improves the expressive power of the model through the fusion of inter-channel information of standard convolution.

3.2.3. The Simplified Yolov5 Detection Head

In the original YOLOv5 model, the detection head consists of three different scale detection heads, each responsible for detecting targets at different scales. This design is to ensure that the model can effectively detect targets of large, medium and small scales. However, in the fixed camera operation scenario of a bulk terminal, the need for detection of small targets is relatively low because the scale of the bulk unloading hopper targets does not vary much. Therefore, in this paper, the detection head of YOLOv5 is simplified by removing the detection head for small target detection to reduce the complexity and computational burden of the model. The simplified YOLOv5 model will contain only two detection heads, which are responsible for the detection of medium-scale and large-scale targets, respectively.

3.2.4. The Improved Yolov5 Framework

By integrating the following elements: a backbone based on ShuffleNet V2 basic modules and down sampling modules, using GSConv convolutional modules to replace traditional convolutions in the neck part, and removing the small object detection head in the head part, we have constructed the YOLOv5-slim model.

Specifically, we reconstructed the backbone network by employing the basic modules and downsampling modules of ShuffleNet V2. This structural design aims to reduce the number of parameters and computational cost, making the network more compact and thus more effective on resource-constrained on-site devices. ShuffleNet V2 achieves significant reductions in computation and improvements in efficiency through the introduction of depthwise separable convolutions, making it particularly suitable for environments with limited computational power, such as mobile devices. However, the depthwise separable convolutions separate channel information during computation, adversely affecting the capability for feature extraction and fusion, thereby significantly reducing the accuracy of these processes.

While ShuffleNet V2 offers clear advantages in terms of parameter reduction and computational cost, its limitations in feature extraction and fusion remain a critical issue. Although depthwise separable convolutions effectively lower computational complexity, the separation of channel information during convolution reduces inter-channel information exchange, impairing the comprehensive representation of feature maps and consequently affecting the overall model performance. To address this shortcoming, we introduced the GSConv module in the neck network to replace traditional convolution modules.

The GSConv module enhances information exchange between different convolutional feature maps by rearranging channel orders. Specifically, GSConv promotes cross-channel information sharing and fusion by reordering the channels, thereby improving feature extraction and fusion. This modification not only enhances the comprehensive representation of feature maps but also significantly improves the overall performance of the model while maintaining a lower computational complexity. Through this approach, the GSConv module achieves a balance between model precision and inference efficiency, making it suitable for deployment on resource-constrained devices.

By adopting the basic modules and downsampling modules of ShuffleNet V2, we achieved a compact and efficient network structure. However, to address the limitations in feature extraction and fusion capabilities inherent in depthwise separable convolutions, we introduced the GSConv module. The GSConv module enhances inter-channel information exchange by rearranging channel orders, significantly improving feature extraction and fusion capabilities. Ultimately, this integration of ShuffleNet V2 and the GSConv module, both modified and incorporated, works synergistically, achieving an ideal balance between model precision and inference efficiency.

In the context of fixed camera operations at bulk cargo terminals, where the target scale variation is minimal, we further reduced the model complexity by removing the small object detection head. The structure of the proposed coarse localization network is shown in Figure 8.

Compared to the traditional YOLOv5 model, the YOLOv5-slim model has fewer parameters and lower complexity, while still balancing model accuracy and inference efficiency. This makes it suitable for the low-power recognition requirements of automated loading and unloading operations in the context of bulk cargo terminals.

With the proposed YOLOv5-slim lightweight neural network, we are able to recognize the bulk unloading hopper in the operation video in real time. The recognition slices of the bulk unloading hopper are then transmitted to the fine localization stage through post-processing to complete the corner recognition acquisition.

3.3. The Fine Positioning Stage

The location of the bulk unloading hopper detected using the deep learning model is shown in Figure 9. The edge features of the image are the key to extract the target corner point information, and due to the influence of lighting factors and the target’s own structure, there will be more noise in the edge features, and the error of directly determining the corner point coordinates is larger. Therefore, the Canny algorithm [25] extracts the edge information of the image and the binarized image after Gaussian filtering are considered comprehensively, and the coordinates of the corner are determined with the geometric features of the outline itself as the constraint conditions.

3.3.1. Extract Edge Information

Canny edge detection detects edge information in an image by computing the gradient magnitude and direction of each pixel. The Sobel operator, which has relatively low computational complexity [26], is used with Sobel kernels

G_{x}

and

G_{y}

to represent the gradients in the x and y directions, respectively. The convolution operation between the convolution kernel and the input image yields the gradients in the horizontal and vertical directions of the image. Here,

d_{x}

represents the gradient in the horizontal direction,

d_{y}

represents the gradient in the vertical direction.

G_{x} = [\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}] G_{y} = [\begin{matrix} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{matrix}], d_{x} = f (x, y) \times G_{x} (x, y) d_{y} = f (x, y) \times G_{y} (x, y)

(5)

d denotes the gradient magnitude at point

(x, y)

, and

θ

represents the gradient direction at point

(x, y)

.

d = \sqrt{d_{x}^{2} + d_{y}^{2}}, θ = \arctan (\frac{d_{y}}{d_{x}})

(6)

By selecting high and low thresholds, weaker gradient edge pixels are filtered out, while preserving edge pixels with stronger gradient values. If an edge pixel’s gradient value is higher than the high threshold value, it is marked as a strong edge pixel. If an edge pixel’s gradient value is smaller than the high threshold value and larger than the low threshold value, it is marked as a weak edge pixel. If an edge pixel’s gradient value is smaller than the low threshold value, it will be suppressed. Sample the target dataset to determine the optimal threshold for Canny edge detection, ensuring the best edge extraction from 50 sampled images. In clear daylight conditions, the double threshold for Canny edge detection is set at (154, 208), as depicted in Figure 10.

The noise in the edge features extracted by the Canny algorithm primarily originates from the edge features within the target. To obtain cleaner edge features of the target, the binarized brightness values at corresponding coordinates are combined with the Canny edge image, following 2D Gaussian filtering and morphological expansion, using a bitwise sum operation, as shown in Figure 11 The formulas for 2D Gaussian filtering and binarization are provided in Equation (7).

H_{(i j)} = \frac{1}{2 π σ^{2}} e x p (- \frac{{(x_{i} - u_{x})}^{2} + {(y_{i} - u_{y})}^{2}}{2 σ^{2}}), d s t (x, y) = \{\begin{matrix} m a x V a l & , i f s r c (x, y) > t h r e s h \\ 0 & , o t h e r w i s e \end{matrix}

(7)

3.3.2. Hough Transform for Corner Detection

After image processing, the edge features of the source image are more completely preserved. However, since the image feature contours are not continuous, conventional corner point detection algorithms cannot be used. Instead, straight line fitting and the target’s geometric features are utilized to extract the corner coordinates.

Employing the Hough transform for fitting to generate all longer segments, identifying the longest segment among them, and recognizing the line where this segment lies as an edge of the bulk unloading hopper. Figure 12a illustrates the segments fitted using the Hough transform, while Figure 12b depicts the position of the line containing the longest segment.
Since the camera coordinates are not fixed, rotation may cause the bulk unloading hopper to appear as an irregular rectangle. Based on the slope of the longest line segment, find the longest line segment that is approximately perpendicular to it. The slope threshold range is set to [−5, 5].
Based on the slopes of the two intersecting sides and the center of the image, the original image is divided into four parts. Using the slope of the known straight line as a constraint, find the other two symmetric edges. The slope range is set to [−0.05, 0.05].
If the area of the composed closed rectangle is less than 60% of the predicted rectangular box, set the pixel values of the longest straight line fitted in step 1 to 0 and restart step 1. The result of straight line fitting is shown in Figure 12.

In the image after coarse localization and cropping, the pixel coordinate system defaults with the origin at the top-left corner of the image. For the target coordinates in the original image

(u_{i}, v_{i})

, and the target coordinate in the roughly localized image frame

(u_{i}^{'}, v_{i}^{'})

, the following relational equation is satisfied between the two:

\{\begin{matrix} u_{i} = u_{i}^{'} + u_{0} \\ v_{i} = v_{i}^{'} + v_{0} \end{matrix}

(8)

In addition, the

(u_{0}, v_{0})

denotes the coordinates of the upper left corner point of the roughly localized image frame again in the original image. In a subsequent step, the transformed coordinates of each of the four corner points are used

(u_{i}, v_{i})

as inputs to the 3D coordinate solving stage to realize the mapping transformation from target pixel coordinates to world coordinates.

In practical work scenarios, algorithms may encounter situations where straight lines cannot be detected, leading to an inability to accurately determine the positions of the four corner points. When this situation occurs occasionally, the system automatically disregards the detection result. Instead, it utilizes the current coordinates from the down-ward-facing camera, yaw angle, and the coordinates from the last successful positioning to estimate the central position of the bulk material discharge hopper reasonably. This ensures the coherence of the overall system workflow. If this situation persists, the system will notify remote operators to intervene and ensure the normal operation of cargo positioning tasks.

3.4. The Three-Dimensional Coordinate Solving Stage

Given the known world coordinates of the camera

(X_{C}, Y_{C}, Z_{C})

, the two-dimensional coordinates of the four corner points of the bulk unloading hopper in the camera image are

(u_{1}, v_{1})

,

(u_{2}, v_{2})

,

(u_{3}, v_{3})

and

(u_{4}, v_{4})

. The height of the bulk unloading hopper in the field environment is known as

H

. The current orientation of the elephant trunk in the world coordinate system is calculated by GPS1 and GPS2, providing the direction vector of the x-axis in the world coordinate system as viewed in the camera image coordinate system, denoted as

\vec{v}

. This vector represents the direction of the x-axis in the camera view image coordinate system relative to the world coordinate system, assuming that the coordinates of the four corner points of the bulk unloading hopper in the world coordinate system are, respectively

(X_{1}, Y_{1}, H)

,

(X_{2}, Y_{2}, H)

,

(X_{3}, Y_{3}, H)

and

(X_{4}, Y_{4}, H)

.

3.4.1. PNP Algorithm Inverse Solution for World Coordinates

Using the camera’s intrinsic matrix

K

and the extrinsic matrix

[R | t]

, we back-project from image coordinates to world coordinates. First establish the connection between the camera coordinate system and the image coordinate system as shown in Equation (9).

s (\begin{matrix} u_{i} \\ v_{i} \\ 1 \end{matrix}) = K [R | t] (\begin{matrix} X_{i} \\ Y_{i} \\ H \\ 1 \end{matrix})

(9)

where

s

is the scale factor. Expanding Equation (10) can be obtained:

s (\begin{matrix} u_{i} \\ v_{i} \\ 1 \end{matrix}) = (\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}) (\begin{matrix} r_{11} & r_{12} & r_{13} & t_{x} \\ r_{21} & r_{22} & r_{23} & t_{y} \\ r_{31} & r_{32} & r_{33} & t_{z} \end{matrix}) (\begin{matrix} X_{i} \\ Y_{i} \\ H \\ 1 \end{matrix})

(10)

In the equation,

f_{x}

and

f_{y}

represent the focal lengths along the x-axis and y-axis of the image, respectively.

c_{x}

and

c_{y}

denote the coordinates of the principal point on the image plane, typically expressed in pixel coordinates. The camera extrinsic matrix

[R | t]

combines the rotation matrix

R

and translation vector

t

. These are used to describe the complete transformation from the world coordinate system to the camera coordinate system. In particular, the rotation matrix

R

represents the orientation transformation from the world coordinate system to the camera coordinate system. The direction vector of the x-axis in the world coordinate system is calculated from the camera view image coordinate system

\vec{v}

The calculation is performed by taking

(X_{v}, Y_{v}, Z_{v})

as the first column vector of the rotation matrix, but to form a legal rotation matrix, the following conditions need to be satisfied: the column vectors of the matrix must be unit vectors and the column vectors of the matrix must be orthogonal to each other.

Therefore, the first vector is first normalized to the unit vector

{\vec{d}}_{1}

.

{\vec{d}}_{1} = \frac{(X_{v}, Y_{v}, Z_{v})}{\sqrt{X_{v}^{2} + Y_{v}^{2} + Z_{v}^{2}}}

(11)

Choose an arbitrary vector that is not parallel to the first vector

\vec{u}

and compute the cross product of this vector with the first vector. The result of this cross product will be perpendicular to the original two vectors, ensuring orthogonality. The resulting vector is then normalized to obtain the second unit vector.

{\vec{d}}_{2} = n o r m a l i z e ({\vec{d}}_{1} \times \vec{u})

(12)

where normalize means dividing the vector by its modulus to ensure that it becomes a unit vector.

To ensure that the three vectors constructed are orthogonal to each other, compute

{\vec{d}}_{1}

the cross product of

{\vec{d}}_{2}

the cross product of the three vectors. This vector will naturally be orthogonal to the first two vectors. Normalize the resulting vector.

{\vec{d}}_{3} = n o r m a l i z e ({\vec{d}}_{1} \times {\vec{d}}_{2})

(13)

By following the above steps, three vectors that are orthogonal to each other and of unit length are obtained. These vectors form the three column vectors of the rotation matrix, each representing the direction of one axis in the new coordinate system. This rotation matrix ensures that lengths and angles remain invariant during coordinate transformations.

The rotation matrix

R

can be expressed as:

R = (\begin{matrix} {\vec{d}}_{1} & {\vec{d}}_{2} & {\vec{d}}_{3} \end{matrix})

(14)

Translation vector

t

represents the position of the camera coordinate origin in the world coordinate system. Knowing the position of the camera in the world coordinate system

(X_{C}, Y_{C}, Z_{C})

then

t

can be expressed as:

t = (\begin{matrix} X_{C} \\ Y_{C} \\ Z_{C} \end{matrix})

(15)

Finalizing the extrinsic matrix

[R | t]

is

[R | t] = (\begin{matrix} {\vec{d}}_{1} & {\vec{d}}_{2} & {\vec{d}}_{3} & t \end{matrix}) = (\begin{matrix} r_{11} & r_{12} & r_{13} & X_{c} \\ r_{21} & r_{22} & r_{23} & Y_{c} \\ r_{31} & r_{32} & r_{33} & Z_{c} \end{matrix})

(16)

Bringing the extrinsic matrix

[R | t]

into the original equation, solving the equation for the four corner points of the bulk unloading hopper in the world coordinate system can be obtained:

s (\begin{matrix} u_{i} \\ v_{i} \\ 1 \end{matrix}) = (\begin{matrix} f_{x} (r_{11} X_{i} + r_{12} Y_{i} + r_{13} H + X_{c}) + c_{x} (r_{31} X_{i} + r_{32} Y_{i} + r_{33} H + Z_{c}) \\ f_{y} (r_{21} X_{i} + r_{22} Y_{i} + r_{23} H + Y_{c}) + c_{y} (r_{31} X_{i} + r_{32} Y_{i} + r_{33} H + Z_{c}) \\ r_{31} X_{i} + r_{32} Y_{i} + r_{33} H + Z_{c} \end{matrix})

(17)

The coordinates of the four corner points of the bulk unloading hopper in the world coordinate system can be obtained by solving the system of equations

(X_{1}, Y_{1}, H)

,

(X_{2}, Y_{2}, H)

,

(X_{3}, Y_{3}, H)

and

(X_{4}, Y_{4}, H)

.

3.4.2. Locate the Center Coordinates of the Bulk Unloading Hopper

The world coordinates of the center point of the bulk unloading hopper

(X_{h}, Y_{h}, H)

are determined using the coordinates of the four corner points of the hopper in the world coordinate system. These coordinates are obtained by applying the Perspective-n-Point (PnP) algorithm.

X_{h} = \frac{X_{1} + X_{2} + X_{3} + X_{4}}{4}, Y_{h} = \frac{Y_{1} + Y_{2} + Y_{3} + Y_{4}}{4}, Z_{h} = H

(18)

4. Experiment and Evaluation

4.1. Experimental Environment and Equipment Configuration

The experiment described in this paper was conducted at the Tianjin Port bulk cargo terminal, as shown in Figure 13. Two GPS units were installed on the gantry crane’s elephant trunk. A vertical vision camera was mounted on the maintenance platform at the tip of the elephant trunk. The camera model is LDS-BP0B96 (2.8 mm), and its main parameters are listed in Table 2.

Ubuntu 20.04 serves as the underlying operating system for the experimental ecosystem. The equipment configuration used for the training process includes 16 GB RAM, an NVIDIA GTX 3090 GPU, and an Intel i5-12400F CPU. The edge computing device used is a WIPC-707 industrial computer, equipped with a GeForce GTX 1660 graphics card with 6 GB of video memory. The software framework encompasses torch 1.12.1 + cu113 in conjunction with Anaconda. In our experimental setup, we configure the following parameters: an image size of 1280, 300 training rounds, a batch size of 8. We employ the SGD optimizer, and opt not to utilize cache-based data-grabbing methods. This setup ensures consistency and precision in our experimentation process.

4.2. Experimental Data Collection

Due to the lack of publicly available datasets for bulk unloading hopper operations at bulk cargo terminals, it is necessary to create a custom dataset. This involves capturing a large number of photos in the operational scenes of bulk unloading hoppers at the terminal, followed by image preprocessing and annotation. For this purpose, we collected 1800 photos of bulk unloading hoppers in operation under various weather conditions, and annotated only the class of bulk unloading hopper with the label “Hopper”. The images in the dataset were obtained using a vertical camera mounted on the tip of the gantry crane’s elephant trunk, capturing the live operating scene from above. The resolution of the pre-processed images was 1920 × 1080.

Due to the complexity of the loading and unloading environment at the bulk cargo terminal, the acquired image data needed to be expanded to increase the number and spatial diversity of the dataset. This included applying techniques such as horizontal flipping and luminance enhancement to enhance the learning capability of the deep neural network. After expanding the dataset, a total of 3000 images were collected. The dataset was divided into training and testing sets: the training set was used for training the algorithm models, while the testing set was employed to test and analyze the model performance. The split between the training and testing sets was 7:3. Figure 14 illustrates examples of the photos used in the dataset.

The GPS data are recorded in real time during operation, and the sample contains the X, Y and Z positioning coordinates of two GPSs on the elephant trunk.

4.3. Experimental Results

In the comparison experiments during the rough positioning stage, we evaluated the performance of our method against YOLOv5, which was used as the baseline. Both the baseline model and our model were trained on the same training set. Since different training parameters can affect the final results, the training parameters of the baseline model were set mainly according to its recommended default values with only minor adjustments. The results in Table 3 show that our proposed method achieves a substantial increase in detection speed, approximately 102 FPS faster than the YOLOv5 algorithm, in the bulk cargo terminal loading and unloading operation scenario, without a significant loss in detection accuracy. These improvements lay a solid foundation for future implementations where video streams from multiple gantry cranes will be processed simultaneously on a single server.

In practical scenarios where the conditions do not allow for the use of multiple GPUs for parallel processing, a single server GPU’s inference capability can support up to 67 FPS when executing YOLOv5n model inference tasks. In the actual automated bulk cargo terminal operation, multiple gantry crane video streams are processed in real time by a single server, placing a high computational load on it. Although theoretically, a single server can handle two 30 FPS video streams, practical implementation requires consideration of resource competition, data transmission, and load balancing issues.

Given the limited computational resources, optimizing the inference model to enhance the processing capability of a single GPU is crucial for achieving higher inference efficiency. The improved inference model discussed in this paper achieves a processing speed of 102 FPS. This means that the same server can simultaneously handle the real-time inference tasks of three gantry cranes with 30 FPS video streams, while reserving 12% of the GPU capacity to handle unexpected situations.

During the actual deployment at the site, our engineering team further quantized the model to int8, which significantly enhanced the GPU processing limit. Consequently, the system can meet the simultaneous operational inference requirements of 5 to 6 automated gantry cranes. These optimizations not only improve processing efficiency but also ensure the accuracy of operations, enabling real-time inference tasks for multiple video streams.

Take the sample test video of loading and unloading operation process of Tianjin Port bulk cargo terminal as an example to conduct.

In the on-site experiments, we selected three bulk unloading hoppers at different coordinates. Ten camera positions were set based on the heading angle of the grab-type gantry crane to predict the coordinates of the bulk unloading hoppers. Video streams were captured under varying weather conditions and changes in lighting. The positions of the hoppers in the camera images under different heading angles are shown in Figure 15.

By comparing the actual world coordinates of the bulk unloading hoppers with the predicted coordinates and calculating their accuracy at the scale of the bulk unloading hopper, we obtained the results shown in Table 4. The true coordinates of the center of the bulk unloading hopper were determined using RTK measurement, as shown in Figure 16.

During the automated loading and unloading operations at the bulk cargo terminal, we employed this positioning method. We conducted a total of 30 coordinate positioning tests at different coordinates and heading angles, as depicted in Figure 17. The data obtained are summarized in Table 4.

Using this method, the average positioning error of the bulk unloading hopper in the automated operations was 0.20 m, with an average positioning accuracy of 96.06%. This exceeds the technical requirement of 90% alignment accuracy in manual operations, demonstrating that the alignment accuracy of this method has reached and surpassed the level achievable by manual operations. This further validates the effectiveness of the proposed approach.

5. Conclusions

This paper proposes a monocular vision-based, three-stage efficient positioning method for high-precision deviation positioning of bulk unloading hoppers at automated bulk cargo terminals. Accurate measurement of the 3D coordinates of the bulk unloading hopper is achieved using an improved YOLOv5 model and image processing techniques combined with the PnP algorithm. Experimental results show that this method significantly improves the detection speed to 102 FPS without a significant loss of detection accuracy. This detection speed allows a single server to conduct detection for multiple unloading operations simultaneously, thereby reducing the hardware costs of the system.

Furthermore, applying this positioning method to automated loading and unloading operations at the bulk cargo terminal resulted in an average positioning error of 0.20 m over thirty tests, with an average positioning accuracy of 96.06%. This exceeds the 93% technical benchmark of manual operations. These results demonstrate that the method not only improves operational efficiency but also surpasses the accuracy and reliability of manual operations, indicating a wide range of practical application prospects. For the practical deployment in the field, we also performed int8 quantization on the model. This process converts the floating-point weights in the model to 8-bit integers, significantly reducing the model’s size and increasing its inference speed, making it more suitable for field operations.

Indeed, the robustness of the system still requires improvement. The harsh working conditions at bulk cargo terminals, such as dust covering the cameras, can affect the overall stability of the system. Additionally, the swinging motion of the bulk cargo grab is an important factor to consider during unloading operations. In the future, bulk cargo terminals will need to focus on maintaining stable operation, achieving rapid and precise positioning, and integrating the measurement of the bulk grab’s posture with automatic control systems to realize efficient automation of gantry crane unloading operations.

Author Contributions

Conceptualization, Z.S. and J.W.; methodology, Z.S. and J.W.; software, Z.S., J.W. and L.Z.; validation, Y.Z.; formal analysis, C.M. and Y.S.; investigation, Z.S., J.W., Y.Z. and L.Z.; resources, C.M. and Y.S.; data curation, Z.S., Y.Z. and Y.S.; writing—original draft, Z.S., J.W. and L.Z.; writing—review and editing, Y.Z., C.M. and Y.S.; visualization, Y.Z. and L.Z.; supervision, Y.S.; project administration, Z.S. and Y.S.; funding acquisition, C.M. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Education Science Research Project of Shanghai Municipality (No. B2023003) and the Science and Technology Commission of Shanghai Municipality (No. 22ZR1427700).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study did not involve any public datasets.

Conflicts of Interest

Authors Chao Mi and Yang Shen were employed by the company Shanghai SMUVision Smart Technology Ltd. The remaining authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Balci, G.; Cetin, I.B.; Esmer, S. An evaluation of competition and selection criteria between dry bulk terminals in Izmir. J. Transp. Geogr. 2018, 69, 294–304. [Google Scholar] [CrossRef]
Zhang, G.; Xiang, Y. Study on Control System of Bridge-Type Grab Ship Unloader. J. Phys. Conf. Ser. 2023, 2483, 012052. [Google Scholar] [CrossRef]
Sun, J. Enhancing the Intelligent Application of Dry Bulk Cargo Terminals. China Water Transp. 2019, 19, 113–114. [Google Scholar]
Benkert, J.; Maack, R.; Meisen, T. Chances and Challenges: Transformation from a Laser-Based to a Camera-Based Container Crane Automation System. J. Mar. Sci. Eng. 2023, 11, 1718. [Google Scholar] [CrossRef]
Mi, C.; Huang, Y.; Fu, C.; Zhang, Z.; Postolache, O. Vision-Based Measurement: Actualities and Developing Trends in Automated Container Terminals. IEEE Instrum. Meas. Mag. 2021, 24, 65–76. [Google Scholar] [CrossRef]
Vinh, N.Q.; Kim, H.S.; Long, L.N.B.; You, S.S. Robust Lane Detection Algorithm for Autonomous Trucks in Container Terminals. J. Mar. Sci. Eng. 2023, 11, 731. [Google Scholar] [CrossRef]
Ji, Z.; Zhao, K.; Liu, Z.; Hu, H.; Sun, Z.; Lian, S. A Novel Vision-Based Truck-Lifting Accident Detection Method for Truck-Lifting Prevention System in Container Terminal. IEEE Access 2024, 12, 42401–42410. [Google Scholar] [CrossRef]
Lee, J. Deep learning–assisted real-time container corner casting recognition. Int. J. Distrib. Sens. Netw. 2019, 15, 1–11. [Google Scholar] [CrossRef]
Xu, B.; Chen, Z. Multi-level Fusion Based 3D Object Detection from Monocular Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2345–2353. [Google Scholar]
Brazil, G.; Liu, X. M3D-RPN: Monocular 3D Region Proposal Network for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9286–9295. [Google Scholar]
ul Haq, Q.M.; Haq, M.A.; Ruan, S.J.; Liang, P.J.; Gao, D.Q. 3D Object Detection Based on Proposal Generation Network Utilizing Monocular Images. IEEE Consum. Electron. Mag. 2022, 11, 47–53. [Google Scholar] [CrossRef]
Mi, C.; Huang, S.; Zhang, Y.; Zhang, Z.; Postolache, O. Design and Implementation of 3-D Measurement Method for Container Handling Target. J. Mar. Sci. Eng. 2022, 10, 1961. [Google Scholar] [CrossRef]
Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Mi, C.; Liu, Y.; Zhang, Y.; Wang, J.; Feng, Y.; Zhang, Z. A Vision-Based Displacement Measurement System for Foundation Pit. IEEE Trans. Instrum. Meas. 2023, 72, 2525715. [Google Scholar] [CrossRef]
Shan, D.; Zhu, Z.; Wang, X.; Zhang, P. Pose Measurement Method Based on Machine Vision and Novel Directional Target. Appl. Sci. 2024, 14, 1698. [Google Scholar] [CrossRef]
Li, D.; Cheng, B.; Wang, K. Self-calibrating technique for 3D displacement measurement using monocular vision and planar marker. Autom. Constr. 2024, 159, 105263. [Google Scholar] [CrossRef]
Rad, M.; Lepetit, V. BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3848–3856. [Google Scholar]
Tekin, B.; Sinha, S.N.; Fua, P. Real-time seamless single shot 6d object pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 292–301. [Google Scholar]
Ma, Y.; Mi, C.; Yao, L.; Liu, Y.; Mi, W. Automated Ship Berthing Guidance Method Based on Three-dimensional Target Measurement. J. Mar. Sci. Appl. 2023, 12, 172–180. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Yifu, Z.; Wong, C.; Montes, D.; et al. ultralytics/yolov5: v7.0-YOLOv5 SOTA Realtime Instance Segmentation; Zenodo: Geneva, Switzerland, 2022. [Google Scholar]
Zhang, S.; Yang, H.; Yang, C.; Yuan, W.; Li, X.; Wang, X.; Zhang, Y.; Cai, X.; Sheng, Y.; Deng, X.; et al. Edge Device Detection of Tea Leaves with One Bud and Two Leaves Based on ShuffleNetv2-YOLOv5-Lite-E. Agronomy 2023, 13, 577. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Li, Y.; Li, A.; Li, X.; Liang, D. Detection and Identification of Peach Leaf Diseases based on YOLO v5 Improved Model. In Proceedings of the 5th International Conference on Control and Computer Vision (ICCCV’22), New York, NY, USA, 9 November 2022; pp. 79–84. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Pratt, W.K. Edge Detection. In Digital Image Processing; John Wiley Sons, Ltd.: Hoboken, NJ, USA, 2007; Chapter 15; pp. 465–533. [Google Scholar]

Figure 1. Gantry crane unloading process schematic diagram.

Figure 2. Equipment composition.

Figure 3. Illustration of the positioning accuracy of the bulk unloading hopper.

Figure 4. System architecture diagram.

Figure 5. Structure of the YOLOv5 network.

Figure 6. ShuffleNet V2 basic module network structure diagram.

Figure 7. ShuffleNet V2 down sampling module network structure diagram.

Figure 8. Structure of the proposed YOLOv5-slim network.

Figure 9. Positioning diagram of the bulk unloading hopper.

Figure 10. Canny algorithm edge map.

Figure 11. Flowchart of edge extraction: (a) Canny edge inflation graph; (b) binarization graph; (c) edge contour graph.

Figure 12. (a) Hough transform fitting diagram. (b) Diagram of the longest line position. (c) Straight line fitting plot.

Figure 13. Experimental environment and equipment.

Figure 14. Sample of bulk cargo terminal loading and unloading operations.

Figure 15. Three-stage experiment for solving 3D coordinates.

Figure 16. Using RTK for on-site measurement.

Figure 17. (a) 3D positioning results; (b) horizontal distribution of positioning results.

Table 1. Performance of YOLOv5 versions on COCO dataset [20].

Model	[email protected]/%	Speed/ms	Params/M
YOLOv5n	45.7	45	1.9
YOLOv5s	56.8	98	7.2
YOLOv5m	64.1	224	21.2
YOLOv5l	67.3	430	46.5
YOLOv5x	68.9	766	86.7

Table 2. Camera parameters.

Optical zoom	2.8 mm, Fixed Focus
Field of view	Horizontal 101.7 degrees
Aperture	F1.0
Spotlight	F = 2.8 mm

Table 3. Comparison of the detection accuracy of the vision-based baseline method and our method on the dataset.

Method	Params (M)	AP (IoU = 0.5, %)	AP (IoU = 0.5:0.9, %)	FPS
YOLOv5n	1.9	99.5	85.3	67
Ours	1.2	98.3	81.9	102

Table 4. Experiment on positioning accuracy in fully automated loading and unloading operations at bulk cargo terminals.

Serial Number	Actual Coordinates $(X_{H}, Y_{H}, Z_{H})$	Predicted Coordinates $(X_{P}, Y_{P}, Z_{P})$	Offset Distance (m)	Prediction Accuracy P (%)
1	(187.900, 82.479, 7.100)	(187.864, 82.418, 7.100)	0.07	98.56
2		(188.183, 82.418, 7.100)	0.29	94.15
3		(187.894, 82.487, 7.100)	0.01	99.80
4		(187.853, 82.486, 7.100)	0.05	99.04
5		(188.031, 82.431, 7.100)	0.14	97.18
6		(187.915, 82.496, 7.100)	0.02	99.55
7		(187.799, 82.303, 7.100)	0.20	95.90
8		(188.191, 82.417, 7.100)	0.30	93.99
9		(187.704, 82.706, 7.100)	0.30	93.95
10		(187.906, 82.475, 7.100)	0.01	99.85
11	(126.217, 81.905, 7.100)	(126.129, 82.072, 7.100)	0.19	96.19
12		(126.050, 81.656, 7.100)	0.30	93.94
13		(126.007, 81.706, 7.100)	0.29	94.16
14		(125.985, 82.091, 7.100)	0.30	93.99
15		(126.316, 81.901, 7.100)	0.10	98.00
16		(126.211, 81.730, 7.100)	0.18	96.46
17		(126.437, 81.763, 7.100)	0.26	94.71
18		(126.282, 81.613, 7.100)	0.30	93.96
19		(125.936, 81.906, 7.100)	0.28	94.32
20		(125.949, 81.787, 7.100)	0.29	94.08
21	(−46.791, 82.857, 7.100)	(−47.004, 82.647, 7.100)	0.30	93.95
22		(−46.795, 82.844, 7.100)	0.01	99.72
23		(−46.667, 82.669, 7.100)	0.23	95.45
24		(−46.892, 83.067, 7.100)	0.23	95.29
25		(−46.584, 83.026, 7.100)	0.27	94.61
26		(−46.726, 82.861, 7.100)	0.06	98.69
27		(−46.585, 82.639, 7.100)	0.30	93.94
28		(−46.774, 82.871, 7.100)	0.02	99.56
29		(−46.546, 83.027, 7.100)	0.30	93.98
30		(−46.625, 82.670, 7.100)	0.25	94.95
Average			0.20	96.06

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shen, Z.; Wang, J.; Zhang, Y.; Zheng, L.; Mi, C.; Shen, Y. Identification and Positioning Method of Bulk Cargo Terminal Unloading Hopper Based on Monocular Vision Three-Dimensional Measurement. J. Mar. Sci. Eng. 2024, 12, 1282. https://doi.org/10.3390/jmse12081282

AMA Style

Shen Z, Wang J, Zhang Y, Zheng L, Mi C, Shen Y. Identification and Positioning Method of Bulk Cargo Terminal Unloading Hopper Based on Monocular Vision Three-Dimensional Measurement. Journal of Marine Science and Engineering. 2024; 12(8):1282. https://doi.org/10.3390/jmse12081282

Chicago/Turabian Style

Shen, Ziyang, Jiaqi Wang, Yujie Zhang, Luocheng Zheng, Chao Mi, and Yang Shen. 2024. "Identification and Positioning Method of Bulk Cargo Terminal Unloading Hopper Based on Monocular Vision Three-Dimensional Measurement" Journal of Marine Science and Engineering 12, no. 8: 1282. https://doi.org/10.3390/jmse12081282

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identification and Positioning Method of Bulk Cargo Terminal Unloading Hopper Based on Monocular Vision Three-Dimensional Measurement

Abstract

1. Introduction

2. Related Works

3. Our Method

3.1. System Design

3.2. The Coarse Positioning Stage

3.2.1. The Shufflenet V2 Optimized Yolov5 Backbone Network

3.2.2. Gsconv Convolutional Module Optimized Yolov5 Neck Part

3.2.3. The Simplified Yolov5 Detection Head

3.2.4. The Improved Yolov5 Framework

3.3. The Fine Positioning Stage

3.3.1. Extract Edge Information

3.3.2. Hough Transform for Corner Detection

3.4. The Three-Dimensional Coordinate Solving Stage

3.4.1. PNP Algorithm Inverse Solution for World Coordinates

3.4.2. Locate the Center Coordinates of the Bulk Unloading Hopper

4. Experiment and Evaluation

4.1. Experimental Environment and Equipment Configuration

4.2. Experimental Data Collection

4.3. Experimental Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI