DoubleNet: A Method for Generating Navigation Lines of Unstructured Soil Roads in a Vineyard Based on CNN and Transformer

Cui, Xuezhi; Zhu, Licheng; Zhao, Bo; Wang, Ruixue; Han, Zhenhao; Lu, Kunlei; Feng, Xuguang; Ni, Jipeng; Cui, Xiaoyi

doi:10.3390/agronomy15030544

Open AccessArticle

DoubleNet: A Method for Generating Navigation Lines of Unstructured Soil Roads in a Vineyard Based on CNN and Transformer

by

Xuezhi Cui

^1,2,

Licheng Zhu

^1,2,

Bo Zhao

^1,2,*,

Ruixue Wang

^1,2,

Zhenhao Han

^1,2,

Kunlei Lu

^1,2,

Xuguang Feng

^1,2,

Jipeng Ni

^1,2

and

Xiaoyi Cui

^1,2

¹

State Key Laboratory of Agricultural Equipment Technology, Beijing 100083, China

²

Chinese Academy of Agricultural Mechanization Sciences Group Co., Ltd., Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(3), 544; https://doi.org/10.3390/agronomy15030544

Submission received: 9 January 2025 / Revised: 17 February 2025 / Accepted: 19 February 2025 / Published: 23 February 2025

(This article belongs to the Special Issue Advanced Machine Learning in Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Navigating unstructured roads in vineyards with weak satellite signals presents significant challenges for robotic systems. This research introduces DoubleNet, an innovative deep-learning model designed to generate navigation lines for such conditions. To improve the model’s ability to extract image features, DoubleNet incorporates several key innovations, such as a unique multi-head self-attention mechanism (Fused-MHSA), a modified activation function (SA-GELU), and a specialized operation block (DNBLK). Based on them, DoubleNet is structured as an encoder–decoder network that includes two parallel subnetworks: one dedicated to processing 2D feature maps and the other focused on 1D tensors. These subnetworks interact through two feature fusion networks, which operate in both the encoder and decoder stages, facilitating a more integrated feature extraction process. Additionally, we utilized a specially annotated dataset comprising images fused with RGB and mask, with five navigation points marked to enhance the accuracy of point localization. As a result of these innovations, DoubleNet achieves a remarkable 95.75% percentage of correct key points (PCK) and operates at 71.16 FPS on our dataset, with a combined performance that outperformed several well-known key point detection algorithms. DoubleNet demonstrates strong potential as a competitive solution for generating effective navigation routes for robots operating in vineyards with unstructured roads.

Keywords:

orchard navigation; unstructured road; convolution; Fused-MHSA

1. Introduction

Grapes are one of the most significant fruits in China, holding immense economic value. By 2022, the Chinese grape planting area reached 757,993 hectares, with a production output of 15.605 million tons [1], this positions China as the third largest cultivator and producer of grapes worldwide. However, automation in Chinese vineyards currently covers only about 15%, resulting in a heavy reliance on manual labor. Therefore, replacing manual labor with tireless and efficient robots is a promising strategy to boost grape productivity and promote further industry development.

While enabling robots is a promising choice, several crucial aspects need careful consideration. One such aspect is the robot’s ability to perceive its surrounding environment. This perception function encompasses a wide range of subfunctions, including obstacle detection, road recognition, and navigation line planning. Together, these functions form the robot’s environmental understanding system, guaranteeing it to operate safely and efficiently in complex environments. Navigation line planning is essential and serves as one of the fundamental yet crucial functions for the regular operation of robots. It allows robots to determine the optimal path in known and unknown environments, enabling them to achieve predetermined goals.

Researchers have dedicated considerable effort to exploring various solutions to obtaining high-quality navigation lines. These solutions can be broadly categorized into the Global Navigation Satellite System (GNSS), Lidar, and Computer Vision (CV) methods. GNSS [2,3,4] methods are the easiest and most effective means of obtaining navigation lines, and they have found widespread application in traffic and agriculture navigation. However, the effectiveness of GNSS methods heavily relies on satellite signals. In some agricultural settings with significant signal obstructions, such as forests or orchards, the navigation performance of GNSS methods may be severely compromised. On the other hand, Lidar [5,6] methods offer another practical approach for acquiring navigation lines. The data collected by Lidar, known as a point cloud, are generated by reflecting laser beams off objects and contains valuable 3D positional information. Fitting lines or curves of object point clouds on the roadside can indirectly calculate the navigation line. While the accuracy of this method is remarkable, the computation of point clouds often leads to slow generation speeds for the lines. Therefore, the only viable option is CV methods to overcome the abovementioned defects and generate navigation lines rapidly and effectively in a vineyard with weak satellite signals.

Specifically, the CV methods can be further sorted into traditional CV methods and AI-based CV methods. The former involves the application of digital image processing technologies [7,8,9], such as a Hough transformer [10] and RANSAC methods [11,12], which have been widely used in the agricultural field. However, traditional CV methods are known to be vulnerable to changes in illumination, making them less robust and gradually outdated. In contrast, with the rise of deep learning, AI-based CV methods are gradually replacing traditional methods due to their higher accuracy and robustness.

At the present stage, generating a navigation line in the agriculture scene with deep learning has two ordinary approaches: One is segmenting the unstructured road or field area first, then calculating the guide points by the mask edges, and, finally, getting the navigation line by fitting the algorithms [13,14,15,16,17,18,19]. To get an accurate navigation path, Yang et al. [20] segmented the unstructured road area with SegNet and UNet, calculated the average of their masks, and used the average coordinate of the average mask’s left and right edges as the path, which achieved an average distance error of 5.03 cm. Yu et al. [21] introduced an effective method to obtain a navigation line. They trained five popular models and selected the fastest one to segment the unstructured road and then computed the line using a polygon fitting algorithm: connected either the middle point of the mask polygon’s bottom line with the top point or the middle point of the top line. Silva et al. [22] designed a method to detect the navigation line in the crop field. They segmented the crop skeleton line with UNet first, then got the navigation line to cover the crop line with a triangle scan method. Li et al. [23] presented a method for detecting the navigation line along the curved path of an unstructured road. They utilized ESANet, a multimodal model that uses depth maps and RGB images as inputs, as the segmentation model to identify the road area and edge. Subsequently, they extracted the coordinates of the edge points and performed a coordinate transformation to generate the line within the road area.

Another approach involves utilizing object detection models to detect the plants alongside roads [24,25]. This method entails calculating the coordinates of the center points of the detected objects and fitting lines based on these points. Subsequently, the navigation lines can be obtained by utilizing the lines. Zheng et al. [26] first detected jujube trees using a modified YOLOX-Nano model. They then applied the K-means clustering algorithm to divide the tree points into two classes, corresponding to the left and right tree rows. Finally, they determined the navigation line by leveraging geometric relations and the least squares method. Liu et al. [27] introduced a navigation algorithm for a pineapple harvester. They applied YOLOv5 to detect the pineapples’ locations and engaged a modified shortest-distance algorithm to fit the navigation lines. Zhang et al. [28] proposed a lightweight model, SN-CNN, to detect vegetables and further extract navigation lines with RANSAC; their detection accuracy of vegetables exceeded a series of YOLO models and thus generated high-accuracy navigation lines.

While the two types of deep-learning schemes have impressive results, they have inherent limitations. Firstly, their acquisition of navigation lines involves a lengthy process consisting of three stages: deep-learning preprocessing, the extraction of target-point coordinates, and the fitting of navigation lines. It is important to note that the speed of this process decreases as more points are extracted, especially in the first stage. Additionally, these schemes are inflexible and are primarily suitable for single and flattened road sections, which face challenges with more complex scenarios, such as curves.

Therefore, this study aims to develop a novel deep learning-based approach for directly generating navigation lines in unstructured agricultural environments. Compared with the existing approach, our method simplifies the process of generating navigation lines and enhances adaptability to different road scenarios.

Our contributions are as follows: (1) We introduce Self-Adaption GELU (SA-GELU), an improved activation function that adjusts based on negative input for better performance in negative regions. (2) We propose Fused-MHSA (F-MHSA), a refined multi-head self-attention variant that enhances global information acquisition by optimizing Q and K sequence calculations. (3) We design DNBLK operation blocks with dual structures to process 2D and 1D information for unstructured roads. (4) We develop DoubleNet, an encoder–decoder model combining CNN and Transformer subnetworks to infer navigation lines effectively. Additionally, a unique dataset is employed to address inappropriate issues pertaining to the navigation lines.

2. Materials and Methods

2.1. Data Collection and Dataset Establishment

2.1.1. Data Collection Platform

To enhance the convenience and effectiveness of collecting vineyard images, we have developed a data collection platform consisting of a camera, a Lidar, and an automated guided vehicle (AGV). The camera we utilized is the Intel^® RealSense^TM D455 RGB-D camera(Intel Corp.; Santa Clara, USA), which is capable of capturing RGB images at various resolutions such as 1280 × 720 (720 P, 30 FPS) and 640 × 480 (480 P, 60 FPS), helping improve the effectiveness of the dataset. The Lidar used is the Velodyne^® 16(Velodyne Lidar, Inc.; San Jose, USA), which is employed for collecting point cloud information for other missions but does not apply to this research.

As shown in Figure 1, the AGV, measuring 1200 mm (length) × 800 mm (width) × 525 mm (height), serves as the carrier for the devices mentioned above, providing both data storage space and power for them. It can be controlled remotely using a handset. The control unit of the AGV is an Intel^® NUC equipped with an Intel^® CORE™ i5-9300H CPU(Intel Corp.; Santa Clara, USA), 32 GB DDR4L RAM(Samsung Electronics Co., Ltd.; Suwon, South Korea), and a 128 GB SSD(Samsung Electronics Co., Ltd.; Suwon, South Korea).

2.1.2. The Navigation Point Dataset

To enhance the robustness of the dataset in relation to illuminations, we used the platform to collect vision data several times (28 June 2023, 30 June 2023, 19 July 2023, 23 July 2023, and 6 August 2023) with different times (morning, noon, afternoon, and dusk) and weather (sunny and cloudy) in a vineyard of the Gansu Wine Industry Technology Research and Development Center in Lanzhou City, Gansu Province, China. The videos of the soil roads between grape trellises were captured at a vehicle speed of 2 m/s, which contain various weather conditions, different camera angles (0°~45°), and resolutions of 1280 × 720 and 640 × 480. We extracted frames from the videos at 0.25 FPS and obtained 2413 images. Finally, to improve the quality of the image set, we removed extremely blurry and duplicate images. This resulted in a final set of 957 images at a resolution of 1280 × 720 and 1456 images at 640 × 480.

After completing the image collection, the next step involves annotating the images. However, in our case, we did not annotate the navigation points directly on the RGB images. Instead, we initially labeled the unstructured road mask with the Labelme (version 5.4.1)annotation software, which has a deep learning automatic annotation mode and can annotate mask accurately and conveniently. Subsequently, we generated binary PNG images of the road masks and labeled the navigation points of the fused images produced by the weight addition algorithm; the weight ratio is RGB/mask = 0.9:0.1; according to our experiment, this ratio is appropriate. The merit of this double annotation approach is improving the location accuracy of the inferred navigation points and reducing the chances of points being located outside the road by adding road position information in the RGB image and not influencing the actual inference with the RGB image.

The process of annotating navigation points can be divided into four distinct steps. First, we utilized Labelme to annotate five categories of navigation points: top point, middle point 1, middle point 2, middle point 3, and bottom point. This annotation process was applied to 500 fused images, which were split into training, validation, and test sets in a 400:50:50 ratio, creating an auxiliary dataset. The decision to use five navigation points was made to strike a balance between effective navigation performance and inference speed. These points, along with the navigation lines connecting them, are capable of handling most unstructured road configurations. Second, the dataset was used to train the DoubleNet model for 100 epochs, producing a trained weight file. Third, the trained DoubleNet model was employed to detect the locations of the five points in the remaining images. Finally, based on the identified position coordinates, a custom program was written to annotate these remaining images, thereby generating the final dataset.

To adhere to the general practice of dataset splitting, we divided the final dataset into training, validation, and test sets. The split ratio used was 8:1:1, meaning the training set consisted of 1929 images, while the validation and test sets contained 242 images each. It is important to note that the training set utilized fused images, whereas the validation and test sets employed RGB images to assess the training performance of the model. Figure 2 shows the annotation process of the dataset images.

2.1.3. Test Videos

We also recorded several 30 s short videos about the unstructured road in the vineyard during the data collection period; each was recorded with 24 FPS. The contents of the videos are similar to the dataset images but not identical. That means they can be used as the materials to evaluate the trained models’ real-time inference performances and their robustness.

2.2. DoubleNet

In the current CV landscape, two prominent model architectures have emerged: Convolutional Neural Network (CNN) and Vision Transformer. CNN is celebrated for its exceptional ability to extract local features using convolutional kernels, making it adept at managing objects of varying sizes. In contrast, Vision Transformer excels in processing global information, primarily leveraging the multi-head self-attention (MHSA) mechanism.

To leverage the strengths of both models and improve the accuracy of vision feature extraction, we propose DoubleNet—a model designed with an encoder–decoder architecture that integrates CNN and Transformer subnetworks. The CNN subnetwork is optimized for processing 2D feature maps, taking advantage of its robust local feature extraction capabilities. Meanwhile, the Transformer subnetwork is tailored to handle 1D feature tensors, enhancing the model’s ability to capture global features. A feature-fusion network is employed to enable mutual verification between the maps and tensors. This integration allows DoubleNet to comprehensively extract vision features, significantly enhancing the accuracy of navigation point positioning.

2.2.1. Self-Adaption GELU (SA-GELU)

Based on related research [29], while the famous GELU activation function has non-zero gradient values in the negative area, which avoids the phenomenon of neuron death, the activation performance in this area is limited and can lead to the loss of partial detailed information from the input feature map.

Therefore, to optimize the activation performance in the negative area, we have modified the GELU expression function to determine its negative activation effect by the input. As a result, we propose our Self-Adaption GELU (SA-GELU) activation function. Compared with GELU, SA-GELU maintains the positive part of GELU and adjusts its negative part, enabling the function to adaptively output more suitable results, the expression of which can be illustrated as follows:

SA - GELU (x) = y = \{\begin{matrix} GELU (x) i f y \geq 0 \\ - |GELU (x) \times N (x, 1)| i f y < 0 \end{matrix}

(1)

where N(x,1) is the Gaussian distribution, which applies x and 1 as its mean value and standard deviation, respectively. GELU is a variant of ReLU and is widely employed; it is based on Gaussian distribution and can be expressed as follows:

GELU (x) = x \times Φ (x) = x \int_{- \infty}^{x} \frac{e^{- \frac{{(X - μ)}^{2}}{2 σ^{2}}}}{\sqrt{2 π} σ} d X \approx 0.5 x [1 + \tanh (\sqrt{\frac{2}{π}} (x + 0.047715 x^{3}))]

(2)

where Φ(x) is the cumulative distribution function of the Gaussian distribution, which applies μ and σ as its mean value and standard deviation.

Specifically, as shown in Figure 3b,c, when the value of SA-GELU is greater than 0, the calculation process is identical to that of GELU (the curve in Figure 3a); while the value is less than 0, the computing process additionally offers a normal distribution term N(x,1). The output is adjusted by multiplying the absolute value by −1, ensuring that y is negative. It reflects a controlled response influenced by the polynomial component, enhancing the model’s expressiveness in handling negative outputs.

2.2.2. Fused MHSA (F-MHSA)

The MHSA computes attention by transforming the input sequence into queries (Q), keys (K), and values (V) using learned weight matrices. It then applies scaled dot-product attention, calculating attention weights from Q and K, and using them to generate a weighted sum of V. This process runs across multiple heads to capture diverse features, with outputs concatenated and linearly transformed to produce the final result.

Analyzing vineyard road scenes revealed abundant yet highly similar visual features, making the original MHSA less effective for this task. To address this, we enhanced its feature extraction ability by developing the Fused MHSA (F-MHSA).

F-MHSA is a special version of MHSA, which enhances global feature understanding ability by integrating multi-scale representations of tensor Q to strengthen attention to critical features, improving upon the standard MHSA framework.

Specifically, the modified approach is not sophisticated: i.e., improve the attention score of the critical features in tensor Q. As shown in Figure 4, firstly, we apply the CBSG (convolution 1D, batch normalization, and SA-GELU) operation two times continually to collect the critical information of Q, and this operation generates Q1 (2 × downsample) and Q2 (4 × downsample); then, Q1 and Q2 are resized to restore the original size of tensor Q by the nearest-neighbor interpolation algorithm, which names Q3 and Q4, respectively; next, Q, Q3, and Q4 will be the materials for the adding operation, and Q, Q5 (=Q + Q3), and Q6 (=Q5 + Q4) are the outputs of the handling process of Q. Then, three matrix multiplications will be implemented (Q

\cdot

K, Q5

\cdot

K, and Q6

\cdot

K), and their calculation results will be added as the input of the scaling operation. After that, the rest of the operations will be the same as those of the MHSA.

2.2.3. Flattening Module

The flattening layer is typically the initial module in a Transformer model, serving to flatten, downsample, and transform the 2D input image into a 1D tensor. This enables subsequent layers to process information in the 1D tensor. In our DoubleNet, however, the flattening module not only performs these functions but also possesses the capability to compute features in images and generate 2D feature maps.

Figure 5 illustrates the pipeline of our flattening module. As depicted in the figure, two ways exist to handle the input images. The upper way is designed to process the 2D feature maps. If the input image size is 3 × 224 × 224 (channel × length × width), the output size will be 16 × 56 × 56, which implements a quadruple downsample and feature extraction by two convolution operations. The lower way is established to process the 1D tensors; the input image will be flattened from 3 × 224 × 224 to 16 × 9408 × 1 first. After that, since the 1D tensor is oversized and contains an excessive parameter amount, we utilize a 1D maxpooling operation to implement a triple downsample. Additionally, to enhance the model’s performance in capturing local details, the output this way is the added result of the reshaped production from the upper way and the maxpooling output.

2.2.4. DNBLK

DNBLK is the fundamental operational block of DoubleNet and is used to process information from the two subnetworks. This module expands the output’s channel or number dimension fourfold compared to the input, reduces the feature map size by half, and decreases the tensor length to a quarter. As depicted in Figure 6, we have designed two distinct DNBLK structures. Both structures have a double-in-double-out configuration, enabling the simultaneous handling of 2D feature maps and 1D tensors. This facilitates the fusion of feature maps and tensors for enhanced performance.

Their difference lies in their ability. DNBLK 1 is designed to handle the features in large road areas, which have more developed large-scale convolution branches, while the performance of the small-scale convolution branches is relatively weak. Furthermore, it equips the standard form Transformer block (MHSA + Feed-Forward Neural Network (FFN)) and achieves feature fusion from tensor to feature map.

Conversely, DNBLK 2 performs better in processing small road areas. It has more effective small-scale convolution branches and applies a special Transformer block, which uses the residual of F-MHSA + FFN and FFN + F-MHSA to alleviate gradient vanishing and exploding. It achieves feature fusion from the feature map to the tensor.

2.2.5. DoubleNet Structure

Based on the method introduced above, the entire structure of the DoubleNet can be established, as shown in Figure 7 (the operation detail is shown in Table A1 in Appendix A). The DoubleNet is designed as an encoder–decoder model consisting of two subnetworks that handle objects with different dimensions. The input can be either the RGB or fused image, and the output consists of the heatmaps of the five navigation points.

In the encoder, the model consists of four operation blocks: a flattening block, two DNBLK 2 blocks, and one DNBLK 1 block. The order of the DNBLKs has been determined through experimentation, and changing the order will prevent the model from converging. In addition to the two subnetworks, there is a feature-fusion network. This network takes the outputs of the first DNBLK 2 block and adds them to its first input. The first input is then processed with a CBSG operation combination, and the calculation result is added with the results of the second DNBLK 2’s outputs (the second input) to generate the input of the second CBSG, and the final result of the feature fusion network is added with the 2D feature map outputted by the DNBLK 1 to enhance the processing performance of targets at different scales.

In the decoder, we apply the “Convolution 2D (transposed) + Batch Normalization + SA-GELU” (CtBSG) and “Convolution 1D + Linear + Layer Normalization + SA-GELU” (CLLSG) operation combinations to adjust the feature map or tensor sizes to match the output size of the flattening module. The 2D feature map subnetwork employs three CtBSG operations, each of which reduces the channels of the feature map by a factor of four and doubles the image size. The other subnetwork, the 1D tensor subnetwork, uses CLLSG to perform similar operations on the 1D tensors. The 1D convolution is used to reduce the dimensions by a factor of four, and the linear (full-connection layer) is employed to increase the tensor length by a factor of four. The two subnetworks achieve feature fusion in different directions through reshaping and addition operations. Following the final feature fusion, the fused result undergoes a convolution (kernel = 1) to adjust its channel and generate the output, a heatmap containing the coordinates of the five navigation points.

2.3. Computing Hardware and Soft Environment

The coordinate regression mission of navigation points is relatively simple. As a result, the models’ training processes often converge quickly, which means the computer’s training pressure is relatively low.

Based on the analysis above, we utilized a portable laptop for training and testing models, which can be the server of a robot. The laptop has an Intel^® CORE^TM i9-13900HX CPU @2.2 GHz(Intel Corp.; Santa Clara, USA), an NVIDIA^® GeForce^TM RTX 4070 Laptop GPU(NVIDIA Corp.; Santa Clara, USA), and 32 GB of LPDDR5 RAM @5600 MHz(Samsung Electronics Co., Ltd.; Suwon, South Korea). The software environment includes Windows 11, Python 3.8, PyTorch 2.1.1, CUDA 12.1, and CUDNN 8.8.1.

2.4. Experiment Introduction and Evaluation Metrics

2.4.1. Experiment Introduction

The research primarily focuses on the evaluation of models for key point detection tasks, with experiments designed around three main aspects: training performance, accuracy performance, and inference performance. For the training performance experiments, the goal was to assess the efficiency and effectiveness of different models during training and deployment. These experiments specifically compared the performance of DoubleNet models under varying activation functions. To ensure fairness, irrelevant programs were closed prior to training, and all models were trained on the same computational platform with identical parameters. During training, loss values were recorded to analyze the convergence behavior of the models. Additionally, average hardware utilization and average epoch time were collected to evaluate the training process comprehensively.

Accuracy performance experiments were designed to compare the precision of different models in the task of navigation point coordinate regression. Furthermore, an ablation study was conducted to evaluate the contributions of individual components within the DoubleNet architecture to the overall accuracy. These experiments provided insights into the effectiveness of the models in accurately detecting key points and the role of each model component in achieving this.

Inference performance experiments aimed to evaluate the real-world applicability of the models in static and dynamic scenarios. This aspect was divided into two categories: static image inference and real-time video inference. Static image inference focused on assessing the accuracy of navigation point coordinate regression on individual frames, while real-time video inference emphasized analyzing the comprehensive performance of adjacent frames in videos.

Beyond these three core aspects, additional functional experiments were conducted to explore the specific capabilities of the proposed methods, including a comparison of SA-GELU and GELU activation functions. These functional experiments provided deeper insights into the strengths and limitations of the models, which are further elaborated in Section 4.

2.4.2. Evaluation Metrics

The evaluation metrics for the experiments were carefully chosen to comprehensively assess the performance of the models in training, accuracy, and inference tasks. For training performance, the primary metrics included the loss curve, which reflects the convergence speed and stability of the models, as well as data related to hardware usage (CPU, GPU, and RAM usage) and average epoch time. These metrics offered a detailed view of the efficiency and resource utilization during the training phase.

The accuracy evaluation should consider several factors. One key factor is the accuracy evaluation metric, which is crucial in assessing accuracy. In our research, we use the percentage of correct key points (PCK) as the accuracy metric, which can be illustrated as follows:

P C K_{0.05} = \frac{\sum_{i = 1}^{N} \sum_{j = 1}^{K} c o r r e c t_{i, j} \times m a s k_{i, j}}{\sum_{i = 1}^{N} \sum_{j = 1}^{K} m a s k_{i, j}}

(3)

where N and K represent the batch size (32) and the number of navigation points (5), respectively. mask_i,j is a Boolean matrix, which indicates whether the jth key point in the ith sample of a batch is visible; if the matrix value is true, it means the jth point is visible and the judgement correct_i,j can be further calculated; the correct_i,j can be described as follows:

c o r r e c t_{i, j} = \{\begin{matrix} 1, i f d_{i, j} < 0.05 \\ 0, o t h e r w i s e \end{matrix}

(4)

In Equation (4), for the jth key point in the ith sample of one batch, d_i,j represents the Euclidean distance between the predicted and ground truth and divides it by the normalization factor.

d_{i, j} = \frac{\sqrt{{(x_{i, j}^{p r e d} - x_{i, j}^{g t})}^{2} + {(y_{i, j}^{p r e d} - y_{i, j}^{g t})}^{2}}}{n o r m a l i z e [i]}

(5)

where (x^pred, y^pred) and (x^gt, y^gt) are the coordinates of the predicted and ground truth navigation points, the normalize[i] represents a scale factor for normalizing the ith sample. In our research, for each sample, the L2 norm is defined by the length and width of the output heatmap, and its value is

56 \sqrt 2

. For the other model, this value is defined by their output size or the preset parameter.

For inference performance, static image inference was measured based on the accuracy of coordinate regression for navigation points, highlighting the effectiveness of the models in single-frame scenarios. Real-time video inference, on the other hand, emphasized practical deployment metrics, including hardware utilization (CPU, GPU, and RAM usage), inference speed (FPS), inference cost (GFLOPS), GPU power (W), and the consistency of predictions across adjacent frames.

3. Results

3.1. Training Results

In this section, we will showcase the pertinent outcomes of the training experiment. This will encompass the loss curves and training data for the training processes of diverse models and the DoubleNets featuring distinct activation functions.

3.1.1. Training Performances of Different Models

To prove DoubleNet’s performance, this study compares several famous key point detection models: DEKR, YOLOv7-Tiny-pose, YOLOv8s-pose, VITPose, and RCNN-pose. To ensure the consistency of variables, all models are trained with identical parameter settings: 300 epochs, a batch size of 32, a 0.01 initial learning rate, two workers of the CPU, an AdamW optimizer, and the exponential decay strategy of learning rate.

Furthermore, all models are trained with our navigation-point dataset, but to meet the requirements of each model, the dataset has two formats: the COCO pose and YOLO pose formats, and the models’ input sizes are unified as 3 × 224 × 224.

To avoid the overfitting phenomenon during our method’s training, we utilize L2 regularization to optimize the training process, and the weight decay rate is 0.1. To ensure the fairness of the experiment, we have decided to use their original loss functions for each model to maintain their convergence performance. However, since these loss functions differ, the calculated loss values may vary by a large magnitude. As a result, comparing the loss curves in a single figure becomes challenging. However, the most crucial aspect of the loss curves is the changing trend, which indicates the convergence effect of the training. Therefore, it is not necessary to merge the curves into one figure.

Moreover, we use two training and validation loss curves to evaluate a model’s training performance comprehensively. The former is used to assess convergence, while the latter is employed to observe overfitting or underfitting.

Based on the data presented in Figure 8, four models, VITPose [30], YOLOv7-Tiny-pose [31], YOLOv8s-pose, and DoubelNet, successfully completed their training. The training and validation loss curves for these models show apparent convergence. However, the other two models, DEKR [32] and RCNN-pose [33], encountered problems during training, potentially affecting their accuracy and inference performance.

VITPose follows a standard training process, with training and validation loss curves converging around the 150th epoch and showing minimal difference. In contrast, YOLOv7-Tiny-pose has a unique process, with training loss converging by the 25th epoch and validation loss converging more slowly by the 270th epoch, after which the curves nearly overlap, indicating continuous performance improvement.

For the YOLOv8s-pose, training is effective, with its loss curves converging around the 200th epoch and a consistently lower validation loss due to normalization applied in training but not validation. A decline in training loss toward the end suggests room for improvement.

DoubleNet, our method, also trains successfully, with training loss converging at the 200th epoch. However, validation loss plateaus around the 100th epoch and rises slightly later, indicating potential overfitting.

DEKR and RCNN-pose failed to train effectively. DEKR’s training loss flattens after the 25th epoch but does not fully converge, likely due to its unique structure, allowing greater training potential. Despite having fewer parameters (29, 404, 738) than VITPose (89, 991, 429), DEKR fails to achieve similar convergence.

For RCNN-pose, both loss curves converge around the 150th epoch, but the training loss is significantly higher than the validation loss, likely due to overfitting during validation.

Table 1 shows that DoubleNet has the shortest training time (19.4 s), even faster than YOLOv7-Tiny-pose (20.3 s), enabling quick deployment. This is due to two factors: (1) Adjusting the CPU/GPU occupancy ratio accelerates data loading and processing. DoubleNet has higher CPU (22.4%) and lower GPU (41.3%) usage compared to RCNN-pose (CPU: 9.5%, GPU: 82.5%), preventing GPU overuse and performance throttling. (2) DoubleNet’s efficient parallel structure speeds up the forward pass, while RCNN-pose’s two-stage CNN design is slower.

As a result, RCNN-pose’s training time (351.6 s) is 18 times longer than DoubleNet’s.

3.1.2. DoubleNet’s Training Performances

To evaluate the training effect of the DoubleNet employing our activation function, SA-GELU, we implemented a training experiment on DoubleNets with six activation functions: Exponential Linear Units (ELU) [34], Leaky ReLU [35], Sigmoid Linear Unit (SiLU) [36], Mish [37], GELU, and SA-GELU. The loss curves for each activation function are displayed in Figure 9.

Figure 9 compares the training loss of six models with identical structures, loss functions, and training settings. While 2D line charts make it difficult to discern final loss values, 3D line charts provide a clearer and more convenient comparison.

DoubleNets show similar training loss convergence trends (around the 150th epoch) since their only difference is the activation function. DoubleNet (SA-GELU) achieves the lowest training loss (0.0001156), outperforming other variants like DoubleNet (GELU) (0.0001434) and DoubleNet (ELU) (0.0002281). For validation loss, all models converge around the 60th epoch, with DoubleNet (ELU) achieving the lowest value (0.0007007), followed by DoubleNet (SA-GELU) (0.0009351), which also outperforms DoubleNet (GELU) (0.0010952).

In summary, DoubleNet (SA-GELU) combines low training loss, reduced overfitting, and strong generalization, all without compromising training efficiency, making it an effective and stable choice for diverse scenarios.

Table 2 shows the hardware occupancy and training behavior of DoubleNets, highlighting that all six models share identical model size and parameter count, as these depend on the model structure, not activation functions. CPU usage varies minimally, with DoubleNet (SiLU) at 22.7% and DoubleNet (Mish) at 21.3%, reflecting their shared dataset-loading program. However, DoubleNet (SA-GELU) stands out with higher GPU (41.3%) and RAM (10.5%) usage due to its input-dependent computations, which increase resource demand.

3.2. Accuracy Results

This section will present the accuracy results of various experiments, including the accuracy experiment of different models and the ablation experiment of DoubleNet. The former aims to evaluate the accuracy performance of DoubleNet compared to other models, while the latter is conducted to assess the accuracy contribution of DoubleNet’s components.

3.2.1. Accuracy Results of Different Models

Table 3 displays the accuracy results of the six models. It also contains the PCK result, the model path file size, and the parameter amount.

The results show that DoubleNet achieves the highest average PCK performance of 95.75%, outperforming other models, including DEKR (57.66%), VITPose (90.80%), RCNN-pose (76.40%), YOLOv7-Tiny-pose (74.32%), and YOLOv8s-pose (92.79%). DoubleNet also performs best across all specific key points, with notable PCKs such as 96.26% (top point) and 97.53% (middle point 2).

DEKR shows the poorest performance, likely due to insufficient training, as suggested by its non-converged loss curve. Other models with moderate accuracy, such as RCNN-pose and YOLOv7-Tiny-pose, face different challenges—RCNN-pose suffers from overfitting due to its complex structure (59, 038, 942 parameters), while YOLOv7-Tiny-pose sacrifices accuracy for lightweight design (9, 599, 635 parameters).

YOLOv8s-pose (92.79%) outperforms the more complex VITPose (90.80%), despite having 7.9 times fewer parameters (11, 423, 552 vs. 89, 991, 429). This can be attributed to YOLOv8s-pose’s efficient architecture and the novel Adaptive Threshold Focus Loss (ATFL), which enhances its feature representation and training performance.

Overall, DoubleNet and YOLOv8s-pose demonstrate superior efficiency and accuracy, highlighting the importance of architecture and loss function optimization in achieving high performance.

3.2.2. Ablation Experiment Result

As previously mentioned, we conducted an ablation experiment to assess the accuracy contributions of our approaches, such as the particular key point dataset and SA-GELU, to the overall model. The corresponding accuracy results are displayed in Table 4.

To evaluate the contributions of DoubleNet’s components, we tested eight structural combinations. The final version, Combination 8, achieved the highest average PCK, outperforming the second-best setup, Combination 3, by 0.90% (94.85% vs. 95.75%).

A key experiment investigated the impact of the proposed F-MHSA method by comparing Combination 1 (standard MHSA) and Combination 2 (F-MHSA). The results show that F-MHSA improves average PCK by 3.2%, with the most notable increase at the bottom point (+12.35%). Smaller differences were observed for the top point (+2.05%) and middle point 1 (+0.81%), while slight decreases occurred at middle point 2 (−3.05%) and middle point 3 (−3.96%). This highlights F-MHSA’s effectiveness in enhancing accuracy, especially for challenging key points.

However, F-MHSA introduces greater computational complexity, resulting in a larger model size and a potential trade-off in inference speed. These findings emphasize the balance between improved accuracy and computational efficiency in achieving optimal model performance.

The second experiment evaluated the contribution of the F-dataset and the activation function to model accuracy.

Using the F-dataset, Combination 3 achieved an average PCK of 94.85%, surpassing Combination 2 (original RGB dataset) by 1.91%. Notable improvements were observed at the top point (+3.78%), middle point 1 (+4.16%), and bottom point (+0.95%), demonstrating that the F-dataset’s inclusion of road area and boundary information improves accuracy by reducing errors outside the road.

The primary difference between Combinations 3 to 8 is the activation function. GELU was selected as the benchmark due to its strong performance, with four key points achieving over 95% accuracy and a lowest score of 88.18%, which had minimal impact on average accuracy. Combination 3, using GELU, achieved the highest average PCK (94.85%), outperforming the second-best, Combination 4, by 0.53%.

The final version, Combination 8 (DoubleNet), employs the improved SA-GELU activation function, which includes input-based adaptive adjustments. This enhancement allows DoubleNet to achieve the highest accuracy across all five key points, with an average PCK of 95.75%, surpassing Combination 3 and further validating the effectiveness of both the F-dataset and the SA-GELU activation function in improving model performance.

3.3. Inference Results

3.3.1. Image Inference Results

As shown in Figure 10, partial inference results include four test samples representing unstructured road conditions: half road, short straight road, long straight road, and curved road. The models’ performances are categorized into three levels, reflecting their accuracy.

At the top level, DoubleNet, YOLOv8s-pose, and VITPose demonstrate excellent performance. Our DoubleNet achieves accurate navigation point inference, with minimal positional offsets and smooth, rational navigation lines. However, occasional slight deviations occur at the bottom point, as seen in the last image of Figure 10. YOLOv8s-pose also performs well but struggles with slight road curves, leading to errors in the top or bottom points. VITPose excels at most points but suffers from overfitting at middle point 3 and the bottom point, causing fluctuations in navigation lines, as seen in the second image of Figure 10.

The second-level models, RCNN-pose and YOLOv7-Tiny-pose, show unsatisfactory results with frequent location errors. Their navigation lines change direction unpredictably or terminate unreasonably, attributed to overfitting and inadequate feature extraction.

At the lowest level, DEKR fails to converge during training, yielding erratic navigation lines with no meaningful orientation, making it entirely unusable.

3.3.2. Video Inference Results

This section presents the experimental results for video inference, focusing on adjacent frame continuity and inference performance, as shown in Figure 11. The former evaluates the stability of navigation lines during real-time video inference, while the latter assesses inference speed and hardware usage.

Three videos, recorded during dataset collection but excluded from the dataset, were used to test adjacent frame continuity. Only the three top-level models were evaluated, as the lower-performing models were deemed unqualified for further testing.

Since ground truth is unavailable for these frames, achieving completely accurate inferences remains challenging despite similar environments to the dataset images. In Video 1, as the view advances slowly, the navigation line should shorten while maintaining a consistent angle with the bottom edge of the images. These criteria directly measure the models’ performance.

YOLOv8s-pose demonstrates excellent stability in Video 1, with minimal changes in navigation lines, except for slight angle deviations in highlighted areas. In contrast, VITPose shows poorer stability, with frequent changes in middle point 2 affecting the navigation line. DoubleNet performs best, maintaining consistent navigation points and lines with only minor angle changes.

In Video 2, where the camera’s height and angle are adjusted to test robustness, YOLOv8s-pose misses the top point in the 4th and 5th frames, resulting in invalid bending of the navigation line. VITPose exhibits confusion in point locations and extends navigation lines outside the road area. DoubleNet remains stable, with correct navigation point order and minimal positional changes that do not affect overall accuracy.

For Video 3, extracted from a static scene, the navigation lines should remain unchanged. DoubleNet maintains fixed navigation lines, while YOLOv8s-pose shows minor lateral deviation in the top point. VITPose produces unreliable lines, with significant changes in middle point 2’s position. Overall, DoubleNet demonstrates superior stability across frames in real-time video inference.

Table 5 highlights inference performance, noting reduced CPU, RAM, and GPU usage during inference compared to training, as no dataset loader is required. The exception is DEKR, which shows a 9.2% increase in GPU usage.

In terms of inference speed, YOLOv7-Tiny-pose and YOLOv8s-pose excel due to their lightweight design, achieving speeds of 156.25 FPS and 192.32 FPS, respectively. Both models are well-suited for real-time applications. DoubleNet and VITPose also achieve real-time performance, with speeds of 71.16 FPS and 74.63 FPS, respectively. Despite its larger parameter count, VITPose slightly outpaces DoubleNet because of its simpler activation function and MHSA implementation. In contrast, DEKR and RCNN-pose fail to meet real-time requirements due to their complex architectures, demonstrating slower speeds unsuitable for time-sensitive tasks. These findings underline DoubleNet’s balance of accuracy and real-time capability, especially compared to other high-accuracy models.

The table also highlights the computational cost and GPU power consumption of models. YOLOv8s-pose (1.85 GFLOPS, 7.9 W) and YOLOv7-Tiny-pose (1.47 GFLOPS, 9.4 W) are the most efficient, ideal for real-time and resource-limited applications. DoubleNet offers a balance of efficiency and performance with 2.07 GFLOPS and 11.4 W. In contrast, RCNN-pose is highly demanding at 483.95 GFLOPS and 25.3 W, limiting its real-time viability. Mid-range models like DEKR (21 GFLOPS, 15.8 W) and VITPose (22 GFLOPS, 12.6 W) provide moderate complexity and power usage. Overall, DoubleNet stands out for its balance of computational and energy efficiency.

4. Discussion

In this section, we will initially discuss the performance and enhancements of our method, such as the activation effect of SA-GELU and the feature extraction capabilities of F-MHSA. Following this, we will outline our future work, such as potential operations to enhance accuracy and provide solutions to inference problems further.

4.1. The Discussion About DoubleNet Components

4.1.1. The Discussion About SA-GELU

As previously mentioned, our SA-GELU is an adapted version of the GELU activation function. By incorporating a self-adaptive input function, it enhances the activation effect in the negative region; however, this improvement comes at the expense of reduced inference speed. To evaluate its performance, we tested three RGB images to assess the capabilities of SA-GELU, comparing it against five other activation functions.

To successfully implement this comparative experiment, we designed a simple network comprising three operations: a 2D convolution, batch normalization, and an activation function. Initially, an RGB image is fed into the network. The 2D convolution is then applied to the image, maintaining its size but increasing its channels to 16 to demonstrate the activation functions’ effects adequately. Subsequently, the convolution output is normalized and activated through batch normalization and activation functions. Finally, the activated feature maps, generated using different activation functions, are visualized in RGB image format, and the visualized feature maps are illustrated in Figure 12.

As shown in the figure, our method, SA-GELU, outperforms the other five activation functions in terms of activation performance. In most cases, the feature maps activated by SA-GELU demonstrate superior dark area expression and retain more details. This means that SA-GELU enables effective color transitions in dark areas, whereas the dark places in feature maps activated by other functions typically exhibit little to no color transitions and are primarily pure black without any detail.

Based on the simple network established before, we conducted an experiment to evaluate the execution speed of SA-GELU. The forward pass of the network was implemented 10 times to count the running time of the activation functions. The average values were calculated, and the results are indicated in Table 6.

As illustrated in the table, the five fixed-expression activation functions demonstrate significant superiority in terms of speed. Even the slowest among them, GELU, with an execution time of 1.99 ms, is 12.52 times faster than SA-GELU, which takes 24.91 ms. While the self-adaptive capability of SA-GELU provides excellent feature activation, this advantage comes at the cost of more complex calculations and a slower execution speed.

4.1.2. The Discussion About F-MHSA

F-MHSA enhances the visual feature expression capability of the attention mechanism by strengthening the multi-scale representation ability of MHSA, but this structure comes at the cost of increasing the number of model parameters.

To further evaluate the performance of F-MHSA, we conducted additional experiments using a specialized network to isolate its effectiveness. This was necessary as DoubleNet’s integrated operations make it difficult to assess F-MHSA independently due to interactions with other components.

To analyze F-MHSA’s attention performance, a resized and transformed input image is fed into an untrained network containing only the F-MHSA or MHSA operation. Training is avoided to preserve the global attention characteristics of the MHSAs. The resulting feature map is transformed, normalized into a matrix, and visualized using pseudo-color. This matrix is fused with the input image to provide a clear representation of attention performance across the image.

The experimental results, shown in Figure 13, illustrate the attention performance of F-MHSA compared to MHSA. In the attention maps, darker pixel colors indicate poorer attention, while brighter pixels indicate richer attention. The difference images (F-MHSA-MHSA) highlight areas where F-MHSA outperforms MHSA, with brighter regions signifying a greater attention gap. Pure black areas represent regions where attention levels are equal or where MHSA surpasses F-MHSA.

Figure 13 demonstrates that F-MHSA generally provides better global attention than MHSA, as evidenced by larger bright areas in F-MHSA’s results and minimal pure black regions in the difference images. Table 7 further shows that DoubleNet with F-MHSA (combination 2) achieves the best final loss and PCK scores in the ablation experiments. However, this accuracy improvement comes at the cost of increased parameters, execution time, and operational size, potentially affecting training and deployment speed. Additionally, the performance gap between F-MHSA and MHSA widens as the tensor input size increases.

4.1.3. The Discussion About F-Dataset

To evaluate our dataset’s inference performance, we compared combinations 2 and 3 from the ablation experiment, with partial inference results shown in Figure 14. The results indicate that combination 3 outperforms combination 2, especially on unfamiliar images and video frames. Combination 3 effectively confines navigation points to the road area, while combination 2 shows issues like location offsets and out-of-bound points, likely due to the absence of manually labeled road areas. This supports the 1.91% accuracy improvement of combination 3, as shown in Table 5.

4.1.4. The Discussion About Model Structure

To optimize DoubleNet’s performance, we carefully designed its structure to balance all aspects. For input size, we tested 3 × 192 × 192, 3 × 224 × 224, and 3 × 256 × 256. The smallest size reduces parameters but sacrifices detail, affecting accuracy, while the largest size offers more detail but exceeds RAM limits due to hundreds of millions of parameters. Thus, 3 × 224 × 224 was chosen as the optimal input size.

The encoder, particularly the DNBLK structure, is key to DoubleNet’s performance. Testing one or two DNBLKs resulted in underfitting and non-convergence. Keeping the input size and structure constant, we explored all possible combinations of three DNBLKs, with performance results shown in Table 8.

As shown in Table 8, the arrangement and sequence of DNBLKs significantly affect training and accuracy due to model complexity and overfitting tendencies. Most configurations lead to overfitting, except for the “1 + 2 + 2” and “2 + 2 + 1” schemes, indicating that two DNBLK 2 and one DNBLK 1 provide optimal performance. The “2 + 1 + 2” scheme, however, overfits, suggesting that the frequency of transitions between DNBLKs also impacts performance. Thus, the “2 + 2 + 1” scheme, with the highest accuracy, is selected as the final encoder design.

For the decoder and head components of DoubleNet, experiments with different feature fusion directions and layer counts showed minimal impact on performance, demonstrating their limited importance. As such, detailed results for these variations were not recorded.

4.2. The Discussion About the Number of Navigation Points

The smoothness of the navigation line is crucial for the vehicle or robot that follows it; the smoother the line, the better the guidance effect. Therefore, the number of navigation points determining the line’s smoothness is a significant focus of our research. Generally, more navigation points result in a smoother navigation line but slower model inference speed. Thus, the critical factor in our study is finding the balance between the number of navigation points and the speed of inference.

Figure 15 illustrates the inference results of DoubleNet with different numbers of navigation points. The two-point scheme provides fast and accurate navigation on straight roads but struggles with curved roads, making it unsuitable for vineyard scenarios. The three-point and four-point schemes improve smoothness on curves but remain stiff, with minimal changes on straight roads and slightly reduced speed, rendering them inadequate for the vineyard environment.

The six-point and seven-point schemes excel in generating smooth navigation lines on curves but exhibit redundancy and fluctuating lines on short, straight roads. Additionally, their inference speed drops significantly, from 71.16 FPS (14.13 ms) to 45.70 FPS (21.88 ms), limiting their practicality.

In conclusion, the five-point scheme strikes the best balance between smoothness on curved roads, stability on straight roads, and inference speed, making it the optimal solution for the vineyard application.

4.3. The Discussion About DoubleNet’s Robustness

This section evaluates DoubleNet’s robustness by examining its performance under varying illumination conditions. Due to limited weather-varied images in the test set, additional unstructured road images with foggy and rainy conditions were used. The inference results under different lighting scenarios are shown in Figure 16.

As illustrated in Figure 16, the model delivers flawless inference results on test images, maintaining high accuracy even under challenging conditions such as intense shadows or cloudy weather. The generated navigation lines are precise in both shape and length, highlighting the model’s robust performance under controlled lighting scenarios.

However, when applied to internet images, certain limitations become evident. In image (c), reflections on the wet road slightly interfere with DoubleNet’s predictions, resulting in a navigation path that is shorter than the road and fails to cover its full extent. Similarly, in image (d), the presence of fog significantly reduces the visibility of soil road features, particularly at greater distances. This diminished visibility causes a loss of road details, preventing the model from accurately inferring navigation paths in remote areas, especially around curves.

In conclusion, the trained model demonstrates excellent robustness on the test set. However, when applied to unfamiliar images, DoubleNet displays limited robustness. While it performs reasonably well in situations with clear soil road features, its reliability decreases in challenging scenarios, where lighting and visibility play a critical role.

4.4. Limitations of DoubleNet

This section addresses issues during inference, including descriptions, causes, and solutions. Key problems are fluctuation, deviation, and misdirection. Fluctuation, especially common in video inference or repeated image processing (e.g., Figure 17), occurs when low-confidence points shift significantly, causing navigation line instability across video frames.

To improve model stability during video inference, we propose reusing features from earlier frames or applying time-sequence analysis methods, such as the Kalman filter, to predict current navigation point locations based on prior frames.

As shown in Figure 18, the deviation problem arises when inferred lines significantly mismatch ideal navigation lines, often due to unfamiliar images or poor training effectiveness. This issue, more severe than fluctuation, hinders navigation accuracy as the model fails to locate navigation points correctly.

To address this, two solutions are suggested: (1) enriching the dataset to improve the model’s ability to handle unfamiliar images by extracting more generalizable features, and (2) optimizing training settings and hyperparameters to enhance performance in poorly trained scenarios, potentially leading to substantial improvements.

Figure 19 illustrates the misdirection problem: when DoubleNet handles an image that is familiar but not in the dataset and the image contains a T-junction, the final issue, misdirection, probably happens. When making repetitive inferences, the inferred navigation lines may direct to different orientations because the features in the two lines are similar.

Employing a post-process program is effective in controlling the orientation of navigation lines. This involves calculating the left-side angle between each navigation line segment and the image’s bottom edge. If the angle is acute or straight, no action is required. Otherwise, reject the inferred point locations and repeat the process until the angle meets the requirement.

4.5. Practical Applications and Perspectives

The navigation line generation method based on DoubleNet holds significant potential for practical applications, particularly in the realms of agricultural robotics and automation. This model is specifically engineered to address navigation challenges on unstructured roads. By integrating DoubleNet into autonomous ground vehicles or robots, it becomes feasible to streamline complex navigation processes, efficiently traverse intricate pathways, reduce reliance on manual labor, and enhance overall operational efficiency. While this research primarily focuses on vineyards, the robustness of DoubleNet suggests its adaptability to other agricultural scenarios, such as orchards or irregular croplands, thereby expanding its range of applications. Furthermore, the improved navigation capabilities offered by DoubleNet can enhance the safety of autonomous machinery during agricultural operations, minimizing the risk of accidents or operational errors.

DoubleNet presents significant potential for advancing autonomous navigation and its applications in agriculture. Its scalability and adaptability could be further enhanced by extending its capabilities to accommodate a wider variety of terrain types and environmental conditions. Additionally, utilizing transfer learning techniques could allow DoubleNet to be quickly adapted to new agricultural contexts with minimal retraining, making it a versatile and cost-efficient solution for a range of applications. As the global movement toward smart farming and AI-driven agricultural innovations gains momentum, DoubleNet has the potential to become a highly effective navigation method for robotic systems, addressing the increasing demand for efficient, sustainable, and autonomous farming practices.

5. Conclusions

In this research, we successfully introduced DoubleNet, a robust deep-learning model specifically designed to address navigation challenges in our vineyard characterized by unstructured roads.

To enhance feature extraction, we modified the GELU activation function, creating the self-adaptive SA-GELU, which improves activation performance in the negative region and preserves crucial details. Additionally, we developed the Fused-MHSA mechanism to enhance the feature representation of the Q sequence through improved extraction and recombination techniques.

To further increase the accuracy of inferred navigation points under varying road conditions, we created a specialized dataset incorporating road area information, reducing the likelihood of misplacing navigation points outside road boundaries. Furthermore, we designed two DNBLK operation blocks that leverage convolutional structures and F-MHSA to process roads of different lengths effectively.

The DoubleNet architecture integrates these innovations through two subnetworks: convolutional and transformer. This dual-network approach exploits the unique strengths of each subnetwork, resulting in a comprehensive navigation solution.

Experimental results validate our model’s effectiveness, demonstrating that the inferred navigation lines are accurate and smooth in real-time applications. Overall, DoubleNet exhibits superior performance compared to existing models, establishing itself as a leading solution for navigation in challenging vineyard environments.

Author Contributions

Conceptualization, X.C. (Xuezhi Cui), B.Z., and K.L.; methodology, X.C. (Xuezhi Cui) and X.C. (Xiaoyi Cui); software, X.C., Z.H., and K.L.; validation, X.C., L.Z., and B.Z.; formal analysis, X.C., R.W., K.L., and X.F.; investigation, B.Z. and J.N.; resources, B.Z. and Z.H.; data curation, X.F. and X.C.; writing—original draft, X.C. and L.Z.; writing—review and editing, X.C., L.Z., B.Z., R.W., X.F., J.N., and X.C.; visualization, J.N.; supervision, B.Z., Z.H., and X.C.; project administration, B.Z.; funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Project of China National Machinery Industry Corporation Ltd., grant number ZDZX2023-2 and the Youth Science and Technology Fund of China National Machinery Industry Corporation, grant number QNJJ-PY-2022-20.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

Author Xuezhi Cui, Licheng Zhu, Bo Zhao, Ruixue Wang, Zhenhao Han, Kunlei Lu, Xuguang Feng, Jipeng Ni, Xiaoyi Cui were employed by the company Chinese Academy of Agriculture Mechanization Sciences Group Co., Ltd. The authors declare that this study received funding from the Science and Technology Project of China National Machinery Industry Corporation Ltd., grant number ZDZX2023-2 and The Youth Science and Technology Fund of China National Machinery Industry Corporation, grant number QNJJ-PY-2022-20. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Appendix A

Table A1. The implementation details of DoubleNet (Figure 7).

Block	Operation	Input Size	Output Size
(1)	cv 2D (3, 16, 3, 2, 1) + batch normalization + SA-GELU	3 × 224 × 224	16 ×112×112
	cv 2D (16, 16, 3, 2, 1) + batch normalization + SA-GELU	16 ×112 ×112	16 × 56 × 56
	reshape 2	16 × 56 × 56	16 × 3136 × 1
	reshape 1	3 × 224 × 224	16 × 9408 × 1
	maxpooling 1D (3, 3)	16 × 9408 × 1	16 × 3136 × 1
	adding	16 × 3136 × 1 16 × 3136 × 1	16 × 3136 × 1
(2)	cv 2D (16, 16, 1, 1, 0) + batch normalization + SA-GELU	16 × 56 × 56	16 × 56 × 56
	cv 2D (16, 16, 3, 1, 1) + batch normalization + SA-GELU	16 × 56 × 56	16 × 56 × 56
	adding	16 × 56 × 56 16 × 56 × 56 16 × 56 × 56	16 × 56 × 56
	cv 2D (16, 16, 1, 1, 0) + batch normalization + SA-GELU	16 × 56 × 56	16 × 56 × 56
	cv 2D (16, 16, 3, 1, 1) + batch normalization + SA-GELU	16 × 56 × 56	16 × 56 × 56
	concatenation	16 × 56 × 56 16 × 56 × 56	32 × 56 × 56
	cv 2D (16, 32, 3, 1, 1) + batch normalization + SA-GELU	16 × 56 × 56	32 × 56 × 56
	adding	32 × 56 × 56 32 × 56 × 56	32 × 56 × 56
	cv 2D (32, 32, 3, 2, 1) + batch normalization + SA-GELU	32 × 56 × 56	32 × 28 × 28
	cv 2D (16, 16, 7, 1, 3) + batch normalization + SA-GELU	16 × 56 × 56	16 × 56 × 56
	adding	16 × 56 × 56 16 × 56 × 56	16 × 56 × 56
	cv 2D (16, 32, 3, 2, 1) + batch normalization + SA-GELU	16 × 56 × 56	32 × 28 × 28
	concatenation	32 × 28 × 28 32 × 28 × 28	64 × 28 × 28
	MLP	16 × 3136 × 1	16 × 784 × 1
	F-MHSA_2 + layer normalization	16 × 784 × 1	16 × 784 × 1
	F-MHSA_1 + layer normalization	16 × 3136 × 1	16 × 3136 ×1
	MLP	16 × 3136 × 1	16 × 784 × 1
	adding	16 × 784 × 1 16 × 784 × 1	16 × 784 × 1
	cv 1D (16, 64, 3, 1, 1) + batch normalization + SA-GELU	16 × 784 × 1	64 × 784 × 1
	MLP	64 × 784 × 1	64 × 784 × 1
	reshape	64 × 28 × 28	64 × 784 × 1
	adding	64 × 784 × 1 64 × 784 × 1	64 × 784 × 1
(3)	cv 2D (64, 64, 1, 1, 0) + batch normalization + SA-GELU	64 × 28 × 28	64 × 28 × 28
	cv 2D (64, 64, 3, 1, 1) + batch normalization + SA-GELU	64 × 28 × 28	64 × 28 × 28
	adding	64 × 28 × 28 64 × 28 × 28 64 × 28 × 28	64 × 28 × 28
	cv 2D (64, 64, 1, 1, 0) + batch normalization + SA-GELU	64 × 28 × 28	64 × 28 × 28
	cv 2D (64, 64, 3, 1, 1) + batch normalization + SA-GELU	64 × 28 × 28	64 × 28 × 28
	concatenation	64 × 28 × 28 64 × 28 × 28	128 × 28 × 28
	cv 2D (64, 128, 3, 1, 1) +batch normalization + SA-GELU	64 × 28 × 28	128 × 28 × 28
	adding	128 × 28 × 28 128 × 28 × 28	128 × 28 × 28
	cv 2D (128, 128, 3, 2, 1) +batch normalization + SA-GELU	128 × 28 × 28	128 × 14 × 14
	cv 2D (64, 64, 7, 1, 3) + batch normalization + SA-GELU	64 × 28 × 28	64 × 28 × 28
	adding	64 × 28 × 28 64 × 28 × 28	64 × 28 × 28
	cv 2D (64, 128, 3, 2, 1) + batch normalization + SA-GELU	64 × 28 × 28	128 × 14 × 14
	concatenation	128 × 14 × 14 128 × 14 × 14	256 × 14 × 14
	MLP	64 × 784 × 1	64 × 196 × 1
	F-MHSA_2 + layer normalization	64 × 196 × 1	64 × 196 × 1
	F-MHSA_1 + layer normalization	64 × 784 × 1	64 × 784 × 1
	MLP	64 × 784 × 1	64 × 196 × 1
	adding	64 × 196 × 1 64 × 196 × 1	64 × 196 × 1
	cv 1D (64, 256, 3, 1, 1) + batch normalization + SA-GELU	64 × 196 × 1	256 × 196 × 1
	MLP	256 × 196 × 1	256 × 196 × 1
	reshape	256 × 14 × 14	256 × 196 × 1
	adding	256 × 196 × 1 256 × 196 × 1	256 × 196 × 1
(4)	cv 2D (256, 256, 1, 1, 0) + batch normalization + SA-GELU	256 × 14 × 14	256 × 14 × 14
	cv 2D (256, 256, 3, 1, 1) + batch normalization + SA-GELU	256 × 14 × 14	256 × 14 × 14
	adding	256 × 14 × 14 256 × 14 × 14 256 × 14 × 14	256 × 14 × 14
	cv 2D (256, 256, 1, 1, 0) + batch normalization + SA-GELU	256 × 14 × 14	256 × 14 × 14
	cv 2D (256, 256, 3, 1, 1) + batch normalization + SA-GELU	256 × 14 × 14	256 × 14 × 14
	concatenation	256 × 14 × 14 256 × 14 × 14	512 × 14 × 14
	cv 2D (512, 512, 3, 2, 1) +batch normalization + SA-GELU	512 × 14 × 14	512 × 7 × 7
	cv 2D (256, 256, 7, 1, 3) + batch normalization + SA-GELU	256 × 14 × 14	256 × 14 × 14
	cv 2D (256, 256, 5, 1, 2) + batch normalization + SA-GELU	256 × 14 × 14	256 × 14 × 14
	adding	256 × 14 × 14 256 × 14 × 14 256 × 14 × 14	256 × 14 × 14
	cv 2D (256, 512, 3, 2, 1) + batch normalization + SA-GELU	256 × 14 × 14	512 × 7 × 7
	concatenation	512 × 7 × 7 512 × 7 × 7	1024 ×7 × 7
	F-MHSA_1 + layer normalization	256 × 196 × 1	256 × 196 × 1
	F-MHSA_1 + layer normalization	256 × 196 × 1	256 × 196 × 1
	adding	256 × 196 × 1 256 × 196 × 1	256 × 196 × 1
	MLP	256 × 196 × 1	256 × 49 × 1
	cv 1D (256, 1024, 3, 1, 1) +batch normalization + SA-GELU	256 × 49 × 1	1024 × 49 × 1
	MLP	1024 × 49 × 1	1024 × 49 × 1
	reshape	1024 × 49 × 1	1024 × 7 × 7
	adding	1024 × 7 × 7 1024 × 7 × 7	1024 × 7 × 7
(5)	cv 2D (64, 256, 3, 2, 1) + batch normalization + SA-GELU	64 × 28 × 28	256 × 14 × 14
(6)	cv 2D (256, 1024, 3, 2, 1) + batch normalization + SA-GELU	256 × 14 × 14	1024 × 7 × 7
(7)	T—cv 2D (1024, 256, 4, 2, 1) + batch normalization + SA-GELU	1024 × 7 × 7	256 × 14 × 14
(8)	T—cv 2D (256, 64, 4, 2, 1) + batch normalization + SA-GELU	256 × 14 × 14	64 × 28 × 28
(9)	T—cv 2D (64, 16, 4, 2, 1) + batch normalization + SA-GELU	64 × 28 × 28	16 × 56 × 56
(10)	cv 2D (16, 5, 1, 1, 0)	16 × 56 × 56	5 × 56 × 56
(11)	cv 1D (1024, 256, 1,1,0) + linear (49, 196) + layer normal + SA-GELU	1024 × 49 × 1	256 × 196 × 1
(12)	cv 1D (256, 64, 1,1,0) + linear (196, 784) + layer normal + SA-GELU	256 × 196 × 1	64 × 784 × 1
(13)	cv 1D (64,16, 1,1,0) + linear (784, 3136) + layer normal + SA-GELU	64 × 784 × 1	16 × 3136 × 1

Note: T—cv (A, B, C, D, E) and cv (A, B, C, D, E) represent the transposed convolution and convolution operations, respectively. A, B, C, D, E represent the input channels, output channels, kernel size, stride size, and padding. Maxpooling 1D (A, B) represents a maxpooling operation, where the kernel size = A and stride size = B. Linear (A, B) means the full connection operation with the input neurons of A and the output neurons of B. MLP is the abbreviation of multi-layer perceptron.

From Table A1, it is evident that the operations in DoubleNet are numerous and complex. Due to the large number of parameters, these operations contribute to the problem of overfitting. To alleviate this issue, we have implemented several methods. Regarding the model structure, we have incorporated a dropout approach with a relatively high drop rate of 0.6 to optimize the model structure and reduce the parameter amount.

References

State of the World Vine and Wine Sector. Available online: https://www.oiv.int/what-we-do/data-discovery-report?oiv (accessed on 20 October 2024).
Guevara, J.; Cheein, F.; Gené-Mola, J.; Rosell-Polo, J.; Gregorio, E. Analyzing and overcoming the effects of GNSS error on LiDAR based orchard parameters estimation. Comput. Electron. Agric. 2020, 170, 105255. [Google Scholar] [CrossRef]
Shamshiri, R.; Navas, E.; Dworak, V.; Cheein, F.; Weltzien, C. A modular sensing system with CANBUS communication for assisted navigation of an agricultural mobile robot. Comput. Electron. Agric. 2024, 223, 109112. [Google Scholar] [CrossRef]
Eceoğlu, O.; Ünal, İ. Optimizing Orchard Planting Efficiency with a GIS-Integrated Autonomous Soil-Drilling Robot. AgriEngineering 2024, 6, 2870–2890. [Google Scholar] [CrossRef]
Malavazi, B.P.F.; Guyonneau, R.; Fasquel, J.; Lagrange, S.; Mercier, F. LiDAR-only based navigation algorithm for an autonomous agricultural robot. Comput. Electron. Agric. 2018, 154, 71–79. [Google Scholar] [CrossRef]
Liu, W.; Li, W.; Feng, H.; Xu, J.; Yang, S.; Zheng, Y.; Liu, X.; Wang, Z.; Yi, X.; He, Y.; et al. Overall integrated navigation based on satellite and lidar in the standardized tall spindle apple orchards. Comput. Electron. Agric. 2024, 216, 108489. [Google Scholar] [CrossRef]
Guhur, P.L.; Tapaswi, M.; Chen, S.; Laptev, I.; Schmid, C. Airbert: In-domain pretraining for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV) 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 1634–1643. [Google Scholar]
Zhang, B.; Zhao, D.; Chen, C.; Li, J.; Zhang, W.; Qi, L.; Wang, S. Extraction of Crop Row Navigation Lines for Soybean Seedlings Based on Calculation of Average Pixel Point Coordinates. Agronomy 2024, 14, 1749. [Google Scholar] [CrossRef]
Ban, C.; Wang, L.; Chi, R.; Su, T.; Ma, Y. A Camera-LiDAR-IMU fusion method for real-time extraction of navigation line between maize field rows. Comput. Electron. Agric. 2024, 223, 109114. [Google Scholar] [CrossRef]
Gong, J.; Wang, X.; Zhang, Y.; Lan, Y.; Mostafa, K. Navigation line extraction based on root and stalk composite locating points. Comput. Electr. Eng. 2021, 92, 107115. [Google Scholar] [CrossRef]
Stefas, N.; Bayram, H.; Isler, V. Vision-Based UAV Navigation in Orchards. IFAC-PapersOnLine 2016, 49, 10–15. [Google Scholar] [CrossRef]
Fu, D.; Chen, Z.; Yao, Z.; Liang, Z.; Cai, Y.; Liu, C.; Tang, Z.; Lin, C.; Feng, X.; Qi, L. Vision-based trajectory generation and tracking algorithm for maneuvering of a paddy field robot. Comput. Electron. Agric. 2024, 226, 109368. [Google Scholar] [CrossRef]
Opiyo, S.; Okinda, C.; Zhou, J.; Mwangi, E.; Makange, N. Medial axis-based machine-vision system for orchard robot navigation. Comput. Electron. Agric. 2021, 185, 106153. [Google Scholar] [CrossRef]
Navone, A.; Martini, M.; Ambrosio, M.; Ostuni, A.; Angarano, S.; Chiaberge, M. GPS-free autonomous navigation in cluttered tree rows with deep semantic segmentation. Robot. Auton. Syst. 2025, 183, 104854. [Google Scholar] [CrossRef]
Liu, Y.; Guo, Y.; Wang, X.; Yang, Y.; Zhang, J.; An, D.; Han, H.; Zhang, S.; Bai, T. Crop Root Rows Detection Based on Crop Canopy Image. Agriculture 2024, 14, 969. [Google Scholar] [CrossRef]
Li, G.; Le, F.; Si, S.; Cui, L.; Xue, X. Image Segmentation-Based Oilseed Rape Row Detection for Infield Navigation of Agri-Robot. Agronomy 2024, 14, 1886. [Google Scholar] [CrossRef]
Simons, C.; Liu, Z.; Marcus, B.; Roy-Chowdhury, A.K.; Karydis, K. Language-guided Robust Navigation for Mobile Robots in Dynamically-changing Environments. arXiv 2024, arXiv:2409.19459. [Google Scholar]
Choudhary, A.; Kobayashi, Y.; Arjonilla, F.J.; Nagasaka, S.; Koike, M. Evaluation of mapping and path planning for non-holonomic mobile robot navigation in narrow pathway for agricultural application. In Proceedings of the 2021 IEEE/SICE International Symposium on System Integration (SII), Iwaki, Fukushima, Japan, 11–14 January 2021; pp. 17–22. [Google Scholar]
Xiao, K.; Xia, W.; Liang, C. Visual Navigation Path Extraction Algorithm in Orchard under Complex Background. Trans. Chin. Soc. Agric. Mach. 2023, 54, 197–204+252. [Google Scholar] [CrossRef]
Yang, Z.; Ouyang, L.; Zhang, Z.; Duan, J.; Yu, J.; Wang, H. Visual navigation path extraction of orchard hard pavement based on scanning method and neural network. Comput. Electron. Agric. 2022, 197, 106964. [Google Scholar] [CrossRef]
Yu, J.; Zhang, J.; Shu, A.; Chen, Y.; Chen, J.; Yang, Y.; Tang, W.; Zhang, Y. Study of convolutional neural network-based semantic segmentation methods on edge intelligence devices for field agricultural robot navigation line extraction. Comput. Electron. Agric. 2023, 209, 107811. [Google Scholar] [CrossRef]
Silva, R.; Cielniak, G.; Gao, J. Vision based crop row navigation under varying field conditions in arable fields. Comput. Electron. Agric. 2024, 217, 108581. [Google Scholar] [CrossRef]
Li, C.; Pan, Y.; Li, D.; Fan, J.; Li, B.; Zhao, B.; Zhao, Y.; Wang, J. A curved path extraction method using RGB-D multimodal data for single-edge guided navigation in irregularly shaped fields. Expert Syst. Appl. 2024, 255, 124586. [Google Scholar] [CrossRef]
Saha, S.; Noguchi, N. Smart vineyard row navigation: A machine vision approach leveraging YOLOv8. Comput. Electron. Agric. 2025, 229, 109839. [Google Scholar] [CrossRef]
Ball, D.; Upcroft, B.; Wyeth, G.; Corke, P.; English, A.; Ross, P.; Patten, T.; Fitch, R.; Sukkarieh, S.; Bate, A. Vision-based Obstacle Detection and Navigation for an Agricultural Robot. J. Field Rob. 2016, 33, 1107–1130. [Google Scholar] [CrossRef]
Zheng, Z.; Hu, Y.; Li, X.; Huang, Y. Autonomous navigation method of jujube catch-and-shake harvesting robot based on convolutional neural networks. Comput. Electron. Agric. 2023, 215, 108469. [Google Scholar] [CrossRef]
Liu, T.; Zheng, Y.; Lai, J.; Cheng, Y.; Chen, S.; Mai, B.; Liu, Y.; Li, J.; Xue, Z. Extracting visual navigation line between pineapple field rows based on an enhanced YOLOv5. Comput. Electron. Agric. 2024, 217, 108574. [Google Scholar] [CrossRef]
Zhang, T.; Zhou, J.; Liu, W.; Yue, R.; Shi, J.; Zhou, C.; Hu, J. SN-CNN: A Lightweight and Accurate Line Extraction Algorithm for Seedling Navigation in Ridge-Planted Vegetables. Agriculture 2024, 14, 1446. [Google Scholar] [CrossRef]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2019, arXiv:1606.08415. [Google Scholar]
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. In Proceedings of the Annual Conference on Neural Information Processing Systems(NeurIPS) 2022, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Wang, C.; Bochkovskiy, A.; Liao, H. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Geng, Z.; Sun, K.; Xiao, B.; Zhang, Z.; Wang, J. Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) 2021, Virtual, 19–25 June 2021; pp. 14676–14686. [Google Scholar]
Braun, M.; Rao, Q.; Wang, Y.; Flohr, F. Pose-RCNN: Joint object detection and pose estimation using 3D object proposals. In Proceedings of the International Conference on Intelligent Transportation Systems (ITSC) 2016, Rio de Janeiro, Brazil, 1–4 November 2016; pp. 1546–1551. [Google Scholar]
Clevert, D.; Unterthiner, T.; Hochreiter, S. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUS). In Proceedings of the 4th International Conference on Learning Representations (ICLR) 2016, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Maas, A.; Hannun, A.; Ng, A. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
Shazeer, N. GLU Variants Improve Transformer. arXiv 2020, arXiv:2002.05202. [Google Scholar]
Misra, D. Mish: A Self Regularized Non-Monotonic Neural Activation Function. arXiv 2019, arXiv:1908.08681. [Google Scholar]

Figure 1. Data collection platform.

Figure 2. Dataset images and annotations. The color dots represent the annotated navigation points manually, and the different colors indicate that they belong to different classes, which are “top point”, “middle point 1”, “middle point 2”, “middle point 3” and “bottom point”.

Figure 3. The activation curves of GELU and SA-GELU functions. (a) The activation curve of GELU; (b,c) are just two possible activation curves of SA-GELU activation function, because SA-GELU has the random item N(x,1) and input-adaptive capability; therefore, different inputs have different output results and its activation curve is not fixed.

Figure 4. Fused MHSA. (a) is the structure of fused scaled dot-product attention; (b) is the structure of Fused MHSA. CBSG is the representation of 1D convolution, 1D batch normalization, and SA-GELU; MatMul is the representation of matrix multiplication; and Concat is the representation of concatenation.

Figure 5. Flattening module. A × B × C represents the channel × length × width (2D feature map) or number × length × width (1D tensor). CBSG A, B, C, D, E: CBSG is the abbreviation of convolution 2D, 2D batch normalization, and SA-GELU, and A, B, C, D, and E indicate the input channel, output channel, kernel size, stride size, and padding of the 2D convolution, respectively. Maxpooling 1D A, B represents the 1D maxpooling operation; A and B are the kernel size and stride size, respectively. ⊕ is the symbol of the adding operation.

Figure 6. DNBLK. (a,b) are the structures of DNBLK 1 and DNBLK 2, respectively. CBSG A, B, and C: CBSG is the abbreviation of 2D convolution, batch normalization, and SA-GELU, and A, B, and C indicate the kernel size, stride size, and padding of the 2D convolution. Inc and otc are the abbreviations of the input and output channels; if the CBSG model does not show them, it means the inc = otc. X1din, x1dout, x2din, and x2dout represent the 1D tensor input and output and the 2D feature map input and output, respectively. MLP is the abbreviation of multi-layer perceptron. F-MHSA_1 and F-MHSA_2 represent the Fused-MHSA with different inputs. LN is the abbreviation of layer normalization. Dimnumber is the number dimension of the 1D tensors. Cat is the abbreviation of concatenation.

Figure 7. DoubleNet. The CtBSG and CLLSG represent the “Convolution 2D (transposed) + Batch Normalization + SA-GELU” and the “Convolution 1D + Linear + Layer Normalization + SA-GELU”, respectively.

Figure 8. Training and validation loss curve.

Figure 9. Training and validation loss curves of DoubleNets with different activation functions. (a) is the training loss curves; (b) is the validation loss curves.

Figure 10. Partial images inference results.

Figure 11. The result of the adjacent frame continuity experiment.

Figure 12. Activation results of activation functions.

Figure 13. The visualization of MHSAs’ attention effects and their difference.

Figure 14. Inference results of DoubleNet trained with the two datasets.

Figure 15. Inference performance of DoubleNet with different navigation point numbers.

Figure 16. The results of the illumination robustness experiment of DoubleNet. (a) Intense shadows (test image); (b) cloudy weather (test image); (c) rain weather + uneven road (internet image); and (d) foggy weather (internet image).

Figure 17. The fluctuation problem of two identical images inferred by DoubleNet repeatedly.

Figure 18. The deviation problem of an untrained image. (a) is the ideal navigation line; (b) is the inferred navigation line.

Figure 19. The misdirection problem of an untrained image. (a) is the inferred navigation with left direction; (b) is the inferred navigation with right direction.

Table 1. Training hardware occupancies and behaviors of different models.

Models	Average Hardware Occupancy			Training Behavior
Models	CPU	GPU	RAM	Epoch Time (s)	Model Size (MB)	Parameter Amount
DEKR	14.4%	55.0%	7.5%	29.4	341.7	29, 404, 738
VITPose	6.4%	83.8%	6.1%	38.1	343.2	89, 991, 429
RCNN-pose	9.5%	82.5%	7.2%	351.6	225.7	59, 038, 942
YOLOv7-Tiny-pose	19.7%	36.3%	18.8%	20.3	18.6	9, 599, 635
YOLOv8s-pose	10.2%	25.0%	11.0%	26.3	22.4	11, 423, 552
DoubleNet	22.4%	41.3%	10.5%	19.4	202.0	53, 167, 038

Note: the bold fonts are used to highlight the best results.

Table 2. Training hardware occupancies and behaviors of DoubleNets.

Models	Average Hardware Occupancy			Training Behavior
Models	CPU	GPU	RAM	Epoch Time (s)	Model Size (MB)	Parameter Amount
DoubleNet (ELU)	21.5%	32.5%	10.0%	18.4	202.0	53, 167, 038
DoubleNet (LeakyReLU)	21.6%	32.5%	9.9%	18.2
DoubleNet (SiLU)	22.7%	32.5%	10.0%	17.9
DoubleNet (Mish)	21.3%	32.5%	9.9%	17.5
DoubleNet (GELU)	22.4%	32.5%	9.9%	18.8
DoubleNet (SA-GELU)	22.4%	41.3%	10.5%	19.3

Table 3. Accuracy performance of different models.

Model	Percentage of Correct Key Points (PCK)
Model	Top Point	Middle Point 1	Middle Point 2	Middle Point 3	Bottom Point	Average
DEKR	60.25%	59.10%	57.10%	55.10%	56.78%	57.66%
VITPose	92.53%	92.42%	85.36%	95.14%	88.53%	90.80%
RCNN-pose	78.15%	76.80%	75.35%	75.60%	76.10%	76.40%
YOLOv7-Tiny-pose	75.10%	73.92%	74.89%	73.45%	74.25%	74.32%
YOLOv8s-pose	91.22%	95.13%	96.41%	95.00%	86.17%	92.79%
DoubleNet	96.26%	97.48%	97.53%	97.46%	90.04%	95.75%

Note: The PCK is calculated with the 0.05 distance threshold.

Table 4. The ablation experiment result of DoubleNet.

Combination		Percentage of Correct Keypoints (PCK)						Model Properties
Combination		Top Point	Middle Point 1	Middle Point 2	Middle Point 3	Bottom Point	Average	Path File Size	Parameter Amount
1	MHSA + GELU + RGB-dataset	94.89%	93.21%	93.06%	92.66%	74.88%	89.74%	201.0 MB	52, 745, 694
2	F-MHSA + GELU + RGB-dataset	92.34%	92.40%	96.11%	96.62%	87.23%	92.94%	202.0 MB	53, 167, 038
3	F-MHSA + GELU + F-dataset	96.12%	96.56%	97.47%	95.92%	88.18%	94.85%
4	F-MHSA + SiLU + F-dataset	95.42%	93.83%	94.58%	97.21%	90.56%	94.32%
5	F-MHSA + ELU + F-dataset	94.91%	93.32%	98.67%	97.88%	82.78%	93.51%
6	F-MHSA + Mish + F-dataset	91.59%	94.87%	96.39%	93.83%	86.53%	92.64%
7	F-MHSA +LeakyReLU +F-dataset	92.68%	90.29%	93.65%	91.14%	82.10%	89.97%
8	F-MHSA + SA-GELU + F-dataset	96.26%	97.48%	97.53%	97.46%	90.04%	95.75%

Note: The F-dataset is the abbreviation of our dataset with fused images.

Table 5. The inference performances of models.

Models	Average Hardware Occupancy			Inference Performance
Models	CPU	GPU	RAM	Speed (FPS)	Cost (GFLOPS)	Power (W)
DEKR	6.9%	64.2%	3.7%	2.23	21.82	15.8
VITPose	4.9%	83.1%	5.6%	74.63	22.08	12.6
RCNN-pose	6.4%	78.9%	6.4%	0.91	483.95	25.3
YOLOv7-Tiny-pose	7.6%	31.6%	6.8%	156.25	1.47	9.4
YOLOv8s-pose	10.1%	24.4%	5.6%	192.32	1.85	7.9
DoubleNet	7.4%	40.2%	5.4%	71.16	2.07	11.4

Table 6. Execution time of the six activation functions.

Activation Function	SA-GELU	GELU	Mish	SiLU	LeakyReLU	ELU
Average Time (ms)	24.91	1.99	1.37	1.98	1.13	1.98

Table 7. Performances of the MHSAs.

Operation	Parameter	Average Time (ms)	Size (MB)	Final Loss (DoubleNet)	PCK (DoubleNet)
MHSA	31, 427, 424	39.14	127.59	0.000189 (training loss) 0.000377 (validation loss)	89.74%
F-MHSA	31, 428, 240	58.79	128.20	0.000185 (training loss) 0.000370 (validation loss)	92.94%

Table 8. The performances of the DoubleNet built with three DNBLKs.

Scheme	Accuracy (Average PCK)	Total Size (MB)	Final Training Loss	Final Validation Loss	Conclusion	Operation
1 + 1 + 1	80.16%	999.08	0.0001314	0.0014247	overfitting	reject
1 + 1 + 2	82.51%	977.42	0.0001242	0.0013319
1 + 2 + 1	79.77%	977.04	0.0001559	0.0014659
1 + 2 + 2	93.66%	955.38	0.0001480	0.0008751	2nd high
2 + 1 + 1	67.81%	984.37	0.0001679	0.0018778	overfitting
2 + 1 + 2	82.92%	962.71	0.0001173	0.0015441	overfitting
2 + 2 + 1	95.75%	962.32	0.0001156	0.0009351	1st high	accept
2 + 2 + 2	52.68%	940.67	0.0001426	0.0024134	overfitting	reject

Note: “1” and “2” represent the DNBLK 1 and DNBLK 2, respectively; “x + x + x” means the DNBLK combination of the encoder.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, X.; Zhu, L.; Zhao, B.; Wang, R.; Han, Z.; Lu, K.; Feng, X.; Ni, J.; Cui, X. DoubleNet: A Method for Generating Navigation Lines of Unstructured Soil Roads in a Vineyard Based on CNN and Transformer. Agronomy 2025, 15, 544. https://doi.org/10.3390/agronomy15030544

AMA Style

Cui X, Zhu L, Zhao B, Wang R, Han Z, Lu K, Feng X, Ni J, Cui X. DoubleNet: A Method for Generating Navigation Lines of Unstructured Soil Roads in a Vineyard Based on CNN and Transformer. Agronomy. 2025; 15(3):544. https://doi.org/10.3390/agronomy15030544

Chicago/Turabian Style

Cui, Xuezhi, Licheng Zhu, Bo Zhao, Ruixue Wang, Zhenhao Han, Kunlei Lu, Xuguang Feng, Jipeng Ni, and Xiaoyi Cui. 2025. "DoubleNet: A Method for Generating Navigation Lines of Unstructured Soil Roads in a Vineyard Based on CNN and Transformer" Agronomy 15, no. 3: 544. https://doi.org/10.3390/agronomy15030544

APA Style

Cui, X., Zhu, L., Zhao, B., Wang, R., Han, Z., Lu, K., Feng, X., Ni, J., & Cui, X. (2025). DoubleNet: A Method for Generating Navigation Lines of Unstructured Soil Roads in a Vineyard Based on CNN and Transformer. Agronomy, 15(3), 544. https://doi.org/10.3390/agronomy15030544

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DoubleNet: A Method for Generating Navigation Lines of Unstructured Soil Roads in a Vineyard Based on CNN and Transformer

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection and Dataset Establishment

2.1.1. Data Collection Platform

2.1.2. The Navigation Point Dataset

2.1.3. Test Videos

2.2. DoubleNet

2.2.1. Self-Adaption GELU (SA-GELU)

2.2.2. Fused MHSA (F-MHSA)

2.2.3. Flattening Module

2.2.4. DNBLK

2.2.5. DoubleNet Structure

2.3. Computing Hardware and Soft Environment

2.4. Experiment Introduction and Evaluation Metrics

2.4.1. Experiment Introduction

2.4.2. Evaluation Metrics

3. Results

3.1. Training Results

3.1.1. Training Performances of Different Models

3.1.2. DoubleNet’s Training Performances

3.2. Accuracy Results

3.2.1. Accuracy Results of Different Models

3.2.2. Ablation Experiment Result

3.3. Inference Results

3.3.1. Image Inference Results

3.3.2. Video Inference Results

4. Discussion

4.1. The Discussion About DoubleNet Components

4.1.1. The Discussion About SA-GELU

4.1.2. The Discussion About F-MHSA

4.1.3. The Discussion About F-Dataset

4.1.4. The Discussion About Model Structure

4.2. The Discussion About the Number of Navigation Points

4.3. The Discussion About DoubleNet’s Robustness

4.4. Limitations of DoubleNet

4.5. Practical Applications and Perspectives

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI