1. Introduction
The relatively new worldwide trend of ‘precision forestry’ refers to the use of high-tech sensors and analytical tools to support site-specific forest management for the conservation and use of forest resources [
1,
2]. According to McKinsey & Company Research, precision forestry plays an important role in nursery and forest management, forestry fees, timber delivery and value chains [
3]. The global precision forestry market is projected to be worth USD 6.1 billion by 2024 [
4] and has become an important industry in China.
Afforestation and reforestation operations constitute an important part of forest management [
5], and the quality of seedlings produced by nurseries is related to the survival rate of planted trees, so it is crucial to advance the level of research to nursery techniques. The number of seedlings per unit area is a common indicator used to characterise seedling production. The rapid and accurate identification of saplings and the detection of the number of saplings per unit area play an important role not only in estimating production, but also in breeding and plant phenotyping. The height of a sapling is not only a reflection of its current growth status and the conditions required for its cultivation, but also determines its production trend and yield size as a cash crop. The traditional methods of testing, counting and measuring the height of saplings using manual sampling estimates are not only unstable and less timely [
6], but they are also labour intensive. When forest nurseries need to produce seedlings in a sustainable manner, they must be produced at minimal cost and with a minimal input of resources [
7]. Therefore, it is imperative to speed up the mechanisation and automation of seedling production management to significantly reduce labour time and eliminate heavy and inefficient manual work while maintaining a high accuracy.
With the development of technologies such as deep learning [
8,
9] and computer vision and the substantial increase in computers’ computing power in recent years, the real-time extraction of the number of saplings in a nursery can be accomplished using the fast detection characteristics of neural networks; real-time extraction of the height of saplings in a nursery can be accomplished using binocular cameras to build depth images. Many studies have applied deep learning techniques to forestry timber species recognition, e.g., Jozef Martinka [
10] used Matlab to build a deep neural network for detecting timber species, and identified colour temperature pictures of light with 97.9% accuracy. Shustrov [
11] used four neural network structures (AlexNet, VGG-16, GoogLeNet and ResNet-50) for fir, pine and spruce wood, respectively, with accuracies of over 90%, 90%, 80% and 70% for their four networks, respectively. Deep learning has also frequently been used in wood knots, surface defects in wood veneers and predicting wood properties. For example, Wei et al. [
12], Mohan et al. [
13] and Urbonas et al. (2019) [
14] designed neural networks for identifying timber knots and timber veneer surface defects, and they achieved detection accuracies of 70%–95%. The detection of tree vegetation in cities using techniques such as UAVs and deep learning has also been studied. For example, Xi et al. [
15] used two instance segmentation networks (BlendMask and Mask R-CNN) to segment ginkgo tree canopies in urban environments after a dimensionality reduction in the UAV multispectral images, while Zheng et al. [
16] used YoloV4-Lite to segment high-resolution remote sensing images of woods on campus, which were detected and localised. In a tree height measurement study, Prada, E [
17] measured the height of a single tree by means of a UAV-based LIDAR scanning sensor. In the detection of forests or native woods, Castilla G et al. [
18] used a point cloud of images from a UAV to measure the height of individual conifer seedlings, noting in the paper that the accuracy was high for seedlings above 30 cm, but not applicable to height measurements of seedlings below 30 cm; in 2019, Puliti et al. [
19] used a random forest model to estimate the height of 580 circular Norwegian plots, and Imangholiloo et al. [
20] used 2.5 cm GSD DIPCs (defoliated and defoliation) to estimate the average height of small trees within 15 plots in a conifer-dominated regeneration stand in Finland, but both studies involved trees above 1 metre in height.
After comparing 40 studies on deep learning in agroforestry applications, Kamilaris et al. [
21] found that deep learning has higher accuracy in image recognition and is better than commonly used image processing techniques. Using deep learning for detection can help to obtain deeper features and produce more accurate classification results [
22,
23,
24,
25]. It is divided into three broad categories of classical target detection algorithms: (1) the region-based convolutional neural network (R-CNN) family based on the region [
26], which has the highest accuracy but whose algorithm is complex and time consuming; (2) the regression-based YOLO (you only look once) series and SSD (single shot multibox detector), which are fast, small and efficient [
27]; (3) density estimation-based methods, i.e., estimating the number of targets through learning target features, and combining the corresponding linear mapping and spatial features to construct a density map [
28]. Compared with other neural network detection models, the YOLO model significantly improves detection speed while ensuring detection accuracy [
29,
30]. This, coupled with the advantage of the smaller overall size of the YOLO model, makes it ideal for mobile embedded device applications, e.g., the improved Fast YOLO model by Redmon [
31] was able to process an impressive 155 frames per second. J Wang et al. [
32] used YOLOV4-Tiny combined with a ZED 2 stereo camera for 3D reconstruction to obtain 3D coordinates of pixels in the current scene, and calculated the distance between the centre of the potted plant and the optical centre of the binocular left camera while completing the identification of the flower species. They completed real-time positioning and ranging of flowers with a real-time detection frame rate of 16 FPS and an average absolute error of 18.1 mm for flower centre positioning, with a maximum positioning error of 25.8 mm for flower centres under different light radiation conditions.
The above studies provide an insight into the detection of forest trees. However, the application of these techniques to nurseries with complex backgrounds and smaller individual saplings for target counting and height measurement still suffers from inaccuracies and is time consuming and labour intensive [
33]. This study proposes the use of the Ghostnet–YoloV4 network and binocular vision technology to solve this type of problem in order to obtain the number and height of target saplings in real time, as well as to investigate whether the improved network can have a better detection capability to satisfy the rudimentary management equipment of most small nurseries with a lower computing power and the use of inexpensive binocular cameras. Field tests were carried out in nurseries to check that the method achieves the practical requirements of sapling counting and height measurement and enables intelligent mechanical operation at minimal cost. The short-term aim is to use the research results to help nursery staff reduce the burden of manually counting saplings and measuring height, while the long-term goal is to improve the effectiveness of forestry machinery automation and lay the foundation for intelligent forestry management.
3. Results
3.1. Training Parameters and Results
The Ghostnet–YoloV4 network parameters and training results are shown in
Table 4. The original detection image size was 640 × 480, but the network was automatically scaled to 416 × 416 before being fed into the network for training and detection. mAP values and recall rates were not particularly high due to the use of data augmentation. We spent a total of nearly 120 h training 25,000 images, resulting in a training loss of 0.35. We achieved a frame rate of 15 FPS for real-time detection, which meets the requirements for real-time detection.
3.2. Presentation of Sapling Detection Results
As shown in
Figure 7, the spruce saplings varied in colour, size, texture and planting density during the three different growth periods. The colour and texture features helped the network to distinguish saplings from their surroundings, which directly affected the accuracy of the counts. The different sizes and planting densities of the saplings made the depth images of each sapling different. The larger the sapling, the more accurate the height measurement; the sparser the planting, the lower the level of occlusion, and the more accurate the count. From the diagram, we can quickly determine the number and height of saplings of this type.
The results of the real-time inspection of Mongolian scotch pine are displayed in
Figure 8, which also shows the complete design of the output window during detection. For sparse Mongolian scotch pine, no adjustment of the camera shooting angle was required; counting within the effective depth of the binocular camera was virtually error-free and the height measured by the system was very close to the actual height.
As shown in
Figure 9, the slender trunk of the sapling was relatively large compared to the crown of the Manchurian ash sapling; the extended crown of the sapling obscured the Manchurian ash saplings from one another, which resulted in the poor detection of the Manchurian ash saplings. Additionally, the roots of the Manchurian ash saplings in the back row of the camera were easily blocked by the saplings near the binoculars, making it difficult to detect the roots of some of the saplings. Moreover, the cadres of the saplings were relatively small, so the smaller trunks of the saplings were not visible in the depth image, making it difficult to match the bottom of the frame with the bottom of the saplings. In this study, upon adjusting the position of the binocular camera downwards, the roots of the willow were exposed as much as possible, while the manual intervention and manual selection of the top and bottom points of the sapling for the severely obscured willow could improve the detection accuracy of the Manchurian ash sapling.
3.3. Analysis of Test Results
Table 5 demonstrates the accuracy of detection for three different forms of spruce saplings. The table shows the number of spruce saplings and the average height of the saplings for the three forms; for the 3D coordinates of the centre point obtained simultaneously, they could be used to locate the saplings for future operations, such as precise automatic watering and the application of pesticides. For each point, the following measures were calculated: TP indicates the number of true saplings correctly detected as saplings; FP indicates the number of false saplings incorrectly detected as saplings; FN indicates the number of true saplings incorrectly detected or missed; count indicates the average number of saplings counted manually; H indicates the average height of saplings measured by the system in cm; and TH indicates the average height of trees measured manually in centimetres.
For the large spruce saplings, the nurserymen chose to plant them at a higher density in order to ensure their growth rate, so that they were counted with 100% accuracy. As the crowns of the large spruce saplings were farther away from the roots on the ground, it was easy to distinguish between them, and because these saplings were taller, the binocular camera took photographs from the side, so that the top and bottom points were selected more accurately, so their counting accuracy was also higher. The medium and small spruces were photographed diagonally downwards. The medium spruce was denser, and the shading between saplings had a greater impact on detection, making it easier for two or even more adjacent saplings to be mistakenly detected as one. The small spruce tilted and fell easily and the plants were shorter, making it easier to find the wrong top and bottom points of the saplings when taking height measurements, and resulting in a slightly lower accuracy than the other two spruce forms. However, the small spruce had a lower spacing, so the number of missed detections was lower, but there were slightly more false detections due to the similarity of its form to the surrounding weeds.
The three saplings showed the best detection results in terms of counting results for the Mongolian scotch pine. This is because camphor pine was more sparsely planted, while the three spruce and Manchurian ash sapling species were more densely planted, so the number of errors and omissions at each detection point for camphor pine was relatively low in comparison. In terms of height measurement, the saplings of Mongolian scotch pine were the furthest apart from one another, so the root and crown features were the most pronounced and the height measurements were the most accurate for Mongolian scotch pine saplings. In contrast, although manual intervention improved the detection accuracy of Manchurian ash, there were still a small number of missed detections, especially in the case of the smaller saplings that were relatively close to one another and could easily be detected as one sapling because they were too close to a slightly larger sapling.
The number of saplings was minimal when comparing the system detection with the manual detection, which shows that the data source was a good fit for the network detection function and, therefore, worked well. The slightly larger difference in the sapling height measurement is due to the errors inherent in the binocular camera and the fact that the roots of some saplings were not detected, which made the difference between the bottom point of the rectangular frame and the bottom point of the sapling too large and ultimately pulled down the average height. During the inspection, the binocular camera was very sensitive to changes in light, which was highlighted by the depth map display of Manchurian ash and spruce. When comparing tests of medium-sized spruce in shade and direct sunlight, and comparing tests under cloudy Manchurian ash and sunny Mongolian scotch pine, we found that under good lighting conditions, the grey contours of the saplings on the depth image were very close together, which also made the height measurements of spruce and Mongolian scotch pine saplings under sunlight more accurate. The reflection of sunlight brought out the colour and texture characteristics of the saplings and allowed more accurate results for spruce and Mongolian scotch pine saplings. In the shade or on cloudy days, the grey-scale contours of the Manchurian ash and medium-sized spruce on the depth images differed less from the background and neighbouring saplings, and the colour and texture characteristics were somewhat reduced, which caused a reduction in the accuracy of both the height measurements and count results. Adequate light conditions made the features, such as texture and colour, of the saplings more visible, facilitating the detection of the network. It is worth noting that when there was sufficient light, the rate of missed detection of multiple saplings into one was significantly reduced.
Table 6 shows the overall detection accuracy of the saplings, from which the Mongolian scotch pine benefitted from its larger spacing, with the highest count accuracy of 96.97% and a high measurement accuracy of 96.55%. Although the front and rear of the Manchurian ash were obscured, it could still count and measure with a high accuracy of over 92%, which could be further improved by combining it with human intervention.
3.4. Network Performance Analysis
In order to verify the performance of the improved network, the following four networks were trained separately and tested for comparison to complete the ablation experiment, and the results are shown in
Table 7, where (1) represents the original YoloV4 network; (2) represents Ghostnet–YoloV4, where Ghostnet is introduced to replace the YoloV4 backbone; (3) represents YoloV4 with PANet modification only; and (4) represents the introduction of Ghostnet to replace the YoloV4 backbone and the modification of PANet for Ghostnet–YoloV4. This dataset used the original dataset of saplings, a total of 1500 images, divided into a training and validation set in a ratio of 8:2, and training was carried out using four neural networks of 400 epochs. Of these, (1) had the longest training time and (4) had the shortest training time, and it had the best training results with a MAP value of 92.93%. Additionally, for all four networks, the real-time frame rate reached the maximum value of 15 FPS for this binocular camera. As the amount of training and detection data for the four networks was not very large, it was possible to have better performance but poorer detection results. In order to control for possible uncertainties of this type, we kept the influence of external factors as low as possible: all variables were the same, the detection locations were identical and the four neural networks were run separately for counting and altimetry. Accuracy calculations were still carried out using computer testing compared to manual testing. As can be seen in
Table 7, Ghostnet–YoloV4, which introduces Ghostnet to replace the YoloV4 backbone and modifies the PANet, demonstrated the highest accuracy in counting and height measurement.
4. Discussion
4.1. Reliability of the Ghostnet–YoloV4 Network
The experimental results show that the Ghostnet–YoloV4 network achieves good accuracy in the real-time counting of all three saplings. This result validates the prediction that the use of the Ghostnet network and deep separable convolution to improve YoloV4 not only reduces the network load massively but also has better detection results. The detection speed of the Ghostnet–YoloV4 network is very high, judging from the real-time frame rate of 15 FPS achieved. From the above, it is clear that there is no obstacle to deploying the Ghostnet–YoloV4 network on personal computers. It is also possible to apply the neural network to other mobile devices, such as mobile phones and tablets, in the future, which will greatly enhance the practical and generalisation capabilities of the network and can be applied to more fields for detection.
It is worth noting that Ghostnet–YoloV4 has a much lower number of parameters compared to YoloV4 when training the network, so it is much faster, thereby saving computer training time. This makes sense for practical applications, as for each different tree species, we need to carry out data collection, labelling and network training, and with a very large variety of saplings in the nursery, the training time is particularly important when conducting large-scale tree counts. If the training time is too long, it will cause a reduction in detection capability as the saplings grow and, more importantly, will delay the production process.
4.2. Binocular Camera 3D Reconstruction Capability
This experiment used images of three tree saplings taken using a binocular camera as the main study dataset, and the binocular view allowed the reconstruction of spatial location information. We chose a low-resolution binocular camera lens of only 640 × 480 in consideration of two factors: (1) Since the training data had 25,000 images, if the resolution of each image was increased so that the training set size was greater, it would massively increase the training time, making it unfavourable for both experimental research and field applications. (2) The network real-time processing and binocular camera real-time 3D reconstruction of higher-resolution images make the speed of the images inconsistent and causes delays, so we needed to choose lower-resolution images in consideration of the real-time effect. Although the lower-resolution images are sufficient to support the extraction of key points of the saplings and their height measurement, the accuracy and generalisability of the system would be improved if the inexpensive system could process the high-resolution images quickly. This will be possible as hardware computing power increases and information sources become more abundant. On the one hand, the increase in computing power will allow the system to obtain better information on colour, texture and depth for learning and detection, which will certainly improve the accuracy of counting and height measurement; on the other hand, other data sources, such as UAV point clouds [
17,
18,
19,
20] and hyperspectral imagery [
41,
42,
43,
44,
45], can effectively reconstruct the structure of individual tree types and thus help their detection.
It is clear from the experimental results that the binocular camera can complete a 3D reconstruction of the current scene and generate a depth map. The depth map contains the 3D coordinates of the pixel points, from which we can obtain the vertex and base points of the saplings, as well as the centre point. The vertex and base points were used to calculate the height of the saplings, while the centroids could be used to position them. Compared to studies that estimate tree height using point cloud images from a drone, we have the advantage that height measurements can be carried out for small saplings shorter than 30 cm, and the binocular camera is cheaper and simpler to operate. The centroid location and height estimation of saplings could provide the basic capability to sense, distinguish, measure and locate target objects in future fully automated nursery management, which in turn would enable unmanned cultivation operations, such as automatic watering, fertilization and temperature control.
However, the cheapness of the binocular camera dictates the simplicity of its hardware system architecture. As a result, compared to sensors such as UAVs and LiDAR, binocular cameras can pose larger data errors and more tedious pre-calibration and other tasks. Binocular cameras are susceptible to terrain, light and weather [
46], which negatively affects processes, such as subsequent altimetry and localisation and limits the ability of binocular cameras to generalise.
In addition, the binocular camera is unable to measure the height of saplings that are heavily obscured by one another, and the binocular camera will output a height value containing a large error due to the lack of access to the key points of the saplings. For this reason, we propose a method for manually selecting key points for height measurement. This method solves the problem of low fit and missed saplings through simple human–machine interaction; in addition, by changing the position of the binocular camera, the height of some heavily occluded saplings can be detected. However, occlusion is still a problem in vision technology, and the accuracy of counting and height measurement using the binocular camera in this study was greatly reduced in the case of dense hibiscus and other saplings. To address these shortcomings, we will introduce other sensors in future work, such as using LIDAR to acquire point cloud data [
17], to segment saplings for height measurement and enhance the generalisability of the system.
4.3. Experimental Errors
Although experimental errors are avoided as much as possible, some errors are inevitable due to objective conditions [
47]. The errors mainly originate from the following: (1) There is always some discrepancy between the parameter calibration results and the real parameters of the binocular camera in actual use, which is due to the errors inherent in the binocular camera and the errors in the tessellation calibration images taken. (2) Saplings that are too close together can be mistakenly detected as one sapling, which is due to the system identification errors caused by the mutual forking of sapling branches. (3) There will always be a partial incomplete fit between the sapling and the sapling detection frame, and for very dense saplings, the count and measurement accuracy will be drastically reduced, which is a drawback of the Yolo series using rectangular frames as the detection tool. (4) Poor lighting conditions will lead to a reduced differentiation between saplings and their surroundings, making sapling features weaker and causing detection errors; for saplings on sloping ground, unsuitable detection points will decrease the accuracy rate. These errors can cause some discrepancies between the manually measured height TH and the system measured height H. This can be countered by applying manual intervention and finding the best camera angle, as described above. Considering that there may be saplings that are in overgrown grass, causing the roots to be obscured, we need to add the estimated height of the grass to the average height of the saplings.
To reduce binocular camera errors, the expensive Zed2 integrated binocular camera can be applied, which poses a much lower risk of calibration error and allows for higher-resolution images, but this will require more hardware, such as a computer graphics card. In addition, the option of using techniques, such as density mapping [
28], to estimate the number of saplings or LiDAR may alleviate the difficulty of counting and measuring the height of dense saplings.