1. Introduction
Safe sailing is the premise for unmanned surface vehicles (USVs) to carry out diverse tasks [
1]. Scenario perception and understanding is a fundamental capability of the unmanned system. Detecting, localizing and recognizing objects or obstacles is one of the functions for ensuring safe inland and maritime autonomous sailing, which is as vital as unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs). Considering the effectiveness and the cost, the camera is the most information-dense, affordable sensor for water scenario understanding [
2]. Additionally, detecting accessible regions in visual images is always a central issue for safe navigation in water scenes.
Water scenario segmentation aims to distinguish the water and obstacles in inland or maritime scenarios, which provide the situation awareness and notations for safe driving of USVs. Generally, it has more strict requirements on accuracy and efficiency on segmentation algorithms for unmanned systems.
In the past decades, some researchers attempted to utilize horizon line extraction methods to solve this problem [
3,
4,
5,
6]. However, they are not suitable for varied water scenarios, such as the situation of interception or disaster relief in maritime environments, or the situation of water quality monitoring or waste management inland. Therefore, a reliable solution for situation awareness and scenario segmentation in such immensely supplemented and expanded water scenarios is important for autonomous navigation of USVs.
In recent years, convolutional neural networks (CNNs) have been proven an effective approach applying to various fields, which provide rich deep features with excellent perception results on different platforms. Various powerful CNN-based water scenario segmentation approaches have been proposed [
7,
8,
9,
10,
11,
12,
13,
14]. Compared with traditional approaches, these methods achieve better performance. However, most of those methods do not focus on the number of parameters, which affects the performance of the model. Water scenario segmentation aims to give real-time supports for downstream tasks, such as obstacle avoidance in harbors or floating waste detection in inland rivers. To cope with the complex environment or strict requirements, especially for situations such as USVs or devices embedded with multiple task-specific software within limited computation resources, the processing speed of existing water scenario segmentation methods still needs improving. Thus, a lighter segmentation network is required to further compress the computational cost.
On the other hand, water scenario segmentation detects the boundary between water, obstacles, sky regions in some extent. Edge information is a powerful prior information in both visual data and manual annotations. As one of the traditional segmentation approaches, edge features are still one of the popular strategies in current deep learning methods. For example, Lee et al. [
15] utilize an edge map as one of the inputs to capture rough semantics; and Lu et al. [
16] and Fan et al. [
17] utilize an auxiliary boundary-aware stream extracted from the ground truth to make salient features and further estimate the silhouette and segmentation of objects. Inspired by those approaches, we are conscious of its importance in raw image and ground truth data. It is rarely adopted in existing water scenario segmentation methods. Therefore, we establish a spatial-stream learning integrating with the edge information.
In this paper, we propose an edge-aware lightweight water scenario segmentation method (dubbed as ELNet) for USV systems, which reduces the structure complexity and enhances the perception capability for water scenarios. Specifically, we build a two-stream learning strategy consisting of a context stream and a well-designed spatial stream. First, edge-prior information is concatenated to the context stream, the purpose of which is to reduce the parameters in the context stream and fully utilize the edge similarity between the raw image and ground truth. Second, we design an edge-dominant spatial stream, which only works on the training stage, to assist the feature learning with no parameter introduction. Edge-prior information is encoded and coupled with the ground truth to guide the detail feature learning. These designs normalize and enrich the model features, including not only the edge-related but also inter-class granularity semantics in the raw data. The main contributions are summarized as follows:
We propose a light segmentation method for water scenarios, which utilizes a two-stream learning strategy. Except for the traditional context stream, a spatial stream is expanded to learn spatial details in low-level layers with no extra computation cost in the inference time.
We introduce edge-prior information to different layers in both streams, which leads to object-level semantic learning and memorizing, and expands the perspectives of pixel-level visual modeling.
Evaluation on MODS benchmark and USV Inland dataset demonstrate that our approach achieves compelling performance. Notably, we obtain a significant improvement with a much lower number of parameters than the best frame-grained method.
This paper is organized as follows. In
Section 2, the related works of existing water scenario segmentation methods, edge detection and lightweight networks are introduced. Then, in
Section 3, we present how the proposed network ELNet works and its detail designs. In
Section 4, experimental settings and results validate the performance of the proposed approach. Finally, the work is concluded briefly in
Section 5.
4. Experiments
4.1. Dataset and Benchmark
Following the evaluation method of Bovcon et al. [
41], we use the MaSTr1325 public dataset for training, and validate the proposed method on the MODS benchmark and USVInland dataset.
MaSTr1325 [
42]: Marine Semantic Segmentation Training Dataset (MaSTr1325) is specially used to develop obstacle detection methods for small coastal USVs. The dataset contains 1325 reality-captured images, which include obstacles, water surface, sky and unknown targets, covering a series of real conditions encountered in coastal surveillance missions. It captures a variety of weather conditions, which range from foggy, partly cloudy with sunrise, overcast to sunny, and visually diverse obstacles, which are shown in
Figure 2. The image size of MaSTr1325 is
.
MODS benchmark. [
41]: The goal of MODS is to benchmark segmentation-based and detection-based obstacle detection methods for the maritime domain, specifically for use in unmanned surface vehicles (USVs). For segmentation-based detection, the segmentation method classifies each pixel in a given sensor image into one of three classes: sky, water or obstacle. Additionally, the MaSTr1325 dataset was created specifically for training.
MODS benchmark totally consists of 94 maritime sequences, which consist of approximately 8000 annotated frames with over 60k annotated objects. It consists obstacles annotation in two types: dynamic obstacles, which are objects floating in the water such as boats and buoys; and static obstacles, which are all remaining obstacle regions such as shorelines and piers.
USVInland [
43]: Different from the condition on the sea, the inland river environment, which is relatively narrow and complex, often brings additional challenges to the positioning and perception of the USV. Compared to the emerged public datasets in the field of road automatic driving, such as KITTI [
44], Oxford RobotCar [
45] and nuScenes [
46], the USVInland dataset undoubtedly fills the gap and opens a new situation for inland river unmanned ships. A total of 27 pieces of original data are collected. There are relatively low resolution (
) and high resolution (
) images in the water segmentation sub-dataset. The fully original data will be directly used for validation.
4.2. Evaluation Metrics
Different from general segmentation metrics, MODS score models use USV-oriented metrics. They focus on the obstacle detection capabilities of methods. For dynamic obstacles, MODS annotates them with bounding boxes. For static obstacles, MODS labels the boundary between static obstacles and water annotated as polylines. In this paper, we utilize water-edge accuracy () and detection robustness () to evaluate the capability of the baseline and the proposed method to detect water edges, and precision (), recall (), and F-score to evaluate the accuracy of segmentation.
For USVInland dataset, water-edge polylines will be annotated manually. Precision (
), recall (
) and F-score are also chosen to evaluate the segmentation performance. The formulas of the metrics are as follows:
where
denotes the water edge RMSE of
i-th frame.
4.3. Implementation Details
Though MaSTr1325 dataset holds various weather conditions, data augmentation is also recommended by officials to make up the quantity of the dataset, prevent overfitting and simulate diverse cruising conditions. Therefore, a series of augmentation methods containing random horizontal flipping, random rotating by up to 15 degrees, random scaling from 60 to 90 percent, and color changes are adopted with 50%, 20%, 20%, and 20% possibilities when training with MaSTr1325, which is more strict than implementations in other papers and poses challenges to existing methods. Additionally, since the resolution is not consistent in USVInlnad dataset, high resolution images will be resized to low resolution.
All approaches are trained with the PyTorch framework [
47] and the Adam optimizer [
48] using GeForce GTX 1080 Ti. The learning rates are all initially set as
and are halved for every 25 epochs. The values of the input images are normalized within the range [0, 1]. Additionally, the implementation of ELNet is imported from the pytorch library, which also follows the preprocess in pytorch that a special mixture of
and
normalization are operated before training.
4.4. Comparison with Related Segmentation Methods
In the experiments, state-of-the-art algorithms are compared with the proposed network: WaSR [
13], WODIS [
9], CollisionFree [
10], Skip-ENet [
12], ShorelineNet [
14] and a general-purpose segmentation network BiSeNet [
39]. The model of WaSR, Skip-ENet and BiSeNet are obtained from official code, and the backbone of WaSR we choose is ResNet101 with no IMU. The model of WODIS and CollisionFree are reproduced by ourselves based on their published papers.
The parameter levels and inference speed of these methods are collected in
Table 2. The total inference time has already contained the time consumed on acquiring the edge prior. In fact, it only consumes a little time, which can be neglected. It is referred that the proposed network ELNet has an extremely evident advantage on the total number of parameters, ∼4.86 M. Except for the lightest model Skip-ENet, ELNet is 26% smaller than ShorelineNet, and far lighter than the state-of-the art methods WaSR and good general segmentation methods BiSeNet. This achieves the goal of establishing a lightweight network.
As for inference speed, ELNet ranks higher in the compared group both on GPU and CPU, which is much lower than the state-of-the-art methods WaSR (10.63 s/0.98 s on GPU/CPU) for water scenario and BiSeNet (36.12 s/4.87 s on GPU/CPU) in general. However, the qualititative results of Skip-ENet and ShorelineNet are unstable according to
Table 3, which is unfavorable for scenario perception. The improvement on inference speed mainly comes from three aspects:
First, a two-stream learning strategy (context stream learning with spatial stream learning) is applied. Additionally, the spatial stream works only in the training stage, which reduces the computation cost and thus affects the speed in the inference time.
Second, the backbone of the proposed network refers to the designs of lightweight networks such as MobileNet series, Interception serires, etc., which have been approved to have a faster speed than traditional CNN networks.
Third, we select an asymmetric decoder rather than a symmetric one with encoder after experiments, which also contributes to the speed in the inference time. The experimental detail is discussed in
Section 4.7.
4.5. Performance on the MODS Benchmark
The training results of ELNet and qualitative evaluation results on MODS benchmark are shown in
Figure 3. The convergent trend in the training curve shows that ELNet has successfully modeled the semantics of the scenario in the MaSTr1325 dataset. The training loss curves of ELNet on the MaSTr1325 dataset confirms that the model has converged rapidly in the first 5000 iterations. From our point of view, the reasons that a couple of peaks emerged in the loss curve after the second 5000 iterations are: (a) the scene of the sampled data is challenging to the model in the training stage so that the model performs a little worse than models of neighboring iterations, such as the sky–water line is long and hard to distinguish in the sampled data; and (b) strict data augmentation including random horizontal flipping, random rotating, random scaling, and color changes is adopted, which also intensify the difficulty. In fact, almost all the loss keeps the difference within 0.05 after the second 5000 iterations, which is far smaller than the difference (>0.8) in the first 5000 iterations. Combining with the training accuracy curve, we regard it as a normal phenomenon.
Among all experimental results in
Figure 4, the edge detection result of WODIS and the obstacle detection result of ShorelineNet are the worst, while the results of WaSR, BiSeNet and ELNet reveal more accurate sea edge detection and positive obstacle detection performance. Additionally, for segmentation results, it is observed in the first row that WODIS and ShorelineNet are the left over most-negative pixels, while BiSeNet and ELNet have less classification errors for each pixel. Especially for identifying obstacles, the proposed network ELNet achieves the better segmentation with contours, which is evidently shown in the yellow circle.
The quantitative evaluation results are summarized in
Table 3. The table shows that WaSR and the proposed method ELNet achieve a much higher water edge accuracy (11 px for both WaSR and ELNet). Except for the water-edge detection, CollisionFree achieves the best precision with 75% and F-score with 80.7%, but fails on recall with 86.7%. Meanwhile, WaSR has the best accuracy on recall with 98.3%, but the precision and F-score are both lower than ELNet. Among the whole metrics, ELNet reveals the best trade-off within the best water-edge detection and other segmentation results. Meanwhile, ELNet has a much lower number of parameters and better performance in qualitative evaluation as aforementioned, which reach the original pursuit for a smaller and stronger model.
4.6. Performance on USVInland Dataset
The utilized model on USVInland is also training on the MaSTr1325 dataset, the purpose of which is to evaluate the capability of transfer learning. The experimental results are illustrated in
Figure 5 and
Table 4. Since the ground truth does not distinguish the sky and obstacles, we have to compare the quality of obstacle–water segmentation and the water–sky-line only. From
Figure 5, we observe that WaSR and ELNet achieve more accurate edge detection through labeling the obstacle–water edge manually. CollisionFree fails to perceive the boundary of water the in river scenarios. For the segmentation result, ELNet has less false positive pixels on detecting water regions.
From
Table 4, the results of the proposed network ELNet reveals a comparable performance with the state-of-the-art method WaSR. This observation confirms that ELNet achieves similar performance compared to other excellent water segmentation approaches with a lighter network, which proves the strong robustness of the proposed network. Therefore, it is believed that ELNet has more competence performing well when transferring the model to other dataset.
4.7. Ablation Study on the Number of Upsampling Blocks
Five types of links between the encoder and decoder are trained on the MaSTR1325 dataset and validated on the MODS benchmark following the implementation details in
Section 4.3, which are summarized as:
Type 1: features in stage 1, 5, which preserve the features of the highest- and lowest-level.
Type 2: features in stage 3, 5, which preserve the features of the finest-guidance and lowest-level.
Type 3: features in stage 1, 3, 5, which preserve the features of the highest-level, lowest-level and the finest-guidance.
Type 4: features in stage 1, 2, 4, 5, which preserve the features apart from that of the finest-guidance.
Type 5: features in stage 1, 2, 3, 4, 5, which classically preserve the features of all levels.
The experimental results are shown in
Table 5. The comparison between Type 2 and Type 3 demonstrates the importance of low-level layer features on fundamental cognition. additionally, the results from Type 1 and Type 3 validate the improvements of the low-level features in the third stage. Type 4 has one large upsample kernel (the kernel size is 4) instead of two small kernels, while it achieves better performance than Type 5.
It is observed that the model of Type 4 has a comparable performance with that of Type 3, which is the second best structure type. Although it is evident that Type 3 has nearly identical performance compared to Type 4, the number of total parameters is 20% lower than Type 4. This is the basic judgment the proposed network ultimately adopts.
4.8. Ablation Study on Edge-Aware Modules
We define the special designed components in ELNet as
Fusion and
Auxiliary, where
Fusion means the concatenation of the edge prior calculated through a Laplacian operator and the main branch feature maps.
Auxiliary means the entire spatial stream, which includes two convolutional heads and their corresponding objective functions. The experiments about the effectiveness of the two components also follow the paradigm that the models are trained on the MaSTR1325 dataset and validated on the USVInland dataset with the implementation details in
Section 4.3.
Table 6 illustrates the final results.
It can be observed from the first row and the second row that the Fusion strategy helps better recognize objects in visual data. The same case also happens on the Auxiliary strategy when comparing the metrics in the first row with the third row, and have a larger margin than the Fusion strategy. This demonstrates that the spatial stream contributes more to the improvements. When the two strategies are implemented simultaneously, the final result naturally achieves the best. Nevertheless, it is worthy of noting that ELNet barely has 88.69% on recall and 93.96% on precision, and lower false segmentation is an advantage when considering a safer navigation on the water surface in maritime environments. This is a key point that needs to be improved in the future.