Towards Smaller and Stronger: An Edge-Aware Lightweight Segmentation Approach for Unmanned Surface Vehicles in Water Scenarios

Han, Wei; Zhao, Binyu; Luo, Jun

doi:10.3390/s23104789

Open AccessArticle

Towards Smaller and Stronger: An Edge-Aware Lightweight Segmentation Approach for Unmanned Surface Vehicles in Water Scenarios

by

Wei Han

^1,2,*,

Binyu Zhao

³ and

Jun Luo

¹

School of Mechatronic Engineering and Automation, Shanghai University, Shanghai 200444, China

²

Systems Engineering Research Institute, China State Shipbuilding Corporation, Beijing 100094, China

³

School of Computer Science and Techonogy, Harbin Institute of Technology, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(10), 4789; https://doi.org/10.3390/s23104789

Submission received: 13 April 2023 / Revised: 6 May 2023 / Accepted: 8 May 2023 / Published: 16 May 2023

(This article belongs to the Section Environmental Sensing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The accurate detection and segmentation of accessible surface regions in water scenarios is one of the indispensable capabilities of surface unmanned vehicle systems. ‘Most existing methods focus on accuracy and ignore the lightweight and real-time demands. Therefore, they are not suitable for embedded devices, which have been wildly applied in practical applications.‘ An edge-aware lightweight water scenario segmentation method (ELNet), which establishes a lighter yet better network with lower computation, is proposed. ELNet utilizes two-stream learning and edge-prior information. Except for the context stream, a spatial stream is expanded to learn spatial details in low-level layers with no extra computation cost in the inference stage. Meanwhile, edge-prior information is introduced to the two streams, which expands the perspectives of pixel-level visual modeling. The experimental results are 45.21% in FPS, 98.5% in detection robustness, 75.1% in F-score on MODS benchmark, 97.82% in precision, and 93.96% in F-score on USV Inland dataset. It demonstrates that ELNet uses fewer parameters to achieve comparable accuracy and better real-time performance.

Keywords:

water scenario segmentation; edge awareness; lightweight networks; unmanned surface vehicles

1. Introduction

Safe sailing is the premise for unmanned surface vehicles (USVs) to carry out diverse tasks [1]. Scenario perception and understanding is a fundamental capability of the unmanned system. Detecting, localizing and recognizing objects or obstacles is one of the functions for ensuring safe inland and maritime autonomous sailing, which is as vital as unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs). Considering the effectiveness and the cost, the camera is the most information-dense, affordable sensor for water scenario understanding [2]. Additionally, detecting accessible regions in visual images is always a central issue for safe navigation in water scenes.

Water scenario segmentation aims to distinguish the water and obstacles in inland or maritime scenarios, which provide the situation awareness and notations for safe driving of USVs. Generally, it has more strict requirements on accuracy and efficiency on segmentation algorithms for unmanned systems.

In the past decades, some researchers attempted to utilize horizon line extraction methods to solve this problem [3,4,5,6]. However, they are not suitable for varied water scenarios, such as the situation of interception or disaster relief in maritime environments, or the situation of water quality monitoring or waste management inland. Therefore, a reliable solution for situation awareness and scenario segmentation in such immensely supplemented and expanded water scenarios is important for autonomous navigation of USVs.

In recent years, convolutional neural networks (CNNs) have been proven an effective approach applying to various fields, which provide rich deep features with excellent perception results on different platforms. Various powerful CNN-based water scenario segmentation approaches have been proposed [7,8,9,10,11,12,13,14]. Compared with traditional approaches, these methods achieve better performance. However, most of those methods do not focus on the number of parameters, which affects the performance of the model. Water scenario segmentation aims to give real-time supports for downstream tasks, such as obstacle avoidance in harbors or floating waste detection in inland rivers. To cope with the complex environment or strict requirements, especially for situations such as USVs or devices embedded with multiple task-specific software within limited computation resources, the processing speed of existing water scenario segmentation methods still needs improving. Thus, a lighter segmentation network is required to further compress the computational cost.

On the other hand, water scenario segmentation detects the boundary between water, obstacles, sky regions in some extent. Edge information is a powerful prior information in both visual data and manual annotations. As one of the traditional segmentation approaches, edge features are still one of the popular strategies in current deep learning methods. For example, Lee et al. [15] utilize an edge map as one of the inputs to capture rough semantics; and Lu et al. [16] and Fan et al. [17] utilize an auxiliary boundary-aware stream extracted from the ground truth to make salient features and further estimate the silhouette and segmentation of objects. Inspired by those approaches, we are conscious of its importance in raw image and ground truth data. It is rarely adopted in existing water scenario segmentation methods. Therefore, we establish a spatial-stream learning integrating with the edge information.

In this paper, we propose an edge-aware lightweight water scenario segmentation method (dubbed as ELNet) for USV systems, which reduces the structure complexity and enhances the perception capability for water scenarios. Specifically, we build a two-stream learning strategy consisting of a context stream and a well-designed spatial stream. First, edge-prior information is concatenated to the context stream, the purpose of which is to reduce the parameters in the context stream and fully utilize the edge similarity between the raw image and ground truth. Second, we design an edge-dominant spatial stream, which only works on the training stage, to assist the feature learning with no parameter introduction. Edge-prior information is encoded and coupled with the ground truth to guide the detail feature learning. These designs normalize and enrich the model features, including not only the edge-related but also inter-class granularity semantics in the raw data. The main contributions are summarized as follows:

We propose a light segmentation method for water scenarios, which utilizes a two-stream learning strategy. Except for the traditional context stream, a spatial stream is expanded to learn spatial details in low-level layers with no extra computation cost in the inference time.
We introduce edge-prior information to different layers in both streams, which leads to object-level semantic learning and memorizing, and expands the perspectives of pixel-level visual modeling.
Evaluation on MODS benchmark and USV Inland dataset demonstrate that our approach achieves compelling performance. Notably, we obtain a significant improvement with a much lower number of parameters than the best frame-grained method.

This paper is organized as follows. In Section 2, the related works of existing water scenario segmentation methods, edge detection and lightweight networks are introduced. Then, in Section 3, we present how the proposed network ELNet works and its detail designs. In Section 4, experimental settings and results validate the performance of the proposed approach. Finally, the work is concluded briefly in Section 5.

2. Related Works

2.1. Water Scenario Segmentation

Semantic segmentation is a classic problem in the field of computer vision. In the semantic segmentation problem, each pixel in an image is assigned with a category ID according to the object of interest to which it belongs [18,19,20]. Recently, segmentation in water scenarios also attracts much attention due to the development of unmanned surface vehicle systems. Water scenario segmentation aims to semantically classify each pixel in images to obstacles/environment region, water region and sky region. The segmentation provides an accurate separation on non-water regions, so that fully autonomous navigation of USVs is realized in complex water surfaces.

In recent years, vision-based segmentation methods have achieved promising improvements modeling with CNN [8,9,10,11,12,13,14]. As for inland waterways, Bovcon et al. extend a single-view model to a stereo system, and finally propose a stereo-based obstacle detection method [8]. Zhou et al. propose an inland collision-free waterway segmentation model by obtaining pixel-wise classification [10]. Vandaele et al. improve inland semantic results on two segmentation datasets by applying transfer learning to segmentation [11].

As for maritime scenarios, Chen et al. propose an attention-based semantic segmentation network through designing an attention refine module (ARM) to improve detection accuracy at sea–sky line areas [9]. Kim et al. propose a vision sensor-based Skip-ENet model to recognize marine obstacles effectively within a limited amount of computation cost [12]. Bovcon et al. propose a water-obstacle separation and refinement network (WaSR) to improve the estimation of the water edge, detection of small obstacles and high false-positive rates on water reflections and wakes [13].

2.2. Edge Detection

Edge detection can localize and extract significant variations (the boundary between different objects) in an image, which benefits various vision-based tasks. Bertasius et al. utilize a multi-scale deep network to exploit object-related features as high-level cues for contour detection [21]. Cooperated with feature fusion, Yu et al. propose a novel skip layer design and a multi-label loss function for semantic edge detection tasks [22]. Shen et al. adapt the training strategy with edge information, in which contour data are portioned into sub-classes and each subclass is fitted by different model parameters [23].

In this paper, different applications of edge information are explored to benefit the model learning, feature analysis, and further successfully improve the performance on pixel-level predictions.

2.3. Lightweight Networks

CNN has applications on diverse research fields, and thus the performance on many tasks is tremendously improved. However, one of the following critical questions is efficiency. Conventional CNN reasoning is quite difficult to apply on resource constrained scenarios, such as mobile terminal and Internet of Things. The reason is that CNN requires a large amount of computation. Only through complex tailoring that CNN models could be reluctantly deployed to the mobile end. Fortunately, starting from SqueezeNet [24] and MobileNetV1 [25], researchers gradually pay attention to the efficiency problems in resource constrained scenarios. After several years of development, the relatively mature lightweight networks include MnasNet [26], mobilenet series [25,27,28], Interception series [29,30,31], FBNet series [32,33,34], and GhostNet [35].

MobileNetV2 refers to the residual structure of ResNet and introduces the inverted residual module for further improvements. On the one hand, the residual structure is conducive to the network’s learning. It reduces the calculation amount of the original point-wise convolution. Although MobileNetV3 learns from the advantages of MnasNet and MobiNet series, the improvement mainly contributes from manual efforts, which are not flexible enough to serve as the backbone. In this paper, MobileNetV2 is selected as the encoder backbone of the proposed network.

3. Method

The overview of the proposed edge-aware lightweight network (ELNet) is shown in Figure 1. First, edge detection is conducted to acquire the edge information, which also serves as the input data of the network. Then, the raw sensor data and corresponding edge data are input into the context stream and spatial stream. The context stream utilizes the classic encoder–decoder structure to inference the segmentation result, and the spatial stream implements the feature guidance mainly in the low-level layer. In this section, we introduce the details of the proposed network, and describe the whole architecture in detail.

3.1. Network Architecture

Encoder. ELNet follows the traditional encoder–decoder structure to obtain the segmentation result. The encoder in the context stream is extremely essential for feature extraction and latent feature analysis. We choose MobileNetV2 [27] as the backbone, and most of the original design in the network will be retained as possible. Except for the last module for classification, MobileNetV2 has a total of 17 calculation modules. It contains multiple bottleneck blocks, which vary in five scales to extract features from the original image progressively.

To achieve a deep perception of the edge, edge-prior information is concatenated to each encoding stage. A small fraction of the feature channels are chosen to store the boundary semantics, which also slightly reduce the number of parameters. Considering to match the size of the learning features, the average pooling is selected to align the size of the edge feature for fusion. We generate binary edge information from the raw sensor data by Laplacian Operator

k_{l a p l a c e}

, and the parameter is set as

k_{l a p l a c e} = (\begin{matrix} 0 & 1 & 0 \\ 1 & - 4 & 1 \\ 0 & 1 & 0 \end{matrix})

(1)

We select this basic filter mask as the parameters of the utilized Laplacian Operator, because it has the most generalization applied to almost all cases for edge detection.

Decoder. To minimize the total number of parameters, we utilize the representative feature maps in three stages: the first stage, the third stage and the last stage. It is believed that the features in the first stage preserve the fundamental pixel-level information such as shapes of objects, the features in the last stage preserve the most abstract semantic information, and the features in the third stage are guided by the edge detail from the inputs and ground truth. Therefore, the features in the second and fourth stage are not as vital as the features in the first, third and last stages. For the feature maps at the first and third stage, the features after the last convolution layer is abandoned, instead by the features when the number of channels keeps at the maximum. This selection is conducive to fully preserving enriching information obtained by the encoder, and contributes to favorable information for segmentation. We also design an ablation study to validate the rationality of this design, and the experimental results are given at Section 4.7. The decoding and fusion strategy in the decoder can be formulated as:

$\begin{matrix} {fea}_{i} = T r a n s p o s e C o n v ({fea}_{i}) \\ {fea}_{i} = N o r m a l i z a t i o n ({fea}_{i}) \\ {fea}_{i} = D r o p o u t ({fea}_{i}) \\ {fea}_{i} = A c t i v a t i o n ({fea}_{i}) \\ {fea}_{i - 2} = C o n c a t e n a t e ({fea}_{i}, {fea}_{i - 2}) \\ (i = 3, 5) \end{matrix}$

(2)

where ${fea}_{i}$ denotes the feature to-be-processed in the i-th stage. Additionally, batch normalization [36] is used for normalization and ReLU function [37] for activation in this paper. The dropout rate is set as 0.5 [38].
Auxiliary detail guidance. As shown in Figure 1, we use the low-level features (the 3rd stage) to produce detail guidance via the edge information and segmentation ground truth. Specifically, the guidance comes from two perspectives: consistent with the features of edge-prior information calculated from the Laplacian operator and with the ground truth of segmentation after decoding. Based on the above description, a Convolution Head is raised to regularize the third-stage feature maps. This module is carried out by

$\begin{matrix} C o n v H e a d (I) = & C o n v_{2} (A c t i v a t i o n ( \\ N o r m a l i z a t i o n (C o n v_{1} (I)))) \end{matrix}$

(3)

where $I$ denotes the input features, and the “Normalization” and “Activation” are batch normalization and ReLU function, respectively. Additionally, the kernels of the two convolution layer are $3 \times 3$ and $1 \times 1$ , respectively.

The edge features

{fea}_{e d g e}

and distinguished features

{fea}_{j d g}

are defined as:

\begin{matrix} {fea}_{e d g e} = C o n v H e a d 1 (LD) \\ {fea}_{j d g} = C o n v H e a d 2 ({fea}_{3}) \end{matrix}

(4)

where “

LD

” is the Laplacian Detection input image, and

{fea}_{3}

denotes the entire third-stage features after processing.

Note that this stream is discarded in the inference phase. Therefore, the spatial stream can boost the accuracy of prediction and introduce no additional parameter at the same time.

Network details. Table 1 shows the detailed structure of the proposed network ELNet.

3.2. Detail Guidance in Spatial Path

Inspired by [17,39], compared with a single stream network, which provides the context information on the original backbone’s low-level layers (the 3rd stage). An additional spatial stream can encode more spatial detail for complementary, e.g., boundary, and corners. Based on this observation, we utilize the auxiliary stream to guide the low-level features to learn the spatial information independently.

Guide with edge prior information. The detail feature prediction is modeled as a small knowledge distillation task and a binary segmentation task. We first generate the edge features encoded from the input edge image by a Laplacian operator and guide the partial learning of the third-scale coding features, which learns the same information from input image pairs. It can be illustrated as

$L_{f e a} = L_{1} ({fea}_{e d g e}, {fea}_{3} [N, :, :])$

(5)

where $L_{1}$ is the L1 loss. Specifically, the last N channel features are guided to be consistent with the knowledge from the input edge image.
Guide with ground truth. Then, another Conv Head is utilized to generate segmentation prediction with the whole third-stage feature map and the detail ground-truth, which guides the feature map of the low-level layer to learn more spatial details. As shown in Figure 1, this guidance can be formulated as

$L_{3 r d - s e g} = L_{f o c a l} (p_{s}, g_{s})$

(6)

where $p_{s} \in R^{H \times W}$ denotes the pixel-level spatial features ${fea}_{j d g}$ and $g_{s} \in R^{H \times W}$ denotes the corresponding downscale spatial ground truth, where the downscale factor is 8. $L_{f o c a l}$ is the focal loss with cross entropy modified from [40].

3.3. Total Loss Function

The total loss function of the proposed network ELNet is composed of three categories:

L = λ_{f e a} L_{f e a} + λ_{3 r d - s e g} L_{3 r d - s e g} + λ_{s e g} L_{s e g}

(7)

where

λ_{i}

is corresponding weight to balance the three items. The first two items,

L_{f e a}

and

L_{3 r d - s e g}

, are stated in Section 3.2. The last item

L_{s e g}

is also the focal cross-entropy loss at the original image scale between the final prediction and ground truth. Though there is not the class-imbalance situation that the number of positive pixels (water) is not much less than the negative pixels (non-water), the reason to use focal loss is the attempt on a stronger penalty on a false prediction. The focal cross entropy loss is written as

L_{f o c a l} = - α {(1 - p)}^{γ} l o g (p)

(8)

where

γ

is the focal rate,

α

is a balancing parameter, and

p \in [0, 1]

describes the probability of the predicted pixels, which is defined as

p = e x p (- L_{C E})

(9)

In this paper, the weight group is set as

λ_{f e a} = 1, λ_{3 r d - s e g} = 1

and

λ_{s e g} = 1

in the subsequent experiments. For focal cross-entropy loss, the setup

γ = 2, α = 0.25

follows [40].

4. Experiments

4.1. Dataset and Benchmark

Following the evaluation method of Bovcon et al. [41], we use the MaSTr1325 public dataset for training, and validate the proposed method on the MODS benchmark and USVInland dataset.

MaSTr1325 [42]: Marine Semantic Segmentation Training Dataset (MaSTr1325) is specially used to develop obstacle detection methods for small coastal USVs. The dataset contains 1325 reality-captured images, which include obstacles, water surface, sky and unknown targets, covering a series of real conditions encountered in coastal surveillance missions. It captures a variety of weather conditions, which range from foggy, partly cloudy with sunrise, overcast to sunny, and visually diverse obstacles, which are shown in Figure 2. The image size of MaSTr1325 is $512 \times 384$ .
MODS benchmark. [41]: The goal of MODS is to benchmark segmentation-based and detection-based obstacle detection methods for the maritime domain, specifically for use in unmanned surface vehicles (USVs). For segmentation-based detection, the segmentation method classifies each pixel in a given sensor image into one of three classes: sky, water or obstacle. Additionally, the MaSTr1325 dataset was created specifically for training.

MODS benchmark totally consists of 94 maritime sequences, which consist of approximately 8000 annotated frames with over 60k annotated objects. It consists obstacles annotation in two types: dynamic obstacles, which are objects floating in the water such as boats and buoys; and static obstacles, which are all remaining obstacle regions such as shorelines and piers.

USVInland [43]: Different from the condition on the sea, the inland river environment, which is relatively narrow and complex, often brings additional challenges to the positioning and perception of the USV. Compared to the emerged public datasets in the field of road automatic driving, such as KITTI [44], Oxford RobotCar [45] and nuScenes [46], the USVInland dataset undoubtedly fills the gap and opens a new situation for inland river unmanned ships. A total of 27 pieces of original data are collected. There are relatively low resolution ( $640 \times 320$ ) and high resolution ( $1280 \times 640$ ) images in the water segmentation sub-dataset. The fully original data will be directly used for validation.

4.2. Evaluation Metrics

Different from general segmentation metrics, MODS score models use USV-oriented metrics. They focus on the obstacle detection capabilities of methods. For dynamic obstacles, MODS annotates them with bounding boxes. For static obstacles, MODS labels the boundary between static obstacles and water annotated as polylines. In this paper, we utilize water-edge accuracy (

μ_{A}

) and detection robustness (

μ_{R}

) to evaluate the capability of the baseline and the proposed method to detect water edges, and precision (

P r

), recall (

R e

), and F-score to evaluate the accuracy of segmentation.

For USVInland dataset, water-edge polylines will be annotated manually. Precision (

P r

), recall (

R e

) and F-score are also chosen to evaluate the segmentation performance. The formulas of the metrics are as follows:

\begin{matrix} μ_{A} = \frac{\sum_{i = 0}^{N_{f r a m e}} W E - R M S E_{i}}{N_{f r a m e}} \\ μ_{R} = \frac{T P_{l a n d}}{T P_{l a n d} + F P_{l a n d}} \\ P r = \frac{T P}{T P + F P} \\ R e = \frac{T P}{T P + F N} \\ F - s c o r e = 2 \times \frac{P r \times R e}{P r + R e} \end{matrix}

(10)

where

W E - R M S E_{i}

denotes the water edge RMSE of i-th frame.

4.3. Implementation Details

Though MaSTr1325 dataset holds various weather conditions, data augmentation is also recommended by officials to make up the quantity of the dataset, prevent overfitting and simulate diverse cruising conditions. Therefore, a series of augmentation methods containing random horizontal flipping, random rotating by up to 15 degrees, random scaling from 60 to 90 percent, and color changes are adopted with 50%, 20%, 20%, and 20% possibilities when training with MaSTr1325, which is more strict than implementations in other papers and poses challenges to existing methods. Additionally, since the resolution is not consistent in USVInlnad dataset, high resolution images will be resized to low resolution.

All approaches are trained with the PyTorch framework [47] and the Adam optimizer [48] using GeForce GTX 1080 Ti. The learning rates are all initially set as

1 \times 10^{- 4}

and are halved for every 25 epochs. The values of the input images are normalized within the range [0, 1]. Additionally, the implementation of ELNet is imported from the pytorch library, which also follows the preprocess in pytorch that a special mixture of

m e a n = [0.485, 0.456, 0.406]

and

s t d = [0.229, 0.224, 0.225]

normalization are operated before training.

4.4. Comparison with Related Segmentation Methods

In the experiments, state-of-the-art algorithms are compared with the proposed network: WaSR [13], WODIS [9], CollisionFree [10], Skip-ENet [12], ShorelineNet [14] and a general-purpose segmentation network BiSeNet [39]. The model of WaSR, Skip-ENet and BiSeNet are obtained from official code, and the backbone of WaSR we choose is ResNet101 with no IMU. The model of WODIS and CollisionFree are reproduced by ourselves based on their published papers.

The parameter levels and inference speed of these methods are collected in Table 2. The total inference time has already contained the time consumed on acquiring the edge prior. In fact, it only consumes a little time, which can be neglected. It is referred that the proposed network ELNet has an extremely evident advantage on the total number of parameters, ∼4.86 M. Except for the lightest model Skip-ENet, ELNet is 26% smaller than ShorelineNet, and far lighter than the state-of-the art methods WaSR and good general segmentation methods BiSeNet. This achieves the goal of establishing a lightweight network.

As for inference speed, ELNet ranks higher in the compared group both on GPU and CPU, which is much lower than the state-of-the-art methods WaSR (10.63 s/0.98 s on GPU/CPU) for water scenario and BiSeNet (36.12 s/4.87 s on GPU/CPU) in general. However, the qualititative results of Skip-ENet and ShorelineNet are unstable according to Table 3, which is unfavorable for scenario perception. The improvement on inference speed mainly comes from three aspects:

First, a two-stream learning strategy (context stream learning with spatial stream learning) is applied. Additionally, the spatial stream works only in the training stage, which reduces the computation cost and thus affects the speed in the inference time.
Second, the backbone of the proposed network refers to the designs of lightweight networks such as MobileNet series, Interception serires, etc., which have been approved to have a faster speed than traditional CNN networks.
Third, we select an asymmetric decoder rather than a symmetric one with encoder after experiments, which also contributes to the speed in the inference time. The experimental detail is discussed in Section 4.7.

4.5. Performance on the MODS Benchmark

The training results of ELNet and qualitative evaluation results on MODS benchmark are shown in Figure 3. The convergent trend in the training curve shows that ELNet has successfully modeled the semantics of the scenario in the MaSTr1325 dataset. The training loss curves of ELNet on the MaSTr1325 dataset confirms that the model has converged rapidly in the first 5000 iterations. From our point of view, the reasons that a couple of peaks emerged in the loss curve after the second 5000 iterations are: (a) the scene of the sampled data is challenging to the model in the training stage so that the model performs a little worse than models of neighboring iterations, such as the sky–water line is long and hard to distinguish in the sampled data; and (b) strict data augmentation including random horizontal flipping, random rotating, random scaling, and color changes is adopted, which also intensify the difficulty. In fact, almost all the loss keeps the difference within 0.05 after the second 5000 iterations, which is far smaller than the difference (>0.8) in the first 5000 iterations. Combining with the training accuracy curve, we regard it as a normal phenomenon.

Among all experimental results in Figure 4, the edge detection result of WODIS and the obstacle detection result of ShorelineNet are the worst, while the results of WaSR, BiSeNet and ELNet reveal more accurate sea edge detection and positive obstacle detection performance. Additionally, for segmentation results, it is observed in the first row that WODIS and ShorelineNet are the left over most-negative pixels, while BiSeNet and ELNet have less classification errors for each pixel. Especially for identifying obstacles, the proposed network ELNet achieves the better segmentation with contours, which is evidently shown in the yellow circle.

The quantitative evaluation results are summarized in Table 3. The table shows that WaSR and the proposed method ELNet achieve a much higher water edge accuracy (11 px for both WaSR and ELNet). Except for the water-edge detection, CollisionFree achieves the best precision with 75% and F-score with 80.7%, but fails on recall with 86.7%. Meanwhile, WaSR has the best accuracy on recall with 98.3%, but the precision and F-score are both lower than ELNet. Among the whole metrics, ELNet reveals the best trade-off within the best water-edge detection and other segmentation results. Meanwhile, ELNet has a much lower number of parameters and better performance in qualitative evaluation as aforementioned, which reach the original pursuit for a smaller and stronger model.

4.6. Performance on USVInland Dataset

The utilized model on USVInland is also training on the MaSTr1325 dataset, the purpose of which is to evaluate the capability of transfer learning. The experimental results are illustrated in Figure 5 and Table 4. Since the ground truth does not distinguish the sky and obstacles, we have to compare the quality of obstacle–water segmentation and the water–sky-line only. From Figure 5, we observe that WaSR and ELNet achieve more accurate edge detection through labeling the obstacle–water edge manually. CollisionFree fails to perceive the boundary of water the in river scenarios. For the segmentation result, ELNet has less false positive pixels on detecting water regions.

From Table 4, the results of the proposed network ELNet reveals a comparable performance with the state-of-the-art method WaSR. This observation confirms that ELNet achieves similar performance compared to other excellent water segmentation approaches with a lighter network, which proves the strong robustness of the proposed network. Therefore, it is believed that ELNet has more competence performing well when transferring the model to other dataset.

4.7. Ablation Study on the Number of Upsampling Blocks

Five types of links between the encoder and decoder are trained on the MaSTR1325 dataset and validated on the MODS benchmark following the implementation details in Section 4.3, which are summarized as:

Type 1: features in stage 1, 5, which preserve the features of the highest- and lowest-level.
Type 2: features in stage 3, 5, which preserve the features of the finest-guidance and lowest-level.
Type 3: features in stage 1, 3, 5, which preserve the features of the highest-level, lowest-level and the finest-guidance.
Type 4: features in stage 1, 2, 4, 5, which preserve the features apart from that of the finest-guidance.
Type 5: features in stage 1, 2, 3, 4, 5, which classically preserve the features of all levels.

The experimental results are shown in Table 5. The comparison between Type 2 and Type 3 demonstrates the importance of low-level layer features on fundamental cognition. additionally, the results from Type 1 and Type 3 validate the improvements of the low-level features in the third stage. Type 4 has one large upsample kernel (the kernel size is 4) instead of two small kernels, while it achieves better performance than Type 5.

It is observed that the model of Type 4 has a comparable performance with that of Type 3, which is the second best structure type. Although it is evident that Type 3 has nearly identical performance compared to Type 4, the number of total parameters is 20% lower than Type 4. This is the basic judgment the proposed network ultimately adopts.

4.8. Ablation Study on Edge-Aware Modules

We define the special designed components in ELNet as Fusion and Auxiliary, where Fusion means the concatenation of the edge prior calculated through a Laplacian operator and the main branch feature maps. Auxiliary means the entire spatial stream, which includes two convolutional heads and their corresponding objective functions. The experiments about the effectiveness of the two components also follow the paradigm that the models are trained on the MaSTR1325 dataset and validated on the USVInland dataset with the implementation details in Section 4.3. Table 6 illustrates the final results.

It can be observed from the first row and the second row that the Fusion strategy helps better recognize objects in visual data. The same case also happens on the Auxiliary strategy when comparing the metrics in the first row with the third row, and have a larger margin than the Fusion strategy. This demonstrates that the spatial stream contributes more to the improvements. When the two strategies are implemented simultaneously, the final result naturally achieves the best. Nevertheless, it is worthy of noting that ELNet barely has 88.69% on recall and 93.96% on precision, and lower false segmentation is an advantage when considering a safer navigation on the water surface in maritime environments. This is a key point that needs to be improved in the future.

5. Conclusions

In this paper, an edge-aware lightweight algorithm, ELNet, is proposed to promote practical development for unmanned surface systems. By leveraging two-stream learning, the proposed method achieves better perception on object-level details with limited computation cost. By taking a strongly edge-guided optimizing direction in the two streams, ELNet achieves a visible margin on the comprehensive accuracy of segmentation. In addition, the generalization capability of ELNet is validated by training and testing on different maritime datasets. In particular, ELNet achieves a more than 50% parameter reduction, and holistically stable performance in detection accuracy (i.e., precision, recall and F-score pair) in comparison with the state-of-the-art methods WaSR [13] and BiSeNet [39], which suggests that it can provide safer passable regions for USVs. The excellent robustness and stability also demonstrate the potential for application in more water scenarios.

Author Contributions

Conceptualization, W.H.; methodology, W.H. and B.Z.; software, W.H.; validation, W.H., B.Z. and J.L.; formal analysis, B.Z.; investigation, W.H. and B.Z.; writing—original draft preparation, B.Z.; writing—review and editing, W.H.; visualization, B.Z.; supervision, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original dataset MaSTr1325 and MODS benchmark can be obtained online by Bovcon et al. [41,42]. The USVInland dataset can be obtained from the access online by Cheng et al. [43].

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, Z.; Zhang, Y.; Yu, X.; Yuan, C. Unmanned surface vehicles: An overview of developments and challenges. Annu. Rev. Control 2016, 41, 71–93. [Google Scholar] [CrossRef]
Kristan, M.; Kenk, V.S.; Kovačič, S.; Perš, J. Fast image-based obstacle detection from unmanned surface vehicles. IEEE Trans. Cybern. 2015, 46, 641–654. [Google Scholar] [CrossRef] [PubMed]
Luo, J.; Etz, S.P. A physical model-based approach to detecting sky in photographic images. IEEE Trans. Image Process. 2002, 11, 201–212. [Google Scholar] [PubMed]
Lipschutz, I.; Gershikov, E.; Milgrom, B. New methods for horizon line detection in infrared and visible sea images. Int. J. Comput. Eng. Res. 2013, 3, 1197–1215. [Google Scholar]
Yan, Y.; Shin, B.; Mou, X.; Mou, W.; Wang, H. Efficient horizon detection on complex sea for sea surveillance. Int. J. Electr. Electron. Data Commun. 2015, 3, 49–52. [Google Scholar]
Liang, D.; Liang, Y. Horizon detection from electro-optical sensors under maritime environment. IEEE Trans. Instrum. Meas. 2019, 69, 45–53. [Google Scholar] [CrossRef]
Taipalmaa, J.; Passalis, N.; Zhang, H.; Gabbouj, M.; Raitoharju, J. High-resolution water segmentation for autonomous unmanned surface vehicles: A novel dataset and evaluation. In Proceedings of the 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), Pittsburgh, PA, USA, 13–16 October 2019; pp. 1–6. [Google Scholar]
Bovcon, B.; Kristan, M. Obstacle detection for usvs by joint stereo-view semantic segmentation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 5807–5812. [Google Scholar]
Chen, X.; Liu, Y.; Achuthan, K. WODIS: Water obstacle detection network based on image segmentation for autonomous surface vehicles in maritime environments. IEEE Trans. Instrum. Meas. 2021, 70, 1–13. [Google Scholar] [CrossRef]
Zhou, R.; Gao, Y.; Wu, P.; Zhao, X.; Dou, W.; Sun, C.; Zhong, Y.; Wang, Y. Collision-Free Waterway Segmentation for Inland Unmanned Surface Vehicles. IEEE Trans. Instrum. Meas. 2022, 71, 1–16. [Google Scholar] [CrossRef]
Vandaele, R.; Dance, S.L.; Ojha, V. Automated water segmentation and river level detection on camera images using transfer learning. In Proceedings of the DAGM German Conference on Pattern Recognition, Tübingen, Germany, 28 September–1 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 232–245. [Google Scholar]
Kim, H.; Koo, J.; Kim, D.; Park, B.; Jo, Y.; Myung, H.; Lee, D. Vision-based real-time obstacle segmentation algorithm for autonomous surface vehicle. IEEE Access 2019, 7, 179420–179428. [Google Scholar] [CrossRef]
Bovcon, B.; Kristan, M. A water-obstacle separation and refinement network for unmanned surface vehicles. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 9470–9476. [Google Scholar]
Yao, L.; Kanoulas, D.; Ji, Z.; Liu, Y. ShorelineNet: An efficient deep learning approach for shoreline semantic segmentation for unmanned surface vehicles. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 5403–5409. [Google Scholar]
Lee, S.P.; Chen, S.C.; Peng, W.H. GSVNet: Guided spatially-varying convolution for fast semantic segmentation on video. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
Lu, H.; Deng, Z. A Boundary-aware Distillation Network for Compressed Video Semantic Segmentation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 5354–5359. [Google Scholar]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking BiSeNet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9716–9725. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding convolution for semantic segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar]
Bertasius, G.; Shi, J.; Torresani, L. Deepedge: A multi-scale bifurcated deep network for top-down contour detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4380–4389. [Google Scholar]
Yu, Z.; Feng, C.; Liu, M.Y.; Ramalingam, S. Casenet: Deep category-aware semantic edge detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5964–5973. [Google Scholar]
Shen, W.; Wang, X.; Wang, Y.; Bai, X.; Zhang, Z. Deepcontour: A deep convolutional feature learned by positive-sharing loss for contour detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3982–3991. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 2820–2828. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 10734–10742. [Google Scholar]
Wan, A.; Dai, X.; Zhang, P.; He, Z.; Tian, Y.; Xie, S.; Wu, B.; Yu, M.; Xu, T.; Chen, K.; et al. Fbnetv2: Differentiable neural architecture search for spatial and channel dimensions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 19–20 June 2020; pp. 12965–12974. [Google Scholar]
Dai, X.; Wan, A.; Zhang, P.; Wu, B.; He, Z.; Wei, Z.; Chen, K.; Tian, Y.; Yu, M.; Vajda, P.; et al. FBNetV3: Joint architecture-recipe search using predictor pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16276–16285. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Bovcon, B.; Muhovič, J.; Vranac, D.; Mozetič, D.; Perš, J.; Kristan, M. MODS–A USV-oriented object detection and obstacle segmentation benchmark. IEEE Trans. Intell. Transp. Syst. 2021, 23, 13403–13418. [Google Scholar] [CrossRef]
Bovcon, B.; Muhovič, J.; Perš, J.; Kristan, M. The mastr1325 dataset for training deep usv obstacle detection models. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 3431–3438. [Google Scholar]
Cheng, Y.; Jiang, M.; Zhu, J.; Liu, Y. Are We Ready for Unmanned Surface Vehicles in Inland Waterways? The USVInland Multisensor Dataset and Benchmark. IEEE Robot. Autom. Lett. 2021, 6, 3964–3970. [Google Scholar] [CrossRef]
Fritsch, J.; Kuehnl, T.; Geiger, A. A New Performance Measure and Evaluation Benchmark for Road Detection Algorithms. In Proceedings of the International Conference on Intelligent Transportation Systems (ITSC), Hague, The Netherlands, 6–9 October 2013. [Google Scholar]
Maddern, W.; Pascoe, G.; Linegar, C.; Newman, P. 1 Year, 1000 km: The Oxford RobotCar Dataset. Int. J. Robot. Res. IJRR 2017, 36, 3–15. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. Overview of the ELNet segmentation algorithm. The ELNet consists of two streams, context stream and spatial stream. The backbone of context stream is selected as MoblieNetV2 [27], and “S1–S5” represent the different stages of intermediate context features in different resolutions. The resolution of “S1” is the largest, which is half the size of the input image, and the resolution of “S5” is the smallest, which is 1/32 the size of the input image.

Figure 2. The MaSTr1325 contains a series of weather conditions and diverse object/obstacles, which provides a broad range of appearances and types for scenario segmentation.

Figure 3. Training loss (a) and accuracy (b) results of ELNet on MaSTr1325 dataset.

Figure 4. Qualitative evaluations on the MODS benchmark. The sky, obstacles and water are colored in deep-blue, yellow and cyan color, respectively. The red dashed line indicates the ground truth sea edge, and the main differences are outlined with arrows or zoomed in.

Figure 5. Qualitative evaluations on the USVINland dataset. The sky/obstacles and water are colored in purple and yellow, respectively. The red dashed line indicates the ground truth river edge.

Table 1. The architecture of the proposed network ELNet. “T” is the expansion factor of the bottleneck, “N” is the repeat times, “S” is stride, “OS” is the output scale of coding features, “IC” is input channel, “MC” is middle channel, and “OC” is output channel.

	Operator	OS	T	N	S	IC	MC	OC
Enc	conv2d	S/2	-	1	2	1	-	32
	bottleneck	S/2	1	1	1	32	-	16
	bottleneck	S/4	6	2	2	16	96/144	24
	bottleneck	S/8	6	3	2	24	144/192	32
	bottleneck	S/8	6	4	2	32	192/384	64
	bottleneck	S/16	6	3	1	64	384/576	96
	bottleneck	S/32	6	3	2	96	576/960	160
	bottleneck	S/32	6	1	1	160	960	320
Dec	up block	S/8	-	1	4	320	-	256
	up block	S/2	-	1	4	448	-	64
	conv2d	S	-	1	2	160	-	2
Aux	conv block	S/8	-	1	1	3	-	16
Aux	conv block	S/8	-	1	1	192	-	4

Table 2. The number of trainable parameters and inference speed of compared segmentation methods.

Algorithms	Number of Parameters (M)	FPS on Gpu	FPS on Cpu
BiSeNet [39]	13.42	36.12	4.87
WODIS [9]	49.07	33.56	1.81
CollisionFree [10]	100.36	9.89	0.21
Skip-ENet [12]	0.75	59.82	11.24
WaSR [13]	71.50	10.63	0.98
ShorelineNet [14]	6.50	49.02	6.33
ELNet (Ours)	4.86	45.21	6.95

Table 3. Quantitative evaluation of segmentation methods on MODS benchmark in terms of water-edge localization accuracy (

μ_{A}

) and detection robustness (

μ_{R}

) with the percentage of correctly detected water edge pixels in parentheses, with precision (Pr), recall (Re) and F-score in percentages.

Table 3. Quantitative evaluation of segmentation methods on MODS benchmark in terms of water-edge localization accuracy (

μ_{A}

) and detection robustness (

μ_{R}

) with the percentage of correctly detected water edge pixels in parentheses, with precision (Pr), recall (Re) and F-score in percentages.

Algorithms	$μ_{A} [px] (μ_{R})$	TP	FP	FN	Pr [%]	Re [%]	F-Score [%]
BiSeNet [39]	12 (98.4)	51,045	33,152	1443	60.6	97.3	74.7
WODIS [9]	18 (97.1)	49,966	87,651	2522	36.3	95.2	52.6
CollisionFree [10]	53 (91.7)	45,528	14,797	6960	75.5	86.7	80.7
Skip-ENet [12]	25 (95.8)	48,786	178,013	3702	21.5	92.9	34.9
WaSR [13]	11 (98.6)	51,607	85,374	881	37.7	98.3	54.5
ShorelineNet [14]	12 (98.4)	49,643	131,130	2845	27.5	94.6	42.6
ELNet (Ours)	11 (98.5)	51,429	29,318	1156	63.7	97.8	75.1

Table 4. Quantitative Evaluation of segmentation methods on USVInland dataset in terms of precision (Pr), recall (Re) and F-score in percentages.

Algorithms	Pr [%]	Re [%]	F-Score
BiSeNet [39]	97.89	87.77	93.04
WODIS [9]	96.14	88.84	92.68
CollisionFree [10]	93.64	75.96	84.88
Skip-ENet [12]	96.81	76.78	86.65
WaSR [13]	97.96	83.23	90.55
ShorelineNet [14]	96.60	81.17	88.83
ELNet (Ours)	97.82	88.69	93.96

Table 5. Ablation study of the upsampling number in terms of water-edge localization accuracy (

μ_{A}

) and detection robustness (

μ_{R}

), precision (Pr), recall (Re) and F-score in percentages.

Table 5. Ablation study of the upsampling number in terms of water-edge localization accuracy (

μ_{A}

) and detection robustness (

μ_{R}

), precision (Pr), recall (Re) and F-score in percentages.

	Params	$μ_{A} [px] (μ_{R})$	Pr [%]	Re [%]	F-Score [%]
1	8.34 M	14 (98.0)	60.6	95.9	64.7
2	4.42 M	28 (95.3)	20.9	92.3	34.3
3	4.87 M	11 (98.5)	63.7	97.8	75.1
4	6.05 M	10 (98.6)	63.4	97.7	75.2
5	5.17 M	14 (98.0)	60.2	96.6	69.5

Table 6. Ablation study of edge-aware modules on USVInland.

Fusion	Auxiliary	F-Score	Re [%]	Pr [%]
×	×	96.96	88.12	93.99
✓	×	97.32	88.37	93.97
×	✓	97.36	88.42	93.88
✓	✓	97.82	88.69	93.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, W.; Zhao, B.; Luo, J. Towards Smaller and Stronger: An Edge-Aware Lightweight Segmentation Approach for Unmanned Surface Vehicles in Water Scenarios. Sensors 2023, 23, 4789. https://doi.org/10.3390/s23104789

AMA Style

Han W, Zhao B, Luo J. Towards Smaller and Stronger: An Edge-Aware Lightweight Segmentation Approach for Unmanned Surface Vehicles in Water Scenarios. Sensors. 2023; 23(10):4789. https://doi.org/10.3390/s23104789

Chicago/Turabian Style

Han, Wei, Binyu Zhao, and Jun Luo. 2023. "Towards Smaller and Stronger: An Edge-Aware Lightweight Segmentation Approach for Unmanned Surface Vehicles in Water Scenarios" Sensors 23, no. 10: 4789. https://doi.org/10.3390/s23104789

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards Smaller and Stronger: An Edge-Aware Lightweight Segmentation Approach for Unmanned Surface Vehicles in Water Scenarios

Abstract

1. Introduction

2. Related Works

2.1. Water Scenario Segmentation

2.2. Edge Detection

2.3. Lightweight Networks

3. Method

3.1. Network Architecture

3.2. Detail Guidance in Spatial Path

3.3. Total Loss Function

4. Experiments

4.1. Dataset and Benchmark

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Comparison with Related Segmentation Methods

4.5. Performance on the MODS Benchmark

4.6. Performance on USVInland Dataset

4.7. Ablation Study on the Number of Upsampling Blocks

4.8. Ablation Study on Edge-Aware Modules

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI