1. Introduction
Autonomous boats have significant commercial and societal potential for transoceanic cargo shipping, passenger transport, hazardous area inspection, and environmental control. Their autonomy crucially relies on obstacle detection, which is particularly challenging in maritime environments such as coastal waters, marinas, urban canals, and rivers. This is because the appearance of the navigable surface (water) is dynamic, changes with the weather, and contains object reflections and glitter. Similarly, obstacles can be static (e.g., shore and piers) or dynamic (e.g., boats, swimmers, debris, buoys) and have a wide range of appearances.
The current state of the art in vision-based maritime obstacle detection [
1] is based on the segmentation of captured images into water, obstacle, and sky classes. Unlike detection-based methods, segmentation methods can simultaneously detect static and dynamic obstacles, are more robust to the diverse appearance of obstacle types, and directly detect the navigable area (water). However, their performance depends on the availability of large per-pixel segmented training datasets [
2].
Manual segmentation of training sets is time-consuming, costly, and error-prone. For example, manual segmentation of a typical maritime image takes approximately 20 min [
2]. In a related field of autonomous vehicles, significant efforts have thus been invested in semi-automatic annotation [
3,
4], domain transfer [
5,
6], and semi-supervised learning [
7,
8] to reduce the required labeling effort. Alternatively, weak supervision methods [
9,
10] aim to achieve this by simplifying annotations and harnessing prior knowledge about image and data structure. Unfortunately, these methods are largely inapplicable to the maritime domain due to their reliance on overly general annotations such as image-level labels or object bounding boxes, which are unable to capture all the requirements for robust maritime perception. Instead, the annotation effort may be reduced in a more principled way by considering the specifics of downstream applications. Thus far, this direction has not yet been explored in the domain of maritime obstacle detection.
We observe that segmentation errors in different semantic regions may have profoundly different consequences for maritime navigation.
Figure 1 visualizes several of these cases. Detecting the boundary between water and static obstacles (i.e., the water edge) is crucial for collision prevention, while accurate segmentation of the shore–sky boundary is irrelevant for obstacle avoidance. Similarly, misclassifying a few pixels on a floating obstacle will not cause a collision, but falsely classifying isolated patches of water pixels as an obstacle adversely affects control, causing frequent and unnecessary stops. Recent maritime obstacle detection benchmarks [
1] also reflect these navigation-specific segmentation requirements by defining performance evaluation measures that evaluate the performance in terms of dynamic obstacle detection and water–edge estimation accuracy, while ignoring the segmentation accuracy beyond the water edge.
Taking into account the aforementioned requirements of the maritime obstacle detection task, we propose a new scaffolding learning regime (SLR), which is our main contribution. SLR avoids the need for densely labeled ground truth for training maritime obstacle detection networks and instead relies only on weak obstacle-oriented annotations (
Figure 1) consisting of water–edge poly-lines, bounding boxes denoting dynamic obstacles, and the horizon location estimated from the on-board inertial measurement unit (IMU) data. At a high level, SLR works with EM-like steps, alternating between estimating the unknown segmentation labels and improving the network parameters. First, by considering the domain constraints, we can construct partial segmentation labels (i.e., not all pixels are labeled) from weak annotations. A segmentation network is then trained on the partial labels to learn domain-specific feature extraction. During this initialization step, we also use additional bounding-box-based training objectives to learn the segmentation of foreground dynamic obstacles.
In turn, the trained network is used to estimate the labels in the unknown regions of the image. While the predictions of such a network cannot be expected to reach the desired robustness due to the complex interplay of multiple training objectives, the encoder must learn powerful domain-specific features to satisfy all the objectives. We therefore use a feature clustering approach to estimate the most likely labels for unlabeled pixels in the partial segmentation labels. Finally, the network is fine-tuned using the newly estimated pseudo-labels.
Extensive evaluation on maritime obstacle detection by segmentation [
1] shows that models trained using SLR outperform models classically trained on full dense annotations, which is a remarkable result. In fact, the new training regime increases robustness to false-positive detections and improves the generalization capabilities of the trained networks, while reducing the time required for ground truth annotation by orders of magnitude. To the best of our knowledge, this is the first method for training obstacle detection from weak annotations in the marine domain which surpasses fully supervised training from dense labels. Furthermore, SLR makes minimal assumptions on the network architecture and can thus be applied to most of the existing segmentation networks. Additionally, it is only used during training and thus does not impact network inference characteristics such as speed or hardware requirements.
Preliminary results of our approach were presented in a conference paper [
11]. This paper goes beyond the preliminary work in several ways. We introduce an additional auxiliary object loss (
), which uses segmentation priors, estimated from bounding boxes, to supervise dynamic obstacle segmentation. Additionally, we reformulate the retraining step as fine-tuning, starting training from pre-trained weights of the warm-up stage. Both changes lead to substantial performance improvements. The experimental analysis is also significantly extended with a cross-domain generalization analysis (
Section 4.5), an annotation efficiency study including a comparison with a recent state-of-the-art annotation–reduction approach (i.e., semi-supervised learning,
Section 4.4), experiments and discussion about soft labels (
Section 4.8), as well as qualitative analysis on a wide range of maritime images (
Section 4.9).
3. Learning Segmentation by Scaffolding
We now introduce the new learning approach for maritime segmentation networks, which we call Scaffolding Learning Regime (SLR). SLR gradually improves the trained model by iterating between improving the network parameters using per-pixel (pseudo-) labels and re-estimating the pseudo-labels using the learned network. This process is composed of three steps shown in
Figure 2. In the first step (
Section 3.1), the network is trained using (1) partial labels, which are constructed from weak annotations and domain constraints and (2) additional weak objectives for learning the segmentation of dynamic obstacles. In the second step (
Section 3.2), the learned network is used to estimate the labels of unlabeled regions in partial labels. Finally (
Section 3.3), the network is fine-tuned using the estimated pseudo-labels. These three steps are explained in more detail below.
3.1. Feature Warm-Up
The purpose of the feature warm-up step is to learn domain-specific encoder features and initial segmentation predictions. This is achieved through weakly supervised training. By combining domain knowledge and weak annotations, we can label specific regions of an input image
with high confidence, while leaving other regions unlabeled, resulting in partial labels
(
Section 3.1.1), which can be used to supervise the model training (
Section 3.1.2).
3.1.1. Partial Labels from Weak Annotations
To generate per-pixel partial labels
for the water, sky, and obstacle classes, respectively, we introduce domain-specific constraints extrapolated from weak annotations and the horizon location estimated from the IMU measurements (following [
26]), as shown in
Figure 3. The estimated horizon divides the image into two groups: regions above the horizon (
) and regions below the horizon (
). Similarly, water–edge annotations define the sets
and
for the regions above and below, respectively, and bounding boxes define the set
of rectangular regions that tightly enclose dynamic obstacles.
Using this notation, we define class-constrained regions
, where the class
c cannot appear. Namely, water pixels cannot appear above the horizon or water edge (
), sky pixels cannot appear below the horizon or water edge (
) and obstacle pixels cannot appear outside the object bounding boxes except above the water edge (
). We can thus set the probability for the class
c of a pixel
i to 0 within the respective restricted region. In certain regions, these constraints lead to unambiguous labels with only one class remaining. In general, the probability of the class
c at pixel position
i is defined as
To account for the lack of unique labels on some pixels, we introduce pixel-wise training weights . In particular, the weights are set to for pixels with unambiguous labels and to elsewhere. This means that the latter pixels are effectively ignored during training.
From the water–edge annotations, we can also infer that a pixel
i, located immediately above the water edge, must belong to the obstacle class (
) by definition. However, since the height of the static obstacle is unknown, we employ a heuristic approach and only label a pixel as an obstacle if its distance
from the water edge is below a threshold
, i.e.,
We furthermore adjust its weight to reflect the increase of label uncertainty with distance, i.e., , where is defined such that all weights lower than a small value are set to zero.
3.1.2. Training
The network is trained with a weighted focal loss [
51]
on the partial labels
and their corresponding weights
. However, further training signals can be derived for the unlabelled regions corresponding to the dynamic obstacles.
In particular, we use several loss functions inspired by instance segmentation literature [
47,
49]. We leverage the bounding box annotations with a projection loss
[
49]. The projection loss provides a weak constraint on the segmentation of an obstacle and forces the horizontal and vertical projections of the segmentation mask to coincide with the edges of the bounding box. Further regularization is provided by using a pairwise loss
[
49]. This loss promotes equal labels for visually similar neighboring pixels. We adapt the pairwise loss term to a multiclass environment and apply it to the entire image so that it monitors both dynamic and static obstacle segmentation.
Finally, we estimate a prior for the segmentation mask for each dynamic obstacle using a pre-trained class-agnostic deep grab-cut segmentation network [
52]. In turn, a focal loss between the predicted object segmentation and the estimated prior segmentation
is used as an auxiliary source of obstacle segmentation supervision.
The final loss is thus composed of global and dynamic obstacle losses:
where
N is the number of annotated dynamic obstacles.
3.2. Estimating Pseudo Labels from Features
The model learned in the warm-up phase (
Section 3.1) cannot be expected to produce robust predictions due to a complex combination of training objectives used during training. However, to address all training objectives simultaneously, the encoder must learn strong domain-specific features. We thus estimate the labels of unlabeled regions of the partial labels
based on feature clustering in the learned feature space producing dense pseudo labels
. This process is based on the assumption that pixels corresponding to the same semantic class cluster together in the learned feature space.
We first correct the model predictions with constraints derived from weak annotations. Let
be the model predictions (probabilities) for the water, sky, and obstacle class, and
be the feature maps produced by the encoder for an input image
. We define
as a function that corrects (i.e., constrains) the labels according to the domain constraints—the probability for class
c is set to 0 in restricted locations
i as in
Section 3.1.1, i.e.,
From the constrained predictions
, class prototypes are constructed. A class prototype
is a single feature vector describing the class
c and is computed as a masked average pooling over the features
where
and
denote the features and constrained probabilities of the class
c at an image location
i. Because dynamic obstacle appearance might vary greatly across instances, we construct individual prototypes
for each of the dynamic obstacles in the image (from features within the obstacle bounding box), and a separate prototype
for all the remaining static obstacles. Two prototypes,
and
, are extracted for the water and the sky class, respectively.
The probability of a pixel belonging to a specific class is reflected in the similarity between the pixel and the respective class prototype in the feature space. High similarity relative to the other prototypes indicates high class probability, while low relative similarity indicates low probability. To implement this, we first compute the feature similarity with the constructed prototypes at every image location to produce similarity maps
for each of the prototypes. The similarity map
for class
c at image location
i is computed by cosine similarity
Due to the introduction of multiple prototypes for the obstacle class, we obtain multiple obstacle similarity maps, which need to be merged into a single obstacle similarity map as follows: similarity maps for each of the dynamic obstacles () are applied inside their respective obstacle bounding boxes, and the static obstacle similarity map is used elsewhere. In regions where annotations of multiple dynamic obstacles overlap, the maximum of their respective similarity values is used.
We can then apply a softmax function over the class similarities to obtain the class probability distribution at each image position
i, i.e.,
where
are the class similarity maps, and
is a scaling hyper-parameter. The resulting probabilities are corrected using
to agree with the weak annotation constraints.
The estimated probabilities are used as soft pseudo-labels to fine-tune the model. Additionally, since we have already constructed the partial labels (
Section 3.1.1), we only need to apply the estimated probabilities to unlabeled (i.e., unknown) regions in the partial labels
. The final pseudo labels can thus be expressed as
Finally, the weights of pixels whose probability is estimated by are set to a constant weight to reflect that the estimated labels are less certain than the partial labels derived directly from weak annotations.
3.3. Fine-Tuning with Dense Pseudo Labels
In the final step, the model trained in the warm-up stage is fine-tuned by optimizing a weighted focal loss between the predicted labels and the re-estimated dense pseudo labels and a global pairwise loss, i.e.,
4. Results
A battery of experiments was conducted to probe the learning capabilities of SLR in the context of maritime obstacle detection. In all experiments, the focal loss parameter is set to and the remaining SLR hyper-parameters to , , and . The number of training epochs in the warm-up and fine-tuning phases is 25 and 50, respectively. Unless stated otherwise, we stop the training after a single SLR iteration (i.e., pseudo-label estimation and model fine-tuning).
SLR is demonstrated on the state-of-the-art maritime obstacle detection network WaSR [
26], which employs a ResNet-101 backbone as the encoder. In the warm-up phase, the encoder is initialized from pre-trained weights on Imagenet, while the remaining network weights are initialized randomly. Features from the penultimate (third) encoder residual block are used in the pseudo-label estimation phase after warm-up (
Section 3.2). The water separation loss with weight
from WaSR is added to the losses in the fine-tuning step (
Section 3.3).
Since our goal is to compare SLR with classical learning, we have tried to minimize the number of changes compared to WaSR to allow a fair comparison. Thus, following [
2], all networks are trained with RMSProp optimizer with momentum 0.9, initial learning rate
, standard polynomial reduction decay 0.9 and a batch size of 12. Random image augmentations including color transformations, horizontal flipping, scaling, and rotation are applied to training images.
4.1. Evaluation Protocol
The standard obstacle detection evaluation protocol from the MODS [
1] benchmark is used. The networks are trained on MaSTr1325 [
2], which contains 1325 diverse, fully per-pixel labeled images captured from unmanned surface vehicles (USV). To evaluate SLR, we additionally annotated the images with water edges and object bounding boxes. The models are evaluated on the test dataset from MODS, which contains approximately 100 sequences annotated by bounding boxes and water edges, using the detection-oriented evaluation protocol [
1]. Static obstacle detection is evaluated by water–edge detection robustness (
), which measures the proportion of the water–edge boundary that has been correctly identified (i.e., the boundary is detected within a set threshold distance). Dynamic obstacle detection is evaluated by precision (Pr), recall (Re), and F1 measure. It is evaluated over the entire navigable surface and separately within a 15 m
danger zone from the USV (F1
), where the obstacle detection performance is most critical for immediate collision prevention.
4.2. Comparison with Full Supervision
SLR was evaluated by training two of the top models from the recent MODS [
1] benchmark, using the weak annotations from MaSTr1325 [
2] and evaluating on the MODS test set. In addition, we trained several top-performing models from the same benchmark (RefineNet [
53], DeepLabV3 [
54], BiSeNet [
55], and our re-implementation of WaSR [
26]) using the dense labels to form a strong baseline. In the following,
denotes the networks trained with SLR.
Results in
Table 1 show that, remarkably, both WaSR
and DeepLabV3
outperform their classically-trained counterparts DeepLabV3 and WaSR, despite using considerably simpler weak annotations. DeepLabV3
outperforms its counterpart by 5.8 and 59.0 percentage points overall and inside the danger zone, respectively. In the case of WaSR, SLR boosts performance by 1.4 (overall) and 5.1 (danger zone) percentage points and sets a new state-of-the-art on the MODS benchmark. We observe that SLR consistently decreases false-positive detections and increases precision while preserving a high recall (qualitative results in
Section 4.9). We speculate that this might be due to detection-based training objectives in SLR, which better reflect the downstream task requirements compared to the standard pixel-based segmentation losses.
4.3. Segmentation Quality
Since the test set annotations in MODS [
1] do not enable segmentation accuracy analysis, we split the MaSTr1325 dataset into training (70%) and test (30%) sets. The standard WaSR and WaSR
are trained on the thus obtained training set and evaluated in terms of IoU on the segmentation ground truth of the test set.
Results in
Table 2 show that the WaSR
segmentation accuracy closely matches that of WaSR with merely a 1.2 decrease in mIoU. This decrease can be attributed to slightly over-segmented objects, missed details, and thin structures in the sky, which are labeled as obstacles, but are not important from the obstacle detection perspective (
Figure 4).
4.4. Comparison with Semi-Supervised Learning
SLR can be considered a weakly-supervised learning method, in which the segmentation supervision comes from weak obstacle-oriented annotations. Semi-supervised methods, on the other hand, use a small set of fully-labeled images and a larger set of unlabeled images in training. In both cases, the aim is to reduce the required annotation effort. We thus compare SLR with the recent semi-supervised state-of-the-art method ATSO [
7] in terms of annotation efficiency.
Table 3 compares the two methods and the classical fully-supervised training scheme with varying percentages of the images in the training set.
According to [
2], manual segmentation of a MaSTr1325 image takes approximately 20 min. Since weak annotations take 1 min per image, we can estimate that annotation of all images with weak annotations is approximately 5% of the effort required to manually segment all images. We thus first analyze WaSR performance when trained with only 5% of all fully-segmented images (selected at random). Compared to using all training images, the performance in terms of the F1 detection measure substantially drops by almost 10 percentage points overall and 19.1 percentage points within the danger zone. Applying the semi-supervised approach ATSO with 5% of annotated (and 95% non-annotated) training images significantly improves the performance, particularly within the danger zone (by 18 percentage points). Nevertheless, SLR by far outperforms both the fully-supervised approach and ATSO and achieves nearly five percentage points overall and an over six percentage point improvement within the danger zone over ATSO. Even when increasing the annotation effort to
that of used with SLR by using 10% of annotations, ATSO still falls short by over three percentage points overall and almost seven percentage points within the danger zone. The results indicate that SLR and the obstacle-oriented reduction of labels to weak annotations are much more efficient than the general reduction of decreasing the number of labeled images.
For reference, we also evaluate the segmentation performance of the validation set of MaSTr1325. Fully supervised training using all images with ground truth segmentation labels achieves the best mIoU. Reducing the training images to match the annotation effort of SLR results in 3.5 drop in mIoU. This drop is slightly smaller when using ATSO (2.2 points) and smaller still when using SLR (1.2 points). The segmentation accuracy of ATSO matches that of SLR only when using twice as much manual annotation effort. These results suggest that SLR could potentially be also useful as a method for segmentation ground truth acquisition at a very low manual annotation effort.
4.5. Cross-Domain Generalization
MaSTr1325 [
2] and MODS [
1] both contain images acquired from the perspective of a small USV. To explore the cross-domain generalization advantages of SLR, we thus perform several experiments by evaluating on the Singapore Marine Dataset (SMD) [
30]. This dataset presents a different domain than the MaSTr training set and contains video sequences mostly acquired from on-shore vantage points. The objects are annotated with weak annotations, and we use the horizon annotations and the training/test split from [
26].
In the first experiment, we evaluate WaSR trained on MaSTr1325 and test it on the SMD test set.
Table 4 shows that SLR outperforms training on segmentation ground truth by 5.1 percentage points, which indicates that better generalization capabilities are obtained purely from the proposed training regime.
We next consider the performance in the context of domain adaptation with MaSTr1325 and SMD training sets used in training and SMD test set for evaluation. We compare SLR with the recent state-of-the-art domain adaptation method FDA [
6]. WaSR trained with FDA performs worse (19 percentage points drop) than WaSR trained only on the MaSTr1325 segmentation ground truth. However, using SLR to fine-tune the model on the SMD training set outperforms the latter by nearly 15 percentage points. This implies that SLR bears a strong potential in domain generalization as well as domain adaptation problems.
4.6. Ablation Study
To expose the contributions of individual components of SLR, a series of experiments with different components turned off was carried out. Results of training on MaSTr1325 and testing on MODS are reported in
Table 5.
The most basic model that uses only the feature warm-up step (
Section 3.1) without fine-tuning and without the static obstacle labels above the water edge (
Section 3.1.1) results in a 7.2 and 38.8 percentage points drop overall and within the danger zone, respectively, compared to the full SLR. Detailed inspection showed that this is mainly due to an increased number of false-positive detections (see
Figure 5). Adding the static obstacle labels in the warm-up step considerably improves the performance of the basic model (by 16.1 percentage points within the danger zone). A further boost of 13.9 percentage points within the danger zone is obtained by enabling the fine-tuning step (
Section 3.3) and learning from predictions of the warmed-up model.
Applying the label constraints to predictions (
from
Section 3.2) leads to a further 4.4 percentage points improvement within the danger zone, while using only the re-estimation of pseudo labels with feature clustering (
Section 3.2) leads to a 4.8 percentage points improvement. Combining both label constraints and pseudo-label re-estimation results in 3.7 (overall) and 6.5 (danger zone) percentage points improvements over the model without the predictions refinement step. Finally, using the auxiliary per-object segmentation loss (
) during warm-up further improves the performance by 0.7 and 2.3 points, respectively. The mIoU (
Table 5) between the predicted segmentation masks and the ground truth labels on the validation set shows that each SLR module consistently contributes to learning the unknown underlying segmentation.
4.7. Influence of SLR Iterations
The number of SLR iterations composed of the pseudo-labels estimation (
Section 3.2) and network fine-tuning (
Section 3.3) steps is currently set to 1 in SLR.
Table 6 reports results with increasing the number of iterations. For reference, performance without the fine-tuning step is reported as well. Results show that a single fine-tuning step significantly boosts the performance compared to not using it; however, results do not improve with further iterations. We thus conclude that a single fine-tuning step in SLR is sufficient.
4.8. Comparison with Label Smoothing
The pseudo-label estimation step (
Section 3.2) produces soft labels, which means that the distribution over labels at a given pixel is not collapsed to a single mode. We, therefore, tested the hypothesis that the observed improvements might actually come from the smoothed training labels, an effect reported in literature [
56].
We implemented a smoothing model that applies per-pixel and cross-pixel label smoothing as follows. Let be the label distribution at a pixel i. The new label distribution is given as , where is the extent of smoothing and N is the number of label classes. The obtained label mask can be further spatially smoothed by a 2D Gaussian filter with size parameter . This model is applied to the one-hot ground truth segmentation labels in MaSTr1325.
Results of exhaustive search over
are shown in
Figure 6. As predicted by [
56], both types of label smoothing do improve the performance of the classically trained WaSR, particularly within the danger zone, albeit often at a slight overall performance decrease. Nevertheless, SLR training considerably outperforms all label smoothing combinations in both measures, despite using only weak annotations.
4.9. Qualitative Analysis
Figure 7 visualizes the performance of WaSR and WaSR
on the MODS [
1] dataset. Observe that SLR training reduces false-positive detections on wakes, sun glitter, and reflections. Reduction of these is crucial for practical USV application since false positives result in frequent unnecessary slow-downs and effectively render the perception useless for autonomous navigation. Note that, while segmentation and thus obstacle detection substantially improve in regions important for navigation, structures such as masts are missed. However, these are in the areas irrelevant for navigation and not accounted for by supervision signals in SLR.
We next downloaded and captured several diverse maritime photographs. These images were taken with various cameras, and we made sure they were out-of-distribution examples of scenes and vantage points not present in MaSTr1325 [
2]. Since IMU data (horizon) are not available for these images, we re-trained WaSR
without IMU on MaSTr1325. The results on the new images are shown in
Figure 8.
A remarkable generalization to various conditions not present in the training data are observed. For example, WaSR performs well even in certain night-time scenes, even though only daytime images are present in MaSTr1325. Furthermore, the training data contain only open, coastal sea cases, while high performance is obtained even in rivers and tight canals with different water colors and surface textures. We observe failure cases on thin structures (e.g., part of the surfboard missing) and false-positive detections on objects seen through shallow water. We also observe less accurate segmentation of the top-most parts of static obstacles, which is not critical for obstacle detection, but indicates that further research could be invested in maritime segmentation networks to address such cases.