1. Introduction
Autonomous surface vehicles (ASVs) are emerging machines catering to a range of applications such as monitoring of the aquatic environment, inspection of hazardous areas, and automated search-and-rescue missions. Considering they do not require a crew, they can be substantially downsized and thus offer potentially low operating costs. Among other capabilities, reliable obstacle detection plays a crucial role in their autonomy since timely detection is required to prevent collisions which could damage the vessel or cause injuries.
The current state-of-the-art (SOTA) algorithms for maritime obstacle detection [
1,
2,
3] are based on semantic segmentation and classify each pixel of the input image as an obstacle, water, or sky. These models excel at detecting static and dynamic obstacles of various shapes and sizes, including those not seen during training. Additionally, they provide a high degree of adaptability to challenging and dynamic water features for successful prediction of the area that is safe to navigate. However, these benefits come at a high computational cost as most of the recent SOTA semantic segmentation algorithms for maritime [
1,
2,
3] and other [
4,
5] environments utilize computationally-intensive architectural components with a large number of parameters and operations. SOTA algorithms therefore typically require expensive and energy-inefficient high-end GPUs, making them unsuitable for real-world small-sized energy-constrained ASVs.
Various hardware designs have been considered to bring neural network inference to industry applications. These include edge TPUs [
6], embedded GPUs [
7], and dedicated hardware accelerators for neural network inference [
8]. In this paper, we consider OAK-D [
9], a smart stereo camera which integrates the MyriadX VPU [
8]. OAK-D reserves
TOPS for neural network inference and contains
MiB on-board memory for neural networks, programs, stereo depth computation, and other image-manipulation-related operations. These properties make it an ideal embedded low-power smart sensor for autonomous systems such as ASVs.
However, state-of-the-art models, such as WaSR [
1], cannot be deployed to OAK-D due to memory limitations, and many standard models that can be deployed, such as U-Net [
4], typically run at less than 1 FPS, which is impractical. These limitations generally hold for other embedded devices as well. Recent works [
3,
10] consequently explored low-latency light-weight architectures for ASVs, but their high throughput comes at a cost of reduced accuracy. While significant research has been invested in development of general embedded-ready backbones [
11,
12,
13,
14,
15,
16,
17,
18], these have not yet been analyzed in the aquatic domain and generally also sacrifice accuracy in our experience. For these reasons there is a pressing need for ASV obstacle detection embedded-compute-ready architectures that do not substantially compromise the detection accuracy.
To address the aforementioned problems, we propose a fast and robust neural network architecture for ASV obstacle detection capable of low latency on embedded hardware with a minimal accuracy trade-off. The architecture, which is our main contribution, is inspired by the current state-of-the-art WaSR [
1], hence the name
embedded-compute-ready
WaSR (eWaSR). By careful analysis, we identify the most computationally intensive modules in WaSR and propose computationally efficient replacements. We further reformulate WaSR in the context of transformer-based architectures and propose a Channel refinement module (CRM) and spatial refinement module (SRM) blocks for efficient extraction of semantic information from images features. On a standard GPU, eWaSR runs at 115 FPS, which is
faster than the original WaSR [
1], with an on-par detection accuracy. We also deploy eWaSR on a real embedded device OAK-D, where it comfortably runs at
FPS, in contrast to the original WaSR, which cannot even be deployed on the embedded device. By matching WaSR detection accuracy and vastly surpassing it in speed and memory requirements, eWaSR thus simultaneously addresses maritime-specific obstacle detection challenges as well as the embedded sensor design requirements crucial for practical ASV deployment. The source code and trained eWaSR models are publicly available (
https://github.com/tersekmatija/eWaSR, accessed on 25 May 2023) to facilitate further research in embedded maritime obstacle detection.
The remainder of the paper is structured as follows:
Section 2 reviews the existing architectures for maritime obstacle detection, efficient encoders, and light-weight semantic segmentation architectures.
Section 3 analyzes the WaSR [
1] blocks, their importance, and bottlenecks. The new eWaSR is presented in
Section 4 and extensively analyzed in
Section 5. Finally, the conclusions and outlook are drawn in
Section 6.
3. WaSR Architecture Analysis
An important drawback of the current best-performing maritime obstacle detection network WaSR [
1] is its computational and memory requirements, which prohibit application on low-power embedded devices. In this section, we therefore analyze the main computational blocks of WaSR in terms of resource consumption and detection accuracy. These results are the basis of the new architecture proposed in
Section 4.
The WaSR architecture, summarized in
Figure 1, contains three computational stages: the encoder, a feature mixer, and the decoder. The encoder is a ResNet-101 [
37] backbone, while the feature mixer and decoder are composed of several information fusion and feature scaling blocks. The first fusion block is called channel attention refinement module (cARM1 and cARM2 in
Figure 1) [
45], which reweights the channels of input features based on the channel content. The per-channel weights are computed by averaging the input features across spatial dimensions, resulting in a
feature vector that is passed through a
convolution followed by a sigmoid activation. The second fusion block is called a feature fusion module (FFM) [
45], which fuses features from different branches of the the network by concatenating them, applying a
convolution and a channel reweighting technique similar to cARM1 with
convolutions and a sigmoid activation. The third major block is called atrous spatial pyramid pooling (ASPP) [
5], which applies convolutions with different dilation rates in parallel and merges the resulting representations to capture object and image context at various scales. The feature mixer and the decoder also utilize the inertial measurement unit (IMU) sensor readings in the form of a binary encoded mask that denotes horizon location at different fusion stages. In addition to the focal loss
[
56] for learning semantic segmentation from ground-truth labels, Bovcon and Kristan [
1] proposed a novel water-obstacle separation loss
to encourage the separation of water and obstacle pixels in the encoder’s representation space.
We note that the encoder is the major culprit in memory consumption since it employs the ResNet-101 [
37] backbone. This can be trivially addressed by replacing it by any lightweight backbone. For example, replacing the backbone with ResNet-18, which uses approximately
fewer parameters and
fewer FLOPs than ResNet-101 (
M and
G vs.
M and
G) and does not use dilated convolutions, thus producing smaller feature maps, already leads to a variant of WaSR that runs on an embedded device. Concretely, a WaSR variant with a ResNet-18 encoder runs at
FPS on OAK-D but suffers in detection accuracy (a
F1 drop overall and
F1 drop on close obstacles in
Section 5). The performance will obviously depend on the particular backbone used and we defer the reader to
Section 5, which explores various lightweight backbone replacements.
We now turn to analysis of the WaSR computational blocks in the feature mixer and decoder. The detection performance contribution of each block is first analyzed by replacing it with a
convolution and retraining the network (see
Section 5 for the evaluation setup). The results in
Table 1 indicate a substantial performance drop of each replacement, which means that all blocks indeed contribute to the performance and cannot be trivially avoided for speedup.
Table 2 reports the computational requirements of each block in terms of the number of parameters, floating point operations (FLOPs), and the execution time of each block measured by the PyTorch Profiler (
https://pytorch.org/docs/stable/profiler.html, accessed on 25 May 2023) on a modern laptop CPU. Results indicate that the FFM and FFM1 blocks are by far the most computationally intensive. The reason lies in the first convolution block at the beginning of the FFM block (
Figure 1), which mixes a large number of input channels, thus entailing a substantial computational overhead. For example, the first convolution in each FFM block accounts for over 90% of the block execution time. In comparison, ASPP is significantly less computationally intensive and cARM is the least demanding block.
Both FFM and cARM blocks contain a channel re-weighting branch, which entails some computational overhead. We therefore inspect the diversity level of the computed per-channel weights as an indicator of re-weighting efficiency. The weight diversity can be quantified by per-channel standard deviations of the predicted weights across several input images.
Figure 2 shows the distribution of the standard deviations computed on the MaSTr1325 [
43] training set images for FFM and cARM blocks. We observe that the standard deviations for FFM blocks are closer to 0 compared to cARM1 and have a shorter right tail compared to cARM2. This suggests that the per-channel computed weights in FFM/FFM1 blocks do not vary much across the images, and thus a less computationally intensive replacements could be considered. On the other hand, this is not true for the cARM blocks, where it appears that the re-weighting changes substantially across the images. A further investigation of the cARM blocks (see information in
Appendix A) shows that the blocks learn to assign a higher weight to the IMU channel in images where horizon is poorly visible. This further indicates the utility of the cARM blocks.
In terms of the WaSR computational stages,
Table 2 indicates that the decoder stage entails nearly twice as much total execution time compared to the feature mixer stage. Nevertheless, since the computationally intensive blocks occur in both stages, they are both candidates for potential speedups by architectural changes.
4. Embedded-Compute-Ready Obstacle Detection Network eWaSR
The analysis in
Section 3 identified the decoder as the most computationally and memory-hungry part of WaSR, with the second most intensive stage being the backbone. As noted, the backbone can be easily sped up by considering a lightweight drop-in replacement. However, this leads to detection accuracy reduction due to semantically impoverished backbone features. Recently, Zhang et al. [
29] proposed compensating for semantic impoverishment of lightweight backbones by concatenating features at several layers and mixing them using a transformer. We follow this line of architecture design in eWaSR, shown in
Figure 3.
We replace the ResNet-101 backbone in WaSR by a lightweight counterpart ResNet-18 and concatenate the features from the last layer with resized features from layers 6, 10, and 14. These features are then semantically enriched by the feature mixer stage. The recent work [
29] proposed a transformer-based mixer capable of producing semantically rich features at low computational cost. However, the transformer applies token cross-attention [
57], which still adds a computationally prohibitive overhead. We propose a more efficient feature mixer that draws on findings of Yu et al. [
53] that computationally costly token cross-attentions in transformers can be replaced by alternative operations, as long as they implement cross-token information flow.
We thus propose a lightweight scale-aware semantic extractor (LSSE) for the eWaSR feature mixer, which is composed of two metaformer refinement modules (
Figure 3)—channel refinement module (CRM) and spatial refinement module (SRM). Both modules follow the metaformer [
53] structure and differ in the type of information flow implemented by the token mixer. The channel refinement module (CRM) applies the cARM [
45] module to enable a global cross-channel information flow. This is followed by a spatial refinement module (SRM), which applies sARM [
58] to enable cross-channel spatial information flow. To make the LSSE suitable for our target hardware, we replace the commonly used GeLU and layer normalization blocks of metaformer by ReLU and batch normalization. The proposed LSSE is much more computationally efficent than SSE [
29]. For example, with a ResNet-18 encoder, the SSE would contain 66.4 M parameters (requires 3.2 GFlops), while the LSSE contains 47.9 M parameters (requires 2.1 GFlops).
We also simplify the WaSR decoder following the TopFormer [
29] semantic-enrichment routines to avoid the computationally costly FFM and ASPP modules. In particular, the output features of the LSSE module are gradually upsampled and fused with the intermediate backbone features using the semantic injection modules [
29] (SIM). To better cope with the high visual diversity of the maritime environment and small objects, the intermediate backbone features on the penultimate SIM connection are processed by two SRM blocks. The final per-layer semantically enriched features are concatenated with the IMU mask and processed by a shallow prediction head to predict the final segmentation mask. The prediction head is composed of a
convolutional block, a batchnorm and ReLU, followed by
convolutional block and softmax.
6. Conclusions
We presented a novel semantic segmentation architecture for maritime obstacle detection eWaSR, suitable for deployment on embedded devices. eWaSR semantically enriches downsampled features from ResNet-18 [
37] encoder in a SSE-inspired [
29] lightweight scale-aware semantic extraction module (LSSE). We propose transformer-like blocks CRM and SRM, which utilize cARM [
45] (channel-attention) and sARM [
58] (simplified 2D spatial attention) blocks as token mixers, instead of costly transformer attention, and allow LSSE to efficiently produce semantically enriched features. Encoder features are fused with semantically enriched features in SIM [
29] blocks. To help the model extract semantic information from a more detailed feature map, we use two SRM blocks on the second long skip connection, and we concatenate binary encoded IMU mask into the prediction head to inject information about tilt of the vehicle.
The proposed eWaSR is
faster than state-of-the-art WaSR [
1] on a modern laptop GPU (
ms compared to
ms latency, respectively) and can run comfortably at
FPS on embedded device OAK-D. Compared to other lightweight architectures for maritime obstacle detection [
3,
10], eWaSR does not sacrifice the detection performance to achieve the reduced latency and achieves only
worse overall and
danger-zone F1 score on the challenging MODS [
46] benchmark compared to state of the art.
Because of additional memory access, long-skip connections can increase the overall latency of the network. In the future, more emphasis could be put on exploring different embedded-compute-suitable means of injecting detail-rich information to the decoder. The developed embedded-compute-ready architecture can be further extended in several ways. One example is to consider the temporal component as in [
2]. Furthermore, since OAK-D is capable of on-board depth computation, fusion of depth into the model could be explored to increase the performance on close obstacles. Alternatively, the solution could be redesigned into a multitask architecture that would also provide an estimate of the distance besides existing semantic segmentation output. Potential practical applications (for example autonomous ferries [
70], collision avoidance systems [
71]) could easily combine predicted segmentation masks with depth, which would provide ASVs with precise information about distance and location of the nearby obstacles. We delegate these improvements and implementation of a practical application to future work.