1. Introduction
The detection and recognition of water surface targets (WSTs) remain crucial for ship navigators and unmanned navigation systems, significantly enhancing overall navigation safety. Effective look-out can provide essential spatial information about WSTs, assisting the assessment of situational awareness and collision risks. Furthermore, identifying the spatial relations among these targets can improve the accuracy of target detecting and tracking, thus contributing to a comprehensive evaluation of the navigation environment and the potential risk of collision. Currently, the visual sensors installed on vessels serve as powerful tools for navigators in the detection and identification of WSTs, as advancements in computer vision have demonstrated remarkable capabilities in object detection and recognition tasks. In addition to giving information on the categories and sizes of the WSTs, which can be efficiently processed by modern computer vision algorithms, the spatial relations among these targets are also represented in the images captured by the visual sensors. This enables the extraction of spatial information concerning the vessel and the WSTs through computational methods.
The recognition and detection of WSTs based on image have been extensively studied in the literature. Traditional approaches mainly primarily depend on edge detection, color analysis, and morphological processing. Widely used image processing algorithms, such as Canny edge detector and Sobel operator [
1], identify the edges of targets by evaluating the contrast between the water surface and the target itself. Nonetheless, these approaches demonstrate limited effectiveness when faced with complex backgrounds such as waves and splashes as well as low-contrast situations. Recently, numerous automated feature extraction methods based on the complex neural networks have been proposed to improve the efficiency of image recognition. A significant milestone is the discovery of convolutional neural networks (CNNs), which have become one of the mainstream methods of water surface image processing in recent years [
2]. CNNs enable the recognition and detection of WSTs through a multi-layered architecture. Several convolutional layers are employed in CNNs to automatically extract local features from the image such as surface fluctuations and target edges. Activation functions and pooling layers subsequently enhance feature representations while reducing computational complexity. In the fully connected layers, the network integrates the extracted features, ultimately using the output layer to classify and localize water surface targets. CNNs have achieved significant advancements in the recognition and detection of WSTs, demonstrating exceptional capability in handling complex backgrounds and low-contrast images. For instance, networks such as VGGNet [
3] and GoogleNet [
4] excel at extracting precise feature information from large datasets of WST images, enabling accurate target classification. Deep networks like Faster R-CNN [
5] allow CNNs to effectively detect water surface targets, such as ships and buoys, maintaining high recognition accuracy even in the presence of challenging backgrounds like waves and splashes. Furthermore, the YOLO [
6,
7,
8,
9,
10,
11] model has been successfully applied to real-time target detection, facilitating the rapid localization and classification of water surface targets, which substantially enhances the efficiency of real-time monitoring. Models such as fully convolutional networks (FCNs) [
12] and U-Net [
13] employ pixel-level image segmentation to successfully isolate targets from the water surface background, thereby improving detection accuracy in complex environments including those with waves and splashes. General speaking, CNNs, through their automated feature learning, end-to-end training processes, and deep network architectures, have significantly enhanced the accuracy, efficiency, and robustness of water surface target recognition and detection.
Although CNNs have shown significant achievement in WST recognition tasks, they still face significant challenges. A notable issue observed in many neural network models is that deep semantic information in images has not yet received sufficient attention during the object recognition process. Particularly, it may lead to inaccurate results when the spatial relations between features are not properly represented.
Figure 1 presents a misclassification case, where
Figure 1a shows an image containing a sailboat, and
Figure 1b illustrates an image containing a wooden boat and a piece of canvas. We evaluated a classification experiment on both
Figure 1a,b with the three widely adopted classification models: VGG16 [
3], ResNet50 [
14], and EfficientNet [
15]. All three models classified both
Figure 1a,b as “sailboat.” However, the canvas features were positioned inappropriately in
Figure 1b, resulting in an obvious misclassification as a “sailboat.” This failure highlights a fundamental shortcoming in conventional CNNs: without explicit modeling of spatial dependencies between features, such architectures struggle to differentiate contextually dissimilar objects (e.g., a sailboat versus a boat with a displaced canvas).
Figure 2 presents a representative example of a WST image. Compared with natural or urban scene images, WST images exhibit distinctive features, with particularly prominent spatial position relations among these features. Firstly, the background of WST images is typically simple, with targets being highly salient. In open water regions such as lakes or oceans, the background is usually homogeneous, leading to a higher contrast between the targets (e.g., ships, buoys) and the background. Secondly, while the targets in WST images typically exhibit a natural two-dimensional spatial distribution, this distribution often reflects an underlying three-dimensional spatial arrangement. For example, targets
B and
C appear in an “up-down” relation in the WST image, but actually correspond to “front-back” spatial position relations in reality. Similarly, while target
C may appear “below right” of target
A in the WST image, it corresponds to a “right-front” relation in three-dimensional space. The two-dimensional relations in WST images facilitate intuitive inference of the relative positions and distances of targets in three-dimensional space. Overall, WST images, characterized by their simple backgrounds, salient targets, and two-dimensional spatial distribution, excel in representing spatial positional relations. Compared with images from other scenes, WST images can better highlight target spatial relations under specific conditions. However, despite the clear advantages of WST images in representing spatial relations, the current exploration of deep semantic features, such as spatial positioning and context, remains underdeveloped and warrants further research.
Recent advancements in artificial intelligence have led to significant developments in spatial relation recognition within WST systems, encompassing a diverse spectrum of methodological approaches, aiming to address the inherent limitations in water surface target recognition and detection while providing innovative frameworks for spatial information analysis. These methods encompass both traditional deep learning architectures and emerging hybrid solutions, with implementations ranging from CNNs and capsule networks to more sophisticated frameworks, collectively advancing the capabilities of spatial understanding in water surface scenarios. Ref. [
16] advanced the field by developing a sophisticated object detection framework that demonstrated a remarkable capability in analyzing ship safety plans. Their innovative approach enables the automated verification of safety equipment and signage placement against regulatory requirements, representing a significant step forward in maritime safety compliance. Building upon the YOLOv7 architecture, Ref. [
17] introduced the groundbreaking LWS-YOLOv7, a lightweight model specifically engineered for water-surface object detection. Their notable contribution includes the enhancement of the localization loss function and the integration of the w-CIoU function, substantially improving the model’s performance on small object detection while optimizing positive and negative sample label allocation. A significant advancement in underwater robotics was achieved by [
18], who pioneered a YOLO-based 3D perception algorithm for underwater vehicle-manipulator systems. This sophisticated approach revolutionizes surface and underwater object detection and localization precision, marking a crucial development in underwater manipulation capabilities. Ref. [
19] made substantial strides in maritime surveillance by developing an innovative adaptive target detection and tracking model for unmanned surface vehicles. Their research addressed the complex challenges posed by maritime environments, offering enhanced tracking capabilities amid dynamic conditions and diverse environmental factors. Ref. [
20] proposed the capsule network, which uses vector input and output. Each capsule corresponds to an entity, with the vector’s length representing the entity’s probability, and the vector’s direction representing the entity’s attribute. As a class of neural networks, the capsule network can evaluate the relative relations between targets and their spatial orientations. Ref. [
21] introduced an image captioning model that could automatically produce descriptive text from an image. This model enables machines to recognize targets in an image and understand how they relate to each other. Actually, image captioning is a general semantic understanding algorithm that can recognize the WST’s spatial position relation under certain conditions.
While these traditional deep learning approaches have shown promising results in maritime object detection and spatial recognition, recent advances in deep learning architectures have opened new avenues for improving detection accuracy and robustness. Particularly noteworthy are the developments in enhanced hybrid attention deep learning and spectrum-based hybrid deep learning, which have brought novel perspectives to maritime surveillance, object detection, and spatial relation recognition. Ref. [
22] proposed an enhanced hybrid attention-based neural network combining CNNs, BiLSTMs, and attention mechanisms to analyze the spatial-temporal characteristics of ship heave motion, achieving superior prediction accuracy in complex maritime spatial environments. Their research significantly advanced spatial motion prediction through the comprehensive analysis of multi-dimensional spatial features (heave displacement, pitch movements, and wave interactions), establishing a robust framework for maritime spatial relationship understanding. Refs. [
23,
24] introduced AEFFNet and HAFDN architectures, both leveraging enhanced hybrid attention mechanisms. AEFFNet focuses on small object detection in UAV imagery with innovative feature fusion strategies, while HAFDN combines channel, local, and global spatial attention for micro-video venue recognition. Although not directly tested on maritime targets, their advances in enhanced hybrid attention deep learning provide valuable guidance for improving maritime object detection and spatial recognition. Furthermore, the research by [
25,
26] on spectrum-based hybrid deep learning, through their exploration of hybrid sensing techniques that incorporated two different spectrum sensing approaches, provides valuable insights that could be potentially applied to maritime object detection and spatial relationship recognition.
The current research on spatial feature understanding is mainly based on target recognition, and they neither provide a transparent scheme of spatial position relation recognition nor any performance test results. Indeed, as far as the authors know, there are no models to recognize the spatial relation between objects in an image. This highlights the need for further exploration into integrating deep semantic features with spatial relations to improve target recognition and detection accuracy.
Fortunately, although the above research ignored the spatial relation between features, it implies the fact that CNNs can learn the spatial position between the WSTs in an image [
27]. Therefore, this paper considered the recognition of spatial orientation between WSTs in images by using a new one-stage algorithm. We first sorted out the relative spatial orientation of the WSTs in the image relative to the reference object, and the triplet (a three-element structure consisting of object A, object B, and the spatial orientation relation type between A and B) was introduced to characterize the spatial orientation relation between the targets. As shown in
Figure 3, these spatial orientations can further be converted into four categories according to people’s perception of the WST’s relative spatial orientation in the image, where
Figure 3a–d represents the left and right (L&R) spatial orientation, front and back (F&B) spatial orientation, front left and rear right (FL&RR) spatial orientation, left rear and right front (LR&RF) spatial orientation, respectively. Thus, we can transform the recognition problem of the spatial orientation between WSTs in the image into a classification problem. Furthermore, a human being usually scans from one WST to another in the eyes through the straight line between them when we observe their spatial orientation. This perception mechanism motivated us to encode the spatial orientation into a CNN model. This paper proposed a new algorithm, the WST spatial orientation vector field (WST-SOVF) algorithm, which is the first to recognize four categories of the spatial orientation between WSTs in images. Similar to the way a human being can perceive the spatial orientation between the WSTs in eyes, the WST-SOVF algorithm establishes the spatial orientation vector field between WSTs. The ground-truth of each pixel of the vector field is encoded by the spatial orientation angle of the WST relative to the reference object. All of the pixels in the vector field finally “vote” to determine the category of the spatial orientation relation between a group of WSTs. For the judgment of the WST category, we followed the detection scheme of key points and realized the target category’s judgment by estimating the WST’s central area. Consequently, a new DCNN with two branches based on ResNet50 [
14] was developed. One branch of this DCNN was used to identify the category of WSTs, and the other was used to predict the spatial orientation vector field between each group of WSTs. Moreover, in order to obtain the optimal parameters of the WST-SOVF algorithm, a pair of coupled regression loss functions corresponding to the WST category’s judgment and the spatial orientation vector prediction was designed based on the structure of the two-branch neural network. We demonstrate abundant experimental results on Huawei’s “Typical Surface/Underwater Target Recognition” dataset and show the efficiency and robustness of the proposed algorithm.
The rest of the paper is organized as follows.
Section 2 discusses the representation of the spatial orientation relation between WSTs in images.
Section 3 illustrates the proposed WST-SOVF algorithm in detail. In
Section 4, the experimental results demonstrate the superior performance of the proposed WST-SOVF algorithm on a variety of water surface scene images.
Section 5 provides an in-depth discussion of the experimental results and limitations of the algorithm. Finally, concluding remarks with the future direction of research are summarized in
Section 6.
2. Spatial Orientation Relation Between WST in an Image
Currently, the recognition and detection techniques for WSTs primarily focus on extracting the regional features of the WSTs, while deep semantic features have yet to receive significant attention. As a fundamental semantic characteristic, the spatial patterns between WSTs are rarely discussed. A water surface image may contain complex spatial patterns. As shown in
Figure 4,
Figure 4a shows a typical WST spatial pattern. The lighthouse is on the right side of the sailboat if the sailboat is considered as the reference; the sailboat is on the left side of the lighthouse if the lighthouse is taken as the reference. This spatial pattern is quite common for the separate objects in the image. Compared with the spatial pattern in
Figure 4a,
Figure 4b presents a more complex spatial pattern with overlapping water surface objects. The image shows the runabout overlaid head of the cabin boat, indicating that the runabout is in front of the cabin boat. It also provides a clear representation of their spatial relation.
Figure 4c further exhibits various spatial patterns in the image. At first glance, the image illustrates the two red lifeboats are on the cruise ship, which is a spatial containment pattern. Additionally, when considering the relative position of the lifeboat in the horizontal direction of the image, it becomes apparent that the lifeboat is to the right of the cruise ship. This spatial pattern is often ignored.
Figure 4d shows a scene with multiple open boats and cabin boats containing complex spatial patterns. Any open boat or cabin boat can be the reference, then the spatial pattern between the WST and the reference can be obtained. Therefore, for a given combination of water surface objects, we can observe the specific spatial pattern. These combinations may yield multiple spatial patterns.
Although there may be various spatial patterns between water surface objects in an image, this paper only concentrated on the spatial orientation patterns between two separated WSTs. However, there was no strict definition on the representation of the spatial orientation relations between two separated WSTs. As shown in
Figure 5, there are three WSTs,
A,
B, and
C, in the pixel coordinate system. To represent the spatial relation between targets
A and
C, if target
A is taken as the reference origin, the spatial relation between the two can be described as “
C is to the right-front of
A”. Conversely, when using target
C as the reference origin, the relation is described as “
A is to the left-rear of
C”. In reality, these two representations of the spatial relations between
A and
C are equivalent, with the apparent ambiguity arising from the symmetry inherent in human language. However, machine recognition typically requires determinacy and uniqueness in the representation of spatial relation to enable reliable subsequent operations such as classification, decision-making, and prediction. Therefore, in order to eliminate the ambiguity between the spatial orientations of targets described in natural language, this paper introduced four fundamental categories of spatial orientation based on the orientation angle between the two water surface objects (i.e., L&R (left and right), F&B (front and back), FL&RR (front left and rear right), and LR&RF (left rear and right front)) through the following formula:
where
Oref is the center point of the object that is closer to the vertical axis of the pixel coordinate system,
Otar is the center point of the another object,
is the angle between the vector
and the horizontal axis in the pixel coordinate system. According to (1), the spatial orientation between two objects shall be invariant, even if the roles of the two targets are interchanged. Consequently, the recognition problem of spatial orientation can be formulated as a classification challenge involving four fundamental spatial orientations between water surface objects within an image.
In representing the spatial orientation relations between objects, it is essential not only to capture the category of relations among a set of objects, but also to clearly define the category between the two involved objects. To address this, the paper introduced the use of triplets to represent spatial orientation relations between WSTs, specifically in the form of <object A, object B, their spatial orientation relation>. In this triplet, the first two elements, A and B, denote the two objects involved in the spatial relation, while the third element specifies the category of their spatial orientation. The order of A and B in the triplet does not affect the relation, allowing for flexibility in representation. This triplet provides a clear, intuitive, and structured way to express the spatial relations between WSTs in an image. Such representation enhances the interpretability of the reasoning process, as each relation can be individually interpreted and verified, making the machine’s understanding of the image more transparent and facilitating debugging and refinement of the system.
3. WST Spatial Orientation Relation Recognition Algorithm
3.1. Toward Spatial Orientation Relations Representation
To determine the spatial orientation category between two separated water surface objects to finally output a triplet to characterize the recognition result, this paper investigated two critical aspects of the problem (i.e., the types of water surface objects involved and their corresponding spatial orientations). Actually, the identification of water surface objects can be effectively addressed using various state-of-the-art object detection algorithms such as CenterNet [
28,
29,
30,
31], CornerNet [
29], OpenPose [
32], and YOLO [
6,
7,
8,
9,
10,
11]. The spatial orientations can subsequently be recognized by using the relative localizations obtained after all water surface objects in the image have been identified and localized. However, such a two-phase methodology cannot provide an end-to-end network capable of comprehensively understanding the spatial features of an image. To overcome the limitations inherent in the two-stage framework, it is essential for the recognition algorithm to simultaneously identify both categories. In other words, the algorithm is required to recognize the object and then obtain the category of the spatial orientation relation between the objects, along with the corresponding confidence levels for each water surface object pair in an image, represented as {<object A1, object B1, their spatial orientation>, …, <object An, object Bm, their spatial orientation>} [
33,
34].
Our algorithm was conceptualized as a pixel-level deep learning network featuring two parallel branches connected after an hourglass structure, with the objective of independently identifying WSTs and the spatial orientation relations between these targets. The final output, through a fusion module, generates a representation of the target spatial orientation relations as triplets. As shown in
Figure 6, the WST recognition branch leverages current state-of-the-art keypoint detection algorithms to predict a Gaussian distribution (marked with yellow salient areas) at the center of each WST, with the heatmap channel corresponding to the Gaussian distribution indicating the category of the WST. Additionally, to recognize the spatial orientation relations between WSTs, we propose that the pixels within the region between target centers can fully determine these relations. Each pixel in this region, along with the reference target, forms a vector, where the vector orientation represents the “trend” of the spatial orientation relations between the targets. The distribution of all vector “trends” in this region forms the category of the spatial orientation relation, as indicated by the red vector arrows in
Figure 6. This region is referred to as the spatial orientation relations vector field of WST
A and
C. We can effectively design a spatial orientation relations recognition branch to predict the pixel distribution within this vector field, and subsequently determine the type of field based on the pixel distribution, which in turn allows us to classify the WSTs. Finally, a fusion module is introduced to integrate the predictions from both branches, yielding the output of the spatial orientation relation triplets.
3.2. Target Category Recognition Branch Design
The WST category recognition branch (T-branch) was designed to recognize the categories of the WSTs within an image. Inspired by the DCNNs developed for object detection [
35,
36], which typically extract the feature maps through detecting keypoints associated with the target, the architecture of the T-branch is similarly constructed to generate Gaussian distribution in the various heatmap layers centered around the WSTs, which can be utilized to supervise the prediction of
Y.
Let I∈Rw✕h✕3 represent an input image of width w and height h, and let denote the refined feature map generated by the hourglass structure, where z = 4 is the output stride. The FM is processed through the T-branch, which comprises two continuous convolutional layers, to obtain the heatmap . In our case, c = 8 denotes the categories of water surface objects, which included “lighthouse”, “sailboat”, “buoy”, “cargo ship”, “naval vessels”, “passenger ship”, “submarine”, and “fishing boat”.
Accurately predicting a WST from a single pixel in the heatmap
Y poses significant challenges. Therefore, our objective was to generate a Gaussian distribution centered at the detected WST’s location within
Y, thereby enhancing the heatmap’s features for target localization. Let
Ct = (
Ctx,
Cty) denote the coordinates of the water surface object’s center point in
Y. The Gaussian distribution [
28,
29,
30,
31] around
Ct can be obtained as follows:
where (
x,
y) denotes the coordinates of the pixel in the heatmap, and
represents a target size-adaptive standard deviation. The subscript
c indicates the channel of the Gaussian kernel, with
corresponding to the
c-th channel associated with the class of the water surface object. The parameter
ensures that the radius of the Gaussian circle is proportional to the water surface object size; typically,
can be set to 1/3 of the water WST size.
The prediction pixel value Yx,y,c denotes the confidence level of a detected WST within the heatmap Y, where (x, y) indicates the coordinates of the pixel in the heatmap while c refers to both the channel of the heatmap and the class of the detected WST. In other words, there are as many heatmap channel types as there are template categories. In other words, the number of heatmap channels corresponds directly to the number of WST categories. The larger the value of Yx,y,c, the closer it approaches the center point of the prediction Gaussian distribution Tx(x,y). Notably, Yx,y,c = 0 corresponds to the background.
Figure 7 displays the outputs of the T-branch for the original image presented in
Figure 7a, which contains two classes of WSTs: two “lighthouse” and one “passenger ship”.
Figure 7b,c illustrates the ground-truth representations of the “lighthouse” channel and “passenger ship” channel, respectively, generated via the Gaussian kernel centered at the WST’s central point. In
Figure 7b, the heatmap represents the “lighthouse” class, displaying two Gaussian distributions,
Tx (
x,
y), centered around the “lighthouse”, while
Figure 7c corresponds to the “passenger ship” class, where the Gaussian distribution,
Tx (
x,
y), was centered on the “passenger ship”. Notably, the radius of the Gaussian circle was directly proportional to the size of the WST. Consequently, the heatmap where the Gaussian circle is located indicates the WST category, while the location of the Gaussian circle and the associated pixels can be further employed to recognize the respective spatial orientations between the WSTs.
3.3. Spatial Orientation Relations Recognition Branch Design
The spatial orientation relations recognition branch (S-branch) was designed to identify the spatial orientation relations between the WSTs within the image. It generates two spatial orientation vector fields: FB-field and LR-field, between each pair of WSTs, which can be utilized to evaluate the spatial orientation relations between them.
Specifically, the S-branch has a neural network comprising two convolutional layers. The second convolutional layer employs two convolutional kernels to map the feature map FM to a feature map F, which contains spatial orientation vector fields. , where W and H represent the width and height of the input image, Z = 4 denotes the output stride, and C = 2 indicates the number of channels in F. Let Ct = {Ct1, Ct2, Ct3, …, Ctn} represent the n detected center points of the WSTs in the T-branch. Our objective was to generate two spatial orientation vector fields for each pair of WSTs in F. Consequently, F produces a set Fc of areas (each area is referred to as a field), , where denotes the number of fields, h = 4 indicates that the width of each field is four pixels, and indicates that there are only two types of fields, namely the FB-field and LR-field, which correspond to the front and back (F&B) and left and right (L&R) orientations, respectively. For each field, denotes the score of this spatial orientation vector field (either FB-field or LR-field), where m indicates the number of pixels within the field.
Figure 8 demonstrates how to generate each spatial orientation relation vector field
. Taking the encoding of the vector field between the reference object
A and the target object
B as an example, the center point of reference object
A is designated as the origin, thereby establishing an image coordinate system. The ground truth for pixel scores within this field can be encoded according to the following formula:
where
c = 0 indicates the pixel in FB-field,
c = 1 indicates the pixel in LR-field,
and
represent unit vectors along the
x-axis and
y-axis, respectively. The vector
denotes the displacement vector formed by the pixel within the field relative to the coordinate origin. Pixels located outside the field are considered irrelevant to spatial orientation and are assigned 0. Conversely, pixels within the field are subject to superposition, with their values capped at a maximum of 1. This encoding algorithm was designed to ensure that as the spatial orientation relations between WSTs approximate the “F&B” relation, the pixel value in the FB-field will increase, while the pixel value in the LR-field will decrease, and vice versa.
Figure 9 illustrates the output
F from the S-branch corresponding to the original image shown in
Figure 9, which is identical to
Figure 7a. In this image, the spatial orientation relations are as follows: between the “lighthouse” on the left and the “lighthouse” on the right, the relation is “L&R”; between the “lighthouse” on the left and the “passenger ship”, the relation is also “L&R”; and between the “lighthouse” on the right and the “passenger ship”, the relation remains “L&R”.
Figure 9b,c shows the anticipated ground-truth FB-field and LR-field generated by the S-branch. In total, six spatial orientation vector fields were produced in
F, consisting of three FB-fields and three LR-fields. Notably, the pixel scores in the LR-field were significantly higher than those in the FB-field.
3.4. Fusion Module Design
To obtain the triplet list that contained all the spatial connections within an image, a fusion module was required to synthesize the output Y from the T-branch and F from the S-branch, along with the corresponding confidence levels for each pair.
The fusion module was implemented as follows. Initially, a 3 × 3 max pooling layer was used on the predicted heatmap Y. The subsequent step involved the selection of center points according to the value of heatmap Y that exceeded a threshold designated as K = C (we set C = 0.5), indicating the presence of detected WSTs. The channel containing the center points was utilized to determine the category of each WST, with k representing the confidence associated with that category. Following this, based on the center points of the detected WSTs in Y, pairwise relations between WSTs were established, and FB-fields and LR-fields could also be constructed from the predicted output F.
The pixel scores derived from field can be utilized to predict the spatial orientation relations between the water surface objects. We can calculate the sum of the scores within the spatial orientation vector fields in both FB-fields and LR-fields as follows:
where
Sfb and
Slr represent the votes from all pixels in the FB-fields and LR-fields, respectively. Finally, based on the pixel positions of the center point predicted on
Y, the confidence associated with a specific spatial relation can be obtained from the following formula:
where
R represents the category of spatial orientation relations while
conf denotes the confidence of the corresponding spatial relation; both are represented using key-value pairs. Additionally,
S and
T are two thresholds, configured as
S = 0.8 and
T = 1.2.
Crty and
Csty denote the Y-axis coordinates of the centers of the reference object (object that is closer to the vertical axis of the pixel coordinate system) and the target, respectively. It is clear that our predictions for “FL&RR” and “LR&RF” require leveraging the pixel positions of the center point predicted on
Y.
Thus, the fusion module integrates the features extracted from both the T-branch and S-branch. Utilizing the planar positions of these central points, the spatial orientation relations and confidence levels between each pair of WSTs can be derived from the feature map F produced by the S-branch. This process yields the final output of the target spatial orientation relations in the form of a triplet set.
3.5. Architecture of Spatial Orientation Relations Recognition Network
As illustrated in
Figure 10, the architecture of the WST-SOVF algorithm followed the generic encoder-decoder framework so that an hourglass-shaped deep convolutional neural network (DCNN) consisting of a chain of convolution and up-convolution layers was generated. Any effective deep neural network (DNN) architecture, such as VGG [
3], ResNet [
14], InceptionNet [
4], DenseNet [
37], etc. can serve as the backbone of the encoder module for feature extraction from the images. The decoder comprises up-convolutional layers that diverge into two branches: the T-branch and the S-branch, for the extraction of two classes of feature maps. The upsample module consists of three up-convolutional layers that transform the output of the backbone into a higher-resolution feature map. The T-branch is specifically designed for WST recognition, which generates multiple heatmap layers, with the quantity of these layers corresponding to the number of WST categories. Simultaneously, the spatial orientation recognition branch, referred to as the S-branch, generates a dual-channel heatmap, facilitating the prediction of vector fields for each WST pair. The subsequent fusion module was designed to integrate the WST recognition output
Y with the spatial orientation recognition output
F, enabling the compilation of a comprehensive triplet list of all WST pairs.
The network architecture details of our spatial orientation relations recognition model are illustrated in
Figure 11. The first part adopted ResNext50, which had been pre-trained for image classification. For the feature extraction phase, the input color image (512 × 512 size) was highly compressed into latent features through a series of deeply stacked convolutional blocks. As a result, the spatial dimensions of these features were reduced to 1/32 of the original resolution, while the number of channels remained substantial. The second part was the upsample module, which enhances a standard residual network by incorporating three up-convolutional networks, thereby facilitating a higher-resolution output. This up-convolutional phase ensures that the output size of the heatmap (512 × 512 pixels) is adequate to meet the coding requirements for the WST center point region and the WST-SOVF. The third part of the network consisted of a branching structure that divided into two branches, each containing two layers of convolution. One branch, namely the T-branch, is responsible for generating the heatmap corresponding to the Gaussian center point of the image target category [
38], effectively performing a decoding task to reconstruct the global layout of the target center point region. The other branch (i.e., S-branch) predicts the spatial orientation vector fields between WSTs. The fusion module was the last part, which establishes the spatial connections among all water surface objects in the image by synthesizing the outputs from both the T-branch and the S-branch. Thus, the output of the entire network is a triplet list that includes the object pairs and their corresponding spatial orientation relations.
3.6. Loss Function Design
Optimizing the discrepancy between the predicted and ground-truth spatial orientations significantly enhances the model accuracy across diverse objects. In order to optimize the trainable parameters of the proposed network, this study introduced a novel spatial loss function that was included in the overall loss computation and backpropagation process. The total adopted loss function can be defined as:
where
Lcen (
Y,
Y*) and
Ltof (
F,
F*) represent the target center point loss and the spatial orientation vector field loss, respectively. The parameters
λ and
β are two balancing factors for
Lcen (
Y,
Y*) and
Ltof (
F,
F*), which were determined as 2 and 1, respectively, through extensive experimentation. Furthermore, we computed the following two quantities:
- (1)
Target center point loss
Lcen (
Y,
Y*):
Lcen (
Y,
Y*) contributes to the target category recognition.
Lcen (
Y,
Y*) has a similar form as the loss function employed in CenterNet [
28,
30] and CornerNet [
29,
31], that is
where
Yx,y,c denotes the confidence level of a detected target within the heatmap
Y, (
x,
y) indicates the coordinates of the pixel in the heatmap
Y,
c refers to both the channel of the heatmap and the class of the detected target,
denotes the ground-truth of the pixel
Yx,y,c, and
N is the number of targets in the heatmap
Y*. This loss function enables the predicted values of all pixels surrounding the target center point to follow a Gaussian distribution. Additionally, the pixel values outside this region diminish toward zero as the distance increases.
- (2)
Spatial orientation vector field loss
Ltof (
F,
F*):
Ltof (
F,
F*) was specifically designed to accurately model the region of the spatial orientation vector field, taking into account the spatial orientation vectors both within and outside the field [
39,
40,
41]. For the vectors located inside the field, we expected that each pixel would approximate the ground truth closely. Thus,
Ltof (
F,
F*) was chosen to represent the mean square error relative to the ground-truth. Conversely, for pixels outside this region, a logarithmic function was employed to facilitate a more rapid convergence toward zero.
where
Fx,y,c denotes the value of the pixels in the feature map
F, (
x,
y) indicates the coordinates of the pixel in the feature map
F,
c refers to both the channel of the feature map and the class of the spatial orientation vector fields,
denotes the ground-truth of the pixel
Fx,y,c, and
M is the number of spatial orientation vector fields in the feature map
F. It is obvious that
inside the fields is non-zero, and the value of the pixels outside the fields are all 0. The hyperparameter
α was set to 4, determined through comprehensive experimental evaluations.
5. Discussion
Our WST-SOVF algorithm demonstrated excellent performance in water surface target spatial orientation relation recognition, achieving an impressive average precision of 97.5% and average recall of 92.0% across all categories. The algorithm exhibited particularly strong detection capabilities for the “L&R” and “F&B” spatial orientation relations, with high precision and F1 scores across all four spatial relation categories. Beyond the primary dataset, our approach showed remarkable generalization capabilities, performing effectively on the Pascal VOC dataset with minimal adaptations. This cross-domain effectiveness validates the robustness of our architectural design, particularly its anchor-free detection principles, which maintained reliable performance even under moderate occlusion conditions (below 50%).
Despite these achievements, our algorithm exhibits several meaningful challenges. Primarily, the WST-SOVF algorithm’s performance in complex marine environments, particularly those characterized by dense fog and rough seas, still has scope for improvement. Our qualitative experiments revealed that while the algorithm performed effectively under mild conditions (fog levels 1–3, sea states 0–4), its performance tended to decrease in challenging conditions with dense fog (levels 6–8) or rough seas (Beaufort scale 7–9). This performance degradation manifests as lower confidence scores and increased misclassifications, limiting the algorithm’s practical utility in real-world maritime surveillance applications where adverse weather conditions are common. The suboptimal performance in complex water surface scenarios can be attributed to several factors. Dense fog substantially reduces image contrast and obscures critical visual features, compromising the algorithm’s ability to accurately detect and classify objects. Similarly, rough sea conditions introduce significant motion-induced noise and create unstable backgrounds with complex wave patterns, disrupting the feature representation learning process. Moreover, complex marine states significantly affect environmental illumination, imaging quality, and object morphology. Most notably, these challenging conditions can cause considerable variations in color characteristics, which serves as a crucial feature for both maritime object detection and spatial relation recognition. Our future work will focus on improving the WST-SOVF’s robustness in complex water surface environments through enhanced image preprocessing, multi-modal sensing integration, and expanded training datasets. Furthermore, inspired by traditional water surface target detection methods that leverage color features effectively, we plan to explore the integration of color-based semantic information into our deep learning framework. Particularly, we aim to utilize the natural color contrast between targets and the blue water background, which could provide valuable implicit cues for spatial orientation relation recognition.
Subsequently, while the WST-SOVF algorithm demonstrated satisfactory performance in handling moderate occlusion situations, another technical challenge emerged when dealing with severe occlusion (occlusion rate exceeding 50%) and overlapping scenarios. Occlusion handling remains a persistent challenge in computer vision, as it represents a complex spatial relationship among objects. However, we have initiated research to address this limitation by extending the WST-SOVF algorithm. Our approach incorporates clustering-based concepts to group occluded foreground objects, occluded background objects, and the occlusion context together, enabling the recognition of such spatial relations as a unified set.
Additionally, the computational efficiency of the WST-SOVF algorithm merits careful consideration. Our empirical evaluation demonstrated that the algorithm achieved a processing throughput of 15–20 frames per second when deployed on an NVIDIA RTX 3070 GPU (Shenzhen Shenzhou Innovation Technology Co., Ltd. in Shenzhen, China.), with a model complexity of 124 M parameters. While this computational performance suffices for conventional water surface target surveillance applications, the processing latency may present challenges in scenarios requiring a stringent real-time response. The model’s parametric complexity reflects an intentional design choice that prioritizes detection reliability in challenging marine environments over computational efficiency. Future optimizations could potentially explore model compression techniques and hardware acceleration frameworks to enhance the WST-SOVF algorithm’s computational performance.