1. Introduction
In recent years, unmanned surface vehicles (USVs) have been gradually used in various fields, such as autonomous surface transportation [
1], water quality testing [
2], autonomous surface cleaning [
3], etc. To ensure that USVs complete their tasks safely and intelligently, an excellent and robust perception system is essential. Among all the perception tasks, object detection plays an important role in both safe navigation and special task completion, and small object detection causes the most challenges, for example, the small reefs and other small obstacles that may affect USVs or small floating wastes that a cleaning USV needs to collect.
Recent development in computer vision makes vision-based object detection one of the most cost-effective solutions for the detection system of USVs. However, for vision-based small object detection on water surfaces, many can be missed and falsely detected due to the water surface environments. For vision-based small object detection on water surfaces, on the one hand, as the sky and water surfaces occupy the most area of the image, the reflection of sunlight may cause overexposure. The small objects can be shaded by the halo or fused with the background, which can cause miss detection. Besides, the reflection of objects in the surrounding environments also disturbs the detection system and causes false detection. In addition to the camera, LiDAR is also widely used for object detection as it can provide precise location and shape information of the objects [
4]. However, for the small object detection on water surfaces, for LiDAR with a low number of beams, the possibility of LiDAR beams falling on small objects is low and the objects might be unstable in sequential frames. In addition, dense fog is easy to appear on the water surface, which can disturb the propagation of LiDAR and lead to more clutter points [
5].
With the development of integrated circuits, the low-cost single-chip 77 GHz millimeter-wave (mmWave) radar is gradually used in autonomous vehicles and mobile robots recently. The mmWave radar can provide measurements of the range, azimuth, and Doppler velocity of the objects. Besides, benefiting from the inherent propagation characteristics of 77 GHz electromagnetic wave, the mmWave radar shows better robustness to harsh weather conditions and lighting conditions compared to camera and LiDAR [
6] and can be used during all types of weather and all day. Despite this, there are still some challenges in using mmWave radar for small object detection on water surfaces. The angular resolution of mmWave radar point clouds is relatively low and the points of the objects are usually more sparse [
7]. Furthermore, the semantic information of mmWave radar point clouds is often insufficient, making it difficult to accurately discern the types of targets.
Therefore, for small object detection on water surfaces, vision and mmWave radar data complement each other effectively, and fusion of vision and radar can improve the detection performance. Compared to other levels of fusion, decision-level fusion has greater robustness and adaptability, and the fused results are also more interpretable. However, there are two challenges in the decision-level fusion of camera and radar in USVs scenes:
Extrinsic Calibration. To perform decision-level fusion, the spatial relationship between the mmWave radar and camera needs to be found, which is referred to as extrinsic calibration. Due to the characteristics of glittery and sparsity of mmWave radar point clouds, extrinsic calibration between mmWave radar and cameras typically requires specific markers, and the calibration process is usually complex. Current extrinsic calibration is mainly conducted offline with human assistance. However, the positions of sensors on the platform may change due to vibrations, shocks, or structural deformations of USVs, leading to some degree of variation in the extrinsic parameter between the mmWave radar and the camera.
Data association. Traditional methods tend to manually craft various distance metrics to represent the similarities between vision and mmWave radar data. However, these manually crafted metrics are not adaptable when the data from different sensors degrade, and setting the parameters is also challenging.
In this paper, we propose a water surfaces small object detection method based on the decision-level fusion of vision and mmWave radar data. Compared to traditional methods, the proposed method has the following advantages: (1) With an initial offline calibrated extrinsic parameter, the proposed method is adapted to changes in extrinsic parameters to some degree during USVs’ online operation; (2) The method has lower computational complexity and can run in real time on embedded systems; (3) The method achieves a higher detection accuracy in the water surface small object detection task.
The contribution of this paper mainly lies in the following aspects:
We propose a new mmWave radar-aided visual small object detection method.
We propose a new image–radar association model based on the metric learning model, which can achieve a robust association of mmWave radar data and images with inaccurate extrinsic parameters to some degree.
We test the proposed method on real-world data, and the results show that our method achieves significantly better performance than current vision detection methods.
The detailed composition of this paper is listed as follows. In
Section 2, we discuss the related works, including object detection on water surfaces and the visual–radar fusion-based detection method. In
Section 3, we introduce the proposed mmWave radar-aided visual small object detection method in detail.
Section 4 gives the results of experiments based on real-world data. Finally,
Section 5 concludes this paper.
3. Our Method
For the task of small object detection on water surfaces, vision-based detection methods always generate false detection due to the sunlight reflection and surrounding scene reflection. The mmWave radar is robust to different lighting conditions but contains limited semantic information compared to the RGB image, which makes it difficult to distinguish objects of similar sizes using the radar-based detection method. Besides, the radar-based detection method may generate false detection on water surfaces due to the water clutter. Therefore, to improve the accuracy and robustness of small object detection on water surfaces, we propose a radar-aided visual small object detection method on water surfaces.
3.1. Network Overview
Due to the inherent shortcomings of camera and radar sensors, in the water surface small object detection task, both vision-based and radar-based detection methods have false detection. However, the reasons that the two sensors generate false detection are different, and the statistical probabilities of error occurrence in detection methods based on the two sensors are also independent. Hence, we adopt a detection method based on the decision-level fusion of vision and radar data. The visual object detection results are gained first, and then the detection results are associated with radar data to reduce false detection.
However, for the decision-level fusion method, the spatial position correlation of different sensors is of vital importance and acquires accurate extrinsic parameters. Due to the sparse and glittery characteristics of mmWave radar point clouds, corner reflectors or LiDAR are usually needed as the auxiliary in the extrinsic calibration between radar and camera, which involves complex calibration procedures [
31]. For the applications of USVs, there can be certain variations in the extrinsic parameters between the radar and camera due to vibrations, shocks, or structural deformations of USVs during operations. In this case, we propose a new image–radar association model based on the metric learning model. By training the model using data based on the provided initial extrinsic parameters, the model is adaptable to variations in extrinsic parameters in practical application.
As shown in
Figure 1, there are two main stages in the proposed radar-aided visual small object detection method: the detection stage and the association stage. Next, we will introduce more details about the two stages.
3.2. Detection Stage
The detection stage includes a vision-based detection model and a radar-based detection algorithm. We adopt YOLOv5-l [
32] as the vision-based model. YOLOv5-l shows good performance in visual object detection tasks and it is a lightweight model which can carry out real-time inference in an embedded system.
3.2.1. Vision-Based Detection
To make the object detection model specialize on our fusion algorithm, we modify the original YOLOv5-l [
32] as the vision-based model. As our fusion detection algorithm can remove the false positive detection results efficiently through the radar-detection results and vision detection results, we need to generate more detection results to improve the recall rate of the vision-based model. The framework of enhanced YOLOv5-l is illustrated in
Figure 2. We adjust the prediction head of YOLOv5-l using a double prediction head and transformer decoder module, then we will introduce the architecture of the prediction head in detail.
(1) Double prediction heads. YOLOv5 object detector uses a single prediction head to predict the location and classification of the detected bounding box at the same time. In our vision-based model, we design a double prediction head including a classification head and location regression head to predict, respectively, the location and classification of objects. Independent double prediction heads will benefit from searching both the location and classification of objects. While we utilize the full connection (FC) layer to obtain more semantic information about objects in the classification head, we obtain the position of detection objects in the location regression head.
(2) Transformer decoder module. Inspired by the vision transformer [
33], we use a transformer decoder module to replace the convolution blocks in the prediction head. Compared with convolution operation, the transformer decoder module can capture global information and abundant contextual information. Each transformer decoders contain a multi-head attention layer and a fully-connected layer. Furthermore, there are residual connections between each sublayer. As the prediction head is at the end of the network and the feature map has low resolution, applying a transformer decoder module in a low-resolution feature map explores the feature representation potential with a self-attention mechanism and enlarges the receptive field of the prediction head with low computation and memory cost.
After applying the vision-based detection model to an RGB image, the image anchors, expressed as , …, , where m denotes the number of image anchors, are extracted. Each image anchor contains four parameters, including the u-axis position, v-axis position, box width, and box height in the u-v image coordinate system. Therefore, for each image, the output size in the detection stage is .
3.2.2. Radar-Based Detection
A mmWave radar system senses its surroundings by transmitting and receiving FMCW signals. The transmitted and reflected signals are mixed using a frequency mixer to obtain beat signals. Then, 1D (range) Fast Fourier Transformation (FFT) and 2D (velocity) FFT are applied to the sampled beat signals along the fast time and slow time, respectively, resulting in the well-known range–Doppler matrix (RDM). The cells with strong energy in the RDM are detected as targets. The most commonly employed detector for FMCW signal processing is the constant false alarm rate (CFAR) detector, which adaptively estimates the noise level based on nearby cells relative to the cell under test. After detection, the direction of arrival (DOA) is estimated for each detected target using signals from multiple antennas. Consequently, we obtain what is referred to as 4D radar point clouds, representing various detected targets with distinct 3D positions and Doppler velocities. The illustration of the radar signal processing chain is shown in
Figure 3.
For radar-based detection, we use the spatial information of mmWave point clouds and the size of the input radar point cloud is
, where
N denotes the number of radar points in the current frame and each point contains three coordinates information
x,
y,
z. The radar point clouds are clustered into groups and the discrete radar clutter points are also removed using DBSCAN [
34]. The point clouds are divided into
n clusters
,
, …,
, where
n denotes the number of radar point clusters. Then, we use the farthest point sampling (FPS) [
35] to sample the point clouds of each group
into a fixed number 32. Therefore, the final size of outputs of radar-based detection is
.
Through the detection stage, the vision-based and radar-based detection results are gained. Then, the detection results are sent to the fusion association stage to generate fusion detection results.
3.3. Fusion Association Stage
The fusion association stage extracts image feature vectors from object detection bounding boxes in the image plane and extracts radar feature vectors from radar point clouds. The image features can represent the position and size of detection-bounding boxes of corresponding objects in the image plane. The radar feature contains the spatial information of objects in the radar coordinate system as well as the shape information of objects. Therefore, by measuring the similarity between image and radar features, the association of vision and radar-detection results can be achieved. The Hungarian algorithm [
36] is used for matching image and radar data according to the L2 distance between two feature vectors. Thus, an end-to-end spatial correlation between image and radar data can be achieved without the extrinsic parameter calibration procedure.
Next, we will introduce the radar and vision feature extraction model in detail. For a frame of RGB image,
m bounding boxes are generated from the detection stage and the size of each bounding box is
. We use the multi-layer perception (MLP) to extract the image feature
whose size is
from each vision detection result
:
where
is the feature tensor of
ith vision detection result
.
A frame of radar data contains
n point cloud groups and the output size of the radar-detection result is
, where
denotes each group consisting of 32 points with each point containing
coordinates. For radar feature extraction, we adopt the mini-PointNet [
37] architecture, which is a famous method to extract point cloud features. Through the shared weighted MLPs, the max-pooling, and another MLP, each point cluster generates a feature of size
. The
radar feature vector of a whole frame is generated by combining the
n cluster features. The radar feature extraction can be represented as follows:
where
denotes the radar point cloud feature of the
ith cluster.
After obtaining a frame of image feature and corresponding radar feature , we compute the L2 distance between each object’s image feature and each object’s radar feature and obtain a cost matrix of size . Based on the cost matrix, within the minimum distance threshold, the matching results are gained using the Hungarian assignment algorithm.
Through the fusion association stage, the final fusion detection results which contain the vision-based detection box, object classification result, and the range and azimuth of the objects can be gained.
3.4. Loss Function
In our method, the detection model and the image–radar association model are trained separately. The training loss function of vision-based detection model
is the same as the YOLOv5 object detection model, which is computed as:
where
denotes the location loss,
denotes the confidence loss, and
denotes the classification loss. The three loss weights
,
, and
are constants. For the training of the image–radar association model, we choose the triplet loss [
38], which is commonly used as the training loss function in the metric learning. Each training data pair for triplet loss contains three samples: a vision-based detection bounding box
as a base anchor, a positive radar sample
, which is the radar-detection cluster corresponding with
, and a negative radar sample, which is randomly selected from rest of the radar-detection clusters. The image and radar features are extracted from the training data pair, and the triplet loss is used to minimize the L2 distance
between the image feature and positive radar feature while maximizing the L2 distance
between the image feature and negative radar feature using:
where
is a constant to express the minimum distance loss.