1. Introduction
Ice along the Arctic shipping waterways is gradually thawing under the influence of global warming, and new shipping routes to polar areas are becoming available [
1]. This could greatly reduce the navigation time and increase safety [
2]. Glacial surges, fog, and ice flow will affect the navigation safety and may result in collisions with ice and ship damage. Therefore, it is important to promptly detect sea ice, icebergs, and passing ships to avoid ship–ice and ship–ship collisions [
3]. A detection system should provide information about the position and size of the objects on navigation routes, so as to support polar ship path planning and make ship navigation safer and more energy efficient.
Field observation focuses mostly on ships and buoys. As described, visual observation was combined with field measurements [
4], determining for instance, ice thickness through the on-site drilling of ice samples. However, on-site detection in the harsh polar environment is challenging, and data collection is limited [
5,
6]. In recent years, image processing and remote sensing technology have been applied to the acquisition of polar information, and indirect detection techniques have been developed [
7]. Methods such as ship walk observation, shipborne radar observation, and unmanned aircraft observation are used for local-scale detection, while active and passive microwave remote sensing is mainly used for large-scale observations [
8].
For local-scale environmental information, shipboard cameras are commonly used to acquire and analyze optical images. Weissling et al. [
9] developed a ship-based, ice condition imagery acquisition, processing, and analysis system. Worby et al. [
10] evaluated the ice distribution characteristics in the Antarctic based on 20,000 images acquired during Antarctic ship voyages. In addition, researchers are studying how to apply machine learning and deep learning to polar target detection. Li et al. [
11] proposed a two-stream radiative transfer model for ponded sea ice. The upwelling irradiance from the pond surface was determined and then its spectrum was transformed into RGB color space. Cai et al. [
12] employed convolutional neural networks to detect sea ice by instance segmentation using a simulation ice pool dataset and estimated ice size and concentration.
For large-scale ice detection, passive and active microwave remote sensing images are mostly used. Some algorithms for calculating ice concentration were proposed, including NASA Team (National Aeronautics and Space Administration), Bootstrap, and ASI (ARTIST Sea Ice) [
13]. For the identification and classification of sea ice, techniques such as the maximum likelihood method, SVM (support vector machines), Markov random field model, and neural networks have been utilized. Belchansky et al. [
14] used SSM/I (Special Sensor Microwave/Image) bright temperature data and remote sensing ice images acquired by the ERS and Okean satellites as inputs to train neural networks. Karvoven et al. [
15] segmented and classified six types of ice from Synthetic Aperture Radar (SAR) images using an impulse-coupled neural network. Ressel et al. [
16] utilized an artificial neural network to classify ice, and the results demonstrated that the method was resistant to image noise. However, generally, the models used were not modified and improved according to the characteristics of the remote sensing ice images to be analyzed.
The detection based on shipboard optical images is characterized by high resolution, rapidity, and the ability to provide rich information [
17], but it cannot allow a continuous monitoring of the environment and is affected by adverse weather conditions. The detection based on remote sensing images can be applied to wide polar regions and is independent of the weather conditions, but its spatial distribution is relatively low, and it is not sufficiently accurate to distinguish small targets. Most studies focused on ship detection rather than on ice detection, and those that investigated ice detection systems mainly used a single data source consisting of remote sensing or optical images.
In this paper, we combined data of local-scale optical images and remote sensing images to integrate their specific strengths. Polar datasets at different scales were constructed. The SSD model was used for polar target detection at the local scale. For remote sensing detection, the YOLOv5 model was improved according to the characteristics of the sea ice, and ablation and comparison experiments were conducted to verify the model. We performed a slicing operation on the images to ensure that small sea ice targets could be detected and we constructed hybrid datasets to verify the proposed model.
2. Polar Multi-Target Detection at the Local Scale
2.1. Target Detection
The region proposal method and the end-to-end method are based on two primary detection deep learning algorithms. Overfeat, R-CNN (Region-CNN), Faster R-CNN [
18], etc., are involved in the region proposal-based method while YOLO and SSD are part of the end-to-end-based method [
19]. The region proposal-based method has a significant advantage in detection accuracy with respect to the end-to-end-based method because it includes “two steps” and is more accurate for target localization and classification. On the other hand, it has a significant disadvantage in the detection of speed because it requires a long time to generate the region proposal. The end-to-end-based detection method directly extracts features for object localization and classification using convolution. The SSD relies on the RPN (Region Proposal Network) mechanism of the Faster R-CNN, which combines the detection speed of the end-to-end method with the detection accuracy of the region nomination method. Therefore, in this paper, we chose the SSD model for polar multi-target detection on the local scale.
2.2. SSD Model
The SSD model consists of two major components, a base network and additional network layers, as shown in
Figure 1. The base network uses the structure of Visual Geometry Group (VGG 16) and converts the last two fully connected layers into convolutional layers, Conv4_3 and Fc7. The additional network layers include four sets of convolutional layers: Conv6_2, Conv7_2, Conv8_2, and Conv9_2. The SSD detection model operations are as reported below.
Firstly, the input image is converted to a three-channel RGB (Red Green Blue) image with a resolution of 300 × 300 or 500 × 500. The image is fed into the network to extract multi-scale feature information, and the scales of each feature layer are 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3 and 1 × 1.
Then, target feature extraction is performed through six feature layers of different scales. Default boxes are generated for each point of the feature map, and the number of default boxes is different for each layer.
Finally, all the generated default boxes are integrated, analyzed by non-maximum suppression (NMS) and filtered with an intersection over union (IOU) higher than 0.5. The final output contains information about the location, category, and confidence level of the target.
The SSD is characterized by its efficiency as a single-stage detector, performing detection directly in a single forward pass without the need for region proposals, which results in a faster detection compared to other models. It leverages multi-scale features and default boxes and can detect objects of various sizes. These advantages make SSD an effective model widely applied in practical scenarios.
2.3. Construction of a Local-Scale Polar Multi-Target Dataset
Due to the lack of a publicly accessible dataset for polar targets, constructing a new dataset is an important step. A total of 650 images were obtained through searching, de-weighting, annotation, and review to create a local-scale polar multi-objective dataset. Some of the images were downloaded from The Norwegian Institute (
https://icewatch.met.no, accessed on 19 August 2022). The dataset was divided into 5 categories, namely, sea ice (first-year ice), icebreakers, icebergs, inter-ice waterways, and melting pools on ice. Labellmg, an image annotation tool, was used to label the images as fy, icebreaker, iceberg, channel, and pool, respectively [
20,
21]. Finally, the dataset is randomly divided into training and testing sets at the ratio of 8:2. The details are shown in
Table 1.
The majority of the images were captured by shipboard cameras and UAVs (Unmanned Aerial Vehicles), and the photographed scenes corresponded to polar ship navigation scenarios. Some of the sample images are shown in
Figure 2.
2.4. Results
The model training and testing configurations are shown in
Table 2. The detailed training parameters are shown in
Table 3.
The steps in the training were as follows. Firstly, the training process was mainly used to predict the results and calculate the loss value by the forward propagation algorithm. Secondly, the parameter gradient value was calculated by backward propagation, and the parameters were optimized and updated. Finally, the training was completed by iterating the gradient descent algorithm to the maximum number of iterations. The process was stopped when the model reached loss convergence; then, a model was generated for the subsequent training and target detection tasks.
Average precision (AP), F1 score, and mean average precision (mAP) were determined to evaluate the detection accuracy [
22]. The precision (P) value can quantify the effectiveness of sample classification, and the recall (R) value can evaluate the capacity to detect positive samples. Considering only precision or only recall is not sufficient to evaluate a model; so, the F1 score was used to harmonize P and R. The calculation of mAP can be divided into two steps: the first step consists of the calculation of the AP (average precision) of each category, while the second step involves determining the sum of the average precision values of each category and then its average value to obtain mAP. These parameters were calculated according to Equations (1)–(5):
where TP (true positives) is the number of correctly classified positive samples, FP (false positives) is the number of incorrectly classified positive samples, TN (true negatives) is the number of correctly classified negative samples, and FN (false negatives) is the number of incorrectly classified negative samples; k is the category number.
After training, the model was applied to the test set, and finally, an mAP value of 70.19% was obtained. The accuracy of the icebreaker category was the highest at 92%, followed by those of the iceberg category, which was 85%, and of the fy (first-year ice) category, which reached 77%. The accuracies of channel and pool were the lowest, 52% and 45%, respectively, due to the low number of images or labels for these two categories. The detection results for each category are shown in
Figure 3. Some of the test results are shown in
Figure 4. The SSD model works well for the detection of large targets at close range, but it is not effective in detecting small targets at a distance.
3. Sea Ice Detection by Remote Sensing
The detection on a local scale does not fully meet the requirements of navigating in polar regions, and using a single data source has certain limitations. The ship optical cameras cannot obtain large-scale and long-time series images and cannot monitor non-navigable areas. If sea ice in remote sensing images can be identified and located, and data fusion between local-scale and large-scale data can be performed, the advantages of different data sources can be fully utilized [
23].
3.1. Introduction of the YOLOv5 Model
In remote sensing images, the ice masses appear very small and densely clustered, and the SSD model is not able to analyze them. After improving its accuracy and efficiency, the YOLOv5 model was applied to the detection of ice through remote sensing. The backbone, neck, and head are the three basic structural components of the YOLOv5 model, as shown in
Figure 5.
The YOLOv5 backbone utilizes CSPDarknet as the backbone for extracting features from images, which is composed of cross stage partial networks. The focal module is responsible for efficiently downsampling the images. It is designed to transmit the images through the channel while maintaining primitive information. The backbone layer incorporates the utilization of the C3, C3_F, and Spatial Pyramid Pooling Fast (SPPF) modules. The C3 and C3_F modules can enhance the extraction of image features and augment the overall speed.
The neck module in YOLOv5 utilizes PANet to produce a feature pyramid network. These aggregated features are subsequently forwarded to the head module for prediction. The neck layer integrates the structures of the feature pyramid network (FPN) and the path aggregation network (PAN). Deep-feature images possess a higher degree of semantic information but a lower degree of location information, whereas shallow-feature images exhibit the reverse characteristics. The FPN model can transmit semantic information from a deep-feature image to a shallow-feature image. In contrast, PAN can transmit location information from a shallow-feature image to a deep-feature image. The integration of FPN and PAN enables the consolidation of parameters across various detection layers.
The YOLOv5 head is composed of layers that produce predictions from the anchor box. The head can be categorized into the loss function and non-maximum suppression (NMS). The binary cross entropy loss function is employed for the computation of classification loss and confidence loss, whereas the complete IoU (CIoU) loss function is utilized for the estimation of location loss. The CioU loss function incorporates three crucial parameters: the overlap area, the distance from the center, and the aspect ratio. NMS is employed to eliminate redundant detection while retaining the candidate box with the highest prediction probability as the ultimate prediction box.
3.2. Improved YOLOv5 Model
The YOLOv5 model was improved in three aspects. Firstly, the Squeeze-and-Excitation Networks (SE) attention module was added to the backbone of the original model. Secondly, the Fast Spatial Pyramid Pooling and Cross Stage Partial Network module (SPPCSPC-F) were used to augment the characterization capabilities. Finally, Funnel Activation (FReLU) was introduced to replace the Sigmoid-Weighted Linear Unit (SiLU) and improve the accuracy of ice detection.
3.2.1. Squeeze-and-Excitation Networks (SE) Attention Mechanism
Due to the large size of the remote sensing images and the small size of the ice targets, it is easy to lose some useful information. The Squeeze-and-Excitation Networks (SE) attention mechanism was added to the YOLOv5 backbone [
24]. The SE module was inserted after the convolutional layers. The module consists of two operations: squeeze and excitation. It is integrated to adaptively adjust the importance of each channel by learning their weights. The structure of the SE attention mechanism is shown in
Figure 6.
In the squeeze phase (Fsq), global average pooling is applied to the input feature map, compressing it from three dimensions to one dimension. This one-dimensional tensor captures global information for each channel. In the excitation phase (Fex), a set of fully connected layers operates on the output of the squeeze phase. These layers model the importance of each channel and generate a channel attention vector. Finally, a rescale operation (Fsc) normalizes the weights and multiplies them onto each feature channel.
3.2.2. SPPCSPC-F
The Spatial Pyramid Pooling Fast (SPPF) is a module designed to enhance feature representation. SPPF is the improved version of Spatial Pyramid Pooling (SPP) and is faster than SPP under the same conditions. The structure of the SPPF module is shown in
Figure 7.
The input feature map passes through three 5 × 5 maximum pooling layers, and three different sizes of receptive fields are obtained. Although maximum pooling can expand the receptive field, it will reduce the resolution of the feature map and cause the loss of some useful information. SPPCSPC is a structural module that combines the concepts of SPP and Cross Stage Partial Network (CSP) [
25]. In this paper, we present the SPPCSPC-F to replace the SPPF concerning the idea of SPPCSPC. The structure of SPPCSPC-F is shown in
Figure 8.
The input feature map is passed through the SPPCSPC-F module, with one path performing convolutional operations to extract lower-level features, and the other path preserving the original features. Next, the module performs multi-scale pooling operations on the feature map to capture features with different receptive fields. Finally, the fused features are further processed by subsequent convolutional layers. The order of pooling is modified to increase the speed while keeping the feeling field constant.
3.2.3. FReLU Activation Function
In the YOLOv5, the Sigmoid-Weighted Linear Unit (SiLU) is used as the activation function. When the input values move away from zero, the derivative of the SiLU can approach zero, leading to gradient saturation. It is difficult for the network to converge or cause training instability. The FReLU was used to replace the SiLU. The FReLU activation function incorporates learnable parameters, enabling the network to adaptively adjust the shape of the activation function through learning [
26]. This flexibility enhanced the model’s learning capacity and improved its adaptation to the sea ice characteristics. Combining SE attention with FReLU enables YOLOv5 to extract high-quality features, concentrate on key objects, reduce overfitting, and improve generalization ability, especially for detecting small objects in polar regions. The FReLU is defined by Equations (6) and (7):
where
denotes the funnel condition,
denotes a
Parametric Pooling Window centered on
,
denotes the coefficient on this window which is shared in the same channel, and (·) denotes dot multiply. The FReLU activation function is shown in
Figure 9.
3.3. Construction of a Remote Sensing Sea Ice Dataset
The remote sensing sea ice dataset was mainly derived from the Google Earth (
http://earthengine.google.com/, accessed on 25 December 2022) and the Northwestern Polytechnical University (NWPU) datasets [
27]. A total of 600 images, obtained after de-duplication, annotation, and review, constituted the dataset. The tag name was ice, and the number of tags was 15,948. It was randomly divided into a training set and a test set at the data ratio of 8:2. Some of the sample images in the dataset are shown in
Figure 10.
Neural networks need a large amount of data and a high data quality to improve their performance and robustness. The YOLOv5 uses Mosaic, adaptive cutout, and other data processing methods for data enhancement [
28].
The main idea of Mosaic is to randomly crop and scale several images and then randomly arrange and splice them to form a single image, to enrich the dataset and improve the training speed of the network. In the normalization operation, several images are calculated at one time, which can reduce the demand for computer memory. The data augmentation process is shown in
Figure 11.
There are many challenges in the detection of remote sensing images, as some targets are relatively small in size and usually clustered together. If the images are directly sent into the network for detection, many small targets cannot be effectively identified.
To solve this problem, in the detection stage, a sliding window was used to cut a specified-size (such as a 416 × 416) image as the input. The cutout adjacent images had a 15% overlap. The slicing operation on the remote sensing image is shown in
Figure 12. The purpose of the overlap is to ensure that every region is completely detected. Although this causes duplicate detection, overlapping sections can be filtered out by the NMS. Finally, the results of each cutout image were combined to obtain the detection results.
In order to verify the accuracy of the improved YOLOv5 model, we combined simulated sea ice images and real sea ice images into a hybrid dataset. The simulated images were constructed as follows. Firstly, we built a large flat ice field. Secondly, we fragmented the flat ice field to obtain a broken ice field. The Voronoi diagram is morphologically similar to an ice field with large pieces of broken ice and consists of a set of continuous polygons formed by the perpendicular bisectors of lines connecting two neighboring points. We used the RayFire plug-in of 3ds Max to fragment the flat ice field according to the Voronoi diagram, as shown in
Figure 13a. Finally, the size of the broken ice field was reduced by 80% to enlarge the gaps between the ice blocks, as shown in
Figure 13b.
3.4. Results
3.4.1. Ablation Study
The ablation study was conducted to facilitate the comparison of the different improvement methods. They were trained with the same configuration used in the local-scale polar objection. The epoch was set as 300, the initial learning rate was 0.001, the momentum parameter was 0.9, the weight decay parameter was 0.0005, and the NMS threshold was 0.5. The evaluation was carried out after every 30 training epochs. The results are shown in
Table 4.
In
Table 5, it can be observed that the mAP of the original YOLOv5 model was 0.719, the lowest among those of the evaluated models. The implementation of SE resulted in an increase in the mAP to 0.738, i.e., by 1.9%. The inclusion of SPPCSPC-F resulted in a 2.4% increase in the mAP, which reached the value of 0.743. However, the R value was relatively low, i.e., 0.688. The inclusion of FReLU resulted in a 2.8% increase in the map, to the value of 0.747. When adding SE, SPPCSPC-F, and FReLU, the mAP was improved by 3.5%, reaching the highest value among those of all the examined models.
Similarly, the P, R, and F1-scores of the original YOLOv5 model were 0.719, 0.684, and 0.701. However, for the proposed method, the P, R, and F1-scores were 0.753, 0.703, and 0.727, that is, they increased by 3.4%, 1.9%, and 1.8%, respectively. Therefore, the improved YOLOv5 model revealed superior accuracy and enhanced performance in the domain of remote sensing sea ice detection.
3.4.2. Contrast Study
In order to further validate the advantage benefits and efficacy of the improved YOLOv5 model, incorporating the three mentioned modules, a comparative experiment was conducted. We compared the improved model with other conventional models, such as Faster-RCNN, YOLOv3, and YOLOv4-tiny; the values of loss and mAP are shown in
Figure 14.
In the first 40 epochs, the loss of each model fell quickly, indicating that the training did not achieve a stable state. When the training is stable, the loss in the curve is flat rather than sharp. The loss of our model was lower than that of the others when training reached a steady stage. The mAP rose sharply in the first 80 epochs. All models tended to become more stable after 250 training epochs, and the mAP of our model was the highest.
The values of the evaluation indicators are shown in
Table 5. Compared with those of YOLOv3, YOLOv4-tiny, Faster-RCNN, and original YOLOv5, the mAP of YOLOv3 was the lowest, at 60.4%, whereas the mAP of our model was the highest, at 75.4%. YOLOv3 and YOLOv4-tiny showed a higher P value but a lower R value, which indicated that these two models largely miss their ice targets when detecting sea ice. Based on the above results, the improved YOLOv5 can better perform in sea ice detection.
The improved YOLOv5 was used to test a remote sensing image with a resolution of 3660 × 3660. Since some sea ice targets were too dense, the confidence degree was hidden in the results. The detection results of the original YOLOv5 are shown in
Figure 15a.
Figure 15b shows zoomed-in local images using the original YOLOv5, in which the number of detected sea ice masses was 14 and 55.
Figure 15c shows zoomed-in local views of the image detected by the improved YOLOv5, in which the number of detected sea ice masses was 53 and 88. When using the improved YOLOv5, the number of detected ice targets increased by 39 and 33 units, and most of them were small.
The results with the confidence degree are shown in
Figure 16. Both a real image and a simulated sea ice image are presented. The results demonstrated that the improved YOLOv5 model was able to detect ice targets in simulated sea ice images with strong generalization ability and robustness.
Local scale detection covers from tens to hundreds of meters. Correspondingly, remote sensing scale detection covers from tens to hundreds of kilometers. If a ship navigates in the polar regions using only local-scale data, the planned path may be optimal at the local scale but not on the whole, as it could be unnecessarily long. If only remote sensing data are used, the planned path may be the best on a large scale, but it may miss some obstacles that will jeopardize the safety of ship navigation on a local scale. In this paper, local-scale and remote sensing data were combined to take advantage of their respective strengths. Our results indicated that the use of this combination for the detection of obstacles can improve the safety and efficiency of polar navigation.
4. Discussion
The instability of polar condition makes navigation difficult. Sea ices which float on the surface are difficult to detect and are prone to collision with the hull or the propeller. In this paper, polar datasets at different scales were constructed. The SSD model was used for multi-target detection at the local scale. For remote sensing images, hybrid datasets were constructed and a slicing operation was performed, the YOLOv5 model was improved and tailored to detect sea ices. Ablation and comparison experiments were conducted to verify the proposed model.
For the source of data, most studies mainly adopt remote sensing or optical image as a single data source. For example, Li et al. [
29] who developed a novel method to extract sea ice cover using Sentinel-1 data based on the support vector machine (SVM). Xu et al. [
30] proposed a Recurrent Attention Convolutional Neural Network (RA-CNN) to classify different ships. In this paper, the fusion of remote sensing and optical images is used to take advantage of the complementary strengths.
For ice detection, some studies did not change their model according to the characteristic of ices. Moreover, many studies used only real datasets to verify the accuracy of their model. For example, Frederik et al. [
31] proposed a deep learning model based on YOLOv3 for distinguishing icebergs and ships. Markus et al. [
32] detected the ice on rotor blades. In this paper, the YOLOv5 mode was improved to ensure that small ices can be detected. The hybrid dataset was constructed to verify the proposed model and the results showed that the model had a good generalization ability.
Although this study successfully detected multi-scale polar objects, it still has some limitations. The lower detection accuracy of some categories on the local scale was due to the small amount of data. The datasets used can be expanded to increase the accuracy [
33]. This study focused on rectangular detection boxes; if more detailed sea ice information is needed, in the future, the ice images can be processed with instance segmentation [
34].