1. Introduction
The all-weather capability of infrared imaging offers extensive applications in ship management, including nighttime anomaly detection, illegal fishing surveillance, and smuggling monitoring. By capturing thermal radiation, infrared imaging enables the acquisition of clear images without the need for visible light, and can penetrate atmospheric obstacles such as smoke and haze. These advantages make it a superior solution compared to traditional visual monitoring technologies like SAR, ORS, and RGB cameras [
1]. However, two major challenges remain unresolved in existing studies: the precise segmentation of ship targets from complex maritime infrared backgrounds with low contrast, as illustrated in
Figure 1, and the interference of thermal noise. Additionally, accurately predicting temperature anomalies through the quantitative analysis of segmented thermal signatures remains a critical issue.
Infrared image segmentation faces significant challenges due to inherent limitations such as low contrast, small target sizes, and blurred boundaries, particularly when distinguishing ships from complex backgrounds in nighttime thermal imagery with elevated noise levels [
2,
3,
4,
5,
6,
7,
8,
9]. Research in this domain has evolved from traditional methods to deep learning-based approaches. Conventional techniques including threshold-based [
10,
11,
12,
13,
14], clustering [
15,
16,
17], and active contour segmentation [
18,
19] exhibit critical shortcomings: threshold methods fail to address spatial information despite enhancements like 2D maximum entropy [
13], clustering approaches like mean shift [
15] and FCM [
16] demonstrate noise sensitivity, while active contours struggle with dynamic backgrounds. Recent advances in convolutional neural networks (CNNs) have improved feature extraction through architectures like PSANet’s adaptive attention masks [
20], DeepLab-based multi-scale attention models [
21], and IRDCLNet’s fog interference mitigation [
22]. Further advancements include Yu et al. [
23], who enhanced segmentation accuracy for turbine blades by integrating layered depth-separable convolution blocks into U-Net, and Xiong et al. [
24], whose multi-level correction network (MCNet) combines attention and edge enhancement modules to better utilize shallow and edge features. Specialized solutions for ship segmentation include Sun et al. [
25], who proposed Global Mask R-CNN with Precise RoI Pooling and Global Mask Head to preserve instance-level global information, and Bai et al. [
26], who developed a fuzzy inference system integrating thresholding, region growing, and morphology for complete ship extraction. Li et al. [
27] introduced a thermal imaging method combining grayscale morphology reconstruction, multi-feature saliency detection, and novel contour descriptors for small ship detection, while Xu et al. [
28] demonstrated significant improvements in nighttime ocean observation through their CGSegNet model. Persistent issues in edge quality and small target segmentation necessitate specialized solutions, particularly given the inherently weak edge features in infrared imagery, where strengthening edge information improves segmentation accuracy [
29,
30,
31]. Edge detection research combines traditional operators (Roberts [
32], Prewitt [
33], Sobel [
34], Kirsch [
35]) with deep learning innovations like DeepLabv1 [
36], Gated-SCNN [
37], and DeepLabv3 [
38], which enhance boundary precision through dilated convolutions and shape learning. Building upon these foundations, our enhanced IERNet framework addresses critical limitations through multi-modal temperature adaptation and hierarchical edge enhancement. By synergistically integrating the Sobel operator’s gradient detection with deep feature perception, we establish a dual-strategy architecture optimized for infrared characteristics. As demonstrated in
Figure 2, this approach overcomes existing methods’ deficiencies in small target segmentation (
Figure 2a) through an edge-assisted detection module that significantly improves boundary delineation (
Figure 2b) while maintaining computational efficiency. The framework further incorporates innovations from multi-level attention networks [
39] and thermal feature utilization [
40], achieving robust performance across diverse observational modes through adaptive multi-scale fusion and edge-aware feature refinement.
Secondly, in acquiring brightness temperature data of ship compartments through infrared imaging, effectively detecting abnormal thermal variations emerges as a critical challenge for maritime safety monitoring, necessitating comprehensive evaluation of detection methodologies. Current approaches for abnormal temperature identification primarily encompass single-parameter methods, multi-parameter fusion techniques, and visual detection strategies. Single-parameter solutions such as thresholding [
41,
42], trend analysis, and slope analysis prioritize computational efficiency but exhibit limited accuracy. While threshold-based detection identifies anomalies through predefined thermal boundaries and trend analysis monitors temporal temperature patterns, these simplistic methods generally underperform compared to advanced alternatives. Recent machine learning advancements have introduced sophisticated prediction models, including Xu et al.’s Multi-LSTM Convolutional Neural Network (M-LCNN) for sea surface temperature forecasting [
43], Gong et al.’s XGBoost-based ceramic thermal analysis [
44], and Ding et al.’s XGBoost application in wildfire brightness threshold estimation [
45]. Building upon these developments, our work employs XGBoost [
46] to model ship cabin temperature transfer processes using infrared-derived thermal data, capitalizing on its generalization capacity, computational efficiency, and overfitting resistance for abnormal temperature prediction. Conversely, multi-parameter fusion methods enhance detection accuracy through comprehensive feature integration, as evidenced by Feng et al.’s extreme gradient boosting for sea temperature analysis [
47], Qu et al.’s CNN-based ensemble learning for fire detection [
48], and Deng et al.’s fuzzy neural network combining temperature, smoke, and CO concentrations [
49]. Addressing challenges specific to ship compartment monitoring, particularly signal loss and false alarms arising from parameter complexity, our proposed framework extends these multi-feature fusion principles by developing a temperature-centric detection architecture optimized for maritime thermal anomaly identification in infrared imagery.
To address the above challenges, the contributions of this paper include the following: we propose a novel ship segmentation framework designed specifically for infrared ship scenes that incorporates contour enhancement and multi-level feature fusion. This framework accurately predicts ship masks in infrared images, significantly improving segmentation precision, especially for small target features and ship edges. To address the scarcity of infrared data for ships on the water, we captured over 2000 images of ships in the Pearl River region, annotating 1000 of them with more than 1000 target masks. To tackle the challenges of edge blurring and low background contrast in infrared images, we developed an edge enhancement network that strengthens edge features, thus improving segmentation accuracy. Additionally, for segmenting small maritime targets, we employed a multi-level feature fusion structure that retains shallow feature information, effectively achieving segmentation of small ships. Finally, we constructed a transverse temperature model for ships, extracting temperature information from infrared images to enable the prediction of ship temperatures.
The proposed Infrared Image Edge-Enhanced Segmentation Network (IERNet), detailed in
Section 2, integrates an encoder–decoder architecture with an edge enhancement module and multi-level fusion strategies to address ship segmentation and temperature anomaly prediction. The model combines gradient perception and deep feature learning to refine boundaries while predicting both spatial locations and developmental trends of thermal anomalies.
Section 3 validates IERNet’s performance through comparative experiments on the Pearl River dataset, benchmarking segmentation accuracy and temperature prediction robustness. Finally,
Section 4 concludes with experimental insights and discusses future directions, including temporal thermal modeling and lightweight deployment for real-time maritime monitoring.
2. Method
2.1. IERNet Architecture
We propose an enhanced version of IERNet, a novel data fusion network consisting of two encoders and one decoder. The encoders are responsible for extracting features from input images, while the decoder restores the resolution. The two encoders extract thermal imaging data from two different observation modes.
Figure 3 illustrates the overall architecture of our IERNet. We use ResNet as the backbone of the encoders and incorporate an edge feature enhancement branch into the network. This branch emphasizes thermal imaging data by using pseudo-color images to highlight edge information. The pseudo-color effect is achieved by setting a temperature threshold to clearly identify high-temperature areas, which are then highlighted in red to distinguish them from the surrounding environment. Additionally, noise reduction is applied to the original data using bilateral filtering, and non-uniformity correction is performed to compensate for the response differences between pixels in the infrared detector. These operations significantly improve image quality and the accuracy of temperature information. The temperature values of pixels are mapped to grayscale values, and regions with temperatures exceeding the threshold are highlighted in red, ensuring more accurate processing of high-temperature areas. This method extracts richer edge features from pseudo-color data, which contain stronger edge information, and enhances spatial perception through the CBAM. These features are then merged with grayscale image feature maps. Furthermore, in the deep feature layers, edge features are extracted using the Sobel operator and concatenated with low-level feature maps from the encoder, further enhancing feature fusion. The Fusion Unit strategically combines these diverse feature sets using techniques like Depthwise Separable Convolution, improving computational efficiency and reducing model complexity. Finally, the fused feature output undergoes deeper semantic information extraction in the Atrous Spatial Pyramid Pooling (ASPP) module, which is then fused with shallow information. These operations enable the final output features to preserve spatial details while incorporating higher semantic information. The fused features are then passed to the decoder module, enhancing the segmentation performance of the image.
In comparison to the DeepLabv3+ algorithm, our segmentation method excels in both edge data acquisition and multi-scale data integration. By emphasizing edge features through the edge feature enhancement branch and utilizing the Sobel operator, our method captures finer edge details, significantly improving segmentation accuracy along object boundaries. This combination of edge enhancement and multi-level fusion is particularly relevant, as it allows for more precise edge detection while also leveraging multi-scale information for a comprehensive understanding of the scene. The integration of these two strategies provides a more holistic approach to segmentation, where edge details are preserved while maintaining the ability to handle objects at different scales. Additionally, our approach effectively utilizes the ASPP module to collect multi-scale information, enhancing performance in handling objects of various sizes and complex structures. These improvements not only make our method more precise but also highlight its novelty in addressing edge sensitivity and multi-scale data integration more effectively than existing algorithms like DeepLabv3+. The details of the encoder–decoder structure, CBAM, Fusion Unit, Depthwise Separable Convolution, Atrous Spatial Pyramid Pooling, and the loss function will be explained in the following sections. The specific workflow is as follows:
Figure 3.
The IERNet framework for infrared image ship segmentation. Input infrared ship images and pseudo color processed images separately, send the pseudo color images to the edge processing end, extract edge features, and integrate them into the backbone network through feature fusion modules to enhance the edge information of the ship. Multi-layer perception networks fuse deep features with shallow networks after being processed by ASPP.
Figure 3.
The IERNet framework for infrared image ship segmentation. Input infrared ship images and pseudo color processed images separately, send the pseudo color images to the edge processing end, extract edge features, and integrate them into the backbone network through feature fusion modules to enhance the edge information of the ship. Multi-layer perception networks fuse deep features with shallow networks after being processed by ASPP.
2.2. Encoder Architecture
In our image processing network, we utilize ResNet-50, leveraging its greater feature extraction capability and lower complexity than VGG-19. ResNet-50’s depth is 5.3 times that of VGG-19, making it the foundation of our feature extraction for image fusion processes. Our design includes a layered approach using ResNet-50’s Res1 to Res4 levels to extract image features from simple to complex gradually, and this hierarchical feature extraction mechanism is vital for deeply understanding and processing images.
Initially, the Res1 layer primarily captures primary features such as edges and texture details, crucial for fundamental image structures. Upon entering the Res2 layer, the network introduces the CBAM, focusing attention both spatially and channel-wise, thus extracting more complex mid-level features. In tandem, the network employs two types of ResNet architectures within the encoder to handle standard thermal imaging and edge-enhanced thermal imaging, respectively.
For applications involving black-hot mode in infrared imaging, we adapt the input by modifying the number of channels in the initial convolutional block of the thermal encoder to accept only one channel. After this adaptation, a max-pooling layer follows the initial block and four residual layers. We thus sequentially reduce the resolution while increasing the channel count, facilitating deep feature extraction.
Moving into Res3 and Res4, the network uses the CBAM to intensify focus on key regions and integrates previous and current layer features through element-wise fusion, enhancing spatial continuity and feature richness. The edge-processing module, applied post-ResNet initial block, employs CBAM in three residual layers to process the extracted features, which are then fused with corresponding layers in the main network to generate feature maps with enhanced edge details.
In the final stage, our Fusion Unit uses max-pooling to extract high-frequency information from edge-enhanced features, while average pooling is used to pull color block information from the main network features. These diverse feature layers are then combined, offering a fused feature map enriched with precise edge details and seamlessly blended features. This advanced feature integration ensures the reconstructed image has sharp edges, smoothed textures, and eliminated shadows—characteristics crucial for high-quality and functional image fusion, notably in sophisticated infrared imaging scenarios.
To better extract edge features, the CBAM [
50] is utilized, as illustrated in the
Figure 4. The CBAM sequentially infers attention maps through channel-wise and spatial-wise dimensions and then multiplies these attention maps with the input feature maps containing edge information for adaptive feature optimization. With a low parameter count, the additional time overhead incurred by adding this module is negligible. The module comprises two parts, channel attention
and spatial attention
, which can be represented by the following formulas:
represents feature maps, represents channel-based attention, represents spatial-based attention, represents element-wise multiplication, represents feature maps calculated through channel attention, and represents feature maps calculated through spatial attention. The tensor size of its input and output features is equal.
The Sobel operator, as a commonly used edge detection operator, can effectively extract edge information from images. It consists of two 3 × 3 convolution kernels that convolve with the image to extract the edge information. The specific formulas are as follows:
The and kernels calculate the gradient values in the horizontal and vertical directions, respectively, and then combine these gradient values to detect edges in the image. Before the input layer of the ResNet, a Sobel convolution layer is added to introduce the Sobel operator as an additional feature extractor into the network. The benefit of this approach is that it fully utilizes the Sobel operator to extract edge information from the image, thereby enhancing the network’s ability to learn edge features. Additionally, in the edge processing encoder, the average pooling layer is replaced with a max-pooling layer to better extract edge information. This results in higher edge detection accuracy and stronger generalization ability.
To effectively utilize edge feature information, we constructed a Fusion Unit, as illustrated in
Figure 5. The input consists of two parts: features from the edge extraction branch and those from the backbone network. The edge features are processed using a 2 × 2 Max-Pool operation, while the backbone features undergo 2 × 2 AvgPool. This choice of a 2 × 2 pooling kernel effectively reduces dimensionality while retaining crucial local features, enhancing computational efficiency and preventing overfitting. The pooled feature tensors are then concatenated to combine both average and maximum information, enriching the feature representation. This fused tensor is further refined through two 3 × 3 convolutional layers and one 1 × 1 convolutional layer, which extract and enhance high-level features. These operations transform the original feature representations into more enriched and discriminative forms, laying a solid foundation for subsequent tasks. Finally, the refined features pass through a sigmoid activation function and are element-wise multiplied with another MaxPooled result from the backbone features, introducing an attention mechanism that amplifies important features while suppressing less relevant ones. This mechanism enhances the model’s representation and generalization capabilities.
2.3. Decode Architecture
In the decoder, we extract low-level features from the intermediate layers of the backbone network ResNet [
51]. These low-level features have higher spatial resolution and richer detail information. Then, through the ASPP module, we capture contextual information at different scales, generating features with different dilation rates. Next, the feature maps generated by the ASPP module are upsampled to align with the input image’s dimensions. Subsequently, the low-level features and the upsampled high-level features are fused to combine fine-grained feature information with global context information. In the improved decoder, before fusing the encoder features with the shallow features from the backbone network, we first stack and fuse them with the fused features from the encoder’s feature fusion branch and then fuse them with the shallow features. The original DeepLab V3+ [
52] improves features by passing the final fused feature map with shallow features through a 3 × 3 standard convolution. To reduce the model’s parameter count, we replace the 3 × 3 standard convolution with a 3 × 3 depthwise separable convolution [
53], as shown in
Figure 6. Finally, the fused features are fed into the final classifier to classify each pixel, producing the final semantic segmentation result.
The ASPP module, as shown in
Figure 7, applies one 1 × 1 × 256 convolution and three 3 × 3 × 256 convolutions with dilation rates of 6, 12, and 18 to capture multi-scale information in parallel. By combining multi-scale features, it can classify objects of different sizes. The module performs global average pooling on the final feature map to obtain global image-level features, which helps mitigate the issue of weight vanishing when the dilation rate increases.
In the decoder, two concat operations are performed to effectively fuse features from different scales and levels. The first concat operation merges the high-level semantic features obtained from the ASPP module with the low-level details from earlier network layers. This fusion allows for the model to combine the context of the image with the fine-grained spatial details, improving segmentation accuracy, especially for small objects and object boundaries. The second concat operation combines multi-scale features extracted from the ASPP module at different dilation rates with the fused low-level features. This process enables the model to handle various object sizes and more complex structures, thereby enhancing its ability to segment objects in a diverse range of contexts.
Through the edge extraction branch, we obtained rich edge information and employed the edge feature-weighted Dice Loss [
54], a statistical metric used to measure the similarity between two sets. The advantage of Dice Loss lies in its robustness to class imbalance, making it particularly suitable for segmentation tasks where there is a significant difference in the number of positive and negative samples. This ensures that the model places more emphasis on the segmentation of small classes. The Dice coefficient, a similarity measure function used to calculate the similarity between two samples, is defined as follows:
represents the intersection between and ; and represent the number of elements in and , respectively. The coefficient 2 in the numerator accounts for the double counting of and in the denominator. In semantic segmentation tasks, represents the ground truth segmentation image, and Y represents the predicted segmentation image.
The Dice coefficient calculation approximates as the dot product between the predicted and ground truth segmentation images and then sums the element-wise results of the dot product. The calculation method is as follows:
- (1)
The dot product between Pred predicted segmentation map and GT segmentation map:
- (2)
The result of element-wise multiplication is the sum of the sum of elements:
For a binary classification problem, where the ground truth segmentation map (GT) contains only values of 0 and 1, all pixels in the predicted segmentation map (Pred) that are not activated in the ground truth segmentation map can effectively be set to zero. For activated pixels, the focus is primarily on penalizing low-confidence predictions, with higher values receiving better Dice coefficients. Due to the blurred nature of edge features in infrared images, an edge weighting function is added on top of this to enhance the importance of edge weights.
Due to the blurred nature of edge features in infrared images, an edge weighting function is introduced to enhance the importance of edge pixels. The edge datum E, obtained by combining the edge features processed through the Sobel operator and filtering out high-frequency data interference from water bodies and background regions in the ground truth (GT) target area, represents the intensity of edge pixels. These edge features capture the most critical information regarding the boundaries of objects, which are vital for accurate segmentation. By focusing on the transitions between different regions, edges provide key structural cues that help the model distinguish between the object and background. The edge datum E is then normalized to the range [0, 1], ensuring that it fits within a standard scale for further processing.
The incorporation of edge information enhances segmentation by emphasizing finer details, especially at the boundaries of the target objects. This is particularly useful in scenarios where the contrast between the object and the background is low, or the edges are blurred. Edge detection helps the model focus on the critical boundaries that define the object, leading to more precise segmentation, particularly for small or intricate structures.
To improve the accuracy of edge segmentation, the weighted Dice coefficient is used, which is calculated using the following formula:
Among them, represents the value of edge data in the -th row and -th column, with a range of . and represent the corresponding values in the GT and prediction segmentation matrices, respectively. is a smoothing term to avoid having a zero denominator.
2.4. Construction of a Ship Cabin Abnormal Temperature Prediction Model Based on GBDT
In this study, we utilize XGBoost as the regression model. The model’s objective function incorporates regularization terms that help control overfitting by penalizing overly complex models. Additionally, XGBoost employs a second-order Taylor expansion of the loss function to better approximate the objective during training. From a computational perspective, the algorithm is highly efficient, utilizing a multi-threaded, parallel processing strategy to compute feature values. This speeds up the training process and reduces computational costs. By integrating key temperature measurements from various points on the ship’s body with environmental factors into composite variables, we aim to predict abnormal temperature variations within the ship’s cabin. This approach enables the construction of a predictive model specifically designed to estimate the transverse cabin temperatures. The objective function is as follows:
where n represents the number of training samples,
represents the observed values of the model,
represents the simulated values of the model,
represents the number of decision trees, and
represents the
t-th tree model.
After Taylor expansion derivation, defining tree complexity, and determining leaf node grouping, the final objective function can be written as
After each iteration, the XGBoost algorithm assigns a learning rate to the leaf nodes to reduce the weight of each tree, diminishing the influence of each tree and providing better learning space for subsequent iterations.
3. Experiment
The software environment for this experiment includes the Win10 operating system and PyTorch program (
https://pytorch.org/), while the hardware environment consists of an AMD R7-5900X CPU and an NVIDIA GeForce RTX 3090Ti GPU. To verify the effectiveness of our method, we conducted extensive comparative experiments on the Infrared-ship thermal dataset and performed visual analysis of the experimental results. Additionally, to validate the effectiveness of the proposed modules, we presented relevant ablation experiments and provided detailed analysis for each of them. When evaluating the performance of semantic segmentation-related algorithms, we used the most common metrics, mean Intersection over Union (
mIoU) and mean Pixel Accuracy (
mPA), as evaluation metrics.
mIoU is calculated using a confusion matrix, as shown in
Table 1, where
TP represents the intersection of predicted and ground truth values,
FN is the ground truth without overlap,
FP is the predicted values without overlap, and
TN is the region with no overlap in either.
calculates the accuracy of segmented images by measuring the overlap between predicted values and ground truth values. Its calculation method is as follows:
represents the percentage of correctly classified pixels in the image. Its calculation method is as follows:
In evaluating regression algorithms, we use Mean Squared Error (
) as the evaluation metric. Its calculation method is as follows:
Among them, is the number of longitudinal sampling temperatures of the ship, is the -th segment cabin temperature in the sampling temperature, and is the predicted temperature of the -th segment cabin.
3.1. The Thermal Ship Dataset
The thermal images in our dataset were collected from the Pearl River waterway scene. By capturing nighttime river scenes, we created a dataset of thermal imaging data for ships navigating in river channels at long distances, specifically within the range of 800 to 1500 m. This allowed for us to focus on ships at a considerable distance, ensuring that the dataset reflects real-world maritime navigation scenarios. We annotated our thermal images, selecting a total of 1000 images from 2000 thermal images for fine annotation. We chose the popular public image annotation tool LabelMe for semantic segmentation annotation of our original data. The image resolution is 640 × 512, and the dataset contains 800 images for training and 200 images for testing. These data include two classes, background and ships, as shown in
Figure 8.
There may be resolution differences in the dataset due to the distance between ships, resulting in ships with very low resolution. To ensure the balanced distribution of images at different distances, we enhanced the data through techniques such as cropping, rotation, and scaling to simulate different perspectives and distances. This method helps the model to generalize better, making it more robust to changes in ship size and environmental conditions. In addition, we conducted careful quality control during the annotation process to avoid inconsistencies and ensure accurate segmentation boundaries of ships in the image.
3.2. The XGBoost Dataset
The normal ship data were obtained by processing thermal infrared images of ships captured in river channels, while the abnormal temperature data were obtained by simulating abnormal temperature occurrences using experimental small boats in a water pool, as shown in the figure. Based on the influence of ship morphology on temperature distribution, we processed the thermal imaging temperature data of ships through lateral sampling because the shape of the ship determines the main trend of temperature change laterally. Considering the influence of water temperature on ships, we filtered out abnormal temperature values and used the average temperature as the sampled temperature of each compartment to reflect the heat exchange between the ship and the water. The temperature sampling method is shown in the
Figure 9. We take the temperatures at the bow and stern of the ship as the boundary temperatures and the lowest temperature of the compartment as the boundary temperature of the water body. Given that normal cargo ships usually have only the engine as a heat source, we assume that the engine is the only heat source and take the compartment with the highest sampled temperature as the engine position and the temperature of the heat source. In addition, we label the distances from each compartment to the engine compartment and to the ship’s hull to further analyze the relationship between heat distribution and ship structure.
The samples of abnormal ship cabin temperatures were created using experimental small boats. We added a high-temperature heat source to the rear part of the small boat to simulate the temperature of the engine compartment. Additionally, we placed a heat source and a cold source at the bow of the boat to mimic an abnormal ship cabin, as shown in
Figure 10. The abnormal samples after sampling are depicted in
Figure 11.
The relationship between ship body temperature and engine temperature, water temperature, and ambient temperature is closely related. Therefore, we chose the heat transfer distance, left edge temperature, right edge temperature, left edge temperature of the heat zone, right edge temperature of the heat zone, water edge temperature, ambient temperature, heat zone temperature, and the ratio of heat zone temperature to distance as the input variables for the model. Initially, the processed data contained a large number of feature variables, among which there might be some features with low correlation. This means that training the model may not achieve the expected predictive performance. Therefore, we first selected feature variables with good correlation with temperature change by calculating the correlation coefficient, as shown in the heatmap in
Figure 12. From the heatmap, it can be observed that adjacent temperatures, heat transfer distance, distance from heat zone edge point 3, distance from heat zone edge point 2, and boundary temperature have a relatively large correlation. Therefore, these elements were chosen as training samples.
3.3. Semantic Segmentation Experiment
We compared the segmentation performance of our algorithm with several state-of-the-art methods, including FCN-32s, FCN-16s, DFN, PSPNet, PSANet, DeepLabv3, and DeepLabv3+, to demonstrate the improvement of our algorithm. Experimental results on the Infrared-ship dataset showed that our algorithm performs slightly better and is more suitable for segmentation of infrared ship images.
In the table, our algorithm achieves a segmentation performance of 89.17% mIoU. Compared to state-of-the-art algorithms such as DeepLabv3+, PSPNet, and Danet, our algorithm’s segmentation performance improves by 6.38%, 13.21%, and 9.16%, respectively. In
Table 2, our overall segmentation accuracy is also higher than that of DeepLabv3 and Denseaspp.
3.4. Visualization Analysis
To further validate the effectiveness of our algorithm, we tested the segmentation performance of the trained model on the Ship dataset and visualized some of the segmentation results, as shown in
Figure 13. The image segmentation results of our method are superior to DeepLabv3+, Pyramid Scene Parsing Network (PSPNet), and FCN. Firstly, compared to DeepLabv3+, our method performs better in detail information such as small objects and edges, and also segments edges and small objects more clearly. For example, in the first row of images in the figure, the segmentation performance of the ship’s edge and the segmentation line of the water surface has been improved. Compared to the visualization results of FCN segmentation, FCN’s segmentation results are too rough. For instance, large objects like ships are coarsely segmented at close range, and the labels of pixels inside the object contours are also confused, resulting in a loss of detailed information for most objects and easily leading to internal discontinuities.
For the images in the second and third rows, compared to DeepLabv3+, the segmentation of each category is more refined, and our proposed method also performs better in details such as edges and small objects. For the images in the fourth and fifth rows, FCN and PSPNet still only roughly segment objects of each category, and FCN’s segmentation results are still relatively rough. The segmentation results of the DeepLabv3+ algorithm show that some pixel categories are confused for objects such as ships near the dock. Our method not only enhances the localization of boundary pixels for these targets but also alleviates the discontinuity and label confusion problems of large targets.
3.5. Visualization Results on the Features of Each Component of Our Algorithm
We also visualized the features of each module of the algorithm, including Sobel and Fusion Unit. As shown in
Figure 14, the Sobel edge detection module extracts edge information through both the X-axis and Y-axis directions, which are then fused together. The lateral information of the ship is the most abundant, and the extracted information improves the problem of missing edges in the infrared images.
The Fusion Unit integrates edge information, enriching the high-frequency details contained in the feature maps. As shown in
Figure 15, taking the ship in the first row as an example, after passing through the feature fusion module, the edge features between the ship and the water body become clearer, with enhanced edge features. Additionally, edge features within the ship and some edge information from heat sources are better extracted. Moreover, for small target ships, we can also extract their features. For example, in the fourth row of the image, by fusing shallow-level feature information with edge information, we can obtain their contour features as well as some high-frequency features on the ship.
3.6. Abnormal Cabin Monitoring Experiment
3.6.1. Enhancement on the Backbone Network
In the ablation experiments, we chose the FCN as the baseline and used ResNet50 and ResNet101 as backbone networks to demonstrate the effectiveness of the proposed improvements. As shown in the
Table 3, when using ResNet50 as the backbone network, the segmentation performance of the original FCN was 56.42% mIoU. By adding the ASPP module to FCN, the segmentation performance improved to 76.59% mIoU. When incorporating the Fusion Unit and CBAM to fuse edge data, the accuracy increased to 83.15% mIoU. Furthermore, adding ASPP on top of this further improved the accuracy to 86.48% mIoU. With ResNet101 as the backbone network, the final segmentation accuracy increased by 0.81% mIoU compared to ResNet50, indicating that the improvement in accuracy does not solely come from increased network depth. This demonstrates that the edge detection branch can obtain more discriminative features to improve semantic segmentation accuracy.
3.6.2. Ablation Experiments on the Effect of Different Loss Functions
The baseline selection includes FCN and DeepLabv3+, and the loss functions include Cross-Entropy Loss, Focal Loss, and Dice Loss with edge weights. The experimental results are shown in
Table 4. The performance of
and
on FCN and DeepLabv3+ is slightly better compared to
, as the edges of infrared ship targets are weak, the targets are small, and the noise is high. Strengthening the supervision of edge pixels by the loss function can enhance segmentation performance.
3.7. Ship Temperature Regression Experiment
The predictive performance of machine learning models mainly depends on data quality and model parameter tuning. Therefore, for further improvement of the model prediction effectiveness, this study requires fine-tuning of the sea surface temperature diurnal cycle amplitude prediction model based on the XGBoost algorithm. To train the XGBoost model, we first prepare the dataset by splitting it into training and testing sets. Then, we initialize the XGBoost model and fit it to the training data. This involves feeding the features into the model and allowing for it to learn the underlying patterns. We apply the model iteratively, adjusting the parameters based on performance metrics to improve prediction accuracy.
Five key parameters significantly impact the predictive performance of the XGBoost model. Among them, n_estimators determines the number of weak learners in the ensemble algorithm. A higher value increases the model’s learning capacity but also raises the risk of overfitting. max_depth controls the maximum depth of the trees in the model. A higher value leads to a more complex model and increases the risk of overfitting. The subsample parameter controls the proportion of randomly selected data for training, while the learning_rate parameter regulates the iteration rate; both help prevent overfitting. The gamma parameter controls the minimum loss function required for node splitting, reducing model complexity and accelerating algorithm convergence. We adjusted the key parameters of the XGBoost algorithm through multiple rounds of testing, using cross-validation results as the criteria for parameter selection. Cross-validation helps us evaluate the model’s performance by training it on different subsets of the data and ensuring that the model generalizes well to unseen data. Random search or grid search techniques are applied to optimize these hyperparameters, systematically testing different combinations to find the most effective configuration. The final optimal values for the five key parameters are shown in
Table 5.
First, the ship’s lateral temperature dataset is randomly split into training and testing sets in an 80%:20% ratio. Then, based on the XGBoost algorithm, the model is trained on the training set to construct a regression prediction model for ship temperature. Finally, the model’s predictive ability is validated using the testing set. This study utilizes the Python programming language environment (Python v3.9) to build the XGBoost model, primarily based on the scikit-learn library. Following the prediction algorithm workflow, the preprocessed dataset is imported into the XGB model. After finding the optimal parameters, the model is trained to obtain the prediction results.
Table 6 shows that the model achieves high goodness of fit and small error values on both the training and testing sets, demonstrating its good performance in predicting ship temperature. Regarding goodness of fit, the XGBoost model achieves over 70%. In terms of error, the RMSE values for both models are 3.472, indicating minor errors. Statistical analysis reveals that the proportion of ship temperature differences predicted by the XGBoost model exceeding 2 is less than 0.020%. Additionally, the R
2 and MAE values further demonstrate the model’s robustness and accuracy in both training and testing phases.
As shown in
Figure 16, the longitudinal fitting curves of the ship are depicted. In (a), the two sets represent simulated curves of normal propagation, exhibiting high goodness of fit and minimal fluctuations. In (b), the first sample simulates the appearance of an abnormal heat source in the midsection of the ship. According to the model prediction, there is a temperature difference of over 100 between the predicted temperature and the actual temperature between 49 and 56. This discrepancy indicates that the temperature in this area does not comply with the ship temperature model assuming a single heat source from the engine, thus suggesting an abnormal temperature range. In the second sample of (b), a simulated abnormal cold source appears in the midsection of the ship. The model predicts an abnormal temperature decrease in the range of 19–73, which deviates from the ship temperature model. Consequently, it is determined that an abnormal cold zone exists in the range of 19–73, affecting the transmission of temperature in the engine compartment.
4. Conclusions
In this paper, we propose a multi-feature fusion encode–decode semantic segmentation network for infrared ship images. Considering the issues of blurred edges and few target features in infrared images, we introduce an edge feature extraction network to enhance the edge information of the images. At the same time, we design a multi-scale feature fusion module to retain more shallow features. To extract sufficient features, we use ResNet50 as the base network and incorporate the ASPP module to detect ships at multiple scales. Since ship targets are usually small and occupy a small proportion of the image, with background pixels being the majority, we use the data from the edge extraction network as weighting parameters. By adding edge weights into the Dice Loss function, we achieve segmentation of small ship targets in infrared images. In the decoder, we replace standard convolutions with dilated convolutions, which reduces the number of parameters while ensuring detection accuracy. Due to the scarcity of infrared ship segmentation datasets, we collected and annotated a large number of infrared images of ships in waterways at night. Experimental results demonstrate that our method significantly outperforms DeepLabv3+ in edge segmentation, achieving an mIoU of 89.17%. When compared to state-of-the-art algorithms like DeepLabv3+, PSPNet, and DANet, our method shows improvements of 6.38%, 13.21%, and 9.16% in segmentation performance, respectively.
The prediction of cabin temperature is based on brightness temperature information, assuming the ship as a single heat source. We establish a transverse temperature regression model for ships based on extreme gradient boosting trees. By sampling the transverse temperature from the collected infrared images of ships, we obtain the transverse temperature distribution gradient. Using the engine room, water body, ship boundary brightness temperature, and temperature transfer distance as elements, we predict the overall temperature of the ship. Experimental results indicate that the XGBoost model demonstrates a high degree of fit, with goodness of fit exceeding 70% and an RMSE value of 3.472, reflecting minimal errors in both the training and testing sets. Statistical analysis also shows that less than 0.020% of the predicted ship temperature differences exceed 2 °C, confirming the model’s reliability and accuracy in predicting cabin temperatures.
Looking ahead, several potential directions can be explored to enhance this system. One promising avenue is to incorporate attention mechanisms, particularly the spatial attention network, to focus on small or distant ships more effectively, especially in scenarios with complex backgrounds or low visibility. This could allow for better differentiation between ships and surrounding objects by prioritizing the most relevant regions of the image. Another improvement would be the integration of temporal data through a recurrent neural network (RNN) or long short-term memory (LSTM) network, which could capture dynamic changes over time and improve the robustness of segmentation in real-world maritime scenarios, where ships may move across frames or appear intermittently. Moreover, while the model primarily focuses on infrared images, a valuable next step would be to implement multi-modal fusion techniques, combining infrared, visible light, and radar imagery to create a robust detection system that operates effectively in various lighting and weather conditions. By fusing these different data types, the system could enhance its resilience to environmental challenges such as fog, rain, or nighttime conditions, where infrared images alone might be insufficient. In terms of cabin temperature prediction, future work could incorporate a more sophisticated model that factors in environmental variables such as ambient weather conditions, wind speed, and water temperature, which influence the ship’s temperature dynamics. This could improve the accuracy of the temperature prediction, especially under changing weather conditions. Additionally, integrating sensor-based data, such as from on-board temperature sensors, could allow for real-time adjustments to predictions, making the model more adaptable and responsive. Lastly, integrating this system into a real-time operational platform would allow for continuous, automatic ship temperature monitoring, enabling operators to detect temperature anomalies and perform predictive maintenance, ensuring operational safety and proactively identifying potential issues before they affect the vessel’s performance.