Maritime Electro-Optical Image Object Matching Based on Improved YOLOv9

Yang, Shiman; Cao, Zheng; Liu, Ningbo; Sun, Yanli; Wang, Zhongxun

doi:10.3390/electronics13142774

Open AccessArticle

Maritime Electro-Optical Image Object Matching Based on Improved YOLOv9

by

Shiman Yang

¹,

Zheng Cao

²,

Ningbo Liu

^2,*,

Yanli Sun

² and

Zhongxun Wang

^1,*

¹

School of Physics and Electronic Information, Yantai University, Yantai 264005, China

²

Information Fusion Institute, Naval Aviation University, Yantai 264001, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(14), 2774; https://doi.org/10.3390/electronics13142774

Submission received: 5 June 2024 / Revised: 8 July 2024 / Accepted: 12 July 2024 / Published: 15 July 2024

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The offshore environment is complex during automatic target annotation at sea, and the difference between the focal lengths of visible and infrared sensors is large, thereby causing difficulties in matching multitarget electro-optical images at sea. This study proposes a target-matching method for visible and infrared images at sea based on decision-level topological relations. First, YOLOv9 is used to detect targets. To obtain markedly accurate target positions to establish accurate topological relations, the YOLOv9 model is improved for its poor accuracy for small targets, high computational complexity, and difficulty in deployment. To improve the detection accuracy of small targets, an additional small target detection head is added to detect shallow feature maps. From the perspective of reducing network size and achieving lightweight deployment, the Conv module in the model is replaced with DWConv, and the RepNCSPELAN4 module in the backbone network is replaced with the C3Ghost module. The replacements significantly reduce the number of parameters and computation volume of the model while retaining the feature extraction capability of the backbone network. Experimental results of the photovoltaic dataset show that the proposed method improves detection accuracy by 8%, while the computation and number of parameters of the model are reduced by 5.7% and 44.1%, respectively. Lastly, topological relationships are established for the target results, and targets in visible and infrared images are matched based on topological similarity.

Keywords:

electro-optical images; YOLOv9; topological relationship; deep learning

1. Introduction

Maritime target identification plays an important role in promoting maritime trade, maintaining maritime transport, and national defense security. The offshore zone usually has dense routes and numerous vessels, poses a high security risk, and needs to monitor and identify cargo ships, fishing vessels, ferries, and other vessels in the sea area [1]. Traditional methods of maritime target detection include radar, AIS (Automatic Identification System), visible image, infrared image, and other technologies. However, a single sensor is susceptible to harsh environments, resulting in difficulty in providing a comprehensive and detailed description of the scene; the integrated use of multiple sensors has become the key to solving the aforementioned problem [2]. Visible light imagery presents extensive target details and surroundings by capturing light in the visible range, whereas infrared imagery can provide insight into the target by detecting the temperature difference between the target and its surroundings in the case of insufficient light [3]. How to make comprehensive use of visible and infrared images to obtain more accurate and richer target scene information is an important issue for automatic target annotation at sea at this stage. Pixel- and feature-level-based image-matching methods have received considerable attention. Zhu et al. proposed a robust and accurate infrared and visible image alignment method using a concentric circle-based feature description algorithm to enhance the description and rotational invariance of the feature points. Moreover, they proposed a multilevel feature matching algorithm to improve the matching accuracy, with significant advantages in terms of the feature point localization accuracy and correct matching rate [4]. Zhang et al. proposed an efficient pixel-based fusion network for infrared and visible image fusion. The network adaptively learns pixel-by-pixel weights and combines them with a detection model to achieve co-optimization. The method outperforms existing methods in terms of quality and speed, especially on the Jeston Xavier NX, which is ideal for embedded systems because only 27 ms is needed to complete image fusion at a resolution of 512 × 512 [5]. However, for cases with large differences in sensor focal lengths, the resolution of the matched reference image can cause over-scaling of the aligned image, thereby reducing the similarity of the pixel values or features. Therefore, conventional matching for the correspondence of targets in visible and infrared images could not be used. In this case, considering that the positional relationship between the targets in the visible and infrared images does not change, the decision-level matching method can considerably use the spatial information between the targets without the need to compare the entire image. Moreover, the aforementioned method is able to have high accuracy and computational efficiency.

Decision-level matching methods first need to complete image target recognition. Given that traditional target detection is susceptible to the complexity of the shore background and the mutual occlusion of the ships, attitude angle changes, and other factors, stripping the ship from the complex background is difficult. Target detection methods based on deep learning have become a popular research topic. To improve the model detection performance, most researchers start from the perspective of reducing the number of parameters, improving detection accuracy, and accelerating the image processing speed. Yao et al. proposed a multidimensional information fusion network (MIFNet), which is used to improve the detection accuracy of infrared small targets in sea defense early warning and maritime border reconnaissance. MIFNet is used to improve the detection accuracy of small targets through the attention mechanism, fusing semantic, and detailed and edge information, which are especially suitable for maritime target detection [6]. Zhang et al. proposed an enhanced YOLOv7-tiny ship detection algorithm to cope with the leakage and false detection problems in multitarget situations. The main improvements include introducing a convolutional block attention module to enhance feature extraction, replacing the standard convolution with GSConv convolution to reduce the computational load, utilizing lightweight operators to reduce loss of feature information during up-sampling, and using the SIOU loss function to improve training accuracy [7]. Wang et al. proposed an improved YOLOv5 method to increase multiship target detection accuracy. By optimizing the input size of SAR images, anchor frame settings, and introducing asymmetric pyramid-shaped non-local blocks and the sim attention mechanism, ship target detection performance is enhanced and the model parameters are reduced. This method achieves 91.3% and 95.8% detection performance on the high-resolution SAR image dataset and SAR ship detection dataset, respectively, outperforming existing methods in specific offshore scenarios and validating its effectiveness on the AIR-SARShip-1 dataset [8]. Zhao et al. proposed E2YOLOX-VFL to address the complex ocean and land contexts with multiscale challenges of ship detection and identification. The method integrates an efficient channel attention module and effective force-IoU (EFIoU) loss function, improves the confidence loss function, and proposes balanced Gaussian nanotubes (BG-NMS) [9]. Zhang et al. proposed a time-priority-based stacked integrated deep learning model (TPSM), which integrates the features of multiple base models by integrating their features and using meta-models to inherit these features to obtain complementary advantages. To mitigate the effect of complex spatial distribution, the training set is divided into day, night, and mixed time attributes. Moreover, the integration framework is constructed by training and selecting the base models from different time attribute samples. This approach improves the generalization ability of the models in different environments and also maximizes the temporal a priori information to make the models markedly adaptable to changing maritime scenarios [10]. Nithya et al. proposed an effective algorithm, MSOD-PT, by combining YOLOv8 and DeepSORT, thereby achieving the localization and tracking of small targets in maritime surveillance; priority ranking also improves situational awareness and surveillance capabilities for maritime operations [11]. Although the preceding methods make some degree of improvement on the basis of the general algorithms, there are still two key problems: (1) the network size is large, resulting in difficulty in deployment on lightweight mobile devices with low arithmetic power; (2) the network lacks sufficient detection and recognition rate for small targets in complex contexts. Typically, reducing the model computation will affect the model feature extraction ability, and improving the image processing speed will affect the detection accuracy and model recall [12]. Therefore, improving detection accuracy while ensuring a light weight has become an important consideration for model improvement.

This study proposes to establish a topological decision-level relationship for matching targets in electro-optical images for multitarget data detection. First, YOLOv9 is improved to identify and obtain the specific locations of targets. Thereafter, the topological structure is used to establish the correspondence between the same target in infrared and visible images. The main innovations can be summarized as follows.

Improvements to YOLOv9
(1)
Increases the P2 detection head to improve the detection accuracy of small targets and alleviates the problem of the poor fusion effect of the ordinary feature fusion layer.
(2)
Replaces the backbone network with the C3Ghost, which extracts markedly refined and lightweight features to improve the inference speed of the network model, while reducing the complexity of the model parameter volume calculation.
(3)
Replaces the lightweight convolution module to reduce the amount of model computation and parameter count, thereby achieving lightweight terminal deployment.
(4)
Proposes a decision-level target correspondence method based on a topological structure, effectively solving the problem of difficult target matching triggered by the large difference in focal length between visible and infrared images.

2. YOLOV9 Algorithm and Improvement

2.1. YOLOv9 Network and Improvement

YOLO has gone through continuous iterations. In particular, the YOLOv9 model, released on 21 February 2024, has made significant advances in performance and localization accuracy. The model integrates programmable gradient information (PGI) and a generalized efficient layer aggregation network (GELAN), focusing on the problem of information loss that may occur in the input data during feed-forward processes. Moreover, this new neural network structure enables the use of different computational modules, thereby improving the flexibility of the neural network.

YOLOv9, as a state-of-the-art target detection algorithm, possesses high efficiency and accuracy. However, it does not perform as well as large targets in small target detection, misdetection, or omission when dealing with complex scenarios. In addition, the model requires high computational resources, which is not suitable for deployment on mobile devices. The current study improves the model, as shown in Figure 1. The module illuminated in bold in the figure is the improved module. The main improvements are the replacement of the original conv module with DWConv and the replacement of the Rep-NCSPELAN4 module in the backbone with C3Ghost. The P2 detection header has been added and is labeled “improved” in the figure.

2.2. Small Target Detection Head

Maritime electro-optical image datasets contain numerous small targets, such as buoys and fishing boats. In infrared images with small focal lengths, distant targets are often difficult to photograph clearly, resulting in an extreme reduction in target size and resolution. In the convolutional neural network, the generalized and abstract feature information will be enriched with the depth of the network. However, the deep network extracts the semantic information, such as target category, target features, and the relationship between the target and surrounding environment, owing to constant downsampling and other operations. The result is a considerably low resolution of the image, and some of the location-related details, specific shapes, and other information may be lost. Moreover, the small target itself contains relatively few features, and the location and detailed information may be lost to numerous small targets. This is especially the case in infrared images with a small focal length, which rely more on location information, further exacerbating the difficulty of small target detection [13,14].

The YOLOv9 model contains three detection heads for fusing the P3, P4, and P5 feature layers. However, for small targets, the original detection heads often have difficulty in accurately determining their locations and features because their representations in images are usually blurred and small. For this reason, this study introduces an additional detection head on top of the original model, which is able to locate small targets accurately by detecting them on the lower feature layer P2, making full use of its strong positional information. The improved detection layer is shown in Figure 2. Although this improved method increases the amount of model computation, additional small target features can be obtained, thereby improving the detection ability of small targets, such as buoys and fishing boats, and especially improving the detection effect of infrared images.

2.3. Depthwise Convolution Module

The convolutional layer is an indispensable part of the traditional CNN model, which consists of numerous convolutional kernels. Moreover, the convolutional layer can effectively extract feature information of images and input such information into the classification layer of CNN, thereby realizing image classification. Standard convolution has the characteristics of weight sharing and local connection, and each convolution kernel is locally connected to the output of the previous layer. However, the number of parameters of standard convolution is usually large. Depthwise convolution (DWConv) is a special form of convolution particularly suitable for lightweight detection models [15]. DWConv has wide applicability, can be applied to convolution operations of arbitrary size, and can significantly reduce the amount of computation and the number of parameters, making it promising for applications in mobile devices and resource-constrained environments. The convolution operation is shown in Figure 3. DWConv, as a strategy to reduce the computational requirements of the model, may introduce potential defects, such as information loss, feature expression degradation, and local perceptual enhancement, while improving computational efficiency. However, these problems can be effectively mitigated through reasonable network design and optimization to enjoy the high efficiency brought by depth-separable convolution while maintaining model accuracy.

2.4. C3Ghost Module

GhostNet is a lightweight neural network, which can significantly reduce computational volume and the number of parameters while ensuring that accuracy will not be affected and realizing the low cost of the convolution operation. The network generates feature images in two main steps. First, a convolution operation is performed on the part of the convolution kernel to obtain the initial feature maps. Second, a linear transformation with low computational complexity and few parameters is performed on the initial feature maps to generate the remaining feature maps. The convolution process of GhostNet is shown in Figure 4.

The C3Ghost module is formed by the integration of the GhostNet network into the C3 module in the YOLOv5 [16] algorithm. This module’s main role is to reduce the amount of computation and parameters by replacing the Bottleneck in the original C3 module with the GhostBottleneck while maintaining similar functionality to the original C3 module. The result is a new lightweight module suitable for mobile and resource-constrained application scenarios [17]. The improved C3Ghost module detection layer is shown in Figure 5. The module increases the depth of the network by repeating iterations while increasing the width of the network by adjusting the number of channels in the feature map.

The Ghost Bottleneck module compresses the network structure in combination with the lightweight convolutional module GhostConv. By changing the number of channels of the input feature map and then restoring its previous number of channels, no channel mismatch is ensured when the Ghost features are connected to the input features [18], and the features are eventually fused with the residual edges that have been convolved with a depth of 3 × 3. The first Conv of GhostConv divides the number of channels of the input feature map by using a 1 × 1 convolution kernel with a step of 1. Thereafter, the previous step is performed with a 5 × 5 convolution kernel to obtain the feature map. The feature map obtained in the previous step is then deeply convolved with a 5 × 5 convolution kernel [19], and eventually spliced.

3. Experimental Design and Result Analysis

3.1. Dataset

This dataset was collected from the first sea bathing beach in Zhifu District, Yantai City, Shandong Province. The maritime video acquired by radar-guided visible and infrared devices with different focal lengths was processed in separate frames. A maritime target dataset of 5025 images was produced in different weather, different light conditions, and different sea conditions. The resolutions of the visible and infrared images were 2688 × 1520 and 1920 × 1080, respectively. The focal length of the visible equipment is larger, so the number of targets in the visible images is lower compared with that of the infrared images. In addition, images in the dataset are all multitarget scenes, covering five different types of targets: fishing vessels, buoys, ferries, cargo ships, and tugboats. This diversity ensures that the dataset is representative of the complexity of the real-world maritime environment. Sample images are provided in Figure 6.

The images were labeled with data using the LabelImg (1.8.6) tool. First, rectangular boxes were selected to label the targets of interest. Second, the appropriate label was selected or entered for each object, and the annotation file was saved in XML format. To ensure that the annotation does not introduce bias affecting the model performance, the annotation ensures that the labeled boxes closely enclose the objects to avoid redundant backgrounds. Labels are used consistently to avoid label confusion. All relevant objects are labeled to avoid omissions. The quality of the labels is reviewed several times to ensure the accuracy and consistency of the annotation. Lastly, the original dataset was randomly divided in an 8:2 ratio. In particular, 80% of the dataset was used for the training of all methods, and 20% was used for the validation set to evaluate the detection performance of all methods.

3.2. Test Environment and Parameter Configuration Evaluation Index

The experimental operating system is Windows 10, the processor is Intel (R) Core (TM) i9-10900K (Intel Corporation; Santa Clara, CA, USA), the graphics card uses NVIDIA RTX 3080 (NVIDIA Corporation; Santa Clara, CA, USA), and the operating memory is 64 G. Pycharm IDE (JetBrains; Prague, Czech Republic) is used, and the algorithmic runtime environments are CUDA 11.6, Python 3.9, and Pytorch 1.13. The specific experimental parameters are shown in Table 1.

3.3. Evaluation Index

The experiments assess the performance of the detection algorithms from two perspectives: detection accuracy and lightness of the network. For detection accuracy, use the average accuracy ([email protected]; i.e., an IOU threshold of 0.5) to evaluate the accuracy of the model to ensure its performance in the target detection task. For the degree of network lightness, consider such metrics as the amount of computation (GFLOPs), number of parameters (Layers), and number of detections per second (FPS). GFLOPs reflect the computational complexity of the model, Layers denote the number of parameters of the model, and FPS represents the real-time performance of the model. By combining these metrics, comprehensively evaluate the performance of the model and select the most appropriate one for practical applications, such as maritime target identification. The metrics are defined by the following formulae [20,21,22]:

P = \frac{T P}{T P + F P}

(1)

R = \frac{T P}{T P + F N}

(2)

A P = \int_{0}^{1} P (R) d r

(3)

m A P = \frac{\sum_{i = 1}^{k} A P_{i}}{k (c l a s s e s)}

(4)

where TP represents the number of correctly detected targets, FP represents the number of targets detected but not correctly identified, FN represents the number of missed targets, AP represents the average precision for a single class, and mAP represents the average of the AP values for all labels.

Demand is increasing for applying target detection algorithms to terminals but is limited by the memory and arithmetic resources of embedded devices. Moreover, the computational overhead and communication time consumption when deploying models with numerous parameters and complex structures are urgent problems to be solved. There is a certain constraint relationship among the number of model parameters, amount of computation, and average accuracy. To evaluate the comprehensive performance of the YOLOv9 improved model for target recognition, this study proposes the comprehensive detection capability M of the model improvement based on the indicators of average accuracy, number of parameters, and computation and detection speed. In addition, this research sets different weight sizes according to the different performance concerns of the improved model. To prevent the application scenarios from being limited, subindicators can be expanded continuously according to the actual needs. The indicator definition formula is as follows:

M = [W_{1} \dots W_{α} \dots W_{N}] [\begin{array}{l} λ_{1} \\ ⋮ \\ λ_{α} \\ ⋮ \\ λ_{N} \end{array}]

(5)

\sum_{α = 1}^{N} W_{α} = 1

(6)

where

W_{α}

represents the weight, with values ranging from [0, 1],

λ_{α}

represents the subindicators of the preceding metrics, and N denotes the total number of subindicators.

3.4. Ablation Test

Ablation experiments are conducted on a homemade dataset to analyze the effectiveness of each model in the improved method on the overall performance enhancement. The experimental results are shown in Table 2, and we can draw the following conclusions.

An increase of a small target detection head achieves higher detection accuracy, although using it alone leads to an increase in model parameters and computational complexity. The DWConv and C3Ghost modules can successfully reduce the amount of computation and parameters without affecting the detection accuracy, thereby improving the detection performance of the model.

First, the small target detection head P2 module is introduced, which has a higher accuracy than the YOLOv9 algorithm. The small target detection head P2 module is verified to have the ability to expand the field of view of feature information capture. This result enhances the model’s ability to obtain feature information to improve the accuracy of the detected targets and significantly reduce the leakage rate. However, the addition of the small target detection head inevitably increases the computational complexity of the model. To ensure that the model can run efficiently on resource-constrained devices, meet real-time requirements, and reduce power consumption and storage requirements, a new convolutional module is introduced to complete the lightweighting of the model.

Second, the C3 module in the YOOv5 algorithm is introduced to combine with the Ghost module to obtain the C3Ghost module. The model with the addition of the C3Ghost alone can achieve the maintenance of model accuracy with a significant reduction in the amount of computation and number of parameters. This result verifies that the C3Ghost module is able to obtain the connection between the image feature information and the deeper feature information, thereby improving the algorithm’s detection performance.

Adding the DWConv module alone does not have an impact on model accuracy, computation, or number of parameters. However, we determine that introducing this module together with the P2 and C3Ghost modules into the YOLOv9 algorithmic model can achieve a larger lightweight compared with not adding the DWConv module. This result shows that the combination of the small target detection head with DWConv and C3Ghost modules can achieve a reduction in model computation while maintaining the model feature extraction capability, and improve the image processing speed while maintaining the detection accuracy and model recall. A smaller model requires fewer computational resources to perform inference tasks, extends the device’s battery life, improves the system’s responsiveness, and significantly reduces the embedded device’s power consumption and operational latency. Therefore, the model’s practical deployability is improved.

To ensure that the improved model achieves the best detection results, experiments are conducted on the number of model channels in descending order. The change in the number of experimental channels can be seen by the gradual increase in the amount of computation and the number of parameters. Hence, only the results of the optimal number of channels are presented in this paper. First, to ensure that the experimental results do not have randomness, experiments are conducted using two different groups of channel numbers, gradually increasing to determine the optimal step size of the C3Ghost, as shown in Table 3. Note that the step size of 2 is more accurate than the other step sizes. Therefore, subsequent experiments determine that the step size is set to 2. Second, the number of channels in the C3Ghost module is modified. The results of the experiments are shown in Table 4. Optimal channels are determined by using the model’s comprehensive detection capability M indicator to determine the best channel, thereby ensuring that the module has the best performance. The final experimental conclusion is that the four C3Ghost modules in the modified backbone network have the best detection performance when the channel numbers are 128, 256, 512, and 512.

Adding the DWConv module alone does not have an impact on model accuracy with respect to the amount of computation and number of parameters. Changes to its channel count, as shown in Table 5, reveal that increasing the channel count gradually increases the amount of computation, and accuracy is relatively affected. The M index is used to determine the number of channels that would enable the module to have the best performance. The original number of channels was chosen to be used. Experiments on the number of channels, convolution kernel size, and step size of the small target detection head are carried out. The results are shown in Table 6. The best detection effect is achieved when the number of settings is [64, 256, 128] and [64, 128, 64].

Lastly, to verify that the superposition module also has better results, several experiments are conducted by combining and adjusting the number of channels and step size. The experimental results are shown in Table 7. The final combination results show that the experimental channel number, convolution kernel, and step size of C3Ghost and DWConv are the best results. The best channel numbers for C3Ghost are 128, 256, 512, and 512, and the best step size is 2. The best channel number for DWConv is 64, the best convolution kernel size is 3, and the best step size is 2. Meanwhile, the P2 detection layer selection of [256, 256, 128] and [128, 128, 64] has a better result.

3.5. Comparison Experiment

To further confirm the effectiveness of the algorithm improvement in this research, a comparison study is conducted with the mainstream target detection algorithms that are widely used at present, including Faster RCNN [23], RetinaNet [24], YOLOXm [25], YOLOv7m [26], YOLOv8x [27], and the literature [28] models. The comparison results are shown in Table 8, verifying the superiority of the detection performance of the algorithms in this study.

4. Electro-Optical Images Target Matching

This section constructs the topological structure to accomplish the target correspondence between visible and infrared images by analyzing and visualizing the results of target detection in infrared and visible images. First, the types of targets present in visible and infrared images are determined. Second, the methods to solve the target matching of electro-optical images calculate the angle between the X-axis of the two targets and calculate the topological structure of the network. The flowchart of the method is shown in Figure 7.

4.1. Single Correspondence

In obtaining target species information from the target detection results of visible and infrared images, there is a single correspondence relationship in the images, which can also be mainly divided into the following cases.

One-on-One: A single type of target is present in both images, so the corresponding matching relationship can be determined directly. The visualized image is shown in Figure 8.

One-to-Many: A single type of target is present in the visible image, and there are multiple types of targets in the infrared image. As shown in Figure 9, there are two cargo ships in the infrared image corresponding to those in the visible image, and it is not possible to judge the matching relationship directly. The only existing target (i.e., the ferry in the figure) is searched first. Thereafter, the angle formed by the line connecting the center point of other targets in the infrared image and the target with the X-axis is calculated using this target as a benchmark. By comparing the angles in the visible and infrared images, similar angle pairs are found, and the corresponding matching target pairs are recorded. The visualized image is shown in Figure 10.

Many-to-Many: Neither the visible nor infrared image has a single target type. As shown in Figure 11, there are two cargo ships in the red visible image and three cargo ships in the infrared image, so it is not possible to judge the matching relationship directly. By taking the single existing target as a benchmark, the angle formed by the line connecting the other targets in the image and the target with the X-axis is calculated. By comparing the similar angle pairs in the electro-optical images to match the corresponding target, the visualized image is shown in Figure 12.

4.2. Non-Unitary Correspondence

In the judgment of target types in electro-optical images where there is no single correspondence in the image, both multiple targets of the same type can be divided into the following two cases.

If the number of targets in the visible image is 2, then the angle to the X-axis is established between every two targets, thereby extracting the target types in the visible image and ensuring that only these same types of targets are used in the infrared image. Correspondence is established by similar angle pairs in the electro-optical images, The visualization is shown in Figure 13.

If the number of targets in the visible image is above 2, then the centroid of each vessel in the visible and infrared images is calculated, and the mesh topological structure is drawn in the electro-optical images, as shown in Figure 14. Thereafter, the same target types are used in the infrared images as in the visible images, the triangle information in the electro-optical images is calculated, and the triangles in the visible and infrared images are compared to find triangles in the visible and infrared images with an angular difference less than the preset threshold (set to 0.5° in this study) of similar triangle pairs (see Figure 15). Lastly, the vertices of the similar triangles in the infrared and visible images are connected. The visualized image is shown in Figure 16.

The use of topological relations for target matching between different types of images shows strong robustness in dealing with changes in environmental conditions and is especially able to maintain stable matching performance under light and weather changes. However, when the target in a graph with a large focal length encounters substantial occlusion, it is easy to cause the target centroid to shift, thereby causing some difficulties in determining the threshold value and affecting the matching accuracy.

5. Conclusions

In addressing the challenge of the difficulty of matching multitarget electro-optical images at sea, the accuracy of detection and computational efficiency of the proposed model are significantly improved in this study. In particular, such an improvement is attributed to the improvement of the YOLOv9 model, including the introduction of the P2 detection head to improve the accuracy of small target detection, use of the C3Ghost module to replace the backbone network to improve the accuracy of feature extraction and lightweighting, and use of the lightweight convolutional module DWConv. Furthermore, a decision-level target correspondence method based on topological structure is introduced to effectively solve the problem of inconsistent target size and number caused by the difference in focal lengths between visible and electro-optical images. This method provides an innovative and feasible solution to the problem of target matching in maritime electro-optical images. However, this study still has limitations. In complex environments with large focal length differences, the decision-level matching method, although it can better utilize the spatial information between targets, still has certain threshold limitations. Future research can focus on addressing the following issues: (1) exploring additional electro-optical image datasets and validating them in different environments and conditions to ensure the universality and robustness of the proposed method; (2) combining the deep learning and traditional image alignment methods to further improve the matching of visible and infrared images’ accuracy and efficiency. These undertakings will help in further developing the capability of the proposed model and improve the performance and reliability of the automatic maritime target annotation system.

Author Contributions

Methodology, S.Y. and N.L.; validation, Z.C. and Z.W.; investigation, Y.S. and N.L.; writing—original draft preparation, S.Y., Z.C. and Z.W.; writing—review and editing, S.Y., N.L., Y.S. and Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Project: National Natural Science Foundation of China (62388102, 62101583, 61871392); Taishan Scholar Project (tsqn202211246); and Fund Project of National Defense Key Laboratory of Science and Technology (2021-JCJQ-LB-018).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. (The data are not publicly available due to privacy or ethical restrictions.)

Acknowledgments

The authors would like to thank all reviewers for their helpful comments and suggestions on this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Idiri, B.; Napoli, A. The automatic identification system of maritime accident risk using rule-based reasoning. In Proceedings of the 2012 7th International Conference on System of Systems Engineering (SoSE), Genova, Italy, 16–19 July 2012; pp. 125–130. [Google Scholar] [CrossRef]
Yifan, L. Visible Light and Infrared Fusion Algorithm Applied to Surface Unmanned Vessel. Ph.D. Thesis, Harbin Engineering University, Harbin, China, 2021. [Google Scholar]
Wu, R.; Yu, D.; Liu, J.; Wu, H.; Chen, W.; Gu, Q. An improved fusion method for infrared and low-light level visible image. In Proceedings of the 2017 14th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 15–17 December 2017; pp. 147–151. [Google Scholar] [CrossRef]
Zhu, D.; Zhan, W.; Fu, J.; Jiang, Y.; Xu, X.; Guo, R.; Chen, Y. RI-MFM: A Novel Infrared and Visible Image Registration with Rotation Invariance and Multilevel Feature Matching. Electronics 2022, 11, 2866. [Google Scholar] [CrossRef]
Zhang, X.; Zhai, H.; Liu, J.; Wang, Z.; Sun, H. Real-time infrared and visible image fusion network using adaptive pixel weighting strategy. Inf. Fusion 2023, 99, 101863. [Google Scholar] [CrossRef]
Yao, J.; Xiao, S.; Deng, Q.; Wen, G.; Tao, H.; Du, J. An Infrared Maritime Small Target Detection Algorithm Based on Semantic, Detail, and Edge Multidimensional Information Fusion. Remote Sens. 2023, 15, 4909. [Google Scholar] [CrossRef]
Zhang, H.; Yu, H.; Tao, Y.; Zhu, W.; Zhang, K. Improvement of ship target detection algorithm for YOLOv7-tiny. IET Image Process. 2024, 18, 1710–1718. [Google Scholar] [CrossRef]
Wang, Z.; Hou, G.; Xin, Z.; Liao, G.; Huang, P.; Tai, Y. Detection of SAR image multiscale ship targets in complex inshore scenes based on improved YOLOv5. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5804–5823. [Google Scholar] [CrossRef]
Zhao, Q.; Wu, Y.; Yuan, Y. Ship Target Detection in Optical Remote Sensing Images Based on E2YOLOX-VFL. Remote Sens. 2024, 16, 340. [Google Scholar] [CrossRef]
Zhang, H.; Liu, C.; Ma, J.; Sun, H. Time-prior-based stacking ensemble deep learning model for ship infrared automatic target recognition in complex maritime scenarios. Infrared Phys. Technol. 2024, 137, 105168. [Google Scholar] [CrossRef]
Nithya, B.; Subash, N.; Sivapriya, K.; Devadharshini, R. Multi Small Object Detection and Prioritized Tracking for Navy Operations using Deep Learning Techniques. In Proceedings of the 2023 International Conference on Quantum Technologies, Communications, Computing, Hardware and Embedded Systems Security (iQ-CCHESS), Kottayam, India, 15–16 September 2023; pp. 1–7. [Google Scholar] [CrossRef]
Wang, C.; Zhu, Y. Ship Crack Detection Based on Lightweight Fast Convolution and Bidirectional Weighted Feature Fusion Network. Chin. J. Ship Res. 2023, 19, 1–12. (In Chinese) [Google Scholar]
Tan, L.; Liang, Y.; Xia, J.; Wu, H.; Zhu, J. Detection and Diagnosis of Small Target Breast Masses Based on Convolutional Neural Networks. Tsinghua Sci. Technol. 2024, 29, 1524–1539. [Google Scholar] [CrossRef]
Zhu, Y.; Dong, E.; Tong, J.; Yang, S.; Zhang, Z.; Li, W. Deep Neural Network Based Object Detection Algorithm With optimized Detection Head for Small Targets. In Proceedings of the 2023 IEEE International Conference on Mechatronics and Automation (ICMA), Harbin, China, 6–9 August 2023; pp. 2378–2382. [Google Scholar] [CrossRef]
Qin, S.; Pu, Y.; Tang, J.; Yao, S.; Chen, K.; Huang, W. Intelligent Edge Gearbox Faults Diagnosis System via Multiscale Depthwise Separable Convolution Network. In Proceedings of the 2023 International Conference on Sensing, Measurement & Data Analytics in the era of Artificial Intelligence (ICSMD), Xi’an, China, 2–4 November 2023; pp. 1–6. [Google Scholar] [CrossRef]
Lin, Q.; Zhang, S.; Xu, S. Construction of Traffic Moving Object Detection System Based on Improved YOLOv5 Algorithm. In Proceedings of the 2023 2nd International Conference on 3D Immersion, Interaction and Multi-sensory Experiences (ICDIIME), Madrid, Spain, 27–29 June 2023; pp. 268–272. [Google Scholar] [CrossRef]
Xu, J.; Yang, H.; Wan, Z.; Mu, H.; Qi, D.; Han, S. Wood Surface Defects Detection Based on the Improved YOLOv5-C3Ghost With SimAm Module. IEEE Access 2023, 11, 105281–105287. [Google Scholar] [CrossRef]
He, Z.; He, D.; Li, X.; Qu, R. Blind Superresolution of Satellite Videos by Ghost Module-Based Convolutional Networks. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5400119. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
He, Q.; Mei, Z.; Zhang, H.; Xu, X. Automatic Real-Time Detection of Infant Drowning Using YOLOv5 and Faster R-CNN Models Based on Video Surveillance. J. Soc. Comput. 2023, 4, 62–73. [Google Scholar] [CrossRef]
Zheng, X.; Lu, X. BPH-YOLOv5: Improved YOLOv5 based on biformer prediction head for small target cigatette detection. In Proceedings of the Jiangsu Annual Conference on Automation (JACA 2023), Changzhou, China, 10–12 November 2023; pp. 77–82. [Google Scholar] [CrossRef]
Pandey, S.; Chen, K.-F.; Dam, E.B. Comprehensive Multimodal Segmentation in Medical Imaging: Combining YOLOv8 with SAM and HQ-SAM Models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October 2023; pp. 2584–2590. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO, version 8.0.0 [Computer Software]. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 January 2023).
Zhang, M.Y.; Liu, N.B.; Wang, Z.X.; Yu, H.L. A method of photoelectric ship image detection based on improved SSD. In Proceedings of the International Conference on Signal Processing and Communication Technology (SPCT 2022), Harbin, China, 23 December 2022; Volume 12615, pp. 130–135. [Google Scholar]

Figure 1. Main framework diagram.

Figure 2. Improved detection layer.

Figure 3. Convolution operation.

Figure 4. GhostNet convolution process.

Figure 5. Improved detection layer.

Figure 6. Dataset example. (a) Visible light image; (b) Infrared image.

Figure 7. Target matching method flowchart.

Figure 8. Electro-optical image target matching 1.

Figure 9. Electro-optical image target matching 2.1.

Figure 10. Electro-optical image target matching 2.2.

Figure 11. Electro-optical image target matching 3.1.

Figure 12. Electro-optical image target matching 3.2.

Figure 13. Electro-optical image target matching 4.

Figure 14. Electro-optical image target matching 5.1.

Figure 15. Electro-optical image target matching 5.2.

Figure 16. Electro-optical image target matching 5.3.

Table 1. Training parameter configuration.

Parameters	Settings	Argument	Settings
Epoch	100	lrf	0.01
Batch	2	lro	0.01
Imgsz	640	momentum	0.937
workers	8	Weight_decay	0.0005

Table 2. Evaluation metric results. (“√” indicates the modified part of the model.)

Number	P2	C3Ghost	DWConv	Parameter/M	GFLOPs	[email protected]/%	Fps/ms
No.1				50.8	236.7	86.7	44.7
No.2	√			35.6	245.3	95.4	44.9
No.3		√		44.7	206.2	86.7	43.9
No.4			√	50.8	230.6	87.6	43.7
No.5	√	√		30.2	250.8	94.7	44.9
No.6	√	√	√	28.4	223.1	94.7	44.9

Table 3. C3Ghost module step size experiment.

Step Size	Accuracy of Original Channel	Accuracy of Original Channel 2
1	0.856	0.860
2	0.867	0.861
3	0.863	0.858
4	0.863	0.857

Table 4. C3Ghost module channel count experiment.

Procedure	Parameter/M	GFLOPs	[email protected]/%	Procedure	Parameter/M	GFLOPs	[email protected]/%
1	42.0	199.9	86.1	8	43.7	216.5	86.5
2	42.0	199.9	86.2	9	44.6	216.3	85.6
3	42.0	199.9	86.2	10	44.6	217.3	86.4
4	43.0	199.5	86.1	11	44.6	217.3	86.4
5	43.3	207.2	86.2	12	46.0	220.1	86.6
6	44.7	206.2	86.7	13	46.0	220.1	86.5
7	43.6	209.8	86.2	14	45.2	293.3	85.7

Table 5. DWConv module channel count experiment.

Procedure	Parameter/M	GFLOPs	[email protected]/%
1	50.8	230.6	0.876
2	50.9	232.3	0.875
3	50.8	230.6	0.876
4	50.9	231.5	0.874
5	50.9	235.8	0.878
6	51.0	239.5	0.876
7	51.0	242.7	0.883

Table 6. P2 detection head channel count experiment.

Procedure	Parameter/M	GFLOPs	[email protected]/%	Procedure	Parameter/M	GFLOPs	[email protected]/%
1	35.6	245.3	95.4	6	40.8	277.6	95.4
2	35.7	246.3	95.5	7	37.6	292.8	95.3
3	36.6	260.3	95.1	8	39.6	317.7	95.5
4	37.0	265.7	95.7	9	41.0	332.7	95.2
5	37.2	267.8	95.1	10	41.2	337.0	95.6

Table 7. Composite model experiment.

Procedure	Parameter/M	GFLOPs	[email protected]/%	Procedure	Parameter/M	GFLOPs	[email protected]/%
1	35.6	245.3	95.4	6	40.8	277.6	95.4
2	35.7	246.3	95.5	7	37.6	292.8	95.3
3	36.6	260.3	95.1	8	39.6	317.7	95.5
4	37.0	265.7	95.7	9	41.0	332.7	95.2
5	37.2	267.8	95.1	10	41.2	337.0	95.6

Table 8. Comparison of experimental results of different network models.

Network Model	Parameter/M	GFLOPs	[email protected]/%	Fps/ms
Faster RCNN	137.1	370.2	56.2	23.5
Retinanet	38.0	170.1	45.9	20.1
YOLOXm	54.2	156.0	84	42.1
YOLOv7m	37.62	106.5	84.5	42.3
YOLOv8x	68.2	258	85.1	43
Literature [28]	56.38	90.2	85.6	41.4
Ours	28.4	223.1	94.5	44.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, S.; Cao, Z.; Liu, N.; Sun, Y.; Wang, Z. Maritime Electro-Optical Image Object Matching Based on Improved YOLOv9. Electronics 2024, 13, 2774. https://doi.org/10.3390/electronics13142774

AMA Style

Yang S, Cao Z, Liu N, Sun Y, Wang Z. Maritime Electro-Optical Image Object Matching Based on Improved YOLOv9. Electronics. 2024; 13(14):2774. https://doi.org/10.3390/electronics13142774

Chicago/Turabian Style

Yang, Shiman, Zheng Cao, Ningbo Liu, Yanli Sun, and Zhongxun Wang. 2024. "Maritime Electro-Optical Image Object Matching Based on Improved YOLOv9" Electronics 13, no. 14: 2774. https://doi.org/10.3390/electronics13142774

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Maritime Electro-Optical Image Object Matching Based on Improved YOLOv9

Abstract

1. Introduction

2. YOLOV9 Algorithm and Improvement

2.1. YOLOv9 Network and Improvement

2.2. Small Target Detection Head

2.3. Depthwise Convolution Module

2.4. C3Ghost Module

3. Experimental Design and Result Analysis

3.1. Dataset

3.2. Test Environment and Parameter Configuration Evaluation Index

3.3. Evaluation Index

3.4. Ablation Test

3.5. Comparison Experiment

4. Electro-Optical Images Target Matching

4.1. Single Correspondence

4.2. Non-Unitary Correspondence

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI