1. Introduction
As deep learning technology advances, the capabilities and accuracy of object detection are progressively increasing. Such progress is being applied in the defense field for terrestrial applications and coastal border and marine surveillance. Maritime surveillance aims to monitor the activities of various objects at sea, detect abnormal behaviors, and raise alarms as necessary. To achieve this goal, maritime surveillance must be accompanied by object detection, which plays an essential role in ensuring the security of maritime areas by locating ships and aircraft navigating maritime regions, tracking their movement paths, preparing for security threats, and enhancing the ability to issue early warnings by recognizing illegal activities in advance [
1,
2,
3].
The most common maritime surveillance method is radar-based surveillance, which enables wide-area coverage and the detection of distant objects in addition to covering vast maritime areas. Santi et al. [
4] demonstrated the feasibility of maritime surveillance using multitransmitter GNSS-based passive radar in the range-doppler domain. Similarly, Nasso and Santi [
5] proposed a method for detecting vessels over a wide area by fusing multiple bistatic channels and multiple frames received from a multi-transmitter GNSS-based passive radar. Zhang et al. [
6] proposed a low-cost maritime surveillance system using high-frequency surface wave radar and a hierarchical key-area perception model. However, it is difficult for radar-based surveillance to detect nonmetallic objects or objects with low reflectivity in small fishing boats with six to eight passengers; furthermore, it is difficult to obtain information regarding the shape or size of the object [
7]. Another method of maritime surveillance employs acoustic sensors, which can identify the location and type of an object by the object’s sound characteristics [
8,
9]. However, acoustic sensors are vulnerable to interference from wind, waves, and other background noise, and their accuracy is low due to noise generated by the sensor itself.
Given these limitations, various studies have been conducted on vision-based maritime surveillance, which can accurately analyze the detected object’s shape, size, color, and other information through a variety of visual information and high-quality images acquired using vision sensors, such as RGB or infrared (IR) cameras. Tran and Le [
10] proposed a method to find dynamic and static objects by using background subtraction and saliency detection, but it revealed the limitation that small objects are often included in the background and undetected. Nita and Vandewal [
11] proposed a method to detect large vessels, such as tankers or container ships, using mask regions with convolutional neural networks (R-CNN). However, it is also difficult to extract detailed features for small fishing boats, and the detection performance drops sharply due to the inability to accurately identify the object’s boundary. Qi et al. [
12] proposed an improved Faster R-CNN that uses image downscaling and scene narrowing to obtain more valuable features of objects and improve processing speed. While it improves the detection accuracy and processing speed of short-range ships near the harbor compared to the existing Faster R-CNN, it does not consider low-resolution ships located near the horizon and is, therefore, unsuitable for maritime surveillance near the horizon. Ma et al. [
13] proposed a method of finding objects by extracting patches in areas where objects are likely to exist based on IR images and using nonlocal dissimilarity measures and multi-frame saliency measures. However, this revealed a weakness, in that the detection performance is reduced when large objects partially obscure small objects.
Similarly, Wang et al. [
14] proposed a method of detecting maritime targets by dividing objects based on grayscale distribution curve shift binarization after enhancing the image through local peak singularity measurement based on IR images. However, the detection performance of objects with poor reflection or low reflection intensity due to being far from the sensor is reduced when detecting objects based on IR reflection. However, even with these endeavors of various researchers in the field of vision-based maritime surveillance, detection performance remains unsatisfactory for objects with low resolution, objects near the horizon, and small objects that are difficult to identify from a distance.
The detection of objects near the horizon, which can pose a serious threat to ships, crews, and port security, plays a crucial role in maritime surveillance, but due to its inherent difficulty, limited research has been conducted on this subject. Yang et al. [
15] introduced an approach for detecting small objects in infrared images by utilizing the local heterogeneity property and the nonlocal self-correlation property to suppress highly bright background clutters caused by strong waves or sunlight. However, the detection performance diminished in scenarios with a large number of objects or when larger ones partially obscure certain objects. Su et al. [
16] proposed a method to detect the horizon using a panoramic vision system and Hough circle transformation, as well as to detect objects near the horizon using local region complexity. However, this method has several drawbacks, as follows: the detection performance of the horizon is greatly dependent on the preset threshold value when unnecessary boundaries are removed for efficient horizon detection, and the detection performance of objects is also affected, making it unstable. Yoneyama and Dake [
17] proposed a method using a large object detector to detect large vessels and then segment the horizon region into patches to find small objects near the horizon. This approach involved using a small object detector to identify small vessels within these patches. However, when part of the horizon is obscured by a large vessel, it is difficult to find the horizon, which causes the small object detector to stop working and miss small objects.
Unidentified objects located near the horizon in maritime surveillance environments can be a potential threat; however, they are challenging to detect due to their small size and low resolution. To overcome this limitation, the approach initially focuses on horizon detection and then emphasizes detecting objects near the horizon. However, this approach still does not show a satisfactory performance. Thus, we propose a multifocal object detection associative network (MODAN) for more reliable and improved maritime surveillance in this study. The proposed network first performs a horizon search to detect objects located near the horizon that occupy a very small area of the entire image. The horizon detector initially employs K-means-based color quantization to segment the input image into two regions: the ocean and the sky. Noise is then eliminated through mathematical morphology, ensuring stable horizon detection even when parts of the horizon are obscured. Because the extracted horizon area is horizontally long, when such an extracted image is fed into an object detection network that receives inputs in the form of squares, the object’s shape is distorted and the detection performance is degraded. Thus, the region of interest (ROI) selector extracts the ROI, which is the area of interest around the detected horizon, and then segments the ROI into square-shaped patches. Afterwards, the super-resolution CNN (SRCNN) is used to improve the resolution of the ROI image. Then, the far-field detector (FFD) simultaneously searches for small objects near the horizon, and the near-field detector (NFD) searches for relatively large objects based on the original image. Duplicate detected objects from two independent detectors are removed using weighted box fusion to estimate the bounding boxes of the objects finally. To verify the superiority of the proposed MODAN, a comparative experiment was conducted with other similar models. The comparison results showed that MODAN had an average precision (AP) improvement of more than 7% over existing single-object detection models. It also exhibited an AP improvement of more than 10% over multifocal-based state-of-the-art detection models, and a false detection rate which was reduced by more than 59%, thereby validating the stability and efficiency of the proposed model’s comparative advantage. The main contributions of this study can be summarized as follows:
− A horizon detector based on color quantization is proposed, ensuring stable horizon detection even when parts of the horizon are obscured. As a result, the rate of undetected objects located near the horizon is effectively reduced.
− An approach by which to enhance the resolution of segmented ROI images through SRCNN is proposed, subsequently improving object detection performance by enhancing the shapes of small objects.
− A multifocal detector is proposed to detect both near-field large objects and far-field small objects near the horizon.
− Finally, weighted box fusion is introduced to improve detection performance by reliably removing duplicate detected objects from segmented ROI images.
2. Horizon Detection and ROI Selection
In the maritime environment, horizon detection provides essential information for navigation and exploration and plays a vital role in locating and identifying the destination for autonomous navigation in vehicles such as ships and aeroplanes [
18,
19]. In particular, objects of interest in maritime surveillance operations often appear near the horizon, making horizon detection an imperative step. Consequently, significant efforts have been devoted to detecting the horizon [
16,
17,
20,
21]. In this study, we propose a horizon detection and ROI selection structure similar to the existing methods, as depicted in
Figure 1, to more reliably detect objects near the horizon.
To detect the horizon, the ocean and sky regions’ colors are first binarized using color quantization based on K-means, and then background and object noise are removed through mathematical morphology. Based on this, the detected horizon is rotated to align with the horizontal axis of the image. Then, an ROI is extracted evenly, ensuring the visibility of small objects within the ROI. This ROI is subsequently passed to the object detector through ROI segmentation for object detection.
2.1. Color Quantization
Color quantization is a method that converts an image’s colors into a limited set of colors, which maps 24-bit colors to a fixed number of colors. This not only reduces the number of colors in the image, which reduces the amount of computation required for image processing, but also has the advantage of removing noise caused by clouds or waves by eliminating high-frequency components from the image and generating smooth and consistent colors [
22].
In this study, traditional K-means clustering is employed for color quantization to cluster the
color vectors in the image into two colors, ocean and sky, and each pixel in the input image has vectors of three colors
, G, B). The proposed quantization method randomly extracts N sample vectors from the total color vectors and defines a set of centroid vectors,
, as any two of the extracted sample vectors. Each centroid vector comes to represent the sky and ocean colors, and based on the Euclidean distance to other sample vectors, these vectors are divided into two groups. The color vector clustered around the centroid
,
, can be calculated as follows:
Here,
refers to the
k-th color vector among
sample vectors and
refers to the centroid vector that represents the sky or the sea. For each group, a vector positioned at the mean of the clustered color vectors is selected as a new centroid vector,
. Here,
is the number of color vectors belonging to that cluster. This process is repeated until the centroid vectors are stabilized. By doing this, the optimal centroid vectors differentiate distinctly between sky and sea colors. Clustering all color vectors using the optimal centroid vector results in two colors, as shown in
Figure 2.
In this way, the image is divided into two regions through the quantization process, but some color vectors are incorrectly clustered, resulting in small holes or dots, as seen in
Figure 2b. The holes or dots caused by the incorrectly clustered color vectors can cause errors in the horizon extraction process. To address these errors, dilation followed by erosion operations of mathematical morphology is performed to merge adjacent objects, reduce the gap between objects, and remove noise or holes in objects to detect the horizon reliably.
2.2. Mathematical Morphology
Mathematical morphology is a technique that extracts the desired parts of an image using a kernel of predefined geometric shapes. It extracts the elements necessary to represent the shape of objects, such as boundaries and skeletons. This technique often enhances the clarity of ambiguous object regions resulting from image binarization processes such as noise removal, hole filling, and connecting broken lines [
23]. The basic operations include erosion, which removes object protrusions or background noise, and dilation, which removes internal noise such as holes in objects. Opening and closing are based on these operations. Among them, closing is an operation that performs dilation followed by erosion on a binary image to fill in the holes inside the object, refining the object and removing the background noise that is strengthened to smooth the edges of the image. The enhanced binary image is obtained by removing the internal and background noise.
Here, is a predefined geometric kernel of size . denotes the dilation process such that , where is the translation of by . This means that the kernel fills the overlapping region with 1 if it does not completely overlap with the 1-filled region as it passes through the binarized image. This process expands the area of the object and removes noise, such as irregularities at the boundaries of the object or holes inside the object. On the other hand, denotes the erosion process, which is the opposite of the dilation process, in which the kernel makes the overlapping region 0 if it does not entirely overlap with the 1-filled region. This action removes small background noise and, in addition, restores the sizes of objects that have been expanded by the dilation process to their original state. Thus, the noise disappears and the sea and sky regions are separated.
2.3. Horizon Search and Rotation
In the separated binary image, the pixel’s position is moved from the top to the bottom to find the point where the pixel value changes from 0 to 1, and the horizontal line coordinates are detected. However, there may be cases in which the horizon is undetected due to large objects near the horizon or close to the sensor. Therefore, to reliably detect the horizon coordinates,
horizontal candidate coordinates are extracted at a regular interval from the left and right sides of the image center, respectively, and the average of these coordinates is calculated to finally estimate the
-coordinates of the left and right endpoints of the horizon,
and
.
Here, and refer to the -th -coordinate values of the coordinates that are candidates for the horizon from the left and right sides, respectively. As a result, the coordinates of the left and right endpoints of the horizon are and , respectively, where is the width of the image.
A portion of the area around the horizon, which is the ROI, should be separated through the estimated horizon coordinates from the original image, but if the horizon is tilted, the ROI cannot be separated evenly. In this case, to minimize the loss of the object detection result and the matching process of the segmented ROI image with the original, the angle
of the tilted horizon depicted in
Figure 3 is found as follows:
Assuming that the horizon in the image is parallel to the x-axis, the right endpoint of the horizon,
, is automatically set to
based on the y-coordinate of its left endpoint,
. Afterward, the image is rotated to make
and
the same as follows, and then the ROI around the horizon is extracted.
Here, is the center of rotation, and and are the pixel coordinates before and after rotation, respectively. The resulting rotated image is obtained through the rotation matrix to make the horizon line parallel.
2.4. ROI Selection and Segmentation
When extracting the ROI near the horizon from the rotated image
, if the extracted image is rectangular, the object’s shape may be distorted, and the detection performance may also be degraded for an object detector that receives a square image as its input. Thus, the image is split into square patches as follows so that the object detector can detect objects near the horizon.
Here,
refers to the maximum integer that can divide the image’s horizontal axis into ROIs with sizes of
;
is the sequence pair from 1 to
; and
and
represent the ROI ranges above and below the horizon, respectively. Consequently, as shown in the example in
Figure 4, an ROI image divided into squares near the horizon can be obtained based on the rotated image after detecting the horizon.
3. Multifocal Object Detection
Existing object detection networks have focused on relatively large objects, such as large ships, as they have tried to detect objects from the entire input image. Consequently, small ships near the horizon located far from the sensor have low resolution and occupy a very small part of the entire image, so the object lacks a distinct shape, and vulnerabilities such as misclassification and even undetection exist. To address this limitation, a strategy is required in order to train object detectors by extracting or transforming images more focused on small objects. To this end, we propose a multifocal object detector that simultaneously searches for small objects near the horizon by means of FFD and relatively large objects by means of NFD, based on the original image, to detect all objects at sea after improving the resolution of the segmented ROI image through SRCNN. The block diagram is shown in
Figure 5.
The segmented ROI images input into the FFD are images that have been enlarged around the horizon, which is larger than the object’s size in the original image, but the resolution is reduced, so it is not easy to understand the shape using a general object detector. Thus, SRCNN, a deep-learning-based image resolution enhancement network, is used to increase the resolution of the segmented ROI image while restoring the object’s shape so that small objects can be detected. However, both FFD and NFD can detect a large object redundantly. Therefore, the weighted box fusion (WBF) removes overlapping bounding boxes and finally estimates the object based on the concatenation of objects inferred independently through NFD and FFD.
3.1. Resolution Enhancement
Objects located near the horizon are characterized by very low resolution because they are small enough to be indistinguishable from the naked eye and occupy only a small portion of the entire image, as shown in
Figure 6a. These small, low-resolution objects are frequently encountered while searching for distant objects. For this reason, research on image resolution enhancement has inevitably proceeded simultaneously with distant object detection [
24,
25]. In this study, SRCNN is used to restore the low-resolution image of the segmented ROI to high resolution to make the shape of the object clear, and is then fed into the object detector to improve the detection performance, which is similar to the method proposed by Dong et al. [
26]. SRCNN consists of three layers: patch extraction and representation, nonlinear mapping, and reconstruction. First, patches are extracted from the low-resolution image of the segmented ROI and fed into the first layer of the CNN, representing each patch as a high-dimensional vector. These vectors form a set of feature maps, which are as follows:
Here,
and
refer to the filter and bias with weights, respectively, and
refers to the convolutional operation. Since
, which is made through the first layer, is a linear shape, it cannot be directly used in nonlinear operation, which restores high-resolution images from low-resolution images. Thus, the second layer, nonlinear mapping, is used to map the feature map
to another high-dimensional vector with added nonlinearity.
The high-resolution expressions created through this process need to be gathered in order to create the final high-resolution image. Thus, the final high-resolution image
is extracted by convolving the high-resolution patches extracted in the previous layer with a linear filter consisting of weights in the last layer, as follows:
Figure 6 shows an example of the image resolution enhancement process. The vessels, which could not be confirmed in the original image, could be identified to some extent through the ROI selector, although they were blurred. After passing through SRCNN, the mosaic-like noise was significantly reduced, and it was confirmed that the resolution of the objects was greatly improved.
3.2. YOLO Detection Network
As deep learning technology advances, image recognition accuracy using CNNs is steadily increasing. Similarly, the performance of object detection, which determines the location and type of a specific object in an image, has also significantly improved. Initially, two-stage detectors such as regions with convolutional neural network (R-CNN), Fast R-CNN, and Faster R-CNN have emerged and have demonstrated high accuracy. However, with the emergence of 1-stage detectors such as You Only Look Once (YOLO) and the single shot multibox detector, which focus on real-time object detection, greater attention has shifted toward these methods. To improve the detection speed, these one-stage detectors made the bounding boxes and class probabilities in the image a single regression problem, which increased the inference speed and allowed the neural network to learn the entire process of estimating the type and location of an object.
In maritime surveillance, it is important to quickly identify objects that pose a potential threat in order to develop security measures or preemptive actions. Therefore, the latest one-stage detector, YOLOv7, is considered in this study. YOLOv7 exhibits faster and better detection performance by reducing the parameters to about 40%–50% of previous networks [
27]. YOLOv7 enables the maintenance of the original gradient route and allows for efficient learning through an extended efficient layer aggregation network (E-ELAN) block. It also reduces the inference cost by efficiently decreasing the parameters through re-parameterized convolution [
27]. YOLO divides the input image into regions of an
grid and predicts preset
bounding boxes. If the actual bounding box and the predicted bounding box overlap by a specific region or more, the predicted bounding box contains an object. The probability that the predicted bounding box is one of the objects that can be classified is then found as follows:
Here,
refers to the probability that the predicted bounding box contains an object;
refers to the intersection over union (IOU), which is the overlapping region between the predicted bounding box and the actual bounding box; and
refers to the probability, indicating whether a specific object is being classified among the objects of interest. Finally, the bounding box with the highest
among the predicted B bounding boxes is selected as the bounding box for the object [
28].
3.3. Weighted Box Fusion
Generally, an object detection model produces a bounding box, indicating the location of the object, and a confidence score, indicating the confidence regarding the detected object. Typically, for a single object, multiple bounding boxes with confidence scores ranging from low to high are generated. Thus, the final part of an object detection model includes removing overlapping bounding boxes, such as through non-maximum suppression (NMS). Similarly, overlapping bounding boxes are inevitably encountered when multiple sensors are used for object detection, necessitating the development of techniques with which to estimate the optimal bounding boxes.
The MODAN proposed in this study consists of NFD, to detect large objects from the entire image, and FFD, to find small objects by enlarging the surroundings around the horizon so that each independent detector estimates overlapping bounding boxes for the same object. In particular, as shown in
Figure 7, in the case of large objects, the detection results of FFD based on the divided ROI images may appear as two different bounding boxes for the same object. In this scenario, due to the same object being detected doubly from different detection models or segmented images, each bounding box may possess a high confidence score. However, applying NMS might risk eliminating accurately detected object bounding boxes, as shown in
Figure 7a. To overcome this, WBF is applied to improve the accuracy of the prediction and obtain robust detection results by generalizing the predicted results through an ensemble of bounding boxes that are doubly detected.
Given the concatenated detection results of NFD and FFD, WBF sorts the detected bounding boxes by confidence score, then extracts the overlapping bounding boxes above a certain level by calculating the IOU between the first bounding box and all other bounding boxes. The IOU between the first bounding box
and any other bounding box
is calculated using the following formula:
Here,
and
represent the intersection and union of bounding boxes
and
, respectively. Then, a fused bounding box is generated with location coordinates that represent the average of the location coordinates of the extracted overlapping bounding boxes.
The class probability of the fused bounding box is set to the class probability with the highest probability among the overlapping bounding boxes. After iterating over all bounding boxes, the optimal bounding box is selected through NMS [
29].
3.4. Proposed MODAN
Based on the multifocal object detector described above, the block diagram of the MODAN proposed in this study to detect both near-field and far-field objects located near the horizon is shown in
Figure 8. In summary, the proposed maritime object detection strategy enlarges objects near the horizon that occupy a small portion of the entire image by selecting and segmenting ROIs based on the detected horizon. Afterwards, the resolution is improved to convert the low-resolution images of small objects into high-quality images, and the multifocal object detector is used to find objects reliably at various focal points. Here, WBF is used to efficiently remove overlapping objects and further optimize the estimation of bounding boxes, improving detection performance. The propose MODAN is summarized by Algorithm 1.
Algorithm 1. Proposed MODAN. |
Require: Pretrained near-field detector , Pretrained far-field detector , |
Pretrained resolution enhancement weight and bias |
Input: Input image , Set of centroid vectors , Predefined geometric kernel , Width of Input image W, |
Maximum integer that can divide the image D, ROI ranges above , ROI ranges below |
Output: Detected objects |
1: = //Color quantization |
2: = //Mathematical morphology |
3: Set = 0, = 0 |
4: for i = 1 to image height //Loop for horizon detection |
5: if (== 0 && == 1) //Detect left side of horizon |
6: |
7: if (== 0 && == 1) //Detect right side of horizon |
8: |
9: =, = |
10: Left endpoints of the horizon |
11: Right endpoints of the horizon |
12: Angle of the tilted horizon //Calculate the angle of tilt |
13: Rotated image //Rotate image to align horizon |
14: ROI image //ROI selection and segmentation |
15: for i = 1 to D //Loop for resolution enhancement |
16: |
17: |
18: |
19: for i = 1 to D //Loop for NFD and FFD |
20: = //Find object in far-field |
21: //Find object in near-field |
22: //Remove overlapping objects |
4. Experimental Results
In maritime operations, detecting objects near the horizon is important for ensuring security and safety by locating potentially dangerous objects in nearby waters. However, objects near the horizon exist in a very small region of the entire image, and because their size is also small, their detection using conventional object detectors is challenging. Various methods have emerged to address this difficulty. However, despite the emergence of various methods to detect objects near the horizon by first detecting the horizon and then defining the surrounding area as the ROI, the detection performance has been unsatisfactory due to the reduction in resolution during object magnification, resulting in the loss of some object information. Thus, in this study, we aimed to improve the detection performance of small objects near the horizon by first detecting the horizon; then enlarging the area around the horizon to include small objects that were not visible; and, finally, improving the resolution of the segmented enlarged image.
The proposed MODAN was implemented using the NVIDIA RTX 3060 and Intel Core i7-12700 Central Processing Unit. The real maritime surveillance performance was evaluated using the Singapore maritime dataset [
30] to test the ship detection process. The network performance was evaluated based on the AP. AP refers to the area under the precision–recall curve. This metric illustrates the accuracy of the object detection network’s predictions for the detected objects. Precision was calculated as TP/(TP+FP), indicating the ratio of correctly detected instances to the total positive detections. Recall was calculated as TP/(TP+FN), representing the ratio of correctly detected instances to all instances that should have been detected. Here, TP, FP, and FN refer to true positive, false positive, and false negative, respectively.
The geometric kernel used in mathematical morphology was set to a Matrix 1 with a size of , which exhibited the best performance while varying the kernel size, since the size of the input image and the average size of the objects are essential factors for performance. The size of the segmented ROI image was , set to ensure that the vertical sizes of small objects in the horizon region were more than 15%, but this can vary depending on the size of the input image. Furthermore, the size of the input image that was applied to YOLOv7 after passing the SRCNN was fixed to ; it was set to to minimize the loss according to the image size adjustment.
To evaluate the performance of the proposed MODAN, object detection performance comparison tests were conducted using state-of-the-art models with similar structures, and the results are summarized in
Table 1. As presented in the table, MODAN exhibited more than 7% higher AP than Wang (2022) [
27], the most common single detection model, thus verifying the validity of the proposed multifocal detection structure. In addition, MODAN, which was proposed together with the Yoneyama (2022) [
17] detection model with a multifocal structure, also detected objects that the existing single models would miss by segmenting and enlarging the horizon area, thereby achieving a higher TP than that of the single models. However, ROI segmentation can cause an object to appear in multiple segmented images, so it should be used in parallel with algorithms that remove overlapping bounding boxes, such as NMS or WBF. As a result, the Yoneyama (2022) [
17] model, which uses a multifocal structure, was able to find more objects than the existing single detection models, resulting in a high TP. However, even though NMS was used to remove duplicate bounding boxes, some correctly detected bounding boxes were also removed due to the characteristic of the NMS algorithm that only leaves the bounding boxes with the highest
. This led to an increase in false alarms, resulting in a much higher FP than other models, and at the same time, the detection performance was degraded, resulting in a significantly lower AP. The Solovyev (2021) [
29] model is a single detection model that uses WBF instead of NMS. While it has a slightly lower TP than the Wang (2022) [
27] model, which uses NMS, it also has the advantage of a lower FP. However, despite showing a lower false detection rate than MODAN, it was verified that the overall AP was 7% or lower than MODAN, verifying that it did not outperform MODAN. Conclusively, the proposed MODAN restored the shape of objects with low resolution near the horizon through SRCNN-based image resolution enhancement based on multifocal structure, and effectively removed the detected overlapping bounding boxes using WBF instead of NMS. This reduced the false detection rate by more than 59% and improved the AP by more than 10% compared to the existing model with a multifocal structure. Furthermore, the proposed MODAN exhibited an AP increase of more than 7% compared to the existing single-detection models, verifying its improved detection performance.
Moreover, the examples of the ship detection results between the proposed MODAN and similar state-of-the-art models are shown in
Figure 9,
Figure 10,
Figure 11 and
Figure 12. In the figure, the yellow dashed line indicates the ROI, the green solid line indicates the correctly detected objects, the red solid line indicates the undetected objects, and the white solid line indicates the ground truth. As shown in
Figure 9, a small ship, which is difficult to distinguish with the naked eye, is located next to a large ship near the horizon, denoted by the ROI. The existing single detection models of Wang (2022) [
27] and Solovyev (2021) [
29] failed to detect the small ship, while the multifocal detection models, Yoneyama (2022) [
17] and the proposed MODAN, were able to successfully detect it. In
Figure 10 and
Figure 11, as is similar to the previous results, the multifocal detection model, Yoneyama (2022) [
17], was able to detect one additional ship out of the two or three small ships that the single detection models missed. Nonetheless, the remaining ship was not detected due to the low resolution. However, the MODAN was able to detect the missed ship by increasing the resolution using SRCNN. In
Figure 12, we can see that there are some small ships overlapped in the divided ROI images; by the Yoneyama (2022) [
17] model, NMS should be used to remove duplicate objects, but due to the characteristics of NMS, it was not processed correctly and remained undetected. Alternatively, MODAN can detect objects more reliably using WBF.
5. Conclusions
In maritime surveillance systems, detecting objects near the horizon plays a crucial role in ensuring the security of nearby waters by tracking the paths of moving objects found at sea, detecting illegal activities, and preemptively countering or predicting potentially dangerous situations. Vision-based maritime surveillance, which is widely used for this purpose, can perform accurate analyses by identifying an object’s shape, size, and color. However, it has a vulnerability in that the detection performance is low in the case of small and low-resolution objects located at distances far from the sensor, as these objects’ shapes are distorted. Thus, in this study, we proposed a horizon detector and ROI selector as preprocessors to reliably find objects near the horizon, aiming to detect all objects in the sea using the multifocal object detector. In particular, SRCNN was applied to the segmented ROI images to improve the object detection performance by increasing the resolution of the small objects near the horizon. Comparative experiments between the proposed MODAN and the existing similar state-of-the-art models were carried out, and the results exhibited that MODAN detected not only large ships in the sea, but also small ships near the horizon, showing an AP detection performance 7% better compared to that of the existing single detection model. Additionally, MODAN showed that the AP was higher than 10%, and the false detection rate was significantly reduced by more than 59% compared to the state-of-the-art multifocal-based detection model, showing that it can detect small objects near the horizon more reliably and effectively in maritime surveillance.