Next Article in Journal
Investigation into the Hydrodynamic Noise Characteristics of Electric Ducted Propeller
Previous Article in Journal
Vertical Distribution and Chemical Fractionation of Heavy Metals in Dated Sediment Cores from the Saronikos Gulf, Greece
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Object Detection and Classification Based on YOLO-V5 with Improved Maritime Dataset

1
Department of Electrical and Electronic Engineering, Dongguk University-Seoul, 30, Pildong-ro 1-gil, Jung-gu, Seoul 04620, Korea
2
Department of Autonomous Things Intelligence, Graduate School, Dongguk University-Seoul, 30, Pildong-ro 1-gil, Jung-gu, Seoul 04620, Korea
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2022, 10(3), 377; https://doi.org/10.3390/jmse10030377
Submission received: 17 January 2022 / Revised: 3 March 2022 / Accepted: 4 March 2022 / Published: 6 March 2022

Abstract

:
SMD (Singapore Maritime Dataset) is a public dataset with annotated videos, and it is almost unique in the training of deep neural networks (DNN) for the recognition of maritime objects. However, there are noisy labels and imprecisely located bounding boxes in the ground truth of the SMD. In this paper, for the benchmark of DNN algorithms, we correct the annotations of the SMD dataset and present an improved version, which we coined SMD-Plus. We also propose augmentation techniques designed especially for the SMD-Plus. More specifically, an online transformation of training images via Copy & Paste is applied to solve the class-imbalance problem in the training dataset. Furthermore, the mix-up technique is adopted in addition to the basic augmentation techniques for YOLO-V5. Experimental results show that the detection and classification performance of the modified YOLO-V5 with the SMD-Plus has improved in comparison to the original YOLO-V5. The ground truth of the SMD-Plus and our experimental results are available for download.

1. Introduction

Public image datasets such as COCO [1] and Pascal visual object classes (VOC) [2] have made a great contribution to the development of deep neural networks (DNN) for computer vision problems [3,4,5,6,7,8]. These datasets include many different categories of objects. On the other hand, a domain-specific dataset usually contains only a relatively small number of sub-categories under a parent category. For domain-specific applications, obtaining a sufficient number of annotated images is considered a difficult task. Moreover, most domain-specific datasets suffer from the class-imbalance problem and noisy labels. Thus, to overcome the overfitting problem due to these inherent problems in the domain-specific dataset, a DNN model pre-trained by the public image dataset mentioned above is usually adopted for its fine-tuning.
The application areas that make use of domain-specific datasets have been expanding and now include road condition recognition [9,10], face detection [11,12], and food recognition [13,14], among others. Object recognition [15,16] in maritime environments is another important domain-specific problem for various security and safety purposes. For example, an autonomous ship equipped with an Automatic Identification System (AIS) requires safe navigation, which is achieved by the detection of surrounding objects [17]. This is a difficult problem simply because the objects at sea change dynamically due to environmental factors such as illumination, fog, rain, wind, and light reflection. In addition, depending on the viewpoint, the same ship can be shown with quite different shapes. Since the ocean usually has a wide-open view, the ships on the sea can be seen with a variety of sizes and occlusions. That is, large inter-class variances in terms of the size and shape of the maritime objects make the recognition problem very challenging. To tackle these difficulties, we rely on the recent advancements in DNN. However, the immediate problem of the DNN-based approach is the lack of annotated training data in maritime environments.
Maritime video datasets with annotated bounding boxes and object labels are hardly available. There exist few published datasets, collected especially for object detection in maritime environments [18,19,20]. Among them, only the Singapore Maritime Dataset (SMD), introduced by Prasad et al. [20], provides sufficiently large video data with labeled bounding boxes for 10 maritime object classes. The SMD consists of onboard and onshore video shots captured by Visual-Optical (VIS) and Near Infrared (NIR) sensors, which can be used for tracking as well as detecting ships on the sea. Although the SMD can be used for the training and testing of DNNs, it is hard to find completely reproducible results published with the SMD for comparative studies. This is due to the fact that the SMD has the following problems. First, there are bounding boxes in the ground truth of the SMD with inaccurate object boundaries. Some of their bounding boxes are too loose to include the background as well as the whole object. Additionally, some of them are too tight to have only a part of the object. Since the maritime images are usually taken from a wide-open view, a faraway object can appear as a tiny one. In this case, a small difference at the border of the bounding box can make a big difference in testing the accuracy of object detection. Second, there are incorrectly labeled classes in the ground truth of the SMD. These noisy labels may not be a big problem for distinguishing the foreground object from the background, but they certainly affect the training and testing of the DNN for the object classification problem. Third, there exists a serious class imbalance in the SMD. The class imbalance can cause the biased training of the DNN in favor of the majority classes and deteriorate the generalization ability of the model. Fourth, there is no proper train/test split in the original SMD.
Note that in [15], they split the SMD into training, validation, and testing subsets. Using the split datasets, they also provided the benchmark results for the object detection via the Mask R-CNN model. However, their benchmark results were about object detection, with no further classification for each detected object. In fact, most of the previous research works that used the dataset only dealt with object detection [15,21,22]. However, for applications in maritime security such as in the use of Unmanned Surface Vehicles (USV), we also need to identify the type of the detected object [23]. Since the original SMD includes the class labels of the objects as well as their bounding box information, we may use the SMD for both object detection and classification problems.
Although the SMD provides the class label for each object with a bounding box, as already mentioned, there are still noisy labels. Furthermore, the split dataset provided by [15] suffers from the class-imbalance problem (e.g., no data assigned for some of the object classes such as Kayak and Swimming Person in the training subset). In this paper, by using the SMD as a benchmark dataset for both detection and classification tasks, we fix its imprecisely determined bounding boxes and noisy labels. To alleviate the class-imbalance problem, we discard rare classes such as ‘swimming person’ and ‘flying bird and plane’. In addition, we merge the ‘boat’ and ‘speed boat’ labels and thus propose a modified SMD (coined SMD-Plus) with seven maritime object classes.
Hence, in having the SMD-Plus dataset, we are able to provide benchmark results for the detection and classification (detection-then-classification) problem. That is, based on the YOLO-V5 model [24], we modify its augmentation techniques through the consideration of the maritime environments. More specifically, an Online Copy & Paste is applied to alleviate the imbalance problem in the training process. Likewise, the original augmentation techniques of the YOLO-V5 such as the geometric transformation, mosaic, and mix-up of the YOLO-V5 are adjusted especially for the SMD-Plus.
The contributions of this paper can be summarized as follows:
(i)
We have improved the existing SMD dataset by removing noisy labels and fixing the bounding boxes. It is expected that the improved dataset of the SMD-Plus will be used as a benchmark dataset for the detection and classification of objects in maritime environments.
(ii)
In addition to the YOLO-V5 augmentation techniques, we proposed the Online Copy & Paste and Mix-up methods for the SMD-Plus. Our Online Copy & Paste scheme has significantly improved the classification performance for the minority classes, thus alleviating the class-imbalance problem in the SMD-Plus.
(iii)
The ground truth table for the SMD-Plus and the results of the detection and classification are open to the public and may be downloaded from the following website (accessed on 2 March 2022): https://github.com/kjunhwa/Singapore-Maritime-Dataset-Plus.

2. Related Work

2.1. Maritime Dataset

In domain-specific DNN applications, it is of vital importance to obtain a proper dataset for training. However, for some domain-specific problems, it is quite difficult to obtain publically available datasets. Depending on the target domain, it is often expensive to collect images for specific classes and annotate them. Moreover, security and proprietary rights often prevent the owners from opening their datasets. One such domain-specific dataset is the maritime dataset. Maritime datasets can be classified into three groups [25]: (i) datasets for object detection [19], (ii) datasets for object classification [26], (iii) datasets for both object detection and classification [20]. The dataset for object detection provides the location information of the objects in the image with their bounding boxes, while no class label is given for each object. On the other hand, in the dataset for both object detection and classification, each image includes multiple objects with their bounding boxes and class labels. Finally, there is only a single maritime object in an image from the dataset for object classification.
Although the SMD [20] provides the ground truth of video objects and their class labels for both object detection and classification, there are no benchmark results reported from the SMD. This is due to the fact that the original SMD is not quite ready for training DNN models. Moosbauer et al. [15] analyzed the SMD and proposed the split sub-datasets of ‘train, validation, and test’. After applying Mask R-CNN on their split sub-datasets, they then reported the foreground object detection results. However, for both object detection and classification tasks, their split sub-datasets of train, validation, and test may not be appropriate for training the DNNs. Note that there certainly exist noisy labels in the SMD, which cause no problems in detection but negatively affect the DNN training for the classification. Additionally, due to the class-imbalance problem of the SMD, some of the split sub-datasets in [15] only have a few or even no data in a certain class of the test dataset. The SMD has been combined with other existing maritime datasets to resolve the limitations. For example, to expand the SMD dataset, Shin et al. [22] exploited the public datasets for classification such as MARVEL [18] by pasting copies of the objects in MARVEL into the SMD dataset. Furthermore, in Nalamati et al. [23], the SMD was combined with the SeaShips [19] dataset. However, these combined datasets were only used for detection. Moreover, due to the lack of dataset-combining details, it is hard to reproduce and compare the results. The Maritime Detection Classification and Tracking benchmark (MarDCT) [27] provided maritime datasets for detection, classification, and tracking separately. Therefore, it is inappropriate to use them for the classification of detected objects with bounding boxes.

2.2. Object Detection Models

Although improved versions of R-CNN [3], such as Faster R-CNN [4] and cascade R-CNN [28], were proposed to speed up the inference, the two-stage architectures of the R-CNN generically limit the processing speed. This has motivated researchers to develop one-stage DNNs such as YOLO [29], SSD [8], and RetinaNet [7] for object detection. Unlike the R-CNN, YOLO performs classification and bounding box regression at the same time, thus reducing the processing time. To further improve the accuracy and speed performance, the first YOLO has been refined to YOLO-V3 [6], YOLO-V4 [30], and YOLO-V5 [24]. The SSD [8] is another model of the one-stage object detector. For the anchor box of the YOLO, the SSD uses a predefined default box and has a scale-invariant feature by using a number of feature maps obtained from the middle layer of the backbone. RetinaNet [7] also adopts the one-stage framework with a modified focal loss, which assigns small weights to easily detectable objects but large weights to objects that are difficult to handle.
The detectors based on anchor boxes have the disadvantage of being sensitive to hyper-parameters. To solve this problem, anchor-free methods such as FCOS [31] have been proposed. However, since FCOS [31] performs pixel-wise bounding box prediction, it takes more time to execute the detection-then-classification task. Since the real-time requirement is essential for autonomous surveillance, we focus on using the fast one-stage method of YOLO-V5 [24] as the baseline object detection model.

3. Improved SMD: SMD-Plus

The SMD provides high-quality videos with ground truth for 10 types of objects in marine environments. Since the ground truth of the SMD was created by non-expert volunteers, it includes some label errors and imprecise bounding boxes. Those ambiguous and incorrect class labels in the ground truth make it difficult to use the SMD as a benchmark dataset for maritime object classification. Therefore, most of the researches making use of the SMD only deal with object detection, with no classification of the detected objects. To make use of the SMD for the detection-then-classification purpose, our first task was to revise and improve its imprecise annotations.
To train a DNN for object detection, we needed the location and size information of the bounding boxes. Note that unlike the datasets with general objects, the background regions of sea and sky in the maritime datasets, similar to the SMD, usually take up much larger areas in the image than the target objects of ships. Therefore, the precise bounding box annotations for the small maritime objects are of importance, and even a small mislocation of the bounding box for the small object can make a huge difference in the training and testing of the DNNs. Figure 1 shows examples of inaccurate bounding boxes in the original SMD. More specifically, the yellow bounding boxes within the zoomed red, green, and purple boxes in the top image of Figure 1 are too loose and mislocated. These bounding boxes are refined in the bottom part of the figure.
The ground truth annotation of the SMD for each maritime object provides one of ten class labels as well as its bounding box information of location and size. However, there are quite a few noisy labels in the SMD. In addition, there are indistinguishable classes that need to be merged. For example, as shown in Figure 2, the two ships from the apparently identical class are assigned the different labels of ‘Speed boat’ and ‘Boat’. Therefore, in our improved version of the SMD-Plus, we are going to merge the two classes of ‘Speed boat’ and ‘Boat’ into a single class of ‘Boat’. Another motivation to combine these two classes is that the number of image data for the two classes is not sufficient for training and testing.
The similar-looking ships in the top part of Figure 3b have two different labels of ‘Speed boat’ and ‘Ferry’, and one of them must be incorrect. In the SMD, most of the ships labeled as ’Ferry’ are the ones that can carry many passengers, as shown on Figure 3a. By this definition of ‘Ferry’, we can correct the class label of ‘Ferry’ into ‘Boat’, as seen in the bottom part of Figure 3b.
Next, we point out the problem of the ‘Other’ classification in the SMD. We noticed that the SMD included a clearly identifiable ‘Person’ in the ‘Other’ class, as seen in Figure 4a, as well as blurred unidentifiable objects, as seen in Figure 4b. This makes the definition of the label ‘Other’ rather fuzzy. Therefore, we assigned the ‘Other’ classification only to unidentifiable objects, excluding rare objects such as the ‘Person’ from the class.
Since there exist no actual labeled objects for the ‘Flying bird and plane’ and ‘Swimming person’ classes in the SMD, we discarded these two classes. Therefore, putting all the above modifications together, we can summarize the criteria for our SMD revisions as follows:
(i)
‘Swimming person’ class is empty and is deleted;
(ii)
Non-ship ‘Flying bird and plane’ class is deleted;
(iii)
Visually similar classes of ‘Speed boat’ and ‘Boat’ are merged;
(iv)
Bounding boxes of the original SMD are tightened;
(v)
Some of the missing bounding boxes in ‘Kayak’ are added;
(vi)
According to our redefinitions for the ‘Ferry’ and ‘Other’ classes, some of the misclassified objects in them are corrected.
Our final version of the SMD, coined as SMD-Plus, is quantitatively compared with the original SMD in Table 1.
We needed to split the SMD-Plus into training and testing subsets for the DNNs. Note that the separation of the SMD into train, validation, and test subsets proposed by [15] is good for detection, but not for detection-then-classification. Furthermore, some of the classes in the test subset of the original SMD were empty. Hence, we carefully re-separated the SMD video clips such that they were distributed evenly for all classes in both the train and test subsets as much as possible (see Table 2).

4. Data Augmentation for YOLO-V5

In this section, we address our detection-then-classification method based on YOLO-V5 with the SMD-Plus dataset. We focus mainly on image augmentation techniques designed especially for the maritime dataset of the SMD-Plus.
Considering the relatively small size and class imbalance problems in the SMD-Plus, data augmentation plays an important role in alleviating the overfitting problem when training the DNNs. As shown in Figure 5, in addition to the basic YOLO-V5 augmentation techniques such as mosaic and geometric transformation, we employ the Online Copy & Paste and Mix-up techniques. That is, to a set of four training images, { I 1 , I 2 , I 3 , I 4 } , we first apply color jittering by randomly altering the brightness, hue, and saturation components of the images. Then, the Copy & Paste is performed by inserting the copied objects from other training images into the input images. Next, adding another set of four training images, { J 1 , J 2 , J 3 , J 4 } , a random mosaic is applied to both sets of { I 1 , I 2 , I 3 , I 4 } and { J 1 , J 2 , J 3 , J 4 } . Then, the two mosaic images are geometrically transformed by translation, horizontal flip, rotation, and scaling. Finally, after the geometric transformations, the two images are fused by the Mix-up process. Among the augmentations mentioned previously, the Copy & Paste and the Mix-up are the newly adopted techniques for the basic YOLO-V5 augmentations. Now, we will elaborate on these two techniques in the following subsections.

4.1. Copy & Paste Augmentation

Copy & Paste augmentation is an effective means of increasing the number of objects for the minority classes, thus alleviating the class-imbalance problem. Here, to enhance the recognition performance for small objects, we can choose smaller objects to be copied as much as possible. To this end, we first divide the objects in the training images into three groups: small (s), medium (m), and large (l). The criterion for the division is given by the size of the rectangular area of the bounding box (see Table 3). Moreover, from Table 1, we can choose more objects from the minority classes for the Copy & Paste to mitigate the class-imbalance problem. Consequently, we first choose the class k { 1 , 2 , , K } out of the K object class with the following probability, P c l a s s ( k ) :
P c l a s s ( k ) = w c ( k ) i = 1 K w c ( i ) k = 1 , 2 , , K
where w c ( k ) = N m i n / N k , N m i n = m i n { N 1 , , N K } , and N k is the number of objects in class k. By choosing the object to be copied by (1), the minority classes have higher chances of being selected. Once the object from class k is chosen by (1), we need to select the final object to be copied from one of the three groups of small (s), medium (m), and large (l), determined according to Table 3. The probability of choosing one of the three groups P s i z e ( k ) for class k is given by the following equation:
P s i z e ( j ) = w s ( j ) i { s , m , l } w s ( i )
where w s ( j ) = m i n { N k ( s ) , N k ( m ) , N k ( l ) } / N k ( j ) , and N k ( j ) is the number of objects for the size of j { s , m , l } in the object class k. Note that P s i z e ( j ) in (2) also gives a higher probability for the minority group among small (s), medium (m), and large (l). Since the small-sized (s) groups for all class labels usually have the smallest number of objects in the SMD-Plus, the objects in the small-sized group s has more chances of being selected than the other groups of m and l.
In the previous methods, Copy & Paste was executed before training as an offline pre-processing technique. As a consequence, the images pre-processed by the Copy & Paste were used over and over again for every epoch of the training process. To provide more diversified images in training the DNN, for this paper, we apply the Copy & Paste in an on-the-fly manner in order to have an Online Copy & Paste scheme. Now, this Online Copy & Paste creates differently pasted objects for every training epoch, which allows the DNN to be trained with maritime objects of many different sizes and locations.
Next, we need to locate the position in the training image where the copied object is to be pasted, avoiding any overlap between the copied object and the existing ones. This can be performed by calculating the Intersection of Union (IoU) between the candidate position for the paste and the location of the original bounding box. That is, with the equation below, we can check if the IoU for the paste is equivalent to zero. In the object detection area, the IoU measures the overlapping area between the to-be-pasted bounding box B p and the existing bounding box B g t in the ground truth, divided by the area of union between them:
I o U = a r e a ( B p B g t ) a r e a ( B p B g t ) .

4.2. Mix-up Augmentation

The Mix-up technique [32] is a means of generating a new image by the weighted linear interpolation of two images and their labels. It is known to be effective for mislabeled data because the labels of the two images are mixed, just as their images. More specifically, for the given input images and their label pairs ( x i , y i ) and ( x j , y j ) from the training data, the Mix-up can be implemented as follows:
x ¯ = λ x i + ( 1 λ ) x j
y ¯ = λ y i + ( 1 λ ) y j
where ( x ¯ , y ¯ ) are the Mix-up outputs and λ [ 0 , 1 ] is the mixing ratio.

4.3. Basic Augmentations from YOLO-V5

We also use the basic geometric transformations of YOLO-V5 such as flipping, rotation, translation, and scale. Another basic augmentation adopted from YOLO-V5 is the mosaic augmentation. It was first introduced in [30]. The mosaic augmentation mixes four training images into a single training image in order to have four different contexts. According to [30], the mosaic augmentation allows the model to learn how to identify objects on a smaller-than-usual scale, and it is useful for training as it greatly reduces the need for large mini-batch sizes.

5. Experiment Results

As explained in the previous section, we revised the SMD in order to obtain the SMD-Plus. As a tool for modifying the ground truth of the SMD, we used the MATLAB ImageLabeler tool. The MATLAB ImageLabeler provides an application interface to be able to easily create video clips and attach annotations to each object.
Our experiments were conducted on an Intel I7-9900 Processor with a main memory of 32GB and an NVIDIA GeForce RTX 2080Ti. Based on the YOLO-V5, we trained the model with the SMD-Plus. The hyper-parameters for the YOLO-V5 training are as follows: the stochastic gradient descent (SGD) optimizer with a momentum of 0.9, a learning rate of 0.01, and a batch size of 8. We also used the following values for the augmentation parameters:
  • For color jittering: hue ranges from 0 to 0.015; saturation, from 0 to 0.7; and brightness, from 0 to 0.4;
  • The probability of generating a mosaic is 0.5;
  • Translate shifts range from 0 to 0.1;
  • The probability of a horizontal flip is 0.5;
  • Random rotation within angles from −10 to +10 degrees;
  • Random scaling in the range of 0.5×∼1.5×.
Using the same augmentation parameters listed above, for the sake of comparison, we conducted additional experiments with YOLO-V4 [30]. Table 4 compares the detection performance of the SMD and the SMD-Plus. As shown in Table 4, the detection performance of the SMD-Plus compared to the SMD increased by more than 10% for both YOLO-V4 and all versions of YOLO-V5. Here, as in the previous benchmarks [15,21,22], only foreground and background detections were performed. Note that the problem with detecting only the foreground and background is that it can be used to evaluate the accuracy of the bounding box detection, but not the recognition accuracy for the class label. Therefore, we can use the results of Table 4 to verify the bounding box accuracy of the SMD-Plus.
Table 5 shows the results of object detection-then-classification task for the train/test split of the SMD, as suggested by [15]. In this train/test split, however, there exist classes with no test data. Therefore, the corresponding classes of columns c1, c5, c7, and c10 are blank. Those non-empty classes for the test set in [15] include ‘Speed boat’, ‘Vessel/ship’, ‘Ferry’, ‘Buoy’, ‘Others’, and ‘Flying bird and Plane’. Fixing the IoU threshold at 0.5, the mAPs for the six non-empty classes are 0.186 for YOLO-V4, 0.22 for YOLO-V5-S, 0.182 for YOLO-V5-M, and 0.304 for YOLO-V5-L.
Next, Table 6 shows the results of the detection-then-classification task for the SMD-Plus. In the table, we can evaluate the performance for the Copy & Paste scheme. More specifically, the detection-then-classification results for ‘No Copy&Paste’, ‘Online Copy&Paste’, and ‘Offline&Paste’ are compared in Table 6. As one can see in the table, our proposed ‘Online Copy&Paste’ outperformed the ‘None’ and ‘Offline Copy&Paste’ methods for YOLO-V4 and all versions of YOLO-V5. Furthermore, the proposed ‘Online&Paste’ has been proven to be quite effective for the minority classes, such as ‘Kayak’ of c6.

6. Conclusions

In this paper, we provided an improved SMD-Plus dataset for future research works on maritime environments. We also adjusted the augmentation techniques of the original YOLO-V5 for the SMD-Plus. In particular, the proposed ‘Online Copy & Paste’ method was proven to be effective in alleviating the class-imbalance problem. Our SMD-Plus dataset and the modified YOLO-V5 are open to the public for future research. We hope that our detection-then-classification model of YOLO-V5 based on the SMD-Plus serves as a benchmark for future research and development initiatives for automated surveillance in maritime environments.

Author Contributions

Conceptualization, C.S.W. and J.-H.K.; methodology, J.-H.K.; software, N.K.; validation, J.-H.K. and N.K.; formal analysis, C.S.W.; investigation, J.-H.K. and N.K.; resources, N.K.; data curation, J.-H.K.; writing—original draft preparation, J.-H.K. and N.K.; writing—review and editing, C.S.W.; visualization, J.-H.K. and N.K.; supervision, C.S.W.; project administration, Y.W.P.; funding acquisition, Y.W.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Future Challenge Program through the Agency for Defense Development funded by the Defense Acquisition Program Administration.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Acknowledgments

For C.S.W., this work was supported by the Dongguk University Research Fund of 2021.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  2. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
  3. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  4. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [Green Version]
  5. Cai, Z.; Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  7. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  8. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  9. Shim, S.; Cho, G.C. Lightweight semantic segmentation for road-surface damage recognition based on multiscale learning. IEEE Access 2020, 8, 102680–102690. [Google Scholar] [CrossRef]
  10. Yuan, Y.; Islam, M.S.; Yuan, Y.; Wang, S.; Baker, T.; Kolbe, L.M. EcRD: Edge-cloud Computing Framework for Smart Road Damage Detection and Warning. IEEE Internet Things J. 2020, 8, 12734–12747. [Google Scholar] [CrossRef]
  11. Li, X.; Lai, S.; Qian, X. DBCFace: Towards Pure Convolutional Neural Network Face Detection. IEEE Trans. Circuits Syst. Video Technol. 2021; early access. [Google Scholar] [CrossRef]
  12. Zhang, S.; Chi, C.; Lei, Z.; Li, S.Z. Refineface: Refinement neural network for high performance face detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4008–4020. [Google Scholar] [CrossRef] [PubMed]
  13. Zhao, H.; Yap, K.H.; Kot, A.C.; Duan, L. Jdnet: A joint-learning distilled network for mobile visual food recognition. IEEE J. Sel. Top. Signal Process. 2020, 14, 665–675. [Google Scholar] [CrossRef]
  14. Won, C.S. Multi-scale CNN for fine-grained image recognition. IEEE Access 2020, 8, 116663–116674. [Google Scholar] [CrossRef]
  15. Moosbauer, S.; Konig, D.; Jakel, J.; Teutsch, M. A benchmark for deep learning based object detection in maritime environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 916–925. [Google Scholar]
  16. Liu, T.; Pang, B.; Zhang, L.; Yang, W.; Sun, X. Sea Surface Object Detection Algorithm Based on YOLO v4 Fused with Reverse Depthwise Separable Convolution (RDSC) for USV. J. Mar. Sci. Eng. 2021, 9, 753. [Google Scholar] [CrossRef]
  17. Gao, M.; Shi, G.; Li, S. Online Prediction of Ship Behavior with Automatic Identification System Sensor Data Using Bidirectional Long Short-Term Memory Recurrent Neural Network. Sensors 2018, 18, 4211. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Gundogdu, E.; Solmaz, B.; Yücesoy, V.; Koc, A. Marvel: A large-scale image dataset for maritime vessels. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 165–180. [Google Scholar]
  19. Shao, Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. Seaships: A large-scale precisely annotated dataset for ship detection. IEEE Trans. Multimed. 2018, 20, 2593–2604. [Google Scholar] [CrossRef]
  20. Prasad, D.K.; Rajan, D.; Rachmawati, L.; Rajabally, E.; Quek, C. Video processing from electro-optical sensors for object detection and tracking in a maritime environment: A survey. IEEE Trans. Intell. Transp. Syst. 2017, 18, 1993–2016. [Google Scholar] [CrossRef] [Green Version]
  21. Zhang, Y.; Li, Q.Z.; Zang, F.N. Ship detection for visual maritime surveillance from non-stationary platforms. Ocean Eng. 2017, 141, 53–63. [Google Scholar] [CrossRef]
  22. Shin, H.C.; Lee, K.I.; Lee, C.E. Data augmentation method of object detection for deep learning in maritime image. In Proceedings of the 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), Busan, Korea, 19–22 February 2020; pp. 463–466. [Google Scholar]
  23. Nalamati, M.; Sharma, N.; Saqib, M.; Blumenstein, M. Automated Monitoring in Maritime Video Surveillance System. In Proceedings of the 2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ), Wellington, New Zealand, 25–27 November 2020; pp. 1–6. [Google Scholar]
  24. YOLO-V5. Available online: ultralytics/yolov5:V3.0 (accessed on 13 August 2020).
  25. Qiao, D.; Liu, G.; Lv, T.; Li, W.; Zhang, J. Marine Vision-Based Situational Awareness Using Discriminative Deep Learning: A Survey. J. Mar. Sci. Eng. 2021, 9, 397. [Google Scholar] [CrossRef]
  26. Zhao, R.; Wang, J.; Zheng, X.; Wen, J.; Rao, L.; Zhao, J. Maritime Visible Image Classification Based on Double Transfer Method. IEEE Access 2020, 8, 166335–166346. [Google Scholar] [CrossRef]
  27. Bloisi, D.D.; Iocchi, L.; Pennisi, A.; Tombolini, L. ARGOS-Venice Boat Classification. In Proceedings of the 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Karlsruhe, Germany, 25–28 August 2015; pp. 1–6. [Google Scholar] [CrossRef]
  28. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
  29. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  30. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  31. Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 9627–9636. [Google Scholar]
  32. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Figure 1. The original bounding boxes of the original SMD in the top image are refined in those at the bottom.
Figure 1. The original bounding boxes of the original SMD in the top image are refined in those at the bottom.
Jmse 10 00377 g001
Figure 2. Integration of ‘Speed boat’ into ‘Boat’.
Figure 2. Integration of ‘Speed boat’ into ‘Boat’.
Jmse 10 00377 g002
Figure 3. Example of noisy label correction in the SMD: (a) A typical image for ‘Ferry’, (b) Noisy labels in the top and their corrected ones at the bottom.
Figure 3. Example of noisy label correction in the SMD: (a) A typical image for ‘Ferry’, (b) Noisy labels in the top and their corrected ones at the bottom.
Jmse 10 00377 g003
Figure 4. Examples of the ‘Other’ class in the SMD: (a) Deleted object from the ‘Other’ class, (b) Remained objects in the ‘Other’ class.
Figure 4. Examples of the ‘Other’ class in the SMD: (a) Deleted object from the ‘Other’ class, (b) Remained objects in the ‘Other’ class.
Jmse 10 00377 g004
Figure 5. Flow of image augmentations for our YOLO-V5.
Figure 5. Flow of image augmentations for our YOLO-V5.
Jmse 10 00377 g005
Table 1. The number of objects in each class label for the original SMD and the SMD-Plus.
Table 1. The number of objects in each class label for the original SMD and the SMD-Plus.
SMDSMD-Plus
ClassObjects(#)ClassObjects(#)
Boat1499Boat14,021
Speed Boat7961
Vessel/Ship117,436Vessel/Ship125,872
Ferry8588Ferry3431
Kayak4308Kayak3798
Buoy3065Buoy3657
Sail Boat1926Sail Boat1926
Others12,564Others24,993
Flying bird and plane650Removed-
Swimming Person0Removed-
Table 2. Proposed train and test split for VIS video in the SMD (c1: Ferry, c2: Buoy, c3: Vessel_ship, c4: Boat, c5: Kayak, c6: Sail_boat, c7: Other).
Table 2. Proposed train and test split for VIS video in the SMD (c1: Ferry, c2: Buoy, c3: Vessel_ship, c4: Boat, c5: Kayak, c6: Sail_boat, c7: Other).
SetSubsetVideo NameConditionNumber of Objects
c1c2c3c4c5c6c7Total
Train (37)OnShore (28)MVI_1451Hazy329025243370003190
MVI_1452Hazy001020003403391609
MVI_1470Daylight0266186230200202450
MVI_1471Daylight0299172343300582513
MVI_1478Daylight00143147704775162901
MVI_1479Daylight0082423700571118
MVI_1481Daylight040912271002004093047
MVI_1482Daylight001362105900242445
MVI_1483Daylight008970000897
MVI_1484Daylight006876870013742748
MVI_1485Daylight01048321040001040
MVI_1486Daylight063050326300006292
MVI_1578Dark/twilight0030300005053535
MVI_1582Dark/twilight007560540005408640
MVI_1583Dark/twilight00251050200973109
MVI_1584Dark/twilight00645688100322810,565
MVI_1609Daylight505055554433115050510,123
MVI_1610Daylight001086974054302603
MVI_1619Daylight0023650004732838
MVI_1612Daylight002069154002612484
MVI_1617Daylight00430900021636472
MVI_1620Daylight00200800011513159
MVI_1622Daylight21406180002361068
MVI_1623Daylight5220152800010443094
MVI_1624Daylight4310148200001913
MVI_1625Daylight00506600046949760
MVI_1626Daylight00285400026055459
MVI_1627Daylight002975595008134383
OnBoard (9)MVI_0788Daylight007960000796
MVI_0789Daylight00881190011218
MVI_0790Daylight01470500897
MVI_0792Daylight00604000100704
MVI_0794Daylight292000000292
MVI_0795Daylight510000000510
MVI_0796Daylight005040000504
MVI_0797Daylight0011290001131242
MVI_0801Daylight005962750043914
Test (14)OnShore (12)MVI_1469Daylight06003600941006005741
MVI_1474Daylight0133535608900035609345
MVI_1587Dark/twilight006000600005867186
MVI_1592Dark/twilight0028500683003533
MVI_1613Daylight0057500009046654
MVI_1614Daylight005464582009346980
MVI_1615Dark/twilight003277005665664409
MVI_1644Daylight0010080007561764
MVI_1645Daylight00321000003210
MVI_1646Daylight0046100003734533
MVI_1448Hazy16503624159000195398
MVI_1640Daylight30201756000382096
OnBoard (2)MVI_0799Daylight161037900040580
MVI_0804Daylight004840009801464
Table 3. The size criterion for grouping small, medium, and large objects.
Table 3. The size criterion for grouping small, medium, and large objects.
Min Rectangle AreaMax Rectangle Area
Small object 0 × 0 32 × 32
Medium object 32 × 32 96 × 96
Large object 96 × 96 ×
Table 4. Comparison of foreground and background detection of the SMD and the SMD-Plus. mAP(0.5) represents the mean average precision (mAP) for IoU = 0.5, while mAP(0.5:0.95) is the averaged mAP for increasing IoU threshold values, from 0.5 to 0.95 by 0.05.
Table 4. Comparison of foreground and background detection of the SMD and the SMD-Plus. mAP(0.5) represents the mean average precision (mAP) for IoU = 0.5, while mAP(0.5:0.95) is the averaged mAP for increasing IoU threshold values, from 0.5 to 0.95 by 0.05.
DatasetNetworkmAP(0.5)mAP(0.5:0.95)
SMDYOLO-V40.7040.297
YOLO-V5-S0.7720.386
YOLO-V5-M0.7500.403
YOLO-V5-L0.7660.407
SMD-PlusYOLO-V40.8470.428
YOLO-V5-S0.8980.522
YOLO-V5-M0.8670.528
YOLO-V5-L0.8780.527
Table 5. Detection-then-classification results for the SMD dataset: c1: Boat, c2: Speed Boat, c3: Vessel/ship, c4: Ferry, c5: Kayak, c6: Buoy, c7: Sail Boat, c8: Others, c9: Flying bird and plane, c10: Swimming Person.
Table 5. Detection-then-classification results for the SMD dataset: c1: Boat, c2: Speed Boat, c3: Vessel/ship, c4: Ferry, c5: Kayak, c6: Buoy, c7: Sail Boat, c8: Others, c9: Flying bird and plane, c10: Swimming Person.
DatasetNetworkObject ClassmAPmAP
c1c2c3c4c5c6c7c8c9c10(0.5)(0.5:0.95)
SMDYOLO-V4-0.02050.6570.271-0.148-0.002230.000-0.1860.0807
YOLO-V5-S-0.02850.6570.249-0.379-0.006710.000-0.220.0903
YOLO-V5-M-0.06270.7060.249-0.0538-0.02130.000-0.1820.0817
YOLO-V5-L-0.08790.6780.357-0.594-0.110.000-0.3040.128
Table 6. Detection-then-classification results for the SMD-Plus dataset: c1: Ferry, c2: Buoy, c3: Vessel_ship, c4: Boat, c5: Kayak, c6: Sail_boat, c7: Others. Columns P and R represent the precision and the recall performance, respectively, for IoU = 0.5.
Table 6. Detection-then-classification results for the SMD-Plus dataset: c1: Ferry, c2: Buoy, c3: Vessel_ship, c4: Boat, c5: Kayak, c6: Sail_boat, c7: Others. Columns P and R represent the precision and the recall performance, respectively, for IoU = 0.5.
DatasetCopy & PasteNetworkObject ClassPRmAP
c1c2c3c4c5c6c70.50.50.50.5:0.95
SMD-PlusNoneYOLO-V40.1600.6220.8680.6320.009950.9950.2740.4760.5660.5090.258
YOLO-V5-S0.3720.6910.8270.5690.005730.9950.0890.7160.5170.5070.254
YOLO-V5-M0.5880.8820.8160.6150.000630.970.1110.7410.5130.5690.298
YOLO-V5-L0.6730.7890.8460.5710.01230.9950.1310.8030.5050.5740.286
OnlineYOLO-V40.1720.5390.8680.7210.1140.9950.2430.4860.6210.5220.308
YOLO-V5-S0.4710.8640.8690.5490.1620.9950.1230.6500.5360.5760.291
YOLO-V5-M0.5880.7060.8420.6070.2590.9910.1230.7090.4860.5880.338
YOLO-V5-L0.7140.8060.8280.5820.2320.9950.1470.8110.5340.6150.33
OfflineYOLO-V40.2170.4450.8810.6470.1080.9950.1720.4810.6100.4950.284
YOLO-V5-S0.4750.3860.8870.6030.09850.9940.1520.5820.4820.5140.291
YOLO-V5-M0.490.8090.8520.6030.05920.9950.1690.7240.7880.5680.309
YOLO-V5-L0.6180.7890.8470.6670.03190.9950.2310.6880.5410.5970.316
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Kim, J.-H.; Kim, N.; Park, Y.W.; Won, C.S. Object Detection and Classification Based on YOLO-V5 with Improved Maritime Dataset. J. Mar. Sci. Eng. 2022, 10, 377. https://doi.org/10.3390/jmse10030377

AMA Style

Kim J-H, Kim N, Park YW, Won CS. Object Detection and Classification Based on YOLO-V5 with Improved Maritime Dataset. Journal of Marine Science and Engineering. 2022; 10(3):377. https://doi.org/10.3390/jmse10030377

Chicago/Turabian Style

Kim, Jun-Hwa, Namho Kim, Yong Woon Park, and Chee Sun Won. 2022. "Object Detection and Classification Based on YOLO-V5 with Improved Maritime Dataset" Journal of Marine Science and Engineering 10, no. 3: 377. https://doi.org/10.3390/jmse10030377

APA Style

Kim, J. -H., Kim, N., Park, Y. W., & Won, C. S. (2022). Object Detection and Classification Based on YOLO-V5 with Improved Maritime Dataset. Journal of Marine Science and Engineering, 10(3), 377. https://doi.org/10.3390/jmse10030377

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop