1. Introduction
Public image datasets such as COCO [
1] and Pascal visual object classes (VOC) [
2] have made a great contribution to the development of deep neural networks (DNN) for computer vision problems [
3,
4,
5,
6,
7,
8]. These datasets include many different categories of objects. On the other hand, a domain-specific dataset usually contains only a relatively small number of sub-categories under a parent category. For domain-specific applications, obtaining a sufficient number of annotated images is considered a difficult task. Moreover, most domain-specific datasets suffer from the class-imbalance problem and noisy labels. Thus, to overcome the overfitting problem due to these inherent problems in the domain-specific dataset, a DNN model pre-trained by the public image dataset mentioned above is usually adopted for its fine-tuning.
The application areas that make use of domain-specific datasets have been expanding and now include road condition recognition [
9,
10], face detection [
11,
12], and food recognition [
13,
14], among others. Object recognition [
15,
16] in maritime environments is another important domain-specific problem for various security and safety purposes. For example, an autonomous ship equipped with an Automatic Identification System (AIS) requires safe navigation, which is achieved by the detection of surrounding objects [
17]. This is a difficult problem simply because the objects at sea change dynamically due to environmental factors such as illumination, fog, rain, wind, and light reflection. In addition, depending on the viewpoint, the same ship can be shown with quite different shapes. Since the ocean usually has a wide-open view, the ships on the sea can be seen with a variety of sizes and occlusions. That is, large inter-class variances in terms of the size and shape of the maritime objects make the recognition problem very challenging. To tackle these difficulties, we rely on the recent advancements in DNN. However, the immediate problem of the DNN-based approach is the lack of annotated training data in maritime environments.
Maritime video datasets with annotated bounding boxes and object labels are hardly available. There exist few published datasets, collected especially for object detection in maritime environments [
18,
19,
20]. Among them, only the Singapore Maritime Dataset (SMD), introduced by Prasad et al. [
20], provides sufficiently large video data with labeled bounding boxes for 10 maritime object classes. The SMD consists of onboard and onshore video shots captured by Visual-Optical (VIS) and Near Infrared (NIR) sensors, which can be used for tracking as well as detecting ships on the sea. Although the SMD can be used for the training and testing of DNNs, it is hard to find completely reproducible results published with the SMD for comparative studies. This is due to the fact that the SMD has the following problems. First, there are bounding boxes in the ground truth of the SMD with inaccurate object boundaries. Some of their bounding boxes are too loose to include the background as well as the whole object. Additionally, some of them are too tight to have only a part of the object. Since the maritime images are usually taken from a wide-open view, a faraway object can appear as a tiny one. In this case, a small difference at the border of the bounding box can make a big difference in testing the accuracy of object detection. Second, there are incorrectly labeled classes in the ground truth of the SMD. These noisy labels may not be a big problem for distinguishing the foreground object from the background, but they certainly affect the training and testing of the DNN for the object classification problem. Third, there exists a serious class imbalance in the SMD. The class imbalance can cause the biased training of the DNN in favor of the majority classes and deteriorate the generalization ability of the model. Fourth, there is no proper train/test split in the original SMD.
Note that in [
15], they split the SMD into training, validation, and testing subsets. Using the split datasets, they also provided the benchmark results for the object detection via the Mask R-CNN model. However, their benchmark results were about object detection, with no further classification for each detected object. In fact, most of the previous research works that used the dataset only dealt with object detection [
15,
21,
22]. However, for applications in maritime security such as in the use of Unmanned Surface Vehicles (USV), we also need to identify the type of the detected object [
23]. Since the original SMD includes the class labels of the objects as well as their bounding box information, we may use the SMD for both object detection and classification problems.
Although the SMD provides the class label for each object with a bounding box, as already mentioned, there are still noisy labels. Furthermore, the split dataset provided by [
15] suffers from the class-imbalance problem (e.g., no data assigned for some of the object classes such as Kayak and Swimming Person in the training subset). In this paper, by using the SMD as a benchmark dataset for both detection and classification tasks, we fix its imprecisely determined bounding boxes and noisy labels. To alleviate the class-imbalance problem, we discard rare classes such as ‘swimming person’ and ‘flying bird and plane’. In addition, we merge the ‘boat’ and ‘speed boat’ labels and thus propose a modified SMD (coined SMD-Plus) with seven maritime object classes.
Hence, in having the SMD-Plus dataset, we are able to provide benchmark results for the detection and classification (detection-then-classification) problem. That is, based on the YOLO-V5 model [
24], we modify its augmentation techniques through the consideration of the maritime environments. More specifically, an Online Copy & Paste is applied to alleviate the imbalance problem in the training process. Likewise, the original augmentation techniques of the YOLO-V5 such as the geometric transformation, mosaic, and mix-up of the YOLO-V5 are adjusted especially for the SMD-Plus.
The contributions of this paper can be summarized as follows:
- (i)
We have improved the existing SMD dataset by removing noisy labels and fixing the bounding boxes. It is expected that the improved dataset of the SMD-Plus will be used as a benchmark dataset for the detection and classification of objects in maritime environments.
- (ii)
In addition to the YOLO-V5 augmentation techniques, we proposed the Online Copy & Paste and Mix-up methods for the SMD-Plus. Our Online Copy & Paste scheme has significantly improved the classification performance for the minority classes, thus alleviating the class-imbalance problem in the SMD-Plus.
- (iii)
3. Improved SMD: SMD-Plus
The SMD provides high-quality videos with ground truth for 10 types of objects in marine environments. Since the ground truth of the SMD was created by non-expert volunteers, it includes some label errors and imprecise bounding boxes. Those ambiguous and incorrect class labels in the ground truth make it difficult to use the SMD as a benchmark dataset for maritime object classification. Therefore, most of the researches making use of the SMD only deal with object detection, with no classification of the detected objects. To make use of the SMD for the detection-then-classification purpose, our first task was to revise and improve its imprecise annotations.
To train a DNN for object detection, we needed the location and size information of the bounding boxes. Note that unlike the datasets with general objects, the background regions of sea and sky in the maritime datasets, similar to the SMD, usually take up much larger areas in the image than the target objects of ships. Therefore, the precise bounding box annotations for the small maritime objects are of importance, and even a small mislocation of the bounding box for the small object can make a huge difference in the training and testing of the DNNs.
Figure 1 shows examples of inaccurate bounding boxes in the original SMD. More specifically, the yellow bounding boxes within the zoomed red, green, and purple boxes in the top image of
Figure 1 are too loose and mislocated. These bounding boxes are refined in the bottom part of the figure.
The ground truth annotation of the SMD for each maritime object provides one of ten class labels as well as its bounding box information of location and size. However, there are quite a few noisy labels in the SMD. In addition, there are indistinguishable classes that need to be merged. For example, as shown in
Figure 2, the two ships from the apparently identical class are assigned the different labels of ‘Speed boat’ and ‘Boat’. Therefore, in our improved version of the SMD-Plus, we are going to merge the two classes of ‘Speed boat’ and ‘Boat’ into a single class of ‘Boat’. Another motivation to combine these two classes is that the number of image data for the two classes is not sufficient for training and testing.
The similar-looking ships in the top part of
Figure 3b have two different labels of ‘Speed boat’ and ‘Ferry’, and one of them must be incorrect. In the SMD, most of the ships labeled as ’Ferry’ are the ones that can carry many passengers, as shown on
Figure 3a. By this definition of ‘Ferry’, we can correct the class label of ‘Ferry’ into ‘Boat’, as seen in the bottom part of
Figure 3b.
Next, we point out the problem of the ‘Other’ classification in the SMD. We noticed that the SMD included a clearly identifiable ‘Person’ in the ‘Other’ class, as seen in
Figure 4a, as well as blurred unidentifiable objects, as seen in
Figure 4b. This makes the definition of the label ‘Other’ rather fuzzy. Therefore, we assigned the ‘Other’ classification only to unidentifiable objects, excluding rare objects such as the ‘Person’ from the class.
Since there exist no actual labeled objects for the ‘Flying bird and plane’ and ‘Swimming person’ classes in the SMD, we discarded these two classes. Therefore, putting all the above modifications together, we can summarize the criteria for our SMD revisions as follows:
- (i)
‘Swimming person’ class is empty and is deleted;
- (ii)
Non-ship ‘Flying bird and plane’ class is deleted;
- (iii)
Visually similar classes of ‘Speed boat’ and ‘Boat’ are merged;
- (iv)
Bounding boxes of the original SMD are tightened;
- (v)
Some of the missing bounding boxes in ‘Kayak’ are added;
- (vi)
According to our redefinitions for the ‘Ferry’ and ‘Other’ classes, some of the misclassified objects in them are corrected.
Our final version of the SMD, coined as SMD-Plus, is quantitatively compared with the original SMD in
Table 1.
We needed to split the SMD-Plus into training and testing subsets for the DNNs. Note that the separation of the SMD into train, validation, and test subsets proposed by [
15] is good for detection, but not for detection-then-classification. Furthermore, some of the classes in the test subset of the original SMD were empty. Hence, we carefully re-separated the SMD video clips such that they were distributed evenly for all classes in both the train and test subsets as much as possible (see
Table 2).
4. Data Augmentation for YOLO-V5
In this section, we address our detection-then-classification method based on YOLO-V5 with the SMD-Plus dataset. We focus mainly on image augmentation techniques designed especially for the maritime dataset of the SMD-Plus.
Considering the relatively small size and class imbalance problems in the SMD-Plus, data augmentation plays an important role in alleviating the overfitting problem when training the DNNs. As shown in
Figure 5, in addition to the basic YOLO-V5 augmentation techniques such as mosaic and geometric transformation, we employ the Online Copy & Paste and Mix-up techniques. That is, to a set of four training images,
, we first apply color jittering by randomly altering the brightness, hue, and saturation components of the images. Then, the Copy & Paste is performed by inserting the copied objects from other training images into the input images. Next, adding another set of four training images,
, a random mosaic is applied to both sets of
and
. Then, the two mosaic images are geometrically transformed by translation, horizontal flip, rotation, and scaling. Finally, after the geometric transformations, the two images are fused by the Mix-up process. Among the augmentations mentioned previously, the Copy & Paste and the Mix-up are the newly adopted techniques for the basic YOLO-V5 augmentations. Now, we will elaborate on these two techniques in the following subsections.
4.1. Copy & Paste Augmentation
Copy & Paste augmentation is an effective means of increasing the number of objects for the minority classes, thus alleviating the class-imbalance problem. Here, to enhance the recognition performance for small objects, we can choose smaller objects to be copied as much as possible. To this end, we first divide the objects in the training images into three groups: small (
s), medium (
m), and large (
l). The criterion for the division is given by the size of the rectangular area of the bounding box (see
Table 3). Moreover, from
Table 1, we can choose more objects from the minority classes for the Copy & Paste to mitigate the class-imbalance problem. Consequently, we first choose the class
out of the
K object class with the following probability,
:
where
,
, and
is the number of objects in class
k. By choosing the object to be copied by (
1), the minority classes have higher chances of being selected. Once the object from class
k is chosen by (
1), we need to select the final object to be copied from one of the three groups of small (
s), medium (
m), and large (
l), determined according to
Table 3. The probability of choosing one of the three groups
for class
k is given by the following equation:
where
, and
is the number of objects for the size of
in the object class
k. Note that
in (
2) also gives a higher probability for the minority group among small (
s), medium (
m), and large (
l). Since the small-sized (
s) groups for all class labels usually have the smallest number of objects in the SMD-Plus, the objects in the small-sized group
s has more chances of being selected than the other groups of
m and
l.
In the previous methods, Copy & Paste was executed before training as an offline pre-processing technique. As a consequence, the images pre-processed by the Copy & Paste were used over and over again for every epoch of the training process. To provide more diversified images in training the DNN, for this paper, we apply the Copy & Paste in an on-the-fly manner in order to have an Online Copy & Paste scheme. Now, this Online Copy & Paste creates differently pasted objects for every training epoch, which allows the DNN to be trained with maritime objects of many different sizes and locations.
Next, we need to locate the position in the training image where the copied object is to be pasted, avoiding any overlap between the copied object and the existing ones. This can be performed by calculating the Intersection of Union (IoU) between the candidate position for the paste and the location of the original bounding box. That is, with the equation below, we can check if the IoU for the paste is equivalent to zero. In the object detection area, the IoU measures the overlapping area between the to-be-pasted bounding box
and the existing bounding box
in the ground truth, divided by the area of union between them:
4.2. Mix-up Augmentation
The Mix-up technique [
32] is a means of generating a new image by the weighted linear interpolation of two images and their labels. It is known to be effective for mislabeled data because the labels of the two images are mixed, just as their images. More specifically, for the given input images and their label pairs
and
from the training data, the Mix-up can be implemented as follows:
where
are the Mix-up outputs and
is the mixing ratio.
4.3. Basic Augmentations from YOLO-V5
We also use the basic geometric transformations of YOLO-V5 such as flipping, rotation, translation, and scale. Another basic augmentation adopted from YOLO-V5 is the mosaic augmentation. It was first introduced in [
30]. The mosaic augmentation mixes four training images into a single training image in order to have four different contexts. According to [
30], the mosaic augmentation allows the model to learn how to identify objects on a smaller-than-usual scale, and it is useful for training as it greatly reduces the need for large mini-batch sizes.
5. Experiment Results
As explained in the previous section, we revised the SMD in order to obtain the SMD-Plus. As a tool for modifying the ground truth of the SMD, we used the MATLAB ImageLabeler tool. The MATLAB ImageLabeler provides an application interface to be able to easily create video clips and attach annotations to each object.
Our experiments were conducted on an Intel I7-9900 Processor with a main memory of 32GB and an NVIDIA GeForce RTX 2080Ti. Based on the YOLO-V5, we trained the model with the SMD-Plus. The hyper-parameters for the YOLO-V5 training are as follows: the stochastic gradient descent (SGD) optimizer with a momentum of 0.9, a learning rate of 0.01, and a batch size of 8. We also used the following values for the augmentation parameters:
For color jittering: hue ranges from 0 to 0.015; saturation, from 0 to 0.7; and brightness, from 0 to 0.4;
The probability of generating a mosaic is 0.5;
Translate shifts range from 0 to 0.1;
The probability of a horizontal flip is 0.5;
Random rotation within angles from −10 to +10 degrees;
Random scaling in the range of 0.5×∼1.5×.
Using the same augmentation parameters listed above, for the sake of comparison, we conducted additional experiments with YOLO-V4 [
30].
Table 4 compares the detection performance of the SMD and the SMD-Plus. As shown in
Table 4, the detection performance of the SMD-Plus compared to the SMD increased by more than 10% for both YOLO-V4 and all versions of YOLO-V5. Here, as in the previous benchmarks [
15,
21,
22], only foreground and background detections were performed. Note that the problem with detecting only the foreground and background is that it can be used to evaluate the accuracy of the bounding box detection, but not the recognition accuracy for the class label. Therefore, we can use the results of
Table 4 to verify the bounding box accuracy of the SMD-Plus.
Table 5 shows the results of object detection-then-classification task for the train/test split of the SMD, as suggested by [
15]. In this train/test split, however, there exist classes with no test data. Therefore, the corresponding classes of columns c1, c5, c7, and c10 are blank. Those non-empty classes for the test set in [
15] include ‘Speed boat’, ‘Vessel/ship’, ‘Ferry’, ‘Buoy’, ‘Others’, and ‘Flying bird and Plane’. Fixing the IoU threshold at 0.5, the mAPs for the six non-empty classes are 0.186 for YOLO-V4, 0.22 for YOLO-V5-S, 0.182 for YOLO-V5-M, and 0.304 for YOLO-V5-L.
Next,
Table 6 shows the results of the detection-then-classification task for the SMD-Plus. In the table, we can evaluate the performance for the Copy & Paste scheme. More specifically, the detection-then-classification results for ‘No Copy&Paste’, ‘Online Copy&Paste’, and ‘Offline&Paste’ are compared in
Table 6. As one can see in the table, our proposed ‘Online Copy&Paste’ outperformed the ‘None’ and ‘Offline Copy&Paste’ methods for YOLO-V4 and all versions of YOLO-V5. Furthermore, the proposed ‘Online&Paste’ has been proven to be quite effective for the minority classes, such as ‘Kayak’ of c6.