UnityShip: A Large-Scale Synthetic Dataset for Ship Recognition in Aerial Images

He, Boyong; Li, Xianjiang; Huang, Bo; Gu, Enhui; Guo, Weijie; Wu, Liaoni

doi:10.3390/rs13244999

Open AccessArticle

UnityShip: A Large-Scale Synthetic Dataset for Ship Recognition in Aerial Images

¹

School of Aerospace Engineering, Xiamen University, Xiamen 361102, China

²

School of Aeronautics and Astronautics, Shanghai Jiao Tong University, Shanghai 200240, China

³

School of Informatics, Xiamen University, Xiamen 361005, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2021, 13(24), 4999; https://doi.org/10.3390/rs13244999

Submission received: 9 November 2021 / Revised: 30 November 2021 / Accepted: 7 December 2021 / Published: 9 December 2021

(This article belongs to the Special Issue Artificial Intelligence and Remote Sensing Datasets)

Download

Browse Figures

Versions Notes

Abstract

:

As a data-driven approach, deep learning requires a large amount of annotated data for training to obtain a sufficiently accurate and generalized model, especially in the field of computer vision. However, when compared with generic object recognition datasets, aerial image datasets are more challenging to acquire and more expensive to label. Obtaining a large amount of high-quality aerial image data for object recognition and image understanding is an urgent problem. Existing studies show that synthetic data can effectively reduce the amount of training data required. Therefore, in this paper, we propose the first synthetic aerial image dataset for ship recognition, called UnityShip. This dataset contains over 100,000 synthetic images and 194,054 ship instances, including 79 different ship models in ten categories and six different large virtual scenes with different time periods, weather environments, and altitudes. The annotations include environmental information, instance-level horizontal bounding boxes, oriented bounding boxes, and the type and ID of each ship. This provides the basis for object detection, oriented object detection, fine-grained recognition, and scene recognition. To investigate the applications of UnityShip, the synthetic data were validated for model pre-training and data augmentation using three different object detection algorithms and six existing real-world ship detection datasets. Our experimental results show that for small-sized and medium-sized real-world datasets, the synthetic data achieve an improvement in model pre-training and data augmentation, showing the value and potential of synthetic data in aerial image recognition and understanding tasks.

Keywords:

deep learning; synthetic data; ship recognition; aerial imagery

Graphical Abstract

1. Introduction

In the past decade, deep learning methods have achieved milestones in various fields. For object recognition and scene understanding in aerial images, significant achievements have been made as a result of deep-learning-based algorithms. However, deep learning algorithms require a large amount of data for training and testing, and the scale of the data determines the upper limit of the model and the final results. For generic object recognition, it is relatively easy to obtain a large amount of data from the real world to constitute a sufficiently large dataset, such as VOC [1], COCO [2], OID [3], Objects365 [4], and other datasets, which usually contain tens of thousands or even millions of images, and dozens or even hundreds of categories of instance examples. These large-scale datasets have greatly promoted the development of computer vision algorithms and related applications.

However, aerial image datasets are usually more difficult to collect due to their distinctive perspectives and relatively rare sources. Existing aerial image datasets include VEDAI [5], COWC [6], HRSC2016 [7], DOTA [8], DIOR [9], and LEVIR [10], and these datasets are relatively small in scale. Furthermore, the individual datasets for each direction and application of aerial imagery will be much smaller than generic datasets. In addition, compared with generic object recognition datasets, aerial image datasets will have higher resolution over thousands or tens of thousands of pixels, whereas the scales of their example objects are much smaller and the proportion of small objects is higher. The characteristics of these aerial image datasets make it more difficult and often more expensive to accurately label their example objects. Additionally, because of their different data sources, viewpoints, acquisition heights, and environmental scenes, the differences between aerial image datasets are usually very large. Even data from a single source can have the problem of regional differences, thus it is also difficult to mix or reuse different aerial image object recognition datasets. This results in even greater data shortage problems for the already relatively scarce aerial image datasets.

One of the solutions to the data shortage is the use of synthetic data, and realistic game engines or 3D video engines are often used to create the data needed. Finding ways to use these tools to acquire data and rapidly and automatically complete annotation tasks quickly is a very active area of research. Some recent works have provided much inspiration in this field, and these have included the validation and testing of autonomous driving in a virtual environment [11,12,13,14] and the use of reinforcement learning to accomplish tasks in a virtual environment [15]. In the area of computer vision, recent works have considered the construction of synthetic data for detection [16], segmentation [17,18], pedestrian re-identification [19,20] or vehicle re-identification [21] with the help of virtual engines, and the study of synthetic data for tasks such as action recognition using 3D models [22]. The excellent results of these works fully demonstrate that it is possible and indeed achievable to implement and improve existing tasks using synthetic data. In a 3D virtual environment, a large amount of data can be easily generated from different scenes, different environments, and different objects, and the locations of their objects and the instances can be easily obtained and labeled, simultaneously solving the problems of difficult access to datasets and the difficulties encountered in labeling.

Based on previous works, this work used Unity, a 3D engine tool, to create a synthetic dataset for ship recognition in aerial images, called UnityShip. This is the first large-scale synthetic dataset for ship recognition in aerial images. UnityShip dataset consists of 79 ship models in ten categories and six virtual scenes, and it contains over 100,000 images and 194,054 instance objects. Each instance object has horizontal bounding box annotation, oriented bounding boxes annotation, fine-grained categories, and specific instance ID and scene information for tasks such as horizontal object detection, rotated object detection, fine-grained object recognition, ship re-identification, and scene understanding.

To generate a large-scale, realistic, and generalized dataset, it is necessary to take into account the variations in the spatial resolutions of sensors and their observation angles, the time of day for collection, object shadows, and the variation of lighting environments caused by the position of the sun relative to the sensor. Additionally, gravity, collision, and other effects need to be implemented by the physics engine. Therefore, we designed two weather environments (with or without clouds), four time periods (morning, noon, afternoon, and night) and their corresponding four light intensities, as well as shooting at different altitudes to obtain a rich and diverse synthetic dataset. This method can be used not only for the generation of ship object datasets in aerial images but also for the generation of other aerial image objects such as vehicles and buildings, bringing new inspiration for object recognition and dataset collection in aerial images.

To validate the application and effectiveness of the synthetic data in specific recognition tasks, we divided the created synthetic dataset into UnityShip-100k and UnityShip-10k. UnityShip-100k was used to validate the synthetic data for model pre-training, and UnityShip-10k was used to validate the synthetic data for data augmentation. We collected six real-world aerial image datasets containing ship objects for validating the synthetic datasets: AirBus-Ship [23], DIOR [9], DOTA [8], HRSC2016 [7], LEVIR [10], and TGRS-HRRSD [24]. The details of these datasets are given in Section 4.1. Three representative object detection algorithms, the two-stage object detection algorithm Faster RCNN [25], the single-stage object detection algorithm RetinaNet [26], and the anchor-free object detection algorithm FCOS [27], were used to comprehensively validate the effects of the synthetic data on model pre-training and data augmentation for real-world data. The specific setup and results of the experiments are presented in Section 4.

The main contributions of the paper can be summarized as follows:

The first large-scale synthetic dataset for ship object recognition and analysis in aerial images, UnityShips, was created. This dataset contains over 100,000 images and 194,054 example objects collected from 79 ship models in ten categories and six large virtual scenes with different weather, time of day, light intensity, and altitude settings. This dataset exceeds existing real-world ship recognition datasets in terms of the number of images and instances.
UnityShip contains horizontal bounding boxes, oriented bounding boxes, fine-grained category information, instance ID information, and scene information. It can be used for tasks such as horizontal object detection, rotated object detection, fine-grained recognition, re-identification, scene recognition, and other computer vision tasks.
Along with six existing real-world datasets, we have validated the role of synthetic data in model pre-training and data augmentation. The results of our experiments conducted on three prevalent object detection algorithms show that the inclusion of synthetic data provides greater improvements for small-sized and medium-sized datasets.

2. Related Work

2.1. Aerial Image Datasets

The recognition of specific objects in aerial imagery is a fascinating and challenging task, and it is often heavily dependent on relevant aerial image datasets. Aerial images typically include high-resolution remote sensing images from satellites and maps or optical images from onboard image acquisition equipment from aircraft. We collected publicly available and widely used object detection datasets from aerial imagery and collated specific information on these datasets, including their data source, number of instance categories, number of images, number of instances, image size (generally the longest side of the image), and instance annotation type, in order of publication time, as shown in Table 1.

The early datasets for object recognition in aerial images mostly have fewer categories, images, and instances. As the applications of deep learning in computer vision have become increasingly widespread, the scale and quality requirements of datasets have also increased; the scale of aerial image datasets has, therefore, been expanding, with typical datasets represented by DOTA [8], DIOR [9], and VisDrone [36]. These datasets have greatly contributed to the development of algorithms and applications for object recognition in aerial images.

Compared with generic datasets, aerial image datasets are more difficult to recognize because of their special overhead views and altitudes; they typically have higher resolutions, larger sizes, denser object distributions, and a higher proportion of small objects. For these reasons, aerial image datasets are smaller in size and more difficult and expensive to annotate. A simple comparison is shown in Table 2.

2.2. Synthetic Data

The approach of training, validating, and testing algorithms in virtual simulation environments has been widely used in various fields, including autonomous driving and reinforcement learning. The construction of synthetic image data based on virtual environments has attracted much attention in computer vision and has resulted in a series of representative works. In 2016, Ros et al. [17] built the SYNTHIA dataset using the Unity 3D engine with the aim of investigating the use of synthetic data in semantic segmentation. In the same year, Johnson-Roberson et al. [16] captured and automatically annotated data in the video game GTA-V for tasks such as object detection, instance segmentation, semantic segmentation, and depth estimation. Subsequently, research on synthetic data has attracted increasing attention. For example, based on the GTA-V game engine, some works [18,38,39,40,41] have focused on generating synthetic datasets for detection, segmentation, estimation, and visual navigation.

Synthetic data is increasingly being used in other vision tasks. For example: Varol et al. [42] and Zhang et al. [22] investigated methods and applications of synthetic data human detection, recognition, and action analysis; Sun et al. [19] and Wang et al. [20] used synthetic data for pedestrian re-identification tasks; Varol et al. [42] used the Unreal engine to build a synthetic dataset for text recognition, UnrealText; Roberts and Paczan [43] built a synthetic dataset for indoor image understanding, called Hypersim; and Zhang et al. [44] published a synthetic dataset for the defogging task. In addition, some works have focused on using visual algorithms such as generative adversarial network methods [21] or reinforcement learning [45] to improve the quality and broaden the applications of synthetic data.

Since the start of the era of deep learning, three trends have become noteworthy in the research on synthetic data: (1) synthetic data are becoming higher in quality and larger in scale, with image data expanding from 2D to 3D; (2) from simple tasks (e.g., semantic segmentation) to various tasks in computer vision (detection segmentation, re-recognition, scene understanding, depth estimation, tracking, pose recognition, etc.), and in individual; (3) from simple engines (GTA-V) to virtual engines (Unreal, Unity) with a relatively higher degree of customizability. To direct joint algorithm generation and improvement, the focus of research on synthetic data has gradually shifted from high-quality data generation to learnable data generation.

2.3. Synthetic Aerial Image Datasets

Within the last few years, there has been some research that has focused on applications of synthetic data in aerial imagery. For example: Bondi et al. [46] used the AirSim tool to generate BIRDSAI, an infrared object recognition dataset that blends both real and synthetic data; Chen et al. [47] published VALID, a synthetic dataset for instance segmentation, semantic segmentation, and panoramic segmentation from an unmanned aircraft perspective; Kong et al. [48] investigated the application of synthetic data to semantic segmentation for remote sensing and published the Synthinel-1 dataset; Shermeyer et al. [49] released RarePlanes, a hybrid real- and synthetic-image dataset for aircraft detection and fine-grained attribute recognition in aerial imagery; Clement et al. [50] also released a synthetic image dataset for aircraft recognition, and this is mainly intended for rotated object recognition; Lopez-Campos et al. [51] released ESPADA, the first synthetic dataset for depth estimation in aerial images; Uddin et al. [52] proposed a method for converting optical videos to infrared videos using GAN, investigating the impact for classification and detection task.

Compared with a large number of synthetic datasets for natural image views, which are applied to a wide range of computer vision tasks, synthetic datasets in aerial images are smaller in number and size, and they involve simpler vision tasks. This indicates that synthetic data for aerial images has great research potential and application value.

2.4. Ship Detection in Aerial Images

As a long-standing fundamental and challenging problem in computer vision, object detection has become a hot research topic and has made great progress in the past few years. The goal of object detection is to determine the presence or absence of a specified class of objects in a given image and to localize their spatial locations. As a fundamental problem in image understanding and computer vision, object detection is the basis for other advanced and complex tasks, such as segmentation, scene understanding, and object tracking.

Ship detection in aerial images (such as remote sensing images and drone-captured images) is a challenging task compared generic object detection. There are some works focus on this task, such as [53,54,55] researched the oriented ship detection and devote to solving scale and feature problems for ship detection in aerial images [56,57,58,59], etc.

3. UnityShip Dataset

In Section 3.1, we describe the collection and annotation processes used to create the UnityShip dataset and give a detailed description of the scene resources, ship models, and environmental information. In Section 3.2, we provide statistics regarding the detailed attributes of the UnityShip dataset, including the location distribution, density distribution, scale distribution, and category distribution of the instance objects, as well as the scene statistics for each different attribute. In Section 3.3, we give visual examples of the UnityShip synthetic dataset compared with the real-world aerial images from the ship object detection datasets.

3.1. Collection of UnityShip Dataset

We used the Unity engine as a tool for creating 3D virtual environments. Unity (https://unity.com, accessed on 6 December 2021) is a platform for creating 3D content, and it provides authoring tools to freely build the terrain and models needed for a scene using real-time rendering technology. The Unity engine can use components such as cameras to constantly acquire photos of a scene and output them in layers according to where the content is located for easy image processing. It has a rich application programming interface (API) interface that can be used for scripting programs to control the components within a scene.

Using the Unity 3D engine, it is possible to build scenes and objects that are comparable to those encountered in real applications and to obtain a large number of composite images. However, to create a large, sufficiently realistic, and generalized synthetic dataset, one must take into account the factors including variation in sensor spatial resolutions and viewing angles, the time of collecting, object shadows, and changes in lighting due to the position of the sun relative to the sensor, and environmental changes. In this work, random individual attributes, including multiple parameters such as weather, collection time, sunlight intensity, viewing angle, model, and distribution density, were adjusted to create a diverse and heterogeneous dataset. A total of 79 ship models from ten categories were included, each with different sizes and types of ships, as shown in Figure 1. There were six large virtual scenes, including sea, harbor, island, river, and beach. Two weather environments (with or without clouds), four time periods (morning, noon, afternoon, night) with their corresponding four light intensities were used in each virtual scene, as well as different altitudes. This approach allowed rich and varied virtual data to be obtained. Examples of images at the same location with different random attribute settings are shown in Figure 2.

We used the built-in Unity API to control the generation of virtual cameras, ships, clouds, and other items in the virtual engine; two types of the shooting were used: spot shooting and scanning shooting at certain spatial intervals. The specific component-generation methods and settings were as follows.

Camera model. In the Unity virtual environment, we used the camera component to generate the camera model. To simulate the perspective of aerial images, the camera mode was set to top-down with the horizontal field of view set to 60 degrees, and images with a resolution of 2000 × 1600 were captured. To generate more images of multi-scale ship objects as possible, we randomly placed the camera at a specific and reasonable altitude during each capture.
Ship model. We generated the number of ships according to the height of the camera. The number of ships was increased with the height. Before each acquisition, a certain number of ships were randomly selected from the 79 pre-imported ships, and the coordinates and rotation angles of the ships were randomly obtained within the camera’s range. As the terrain in the scene is hollow, to prevent ships from spawning inside the terrain or crossing with other objects and causing the shot to fail, we added collision-body components to all ship models and set collision triggers for them so that the models and data were automatically cleared when the ship hit another collision body. A transparent geometric collision body exactly filling the land was used to detect collisions, preventing ship models from being generated on land.
Environment model. We randomly generated direct sunlight using a range of variables, including the position of the sun, the angle of direct sunlight, and the intensity of the light, to characterize the local time and season in a physical sense. In addition, we used random values to determine whether the clouds should be generated and their concentration. To reduce the amount of computer rendering effort required and the time taken to generate the scenes, the fog was only generated within a range that the camera could capture.

While constructing various models for generating virtual composite images, we used the component layering function in Unity to extract ship information from the object layer and obtain image annotation information through threshold segmentation and image morphology operations; each ship was accurately annotated with a horizontal bounding box and an oriented bounding box. Moreover, for additional research purposes, we also annotated each ship’s subdivision category, instance ID, scene information, and other annotations, building a foundation for subsequent tasks such as fine-grained object recognition, re-identification, and scene analysis.

3.2. Statistics of the UnityShip Dataset

Through the above method, a dataset of 105,086 synthetic images was generated for ship recognition in aerial images. As noted, the ship objects in each image are annotated with horizontal and rotated bounding boxes, along with the subclasses and inter-class ID of the ships, as shown in Figure 3, in which the red boxes represent rotated box annotations, and the green boxes represent horizontal box annotations.

We placed ship models in six large virtual scenes with different environmental parameters, including the presence or absence of clouds, different capture altitudes, and capture times. The statistical results of the scenario information are shown in Figure 4. Among all six scenes, the first and second were used for the majority of the images because they were larger in scale and had a greater range of ocean scenes available for capture. The remaining images were mainly collected from restricted-range sections such as islands and rivers and were less numerous. The number of images containing clouds constitutes around 40% of the total, far exceeding the proportion of natural scenes that contain clouds, providing research data for ship recognition in complex and extreme environments. Analysis in terms of the acquisition altitudes of the data shows a higher proportion of ships with higher altitudes. In terms of the acquisition time, a random selection from a set period of time was made before each image acquisition so that the distribution of times was even.

The most significant feature of the UnityShip dataset, as well as its most prominent advantage over real-world datasets, is the availability of accurate and detailed ship categories and specific instance ID, which can be used for tasks such as fine-grained ship detection and ship re-identification. As shown in Figure 5, UnityShip contains ten specific categories (Battleship, Cruise Ship, Cargo Ship, Yacht, Sailship, Warship, Aircraft Carrier, Passenger Ship, Cargo Ship, and Strip-Shape Ship), and a total of 79 different instances of ship objects, with the specific distribution shown in Figure 5a; the distribution of the number of instances of ships in each category is shown in Figure 5b. The number of instances is proportional to the number of categories, as each placement was randomly selected from the 79 ships. We also counted the distribution of ship instances in the subcategories, as shown in Figure 6.

For the instance-level objects, we also calculated statistics regarding the attributes of the dataset, including the location distribution, scale distribution, and density distribution of the ship instances. As shown in Figure 7, the distribution of ship objects in the images is relatively even, as we have randomly placed them in suitable locations. We also counted the distribution of all instances, and it is clear that the vast majority of the images contain fewer than four ships In addition, we counted the distribution of relative width (ratio of instance width to image width), relative height (ratio of instance height to image height) and relative area (ratio of instance area to image area) of all ship instances. The results indicate that most of the ships have a relative width and relative height distribution below 0.1, and similarly a relative area distribution below 0.02, as shown in Figure 8.

3.3. Comparison with Real-World Datasets

We took two existing real-world datasets for ship object detection in aerial images (AirBus-Ship [23] and HRSC2016 [7]) and four datasets containing ship categories in aerial images (DOTA [8], DIOR [9], LEVIR [10], and TGRS-HRRSD [24]) and extracted parts of those datasets containing ship objects. The extremely large images of the DOTA [8] dataset were cropped to smaller images of 1000×1000 pixels on average; the datasets are described and divided in detail in Section 4.1.

Table 3 shows a comparison of the properties of these six datasets with the synthetic dataset in Table 3. It is observed that our synthetic dataset has the largest size in terms of the number of images and the largest number of ships. Another significant difference is the average number of instances per image, with the UnityShip dataset having an example density distribution closer to the AirBus-Ship, HRSC2016, LEVIR, and TGRS-HRRSD datasets, whereas the DOTA and DIOR dataset have a higher density of ship distribution. We selected some representative images from these six datasets, as shown in Figure 9 and Figure 10; it is clear that the reason for the difference in the instance density distribution is that the ship objects in the DOTA and DIOR datasets are mainly distributed in near-coast and port locations, where a large number of ships often gather. We also selected some images from UnityShip, and visually, this synthetic dataset is most similar to the AirBus-Ship, LEVIR, and TGRS-HRRSD datasets.

4. Experiments and Results

To explore the use of synthetic datasets and the results of the augmentation of existing datasets, we used the synthetic dataset UnityShip and six existing real-world datasets to complete separate model pre-training experiments using UnityShip100k (containing all synthetic images) and UnityShip10k (containing 10,000 synthetic images) as additional data experiments for data augmentation. Section 4.1 details the dataset partitioning and gives other details of these six real-world datasets and the synthetic dataset. Section 4.2 explains the environment and detailed settings of the experiments. Section 4.3 and Section 4.4 mainly show the results of model pre-training and data augmentation; Section 4.5 summarizes and discusses the experimental results.

4.1. Dataset Partitioning

AirBus Ship. The AirBus Ship dataset was released in 2019 and is used for the Kaggle data competition “AirBus Ship Detection Challenge”; it is one of the largest datasets publicly available for ship detection in aerial images. The original dataset is labeled in mask form for instance segmentation task. In this experiment, we transformed this labeling into horizontal bounding box. The original training dataset (the test dataset and labels are not available) has a total of 192,556 images, containing 150,000 negative samples that do not contain any ships. We selected only positive samples for the experiment, resulting in 42,556 images and 81,011 instances. We randomly divided the dataset in the ratio 6:2:2 for the training, validation, and testing sets, respectively.
DIOR. The DIOR dataset is a large-scale object detection dataset for object detection in optical remote sensing images. The dataset contains a total of 20 categories, and these are large in scale with respect to the number of object classes, images, and instances. We extracted ship data totaling 2702 images and 61,371 example objects. Most of the ship objects are located in ports and near coasts with a dense distribution. Again, we randomly divide the dataset into training, validation, and testing sets in the ratio 6:2:2.
DOTA. DOTA is a large-scale dataset containing 15 categories for object recognition in aerial images. In this work, we extracted images containing ship categories and cropped them to 1000 × 1000 pixels with an overlap rate of 0.5, creating a total of 2704 images and 80,164 ship objects. Once more, these were randomly divided into training, validation, and testing sets in the ratio 6:2:2.
HRSC2016. HRSC2016 is a dataset collected from Google Earth for ship detection in aerial imagery; it contains 1070 images and 2971 ship objects. All images in the dataset are from six well-known ports, with image resolutions between 2 m and 0.4 m and image sizes ranging from 300 to about 1500 pixels. The training, validation, and testing sets contain 436, 181, and 444 images, respectively. All annotations contain horizontal and rotated frames and have detailed annotations of geographic locations as well as ship attributes.
LEVIR. LEVIR is mainly used for aerial remote sensing object detection; it includes aircraft, ships (including offshore and surface ships), and oil tanks. The LEVIR dataset consists of a large number of Google Earth images with ground resolutions of 0.2 to 1.0 m; it also contains common human habitats and environments, including countryside, mountains, and oceans. We extracted the fraction containing ships, resulting in 1494 images and 2961 ship objects. This was divided into training, validation, and testing sets in the ratio 6:2:2.
TGRS-HRRSD. TGRS-HRRSD is a large-scale dataset for remote sensing image object detection; all images are from Baidu Maps and Google Maps, with ground resolutions from 0.15 to about 1.2 m, and it includes 13 categories of objects. The sample size is relatively balanced among the categories, and each category has about 4000 samples. We extracted those images containing ships from them, creating a total of 2165 images and 3954 ship objects. These were again divided into training, validation, and testing sets in the ratio 6:2:2.

4.2. Implementation

The experiments described in this section were implemented by PyTorch and MMDetection [60]. By convention, all images in this paper were scaled to a size of 600 to 800 pixels without changing their aspect ratio. SGD was used as the default optimizer, and four NVIDIA GeForce 2080 Ti were used to train the model. The initial learning rate was set to 0.01 (two images on each GPU) or 0.05 (one image on each GPU) for Faster RCNN [25], and 0.05 (two images on each GPU) or 0.0025 (one image on each GPU) for RetinaNet [26] and FCOS [27]. The models in this paper were trained using the backbone ResNet50 [61], and used the feature fusion structure FPN [62] by default. If not explicitly stated, all hyperparameters and structures follow the default settings in MMDetection.

All reported results follow the COCO [2] metrics, including AP, which represents the average precision for the IoU interval from 0.5 to 0.95 at 0.05 intervals, and AP₅₀, which represents the average precision when the IoU takes the value of 0.5. The average recall (AR), which represents the average recall obtained for the representative IoU interval from 0.5 to 0.95 at 0.05 intervals, was also calculated.

4.3. Model Pre-Training Experiments

To verify the utility of the synthetic dataset presented here for model pre-training in a detailed and comprehensive manner, the following experiments were conducted on six datasets using Faster RCNN, RetinaNet, and FCOS, including: (1) training the object detection algorithm from scratch using GN [63] in both the detection head and the feature extraction backbone network, with a training schedule using 72 and 144 epochs, respectively; (2) using the default ImageNet pre-training model ResNet50 according to the default settings in MMDetection with a training schedule using 12 and 24 epochs, respectively; (3) using the synthetic dataset UnityShip100k to obtain pre-trained models, then use the same settings in (2) and train 12 and 24 epochs.

The results of the model pre-training experiments are shown in Table 4 and Table 5. Compared with pre-training the backbone network with ImageNet, training from scratch can achieve better results on larger datasets, and pre-training with the UnityShip dataset achieved better results in small- to medium-sized datasets. To be specific, compared with pre-training with ImageNet: (1) in the experiments on the large AirBus-Ship and DOTA datasets, training from scratch achieved more significant improvements in all three algorithms, and pre-training with the UnityShip dataset achieved similar results in all three algorithms; (2) in the experiments on the medium- or small-sized datasets DIOR, HRSC2016, LEVIR and TGRS-HRRSD, pre-training with the UnityShip dataset brought more significant improvements. We compare the values of mAP under different settings for each dataset in detail in Table 4.

4.4. Data Augmentation Experiments

In the experiments on data augmentation, we used the training-set portion of Unity-Ship 10k (containing 6000 images) as additional data and added this to the existing six real-world datasets to verify the role of synthetic data for data augmentation. The results of these experiments are shown in Table 6 and Table 7, in which the results with and without data augmentation are compared, respectively.

Similar to the results presented in Section 4.2, compared with the results without data augmentation, using UnityShip synthetic data as additional data achieved comparable results with large datasets (AirBus-Ship, DOTA), whereas more significant improvements were seen with small- and medium-sized datasets (DIOR, HRSC2016, LEVIR, and TGRS-HRRSD).

4.5. Discussion

The larger datasets (AirBus-Ship and DOTA) are better and more stable when using the scratch training method, and a more significant improvement is obtained in the smaller datasets (HRSC2016, LEVIR, and TGRS-HRRSD) by using the UnityShip pre-training. The reason for this may be that the larger datasets have enough samples to train from scratch and obtain a sufficiently accurate and generalized model without relying on the initialization of the detector structure parameters, whereas the smaller datasets rely more on a good parameter initialization method due to their relative lack of samples. In addition, similar results were obtained using ImageNet pre-training, and pre-training with the UnityShip dataset yielded bigger boosts on the single-stage algorithms (RetinaNet and FCOS) than on the two-stage algorithm (Faster RCNN). One possible reason for this is that the accuracy of two-stage algorithms is usually higher, decreasing the pre-training gain when compared with single-stage algorithms. Another possible reason is that, in contrast to single-stage algorithms, two-stage algorithms have the region proposal network, making the training process smoother and more balanced, and the gains from pre-training are thus not as obvious.

In the data augmentation experiments, a larger improvement was obtained for small datasets, and a larger improvement was obtained on the single-stage algorithms (FCOS and RetinaNet) algorithms than the two-stage algorithm (Faster RCNN). In the larger dataset, the additional data represent a relatively small proportion in terms of the number of images and instances, and they thus bring limited improvements. In addition, there are still large differences between the synthetic dataset and the real-world datasets; using the synthetic data directly and simply as additional data for data augmentation thus does not take full advantage of its potential.

Overall, this paper’s results demonstrate the beneficial impact of using larger-scale synthetic data for relatively small-scale real-world tasks. For model pre-training, some existing work has demonstrated that pretraining on large datasets and then transfer learning and fine-tuning will be valid for improving downstream tasks. However, for the object recognition task in aerial images, the significant difference between natural and aerial images is unfavorable for classifying and localizing small and dense objects. UnityShip has similar features and distributions as the ship objects in aerial images, so using the UnityShip for pretraining will be more beneficial both for the classification and localization of ship objects in aerial images. Similarly, using the synthetic dataset as additional data for data augmentation with a large number of images in various environments and scenes will effectively complement the problems posed by the sparse number of aerial images to acquire a more generalizable model.

5. Conclusions

In this paper, we present the first synthetic dataset for ship identification in aerial images, UnityShip, captured and annotated using the Unity virtual engine; this comprises over 100,000 synthetic images and 194,054 ship instances. The annotation information includes environmental information, instance-level horizontal bounding boxes, oriented bounding boxes, and the category and ID of each ship, providing the basis for object detection, rotating object detection, fine-grained recognition, scene recognition, and other tasks.

To investigate the use of the synthetic dataset, we validated the synthetic data for model pre-training and data augmentation using three different object detection algorithms. The results of our experiments indicate that for small- and medium-sized real-world datasets, the synthetic dataset provides a large improvement in model pre-training and data augmentation, demonstrating the value and application potential of synthetic data in aerial images.

However, there are some limitations to this work that also deserve consideration and improvement. First, although the UnityShip dataset exceeds existing real datasets in terms of the numbers of images and instances, it still has a lack of scene diversity and generalization, as well insufficient ship distribution in scenes such as ports and near coasts, along with differences from the details and textures of real-world images. Due to time and space constraints, this paper does not explore other possible uses of synthetic datasets, such as ship re-identification tasks in aerial images, fine-grained recognition of ships, or exploring ship recognition under different weather conditions and against different backgrounds. In addition, among the experiments that have been completed, only the relatively simple dataset pre-training and data augmentation parts have been considered in this paper; studies of topics such as dataset bias, domain bias, and domain adaption have not been carried out. In general, there are still relatively few studies on synthetic datasets in aerial images at this stage, and more in-depth research and studies on the use and research of synthetic datasets are needed.

Author Contributions

B.H. (Boyong He) conceived and designed the idea; B.H. (Boyong He) performed the experiments; B.H. (Bo Huang) and X.L. analyzed the data and helped with validation; B.H. (Boyong He) wrote the paper; E.G. and W.G. helped to create and check the dataset; and L.W. supervised the study and reviewed this paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by National Natural Science Foundation of China (No.51276151).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. Our synthetic dataset is available from the corresponding author on reasonable request.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and helpful suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2009, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A. The open images dataset v4. Int. J. Comput. Vis. 2020, 128, 1956–1981. [Google Scholar] [CrossRef] [Green Version]
Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G.; Zhang, X.; Li, J.; Sun, J. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 8430–8439. [Google Scholar]
Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef] [Green Version]
Mundhenk, T.N.; Konjevod, G.; Sakla, W.A.; Boakye, K. A large contextual dataset for classification, detection and counting of cars with deep learning. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 785–800. [Google Scholar]
Liu, Z.; Wang, H.; Weng, L.; Yang, Y. Ship rotated bounding box space for ship extraction from high-resolution optical satellite images with complex backgrounds. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1074–1078. [Google Scholar] [CrossRef]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. Isprs J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Zou, Z.; Shi, Z. Random access memories: A new paradigm for target detection in high resolution aerial remote sensing images. IEEE Trans. Image Process. 2017, 27, 1100–1111. [Google Scholar] [CrossRef] [PubMed]
Zhu, M.; Hu, J.; Pu, Z.; Cui, Z.; Yan, L.; Wang, Y. Traffic Sign Detection and Recognition for Autonomous Driving in Virtual Simulation Environment. arXiv 2019, arXiv:1911.05626. [Google Scholar]
Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Proceedings of the Field and Service Robotics, Zurich, Switzerland, 12–15 September 2017; pp. 621–635. [Google Scholar]
Li, W.; Pan, C.; Zhang, R.; Ren, J.; Ma, Y.; Fang, J.; Yan, F.; Geng, Q.; Huang, X.; Gong, H. AADS: Augmented autonomous driving simulation using data-driven algorithms. Sci. Robot. 2019, 4. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Best, A.; Narang, S.; Pasqualin, L.; Barber, D.; Manocha, D. Autonovi-sim: Autonomous vehicle simulation platform with weather, sensing, and traffic control. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1048–1056. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Johnson-Roberson, M.; Barto, C.; Mehta, R.; Sridhar, S.N.; Rosaen, K.; Vasudevan, R. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? arXiv 2016, arXiv:1610.01983. [Google Scholar]
Ros, G.; Sellart, L.; Materzynska, J.; Vazquez, D.; Lopez, A.M. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3234–3243. [Google Scholar]
Angus, M.; ElBalkini, M.; Khan, S.; Harakeh, A.; Andrienko, O.; Reading, C.; Waslander, S.; Czarnecki, K. Unlimited road-scene synthetic annotation (URSA) dataset. In Proceedings of the 21st International Conference on Intelligent Transportation Systems (ITSC), Salt Lake City, UT, USA, 18–22 June 2018; pp. 985–992. [Google Scholar]
Sun, X.; Zheng, L. Dissecting person re-identification from the viewpoint of viewpoint. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 608–617. [Google Scholar]
Wang, Y.; Liao, S.; Shao, L. Surpassing real-world source training data: Random 3d characters for generalizable person re-identification. In Proceedings of the 28th ACM International Conference on Multimedia, Augsburg, Germany, 23–28 September 2020; pp. 3422–3430. [Google Scholar]
Yao, Y.; Zheng, L.; Yang, X.; Naphade, M.; Gedeon, T. Simulating content consistent vehicle datasets with attribute descent. arXiv 2019, arXiv:1912.08855. [Google Scholar]
Zhang, Z.; Wang, C.; Qiu, W.; Qin, W.; Zeng, W. AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild. Int. J. Comput. Vis. 2020, 129, 703–718. [Google Scholar] [CrossRef]
Airbus. Airbus Ship Detection Challenge. 2019. Available online: https://www.kaggle.com/c/airbus-ship-detection (accessed on 6 December 2021).
Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and Robust Convolutional Neural Network for Very High-Resolution Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5535–5548. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [Green Version]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 9627–9636. [Google Scholar]
Heitz, G.; Koller, D. Learning spatial context: Using stuff to find things. In Proceedings of the European Conference on Computer Vision, Marseille, France, 12–18 October 2008; pp. 30–43. [Google Scholar]
Benedek, C.; Descombes, X.; Zerubia, J. Building development monitoring in multitemporal remotely sensed image pairs with stochastic birth-death dynamics. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 33–50. [Google Scholar] [CrossRef] [Green Version]
Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 3735–3739. [Google Scholar]
Liu, K.; Mattyus, G. Fast multiclass vehicle detection on aerial images. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1938–1942. [Google Scholar]
Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
Yang, M.Y.; Liao, W.; Li, X.; Cao, Y.; Rosenhahn, B. Vehicle detection in aerial images. Photogramm. Eng. Remote Sens. 2019, 85, 297–304. [Google Scholar] [CrossRef] [Green Version]
Lam, D.; Kuzma, R.; McGee, K.; Dooley, S.; Laielli, M.; Klaric, M.; Bulatov, Y.; McCord, B. xView: Objects in context in overhead imagery. arXiv 2018, arXiv:1802.07856. [Google Scholar]
Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. Vision meets drones: A challenge. arXiv 2018, arXiv:1804.07437. [Google Scholar]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Martinez, M.; Sitawarin, C.; Finch, K.; Meincke, L.; Yablonski, A.; Kornhauser, A. Beyond grand theft auto V for training, testing and enhancing deep learning in self driving cars. arXiv 2017, arXiv:1712.01397. [Google Scholar]
Richter, S.R.; Hayder, Z.; Koltun, V. Playing for benchmarks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2213–2222. [Google Scholar]
Wrenninge, M.; Unger, J. Synscapes: A photorealistic synthetic dataset for street scene parsing. arXiv 2018, arXiv:1810.08705. [Google Scholar]
Hurl, B.; Czarnecki, K.; Waslander, S. Precise synthetic image and lidar (presil) dataset for autonomous vehicle perception. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 2522–2529. [Google Scholar]
Varol, G.; Romero, J.; Martin, X.; Mahmood, N.; Black, M.J.; Laptev, I.; Schmid, C. Learning from synthetic humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 109–117. [Google Scholar]
Roberts, M.; Paczan, N. Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding. arXiv 2020, arXiv:2011.02523. [Google Scholar]
Zhang, J.; Cao, Y.; Zha, Z.-J.; Tao, D. Nighttime dehazing with a synthetic benchmark. In Proceedings of the 28th ACM International Conference on Multimedia, Augsburg, Germany, 23–28 September 2020; pp. 2355–2363. [Google Scholar]
Xue, Z.; Mao, W.; Zheng, L. Learning to simulate complex scenes. arXiv 2020, arXiv:2006.14611. [Google Scholar]
Bondi, E.; Jain, R.; Aggrawal, P.; Anand, S.; Hannaford, R.; Kapoor, A.; Piavis, J.; Shah, S.; Joppa, L.; Dilkina, B. BIRDSAI: A dataset for detection and tracking in aerial thermal infrared videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Village, CO, USA, 1–5 March 2020; pp. 1747–1756. [Google Scholar]
Chen, L.; Liu, F.; Zhao, Y.; Wang, W.; Yuan, X.; Zhu, J. Valid: A comprehensive virtual aerial image dataset. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 2009–2016. [Google Scholar]
Kong, F.; Huang, B.; Bradbury, K.; Malof, J. The Synthinel-1 dataset: A collection of high resolution synthetic overhead imagery for building segmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1814–1823. [Google Scholar]
Shermeyer, J.; Hossler, T.; Van Etten, A.; Hogan, D.; Lewis, R.; Kim, D. RarePlanes: Synthetic Data Takes Flight. arXiv 2020, arXiv:2006.02963. [Google Scholar]
Clement, N.; Schoen, A.; Boedihardjo, A.; Jenkins, A. Synthetic Data and Hierarchical Object Detection in Overhead Imagery. arXiv 2021, arXiv:2102.00103. [Google Scholar]
Lopez-Campos, R.; Martinez-Carranza, J. ESPADA: Extended Synthetic and Photogrammetric Aerial-image Dataset. IEEE Robot. Autom. Lett. 2021, 6, 7981–7988. [Google Scholar] [CrossRef]
Uddin, M.S.; Hoque, R.; Islam, K.A.; Kwan, C.; Gribben, D.; Li, J. Converting Optical Videos to Infrared Videos Using Attention GAN and Its Impact on Target Detection and Classification Performance. Remote Sens. 2021, 13, 3257. [Google Scholar] [CrossRef]
Liu, Z.; Hu, J.; Weng, L.; Yang, Y. Rotated region based CNN for ship detection. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 13–17 September 2017; pp. 900–904. [Google Scholar]
Zhang, Z.; Guo, W.; Zhu, S.; Yu, W.J.I.G.; Letters, R.S. Toward arbitrary-oriented ship detection with rotated region proposal and discrimination networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1745–1749. [Google Scholar] [CrossRef]
Liu, W.; Ma, L.; Chen, H.J.I.G.; Letters, R.S. Arbitrary-oriented ship detection framework in optical remote-sensing images. IEEE Geosci. Remote Sens. Lett. 2018, 15, 937–941. [Google Scholar] [CrossRef]
Li, Q.; Mou, L.; Liu, Q.; Wang, Y.; Zhu, X.X. HSF-Net: Multiscale deep feature embedding for ship detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2018, 56, 7147–7161. [Google Scholar] [CrossRef]
Zhu, X.; Cheng, D.; Zhang, Z.; Lin, S.; Dai, J. An empirical study of spatial attention mechanisms in deep networks. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 6688–6697. [Google Scholar]
Qi, S.; Wu, J.; Zhou, Q.; Kang, M. Low-resolution ship detection from high-altitude aerial images. In MIPPR 2017: Automatic Target Recognition and Navigation; International Society for Optics and Photonics: Bellingham, WA, USA, 2018; p. 1060805. [Google Scholar]
Zhang, Y.; Guo, L.; Wang, Z.; Yu, Y.; Liu, X.; Xu, F. Intelligent Ship Detection in Remote Sensing Images Based on Multi-Layer Convolutional Feature Fusion. Remote Sens. 2020, 12, 3316. [Google Scholar] [CrossRef]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]

Figure 1. Examples of ship models, including ten categories and 79 kinds of ship models.

Figure 2. Examples of random environment property settings at the same location.

Figure 3. Example of bounding box labeling, where red indicates rotated bounding boxes, and green indicates horizontal bounding box.

Figure 4. Scene statistics for the UnityShip dataset, including: (a) the number of images from different scenes, (b) whether they contain clouds, (c) different collection altitudes, and (d) different collection times.

Figure 5. Category information of ship models: (a) the number of ship models contained in each category, and (b) the number of ship instances per category.

Figure 6. Area distribution map of ship instances. Different colors represent different categories, and the sizes of areas represent the number of ship instances in each category.

Figure 7. Location distribution (a) and density distribution (b) of ship instances.

Figure 8. (a) Width, height, and (b) area distribution of the ship instance bounding box.

Figure 9. Examples of real-world aerial image datasets for ship recognition. From left to right: (a) AirBus-Ship, (b) DOTA, (c) DIOR, (d) TGRS-HRRSD, (e) LEVIR, and (f) HRSC2016 datasets.

Figure 10. Examples from the UnityShip synthetic dataset.

Table 1. Existing publicly available aerial image datasets. HBB indicates horizontal bounding box and OBB indicates oriented bounding box.

Dataset	Data Sources	Categories	Images	Instances	Size	Type of Annotation	Publication Year
TAS [28]	Aircraft	1	30	1319	792	HBB	2008
SZTAKI-INRIA [29]	Satellite/Maps	1	9	665	~800	OBB	2012
NWPU VHR-10 [30]	Satellite/Maps	10	800	3775	~1000	HBB	2014
VEDAI [5]	Satellite/Maps	9	1210	3640	1024	OBB	2015
UCAS-AOD [31]	Satellite/Maps	2	910	6029	1280	HBB	2015
DLR 3K [32]	Aircraft	2	20	14,235	5616	OBB	2015
COWC [6]	Aircraft	4	53	32,716	2000~20,000	HBB	2016
HRSC2016 [7]	Satellite/Maps	1	1070	2976	~1000	OBB	2016
RSOD [33]	Satellite/Maps	4	976	6950	~1000	HBB	2017
DOTA [8]	Satellite/Maps	15	2806	188,282	800~4000	OBB	2017
ITCVD [34]	Aircraft	1	173	29,088	1000~2000	HBB	2018
DIU xView [35]	Satellite/Maps	60	1413	800,636	/	HBB	2018
DIOR [9]	Satellite/Maps	20	23,463	192,472	800	HBB	2018
VisDrone2018 [36]	Aircraft	10	10,209	542,000	1500~2000	HBB	2018
UAVDT [37]	Aircraft	3	40,409	798,755	~1024	HBB	2018
LEVIR [10]	Satellite/Maps	3	21,952	10,467	600–800	HBB	2018
AirBus-Ship [23]	Satellite/Maps/	1	42,556	81,011	768	HBB	2018
TGRS-HRRSD [24]	Satellite/Maps	13	21,761	55,740	800–2000	HBB	2019

Table 2. Comparison between generic datasets and aerial datasets.

	Dataset	Categories	Images	Instances per Image
Aerial dataset	DOTA [8]	15	2806	67.1
Aerial dataset	VisDrone [36]	10	10,209	53.0
Generic dataset	VOC [1]	20	22,531	2.4
	COCO [2]	80	163,957	7.3
	OID [3]	500	1,743,042	9.8
	Objects365 [4]	365	638,000	15.8

Table 3. Comparison of the real-world datasets and UnityShip synthetic datasets.

Dataset	Images	Size	Instances	Instances per Image	Type of Annotation
AirBus-ship	42,556	768	81,011	1.90	HBB
DIOR	2702	800	61,371	22.71	HBB
DOTA	2704	1000	80,164	29.65	OBB
HRSC2016	1070	800~1200	2971	2.78	OBB
LEVIR	1494	800 × 600	2961	1.98	HBB
TGRS-HRRSD	2165	600~2000	3954	1.83	HBB
UnityShip 100k	105,806	1600 × 2000	194,054	1.83	HBB, OBB

Table 4. Comparison of AP for different datasets with different pre-training and training schedules. The red indicates results for longer training schedules (24 and 144 epochs) and the green indicates results for shorter training schedules (12 and 72 epochs).

Method	Pre-Training	Epoch	AirBus-Ship	DOTA	DIOR	HRSC2016	LEVIR	TGRS-HRRSD
Faster RCNN	ImageNet	12	50.79	40.67	47.68	68.36	63.97	72.30
	ImageNet	24	55.19	42.49	47.99	74.07	64.99	75.39
	None	72	59.49	42.05	44.33	60.64	57.58	64.79
	None	144	61.40	43.81	43.25	63.17	57.59	67.59
	UnityShip	12	51.53	41.63	48.40	76.64	65.21	76.99
	UnityShip	24	55.76	43.10	48.10	78.48	66.02	78.18
RetinaNet	ImageNet	12	53.65	22.55	34.54	48.74	56.44	63.33
	ImageNet	24	60.37	27.23	40.30	59.32	63.56	72.32
	None	72	64.60	33.00	40.44	53.10	56.83	65.91
	None	144	69.53	36.18	41.18	55.25	56.31	69.28
	UnityShip	12	55.07	27.82	42.96	70.95	64.02	78.47
	UnityShip	24	60.67	32.10	43.76	74.75	66.94	79.09
FCOS	ImageNet	12	52.64	30.59	42.65	44.80	59.73	61.97
	ImageNet	24	60.69	36.79	46.19	60.22	62.07	69.84
	None	72	63.37	35.81	45.89	66.75	61.03	63.70
	None	144	69.89	42.10	44.60	68.58	48.78	69.78
	UnityShip	12	52.31	34.98	46.92	73.34	64.42	76.30
	UnityShip	24	60.77	39.01	48.95	74.58	65.97	76.91

Table 5. Comparison of mAP, AP_0.5, and AR for different datasets with different model-pretraining settings and different training schedules.

			AirBus-Ship			DIOR			DOTA			HRSC2016			LEVIR			TGRS-HRRSD
Method	Pre-Training	Epochs	mAP	AP_0.5	AR	mAP	AP_0.5	AR	mAP	AP_0.5	AR	mAP	AP_0.5	AR	mAP	AP_0.5	AR	mAP	AP_0.5	AR
Faster RCNN	ImageNet	12	50.79	70.22	53.74	47.68	76.32	52.45	40.67	62.33	44.64	68.36	91.91	73.31	63.97	93.37	68.31	72.30	96.37	75.47
	ImageNet	24	55.19	70.32	57.80	47.99	75.38	52.25	42.49	61.77	45.92	74.07	93.49	77.95	64.99	92.89	68.58	75.39	95.80	78.22
	None	72	59.49	69.00	61.52	44.33	71.21	49.17	42.05	59.86	45.45	60.64	83.53	66.50	57.58	86.48	62.43	64.79	86.48	68.99
	None	144	61.40	67.18	62.70	43.25	68.28	48.25	43.81	59.91	47.08	63.17	84.34	68.82	57.59	86.93	63.36	67.59	88.56	72.00
	UnityShip	12	51.53	69.50	54.35	48.40	75.48	52.88	41.63	62.51	45.42	76.64	93.99	80.64	65.21	92.81	69.37	76.99	96.50	79.76
	UnityShip	24	55.76	70.37	58.37	48.10	74.49	52.47	43.10	61.84	46.61	78.48	93.72	81.91	66.02	92.45	69.58	78.18	97.63	80.65
RetinaNet	ImageNet	12	53.65	81.99	61.78	34.54	66.77	42.98	22.55	49.24	31.66	48.74	86.76	63.26	56.44	88.10	63.53	63.33	93.07	69.03
	ImageNet	24	60.37	86.74	67.41	40.30	71.80	47.71	27.23	53.07	35.12	59.32	90.75	70.10	63.56	91.90	69.29	72.32	96.77	77.42
	None	72	64.60	89.76	70.84	40.44	70.34	47.84	33.00	56.11	39.13	53.10	81.77	62.43	56.83	86.58	63.71	65.91	90.45	71.40
	None	144	69.53	91.42	75.46	41.18	71.44	48.08	36.18	57.40	41.52	55.25	81.35	63.47	56.31	85.63	63.00	69.28	92.94	74.84
	UnityShip	12	55.07	83.65	62.62	42.96	72.31	49.60	27.82	53.97	35.83	70.95	92.52	79.76	64.02	91.99	69.83	78.47	97.79	82.00
	UnityShip	24	60.67	87.47	67.51	43.76	72.16	50.73	32.10	56.21	38.77	74.75	92.20	81.64	66.94	92.80	71.43	79.09	97.60	82.50
FCOS	ImageNet	12	52.64	78.38	57.31	42.65	77.85	49.09	30.59	59.41	36.62	44.80	82.87	58.75	59.73	90.66	65.50	61.97	93.26	69.24
	ImageNet	24	60.69	83.91	64.82	46.19	79.29	52.21	36.79	62.67	41.71	60.22	89.71	67.85	62.07	89.92	67.72	69.84	96.26	74.66
	None	72	63.37	84.46	67.41	45.89	78.95	52.30	35.81	62.01	41.06	66.75	90.51	72.30	61.03	89.21	66.73	63.70	88.53	69.19
	None	144	69.89	87.14	73.47	44.60	77.70	51.47	42.10	63.93	46.92	68.58	89.42	74.19	48.78	79.72	58.26	69.78	90.33	74.47
	UnityShip	12	52.31	77.99	57.34	46.92	79.26	52.83	34.98	61.81	40.01	73.34	91.68	79.08	64.42	92.16	69.37	76.30	96.62	79.99
	UnityShip	24	60.77	83.61	64.95	48.95	80.43	54.75	39.01	63.77	43.61	74.58	91.58	80.01	65.97	91.62	70.74	76.91	95.65	80.87

Table 6. Comparison of AP for different datasets with and without augmentation and different training schedules. The red indicates results for longer training schedules (24 epochs) and the green indicates results for shorter training schedules (12 epochs).

Methods	Augmentation	Epochs	AirBus-Ship	DOTA	DIOR	HRSC2016	LEVIR	TGRS-HRRSD
Faster RCNN	/	12	50.79	40.67	68.36	47.68	63.97	72.30
	/	24	55.19	42.49	74.07	47.99	64.99	75.39
	UnityShip	12	50.51	41.09	74.01	47.92	64.23	73.38
	UnityShip	24	54.69	42.88	76.89	48.25	65.01	76.12
RetinaNet	/	12	53.65	22.55	48.74	34.54	56.44	63.33
	/	24	60.37	27.23	59.32	40.30	63.56	72.32
	UnityShip	12	53.72	25.39	64.65	39.93	64.23	70.49
	UnityShip	24	60.20	30.32	69.17	42.53	65.01	74.39
FCOS	/	12	52.64	30.59	44.80	42.65	59.73	61.97
	/	24	60.69	36.79	60.22	46.19	62.07	69.84
	UnityShip	12	51.69	30.66	74.01	42.38	62.79	71.31
	UnityShip	24	59.28	36.10	76.89	46.14	63.39	72.41

Table 7. Comparison of mAP, AP_0.5, and AR for different datasets with and without augmentation and different training schedules.

			AirBus-Ship			DIOR			DOTA			HRSC2016			LEVIR			TGRS-HRRSD
Method	Augmentation	Epochs	mAP	AP_0.5	AR	mAP	AP_0.5	AR	mAP	AP_0.5	AR	mAP	AP_0.5	AR	mAP	AP_0.5	AR	mAP	AP_0.5	AR
Faster RCNN	/	12	50.79	70.22	53.74	47.68	76.32	52.45	40.67	62.33	44.64	68.36	91.91	73.31	63.97	93.37	68.31	72.30	96.37	75.47
	/	24	55.19	70.32	57.80	47.99	75.38	52.25	42.49	61.77	45.92	74.07	93.49	77.95	64.99	92.89	68.58	75.39	95.80	78.22
	UnityShip	12	50.51	69.37	53.49	47.92	75.40	52.53	41.09	61.48	44.91	74.01	93.27	78.26	64.23	93.29	68.43	73.38	95.36	76.79
	UnityShip	24	54.69	70.18	57.16	48.25	74.58	52.64	42.88	61.82	46.36	76.89	93.30	80.65	65.01	92.38	68.80	76.12	95.65	78.83
RetinaNet	/	12	53.65	81.99	61.78	34.54	66.77	42.98	22.55	49.24	31.66	48.74	86.76	63.26	56.44	88.10	63.53	63.33	93.07	69.03
	/	24	60.37	86.74	67.41	40.30	71.80	47.71	27.23	53.07	35.12	59.32	90.75	70.10	63.56	91.90	69.29	72.32	96.77	77.42
	UnityShip	12	53.72	82.35	61.57	39.93	70.50	47.04	25.39	51.33	33.08	64.65	89.81	71.25	64.23	93.29	68.43	70.49	95.74	76.03
	UnityShip	24	60.20	86.94	67.20	42.53	72.57	49.52	30.32	55.63	37.06	69.17	90.29	77.37	65.01	92.38	68.80	74.39	97.00	79.28
FCOS	/	12	52.64	78.38	57.31	42.65	77.85	49.09	30.59	59.41	36.62	44.80	82.87	58.75	59.73	90.66	65.50	61.97	93.26	69.24
	/	24	60.69	83.91	64.82	46.19	79.29	52.21	36.79	62.67	41.71	60.22	89.71	67.85	62.07	89.92	67.72	69.84	96.26	74.66
	UnityShip	12	51.69	77.25	56.53	42.38	77.27	48.43	30.66	59.80	36.11	74.01	93.27	78.26	62.79	91.00	68.01	71.31	93.41	75.85
	UnityShip	24	59.28	82.70	63.54	46.14	79.31	51.90	36.10	62.60	40.80	76.89	93.30	80.65	63.39	91.77	68.78	72.41	93.73	76.49

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, B.; Li, X.; Huang, B.; Gu, E.; Guo, W.; Wu, L. UnityShip: A Large-Scale Synthetic Dataset for Ship Recognition in Aerial Images. Remote Sens. 2021, 13, 4999. https://doi.org/10.3390/rs13244999

AMA Style

He B, Li X, Huang B, Gu E, Guo W, Wu L. UnityShip: A Large-Scale Synthetic Dataset for Ship Recognition in Aerial Images. Remote Sensing. 2021; 13(24):4999. https://doi.org/10.3390/rs13244999

Chicago/Turabian Style

He, Boyong, Xianjiang Li, Bo Huang, Enhui Gu, Weijie Guo, and Liaoni Wu. 2021. "UnityShip: A Large-Scale Synthetic Dataset for Ship Recognition in Aerial Images" Remote Sensing 13, no. 24: 4999. https://doi.org/10.3390/rs13244999

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

UnityShip: A Large-Scale Synthetic Dataset for Ship Recognition in Aerial Images

Abstract

1. Introduction

2. Related Work

2.1. Aerial Image Datasets

2.2. Synthetic Data

2.3. Synthetic Aerial Image Datasets

2.4. Ship Detection in Aerial Images

3. UnityShip Dataset

3.1. Collection of UnityShip Dataset

3.2. Statistics of the UnityShip Dataset

3.3. Comparison with Real-World Datasets

4. Experiments and Results

4.1. Dataset Partitioning

4.2. Implementation

4.3. Model Pre-Training Experiments

4.4. Data Augmentation Experiments

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI