**Application of Multi-Sensor Fusion Technology in Target Detection and Recognition**

Editors

**Jukka Heikkonen Fahimeh Farahnakian**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editors* Jukka Heikkonen Computing University of Turku Turku Finland

Fahimeh Farahnakian Computing University of Turku Turku Finland

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Remote Sensing* (ISSN 2072-4292) (available at: www.mdpi.com/journal/remotesensing/special issues/application multi-sensor fusion technology target detection recognition).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-1352-2 (Hbk) ISBN 978-3-0365-1351-5 (PDF)**

© 2022 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**

## **About the Editors**

### **Jukka Heikkonen**

Jukka Heikkonen is a full professor and head of the Algorithms and Computational Intelligence research Lab, University of Turku, Finland. His research focuses on data analytics, machine learning and autonomous systems. He has worked at top level research laboratories and Center of Excellences in Finland and international organizations (European Commission, Japan) and has led many international and national research projects. He has authored more than 150 peer-reviewed scientific articles. He has served as organizing/program committee member in numerous conferences, and acted as a guest editor in 5 special issues of scientific journals.

#### **Fahimeh Farahnakian**

Fahimeh Farahnakian is currently an adjunct professor (docent) in the Algorithms and Computational Intelligence research Lab, Department of Future Technologies, University of Turku, Finland. Her research interests include the theory and algorithms of machine learning, computer vision and data analysis methods, and their applications in various different fields. She have published +30 articles in journal and conference proceedings. She is a member of the IEEE and also served in program committees of numerous scientific conferences.

## **Preface to "Application of Multi-Sensor Fusion Technology in Target Detection and Recognition"**

Application of multi-sensor fusion technology has drawn a lot of industrial and academic interest in recent years. The multi-sensor fusion methods are widely used in many applications such as autonomous systems, remote sensing, video surveillance and military. These methods can obtain the complementary properties of targets by considering multiple sensors. On the other hand, they can achieve a detailed environment description and accurate detection of interest targets based on the information from different sensors.

This book collects novel developments in the field of multi-sensor, multi-source and multi-process information fusion. Articles are expected to emphasize one or more of the three facets: architectures, algorithms, and applications. Published papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems.

> **Jukka Heikkonen, Fahimeh Farahnakian** *Editors*

## *Article* **Deep Learning Based Multi-Modal Fusion Architectures for Maritime Vessel Detection**

## **Fahimeh Farahnakian \*,† and Jukka Heikkonen †**

Department of Future Technologies, University of Turku, 20500 Turku, Finland; jukhei@utu.fi


Received: 20 July 2020; Accepted: 2 August 2020; Published: 5 August 2020

**Abstract:** Object detection is a fundamental computer vision task for many real-world applications. In the maritime environment, this task is challenging due to varying light, view distances, weather conditions, and sea waves. In addition, light reflection, camera motion and illumination changes may cause to false detections. To address this challenge, we present three fusion architectures to fuse two imaging modalities: visible and infrared. These architectures can provide complementary information from two modalities in different levels: pixel-level, feature-level, and decision-level. They employed deep learning for performing fusion and detection. We investigate the performance of the proposed architectures conducting a real marine image dataset, which is captured by color and infrared cameras on-board a vessel in the Finnish archipelago. The cameras are employed for developing autonomous ships, and collect data in a range of operation and climatic conditions. Experiments show that feature-level fusion architecture outperforms the state-of-the-art other fusion level architectures.

**Keywords:** multi-sensor fusion; object detection; deep learning; convolutional neural networks; autonomous vehicles; marine environment

## **1. Introduction**

Object detection is a crucial problem for autonomous vehicles and has been studied for years to make it efficient and faster. A reliable autonomous driving system relies on accurate object detection for providing robust perception of the environment. In addition, the performance of subsequent tasks such as object classification and tracking depend strongly on the object detection. In marine environment, object detection is a challenging problem due to varying light, view distances, weather conditions, and dynamic sea nature. In addition, light reflection, camera motion and illumination changes may cause false detections [1].

Multi-sensor fusion technology is a promising solution for achieving accurate object detection by obtaining the complementary properties of objects based on multiple sensors. The multi-sensor fusion architectures are generally classified into three groups that are based on the level of data abstraction used for fusion [2]. (1) Early fusion, also called pixel-level fusion, combines raw data from the sensors before applying any information extraction strategies. (2) Middle fusion, also called feature-level fusion, fuses the extracted features from each raw sensor data and then performs detection on the fused data. (3) Late fusion, also called decision-level fusion, independently performs detection from each sensor and the outputs of each sensor are fused at the decision level for final detection.

Among the combination of sensor types, InfRared (IR) and visible (RGB) image fusion is superior in many aspects [3]. Firstly, image sensors are cheap when compared in other sensors, such as radar and LiDAR (Light Detection And Ranging). Secondly, collecting and annotating image data is much easier than LiDAR point clouds. Thirdly, IR and RGB images share complementary properties, thus producing robust and informative fused images. Finally, RGB images typically have high spatial resolution and considerable detail when compared to the images that obtained from other sensors. However, these images can be easily influenced by severe conditions, such as poor illumination, fog, and other effects of bad weather. Meanwhile, the thermal IR cameras capture relative temperature, which allows for distinguishing warm objects, like person from cold objects, like navigation buoy or the island. Moreover, IR cameras can improve navigation safety at night/day time and all-weather conditions by determining interest objects based on radiation difference [1–3].

Convolutional Neural Networks (CNNs) or ConvNet allowed for a significant improvement in the performance of computer vision tasks, such as object classification [4], detection [5,6], and segmentation [7]. Moreover, various fusion approaches have been employed CNN in autonomous vehicles [1,8,9]. While the majority of these approaches has focused on RGB images, some of them have also been directed using infrared images for object detection. We use CNN for addressing the object detection problem in marine environment to fill this gap and by the fact that CNN is a very powerful model for computer vision tasks.

In this work, we present three early, middle and late fusion CNN architectures to carry out vessel detection in marine environment. These architectures can fuse the images from the visible and thermal infrared cameras at the different levels of data abstraction. In addition, these architectures employed a deep CNN as a detector to generate bounding box proposals for interest vessels in marine environment. We did not take into consideration any semantic segmentation algorithms in this study. The CNN is trained on data from a single sensor or two used sensors according to the proposed fusion strategies. On the other hand, we investigate the training of uni-modal architectures as well as multi-modal architectures. We also evaluated the proposed fusion architectures on a real marine dataset that was collected by a vessel in the Finnish archipelago. The data represents images which are captured by RGB and IR cameras in different marine environmental conditions (i.e., weather conditions, light conditions, daytime/nighttime). To the best of our knowledge, no work has been done on studying the effectiveness of three different levels of fusion in marine environment. To summarize, the main contributions of this paper are in three-fold:


The remainder of the work is organized, as follows. Section 2 discusses some of the most important related works. The proposed architectures are presented in Sections 3–5. Sections 6 and 7 show the experimental setup and results of our implementations, respectively. Finally, we present our conclusions in Section 8.

#### **2. Related Work**

In this section, we briefly review the related work on infrared and visible image fusion and object detection using CNN. In addition, the vessel detection for maritime is also discussed.

**CNNs for fusion:** many image fusion techniques have been developed in recent years. The main idea of these techniques is obtaining salient features from input images and then combining them for generating a fused image [10]. Deep Learning (DL) is one of the widely-used approaches that has recently been used by theses techniques, since it can explore the features from the data efficiently [8]. It is able to obtain features from input images and then reconstruct a fused images with more details.

Multi-Scale CNN (MS-CNN) is one of these techniques that uses DL for performing pixel-level image fusion. It uses a proposal sub-network to perform target detection at multiple output layers, so that receptive fields match objects of different scales. These complementary scale-specific detectors are combined in order to create a strong multi-scale object detector. In [9], a middle fusion approach is proposed for fusing LiDAR and RGB data in order to classify objects in autonomous vehicle application. This approach first converts LiDAR point cloud data into depth map and then fed the data to a CNN for object classification. In a similar work [11], the dense depth map from LiDAR data and color imagery are fused for pedestrian detection while using CNN. Their results show that fusing LiDAR can improve the detection results. In another work, a DL-based fusion method [10] is presented to generate a fused image containing whole features from two sources IR and RGB images. We will describe the details of this method in Section 4.1.

DenseFuse [8] is another well-known DL-based fusion architecture for extracting and preserving most of the deep features of both RGB and IR images in a middle fusion fashion. In [1], a late fusion method is proposed based on the Probabilistic Data Association (PDA) [12] in order to produce object region proposals by fusing detection results from RGB, IR, radar and LiDAR. Then, a CNN is applied on the top of region proposals for classifying the interest objects within the regions. DyFusion [13] is a decision level fusion for maritime vessel classification. It first uses a CNN to generate the probabilities over maritime vessel classes for each input sensor. Subsequently, a fusion part updates the sensor probabilities by considering the contextual data.

PointFusion [14] leverages both image and three-dimensional (3D) point cloud data based on a late fusion architecture to perform target detection. The image data and point cloud data are independently processed by a CNN and then their results are combined to estimate object bounding boxes from image and point cloud data. The main contribution of PointFusion is using using heterogeneous network architectures. Moreover, the raw point cloud data is directly handled using a PointNet model, which avoids time consuming input pre-process such as quantization or projection.

**CNNs for object detection:** CNN were recently used in the development of object detection, as they are capable exploiting unknown structures in training data for discovering good representations [15]. The CNN-based object detectors are divided into two categories: two-stage detectors and one-stage detectors. Two-stage detectors employ an external module for generating interest object region proposals and their speed usually slower than one-stage detectors. In contrast, one-stage object detectors integrate region proposition and classification into one single stage. However, two-stage detectors usually have higher detection accuracy when compared to the one-stage detectors. Popular two-stage detectors include R-CNN [16], Fast/Faster R-CNN [17,18], and R-FCN [19]. Between one-stage detectors, SSD [20] and YOLO [21] are most common.

Region-based Convolutional Neural Network (R-CNN) [16], which leads to substantial gains in object detection accuracy. R-CNN first identifies region proposals and then classifies these regions into object categories or background using a CNN. One disadvantage of R-CNN is that it performs exhaustive search and proposes large number of regions from an image. Therefore, RCNN leads to time-consuming and energy-inefficient computation. The extension version of R-CNN is Fast R-CNN [17] which uses CNN to generate feature map straight from the input image instead of regions. Both R-CNN and Fast R-CNN use selective search for obtaining the region proposals. In order to reduce running time of Fast R-CNN, Faster R-CNN [18] omits the selective search method for generating object region proposals. Instead of using selective search, Faster R-CNN identifies the regions by using a separate network.

**Maritime vessel detection:** A few studies utilized object detection algorithms from waterborne images beyond maritime vessel detection from spaceborne imagery [22]. Some of these works have focused on classifying the interest objects from the background [23], others employed the Histogram of Oriented Gradients (HOG) approach using sliding-windows [24]. Recently, CNNs have been used for seaborne vessel detection. However, developing more new dataset and applications are necessary for autonomous maritime navigation. For instance, the Singapore Maritime Dataset is used in [25] for ship detection under a new proposed model, YOLO [21]. In [26], a contextual region-based convolutional neural network with multi-layer fusion is proposed for ship detection. It consists of a region proposal

network (RPN) and an object detection network with contextual features. Their results show that the additional contextual features provide more information for detection. However, this method can not detect small objects efficiently. In [27], an approach based on selective search is presented in order to extract the initial region proposals from RGB images. Subsequently, the initial proposals are filtered using the information from other sensors in order to find more dense proposals. Finally, a CNN is employed to identify the class of objects within the final proposals. The results are collected based on the marine data that were collected for the Advanced Autonomous Waterborne Applications Initiative (AAWA) project [28].

In [29], another novel dataset, SeaShips, consisting of a collection of in-shore and offshore ship images is introduced. Moreover, they used three object detectors (Faster R-CNN [18], SSD [20], and YOLO [21]) for detecting maritime vessels. In [30], a maritime vessel image dataset from a Vessel Tracking System (VST) is collected. This dataset contains authentic situations from traffic management operators. In addition, they proposed a SSD detector in order to identify vessels.

#### **3. The Proposed Early Fusion Architecture**

In this architecture, fusion happens at a very low abstraction level. As shown in Figure 1, the early fusion architecture concatenates RGB and IR images and produces a tensor with four channels (three channels from RGB and one channel from IR). This four-channel tensor is used as an input for a detector network. The intuition behind this is simple, since the features of the concatenated image should contain both information from RGB and IR. The detector produces Bounding Boxes (BBs) from the feature maps to localize the vessels. The localization is determined with a box that the top-left corner's coordinate (*x*1, *y*1) and bottom-right corner's coordinate (*x*2, *y*2). Moreover, each bounding box is associated with a confidence score *s*, which indicates how likely does the bounding box contain a vessel. The bounding boxes with the highest confidence are kept in order to filter by a Non-Maximum Suppression (NMS). NMS is a popular post-processing method in object detection methods [5,18] for filtering redundant bounding boxes and obtaining final detections.

**(4 channels input )**

**Figure 1.** An overview of the proposed early fusion architecture. (**A**) The 3-channel RGB input image and 1-channel IR image are concatenated. (**B**) Subsequently, the produced four-channel input data is processed by a detector in order to robustly detect vessels. (**C**) The output image consists of the predicted BBs and corresponding scores and labels.

#### **4. The Proposed Middle Fusion Architecture**

The middle fusion architecture consists of two layers, as illustrated in Figure 2. The first layer is a fuse layer that combines the information given by two RGB and IR cameras and constructs a fused image (Figure 2C). The fused image represents the thermal radiation information in infrared images and detailed texture information in visible images. Afterwards, a detector layer (Figure 2D) performs detection on the fused image in order to generate the object bounding box proposals.

**Figure 2.** An overview of the proposed middle fusion architecture. The original input images (**A**,**B**) are fused using by an image fusion method in order to provide complementary information for object detection. (**C**) The image fusion method can be one of the mentioned method in Sections 4.1–4.7. (**D**) Subsequently, the fused image is processed by a detector in order to detect and localize marine vessels. (**E**) The output image localizes the detected vessels with the corresponded scores and labels.

To generate the fused image in the fuse layer, we employed three DL-based image fusion methods (see Sections 4.1–4.3) and four traditional image fusion methods (see Sections 4.4–4.7). Here, we briefly review the tested image fusion methods, three DL and four traditional, which were evaluated in this work. The DL-based methods include: deep learning framework based on VGG19 and Multi-Layer (VGG-ML) [10], DenseFuse [8], and ResNet and Zero-phase Component Analysis-based method (ResNet-ZCA) [31]. The traditional fusion algorithms are categorized into two main groups: Multi-Scale Decomposition (MSD)-based methods [32] and Sparse Representation (SR)-based methods [33,34] according to the the fusion strategies. The MSD-based methods usually use different transform functions: pyramidal and discrete wavelet. The SR-based methods calculate the activity level of input images in a sparse domain. In this work, we utilized the weighted least square [32] as a MSD-based method and convolutional sparse representation [35] as a common SR-based method.

#### *4.1. Deep Learning Framework Based on VGG19 and Multi-Layers*

Deep learning framework based on VGG19 and Multi-layer (VGG-ML) [10] can combine the features from two source IR and RGB images and generate a fused image. For this purpose, the source images are firstly decomposed into base and details parts using the image decomposition method [36]. The base part of each source image contains the common features and redundant information and obtains it by the average filter. The details part represents the detail contents of source images and it produces by subtracting the base part from the source image. The base parts of both images are then fused by a weighted average strategy. For the detail parts, a pre-trained VGG19 network [37] obtains deep features from source images. Finally, the base and detail parts are added for creating a final output fused image.

#### *4.2. DenseFuse*

DenseFuse [8] is a deep network including three elements: encoder, fusion, and decoder. For testing the network, the encoder first extracts and preserves most deep features of both input RGB and IR images using DenseBlock [38] architecture. DenseBlock contains three cascaded convolutional layers. Subsequently, the fusion layer uses either additional fusion [38] or l1-norm fusion strategy for fusing the extracted features maps from both source images. Finally, the three convolutional-layered decoders receive the fused feature maps in order to create a fused image. For training the network, only encoder and decoder are employed to reconstruct the training images and fix weights of the network. In order to reconstruct the images, DenseFuse aims to reduce the *λ* weighted combination of pixel and structural similarity losses.

#### *4.3. ResNet and Zero-Phase Component Analysis-Based Fusion*

ResNet and Zero-Phase Component Analysis-based (ResNet-ZCA) method [31] has shown to be an efficient method for image fusion. Firstly, it employs ResNet [39] for extracting deep features from source images. Subsequently, ZCA [40] and l1-norm are used in order to project deep features into sparse domain. The initial weight maps are obtained by l1-norm. Finally, a bicubic interpolation is used to resize the initial weight maps to source image size. The final weight maps are generated by soft-max and the fusion image is reconstructed by final weight maps and source images.

#### *4.4. Visual Saliency Map and Weighted Least Square*

Visual Saliency Map and Weighted Least Square (VSM-WLS) [32] is a multi-scale fusion method that is based on WLS optimization and VSM. To perform Multi-Scale Decomposition (MSD), it first employs the rolling guidance filter [41] and Gaussian filter and decomposes both source IR and RGB images into base and detail parts. Afterwards, the fusion of base parts is carried by using a weighted average technique in order to enhance the fused image contrast. For fusing the detail parts, WLS optimization is used. Finally, inverse MSD is employed on the fused base and details parts to construct the output fused image.

#### *4.5. Cross Bilateral Filter*

Cross Bilateral Filtering (CBF) [42] is a non-iterative and local nonlinear method that combines an edge-stopping function with a low-pass filter for reducing the filter effect wherever the intensity between neighbouring pixels is large. It can filter the images while preserving the edges. Initially, CBF is applied to both RGB and IR source images to extract the base images. Subsequently, the detailed images are obtained by subtracting the base images from the original IR and RGB images. Finally, the fused image is obtained by multiplying the weights with input images, followed by a weight normalization.

#### *4.6. Convolutional Sparse Representation*

Convolutional Sparse Representation (ConvSR) [35] address the problem of SR-based image fusion methods by considering a global approach that aims the SR-based image fusion of the whole image rather than of local image patch windows. The global approach enhances the detail preservation and model sensitivity regarding mis-registration. ConvSR consists of hierarchical layers, where each layer includes an image decomposition to divide the input images into base and detail parts. The detail parts are combined using a choose-max strategy. An averaging strategy is applied in order to fuse the base parts and built the fused coefficient maps. The output fused image is built by combining the base and detailed layers.

#### *4.7. Guided Filtering Based Fusion Method*

Guided Filtering based Fusion (GFF) [36] method can generate a highly informative fused image based on a two-scale decomposition strategy. This strategy produces base and detail layers containing large scale variations in intensity and small scale details, respectively. Finally, a guided filtering-based weighted average technique is employed in order to make full use of spatial consistency for fusion of the base and detail layers.

**Figure 3.** An overview of the proposed late fusion architecture. (**A**) The input RGB image and (**B**) IR image are feed into the Detector1 and Detector2, respectively. (**C**) These detectors independently extract features from the corresponding input image. (**D**) The architecture concatenates outputs of detectors (*ORGB*,*OIR*), and then a final set of object proposals is obtained after none-maximum suppression. (**E**) The final output containing predicted BBs, which are associated with a category label and a confidence score.

#### **5. The Proposed Late Fusion Architecture**

Figure 3 demonstrates the proposed late fusion architecture. The late fusion architecture first combines the detection results from two detectors. These two detectors have similar architecture. One detector takes a RGB image as input and the other one takes the corresponding IR image as input. To be more specific, a separate detector is utilized in order to process each input camera image independently and extracts feature from the image. This process involves the estimation of the bounding box proposals, which indicate the objects' localization in the image. Subsequently, the output bounding boxes of two detectors (*ORGB*,*OIR*) are concatenated to explicitly capture complementary information of RGB and IR. In this case, fusion happens at the decision level. After that, the following steps are applied on the all boxes (*ORGB* + *OIR*) in order to generate final boxes and remove redundant detections, as follows:


$$f(b\_{\text{i}}, b\_{\text{best}}) = \begin{cases} 0, & \text{if } IoI < \mathfrak{a} \\ 1, & \text{if } IoI > \mathfrak{a} \end{cases} \tag{1}$$

where *α* is Intersection of Unit (IoU) threshold between two bounding boxes and it is determined experimentally. Based on a series of preliminary experiments, we got the best performance with *α* = 0.5. IoU is intersection of two boxes divided by their union.

$$IoI(b\_{i\prime}b\_{best}) = \frac{S\_{b\_i} \cap S\_{b\_{best}}}{S\_{b\_i} \cup S\_{b\_{best}}} \tag{2}$$

where *S<sup>b</sup>* represents the area of bounding box *b*.

**Figure 4.** An example of applying NMS in the proposed late fusion architecture: (**A**) the predicted BBs which their score is lower than 0.6 are removed and then (**B**) each box between the remaining boxes is assumed as an output box if IoU between ground truth BB and predicted BB is more or equal than 0.5.

#### **6. Experimental Setup**

#### *6.1. Datasets*

We collect a real marine dataset by a vessel in Finnish archipelago for evaluating our proposed fusion architectures. The dataset is recorded by two sensors continuously, providing data from various environmental and geographical scenarios. This sensor system includes RGB (visible spectrum) and IR (thermal) camera arrays, providing output that can be synchronized and stitched to form panoramic images. The individual visible cameras have full HD resolution while the thermal cameras have VGA resolution. Both camera types have a horizontal field of view of approximately 35 degrees. For image alignment in this sensor set, the registration parameters are manually determined by finding corresponding features in calibration images and minimizing alignment mismatch. Therefore, our dataset contains well-aligned IR/RGB images. The images were sampled one frame per second in and stored in MPEG format. The images show maritime scenarios under different illumination conditions with various marine vessels. We manually annotated all vessels (passenger vessel, motorboat, sailboat, or docked vessel) within each RGB sequence with a bounding box as accurately as possible. However, all of the vessels have a general label "Vessel" in our datatset. The bounding box should contain all pixels that belong to that object and, at the same time, be as tight as possible. In addition, two different scenarios are proposed in order to evaluate the proposed architectures in different light condition, time imaging and location.

**Scenario1**: the training dataset is collected by two visible and infrared cameras at daytime. In this scenario, the training dataset consists of 7250 pairs of well-aligned multispectral images captured by cameras. For evaluation, a separate test dataset is gathered in the same light and weather condition contains 1750 RGB/IR pair images. Figure 5a demonstrates a sample of RGB images and corresponding IR in this scenario. The number of vessels in the training and test datasets is determined in Table 1.

**Scenario2**: RGB and IR images are collected by a vessel operating near the harbour at nigh time. This data represent a challenging data (dark and oversaturated areas) in marine environment. The source videos for generating training and test images are different. The training and test datasets consist of 2250 and 1000 pair RGB/IR images, respectively. Table 1 shows the number of vessels in each dataset. Furthermore, Figure 5b illustrates an IR/RGB pair of a sample of our data in this scenario.

The original size of all images is 3240 × 944 pixels for both scenarios. To reduce the computation time, we re-sized the original images into 1200 × 400 pixels.


**Table 1.** Number of vessels in our training and testing marine datasets for each Scenario.

**Figure 5.** Example of RGB and InfRared (IR) pair images in the real maritime dataset at (**a**) Scenario1 and (**b**) Scenario2.

#### *6.2. Implementation Details*

Here, we give more information regarding the method parameters. The parameter setting of the proposed (1) image fusion methods in the middle architecture and (2) CNN-based detector in all architectures are as follows:

**Image fusion methods**: we selected all parameters of the image fusion methods based on the existing works which are described in Section 4. VGG-ML fuses the detailes parts by using VGG-19 [37] with four relu layers. The weight values for pixel in two base part images *α*<sup>1</sup> = 0.5 and *α*<sup>2</sup> = 0.5 in VGG-ML. DenseFuse is pre-trained on MS-COCO [43] and utilizes two methodologies for fusion: addition and l1-norm. DenseFuse achieves the minimum pixel and structural similarity losses when *λ* is 100. For ResNet-ZCA, we used ResNet50 with l1-norm. ResNet50 is pre-trained by ImageNet [44]. In VSM-WLS, the initial spatial weight, *σ<sup>s</sup>* , is 2. The number of decomposition levels *N* is 4 and *λ* = 0.01. CBF uses the neighborhood kernel with 11 × 11 size, as it can achieve good enough fusion results [42]. The value of *σ<sup>s</sup>* and *σ<sup>r</sup>* are 1.8 and 25, respectively. Moreover, the parameter *λ* is fixed at 0.01 in ConvSR. In the GFF experiment, the parameters of the guided filter are set as *r*<sup>1</sup> = 45, *e*<sup>1</sup> = 0.3, *r*<sup>2</sup> = 7 and *e*<sup>2</sup> = 10−<sup>6</sup> . All of the image fusion methods require the grayscale images transformed from the input RGB images except DenseFuse and VSM-WLS, .

**CNN-based detectors**: we use Faster R-CNN as a detector in all proposed architectures. The CNN parameter are chosen based on several experimental results. Faster R-CNN is trained for 900 k iterations with a learning rate of 0.0003 and then 1200k iterarions with a learning rate of 0.000003. We use 4 sub-octave scales (0.25, 0.5, 1.0, 2.0) and three aspect ratios [0.5, 1.0, 2.0] mainly motivated by handling small objects on this dataset.

Since Microsoft COCO dataset [43] consists of 3146 images of marine vessels, the Faster R-CNN is pre-trained on it to learn more good feature representation. Subsequently, the model is fine-tuned on our data. We utilize different source videos to train and test architectures. These fixed parameter setting can obtain good results for our experiments done in this work.

#### **7. Experimental Results**

In this work, three multi-modal architectures were considered for vessel detection: early fusion, middle fusion, and late fusion. In addition, two uni-modal architectures are proposed, which utilized RGB or IR camera images. We have done three experiments: (1) evaluation of seven image fusion methods in the middle fusion architecture, (2) evaluation of all fusion architectures, and (3) a visual comparison between all architectures in each scenario.

#### *7.1. Comparison of Image Fusion Methods*

In the propose middle fusion architecture, an image fusion method is first employed to combine source RGB and IR images and produce a fused image (see Sections 4.1–4.7). Subsequently, a CNN is applied on the obtained fused image for detection. Therefore, the image fusion method provides an essential functionality in our proposed middle fusion architecture. For this reason, we first evaluated the performance of three DL-based image fusion methods and four traditional methods. The details of our experiment are introduced in Section 6.2. These methods are compared with six common assessment metrics to conduct qualitative and quantitative experiments. These metrics include:


Figures 6 and 7 demonstrate the average values of performance metrics for whole test dataset in two scenarios. In Scenario1 (Figure 6), the results show that DL-based fusion methods perform better than traditional methods with the larger values of *FMIw*, *FMIdct*, and *SSIM*. The reason is these methods (VGG-ML, DenseFuse, and ResNet-ZCA) can extract more structural and rich features that are based on their DL architectures. Between these DL-based methods, ResNet-ZCA has the highest value of *FMIw*, *FMIdct*, and *SSIM*. However, DenseFuse provide more natural results and contain less artificial noise as it has the minimum values of *NAB*/*<sup>F</sup>* , *QAB*/*<sup>F</sup>* , *EN* and *SCD*. Between traditional methods, GFF can achieve more complementary information in the fused image, since it has the maximum value of *FMIw*, *FMIdct*, and *SSIM*.

Figure 7 shows the average values of six quality metrics for Scenario2. We can observe that DL-based method is roughly more natural and less noise than other traditional methods. Furthermore, the results represent DenseFuse can generate the fused image with less artificial information and noise as the value of *NAB*/*<sup>F</sup>* is low. However, ResNet-ZCA provide more structural information and features, as it has the highest value of *FMIw*, *FMIdct*, and *SSIM*. GFF performs betters than other traditional image fusion methods in terms of *FMIw*, *FMIdct*, and *SSIM*. This is because GFF can effectively keep the contrast in the fused image.

**Figure 6.** The average values of six quality metrics for test images obtained by the deep and traditional methods in Scenario1.

**Figure 7.** The average values of six quality metrics for test fused images obtained by the deep and traditional methods in Scenario2.

Moreover, we performed a visual comparison between all image fusion methods for an example test image in each scenario. In the scenario1, the obtained fused image by DL-based method contains more frequency details and edge preservation (Figure 8A–D). The fused image that is generated by VSM-WLS, CBF, ConvSR, and GFF includes more artificial noise and their saliency features are not clear. CBF and ConvSR produce the fused images with more artifacts as well. On the contrary, the fused images obtained by VGG-Ml, DenseFuse, ResNet-ZCA and VSM-WLS look more natural and less noise. Generally, the obtained results of these DL-based methods are roughly more clear than other traditional methods in Scenario1.

Figure 9 shows the fused image obtained by DL and traditional image fusion methods in the Scenario2. From the Figure 9A–E, it is observed that VGG-Ml, DenseFuse, ResNet-ZCA, and VSM-WLS provide a more pleasing image with clear texture details. From the red box (part of a land), it is observed the fused image by VGG-Ml contains less noise, and details are more clearer than other image fusion methods. In contrast, CBF, ConvSR, and GFF (Figure 9F–H) produce results with more noise, color distortion and contrast loss.

*Remote Sens.* **2020**, *12*, 2509

**Figure 8.** Qualitative results of the fused image in Scenario1 by (**A**) VGG-ML, (**B**) DenseFuse-add, (**C**) DenseFuse-l1, (**D**) ResNet-ZCA, (**E**) VSM-WLS, (**F**) CBF, (**G**) ConvSR, and (**H**) GFF on the original RGB and IR images.

**Figure 9.** Qualitative results of the fused image in Scenario2 by (**A**) VGG-ML, (**B**) DenseFuse-add, (**C**) DenseFuse-l1, (**D**) ResNet-ZCA, (**E**) VSM-WLS, (**F**) CBF, (**G**) ConvSR, and (**H**) GFF on the original RGB and IR images.

**Processing Time:** Table 2 shows the running time (second) of all image fusion methods for one image. The tested system specification is: Intel(R) Core(TM) i7-4702MQ CPU @ 2.20 GHz×8 CPU with 15.4 GB RAM. The running time for obtaining the fused image by ResNet-ZCA is 4.9 s per image. ResNet-ZCA has the minimum time between DL-based methods. In addition, GFF can generate a fused image in 0.4 s that is lower than ResNet-ZCA.


**Table 2.** The running time (seconds) of the deep and traditional image fusion methods for one image.

#### *7.2. Multi-Modal Architectures vs. Vni-Modal Architectures*

We compared the fusion architectures for the test dataset in terms of Average Precision (AP) as a main detection accuracy metrics. For this purpose, we measured the IoU of detected bounding boxes and matching those from ground truth annotations. A detected bounding box result is counted as a true positive if the IoU with a ground truth one exceeds 50%. Unmatched detected bounding boxes are counted as false positives and unmatched ground truth ones are counted as false negatives.

Table 3 shows that AP for the proposed architectures in each scenario. The best results are highlighted in bold. This results show the effect of the fusion on the object detection performance, as we compared uni-modal and multi-modal architectures. It is observed from the result, the multi-modal middle architecture generates reliable detection results (bold font in Table 3) for both scenarios (scenario1:79.1% and scenario2:61.6%), as it can provide complementary information when compared with the uni-modal architectures. However, the performance can be improved when the dataset contains more bigger targets. Our dataset consists of large amount of small targets which occupying areas lower than 16 by 16 pixels. Detecting very small objects with a few pixels is still challenging because of less information being associated with them.

In addition, the results show that uni-modal RGB-based architecture can provide higher accuracy in comparison with uni-modal IR-based architecture. For instance, the accuracy of uni-modal RGB-based architecture is 9.0% and 9.7% more than the uni-modal IR-based architecture for scenario1 and 2, respectively. This is because it can learn richer features from color images than infrared images. Moreover, the results show that DenseFuse totally have higher accuracy in comparison with other middle-fusion architectures.


**Table 3.** Average Precision (AP) (in %) on the test dataset of two scenarios.

### *7.3. Qualitative Results*

Figure 10 demonstrates an examples of the detection results from the visible-only architecture, infrared-only architecture and multi-modal architectures in each scenario (day-time and night-time). We observe that the proposed fusion architectures is better at the detection of objects than the uni-modal architectures. Note that, because the fusion architectures can integrate information from both color and infrared images. The fusion architectures successfully detected the size/location of the bounding boxes. In the third row, our middle- fusion architecture has detected marine vessels that other architectures have missed. Moreover, the middle-fusion architecture is able to detect small objects

with a few pixels as shown in Figure 10 and many of them are detected by our framework. It shows the generalisation capability of the proposed middle-fusion architecture and indicates its potentials in executing two-dimensional (2D) object detection in real situations beyond a pre-designed dataset.

**Figure 10.** Qualitative vessel detection results for (**A**) Scenario1 and (**B**) Scenario2 based on uni-modal based on RGB, uni-modal based on IR, multi-modal early fusion, multi-modal middle fusion, and multi-modal late fusion architectures. The ground truth bounding boxes are shown as green rectangles. Predicted boxes by the architectures are depicted as red bounding boxes. Each output box is associated with a category label and a score value in [0, 1].

#### **8. Conclusions**

In this paper, we proposed three image fusion architectures for vessel detection in marine environments. The architectures can combine the thermal radiation information on infrared images and the texture detail information on visible images. They also utilized a simple fast CNN, Faster R-CNN, in order to carry out the final detection task. The evaluation on our real marine dataset show that the proposed middle-fusion architecture is able to detect the vessel at the state of the art accuracy.

For future work, we plan to improve the detection network of these architectures in order to address the detection problem of very small objects (less than 16 by 16 pixels) in our data. We will investigate the effect of using transfer learning and domain-specific data on the detection performance. We also plan to extend our fusion framework by considering other common sensors in autonomous vessels, such as LiDAR and radar, besides IR and RGB cameras.

**Author Contributions:** F.F. conceived and designed the methodology; performed the experiments; analyzed the data; wrote the paper. J.H. supervised the study and reviewed this paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is funded by the Tekes (Finnish Funding Agency for Technology and Innovation) as a part of autonomous Ships and Machines project at Turku university.

**Acknowledgments:** Computational resources were provided by CSC-IT Center for Science Ltd., Espoo, Finland. **Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **ABOships-An Inshore and Offshore Maritime Vessel Detection Dataset with Precise Annotations**

**Bogdan Iancu \*,†, Valentin Soloviev † , Luca Zelioli † and Johan Lilius †**

> Faculty of Science and Engineering, Åbo Akademi University, 20500 Åbo, Finland; valentin.soloviev@abo.fi (V.S.); luca.zelioli@abo.fi (L.Z.); johan.lilius@abo.fi (J.L.)

**\*** Correspondence: bogdan.iancu@abo.fi

† Current address: Åbo Akademi, Agora, Informationsteknologi, Vattenborgsvägen 3, 20500 Åbo, Finland.

**Abstract:** Availability of domain-specific datasets is an essential problem in object detection. Datasets of inshore and offshore maritime vessels are no exception, with a limited number of studies addressing maritime vessel detection on such datasets. For that reason, we collected a dataset consisting of images of maritime vessels taking into account different factors: background variation, atmospheric conditions, illumination, visible proportion, occlusion and scale variation. Vessel instances (including nine types of vessels), seamarks and miscellaneous floaters were precisely annotated: we employed a first round of labelling and we subsequently used the CSRT tracker to trace inconsistencies and relabel inadequate label instances. Moreover, we evaluated the out-of-the-box performance of four prevalent object detection algorithms (Faster R-CNN, R-FCN, SSD and EfficientDet). The algorithms were previously trained on the Microsoft COCO dataset. We compared their accuracy based on feature extractor and object size. Our experiments showed that Faster R-CNN with Inception-Resnet v2 outperforms the other algorithms, except in the large object category where EfficientDet surpasses the latter.

**Keywords:** maritime vessel dataset; ship detection; object detection; convolutional neural network; deep learning; autonomous marine navigation

### **1. Introduction**

Maritime vessel detection from waterborne images is a crucial aspect in various fields involving maritime traffic supervision and management, marine surveillance and navigation safety. Prevailing ship detection techniques exploit either remote sensing images or radar images, which can hinder the performance of real-time applications [1]. Satellites can provide near real-time information, but satellite image acquisition, however, can be unpredictable, since it is challenging to determine which satellite sensors can provide the relevant imagery in a narrow collection window [2]. Hence, seaborne visual imagery can tremendously help in essential tasks both in civilian and military applications, since it can be collected in real-time from surveillance videos, for instance.

Ship detection in a traditional setting depends extensively on human monitoring, which is highly expensive and unproductive. Moreover, the complexity of the maritime environment makes it difficult for humans to focus on video footage for prolonged periods of time [3]. Machine vision, however, can take the strain from human resources and provide solutions for ship detection. Traditional methods based on feature extraction and image classification, involving background subtraction and foreground detection, as well as directional gradient histograms, are highly affected by datasets exhibiting challenging environmental factors (glare, fog, clouds, high waves, rain etc.), background noise or lighting conditions.

Convolutional neural networks (CNNs) contributed massively to the image classification and object detection tasks in the past years [4–8]. They incorporate feature extractors

**Citation:** Iancu, B.; Soloviev, V.; Zelioli, L.; Lilius, J. ABOships-An Inshore and Offshore Maritime Vessel Detection Dataset with Precise Annotations. *Remote Sens.* **2021**, *13*, 988. https://doi.org/10.3390/ rs13050988

Academic Editors: Pedro Melo-Pinto

Received: 4 February 2021 Accepted: 1 March 2021 Published: 5 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

and classifiers in multilayer architectures, whose number of layers regulate their selectiveness and feature invariance. CNNs exploit convolutional and pooling layers extracting local features, and gradually advancing object representation from simple features to complex structures, across multiple layers. CNN-based detectors can subtract compelling distinguishable features automatically unlike more traditional methods which use predefined features, manually selected. However, integrating ship features into detection proves to be challenging even in this context, given the complexity of environmental factors, object occlusion, ship size variation, occupied pixel area etc. This often leads to unsatisfactory performance of detectors on ship datasets.

To address ship detection in a range of operating scenarios, including various atmospheric conditions, background variations and illumination, we introduce a new dataset consisting of 9880 images, and annotations comprising 41, 967 carefully annotated objects.

The paper is organized as follows. Section 2 describes related work, including notable results in vessel detection and maritime datasets comprising waterborne images. Section 3 describes data acquisition, dataset diversity, dataset design and our relabelling algorithm along with basic dataset statistics based on the final annotation data. In Section 4, we discuss evaluation criteria and present experimental results; we investigate four CNNbased detectors and discuss the feature extractors and object size effect on the performance of the detectors. Section 5 provides a qualitative overview of the experimental results. In Section 6, we provide a brief analysis of our dataset specifications in comparison with other similar datasets. Conclusions are presented in Section 7.

#### **2. Related Work**

#### *2.1. Object Detection*

Object detection is one of the fundamental visual recognition problems where the requirement is to predict whether there are any objects from given categories in an image and provide their location (bounding boxes or pixel-level localization in case of instance segmentation), if any are found. Generally, this is achieved by extracting features in an image and matching them against features from trained images. Traditional approaches use sliding windows to generate proposals, then visual descriptors to generate an embedding, which are subsequently classified (such as SVM, bagging, cascade learning and AdaBoost). Traditional algorithms with best performance focus on carefully designing the descriptors for extracting the features (SIFT, Haar, SURF). However, since 2008, more and more limitations of this approach became evident [7]. We list below the most notable ones:


In the early 2010s, deep learning approaches came to prominence and started replacing the traditional ones. Object detection networks can be roughly categorized into 2 types: onestage detectors and two-stage detectors. The structure of the latter resembles traditional object detectors in that they generate proposal-regions and then classify the proposals, while the former considers positions within an image as potential objects and attempts to classify them immediately. The traditional approach of sliding windows for proposal generation is still used in CNNs, but other notable advances emerged, which allow for more efficient proposal generation, such as anchor-based and key-point approaches (CenterNet being one of the more notable examples of the kind) [7].

However, the key difference between traditional object detection and CNNs stems from the manner in which visual descriptors are generated. In deep learning, instead of creating visual descriptors by hand, convolutional layers perform this role. Instead of defining feature extractors by hand, basic CNNs train multiple convolutional layers to extract both high- and low-level features, which are then classified with the help of fully-connected layers. The resulting network essentially solves all the main limitations of a traditional approach, but the trade-off is that it requires a significantly larger number of training images for hyperparameter optimization [7,8].

While the requirement of a large number of training samples can prove to be a large obstacle, one of the benefits of CNN-based models is that they can be generalized into other fields with similar characteristics with the help of transfer learning. By training a model on a specific dataset, the backbone of the model can be later used to extract features in other tasks with similar features. For this reason, the aim of recent CNN-models was to be as generic as possible, since with the help of transfer learning, they can be specialized for the field of interest. The challenge, however, appears when those generic models are not suitable feature extractors for a new field and there is not enough data to train them [6]. For those specific cases, the only viable solution is creation of new datasets.

#### *2.2. General Object Detection Datasets*

The two main reasons for the remarkable progress computer vision made in the past decades are the availability of large-scale datasets and powerful GPUs that made it possible for deep learning to take off considerably [9]. Deep learning made notable contributions to the field of computer vision, the tasks of image classification and object detection being in the forefront of research areas that benefited from it. International competitions such as ILSVRC, PASCAL VOC, and Microsoft COCO motivated the community tremendously, each of their contributions offering large-scale datasets that have been exploited ever since. These general object detection datasets have been extensively used for object detection with deep neural networks. They are essential for testing and training computer vision algorithms. We will discuss below some of the most prominent general-purpose object detection datasets.

Microsoft COCO [10] provides a selection of 330,000 images with a number of 2.5 million of labelled object instances, over 91 object classes. The dataset labeling used perinstance segmentation to ensure precise object localization. Two crucial aspects of the dataset are that it exhibits abundant contextual information and images contain multiple objects per image.

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) ran annually for a number of years and was established as one of the typical benchmarks for object classification and detection. The Imagenet dataset [4], the foundation of the challenge, is an image collection based on the WordNet hierarchy [11], which provides on average 1000 manually verified images for every synset (synonym set) in the hierarchy. These images are subjected to quality-control and are human-annotated. The dataset consists of over 14 million images, of which over 14 million were annotated to denote what objects are present in the image and, for over a million of them, bounding boxes are provided too.

Pattern Analysis, Statistical Modelling and Computational Learning (PASCAL) Visual Object Classes (VOC) is a prominent project in the computer vision community, which provided publicly available image datasets including ground truth annotations and standardized evaluation metrics. These datasets were exploited as part of a number of challenges on various tasks such as: classification, detection, segmentation, etc. The greater number of scientific publications regarding object detection use the PASCAL VOC challenges to benchmark their proposed algorithms. The reason is that these challenges introduced a number of evaluation methods: bootstrapping, to decide significant differences among algorithms, a normalised average precision across classes, etc. The dataset released by last PASCAL VOC challenge includes 11,530 images with 27,450 annotated objects over 20 classes. Table 1 shows a variety of object detection datasets, with their total number of images and clasess. We can notice that ImageNet is by far the largest of the ones mentioned in the table, encompassing the greater number of total images and classes.


**Table 1.** Different object detection datasets comprising various object classes, with their corresponding annotations.

Of the general-purpose object detection datasets, in Table 1, the total number of maritime vessels included is limited, only Microsoft COCO comprising a considerable amount of vessels, 3146. All vessel counts can be found in Table 2.


**Table 2.** Maritime vessel instances in general object detection datasets.

#### *2.3. Maritime Vessel Detection Datasets*

Maritime vessel detection from satellite imagery was employed in many studies, over the past 40 years, a review from 2018, [12], gathering a number of 119 papers regarding ship detection and classification only from optical satellites. At the same time, the studies regarding maritime vessel detection from waterborne images are still quite scarce to this day. Some studies proposed algorithms utilizing the idea of background subtraction and detection of the foreground in maritime images. This class of techniques is predominantly used in surveillance applications due to their ability to perform well with unexpected changes in illumination, frequency or background noise [13]. Other studies proposed solutions for ship detection based on the Histogram of Oriented Gradients (HOG) and sliding windows [14].

However, since the bloom of deep learning in the past 15 years, CNNs were employed in ship detection from waterborne images. Even so, datasets of seaborne images are scarce, the most notable ones we briefly discuss below.

The Singapore Maritime Dataset, introduced in [15] consists of 80 videos recorded during daytime and nighttime, and provides ground truth labels for every frame of every video, comprising bounding-boxes and object classes for the corresponding boundingboxes. The annotations for the Singapore Maritime Dataset include 10 object classes, of which 6 ship types. This dataset is used for ship detection employing the YOLO v.2 algorithm [16].

Another recent ship dataset, SeaShips [3], consists of over 31,455 inshore and offshore images of ships, comprising 6 ship types. In [3], they employ three object detectors (Faster R-CNN [17], SSD [18] and YOLO [16]) to detect ships.

One of the most recent datasets published is MCShips [19], comprising a number of 14,709 images of ships, whose annotations cover 6 warship classes and 7 civilian ship classes. In [19], they also use the object detection algorithms above (Faster R-CNN [17], SSD [18] and YOLO [16]) to evaluate the dataset over the 13 ship classes.

We compared our ABOships dataset against other existing ship datasets. Table 3 illustrates the main differences. Our dataset has the smallest number of images (9880) amongst the four datasets, however it contains a great number of annotations (41, 967)

given the image total, which shows it represents well real scenarios of maritime imagery, taking into account the fact that it includes on average more than 4 annotated objects per image.


**Table 3.** Comparison of ABOships with other maritime datasets.

#### **3. Materials and Methods**

#### *3.1. Camera System*

The dataset was acquired from a set of 135 videos, collected from a sightseeing watercraft, by a camera with a field of view of 65◦ and stored in FullHD (1920 × 720) resolution at 15 FPS in MPEG format. The route of the watercraft extended from the city of Turku to Ruissalo in South-West Finland, the videos comprising the urban area along the Aura river, the port and the Finnish Archipelago, for a duration of 13 days (26 June 2018–8 July 2018). The watercraft ran each day in a timeframe between 10.15 and 16.45. The videos were captured into 30-min long periods consisting of footage from the route that the watercraft took. While the route remained largely the same, the data contains a variety of typical maritime scenarios in a range of weather conditions.

In addition to camera video data, the platform had a LiDAR attached to it (SICK LD-MRS, FoV 110 degrees, 2 × 4 planes, up to 300 m detection, at 5 Hz). The data from the LiDAR was captured alongside the video at a rate of 5 entries of up to 800 points per 0.2 s. Given the utilized LiDAR had a detection range of up to 300 m, it was very useful for detecting other objects in the harbor environment. Due to having only 2 times 4 lasers in the height direction however, the provided data was not reliable enough for discerning the nature of the object (i.e., what object was detected). It was useful however to determine distances to the objects perceived in the videos. For the purpose of creating the dataset presented in this paper, we used the LiDAR data to filter out video segments that were captured in the harbor area (usually the ones that had too many points for a prolonged period of time).

To evaluate the models, we acquired 9880 image photos from the videos. First, we annotated all images with 11 categories: seamarks, 9 types of maritime vessels, and miscellaneous floaters. In a second round, we relabelled all the inconsistencies we found, using an algorithm based on the CSRT tracker [20].

#### *3.2. Dataset Diversity*

Maritime environments are inherently intricate, hence a range of factors have to be accounted for when desinging a dataset. Dataset design must ensure that the dataset characterizes well vessels in the environment. Of course, data augmentation methods can be considered for reproducing certain environmental conditions, however authentic conditions may be difficult to anticipate.

*Background variation.* Particular object detection tasks are more prone to be affected by changes in the background of the picture. For instance, facial recognition is less susceptible to background variations, because given the similar shape of most faces, it is easier to fit them into bounding boxes in a congruous manner. However, the shapes of maritime vessels are highly heterogeneous, making them more difficult to separate from the background due to a potentially vast background information in the bounding box. The accuracy of ship detection would be significantly affected if background information were classified as ship features. Figure 1 illustrates the background variation of images in our datasets, including urban landscapes and an open sea environment.

**Figure 1.** Example image of background variations in the ABOships dataset: (**a**) View of maritime vessels on Aura river including the urban landscape; (**b**) View of a maritime vessel in the Finnish Archipelago.

*Atmospheric conditions.* Atmospheric conditions were specific to Finnish summers, with very sunny periods, alternating with rainy intervals and cloudy skies. The dataset includes a variety of images of different atmospheric conditions throughout a day.

*Illumination.* Lighting variations can significantly impact image capture. Illumination throughout the day, in various geographical areas and with specific daylight hours in a given region can dramatically influence image detection.

*Visible proportion.* A great number of the images in our dataset consists of moving ships, with objects being only partially captured in the camera field of view. However, they still represent objects that were annotated since one has to detect them as well. The annotation should comprise different visible proportions of the maritime vessels.

*Occlusion.* Due to the fact that our dataset has been captured in an open sea environment, in the harbor area and also comprises urban landscapes, there are many occasions when maritime vessels occlude each other or occlude other objects in the environment in the harbor area or in the urban landscape. In a subset of pictures especially in the harbor area, there is significant occlusion due to a high number of maritime vessels in the proximity of each other. Two examples of occlusion are shown in Figure 2.

**Figure 2.** Example image of a occlusion: (**a**) Boat in front of a militaryship; (**b**) Several sailboats occluding each other while docked, on the right half of the image.

*Scale variation.* Detection of small object can prove to be quite difficult, especially in a complex environment like the sea, ships that occupy a small pixel area in the picture can be confused with other objects in the background. Maintaining a high level of detection for ships requires including several scales for ships sizes in the dataset. For more information regarding the annotation and the size of the bounding boxes, please refer to Section 3.4.

Figure 3 illustrates a sailboat from two different perspectives: a lateral and a frontal view, which shows a variation in both occupied pixel area, but also the visible proportion.

**Figure 3.** Example image of a sailboat, view from two perspectives: (**a**) Lateral; (**b**) Frontal.

#### *3.3. Dataset Design*

The raw data acquired from the camera on the sightseeing watercraft is captured as MPEG videos, with 720 p resolution at 15 FPS . The videos include some footage exhibiting content that is irrelevant for the scope of vessel detection (especially footage captured when the watercraft was docked, either at the start of its route on the Aura river or at the Port of Turku) or sensitive content, such as faces of people. To address the latter issue, we performed face detection on all videos and blurred all detected faces. Addressing the former issue on the other hand, required additional data from the LiDAR.

In a maritime environment, LiDAR data is relatively sparse, authors of this study observed that a high number of points detected for a prolonged duration correlates with the watercraft being docked in the harbor. By setting a point threshold to detect these (docked/harbor) cases, we were able to filter them out in their majority and extract only the images regarding mostly the maritime environment. The images were extracted at an interval of 15 s (one image every 225 frames) and still contained some images captured during docking, but most of them were facing outwards from the harbor, so the images captured in this manner still contain useful maritime data. As a result we acquired 9880 images in the maritime environment.

The acquired images were subsequently separated into workpackages in such a manner that chronologically adjacent pictures were separated into different workpackages. The workpackages were then manually labelled by different annotators. After the initial labelling was completed, we used the CSRT tracker [20] to combine labels of the same object into traces, i.e., a collection of chronologically adjacent images containing a bounding-box for that object. Due to inaccuracies in the tracking process and discrepancies in labelling, the produced traces were not always accurate. After viewing the labels in these traces, we identified the main causes for discrepancies in labelling, which were mainly caused by different interpretations of label annotations. We refined those annotations to eliminate the discrepancies and separated the data into a second collection of workpackages that were provided to annotators, who then relabelled the data, according to refined annotations. After the relabelling was completed, the images and their refined labels were compiled into a dataset of maritime images with refined annotations.

#### *3.4. Annotation*

To perform the annotation task, we first investigated the captured videos and identified the vessel types that appeared most often. Due to the fact that the videos were captured at locations with a significant number of passenger ships, there is a certain level of bias for labelers towards those types of ships. This is different from the Seaships database, for instance, which comprises a higher variety of cargo ships. For the purposes of future use in machine vision, rather than using maritime terminology as such (depicting ship scale and purpose), we selected labels that had some clearly distinct visual characteristics. A visual representation of the labels is illustrated in Figure 4. The label categories are discussed below, with more specific details for every category:

• boat—rowing boats or oval-shaped boats (from a lateral perspective), or small-sized boats, visual distinction – rowing-like boats even if they possess engine power;


**Figure 4.** Example images of annotated objects in the ABOships dataset: (**a**) boat, (**b**) cargoship, (**c**) cruiseship, (**d**) ferry, (**e**) militaryship, (**f**) miscboat, (**g**) miscellaneous (floater), (**h**) motorboat, (**i**) passengership, (**j**) sailboat and (**k**) seamark.

#### *3.5. Relabelling Algorithm*

The labelling was performed by multiple annotators with different backgrounds, hence some label types were interpreted differently among them. To increase the consistency of labelling, we used the continuous nature of the raw data by tracking the labels between frames using the CSRT tracker [20]. For every labelled frame, a tracker instance was created. The aim was to track an object until the next labelled frame. At that point, the existing traces would be mapped onto the labels of the new frame, based on the *IoU* metric. During this mapping, it was assumed that labellers would not confuse seamarks with

vessels, hence ship labels were not mapped onto seamarks or vice versa. More importantly, previous labels were not taken into consideration, so even if annotators gave the same object conflicting labels in different frames, these labels would still belong to the same trace as long as the tracker could identify them. For cases where the mapping could not be found, the trace would assign a new label, <*Unlabeled*>, to denote that even though nothing was labeled in that specific case, the tracker indicated that the object should belong to the trace.

After a certain number of frames, either the tracker would lose the object (the most common reasons for this being object occlusion, or due to the object being either too far or exiting the frame altogether) or the tracker would have none of the defined labels mapped to it enough times (which would mean it most likely drifted onto another object). In both of those cases, the tracker was stopped and the resulting trace was saved to a file for further processing as described below.

To reduce the number of errors caused by occlusion and the tracker drifting towards other objects than the current object of interest, we performed a second tracking in the backwards direction. By comparing labels identified in the traces acquired from tracking videos in both directions, one could detect situations where traces could not be mapped onto each other. Those cases signify that the tracker was either occluded or drifted to another object, so traces required to be split into smaller sequences still, until no more conflicts could be detected.

The resulting traces (after the backwards tracking) were provided as batches for relabelling. Traces containing a single entry were batched together with other singular traces from the same category. This setup was done with the purpose of preventing and removing accidental labels (mislabeling), while, at the same time, providing more information about the objects being annotated. This allowed us to accurately label even the objects at a longer distance as a consequence of tracking history. Traces obtained in this manner were then provided for relabelling as a collection of labels belonging to the same trace and annotators were asked to refine the labels so that labelling would be consistent with the labelling specifications. Singular entries that did not belong to any trace were subsequently batched together with other objects of the same category. The process described above is illustrated in Figure 5, while the relabelling software application is depicted in Figure 6.

**Figure 5.** The video collection was separated into 48 workpackages of images (1), which were labelled in an initial labelling step (2). Using the OpenCV Tracker, the objects were tracked across frames to produce traces (3) and then relabelled to fix inconsistencies and fill in the labels that might have been skipped (4). The resulting labels were then compiled into the maritime imagery dataset (5).

**Figure 6.** The relabelling process utilized our relabelling software application. Its GUI (graphical userinterface) shows the annotator traces of tracked images between annotation frames (1). The annotator is required to either relabel every instance by selecting the correct label from the right panel, or edit an annotation (by selecting a label that emerged distinct from others (2)) and change the label of each image individually and possibly fix the bounding box to fit the object more tightly (3). Special attention was required in certain situations when the tracker would drift onto other objects, in which case that particular entry of the trace might have had a different label from the rest (4). When all labels belonging to a trace were verified and steps (1)–(4) were completed (5), the changes were saved into a new file and the annotator was provided with the next trace.

#### *3.6. Dataset Statistics*

Table 4 shows the number of images of each category in our dataset and the number of annotations. The column Images represents the number of images that contain that particular object class and then the percentage of images that comprise that class follows. Then the column Objects represents the number of annotations for that particular class in the dataset, along with the percentage of objects annotated for that specific class out of all the annotated objects in the dataset. One can notice from Table 4 that the highest representation of labels in the images from ABOships dataset is reached by three categories: motorboats (present in 41.11% of the images), sailboats (present in 38.88% of the images), and seamarks (present in 37.89% of the images). Conversely, the lowest representation is registered for cargoships (in 1.58% of the images) and miscellaneous floaters (in 1.30% of the images).

Moreover, Figure 7 illustrates the distribution of annotated objects in our dataset based on occupied pixel area at log2-scale, for every object category, and separates every object category by size in small, medium and large objects based on the Microsoft COCO variants (small: log2(area) < 10, medium: 10 < log2(area) < 13.16 and large: log2(area) > 13.16).

**Figure 7.** Histograms of occupied pixel area at *log*2-scale for all annotated objects by object category, divided into three groups for each category: small, medium and large according to Microsoft COCO variants (small: log2(area) < 10, medium: 10 < log2(area) < 13.16 and large: log2(area) > 13.16). The vertical colored lines represent the following values: the red line—represents the mean of the distribution, the yellow line represents the threshold for small objects and the green vertical line delineates the threshold for large objects. In each histogram, respectively, entries to the left of the yellow line represent the small objects group, entries in between the yellow and the green line show the medium-sized objects group and those to the right of the green line depict the large objects group.

**Table 4.** The table shows the number of images and annotations in the ABOships dataset for every object category, along with their overall percentages.


#### **4. Results**

#### *4.1. Evaluation Criteria*

To evaluate the performance of different object detection algorithms on specific datasets, one can employ various quantitative indicators. One of the most popular measures in object detection is the *IoU* (Intersection of Union ), which defines the extent of overlap of two bounding boxes as the intersection between the area of the predicted bounding box *B<sup>p</sup>* and the area of the ground truth bounding box *Bgt*, over their union [21]:

$$IoU = \frac{|B\_p \cap B\_{gt}|}{|B\_p \cup B\_{gt}|} \tag{1}$$

Given an overlap threshold *t*, one can estimate whether a predicted bounding box belongs to the background (*IoU* < *threshold*) or to the given classification system (*IoU* > *threshold*). With this measure, one can proceed to assess the average precision (*AP*) by calculating the precision and recall. The precision reflects the capability of a given detector to identify relevant objects and it is calculated as the proportion of detected bounding-boxes, correctly identified, over the total number of detected boxes. The recall reflects the capability of a detector to identify relevant cases and it is calculated as the proportion of correct positive predictions to all ground truth bounding boxes. Based on these two metrics one can draw a precision-recall curve, which encloses an area representing the average precision. However, in a majority of cases, this curve is highly irregular (zigzag pattern) making it challenging to estimate the area under it, i.e., the *AP*. To address this, one can approach it as an interpolation problem, either as an 11-point interpolation or an all-point interpolation [21].

The 11-point interpolation averages the maximum values of precision over 11 recall levels that are uniformly distributed [21], as depicted below:

$$AP\_{11} = \sum\_{\mathcal{R} \in \{0, 0.1, \dots, 0.9, 1\}} P\_l(\mathcal{R})\_\prime \tag{2}$$

with

$$P\_i(\mathcal{R}) = \max\_{\mathcal{R}^\* \mid \mathcal{R}^\* \ge \mathcal{R}} P\_i(\mathcal{R}^\*). \tag{3}$$

*AP*<sup>11</sup> is calculated using the maximum precision *Pi*(*R*), with a recall greater than *R*.

#### *4.2. Baseline Detection*

To explore the performance of CNN-based object detectors on our dataset, we focused on prevalent detectors: one-stage (SSD [18] and EfficientDet [22]) and two-stage detectors (Faster R-CNN [17] and R-FCN [23]). The detectors were previously trained on the Microsoft COCO object detection dataset, which comprises a number of 91 object categories. The training dataset contains a number of 3146 images of marine vessels. We investigated the performance of different feature extractors in the aforementioned detectors. We collect maritime vessel detection results based on SSD over different feature extractors (ResNet101, MobileNet v1, MobileNet v2). Moreover, we evaluate the performance of a new stateof-the-art detector, EfficientDet, on our dataset, which used EfficientNet D1 as feature extractor. We also evaluated two-stage detectors: Faster R-CNN and RFCN with different feature extractors. Combining all proposed detectors with the feature extractors, a total of 8 algorithms were investigated. All information regarding the specific configuration of these detectors can be found in [24].

We estimated the performance of these algorithms in detecting maritime vessels, so we excluded seamark and miscellaneous labels from our experiments and focused on detecting vessels. Moreover, we chose images with an occupied pixel area larger than 16<sup>2</sup> pixels. Based on these experiments, we attained Table 5.

Our experiments indicated that the object size impacts the detection accuracy. To corroborate this observation, we divided all vessel labels (with an occupied pixel area larger than 16<sup>2</sup> pixels) in our datasets into three categories, based on Microsoft COCO challenge's

variants: small (16<sup>2</sup> < area < 32<sup>2</sup> ), medium (32<sup>2</sup> < area < 96<sup>2</sup> ) and large (area > 96<sup>2</sup> ). Out of the annotated vessels with an occupied pixel area larger than 16<sup>2</sup> pixels in our dataset, 30.25% of the annotated vessels are small, 49.37% are medium and 20.37% are large.

Analyzing the results from our experiments, we observe that detection accuracy decreases with object size. The *AP* for best-performing detector on the ABOships dataset (Faster R-CNN with Inception ResNet v2 as feature extractor) with a registered *AP* of 35.18% more than doubles in size from small (*AP<sup>S</sup>* = 23.16%) to large objects (*AP<sup>L</sup>* = 46.84%). The second best detector on the whole dataset (EfficientDet with EfficientNet as feature extractor) however had the best performance on the large-objects category, with an *AP<sup>L</sup>* = 55.48%. In general, detecting small objects turns out to be more difficult than larger objects given that there is less information associated with a smaller occupied pixel area. For medium-sized objects, the best performance is attained by SSD with ResNet101 as feature extractor (*AP<sup>M</sup>* = 31.18%). For small objects, the best-performing detector, Faster R-CNN with Inception ResNet v2, outperforms the other detectors with a registered *AP<sup>S</sup>* = 23.16%. Among the SSD configurations, best performing, in general, was the one having ResNet101 as feature extractor.

**Table 5.** Average Precision (AP) (in %) of the proposed CNN-based detectors on ABOships dataset, with different feature extractors and object sizes, for all objects with an occupied pixel area > 16<sup>2</sup> pixels.


#### **5. Qualitative Results**

Figure 8 illustrates an example of detection results for the proposed methods, selecting for each the combination of feature extractor that scored the highest AP in each category. We can notice in Figure 8 that Faster R-CNN with a Inception-ResNet-v2 feature extractor (a) and R-FCN with a ResNet101 feature extractor (c) provide detected regions registering high scores ranging from 0.91 to 0.99. The other two detectors in Figure 8, EfficientDet with EfficientNet as feature extractor (b) and SSD with ResNet101 as feature extractor (d), register satisfying results registering with scores ranging from 0.55 to 0.67.

**Figure 8.** Qualitative detection results for the ABOships dataset on (**a**) Faster R-CNN and Inception-ResNet-v2 as feature extractor, (**b**) EfficientDet with EfficientNet as feature extractor, (**c**) R-FCN with ResNet101 as feature extractor, and (**d**) SSD with ResNet101 as feature extractor. The ground truth bounding-boxes are shown as red rectangles. Predicted boxes by these methods are depicted as green bounding boxes. Each output box is associated with a class label and a score with a value in the interval [0, 1].

#### **6. Discussion**

Maritime vessel detection of inshore and offshore images is a topical issue in many areas, such as maritime surveillance and safety, marine and coastal area management, etc. Many of these fields require intricate management of disparate activities, which in practice often necessitate real-time monitoring. This implies, among other aspects, realtime detection of inshore and offshore ships. However, in their majority, ship detection studies and methodology are mostly concerned with either satellite or radar imagery, which can prove to be unreliable in a real-time setting. For this very reason, algorithms, and specifically CNNs, employed on waterborne imagery are especially beneficial either on their own, or in fusion architectures.

Traditional ship detection methods using either background separation or histograms of oriented gradients provide satisfactory results under favorable sea conditions. However, the complexity of the marine environment, including challenging environmental factors (glare, fog, clouds, high waves, rain etc.), renders the extraction of low-level features unreliable. Recent studies involving CNNs address this issue, but deep learning requires domain-specific datasets to produce satisfactory performance. However, public datasets specifically designed for maritime vessel detection are scarce to this day [1]. We discuss this in more detail in Section 2.

Performing exploratory analysis on our dataset, in comparison with other recent maritime object detection datasets (Singapore Maritime Dataset [15], SeaShips [3], MC-Ships [19]), there are a few aspects that emerge that we discuss as follows. Comparing our dataset to the Singapore Maritime Dataset, one can notice (from Table 3) that ABOships registers a higher number of ship types (9 vs. 6). However, considering the number of annotations per image, the Singapore dataset registers almost 3 times more annotations on average per image (11.05 vs. 4.2). The SeaShips dataset consists of 31, 455 images, more than 3 times the image total of our dataset, but ABOships provides more annotations than the former, with a greater average number of annotations per image (4.2 vs. 1.2). SeaShips consists mostly of images with one annotation per image. MCShips provides a number

of 13 ship categories (vs. 9 ship categories in ABOships), but only offers just over 26*K* annotations, with an average of 1.8 annotations per image, see Table 3. We note that our dataset annotations comprise also seamarks and miscellaneous floaters in addition to the 9 ship types.

We tested our relabelling software application on the Singapore Maritime Dataset, as suggested by our reviewers, and the tracker was able to consistently map object labels from one frame to another correctly (without drifting from the object of interest to other objects), which did not always occur when we performed the tracking on the ABOships dataset. There are a few aspects that can influence the tracker's performance and those most probably affected its performance on the ABOships dataset. First, the videos included in the Singapore Maritime dataset have a higher frame rate (30 FPS), double than those in our dataset (15 FPS). Moreover, the videos from the Onshore dataset (one part of the Singapore Maritime Dataset) have higher resolution. Videos in the Onshore dataset do not have a high density of annotations per video. Furthermore, the environment present in the images of our dataset is far more complex, including urban landscapes and complicated background, especially in the port area.

#### **7. Conclusions**

This paper provides a solution for addressing the annotation inconsistencies appeared as a consequence of manual labeling of images, using the CSRT tracker [20]. We build traces of the images in the videos they originated from and use the CSRT tracker to traverse these videos in both directions and identify the possible inconsistencies. After this step, we employed a second round of labeling and obtained a set of 41, 967 carefully annotated objects, of which 9 types of maritime vessels (boat, miscboat, cargoship, passengership, militaryship, motorboat, ferry, cruiseship, sailboat), miscellaneous floaters and seamarks.

We ensured the dataset consists of images taking into account the following factors: background variation, atmospheric conditions, illumination, visible proportion, occlusion and scale variation. We performed a comparison of the out-of-the-box performances of four state-of-the-art CNN-based detectors (Faster R-CNN [17], R-FCN [23], SSD [18] and EfficientDet [22]). These detectors were previously trained on the Microsoft COCO dataset. We assess the performance of these detectors based on feature extractor and object size. Our experiments show that Faster R-CNN with Inception-Resnet v2 outperforms the other algorithms for objects with an occupied pixel area > 16<sup>2</sup> pixels, except in the large object category where EfficientDet registers the best performance with an *AP* = 55.48%.

For future research, we plan to investigate different types of errors in the manual labelling, for cases where the labels still have inconsistencies, such as: fine-grained recognition (which renders it more difficult for human even to detect objects even when they are in plain view [25], class unawareness (some annotators become unaware of certain classes as ground truth options) and insufficient training data (not enough training data for the annotators).

Moreover, we plan to investigate in more detail the detection of small and very small objects, including those with an occupied pixel area < 16<sup>2</sup> pixels. Furthermore, distinguishing between different vessel types in our datasets will be an essential focus as the next steps in our experiments. In order to do this, we plan to exploit transfer learning both in the form of heterogeneous transfer learning, but also homogeneous domain adaptation.

To further our research, we will employ maritime vessel tracking detectors on the original videos captured in the Finnish Archipelago and examine the impact on autonomous navigation and navigational safety.

**Author Contributions:** V.S. and J.L. planned video capture and collection in the Finnish Archipelago. B.I. and V.S. planned the annotation process and wrote the annotation requirements. L.Z. and V.S. supervised and participated in the annotation process. V.S. implemented the relabelling algorithm. B.I. planned the experiments on the relabelled dataset and supervised their implementation. L.Z. wrote the software for the evaluation of the algorithms on the datasets. V.S. wrote the software for

AP calculations. All authors contributed to the interpretation of results. B.I. wrote the following sections and subsections: Introduction, Conclusion, Experimental Results, Dataset Statistics, Dataset Diversity, Annotation. L.Z. and V.S. wrote the Related Work section. Valentin Soloviev wrote the following subsections in Materials and Methods: Dataset Design, Relabelling Algorithm. The annotation subsection was written by V.S. and B.I. B.I. planned the manuscript writing, and revised the final writing of each section. B.I. and J.L. supervised the evaluation of the algorithms, AP calculation. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author. The data are not publicly available due to being in the process of publishing, it is planned to be published at: https://www.fairdata.fi/en/ (accessed on 4 February 2021). For reviewers we can provide a separate package with data and any necessary code in the meantime.

**Acknowledgments:** The annotation of the ABOships dataset was completed with the help of the following persons: Sabina Bäck, Imran Shahid, Joel Sjöberg and Alina Torbunova.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **A Co-Operative Autonomous Offshore System for Target Detection Using Multi-Sensor Technology**

#### **Jose Villa 1,\* , Jussi Aaltonen <sup>1</sup> , Sauli Virta <sup>2</sup> and Kari T. Koskinen <sup>1</sup>**


Received: 16 November 2020; Accepted: 11 December 2020; Published: 16 December 2020

**Abstract:** This article studies the design, modeling, and implementation challenges for a target detection algorithm using multi-sensor technology of a co-operative autonomous offshore system, formed by an unmanned surface vehicle (USV) and an autonomous underwater vehicle (AUV). First, the study develops an accurate mathematical model of the USV to be included as a simulation environment for testing the guidance, navigation, and control (GNC) algorithm. Then, a guidance system is addressed based on an underwater coverage path for the AUV, which uses a mechanical imaging sonar as the primary AUV perception sensor and ultra-short baseline (USBL) as a positioning system. Once the target is detected, the AUV sends its location to the USV, which creates a straight-line for path following with obstacle avoidance capabilities, using a LiDAR as the main USV perception sensor. This communication in the co-operative autonomous offshore system includes a decentralized Robot Operating System (ROS) framework with a master node at each vehicle. Additionally, each vehicle uses a modular approach for the GNC architecture, including target detection, path-following, and guidance control modules. Finally, implementation challenges in a field test scenario involving both AUV and USV are addressed to validate the target detection algorithm.

**Keywords:** target detection; co-operative; autonomous; multi-robot; USV; AUV

### **1. Introduction**

In recent years, the use of autonomous offshore vehicles, which includes autonomous underwater vehicles (AUVs) and unmanned surface vehicles (USVs), for marine interventions has attracted increasing interest from research scientists, maritime industries, and the military. These interventions include several activities such as offshore surveillance, offshore target detection, seabed explorations, or search and rescue (SAR) missions. Additionally, the use of multi-robot platforms can improve the performance in these activities, as they can include above and below-water characterization. Regarding a multi-robot platform, Vasilijevi´c et al. [1] presented the co-operative robotic system consisting of an AUV and a USV for ocean sampling and environmental monitoring. In [2], the study used a heterogeneous collaborative system of above, surface, and underwater robots to obtain a multi-domain awareness on a floating target. The heterogeneous system consists of a USV, an AUV, and an unmanned aerial vehicle (UAV). Additionally, Gu et al. [3] presented a homogeneous study, where a guidance and control law design method for coordinated path following of networked under-actuated robotic USVs under directed communication links. In [4], the control scenario simulated a homogeneous AUV fleet to study formation tracking control and collision-obstacle avoidance.

To accomplish the target detection in the offshore environment, the availability of accurate USV and AUV mathematical models is crucial for simulation study purposes, controller design, and development. The theoretical six-degrees-of-freedom (DOFs) dynamic model [5], based on nonlinear equations of

motion, can be used for the design and modeling of the AUV. Equally, the USV can use the same dynamic model of the AUV but with reduced order for the three DOFs horizontal plane control (surge, sway, and yaw motions). Several tools can help to obtain the coefficients of the dynamic model equations and the necessary transfer functions of each vehicle. These tools can include the parameter estimation from MATLAB-Simulink [6], and the system identification (SI) [7,8], introduced to develop the mathematical model using field test data. In [9], SI of the maneuvering data determined the hydrodynamic coefficients of a USV. Also, the mathematical model of the USV includes the propulsion and power system. Commonly, the rudder and propeller, or waterjet propulsion systems provide the heading and the speed control of most existing USVs. In [10], a twin waterjet propelled USV was modeled based on SI, but it neglects the calculation for the dynamics of the propulsion system.

Target detection in offshore environments is a fundamental activity that combines different perception sensors. Numerous studies use passive (stereo cameras) or active (LiDAR or radar) perception methods to obtain situational awareness of a USV. Nonetheless, most of the obstacle detection methods rely on depth measurements, in which LiDAR sensors are the most reliable method of obtaining depth data. Correspondingly, sonar devices are still the most convenient option for collecting data on underwater environments. Mechanical imaging sonar, multibeam, profiler, or sidescan are some of the main sonar imaging and ranging devices. For the target detection with sonar devices, how detectable is a target is mainly dependent on the physical characteristics of the target and acoustic signal. Some studies use sonar devices for target detection capabilities, as in [11], where a profiler sonar was adopted for obstacle detection. According to [12], a method for underwater obstacle detection (standard buoy) was developed using forward-looking sonar and a probabilistic local occupancy grid.

Correct localization and navigation are crucial to ensure the accuracy of the gathered data for all these applications. Above the water surface, most of the autonomous systems rely on radio or global positioning and spread-spectrum communications, as a GPS-compass installed in the USV platform. However, those signals propagate only in short distances in an underwater scenario, where acoustic-based systems perform better. Regarding underwater navigation, the three fundamental methods are deadreckoning (DR) and inertial navigation systems (INS), acoustic navigation, and geophysical navigation techniques [13]. These navigation methods require specific survey and navigation sensors installed in the AUV. The Girona 500 [14] is an example of AUV that performs the traditional dead-reckoning navigation utilizing a doppler velocity log (DVL) and a solid-state attitude and heading reference system (AHRS). Also, the absolute position can be obtained through a GPS when the vehicle is on the surface and using an ultra-short baseline (USBL) while underwater. The high-accuracy USBL system allows the localization of the AUV and the communication between the vehicle and the surface unit. In [15], the study provided a navigation algorithm for an underwater vehicle with a Kalman filter to estimate the error state via measurement residuals from aiding sensors. These aiding sensors incorporate an attitude sensor, a DVL, a long-baseline (LBL) system, and a pressure sensor. In acoustic navigation techniques, acoustic transponders and modems perform localization by measuring the time-of-flight of signals from acoustic beacons or modems. USBL navigation allows an AUV to localize itself relative to a USV, and it provides an efficient and reliable acoustic communication network [16]. In [17], the study presented the design and implementation of an USBL-aided navigation approach for an AUV in a two-parallel extended Kalman filter (EKF). It also includes the measurements provided by a DVL, a Visual Odometer, an inertial measurement unit (IMU), a pressure sensor, and a GPS.

Safe and adequate control of the offshore vehicles depends notably on proper guidance, navigation, and control (GNC) systems. This study adopts a path-following as the guidance system for both offshore platforms. The path-following approach is closer to practical engineering, and it is easier to implement than trajectory tracking. A generally used method for path-following in autonomous vehicles is the named line-of-sight (LOS) guidance. LOS guidance is classified as a three-point guidance scheme, involving a commonly stationary reference point along with the interceptor and the target [5]. In [18], the study developed a guidance-based algorithm for path-following using the LOS algorithm

in offshore operations. Additionally, in [10], a path-following with obstacle avoidance based on the safety boundary box approach was implemented in a USV with a LOS-based guidance system.

Due to the co-operative offshore system in this study, it becomes necessary to fuse information obtained from the individual vehicles. Robot Operating System (ROS) has been an effective tool when working with multi-robot systems. This tool is a flexible framework for writing robot software and provides the tools to acquire sensors' data, process it, and generate the necessary response for the vehicle actuators [19]. Multi-robot systems can either be centralized with a ROS master node at the ground control station (GCS) or decentralized with each autonomous vehicle (AV) running an independent ROS master. In the case of the decentralized control techniques, they are more flexible, profitable, and generally reduce the communication network requirements compared with centralized control [20]. However, they are also more challenging due to obstacles, uncertainties, and communication constraints, such as noises, delays, dropouts, or failures. In this case, the multi-master approach provides a solution where each vehicle keeps its own ROS master and also exchange the necessary information with other components of the multi-robot system. In [21], they proposed a package that efficiently developed multi-master architectures.

In the presented manuscript, the mathematical model of the USV consists of the simplified three DOFs dynamic model [5], where their parameters are obtained from field test data using the parameter estimation tool. Additionally, the waterjet model has been included in the mathematical model of the USV using data from the manufacturer and transfer functions based on SI. The AUV platform considered in this study does not incorporate a DVL, neglecting the velocity feedback of the vehicle. However, the installed USBL provides an absolute position and a communication link between the USV and the AUV. Thus, the AUV platform includes a basic setup for underwater localization, but it is not able to precisely locate the vehicle underwater. The path-following algorithm uses the LOS approach for heading control to simplify the guidance control of the AUV, keeping a constant depth and constant surge speed. The target detection algorithm uses a modular ROS architecture to provide a computationally cheap and simple implementation in both offshore platforms. Furthermore, the offshore system includes two different perception sensors based on the same target detection algorithm. Finally, a multi-master architecture is in charge of the interaction between the AUV and USV, providing an easy plug-and-play solution for the multi-robot system.

In this work, a model-based GNC architecture for a co-operative autonomous offshore system is proposed for target detection using multi-sensor technology. In Section 2, the USV modeling and simulation are presented using the parameter estimation tool to define the waterjet and USV maneuvering model. Furthermore, this section includes an overview of the USV and AUV platforms. Then, in Section 3, the GNC system for the co-operative tasks is included using the LOS-based guidance system for control. The target detection algorithm is developed using a mechanical imaging sonar at the AUV and a LiDAR at the USV as the primary perception sensor for underwater and surface inspection, respectively. Finally, in Section 4, the implementation of a GNC architecture is described as modular and multilayer for the multi-robot system. A control scenario in a field test is shown in this section to validate the proposed target detection algorithm.

#### **2. Modeling and Simulation for the Offshore Vehicles**

The co-operative autonomous offshore system consists of two different vehicles: a USV and an AUV. This section gives an overview of both subsystems, and it describes the simulation model of the USV, which provides the capability to develop the GNC algorithms.

#### *2.1. Overview of Under-Actuated USV*

This article uses an under-actuated USV as the primary vehicle in the co-operative autonomous offshore system. The USV is an aluminum hull with a thrust vectoring waterjet propulsion system, which provides optimal maneuverability using a twin waterjet configuration. Figure 1 shows a simplified model of the vehicle, where the port and starboard (STDB) waterjets produce the necessary thrust forces to move forward, backward, sideways or performing turns. Additionally, Figure 1 includes the position and orientation of the USV in the North-East-Down (NED) coordinate system. The NED coordinate system is related to planar Cartesian coordinates, so a coordinate transformation is performed from the GPS-compass output to get the USV's absolute position. This transformation is between longitude and latitude (*l*, *µ*) from the world geodetic system 84 (WGS84) coordinate system and ETRS-TM35FIN [22], which displays the NED position (*x*USV, *y*USV). The Euler angles provide the USV heading or yaw angle *ψ*. The motion of the USV has three DOFs, which are surge, sway, and yaw (linear (*u*, *v*), and angular *r* velocities) while ignoring roll, pitch, and heave motions.

**Figure 1.** Simplified model of the unmanned surface vehicle (USV) using the North-East-Down (NED) coordinate system. USV motion is described by surge *u* (linear longitudinal motion), sway *v* (linear transverse), and yaw motion *r* (turning rotation about its z-axis).

#### *2.2. USV Modeling*

The development of an adequate maneuvering model will simplify the GNC algorithms design and simulation. The three DOFs horizontal plane model for maneuvering of a USV consists of the rigid-body kinetics [5]

$$\mathbf{M}\dot{\boldsymbol{\nu}} + \mathbf{C}(\boldsymbol{\nu})\boldsymbol{\nu} + \mathbf{D}(\boldsymbol{\nu})\boldsymbol{\nu} = \boldsymbol{\pi} + \boldsymbol{\pi}\_{\text{wind}} + \boldsymbol{\pi}\_{\text{wave}} \tag{1}$$

where *ν* = [*u*, *v*,*r*] *T* is the velocity vector composed of surge, sway and yaw. *τ* = [*τ*u, 0, *τ*r] is the vector forces and moments generated by twin waterjet configuration, while *τ*wind and *τ*wave are the environmental forces. *M*, *C*(*ν*), and *D*(*ν*) are the mass, Coriolis and damping matrices, respectively, where *M* and *C*(*ν*) combine added and rigid-body terms. The mass matrix *M* is defined by

$$\mathbf{M} = \mathbf{M\_{RB}} + \mathbf{M\_{A}} = \begin{bmatrix} m - \mathbf{X\_{il}} & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & m - \mathbf{Y\_{\vartheta}} & m\mathbf{x\_{g}} - \mathbf{Y\_{\vartheta}} \\ \mathbf{0} & m\mathbf{x\_{g}} - \mathbf{Y\_{\vartheta}} & \mathbf{I\_{z}} - \mathbf{N\_{\vartheta}} \end{bmatrix},\tag{2}$$

where *m* is the mass of the vehicle, *I*<sup>z</sup> is the moment of inertia about *z*<sup>b</sup> axis, *r b* <sup>g</sup> = - *x*g, *y*g, *z*<sup>g</sup> | is the vector from origin *o*<sup>b</sup> to centre of gravity CG, and *Xu*˙ , *Yv*˙ , *Yr*˙ , and *Nr*˙ represent hydrodynamic added mass. The moment of inertia *I*<sup>z</sup> at the pivot point has been estimated based on the calculation of the moments of inertia in the rear *I*z,rear and front *I*z,front of the USV. These moments of inertia are defined by

$$I\_{\rm z,near} = m\_{\rm pt} l\_{\rm pt}^2 + \left(\frac{1}{3} \, m\_{\rm hull} \, c\_{\rm g} \right) l\_{\rm pivot}^2 \tag{3}$$

$$I\_{\rm z,front} = \frac{1}{3} \,\mu\_{\rm hull} \left(1 - c\_{\rm g}\right) \times \left(l\_{\rm USV} - l\_{\rm pitot}\right)^2 \,\text{\,\,\,}\tag{4}$$

where *m*pt is the estimated powertrain mass (engines, waterjets, fuel, etc.), *l*pt is the estimated location of the powertrain mass, *m*hull is the hull weight without powertrain mass, *c*<sup>g</sup> is the relative center of mass point having one as the front of the USV, *l*pivot is the pivot point location, *κ* is a scaling factor as the mass is not evenly distributed from the pivot point to the front of the USV, and *l*USV is the length of the USV. The total moment of inertia *I*<sup>z</sup> is defined by

$$I\_\mathbf{z} = \left(I\_{\mathbf{z,near}} + I\_{\mathbf{z,front}}\right) I\_{\mathbf{Corr}} \tag{5}$$

where *I*cor is the tuning factor for the moment of inertia.

The Coriolis-centripetal matrix *C*(*ν*) can always be parameterized such that *C*(*ν*) = *C* | (*ν*) [23]. However, linearization of the Coriolis and centripetal forces *C*RB(*ν*) and *C*A(*ν*) about zero angular velocity (*p* = *q* = *r* = 0) implies that the Coriolis and centripetal terms can be removed from the above expressions, that is *C*RB(*ν*) = *C*A(*ν*) = 0 [24]. Additionally, the mathematical model is simplified to take into account only surge and yaw motions, so Coriolis and centripetal terms have been removed at the three DOFs dynamic model in this study.

The different damping terms contribute to linear and quadratic damping [5]. Nonetheless, it is generally difficult to distinguish these effects. The total hydrodynamic damping matrix *D*(*νr*) is the sum of the linear part *D*lin and the nonlinear part *D*nlin(*νr*) such that

$$D(\nu\_r) = D\_{\text{lin}} + D\_{\text{nlin}}(\nu\_r) \tag{6}$$

where *D*lin is the linear damping matrix produced by potential damping and possible skin friction, and *D*nlin(*νr*) is the nonlinear damping matrix as a result of the quadratic damping and higher-order terms, defined by

$$\mathcal{D}\_{\text{lin}} = \begin{bmatrix} -X\_{\mu} & 0 & 0 \\ 0 & -Y\_{\upsilon} & -Y\_{r} \\ 0 & -Y\_{r} & -N\_{r} \end{bmatrix} \tag{7}$$

$$\mathcal{D}\_{\text{nlin}}(\nu\_r) = \begin{bmatrix} -X\_{|u|u} & 0 & 0 \\ 0 & -Y\_{|v|v} & 0 \\ 0 & 0 & -N\_{|r|r} \end{bmatrix} \left| \nu\_r \right| \,. \tag{8}$$

The USV used in this study includes the AJ245 waterjet units [25]. The nozzle position *P*nozzle varies the direction of the jet flow, which generates the force needed for turning. Thus, the total thrust force *F*total combines the engine rpm of the waterjet *n*rpm and *P*nozzle. The variable *n*rpm is directly gathered from the waterjet engine, and *P*nozzle is a variable from −10,000 to 10,000, with 0 as the neutral position and equal to forward motion. Table 1 shows the data obtained from the manufacturer Alamarin-Jet Oy for these waterjet units at a specific operating point. This operating point is selected at 1800 rpm, nozzle in the neutral position, and bucket in the full up position.

**Table 1.** Data obtained from manufacturer for an operating point of a single AJ245 waterjet unit.


The thrust forces and torques for the mathematical model of the USV are defined according to the manufacturer's data and an affinity law. Thus, a two-dimensional (2D) lookup table can include the relation between the shaft rotational speed of the waterjet engine *N* with the thrust force per waterjet *F*. The affinity law used to obtain the thrust force at the waterjet units is defined by

$$\frac{F\_1}{F\_2} = \left(\frac{N\_1}{N\_2}\right)^2. \tag{9}$$

Figure 2 shows the results for the affinity law with the manufacturer's data for a waterjet engine from 600 to 2400 rpm, which match the operational engine speeds of this study.

**Figure 2.** Thrust force *F* generated by the waterjet propulsion system depending on the shaft rotational speed *N*.

In the mathematical model, a 2D lookup table provides the engine rpm and the surge speed of the USV as inputs, and the total thrust generated by the waterjet unit as output. Also, a one-dimensional (1D) lookup table *f*(*Joyu*) obtains the engine rpm depending on the joystick input for surge motion, and a second-order transfer function adds the waterjet dynamics of the engine rpm into the mathematical model. This transfer function is obtained using the SI tool from MATLAB and the field test data of the USV. Thus, the engine rpm is calculated based on the combination of the 1D lookup table and the engine rpm transfer function, defined by

$$m\_{\rm rpm}(s) = \frac{0.317s^2 + 2.793s + 1.828}{s^2 + 3.499s + 1.828} f(log\_u). \tag{10}$$

In the case of the heading motion of the USV, the total efficiency *η*nozzle for the thrust force depends on the nozzle position (which refers to the angle of the waterjet thrust force *α*nozzle). According to the waterjet manufacturer, if the nozzle position is deviated to a maximum nozzle angle *η*nozzle = ±25◦ (related to *P*nozzle = ±10,000) , efficiency drops exponentially to 30–40% of the maximum (center). The exponential function is obtained using the general exponential model.

$$
\eta\_{\text{nozzle}}(P\_{\text{nozzle}}) = a \exp(b \, P\_{\text{nozzle}}),
\tag{11}
$$

where *<sup>a</sup>* <sup>=</sup> 1 and *<sup>b</sup>* <sup>=</sup> <sup>−</sup>9.163 <sup>×</sup> <sup>10</sup>−<sup>5</sup> .

Similarly to the dynamics of the waterjet calculation for the engine rpm, the nozzle position includes a 1D lookup table *f*(*Joyr*) and a first-order transfer function. This transfer function is obtained also from the SI tool from MATLAB based on field test data. The nozzle position of each waterjet is defined by

$$P\_{\text{nozzle}}(s) = \frac{-\exp(-0.25s)}{0.1s + 1} f(Jy\_r). \tag{12}$$

Regarding the behavior of the second-order transfer functions for both engine rpm and nozzle position, Figure 3 shows the comparison between the SI tool transfer function and field test data for both *n*rpm and *P*nozzle variables.

**Figure 3.** Comparison between the USV field test data and the system identification (SI) transfer functions: (**a**) Engine rpm *n*rpm. (**b**) Nozzle position *P*nozzle.

Additionally, the parameters for the 1D Lookup table are obtained from field test data and are presented in Table 2.


**Table 2.** 1D Lookup Table parameters.

Finally, the vector *τ* = [*τ*u, 0, *τ*r], which represents the forces and moments generated by the two waterjets, is defined by

$$\begin{cases} \tau\_{\rm u} = (F\_{\rm POKT} + F\_{\rm STDB}) \eta\_{\rm nozzle} \\ \tau\_{\rm r} = l\_{\rm pivot} \sin(a\_{\rm nozzle}) (F\_{\rm POKT} + F\_{\rm STDB}) \eta\_{\rm nozzle} \end{cases} \tag{13}$$

Figure 4 shows the schematic with all the necessary functions for the USV dynamic model, from the joystick controller input to the vehicle's position output. The waterjet model includes the 1D lookup table to translate between joystick commands to rpm, the second-order transfer function, and the 2D lookup table related to the thrust force of each waterjet unit. Furthermore, it also includes the 1D lookup table to translate between joystick commands to the nozzle position, the first-order transfer function, the thrust force efficiency depending on the nozzle position, and the calculation of the total torque. Both thrust force *τ*<sup>u</sup> and torque *τ*<sup>r</sup> are the inputs in the mathematical model of the USV based on the three DOFs dynamic model. The position and orientation of the USV are performed by integrating the velocity vector *ν*.

**Figure 4.** Schematic of the mathematical model of the USV including both waterjet propulsion system and USV dynamic models.

#### *2.3. USV Model-Validation Using Parameter Estimation*

The matrices *M* and *D*(*ν*) of the three DOFs Dynamic model are estimated with the parameter estimation tool from MATLAB-Simulink. The matrices are defined in the Simulink model by creating the matrices from input values. Then, the MATLAB-Simulink tool can estimate the individual coefficients of the dynamic matrices.

There are two different parameter estimation runs related to surge and yaw motion. Table 3 shows the constant values shared in both experiments, while Table 4 shows the coefficients obtained from the parameter estimation tool with their results. Only surge and yaw motion coefficients, *Xu*, *Xu*˙ , *<sup>X</sup>*|*u*|*<sup>u</sup>* and *N<sup>r</sup>* , *Nr*˙ , *<sup>N</sup>*|*r*|*<sup>r</sup>* respectively, have been considered and estimated in this study, as the mathematical model focuses in these two USV motions.


**Table 3.** Principal characteristics of the under-actuated USV.

**Table 4.** Parameter estimation results for the surge and yaw motion coefficients.


Figure 5 shows the comparison between the field tests, which include raw and filtered USV linear and angular velocity, the three DOFs dynamic model with the coefficients obtained from the parameter estimation, and the SI results from [10], for the joystick controller input shown in Figure 3. As shown

in both linear and angular velocities results, the parameter estimation results improve the previous SI approach, giving an accurate output of the USV maneuvering compared to the field test results.

**Figure 5.** Comparison plot between SI tool, parameter estimation (PE) app, and field test data: (**a**) Surge motion. (**b**) Heading motion.

#### *2.4. Overview of the AUV*

This article uses a high configurable AUV platform for different scientific instrumentation. This vehicle contains basic instrumentation and sensors for localization and target detection, including a USBL and a depth sensor for underwater localization and navigation, an AHRS from the flight control for the navigation of the AUV, and a mechanical imaging sonar (Tritech Micron [26]) as main underwater perception sensor.

Figure 6a shows a simplified model of the AUV. This AUV uses a six-thruster configuration to provide thrust forces when moving in the surge, sway, heave motions, or performing turns. Also, the position and velocities of the AUV are illustrated in Figure 6a. The general AUV motion in six DOFs is modeled by using the NED local coordinate system. AUV position and velocities are considered with the following vectors

$$\mathfrak{h} = \left[ \mathbf{N}, \mathbf{E}, \mathbf{D}, \boldsymbol{\Phi}, \boldsymbol{\Phi}, \boldsymbol{\Psi} \right]^{\top}, \boldsymbol{\nu} = \left[ \boldsymbol{u}, \boldsymbol{v}, \boldsymbol{w}, \boldsymbol{p}, \boldsymbol{q}, \boldsymbol{r} \right]^{\top}, \tag{14}$$

where *N*, *E*, *D* denote the NED positions in Earth-fixed coordinates, *φ*, *θ*, *ψ* are the Euler angles, *u*, *v*, *w* are the body-fixed linear velocities, and *p*, *q*,*r* are the body-fixed angular velocities [5].

The design and modeling of the AUV should be studied using a theoretical six DOFs dynamic model [27]. However, due to the lack of instrumentation, it is not possible to obtain accurate navigation data. Thus, the AUV is not fully simulated, and just simple control commands are established for navigation. Once that navigation data is available, it is possible to use the same approach as the USV mathematical model to obtain the six DOFs dynamic model, using the parameter estimation or SI tools based on field test data. Regarding the control of the AUV, thrusters are located as it is shown in Figure 6b, where thrusters *T*1, *T*2, *T*3, and *T*<sup>4</sup> effects in surge, sway, and yawing, and thrusters *T*<sup>5</sup> and *T*<sup>6</sup> effects in heave and rolling motions.

**Figure 6.** Six-thruster configuration in the AUV: (**a**) Simplified model of the considered vehicle using the NED coordinate system. (**b**) Thrust forces with their direction for each thruster.

#### **3. Gnc System for the Co-Operative Tasks**

This study has the target detection and the guidance algorithms as main modules of the GNC architecture of the offshore multi-vehicle system. This section describes both of these algorithms for each platform and the description of the multi-vehicle guidance system.

#### *3.1. Target Detection System*

The mechanical imaging sonar installed at the AUV and the LiDAR at the USV are the primary perception sensors in the co-operative autonomous offshore system. The target detection algorithm includes the application in both perception sensors, depending on the position of the objects (underwater or over the water surface).

For the mechanical imaging sonar, the employed algorithm consists of analyzing the acoustic intensity at every bin to determine the presence of an underwater vehicle. The Tritech Micron sonar [26] has an operating frequency chirp centered on 700 kHz, a beamwidth of 35◦ vertical and 3◦ horizontal, a range from 0.3 to 75 m, a range resolution of approximately 7.5 mm, and a configurable mechanical resolution of 0.45◦ , 0.9◦ , 1.8◦ , and 3.6◦ . In this study, the maximum range used to detect an obstacle is 10 m, a forward field-of-view (FoV) of 90◦ , and a mechanical resolution of 1.8◦ . If the target is known a priori to be narrow, the imaging sonar can be configured with a lower resolution to detect the object.

Regarding the data obtained from the mechanical imaging sonar, it contains the heading of the beam *θ*scan, the location of the specific point in Cartesian coordinates *P*scan, and the intensity at every bin *I*scan. The dynamic range of the mechanical imaging sonar is 80 dB. Then, the dynamic range controls allow to adjust the position of a sampling window within the defined dynamic band range of the received signal, and it translates the intensity at every bin to an integer value ranging between 0 and 255.

After data acquisition from the mechanical imaging sonar, Algorithm 1 shows the post-processing steps for target detection. This algorithm includes the position of the highest intensity value for each bin in polar coordinates, filtering the data in the range of [0,1.5] meters to avoid possible noise from the AUV structure.

Algorithm 1 provides the post-processing of a single bin of a specific angle. An additional function forms an array of number of scans *n*scans, obtained from *θ*scan,min, *θ*scan,max, and *θ*scan,increment parameters of the mechanical imaging sonar to create the complete array of scans from the sonar. After gathering the scan array, the position of the targets needs to be calculated. The data from the perception sensors is obtained in the body-fixed reference frame (BODY), and it requires a translation into an absolute coordinate system. This translation is defined by

$$
\begin{bmatrix} \chi\_{\rm obs} \\ \mathcal{Y}\_{\rm obs} \end{bmatrix} = \mathbf{R}\_{\mathbf{z}}(\psi\_{\rm AV}) \begin{bmatrix} \chi\_{\rm scan} \\ \mathcal{Y}\_{\rm scan} \end{bmatrix} \tag{15}
$$

where *Rz*(*ψ*AV) is the rotation matrix around the z-axis using the heading angle *ψ*AV of the selected AV. This rotation matrix translates between the BODY and the East-North-Up (ENU) coordinate system. The rotation matrix *Rz*(*ψ*AV) in 2D is defined by

$$\mathbf{R}\_z(\psi\_{\rm AV}) = \begin{bmatrix} \cos(\psi\_{\rm AV}) & \sin(\psi\_{\rm AV}) \\ -\sin(\psi\_{\rm AV}) & \cos(\psi\_{\rm AV}) \end{bmatrix} \tag{16}$$

**Algorithm 1:** Post-processing of the mechanical imaging sonar data for target detection.

```
Input :Intensities Iscan, positions Pscan in Cartesian coordinates [X,Y], and current heading
       θscan value obtained from the mechanical imaging sonar.
```
**Output :**Position *micron* of the highest intensity value in polar coordinates.

	- /\* Remove data in the range from 0 to 1.5 m to avoid possible noise from the AUV structure. *nscan* equal to number of scans. \*/

After locating the obstacle by the mechanical imaging sonar in the ENU coordinate system, the target's origin position (*N*o, *E*o) is defined by

$$
\begin{bmatrix} N\_{\rm o} \\ E\_{\rm o} \end{bmatrix} = \begin{bmatrix} N\_{\rm AV} \\ E\_{\rm AV} \end{bmatrix} + \mathbf{R}\_{\rm x}(\gamma) \begin{bmatrix} \frac{\mathbf{x}\_{\rm obs,init} + \mathbf{x}\_{\rm obs,end}}{2} \\ \frac{\mathbf{y}\_{\rm obs,init} + \mathbf{y}\_{\rm obs,end}}{2} \end{bmatrix} \tag{17}
$$

where *Rx*(*γ*) is the rotation matrix around x-axis with *γ* = *pi* [rad]. This matrix is used to translate between ENU to NED coordinate system used for the offshore navigation. The *Rx*(*γ*) rotation matrix in 2D is defined by

$$\mathbf{R}\_{\mathbf{x}}(\gamma) = \begin{bmatrix} 1 & 0\\ 0 & \cos(\gamma) \end{bmatrix}. \tag{18}$$

Algorithm 2 includes the detected target localization for the perception sensor data array. This algorithm distinguishes between different targets depending on the consecutive elements in the data array, and the origin position of the targets is sent to the GNC algorithm to proceed with the autonomous navigation of the offshore system.

#### **Algorithm 2:** Localization of the detected targets.

**Input :***scan* data array in Cartesian coordinates and *RobotPose* (position and orientation). **Output :**Obstacle origin [*N*o, *E*o] calculated in absolute NED coordinates.


**5** create vector to distinguish between different obstacles;


```
9 for i = 1 to nobs do
```

```
10 calculate the obstacle origin [No(i), Eo(i)] in NED according to (17);
```

**11 end**

```
14 end
```
Figure 7 shows the steps from the scan data obtained from the mechanical imaging sonar in the BODY reference frame to the final origin position of the detected targets. Figure 7a shows the raw data from the mechanical imaging sonar. Then, Figure 7b shows the post-processing described in Algorithm 1. Finally, Figure 7c,d represents the origin position of the targets in NED coordinate system, with relative to origin [0,0] and absolute coordinates respectively.

**Figure 7.** Post-processing of the mechanical imaging sonar data in the target detection algorithm: (**a**) Scan data acquired from sonar. (**b**) Post-processing based on Algorithm 1. (**c**) Relative position in NED with origin as [0,0] and calculation of target's origin. (**d**) Absolute position in NED of the targets.

Regarding the USV platform, the SICK MRS1000 LiDAR [28] is the primary perception sensor. This LiDAR has four spread-out scan planes and a multi-echo analysis to be used in harsh environment applications, as it can avoid the noise produced by fog, rain, or dust. Also, this device has a 275◦ aperture angle, and a working range from 0.2 to 64 m. Thus, in case that the target is above the water surface, it can be detected by the LiDAR sensor.

The algorithm for target detection is similar to the described for the mechanical imaging sonar. The only difference is that the LiDAR contains four spread-out scan planes, acquiring three-dimensional (3D) scan data (see Figure 8a). The target detection algorithm is simplified by translating the received data to 2D by avoiding the z-axis from the sensor data (see Figure 8b). Figure 8c shows the maximum detection range and aperture angle with the scan data in the BODY reference frame. Finally, Figure 8d shows the origin's position of the targets in the NED coordinate system after applying Algorithm 2.

**Figure 8.** Post-processing of the LiDAR in the target detection algorithm: (**a**) LiDAR scan data in 3D. (**b**) LiDAR scan data in 2D. (**c**) Detection area of the USV in BODY including scan data. (**d**) Absolute position in NED of the targets.

The same procedure detects obstacles from the LiDAR for the path-following with the obstacle avoidance algorithm. After obtaining the origin position [*N*o, *E*o] from Algorithm 2, the obstacle avoidance algorithm can define a safety boundary box around the obstacle [10].

#### *3.2. Guidance System for Multi-Vehicle System*

The multi-vehicle system aims firstly to detect a target using the AUV in a specific offshore area, and after that, sends the location to the USV to do further exploration of the target. Thus, a pathfollowing algorithm is essential for both AUV and USV subsystems. This algorithm intends to reach every waypoint of a specific path independent of time. A commonly used method for path-following is the named LOS guidance, which is chosen as a reference trajectory in this study.

#### 3.2.1. Auv Guidance System

The heading control can use a LOS vector from the AUV position to the next waypoint, similar to [5]. The LOS path-following controller used in this study is the same as the one defined in [10]. However, the AUV movement includes a heave motion, which is avoided by keeping a constant depth for the path-following algorithm. This controller computes the course angle *ψ*<sup>d</sup> based on the path-tangential angle *χ*<sup>p</sup> and the velocity-path relative angle *χ*r. The lookahead-based steering can be implemented related to the heading controller applying the transformation defined as

$$
\psi\_{\rm d} = \chi\_{\rm P} + \chi\_{\rm r} - \beta\_{\rm \prime} \tag{19}
$$

where the variable sideslip (drift) angle *β* [5] has been omitted in this study to simplify the steering law. The velocity-path relative angle *χ*<sup>r</sup> establishes that the velocity has the direction facing a path location that is in a lookahead distance ∆(*t*) > 0 along of the direct projection [29]. The path-tangential angle *χ*<sup>p</sup> and the velocity-path relative angle *χ*<sup>r</sup> are defined as

$$\chi\_{\rm P} = \operatorname{atan2}(E\_{k+1} - E\_{k\prime}N\_{k+1} - N\_k) \,\tag{20}$$

$$\chi\_{\mathbf{r}}(e) = \arctan(-\mathcal{K}\_{\mathbf{P}}e - \mathcal{K}\_{\mathbf{I}} \int\_{0}^{t} e(\tau)d\tau),\tag{21}$$

where (*N<sup>k</sup>* , *E<sup>k</sup>* ) and (*Nk*+<sup>1</sup> , *Ek*+<sup>1</sup> ) are the positions of the passed and next waypoint, respectively, the proportional gain is *K*<sup>P</sup> = 1/∆(*t*) > 0, and *K*<sup>I</sup> > 0 represents the integral gain. The cross-track error *e*(*t*) is given by

$$e(t) = -[N\_{\rm AUV}(t) - N\_{\rm k}]\sin(\chi\_{\rm P}) + [E\_{\rm AUV}(t) - E\_{\rm k}]\cos(\chi\_{\rm P}).\tag{22}$$

The switching mechanism is declared as a sphere of acceptance for AUVs [30]. This mechanism selects the next waypoint as a lookahead point if the AUV position lies within a sphere with a radius *R* around the position (*Nk*+<sup>1</sup> , *Ek*+<sup>1</sup> , *Dk*+<sup>1</sup> ). The sphere of acceptance is defined as

$$\left[\mathbf{N}\_{k+1} - \mathbf{N}(t)\right]^2 + \left[E\_{k+1} - E(t)\right]^2 + \left[D\_{k+1} - D(t)\right]^2 \le \mathbf{R}\_{k+1\prime}^2\tag{23}$$

where, if the time AUV position (*N*(*t*), *E*(*t*), *D*(*t*)) satisfies Equation (23), the next waypoint (*Nk*+<sup>1</sup> , *Ek*+<sup>1</sup> , *Dk*+<sup>1</sup> ) needs to be selected. Radius *R* is equal to three AUV lengths *L*AUV (*R* = 3*L*AUV), as the position is only obtained from the USBL system.

After obtaining the course angle from the LOS path-following algorithm, this algorithm sends the heading commands to the yaw controller to match the aimed path. The main control system of the AUV is formed by three separate PID controllers for surge, heave, and yaw motions. Apart from the heading controller, the heave controller keeps the AUV at a constant depth. Their PID parameters for heading controller are obtained by using rapid control prototyping based on the Ziegler-Nichols PID tuning [31] during field tests. Both amplitude *K*zn and period *T*zn are calculated for the AUV at the water tank, and then, the PID parameters are defined based on Table 5. Furthermore, a simple proportional controller has been selected in the heave controller. The surge motion is implemented as a constant PWM value to the thrusters.


**Table 5.** PID parameters for AUV.

#### 3.2.2. USV Guidance System

Same as the AUV guidance system, USV heading control uses a LOS vector from the USV position to the next waypoint. The LOS path-following controller used in this study is the same as the one defined in [10], including the obstacle avoidance capabilities with the safety boundary box approach. The LOS path-following controller of the USV uses the same path-tangential angle *χ*<sup>p</sup> defined in Equation (20), the velocity-path relative angle defined in Equation (21), and the total lookahead-based steering from Equation (19). The switching mechanism is selected as a circle of acceptance for surface vehicles [5]. It selects the next waypoint as a lookahead point if the position of the USV lies within a circle with radius *R* around (*Nk*+<sup>1</sup> , *Ek*+<sup>1</sup> ). This circle of acceptance is defined as

$$\left[\text{N}\_{\text{USV}}(t) - \text{N}\_{k+1}\right]^2 + \left[\text{E}\_{\text{USV}}(t) - \text{E}\_{k+1}\right]^2 \le \text{R}\_{k+1\prime}^2\tag{24}$$

where, if the time surface vehicle position (*N*USV(*t*), *E*USV(*t*)) satisfies (24), the next waypoint (*Nk*+<sup>1</sup> , *Ek*+<sup>1</sup> ) needs to be selected. Radius *R* is equal to two USV lengths *L*USV (*R* = 2*L*USV).

#### 3.2.3. Multi-Vehicle Guidance System

At the beginning of the control scenario, the USV keeps its position in dynamic positioning (DP) mode while the AUV is trying to search for targets in the coverage area. A DP vessel is a vessel that maintains its position exclusively using active thrusters [24]. This study considers the use of conventional controllers with cascade with low-pass and notch filters to simplify the implementation. The control problem is solved by using PID-controllers for surge, sway, and yaw motions.

The AUV in this study aims to detect a target in a specific offshore area. The coverage area is defined as a set of waypoints to cover a far-reaching range inside. However, this coverage area has been substituted by a straight-path to simplify the control scenario. After detecting the object by the target detection system, it sends a stop command to the AUV, and the vehicle stays in its position until it received further instructions from the USV. As the AUV does not contain enough instrumentation to have a precise localization of the subsystem, the AUV in this study stops its thrusters instead of having a DP control of its final position. Additionally, if the target detection algorithm does not recognize any target in the coverage area, the AUV stops after reaching the last waypoint of the predefined path.

After receiving the target position by the USV, the path-following algorithm creates the waypoints with a straight-line trajectory. The first waypoint matches the current position of the USV at the time that the target position is received, and the last waypoint is the target position itself. With a constant distance between waypoints of 10 m, the number of waypoints is related to the length of the straight-line path. These waypoints are sent to the LOS path-following algorithm to calculate the course angle of the USV. Furthermore, an additional switching mechanism is included using the same principle as the circle of acceptance defined in (24) to stop the LOS path-following controller once the USV has reached the last waypoint of the predefined path. Then, the guidance system does not send any heading or surge commands to the controllers, and there is no output from the target detection algorithm. In this case, the USV changes to DP internal algorithm keeping its position constant.

Figure 9 shows the priority control level for the multi-vehicle guidance system. First, the AUV starts the path-following of the coverage area based on predefined waypoints. The vehicle continues to the next waypoint until the mechanical imaging sonar detects a target. Then, the AUV stops its operation, and the target position is transmitted to the USV. The USV keeps its position in DP mode and, when the target position is received, it starts the path-following with obstacle avoidance operation with the target position as the final waypoint of the USV trajectory. After reaching the last waypoint, the USV stops and uses the DP mode to keep its position, allowing the GCS to have further inspection of the detected target. Additionally, the steering wheel and 3-axis joystick, both forming the manual control of the USV, provides the safety feature in the autonomous algorithm.

**Figure 9.** Stateflow diagram for priority control level in the multi-vehicle guidance system. The target detection algorithm at the AUV enables the autonomous operation of the USV.

#### **4. Experimental Validation**

#### *4.1. System Implementation*

For this particular study, the USV and AUV platforms incorporate multiple mechatronic systems to implement the target detection algorithm. Both vehicles include high-level control (computers with ROS), which performs complex computations and processes the data obtained from localization and perception sensors, and low-level control (sensors and actuators units), that runs as the basic interface for vehicle operations. Also, an intermediate-level (or mid-level) control is included, which is the main link between low-level data acquisition and high-level logic operations.

Figure 10 shows the mechatronic systems used in the USV, including also the connection to the AUV and external MATLAB-Simulink computer through the main network switch. These devices are the link to the co-operative autonomous offshore system. In general, the USV platform is equipped with a payload for navigation (high precision GPS-Compass), LiDAR as the main perception sensor, SeaTrac acoustic system for USBL localization, and communication with the AUV, and WiFi for communication with the GCS. The USV system implementation is the same as the one studied in [10]. For the high-level control, the ROS master includes the necessary stand-alone ROS-nodes for the path-following with obstacle avoidance. The display computers act as intermediate-level control for translation between CAN bus and ROS messages. Also, they are in charge of sending joystick commands to the waterjet control units based upon priority levels.

**Figure 10.** System overview of the USV platform with high-level (blue boxes),intermediate-level (white boxes), and low-level control (purple boxes), including the connection to the AUV platform (adapted from [10]).

Figure 11 shows the mechatronic systems used in the AUV platform. The AUV is connected to the USV via a neutrally buoyant tether to have a direct connection between the vehicles. Similarly to the USV platform, the AUV contains high-level control with the ROS computer and an intermediate-level control as a bridge between the main ROS computer and the companion computer, which communicates using the MAVLink protocol. The low-level control includes actuators and sensors, formed by six thrusters and their respective electronic speed controllers (ESCs), a pressure sensor for depth measurements, a mechanical imaging sonar as the perception sensor, and the USBL SeaTrac acoustic system for positioning and communication. Finally, the AUV includes a companion computer with the flight controller and the ROS computer (Linux computer) connected to a network switch. The ROS computer performs the complex computations for autonomous operation and target detection.

**Figure 11.** System overview of the AUV: High-level (Robot Operating System (ROS) computers), intermediatelevel (companion computer and Pixhawk flight controller), and low-level control (thrusters, ultra-short baseline (USBL), pressure sensor, and mechanical imaging sonar).

The approach used in this study for the multi-robot architecture is multimaster-fkie, which provides simplicity and ROS compatibility [21]. This package is a fully compatible multi-master implementation for topic and services transactions. Nevertheless, this implementation can cause some drawbacks due to the continuous master state scanning and the delay between changes in advertising, as well as information exchange. As this study requires a total of three ROS topics, this package is useful as an easy plug-and-play solution.

Figure 12 illustrates the communication between the USV and AUV platforms, including the nodes for the multimaster-fkie architecture. The exchanged topics are /*target*, which is the position of the detected target, /*usv*\_*gps* obtained from the USV GPS-compass and used to get the absolute Cartesian coordinates of the AUV position, and /*usv*\_*heading* which rotates the USBL coordinate system according to the heading of the USV. The diagram also includes the links between the high-level, mid-level, and low-level control in both platforms.

**Figure 12.** Communication of the autonomous offshore system based on the multimaster-fkie architecture. Each vehicle shows the internal connection between the sensors and actuators with the rest of the system.

#### *4.2. Modular System for Multi-Sensor Technology*

The target detection algorithm uses a modular approach to include target detection from each perception sensor, path-following, and guidance control from both USV and AUV platforms. Each of these modules runs a separate ROS node in the autonomous offshore system. This approach has been previously studied and successfully implemented in [10,32]. However, the algorithms of the mentioned studies did not include co-operative capabilities between multiple autonomous vehicles.

Figure 13 illustrates the modular architecture with all topics involved, defining the subscribers and publishers of each topic. The only difference between the two vehicles is the path-following model at the USV for obstacle avoidance, which is in charge of modifying the USV trajectory using the safety boundary box approach.

The GPS-Compass obtains the absolute position of the USV in global coordinates, while the USBL collects the position of the AUV in the BODY reference frame of the USBL. The ROS topic /*odometry* in the AUV is based on the low-level serial messages accepted and generated by the SeaTrac USBL beacons [16]. These serial messages are ASCII-Hex characters of the message string, which are decoded into an array of bytes representing their values. The ROS topic is generated using the Serial package [33], which translates the RS232 messages to a ROS topic array. After that, PING messages are sent from the main USBL #1 beacon located at the USV, and the response from the AUV (USBL #2) produces the necessary serial messages containing the AUV position in the BODY USBL coordinate system. Finally, the change from this reference frame to the NED coordinate system is defined by the combination of a translation and a rotation matrix. These matrices use the initial heading of the USBL and the /*heading* and /*gps* variables from the GPS-Compass.

**Figure 13.** Schematic of the modular multi-vehicle guidance system with target detection. All different modules from USV and autonomous underwater vehicle (AUV) were included. ROS topics /*gps*, /*heading* and /*target* (purple connectors) are the exchange topics in the control scenario.

The predefined path for the AUV is defined as the ROS topic /*path*\_*coverage*, which includes the waypoints for the GNC algorithm in the control module. The GNC guidance algorithm generates the required AUV heading command, sending this parameter to the AUV controller. The controller generates the required inputs /*rc*\_*channel*3, /*rc*\_*channel*4, /*rc*\_*channel*5, and sends them to the companion computer for surge, heave, and yaw motions, respectively, based on the BlueRov-ROS-playground ROS package [34].

Regarding the USV, the exchanged ROS topic /*target* contains the target's origin position. Thus, once this topic is received in the path-following model, it defines the necessary waypoints to perform the autonomous mission. These waypoints are sent to the GNC model, where the LOS-algorithm calculates the required course angle for the controller. Finally, the controller generates the required joystick commands for surge /*Joy<sup>u</sup>* and yaw /*Joy<sup>r</sup>* to reach the LOS values. These joystick commands are sent to the low-level control (display computers) to perform the autonomous USV operation, using the same outputs as a manual three-axis joystick.

#### *4.3. Experimental Results*

The control scenario for this study includes target detection, path-planning, and guidance control in both offshore vehicles. However, even though the modular ROS architecture provides a computationally cheap and easy implementation in both offshore platforms, the operation of both platforms in an offshore scenario depends highly on environmental elements such as wind or wave drift forces. As the guidance control bases its operation on simple PID controllers without the compensation of these environmental elements, it makes it highly challenging to gather useful field test data from the offshore system. Thus, the experimental results of this study are shown in a modular way, testing each of the subsystems separately to validate the target detection algorithm using multi-sensor technology. Figure 14a illustrates the location for the AUV and USV field tests at the Pyhäjärvi lake in Tampere, Finland. The water-flow direction from a hydro-power plant is also defined to show the environmental drift forces. Figure 14b shows the implementation for the AUV path-following, where the USV stays

stationary at the harbor. Regarding the USV field test, it is demonstrated in a clear obstacle area at the lake.

**Figure 14.** Control scenario: (**a**) Location of the AUV (red arrow) and USV (green arrow) field tests at the Pyhäjärvi lake in Tampere, Finland, being affected by the water-flow (blue arrow) from a hydro-power plant. (**b**) AUV and USV platforms at the harbor during the AUV field tests.

The first step in the target detection algorithm is the AUV path-following. This module is tested at the harbor with a set of three waypoints defined in the NED coordinate system. The surge motion has a constant PWM value to the thrusters, and the yaw and heave motions are implemented using separate PID controllers. The LOS-based guidance system calculates the necessary course angle to reach every waypoint of the predefined path. Figure 15 shows the AUV trajectory using the USBL data for navigation, where the AUV initial position and orientation are defined as random. The AUV moves slightly to the left side of the path-following due to the environmental drift forces. As it is shown in Figure 14a, the field tests have been done in an estuary area of a narrow and shallow lake, where the flow from a hydro-power plant affects considerably. These flow conditions vary depending on the river discharge rate. During the time of testing, the river discharge was 38 m3/s to the south direction, and the wind speed was equal to 6 m/s with southwest wind direction.

**Figure 15.** AUV Control scenario: AUV trajectory for the path-following algorithm.

Figure 16a shows the comparison between the input control values for the yaw angle and the field test data, and Figure 16b displays the same comparison for heave motion. In this case, the multi-vehicle system contributes to the GPS-Compass data at the USV, providing the ROS topics /*gps* and /*heading* to the USBL acoustic system for positioning.

During the implementation of the GNC model, the target detection algorithm processes the mechanical imaging sonar data to detect and locate any possible obstacle around the AUV. Figure 7 illustrates the adequate performance of this module, where a static obstacle (buoy) is detected and located in absolute NED coordinates.

**Figure 16.** AUV Control scenario: (**a**) Comparison of heading angle from the LOS guidance system with field-test data. (**b**) Comparison of the constant depth of 1.5 m with field-test data.

Once the AUV detects and locates the target, it sends the target's position to the USV platform via multimaster-fkie architecture. The last control scenario in the experimental results demonstrates the co-operative autonomous offshore system with the path-following with obstacle avoidance capabilities of the USV. The USV main computer receives the /*target* ROS topic from the AUV main computer. Then, the GNC model provides the necessary surge and yaw motions to reach the target's position based on the LiDAR and path-following models. Figure 17 shows the USV trajectory once the path has been defined according to the ROS topic /*target*. Additionally, Figure 18 shows the comparison between the LOS guidance system and the field test data for yaw motion, and Figure 19 shows the corresponding LOS cross-track error *e*(*t*), which demonstrates the correct performance of the guidance control, even though environmental variables are not considered in this study. During the USV field tests, the river discharge was 30 m3/s to the south direction, and the wind speed was equal to 3.7 m/s with south-southwest wind direction.

The experimental results of this study indicate the correct performance of the target detection algorithm using multi-sensor technology. These results are implemented in a modular way, and they show the appropriate implementation of each model, including target detection, path-following, and guidance control. The path-following algorithms in the AUV and USV platforms include some error due to the environmental variables, such as wind and wave drift forces. These variables need to be considered to increase the accuracy of the system, and they can be removed by improving the GNC controllers. Furthermore, the AUV navigation includes only the USBL beacons for positioning, which is not able to locate precisely the vehicle underwater. By improving the navigation system, the path-following algorithm will enhance its performance.

**Figure 17.** USV Control scenario: USV trajectory for the path-following algorithm, where the last waypoint is equal to the ROS topic /*target*.

**Figure 18.** USV Control scenario: Comparison of heading angle from the LOS guidance system with field-test data. After reaching the /*target* position, the yaw angle is equal to the constant velocity-path relative angle *χ*r for DP mode.

**Figure 19.** USV Control scenario: LOS cross-track error *e*(*t*) for the lookahead-based steering law defined in (22). This error is produced by the environmental variables, as the drift angle *β* is not included in the LOS-based guidance control.

#### **5. Conclusions and Future Work**

This article was concerned with the target detection using multi-sensor technology in a co-operative autonomous offshore system. The offshore system had a USV and an AUV, and the fundamental purpose of the algorithm was to detect an underwater target in a preplanned coverage area. The mathematical model of the USV, including also the waterjet propulsion system model, was presented to verify the designed GNC architecture. This model included parameter estimation methods to obtain the dynamic coefficients using field test data for both surge and yaw motions. This study developed a basic target detection algorithm for any offshore perception sensors, showing the results for a mechanical imaging sonar at the AUV and a LiDAR at the USV. The guidance system included the LOS model for path-following on both platforms. After designing the GNC architecture, both vehicles incorporated a system implementation of the modular approach with high, intermediate, and low-level controls. The experimental results showed a field test control scenario that presents the capabilities and adequate performance of the target detection algorithm.

Future work will include an accurate mathematical model of the AUV for simulation, which requires the complete navigation data (position, velocity, and acceleration feedback) from the vehicle. Additionally, the coverage path planning can replace the straight-line trajectory of this study, having more coverage area and increasing the capabilities of the system. The AUV scenario will include the capabilities of making decisions in the presence of several obstacles, and further navigational sensors will be installed for more precise localization of the AUV (e.g., DVL). Finally, future work will also include additional platforms into the system, as it could be other USV or AUV, or even a UAV, which would increase the capabilities of the system working in the air.

**Author Contributions:** J.V. conceptualized and designed the methodology, developed the software and validation of the model, performed the experiments, analyzed the data, and wrote the paper; S.V. performed the experiments; J.A. and K.T.K. supervised the study and made the writing—review and editing of this paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research is based on the Autonomous and Collaborative Offshore Robotics (aColor) project, funded by the Technology Industries of Finland Centennial and Jane & Aatos Erkko Foundations.

**Acknowledgments:** The authors would like to thank the contributions from Alamarin-Jet Oy for facilitating their research surface vehicle as platform in this study.

**Conflicts of Interest:** The authors declare no conflict of interest.

*Remote Sens.* **2020**, *12*, 4106

## **Abbreviations**

The following abbreviations are used in this manuscript:



#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Technical Note* **Inversion of Phytoplankton Pigment Vertical Profiles from Satellite Data Using Machine Learning**

**Agathe Puissant 1, , Roy El Hourany 2,\* , Anastase Alexandre Charantonis 1,3, Chris Bowler <sup>2</sup> and Sylvie Thiria 1,4**


**Abstract:** Observing the vertical dynamic of phytoplankton in the water column is essential to understand the evolution of the ocean primary productivity under climate change and the efficiency of the CO<sup>2</sup> biological pump. This is usually made through in-situ measurements. In this paper, we propose a machine learning methodology to infer the vertical distribution of phytoplankton pigments from surface satellite observations, allowing their global estimation with a high spatial and temporal resolution. After imputing missing values through iterative completion Self-Organizing Maps, smoothing and reducing the vertical distributions through principal component analysis, we used a Self-Organizing Map to cluster the reduced profiles with satellite observations. These referent vector clusters were then used to invert the vertical profiles of phytoplankton pigments. The methodology was trained and validated on the MAREDAT dataset and tested on the Tara Oceans dataset. The different regression coefficients *R* <sup>2</sup> between observed and estimated vertical profiles of pigment concentration are, on average, greater than 0.7. We could expect to monitor the vertical distribution of phytoplankton types in the global ocean.

**Keywords:** machine learning; inversion; ocean colour; phytoplankton; pigment vertical profile; deep chlorophyll maximum; Tara Oceans; MAREDAT; pigments; ITCOMP-SOM; Self Organizing Maps

## **1. Introduction**

Phytoplankton is a key player in ocean biodiversity with consequences on fish catch potential, and climate regulation through carbon dioxide storage [1–4]. A decline in total phytoplankton population has been observed in Northern hemisphere basins over the last decade [5] and is projected to strengthen over the 21st century over wide oceanic regions under all global warming scenarios [6]. This decline is one of the most alarming consequences of anthropogenic climate change, as highlighted by recent policy-relevant reports [7] and by a scientists' warning to a humanity consensus statement in Nature Reviews [8]. However, an important question is how phytoplankton composition responds to changes in ocean characteristics (temperature, nutrients, currents, stratification, ...) since phytoplankton diversity constrains the societal impacts on both climate and fisheries.

Methods to observe the phytoplankton diversity from remote sensing data have greatly progressed during the last two decades [9,10]. New algorithms have been developed [11,12] that extract phytoplankton pigments and phytoplankton Functional Types (PFTs) at sea surface from satellite ocean color data. A major limitation of ocean color observations is that they only provide information on the sea-surface and miss subsurface peaks of phytoplankton abundance that can represent a large proportion of the total depthintegrated quantity. In fact, Morel and Berthon [13] classified the vertical variability into

**Citation:** Puissant, A.; El Hourany, R.; Charantonis, A.A.; Bowler, C.; Thiria, S. Inversion of Phytoplankton Pigment Vertical Profiles from Satellite Data Using Machine Learning. *Remote Sens.* **2021**, *13*, 1445. https://doi.org/10.3390/rs13081445

Academic Editors: Fahimeh Farahnakian and Edoardo Pasolli

Received: 30 January 2021 Accepted: 1 April 2021 Published: 8 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

"trophic categories" following the surface Chlorophyll-A (Chla) concentration, and showed that there is a relationship between this concentration and the integrated concentration of Chla in the water column. Subsequently, based on this previous work, Uitz et al. [14] determined from surface satellite data the variability of different phytoplankton size classes (PSC) in the water column based on their contribution to the Chla. However, these studies are constrained by the empirical relationships between Chla and secondary pigments and by assumptions on the shape of the vertical pigments profiles and cannot predict atypical associations [15]. Charantonis et al. [16] presented a combined use of a Self-Organizing Map with the Hidden Markov Models to infer Three-Dimensional Chla fields starting from Two-Dimensional (2D) imaging of several variables (surface Chla, Sea Surface Elevation, solar radiation and wind). Furthermore, Cortivo et al. [17] proposed a neural network methodology to estimate the sub-surface Chla concentration in open waters from the upwelling radiation. A similar attempt to infer the vertical Chla profile, by using a Multi-Layer-Perceptron (MLP), was shown in Sauzède et al. [18], in which the output is predicted from surface ocean-color estimates and depth-resolved physical properties, derived profiling floats such as SST and salinity. In addition, finally, Sammartino et al. [19] and Sammartino et al. ( [20] proposed a regional neural network approach to reconstruct the 3D variability of Chla in the Mediterranean sea. All of these works have targeted the Chla reconstruction as the main proxy of phytoplankton biomass. However, Uitz et al. [14] and Sauzède et al. [18] pushed their approach one step further to reconstruct phytoplankton community structure in terms of cell size.

In the present work, we introduce a new machine learning (ML) methodology to estimate several phytoplankton pigment profiles from ocean-color data, hindering a multidimensional problem based on the co-estimation of six different pigments. The novelty of this work lies within the ability of observing the 3D variability of phytoplankton functional types using these pigments.

Indeed, recent developments in artificial intelligence, combined with the availability of large datasets of satellite observations, provide enormous potential to learn the hidden structure of geophysical phenomena such as the one faced in this paper. ML methods have started to allow the intelligent investigation of such multi-dimensional data sets in oceanography and biogeochemical studies [21–23]. ML algorithms are now used to exploit spatial and temporal complex data structures, find patterns, and fuse heterogeneous sources of information efficiently. The survey in Reichstein et al. [24] describes the recent achievements and research challenges in the field of geophysics. Cross-fertilization of the ML with physical and biogeochemical contexts should allow the extraction of relevant knowledge from the dataset encountered in this study. This functioning is crucial for a better joint exploitation of observational data for understanding the phytoplankton variability as observed from space.

To achieve this aim, we used a large global database of pigment concentrations measured by high-performance liquid chromatography (HPLC) at the surface and through the water column, the Marine Ecosystem Data (MAREDAT) database [25], alongside with satellite ocean colour daily matchups. After a series of training and validation experiments on MAREDAT, we will use, as a final test, the HPLC data provided by Tara Oceans Expedition [26], a pan-oceanic expedition that deployed a holistic sampling of phytoplankton communities, coupled with comprehensive in situ biogeochemical measurements which provide the detailed environmental contexts necessary for ecological interpretation of the phytoplankton ecosystem.

#### **2. Materials and Methods**

#### *2.1. Data*

This section is devoted to the data we used that can be split into two distinct parts: in-situ observations and remotely sensed signals. Remote sensed data are abundant and easy to acquire, but the in-situ observations that are gathered during oceanic campaigns all around the world are sparse and represent a limited dataset. Due to the difficulty inherent

to measurements at sea, the in-situ dataset is heterogeneously sampled in both pigments and depths. Moreover, both datasets are imperfect and have a percentage of missing data that can be consequent. The challenge is thus to gather the available information (in-situ and remotely sensed) and to build a limited but robust dataset allowing the use of machine learning techniques. This requires the fusion of the two datasets.

#### 2.1.1. Pigment Observations

The MAREDAT database contains concentration measurements obtained at different depths and different stations at sea and analysed by HPLC for Chla and secondary pigments. The stations, defined by their longitude, latitude, and date (day/month/year), come from 136 scientific cruises around the world which have been compiled and quality controlled [25].

Besides the Chla concentration, we used 5 pigments that provide information on the main groups of phytoplankton: Divinyl-Chlorophyll-A (DVchla), 19'hexanoyloxyfucoxanthin (19hex), fucoxanthin (fucox), peridinin (perid) and zeaxanthin (zeax). These pigments were chosen based on their ability to distinguish the main groups of phytoplankton determined from the scientific literature [14,27–29]; Fucoxanthin for diatoms [30], Peridinin for dinoflagellates [30,31], 19'Hexanoyl-Fucoxanthin for Haptophytes [32], Zeaxanthin for Cyanobacteria [33,34] and Divinyl Chlorophyll-a for Prochlorococcus [33,34].

The measurements corresponding to depths greater than 300 m have been eliminated due to low pigment concentration and variability in light-limited environments. A quality control check was performed to filter the data, described in the following paragraph.

First, measurements with Chla concentrations greater than 3 mg m−<sup>3</sup> were rejected, as they correspond to rare and abnormal high concentrations encountered in open waters [11]. Afterwards, values of secondary pigments above the 95th percentile for each pigment were considered outliers and were replaced by missing values [11,29]. In addition, finally, due to specific physical, optical and biogeochemical properties, stations in the Antarctic below 50 degrees south were excluded [35–37]. The differences are often explained by the adaptation or acclimation of polar phytoplankton to extreme environmental characteristics or because of alterations in the relative abundances and characteristics of other optically-significant constituents resulting from particular geographical settings, specifically in the Southern Ocean [35,37–45]. In order to promote the greater variability of the Chla within the sunlit surface layer, a 9-point logarithmic depth grid was defined between the surface and 300 m to represent the greater near-surface variability: 5 m, 8.34 m, 13.92 m, 23.23 m, 38.75 m, 64.63 m, 107.81 m, 179.84 m and 300 m. For each station, multiple measurements occurring in a same depth point were averaged. From the initial longitude, latitude and date of the HPLC measures, 6807 stations were found and then reduced to 3903 stations which are collocated with satellite observations whose resolution is 4 km × 4 km. The stations that contained more than 50% missing pigment values were excluded, resulting in a final total of 1614 retained stations. The geographical distribution of the stations is shown in red in Figure 1.

A separated database has been used in the last section of the paper to test the proposed methodology. The Tara Oceans HPLC pigment concentration database from the Tara Oceans Expedition [26] contains HPLC measurements for several pigments at different depths, from which we select the data corresponding to the 6 pigments we are interested in (Chla, fucox, perid, 19hex and zeax). The measurements are composed of 211 stations distributed over the globe, which were combined into 143 stations according to the satellite resolution and excluding Antarctic stations. This dataset has been cleaned in the same way as MAREDAT, resulting in 66 stations whose geographic distribution is shown in green in Figure 1.

**Figure 1.** Geographical repartition of the stations. Red dots represent the repartition of the 1614 stations from MAREDAT constituting the training set, green stars represent the repartition of the 66 stations from Tara constituting the test set. The magenta diamonds represent the Biosope trajectory, a subset of the MAREDAT dataset, and the blue square indicates the location where satellite data were obtained in order to test the developed method.

#### 2.1.2. Satellite Observations

The ocean colour satellite data originates from the Globcolour project, carried out by the European Space Agency (ESA), consists of creating and maintaining a long time-series of ocean color data from satellite measurements (from 1997 till present). This database is the result of the fusion of data from various satellite sensors: Sea-viewing Wide Field-ofview Sensor (SeaWiFS), Moderate Resolution Imaging Spectroradiometer (MODIS), Visible Infrared Imaging Radiometer Suite (VIIRS), Medium Resolution Imaging Spectrometer (MERIS), and Ocean and Land Colour Instrument (OLCI).

The sensors measure the backscatter and spectral absorption coefficients of light by the ocean, and the reflectance is then calculated from these parameters. The reflectances are generated by each sensor from level 2 data (data pre-processed according to sensor and geophysical parameters). The reflectances are then merged by taking a weighted average of each sensor output. Meanwhile, Sea Surface Temperature (sst) was obtained from the Advanced Very High-Resolution Radiometer (AVHRR) instruments on board of the National Oceanic and Atmospheric Administration (NOAA) 5.3 [46,47]. The satellite data have undergone quality and flag check and are generated with a spatial resolution of 4 km and a temporal resolution of a day.

Eleven satellite measurements were proposed to be used for retrieving the 6 pigment concentration profiles that constitute the pigment database: Remote sensing reflectances at 4 wavelengths (RRS412, RRS443, RRS490, and RRS555), satellite Chla (chla\_sat), Sea Surface Temperature (SST), light attenuation coefficient at 490 nm (KD490), depth of the euphotic layer (ZEU), depth of the warmed layer (ZHL), photosynthesis available radiation (PAR) and its coefficient of attenuation (KDPAR).

The choice of the satellite variables is based on the findings of previous studies [13,14]. It has been shown in these studies that surface Chla and the euphotic depth (ZEU) are the main variables explaining the vertical variability of the Chla in the water column. However, since we are dealing with several pigments, it is primordial to use the surface reflectance at different wavelengths rather than only satellite-derived Chla to consider the influence of other pigments' variability on the satellite-detected signal. Physical factors are also investigated to take into account the influence of light (PAR, KDPAR, KD490, ZEU) and heating (SST, ZHL) on this vertical variability. In order to validate our use of the Satellite data, we compared the Chla in-situ data (Section 2.1.1) to the Globcolor Chla product. The calculated regression coefficient and the Spearman correlation were 0.67 and 0.77, respectively.

The two separate datasets were merged into a final reduced database colocating the in situ observations with the satellite data. Finally, the database subsequently used for the construction of the method, noted *D*, of dimension (1614, 65), where 1614 is the number of in situ profiles (stations) colocalised with satellite images, noted *z<sup>i</sup>* , and 65 the number of variables, consisting of 54 in-situ HPLC pigment variables (6 pigments, 9 depths each) and 11 satellite variables.

#### 2.1.3. Combined Dataset

The dataset resulting from the merging of the two databases is of high dimension, due to the inclusion of the concentrations of the six pigments at nine depths, and show scattered data as it can be seen in Table 1. The omission of localization elements such as the latitude and longitude in this study is tied to a lack of sufficient data to prevent over-fitting. Furthermore, since phytoplankton are associated with nonlinear population dynamics [48], there exist strong nonlinear relationships among the different concentrations of photosynthetic pigments. We are therefore working on high-dimensional and scattered pigment data, with strong nonlinear relationships. The development of a method for in-depth reconstruction then requires the choice of a suitable technique that can manage these nonlinear relations.


**Table 1.** Missing data for each pigment (among the 9 depths) and for the satellite variable of the experimental dataset *D*.

#### *2.2. Inverse Method: From Satellite Data to Vertical Profiles*

In order to infer the vertical distribution from vertical profiles, we need to enchain different methodological phases that rely on Artificial Neural Networks and dimension reduction techniques. These methods are briefly outlined in this section, before detailing the specific implementation.

#### 2.2.1. Algorithms

Neural approaches can be used to study nonlinear interactions within complex selfadaptive systems, such as marine ecosystems in relation with remote sensing measurements. Unsupervised approaches make it possible to extract these nonlinear relationships without any *a priori* assumptions.

The Self-Organizing Maps (SOM) [49] are unsupervised neural networks, whose objective is to cluster a high dimensional dataset *D* ∈ *R n* into a discrete representation in reduced dimensions, generally on a two-dimensional neural grid called a "map". This grid layout allows the introduction of the notion of neurons' neighborhood during the clustering so that two clusters that are near on the topological map gather similar data, thanks to the topological ordering of the map. They have the advantage of having high interpretability and make it possible to find relationships between the distribution of data on the map and the main explanatory variables. This is particularly useful in the case of complex and noisy data—as it is the case with climatology/oceanography data where they have been used in a large variety of studies [50,51].

After training, each cluster is defined by a referent vector *WC*, which represents the mean value of the data assigned to it, and by its position on the topological map, which indicated the clusters which are close to it. The attribution of a data *Z* to a class is made by comparing it to the set of referent vectors {*WC*; *C* ∈ *SOM*} and attributing them to the nearest referent vector *W<sup>c</sup>* according to the Euclidean distance (*C* is called *Best-Matching Unit* or BMU) (1):

$$BMMI(Z,SOM) = \arg\min\_{\mathbb{C}} \varepsilon\_{\mathbb{C}\to SOM} \sqrt{\sum\_{i=1}^{n} (Z\_i - \mathcal{W}\_{\mathbb{C}\_i})^2},\tag{1}$$

where *Z* ∈ *R n* . The SOM can be used in the context of completing missing data [52] by considering a modification of this distance. In that case, the projected vectors *Z* can have components *Z<sup>i</sup>* whose values are missing. Under these conditions, the distance between a vector *Z* and the referent vectors *W<sup>c</sup>* of the map is the Euclidean distance that considers only the existing components (the Truncated Distance or TD hereinafter). The use of the TD allows for taking into account the information embedded in the incomplete data.

The Iterative Completion SOM (ITCOMPSOM) method is an iterative data completion method derived from the SOM. When a data vector presents missing values, the method uses a modified TD, denoted *TD<sup>s</sup>* as seen in Equation (2). The modified TD makes use of the correlations between the missing variables and those present to weight the Euclidean distance so that the variables most correlated to the missing values will more strongly influence the attribution to a cluster:

$$TD\_s^\mathbb{C}(Z, \mathcal{W}\_\mathbb{C}) = \sum\_{i \in \text{avail.}} \left( \left( 1 + \sum\_{j \in \text{missing}} \left( color\_{ij} \right)^2 \right) \times \left( Z\_i - \mathcal{W}\_{\mathbb{C}\_i} \right)^2 \right),$$

where *avail.* corresponds to the components of *Z* without missing values, while *missing* to those with missing values. The correlations *corij* are calculated pairwise between all variables over the training data set before applying the method.

Furthermore, ITCOMPSOM iteratively completes the dataset, imputing the missing values of a data vector several times during the iterations, by training successively bigger topological maps, which combine previously completed data and new data with missing values at each iteration. This method allows a better data completion than the basic SOM method, for data with up to 75% missing data. Moreover, it is adapted to the completion of oceanographic data in which the variables are linked [23,53].

Finally, we also used Principal Component Analysis (PCA) [54], which is an orthogonal linear transformation of a dataset that projects the values onto new axes that best fit the data. These new axes are selected to explain a maximum amount of variability of the initial data. It can also be seen as a filtering tool, the first axes representing most of the information embedded in the data set, the remaining axes being associated with dataset noise. The specific number of modes was selected by cross-validation and are presented in Section 3.1.

2.2.2. Sat2profile Methodology

The aim of Sat2profile is to retrieve the vertical profile using the satellite data only. Due to the huge number of missing data and the level of noise occurring in the observation data, this requires a complete methodology taking each problem into account. *Sat2Profile* can be split into three main phases:


At the end of these 3 phases, we perform a variable selection. We fix the hyper parameters *naxes<sup>i</sup>* and the size of the map, and we test all the possible combinations of explanatory variables reiterating the *Sat2Profile* inversion for each subset. Figure 2 summarizes the methodological process.

**Figure 2.** Flow diagram of the *Sat2Profile*. A 500-fold cross-validation was effectuated on the training data.

In our study, the different phases were implemented in the way presented below.

*1st Phase*: we chose to use the satellite variables RRS412, RRS443, RRS555, KD490, ZEU, and ZHL that we expected to have the best ability to retrieve the vertical distributions of the pigment concentrations. As described in Section 2.1.2, the surface reflectance at different wavelength is used to consider the influence of the pigments' variability on the satellite-detected signal. KD490, ZEU and ZHL are also used to take into account the the sun light and heating effects.

*2nd Phase*: The learning dataset *D* has two distinct components: the satellite data that can have missing data and the pigment profiles. The pigment profiles were completed using ITCOMPSOM. The most complete part of the dataset (106 observations from the 1640, across the globe) is set aside as a validation set. Parts of these data were artificially masked. The ITCOMPSOM method was trained with the rest of the dataset and used to complete the validation set. The completed data and the corresponding observed data were compared

computing *R* <sup>2</sup> and RMSE. This process was repeated a large number of times (500 times) and an average assessment of completion was obtained, shown in Table 2.

*3rd Phase*: The completed pigment data were collocated with the satellite measurements and combined into a single dataset. Then, a smoothed version of the pigment dataset was constituted by using PCAs. For a given number of axes *naxes*, a learning dataset was constituted with 1614 lines and 11 + *naxes* columns corresponding, respectively, to the satellite and the smoothed PCA profiles. All the variables of the resulting dataset are centered-reduced and are used as a training set for a SOM. The cross validation (described in Section 2.2.3) resulting from the 9 experiences (dimension of the profiles) allows the determination of *naxes<sup>i</sup>* .

Finally, after having selected the optimal number of axes to keep, we analyzed the whole *Sat2Profile* methodology, testing all the combinations of the 11 satellite variables to be used as inputs allowing the best retrieval of pigments' vertical profiles. The exact hyperparameter values are provided in the code (https://github.com/AgathePuissant/SOM\_PCA (accessed on 1 March 2020)). At that time, we found that the 6 selected variables (RRS412, RRS443, RRS555, KD490, ZEU and ZHL) were the optimal combination of variables to be used.


**Table 2.** Validation results for the completion of the data by ITCOMPSOM.

#### 2.2.3. Methodological Workflow

#### Training Phase

First, the training dataset was completed using the ITCOMPSOM method. A PCA was performed on the matrix of in situ data for each pigment of dimension (1614,9). These PCAs resulted in 9 principal components for each pigment. A certain number *naxes* of these principal components were kept (the precise number for each pigment was chosen through optimization), resulting in 6 pigment datasets of dimensions (1614, *naxes<sup>i</sup>* ), with *i* ∈ [1 . . . 6]. The pigment data were colocated with the satellite measurements and combined in a single dataset. All the variables of the resulting dataset were centered-reduced and were used as a training set for a SOM.

#### Retrieval Phase

After the initial training, the SOM can be used to reconstruct the missing ∑*<sup>i</sup> naxes<sup>i</sup>* variables of in situ-data from the available *nsatvar* variables of satellite-derived data. Each observation was assigned to its Best Matching Unit, the neuron in the map whose referent vector was the closest in the Euclidean sense (1). The missing data were then replaced by the values of the corresponding components of the assigned referent vector. The PCA coordinates of the profiles were retrieved from the satellite data input, and then the profiles were reconstructed in the data space using the determined PCA parameters.

#### Cross-Validation of the Model

To assess the performance of the method, a 500-fold cross-validation procedure has been set up: the preprocessed database used is randomly segmented into 500 blocks. In each iteration, 499 out of the 500 blocks are used as a validation set. The pigment data from the validation set is masked, only the satellite variables data are kept and used to infer the missing values.

The SOM is trained on the training set, and the retrieval procedure is applied to the validation set. The estimated pigment data from the validation set is compared to the corresponding observed data that had been masked beforehand. This process is repeated on the 500 blocks.

The performance of the retrieval is assessed by computing the *R* 2 (2), Root-Mean Squared Error (*RMSE*) (3) and Spearman correlation coefficient (4) between each observed and estimated profile. They are then averaged for each pigment.

$$\mathcal{R}^2(Obs\_{i\prime}.Est\_i) = 1 - \frac{\sum\_{j=1}^n (Obs\_{ij} - Est\_{ij})^2}{\sum\_{i=1}^n (Obs\_{ij} - \overline{Obs\_i})^2}, \ i \in [1, m] \tag{2}$$

$$RMSE(Obs\_i, Set\_i) = \sqrt{\frac{\sum\_{j=1}^{n} (Obs\_{ij} - Est\_{ij})^2}{n}}, \ i \in [1, m] \tag{3}$$

$$\rho\_{Spar}(Obs\_{i\prime}\,\mathrm{Est\_i}) = 1 - \frac{6\,\sum\_{j=1}^{n} d^2}{n(n^2 - 1)}, \; i \in [1, m] \tag{4}$$

where *d* is the rank difference among the vectors, *n* the number of components in the vector (*n* = 9 because the profiles are composed of 9 depths) and *m* the number of observations in *D* (*m* = 1614).

The *R* <sup>2</sup> and *RMSE* are computed from the linear regression between the observed and estimated values for each profile and allows the quantification of the error committed during the profile retrieval. The Spearman correlation coefficient accounts for nonlinear relationships among variables, and thus allows an assessment of the correspondence of the shapes of the estimated versus observed profile.

#### 2.2.4. Test of Spatial and Temporal Coherence

Once the inversion method has been implemented, the results obtained must be spatially and temporally consistent. To test the results of the method on spatially varying data, the inversion method was applied to observations in a particular ocean cruise transect. The Biosope cruise transect (http://www.obs-vlfr.fr/proof/vt/op/ec/biosope/bio.htm (accessed on 1 March 2020)) was selected based on the quantity of satellite data available to invert pigment profiles from. The Biosope transect is composed of 49 stations, 28 of which contain enough satellite data to perform an inversion. This transect data come from the training set and therefore was used to verify the spatial consistency of the results from our inversion method. On the other hand, to validate the consistency over time of the data obtained by inversion, we selected a station located in a temperate zone (47°N, 8°W) and therefore where phytoplankton show a well-marked seasonality. The weekly satellite data (averaged over 8 days) observed during the year 2019 from January to December were extracted from a 6 × 6 pixel box around the location coordinates. Pigment profiles were inverted from satellite data and then the profiles were spatially averaged for each week, resulting in 46 weekly average pigment profiles.

#### **3. Results**

#### *3.1. Parameters of the Method*

The data were completed using the ITCOMPSOM method with a two-dimensional hexagonal grid with a final size of 27 × 15 (405 neurons) on the SOM and 10 iterations. The SOM consists of the same structure of a two-dimensional hexagonal grid with a size of 27 × 15 (405 neurons), determined heuristically by taking into account the number of observations in the training set and the number of observations per class, to have a good distribution of data on the neural map. Cross-validation experiments of the performance of the method helped to determine the number of PCA coordinates to keep for each pigment. The first two PCA coordinates were kept for each pigment, corresponding to between 69% and 82% of the explained variance depending on the pigment. After cross-validating the

method for every combination of the considered eleven satellite variables, the six selected variables were RRS412, RRS443, RRS555, KD490, ZEU and ZHL.

#### *3.2. Cross Validation Performance*

The results of the cross-validation of the method using the PCA preprocessing with two axes were compared with the results of the cross-validation of the method without the smoothing of the profiles by the PCAs given in Table 3. The average *R* <sup>2</sup> and average Spearman's correlation coefficient per profile increase with the use of profile smoothing by PCA, while the average RMSE per profile decreases. As an example, for fucoxanthin, the average *R* <sup>2</sup> per profile increases from 0.4 to 0.83 with the use of PCA smoothing in the inversion method. On average, the Spearman's per profile correlation coefficient increased by 0.26, the *R* <sup>2</sup> per profile increased by 0.31, and the RMSE per profile was divided by 2.17. Globally, for the method using a PCA reduction, the average *R* <sup>2</sup> per profile ranges from 0.68 to 0.83, and the average Spearman correlation coefficients per profile range from 0.77 to 0.84.

**Table 3.** Cross-validation results for the method without PCA preprocessing, and with PCA preprocessing (two axes).


To assess the order of magnitude of the information lost by the PCA smoothing, the initial profiles have been compared before and after the PCA preprocessing with two axes, using the RMSE averaged over all the observations for each pigment. The results are presented below in Table 4 along with the RMSE estimates from the cross validation, and represent the uncertainties associated with each estimated pigment vertical profile. Clearly, the percentage of errors for the two steps, PCA and SOM, have the same order of magnitude.

**Table 4.** Mean RMSE results for the PCA step of the method and the SOM step of the method.


#### *3.3. Test Performance*

Once the method has been trained on the ITCOMPSOM completed and PCA preprocessed data, the retrieval procedure was applied to satellite data colocated with the 66 Tara dataset stations. The Tara profiles were completed using ITCOMPSOM to allow the comparison between observed and estimated profiles. The estimated pigment profiles were compared to the completed observed ones. The results are shown in Table 5. The comparison criteria are in the same order of magnitude as the results of the crossvalidation experiment. These results suggest a good generalization capability of the method to exterior data.


**Table 5.** Results of the inversion of the Tara test set using the method with PCA preprocessing (two axes).

#### *3.4. Spatial and Temporal Coherence*

The pigment profiles of the Biosope cruise trajectory were estimated from the daily satellite data using our method. The results for the main pigment (Chla) and a secondary pigment (DVChla) are shown in Figures 3 and 4. In these figures, as the cruise trajectory crosses the Pacific Ocean longitudinally, we chose to represent the pigment concentration values along the longitude on the *x*-axis and the depth values on the *y*-axis. The profiles, smoothed using PCAs, which are represented in Figures 3a and 4a, are the final profiles that we aimed at retrieving from satellite data. The inverted profiles are represented in Figures 3b and 4b, the black areas corresponding to the longitudes where there were no matched satellite data available for any of the six selected satellite variables. Figures 3c and 4c show the difference between observed and estimated profiles.

In Figures 3b and 4b, we show the profiles estimated by inversion, which can be compared with Figures 3a and 4a. Globally, we find the same zones and the same depths for the concentration maxima. The same pattern of the maximum concentration depth as a function of longitude is found both in the estimated and observed profiles, i.e., close to the surface in the west, then reaching deep depths between 107.81 m and 179.85 m at intermediate longitudes and again close to the surface in the eastern longitudes. However, some profiles are overestimated, other underestimated, which are respectively shown in red and blue in Figures 3c and 4c. This test of the inversion method on the Biosope cruise trajectory satisfactorily accounts for the inter-pigment dynamics along a continuous spatial observation. The spatial coherence of the trajectory is preserved after the inversion from satellite data.

The weekly pigment profiles in the ocean area (47°N, 8°W) were inverted from satellite data by our method for the year 2019. The inversion was performed using satellite data not included in the training dataset. Only satellite data were available at this location, but the temporal characteristics of phytoplankton are known: the region corresponds to the North Atlantic Biogeochemical province, with a temperate climate and a seasonal variation of phytoplankton. Therefore, a spring bloom of phytoplankton is expected. This inversion thus allows us to test the method on new data and to verify the temporal coherence of the results obtained with the environmental characteristics. We show the results for the estimated Chla, fucox, and zeax profiles with respect to time. The Chla concentration represents the occurrence of the phytoplankton as a whole, and the fucox and zeax represent the composition of the phytoplankton community. These two secondary pigments are indicators of two main groups of phytoplankton, fucox being a diagnostic pigment for the diatoms [30] and zeax being a diagnostic pigment for the prokaryotes [33,34].

Figure 5 shows Chla profiles as a function of depth and time (in weeks). Between weeks 10 and 18, which corresponds to mid-March to early May, the Chla reaches high concentrations in the water column with a maximum at the surface between 5 and 8 m. Following that, the surface Chla concentration decreases, showing a DCM between 23 and 64 m. As seen in Figure 6, there is a concentration peak of fucox at a depth of about forty meter at the same time as the Chla peak, between weeks 10 and 18. In Figure 7, we observe a different

dynamic for the zeax concentration with respect to the two other pigments: the concentration peak occurs later during weeks 19–37 corresponding to the late spring/summer seasons. The increase of zeax happens at the surface layers (between 5 and 23 m).

**Figure 3.** Result of the inversion of Chla profiles from the satellite data of the Biosope trajectory. (**a**) Smoothed observed Chla profiles; (**b**) estimated Chla profiles; (**c**) difference between estimated and observed.

**Figure 4.** *Cont*.

**Figure 7.** zeax inverted profiles over time.

#### **4. Discussion and Conclusions**

We presented, in this paper, robust estimations of the vertical variability of six phytoplankton pigments (Chla, fucox, 19hex, perid, zeax and DVChla) from the surface to a depth of 300 m, using satellite surface measurements at high spatial (global, 4 km) and temporal (daily) resolution. These estimations are derived from a new machine learning methodology proposed in this paper, *Sat2Profile*, based on a SOM, and trained and validated using the fusion of an in situ global HPLC database, MAREDAT, and an ocean colour satellite database. After a series of cross-validations and checking the coherence of the results, a validation experiment was performed on a new database introduced as a test set from Tara Oceans measurements. The different experiments show a satisfying performance. The different regression coefficients *R* <sup>2</sup> between observed and estimated vertical profiles of pigment concentration and the Spearman correlation coefficient are greater than 0.7. The reconstruction of the 3D distribution of phytoplankton pigments is an innovative result that gives a better understanding of the PFTs distribution in the water column.

Works attempting to predict vertical pigment profiles from surface data targeted the Chla and were based on the surface Chla and/or assigned with other physical factors such as SST and currents ([13,14,17–20]). However, during the optimization process of *Sat2Profile*, we showed that the problem is more complex when dealing with different pigments at the same time, each with their own particular variability. SST and Chla surface information were not enough to estimate the vertical profile of the pigments. Therefore, several biooptical parameters, such as remote sensing reflectance at several wavelengths, and the information about the euphotic layer were essential to infer pigment vertical variability from surface data. The necessity of having euphotic depth as an input aligns our study with the reasoning of [13]. In addition, in [55], the authors proved that optical and radiometric information are effective indicators of the vertical dynamics of pigments. Estimating phytoplankton pigment variability using a temporal dataset of satellite data within the North Atlantic biogeochemical province showed that pigments such as Zeaxanthin and Fucoxanthin exhibit different temporal variability over time. Furthermore, the depth of the pigment concentration maximum is not the same for each pigment; this was observed in in-situ studies [56] and have been also observed in the MAREDAT database. These findings can be related to the community shift in response to seasonal changes and variations of environmental factors. The fucox peak concentrations indicate a bloom dominated by diatoms. The overall low zeax concentration highlights that the fraction of prokaryotes at this time is limited. Later, with the heating of the surface layer at the beginning of the summer until the end of September, the fucox decreases while the zeax remains the same. In such events of stratification of the water column in response to higher SST, prokaryotes are the most favored by these environments [57,58]. This analysis of pigments dynamics along time is consistent with studies done in the North Atlantic Biogeochemical region [59,60].

The Biosope experiment to reconstruct the pigment variability along the ship transect using *Sat2Profile* showed satisfying concordance. The transect crosses a region characterized by the presence of the southern sub-tropical gyre, which is known by its ultra-oligotrophic environment. In other terms, this nutrient poor environment is represented by the lowering of the overall Chla concentration in this gyre and deepening of the DCM as seen in the in-situ database. *Sat2Profile* estimation of Chla and DVChla shows an interesting ability of the method to capture the deep DCM and the variability of pigments using surface satellite data in that region of the southern Pacific.

Indeed, the inter-pigment relationships are specific to regions and to trophic states ([13]), and the variability of these pigments is capable to reflect the influence of environmental factors such as nutrient dependency and water masses on the phytoplankton community structure ([61,62]).

Uitz et al. and Sauzede et al. [14,18] exploited the data obtained by HPLC to determine the different phytoplankton size classes occurring in the water column based on their contribution to the total Chla [14]. The pigment variability seen in our previously described analysis can be compared to the results of both studies. Indeed, fucox is usually used to estimate microphytoplankton relative abundance and zeax for picophytoplankton. The variability of these two size classes is seen to be antagonistic in the work of [14,18]; more microphytoplankton in a Chla-rich water column, and more picophytoplankton in poor oligotrophic waters. This corresponds also to the variability of fucox and zeax in our temporal study.

However, the difference brought by the presented method is that PSC estimations in [14,18] were constrained by the empirical relationships between Chla and secondary pigments and by a priori hypotheses on the shape of the vertical pigments profiles [15]. In order to avoid biases introduced with these inter-pigment empirical relationships, *Sat2Profile* aims to estimate phytoplankton pigments as a first step. In a later stage, *Sat2Profile* unfolds the opportunity to observe phytoplankton groups derived from pigments and to assess the retrieval of these PFTs from empirical relationships.

The method we present is globally applicable (excluding the Southern Ocean) and generates daily products from 1997–present; this opens the way for multiple new studies. However, several limitations cannot be denied. There are uncertainties resulting from the error propagation in the *Sat2Profile*: through the data completion and the loss of information during the PCA filtering until the retrieval from satellite data. These errors were quantified and addressed in this paper. However, the information retrieved using *Sat2Profile* is one step toward closing the gap of knowledge in the distribution of phytoplankton groups, especially below the surface where sampling of phytoplankton diversity measures has been very scarce.

The existence of direct links between pigment concentrations and phytoplankton functional types implies that we can use this approach to attempt to study their global vertical distribution. This would improve the global spatio-temporal monitoring of the biological pump, crucial in constraining our estimations of the ocean's absorption capacity in a changing climate.

**Author Contributions:** Conceptualization, S.T., R.E.H. and A.A.C.; methodology, S.T., R.E.H., A.P. and A.A.C.; software, A.P.; validation, A.P.; formal analysis, S.T., R.E.H., A.P. and A.A.C.; investigation, S.T., R.E.H., A.P. and A.A.C.; resources, S.T. and C.B.; data curation, R.E.H. and C.B.; writing—original draft preparation, A.P.; writing—review and editing, S.T., R.E.H., A.P. and A.A.C.; visualization, A.P.; supervision, S.T., R.E.H., C.B. and A.A.C.; project administration, S.T. and R.E.H.; funding acquisition, S.T. All authors have read and agreed to the published version of the manuscript.

**Funding:** This project was carried out with the support of the Sorbonne Center for Artificial Intelligence (SCAI) of Sorbonne University. A.P. is supported by l'Ecole Universitaire de Recherche IPSL-Climate Graduate School, funding ANR entitiled: Programme des Investissements d'Avenir (reference ANR-11-IDEX-0004-17-EURE-0006). R.E.H. is supported by a postdoctoral fellowship from the CNES.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The MAREDAT and Tara Oceans Expedition HPLC data used in this study can be found at https://doi.pangaea.de/10.1594/PANGAEA.793246 (accessed on 1 March 2020). The different merged satellite ocean color data were obtained from the GlobColour project portal (www.globcolour.info) (accessed on 1 March 2020). All the globcolour products are described in the product user guide version version 4.2.1 (https://www.globcolour.info/CDR\_Docs/GlobCOLOUR\_ PUG.pdf (accessed on 1 March 2020)) found on the GlobColour portal. Pathfinder Level 3 Daily Daytime SST Version 5.3 data set were obtained from http://doi:10.7289/V52J68XX/ (accessed on 1 March 2020). Following best practices, the code was deposited into a public domain repository accessible at https://github.com/AgathePuissant/SOM\_PCA (accessed on 1 March 2020). Prerequisite software library SOM Toolbox 2.0 for Matlab is required, implementing the self-organizing map algorithm, Copyright (C) 1999 by Esa Alhoniemi, Johan Himberg, Jukka Parviainen, and Juha Vesanto and accessible at https://github.com/ilarinieminen/SOMToolbox(accessed on 1 March 2020).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


## *Article* **Specular Reflection Detection and Inpainting in Transparent Object through MSPLFI**

**Md Nazrul Islam 1,2,\* , Murat Tahtali <sup>1</sup> and Mark Pickering <sup>1</sup>**


**Abstract:** Multispectral polarimetric light field imagery (MSPLFI) contains significant information about a transparent object's distribution over spectra, the inherent properties of its surface and its directional movement, as well as intensity, which all together can distinguish its specular reflection. Due to multispectral polarimetric signatures being limited to an object's properties, specular pixel detection of a transparent object is a difficult task because the object lacks its own texture. In this work, we propose a two-fold approach for determining the specular reflection detection (SRD) and the specular reflection inpainting (SRI) in a transparent object. Firstly, we capture and decode 18 different transparent objects with specularity signatures obtained using a light field (LF) camera. In addition to our image acquisition system, we place different multispectral filters from visible bands and polarimetric filters at different orientations to capture images from multisensory cues containing MSPLFI features. Then, we propose a change detection algorithm for detecting specular reflected pixels from different spectra. A Mahalanobis distance is calculated based on the mean and the covariance of both polarized and unpolarized images of an object in this connection. Secondly, an inpainting algorithm that captures pixel movements among sub-aperture images of the LF is proposed. In this regard, a distance matrix for all the four connected neighboring pixels is computed from the common pixel intensities of each color channel of both the polarized and the unpolarized images. The most correlated pixel pattern is selected for the task of inpainting for each sub-aperture image. This process is repeated for all the sub-aperture images to calculate the final SRI task. The experimental results demonstrate that the proposed two-fold approach significantly improves the accuracy of detection and the quality of inpainting. Furthermore, the proposed approach also improves the SRD metrics (with mean F1-score, G-mean, and accuracy as 0.643, 0.656, and 0.981, respectively) and SRI metrics (with mean structural similarity index (SSIM), peak signal-to-noise ratio (PSNR), mean squared error (IMMSE), and mean absolute deviation (MAD) as 0.966, 0.735, 0.073, and 0.226, respectively) for all the sub-apertures of the 18 transparent objects in MSPLFI dataset as compared with those obtained from the methods in the literature considered in this paper. Future work will exploit the integration of machine learning for better SRD accuracy and SRI quality.

**Keywords:** specular reflection detection; specular reflection inpainting; transparent object; multispectral polarimetric imagery; light field

#### **1. Introduction**

The emerging significance of specular reflection detection and inpainting (SRDI) has been actively pursued in the computer vision community over the last few decades. The presence of specular reflection creates potential difficulties for tasks such as detection, segmentation, and matching, as it captures significant information about an object's distribution, shape, texture, and roughness features that cause discontinuity in its omnipresent, object-determined diffuse part [1]. Once specular reflection is detected, it may be used to

**Citation:** Islam, M.N.; Tahtali, M.; Pickering, M. Specular Reflection Detection and Inpainting in Transparent Object through MSPLFI. *Remote Sens.* **2021**, *13*, 455. https:// doi.org/10.3390/rs13030455

Academic Editors: Tiziana D'Orazio and Jukka Heikkonen Received: 24 November 2020 Accepted: 26 January 2021 Published: 28 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

synthesize a scene [2] or to estimate lighting direction and surface roughness [3,4]. While passing through the surface of a transparent object, some incoming lights are immediately reflected back into the space and are called surface or specular reflections, and others penetrate the surface and then reflect back into the air body or diffuse reflections [5]. Due to a transparent object lacking its own texture, it is always a difficult and challenging task to detect its specular reflections and inpainting [6]. The potential application of specular reflection detection and inpainting in transparent objects through multispectral polarimetric light field imagery (MSPLFI) includes 3D shape reconstruction, detection and segmentation, surface normal generation, and defect analysis.

By integrating advanced communication tools and techniques, multispectral polarimetric imagery (MSPI) can extract an object's meaningful information, such as surface features, shapes, and roughness, in optical sensing images [7]. Potential applications of it could investigate acquiring an imaging system that performs image denoising [8], image dehazing [9], and semantic segmentation [10]. Multispectral imaging is a mode commonly reported in the literature for enhancing color reproduction [11], illuminant estimation [12], vegetation phenology [13,14], shadow detection [15], and background segmentation [16,17]. Additionally, although a multispectral cue is capable of generating information through penetrating deeper into an object, it is sometimes infeasible for extracting the object's inherent features. Together with a polarimetric cue, where specific photoreceptors are used for polarized light vision, MSPI is applied in applications such as specular and diffuse separation [18], material classification [19], shape estimation [20], target detection [21–23], anomaly detection [24], man-made object separation [25], and camouflaged object separation [26]. Recently, a light field (LF) cue has gained popularity in the graphics community for detecting and segmenting some complex tasks, such as transparent object recognition [27], classification [28], and segmentation [29] from a background, by analyzing the distortion features of a single shot captured by an LF camera. Each pixel in an LF image is capable of having six degrees of freedom to extract the hidden information unable to be captured by MSPI cues. The aim of the proposed research is to use the multisensory cues of MSPLFI, which can effectively detect the specular reflection and the corresponding suppression in a transparent object.

Firstly, it is necessary to separate specular reflection from diffuse reflection. Each pixel in MSPLFI can be defined as the sum of specular and diffuse reflections following the dichromatic reflection model [30] as

$$L(\lambda,\,\rho,\,\mathcal{L},\,\theta\_{\text{i}}\,\,\theta\_{\text{r}}\,\,\theta\_{\text{r}}\,\,\bigotimes) = \,\,L\_{\text{Spec}}(\lambda,\,\rho,\,\mathcal{L},\,\theta\_{\text{i}}\,\,\theta\_{\text{i}}\,\,\theta\_{\text{r}}\,\,\bigotimes) + \,\,L\_{\text{Diff}}(\lambda,\,\rho,\,\mathcal{L},\,\theta\_{\text{i}}\,\,\theta\_{\text{r}}\,\,\bigotimes),\tag{1}$$

where *Ls*(*λ*, *ρ*, L, *θ<sup>i</sup>* , *θ<sup>r</sup>* , *g*) is the specular reflection, *Ls*(*λ*, *ρ*, L, *θ<sup>i</sup>* , *θ<sup>r</sup>* , *g*) the diffuse reflection, *λ* the wavelength in the multispectral visible band (400 nm–700 nm), *ρ* the orientation of the polarimetric filter (rotating at 0◦ , 45◦ , 90◦ , 135◦ ), L the LF direction in which the light rays are traveling in space, and *θ<sup>i</sup>* , *θ<sup>r</sup>* , *g* the geometric parameters indicating incidence, viewing, and phase angles, respectively.

The individual components in Equation (1) can be further decomposed into two parts, composition and magnitude, as in Equation (2). Composition is a relative spectral power distribution (*cSpec* (surface reflection) or *cDi f f* (body reflection)) that depends on only wavelength, polarization, and LF but is independent of geometry. Magnitude is a geometric scale factor (*ωSpec* or *ωDi f f*) which depends on only geometry and is independent of the wavelength, polarization, and LF.

$$L(\lambda,\rho,\mathcal{L},\theta\_{\dot{\nu}},\theta\_{\dot{\nu}},\mathcal{g}) = \omega\_{\text{Spec}}(\theta\_{\dot{\nu}},\theta\_{\tau},\mathcal{g})\mathbf{c}\_{\text{Spec}}(\lambda,\rho,\mathcal{L}) \, + \,\omega\_{\text{Diff}}(\theta\_{\dot{\nu}},\theta\_{\tau},\mathcal{g})\mathbf{c}\_{\text{Diff}}(\lambda,\rho,\mathcal{L}),\tag{2}$$

As the appearance of a transparent object is highly biased by its background's texture and color, it is a challenging task to detect, segment, and suppress the specular reflections on it. Through predicting multispectral changes per sub-aperture image in the LF, the proposed research detects specular reflected pixels. In terms of inpainting, as it can be predicted that a pixel in a LF image has six degrees of freedom and can appear within any surrounding four-connected pixels in a sub-aperture image, a pixel pattern with maximum acceptability is selected to suppress an SRD pixel. Briefly, the proposed system firstly describes the significance of the joint utilization of multisensory cues, then captures an MSPLFI object dataset, proposes a two-fold algorithm for detecting and suppressing specular reflections, evaluates both detection accuracy and suppression quality in terms of statistical distinct metrics and, finally, compares performance with those of some other methods in the existing literature.

The main contribution of this research is two-fold. Firstly, an SRD algorithm that predicts changes in MSPLFI by calculating mean (*µ*) and covariance (Σ) of each sub-aperture index of the LF to predict specular reflections through applying the Mahalanobis distance is proposed. Then, the predicted changes in unpolarized and polarized images are averaged, and a threshold is applied to obtain a final SRD pixel mask (SRD-PM). However, due to the absence of publicly available multisensory 6D datasets to evaluate the performance of the proposed research, we firstly built an image acquisition system to capture an MSPLFI object dataset. Secondly, an SRI algorithm which extends the final SRD-PM in an immediately neighboring pixel using the RGB channels of both polarized and unpolarized sub-apertures in the LF is proposed. For a pixel in the SRD-PM, all the four-connected neighboring pixel patterns per sub-apertures of the LF, excluding those already in the SRD-PM, are carefully selected and a distance matrix is computed based on their intensities. Finally, the pixel pattern with the minimum distance is chosen for the task of inpainting. The performances of these approaches are evaluated and compared using a private MSPLFI object dataset to demonstrate the significance of this research.

This paper is organized as follows. In Section 2, the background to SRD and SRI is fully described. In Section 3, the details of the private MSPLFI dataset, including image acquisition setup, multisensory cues, and pixels' degrees of freedom, are analyzed. In Section 4, a complete two-fold SRDI framework and corresponding algorithms are presented with proper mathematical and logical explanations. In Section 5, the performances of the proposed SRD and SRI algorithms are evaluated by distinct statistical metrics. Additionally, detection accuracy and suppression quality of the proposed SRDI are visualized and compared with those of existing approaches. Finally, concluding remarks and suggested future directions are provided in Section 6.

#### **2. Related Works**

SRD techniques usually assume that the intensities of specular pixels vary from those of diffuse ones in multiple spectra as

$$P\_{\left(\mathbf{x},y,c,\lambda,\rho\mid i\right)} = \left\{ \begin{array}{ll} 1 & \text{if } d\left(\mathbf{I}\_{\left(\mathbf{x},y,c,\lambda,\rho\mid i\right)}, \mathbf{S}\_{\left(\mathbf{x},y,c,\lambda,\rho\mid i\right)}\right) > \tau\_{\mathbf{G}}\\ & \mathbf{0} & \text{otherwise} \end{array} \right\} \tag{3}$$

where *<sup>τ</sup><sup>G</sup>* is a global threshold, *<sup>P</sup>*(*x*, *<sup>y</sup>*, *<sup>c</sup>*, *<sup>λ</sup>*, *<sup>ρ</sup>*<sup>|</sup> *<sup>i</sup>*) the final SRD-PM at pixel (*x*, *y*) of a fused spectrum (*λ*) at a polarimetric orientation (*ρ*) in sub-aperture index *i* of the LF (L), *d* the distance between the pixel of the predicted specular pixel (*S*) and that of the fused image in spectrum *λ*(*I*) at orientation *ρ*. In this section, a brief review of the literature related to SRDI techniques for multisensory cues of MSPLFI is provided.

#### *2.1. Specular Reflection Detection (SRD)*

Recent works on SRD are categorized in two major ways, single and multiple imagebased, where the latter depends on specific conditions such as lighting direction and viewpoint. Based on a single-textured color image, Tan [31] iteratively shifts the maximum chromaticity of each pixel between two neighboring ones. An iteration stops when the chromaticity difference satisfies a certain threshold value and generates a specular-free (SF) image. The final SF image ensures a similar geometrical distribution even though it contains only diffuse reflections. However, for a large image with more specularity, this techique may lead to erroneous diffuse reflections with excessive and inaccurate removal

as well as higher computational complexity. Subtracting the minimum color channel value from each channel, Yoon [32] obtains an SF two-band image. Capturing images from a dynamic light source, Sato [33] integrates the dichromatic reflection model for separation by analyzing color signatures in many images captured by a moving light source. A series of linear basis functions are introduced by Lin [34], and the lighting direction is changed to decompose the reflection components.

The modified SF (MSF) technique introduced by Shen [35] ensures robustness to the influence of noise on chromaticity. It subtracts the minimum RGB value from an input image and works in an iterative manner by selecting a predefiend offset value using the least-squares criterion. Nguyen [36] proposes an MSF method that integrates tensor voting to obtain the dominant color and distribution of diffuse reflections in a region. To improve the separation performance, Yamamoto [37] applies a high-emphasis filter on individual reflection components to separate them [35]. However, all these methods suffer from artifacts and inaccuracy if the brightness of the input image is high.

Recent literature on SRD reveals that the specular reflection of an object's area has a stronger polarization signature than its diffuse reflection. Placing a polarization filter in front of an imaging sensor, Nayar [18] proposes separating the specular reflection components from an object's surface with heavy textures. Considering the textures and the surface colors of neighboring pixels, many authors [31,38,39] could separate specular reflections through neighboring pixel patterns. Applying a bilateral filter with coefficients, Yang [39] proposes an extension of Tan's [31] method in which the diffuse chromaticity is maximized. Although it provides faster separation and better accuracy, it still suffers from some problems for separating specular reflections in a transparent object. Akashi [40] also employs the dichromatic reflection model to separate specular reflections in single images based on sparse non-negative matrix factorization (NMF) composed of only non-negative values regulated by parameters such as sparse regularization, pixel color, and convergence. Although this method demonstrates better separation accuracy than those of Tan [31] and Yang [39], inaccurate parameter settings may lead to artifacts in the separation of specular reflections.

An SUV color space for separating specular and diffuse reflections from S and UV channels, respectively, of a single image or image sequence in an iterative manner is proposed by Mallick [38]. However, discontinuities in the surface color may lead to erroneous detection of secular reflections. In [41], Arnold applies image segmentation based on non-linear filtering and thresholding to separate specular and diffuse reflections in medical imaging. Saint [42] proposes increasing the gap between two reflection components and then applying a non-linear filter to isolate spike components in an image histogram. In [43], Meslouhi integrates the dichromatic reflection model to detect specular reflections. In our research, we use multisensory cues to detect specular reflections by predicting changes among multiband data.

#### *2.2. Specular Reflection Inpainting (SRI)*

SRI refers to restoring an SRD pixel pattern with semantically and visually believable content through analyzing neighboring pixel patterns. Recent works in the literature on SRI depend mainly on patch-based similarity, with similar patch- or diffusion-based inpainting proposed to fill an SRD pixel pattern by spreading color intensities from its background to its holes [8,9,44,45]. Traditional inpainting approaches apply an interpolation technique on the surrounding pixels to restore an SRD pixel pattern [46,47]. Based on temporal information in an endoscopic video image sequence, Vogt [48] proposes a well-inpainting method. Cao [49] develops an inpainting technique for averaging the pixels in a sliding rectangular window and later replacing it with an SRD pixel. Although this method is simple and relatively fast to compute, it lacks robustness due to varying window sizes based on the SRD's connected pixels. In [50], an average intensity of a contour is calculated to replace the SRD pixels by author Oh but may lead to strong gradients.

In [41], Arnold proposes a two-level inpainting technique which replaces SRD pixels with the centroid color within a certain distance and applies a Gaussian kernel for smoothing using a binary weight mask. Although the inpainting quality is better than those of other methods, it may produce some artifacts and blur for large spectral areas by integrating a partial differential equation with gradient thresholding. In [51], Yang proposes a convex model for suppressing the reflection from a single input image. In [52], Criminisi describes an image inpainting method in which an affected region is filled by some exemplars. As these techniques may produce artifacts and fail to suppress large reflection areas, our proposed method reconstructs the specular reflected pixels through analyzing their four-connected neighbors in the sub-apertures of the 4D-LF. In [41], Arnold proposes a two-level inpainting technique which replaces SRD pixels with the centroid color within a certain distance and applies a Gaussian kernel for smoothing using a binary weight mask. Although the inpainting quality is better than those of other methods, it may produce some artifacts and blur for large spectral areas by integrating a partial differential equation with gradient thresholding. In [51], Yang proposes a convex model for suppressing the reflection from a single input image. In [52], Criminisi describes an image inpainting method in which an affected region is filled by some exemplars. As these techniques may produce artifacts and fail to suppress large reflection areas, our proposed method reconstructs the specular reflected pixels through analyzing their four-connected neighbors in the sub-apertures of the 4D-LF.

#### **3. Analysis of MSPLFI Transparent Object Dataset 3. Analysis of MSPLFI Transparent Object Dataset**

Regarding SRD and SRI, the proposed research uses multisensory cues through capturing different objects in MSPLFI, each of which is defined as a function of 6D as Regarding SRD and SRI, the proposed research uses multisensory cues through capturing different objects in MSPLFI, each of which is defined as a function of 6D as

$$L\_{6D} = L(\mu, v, s, t, \lambda, \rho),\tag{4}$$

where (*u*, *v*) is the image plane referring to an image's spatial dimensions, (*s*, *t*) the viewpoint plane referring to the direction in which the light rays are traveling in space, *λ* the wavelength in the multispectral visible band (400 nm–700 nm), and *ρ* the orientation of the polarimetric filter (rotating at 0◦ , 45◦ , 90◦ , 135◦ ). where (, ) is the image plane referring to an image's spatial dimensions, (, ) the viewpoint plane referring to the direction in which the light rays are traveling in space, the wavelength in the multispectral visible band (400 nm–700 nm), and the orientation of the polarimetric filter (rotating at 0°, 45°, 90°, 135°).

In this section, acquisition of the MSPLFI object dataset and then its use for detecting and suppressing specular reflections in a transparent object are described. In this section, acquisition of the MSPLFI object dataset and then its use for detecting and suppressing specular reflections in a transparent object are described.

#### *3.1. Experimental Setup 3.1. Experimental Setup*

As there is no dataset available for the evaluation of SRDI in a transparent object that integrates multiple cues of MSPLFI, Figure 1 illustrates our setup for image acquisition to generate a problem-specific object dataset in a constrained environment with a plenoptic camera, Lytro Illum, used to capture all the LF images. We place different band filters in front of the camera to capture multispectral images and a linear polarization filter rotating at 0◦ , 45◦ , 90◦ , and 135◦ to manually obtain different polarimetric images with two light sources used to obtain accurate spectral reflections. The lighting is similar for different objects, and we retain the same background for them, which completely matches most of the objects in most of the area with the purpose of creating a complex environment from which to segment a whole object. One of the light sources is located beside the camera lens at 45◦ angle and another is located on the top object's location. The energy levels of multiple spectra are not similar; however, individual cues contain a useable amount of information when capturing MSPLFI. As there is no dataset available for the evaluation of SRDI in a transparent object that integrates multiple cues of MSPLFI, Figure 1 illustrates our setup for image acquisition to generate a problem-specific object dataset in a constrained environment with a plenoptic camera, Lytro Illum, used to capture all the LF images. We place different band filters in front of the camera to capture multispectral images and a linear polarization filter rotating at 0°, 45°, 90°, and 135° to manually obtain different polarimetric images with two light sources used to obtain accurate spectral reflections. The lighting is similar for different objects, and we retain the same background for them, which completely matches most of the objects in most of the area with the purpose of creating a complex environment from which to segment a whole object. One of the light sources is located beside the camera lens at 45° angle and another is located on the top object's location. The energy levels of multiple spectra are not similar; however, individual cues contain a useable amount of information when capturing MSPLFI.

**Figure 1.** Schematic diagram of the proposed image acquisition in multispectral polarimetric light field imagery (MSPLFI). **Figure 1.** Schematic diagram of the proposed image acquisition in multispectral polarimetric light field imagery (MSPLFI).

#### *3.2. MSPLFI Transparent Object Dataset 3.2. MSPLFI Transparent Object Dataset*

In Figure 2, the median specular reflections of the sub-aperture images of 18 transparent objects (O#1–O#18) captured through MSPLFI are presented with their corresponding labels. To evaluate the performance of the image inpainting technique, some balls are placed inside object O#1. In Figure 2, the median specular reflections of the sub-aperture images of 18 transparent objects (O#1–O#18) captured through MSPLFI are presented with their corresponding labels. To evaluate the performance of the image inpainting technique, some balls are placed inside object O#1.

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 6 of 30

**Figure 2.** Median specular reflections of MSPLFI for different objects: (O#1) round container with ball; (O#2) classic jug; (O#3) empty round container; (O#4) jar with cork lid; (O#5) sauce container; (O#6) ice glass; (O#7) clear glass jar; (O#8) coffee cup; (O#9) cuvee tumbler; (O#10) glass tumbler; (O#11) teacup; (O#12) water glass; (O#13) Bordeaux wine glass; (O#14) red wine glass; (O#15) hi-ball glass; (O#16) food box; (O#17) jar with cork handle; (O#18) port wine glass. **Figure 2.** Median specular reflections of MSPLFI for different objects: (O#1) round container with ball; (O#2) classic jug; (O#3) empty round container; (O#4) jar with cork lid; (O#5) sauce container; (O#6) ice glass; (O#7) clear glass jar; (O#8) coffee cup; (O#9) cuvee tumbler; (O#10) glass tumbler; (O#11) teacup; (O#12) water glass; (O#13) Bordeaux wine glass; (O#14) red wine glass; (O#15) hi-ball glass; (O#16) food box; (O#17) jar with cork handle; (O#18) port wine glass.

We consider five different shots for each spectrum of each object. Of them, one corresponds to the unpolarized version of the image captured without using a polarization filter and the other four to four different polarization filter orientations (0°, 45°, 90°, and 135°) using a linear polarizer. We consider multiple spectra in the visible range (400 nm– 700 nm) to obtain images in the multispectral environment. Figure 3 shows the center subaperture images of object O#8 in multiple color bands of violet, blue, green, yellow, orange, red, pink, and RGB in polarized and unpolarized versions. As can be seen, due to the nature of polarization, on average, 50% of the photons get blocked while passing We consider five different shots for each spectrum of each object. Of them, one corresponds to the unpolarized version of the image captured without using a polarization filter and the other four to four different polarization filter orientations (0◦ , 45◦ , 90◦ , and 135◦ ) using a linear polarizer. We consider multiple spectra in the visible range (400 nm– 700 nm) to obtain images in the multispectral environment. Figure 3 shows the center sub-aperture images of object O#8 in multiple color bands of violet, blue, green, yellow, orange, red, pink, and RGB in polarized and unpolarized versions. As can be seen, due to the nature of polarization, on average, 50% of the photons get blocked while passing through a lossless polarizer at different orientations.

through a lossless polarizer at different orientations. The LF images are 4D data obtained from different viewpoints, with each image presented as a sub-aperture plane (, ) with its tangent direction (, ). In our experiments, we consider 11 × 11 sub-aperture images, including their center viewpoints, with their spatial representations denoted by (, ). Figure 4 shows the 4D-LF images of object O#8 The LF images are 4D data obtained from different viewpoints, with each image presented as a sub-aperture plane (*s*, *t*) with its tangent direction (*u*, *v*). In our experiments, we consider 11 × 11 sub-aperture images, including their center viewpoints, with their spatial representations denoted by (*u*, *v*). Figure 4 shows the 4D-LF images of object O#8 in the violet color band, with the center viewpoint image at the cross-section of the S and the T lines denoted as the (6,6) position in the hyperplane (*s*, *t*, *u*, *v*).

in the violet color band, with the center viewpoint image at the cross-section of the S and

the T lines denoted as the (6,6) position in the hyperplane (, , , ).

NP

LP-0°

LP-45°

LP-90°

LP-135°

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 7 of 30

**Figure 3.** Multiband polarimetric images of object O#8 (seven individual bands and RGB band in visible range (400 nm– 700 nm) at four polarimetric orientations (0°, 45°, 90°, and 135°) with no polarization setting). **Figure 3.** Multiband polarimetric images of object O#8 (seven individual bands and RGB band in visible range (400 nm–700 nm) at four polarimetric orientations (0◦ , 45◦ , 90◦ , and 135◦ ) with no polarization setting). **Figure 3.** Multiband polarimetric images of object O#8 (seven individual bands and RGB band in visible range (400 nm– 700 nm) at four polarimetric orientations (0°, 45°, 90°, and 135°) with no polarization setting).

**Figure 4. Figure 4.** Captured 4D light filed images th Captured 4D light filed images through Lytro camera (sample object O#8 with 121 sub-aperture images). rough Lytro camera (sample object O#8 with 121 sub-aperture images).

#### *3.3. Degrees of Freedom*

Figure 5 presents an example of object O#1's scene flow among its sub-aperture images and their relative directions. In Figure 5a, the arrow indicates that all the viewpoint images' motion flows to the center viewpoint image and, in Figure 5b, each pixel has six degrees of freedom in the LF images, with the region of interest (ROI) regarding the scene flow indicated by a yellow rectangle. In Figure 5c, the pixel displacements are shown with their corresponding intensity flow plots, which confirm that the intensity of the ROI varies in different viewpoints. *3.3. Degrees of Freedom*  Figure 5 presents an example of object O#1's scene flow among its sub-aperture images and their relative directions. In Figure 5a, the arrow indicates that all the viewpoint images' motion flows to the center viewpoint image and, in Figure 5b, each pixel has six degrees of freedom in the LF images, with the region of interest (ROI) regarding the scene flow indicated by a yellow rectangle. In Figure 5c, the pixel displacements are shown with their corresponding intensity flow plots, which confirm that the intensity of the ROI varies in different viewpoints.

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 8 of 30

**Figure 5.** Scene flows in light field (LF) imagery: (**a**) views of positional and directional movements corresponding to central viewpoint; (**b**) each pixel in LF imagery has six degrees of freedom, with region of interest (ROI) indicated by yellow rectangle; and (**c**) example of ROI displacement and corresponding intensity plot. **Figure 5.** Scene flows in light field (LF) imagery: (**a**) views of positional and directional movements corresponding to central viewpoint; (**b**) each pixel in LF imagery has six degrees of freedom, with region of interest (ROI) indicated by yellow rectangle; and (**c**) example of ROI displacement and corresponding intensity plot.

#### In this section, the proposed two-fold SRDI framework based on the distinctive fea-**4. Proposed Two-fold SRDI Framework**

**4. Proposed Two-fold SRDI Framework** 

tures of MSPLFI cues is discussed and presented in Figure 6. Firstly, a 6D dataset of different transparent objects is captured, and then Reed-Xiaoli (RX) detector [53] is applied to obtain the actual specular reflection of an object through predicting changes among multiband. Secondly, a pixel neighborhood-based inpainting method for suppressing this reflection is proposed. In this section, the proposed two-fold SRDI framework based on the distinctive features of MSPLFI cues is discussed and presented in Figure 6. Firstly, a 6D dataset of different transparent objects is captured, and then Reed-Xiaoli (RX) detector [53] is applied to obtain the actual specular reflection of an object through predicting changes among multiband. Secondly, a pixel neighborhood-based inpainting method for suppressing this reflection is proposed. *Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 9 of 30

> The proposed system detects specular reflected pixels in transparent objects through predictions of multiband changes. Firstly, a raw lenslet (.LFR) image is decoded into a 4D (, , , ) LF one, where (, ) denotes the image's position in the hyperplane and (, ) its spatial region. The MSPLF imagery was captured by the Lytro Illum camera, which can capture 15 × 15 sub-apertures per shot. However, due to the main lens of the camera being circular, vignetting occurs at its edge. Hence, only the inner 11 × 11 sub-apertures are retained. It could be argued that few more sub-apertures at the top, the bottom, the left, and the right could be as good—if not better—than the corner sub-apertures kept in the 11 × 11 array, but excluding them keeps them in a square array for simplicity. As our main purpose is to detect and suppress specularity in a transparent object, we maximize an object's area with a minimum surrounding background. In order to compute the specular reflections in unpolarized images, we convert all the multiband unpolarized 4D LF ones into their corresponding grayscale ones. For each sub-aperture index, we store the individual band images in a column vector, with their mean () and covariance () calculated

> The 2D distance matrix represents the changes among the multiband images per subaperture index, which is also observed as specular reflection. We also predict the maximum specularity in unpolarized 4D images. In order to draw specular reflections in polarized images, we firstly calculate the Stokes parameters (–ଶ) [54], which describe the linear polarization characteristics using a three-element vector (), as shown in Equation (6), where represents the total intensity of light, ଵ the difference between the horizontal and vertical polarizations, and ଶ the difference between the linear +45° and –45° ones. The బ, ସହబ, ଽబ, and ଵଷହబ are the different input images for the system at polar-

> > ൩ = బ + ଽబ బ − ଽబ ସହబ − ଵଷହబ

ඥ(−)் ିଵ (−), (5)

, (6)

**Figure 6.** Proposed two-fold framework for specular reflection detection (SRD) and specular reflection inpainting (SRI). **Figure 6.** Proposed two-fold framework for specular reflection detection (SRD) and specular reflection inpainting (SRI).

for the Mahalanobis distance as

ized angles of 0, 45, 90, and 135, respectively.

 = ଵ ଶ

#### *4.1. Specular Reflection Detection (SRD)*

The proposed system detects specular reflected pixels in transparent objects through predictions of multiband changes. Firstly, a raw lenslet (.LFR) image is decoded into a 4D (*s*, *t*, *u*, *v*) LF one, where (*s*, *t*) denotes the image's position in the hyperplane and (*u*, *v*) its spatial region. The MSPLF imagery was captured by the Lytro Illum camera, which can capture 15 × 15 sub-apertures per shot. However, due to the main lens of the camera being circular, vignetting occurs at its edge. Hence, only the inner 11 × 11 sub-apertures are retained. It could be argued that few more sub-apertures at the top, the bottom, the left, and the right could be as good—if not better—than the corner sub-apertures kept in the 11 × 11 array, but excluding them keeps them in a square array for simplicity. As our main purpose is to detect and suppress specularity in a transparent object, we maximize an object's area with a minimum surrounding background. In order to compute the specular reflections in unpolarized images, we convert all the multiband unpolarized 4D LF ones into their corresponding grayscale ones. For each sub-aperture index, we store the individual band images in a column vector, with their mean (*µ*) and covariance (Σ) calculated for the Mahalanobis distance as

$$\sqrt{\left(\mathbf{x} - \boldsymbol{\mu}\right)^{T} \boldsymbol{\Sigma}^{-1} \left(\mathbf{x} - \boldsymbol{\mu}\right)},\tag{5}$$

The 2D distance matrix represents the changes among the multiband images per subaperture index, which is also observed as specular reflection. We also predict the maximum specularity in unpolarized 4D images. In order to draw specular reflections in polarized images, we firstly calculate the Stokes parameters (*S*0−*S*2) [54], which describe the linear polarization characteristics using a three-element vector (*S*), as shown in Equation (6), where *S*<sup>0</sup> represents the total intensity of light, *S*<sup>1</sup> the difference between the horizontal and vertical polarizations, and *S*<sup>2</sup> the difference between the linear +45◦ and –45◦ ones. The *I* 0 <sup>0</sup> , *I* <sup>45</sup><sup>0</sup> , *I* <sup>90</sup><sup>0</sup> , and *I* <sup>135</sup><sup>0</sup> are the different input images for the system at polarized angles of 0<sup>0</sup> , 45<sup>0</sup> , 90<sup>0</sup> , and 135<sup>0</sup> , respectively.

$$\mathbf{S} = \begin{bmatrix} \mathbf{S}\_0 \\ \mathbf{S}\_1 \\ \mathbf{S}\_2 \end{bmatrix} = \begin{bmatrix} I\_{0^0} + I\_{90^0} \\ I\_{0^0} - I\_{90^0} \\ I\_{45^0} - I\_{135^0} \end{bmatrix} \tag{6}$$

The degree of linear polarization (*DoLP*) is a measure of the proportion of the linear polarized light relative to the light's total intensity, and the angle of linear polarization (*AoLP*) is the orientation of the major axis of the polarization ellipse, which represents the polarizing angle where the intensity should be the strongest. They are derived from the Stokes vector according to Equations (7) and (8), respectively. To calculate the linear polarized image, firstly, the polarimetric components are concatenated, as shown in Equation (9). Then, a concatenated image is generated in the hue, saturation, value (HSV) color space and converted to the RGB color space, as in Equation (10), where *LP* stands for linear polarization.

$$DoLP = \frac{I\_{pol}}{I\_{tot}} = \frac{\sqrt{S\_1^2 + S\_2^2}}{S\_0} \,\tag{7}$$

$$AoLP = \frac{1}{2} \tan^{-1} \left(\frac{S\_2}{S\_1}\right) \tag{8}$$

$$hsv = \left(\left(AoLP + \text{ }\pi/2\right)/\pi\right) \left(DoLP \times \text{ }\newline \right) \text{ S}\_{0\prime} \tag{9}$$

$$LP = RGB \text{ (}hsv\text{)},\tag{10}$$

For each sub-aperture index of *DoLP* and *LP*, we store individual band images in a separate column vector. Then, a similar procedure (unpolarized specular detection) is followed to calculate the maximum specularity in the LP and the DoLP 4D imagery. The average of three specularities (*RX* − *NP*, *RX* − *LP*, *RX* − *DoLP*) shows the overall predicted specularity in an object of MSPLFI, with a threshold (Otsu's method and in the

range (0–1)) applied to obtain the SRD pixels in binary form. The complete process for detecting specular pixels in a transparent object is described in Algorithm 1.



21: Apply threshold (*τ*) to binarize SRD pixels

#### *4.2. Specular Reflection Inpainting (SRI)*

In this research, the SRD pixels are suppressed through analyzing the distances among four connected neighboring pixels. Firstly, four different regions in an image are identified, as shown in Figure 7. Algorithm 1 predicts region A as an SRD pixel but, for better inpainting quality, both regions A and B are considered specular reflected pixels. It is to be noted that region B contains the pixel patterns (color channels) that are the immediate neighbors of region A. Then, all the connected regions are identified and labeled for the task of inpainting. The complete process for inpainting the detected specular pixels in transparent object is described in Algorithm 2.

**Algorithm 2.** SRI in Transparent Object

**Input:** MSPLFI Object Dataset, SRD-PM

**Output:** SRD Pixel Inpainting in RGB

1: Strengthen SRD-PM (output from Algorithm 1) by labeling all neighboring pixels as SRD ones

2: Compute connected components and label them



## 13: **end for**

14: **repeat** steps 4 to 13 to calculate maximum specular reflection in suppressed image of transparent object from already suppressed sub-apertures

**Output:** SRD Pixel Inpainting in RGB

tions

5: **for** all labels **do** 

10: **end if**  11: **end for** 

12: **end for** 13: **end for** 

transparent object is described in Algorithm 2.

*4.2. Specular Reflection Inpainting (SRI)* 

**Algorithm 2.** SRI in Transparent Object **Input:** MSPLFI Object Dataset, SRD-PM A baseline image per sub-aperture index is computed by taking the minimum pixel intensities in both polarized and unpolarized RGB channels. The aim is to suppress the specular reflected areas in the image, with the distance between two pixel-patterns calculated by

$$d\_{\left(j,k\mid \mathbf{x},\mathbf{y}\right)} = \sqrt{\sum\_{\mathbf{c}=\mathbf{R},\mathbf{G},\mathbf{B}} \left(\mathbf{P}\_{\left(\mathbf{x},\mathbf{y},\mathbf{c},j\mid i\right)} - \mathbf{P}\_{\left(\mathbf{x},\mathbf{y},\mathbf{c},k\mid i\right)}\right)^2} \tag{11}$$

In this research, the SRD pixels are suppressed through analyzing the distances

among four connected neighboring pixels. Firstly, four different regions in an image are identified, as shown in Figure 7. Algorithm 1 predicts region A as an SRD pixel but, for better inpainting quality, both regions A and B are considered specular reflected pixels. It is to be noted that region B contains the pixel patterns (color channels) that are the immediate neighbors of region A. Then, all the connected regions are identified and labeled for the task of inpainting. The complete process for inpainting the detected specular pixels in

1: Strengthen SRD-PM (output from Algorithm 1) by labeling all neighboring pixels as SRD ones 2: Compute connected components and label them 3: Calculate baseline image per sub-aperture index by taking minimum pixel intensities of both polarized and unpolarized images in RGB channels 4: **for** all common sub-aperture images **do** where *<sup>P</sup>*(*x*, *<sup>y</sup>*, *<sup>c</sup>*, *<sup>j</sup>* <sup>|</sup> *<sup>i</sup>*) and *<sup>P</sup>*(*x*, *<sup>y</sup>*, *<sup>c</sup>*, *<sup>k</sup>* <sup>|</sup> *<sup>i</sup>*) are the two four-connected neighbors of the pixel pattern (*P*(*x*, *<sup>y</sup>*, *<sup>c</sup>* <sup>|</sup> *<sup>i</sup>*) ) in sub-aperture index *<sup>i</sup>* and *<sup>d</sup>*(*j*,*<sup>k</sup>* <sup>|</sup> *<sup>x</sup>*,*y*) the distance between the two pixel patterns corresponding to *<sup>P</sup>*(*x*, *<sup>y</sup>*, *<sup>c</sup>* <sup>|</sup> *<sup>i</sup>*) in sub-aperture index *i*. A 2D matrix [55] of the distances among the pixel patterns is calculated by Equation (12). The pattern corresponding to the lowest column-wise sum of the distances is selected as the winning one (*P*(*x*, *<sup>y</sup>*, *<sup>c</sup>*, *IDX*<sup>|</sup> *<sup>i</sup>*) ) for the task of SRI in Equations (13) and (14).

$$dM\_{(\text{unva},\text{non})} = \begin{pmatrix} d\_{(j-4,k-4 \mid \text{x},y)} & \dots & d\_{(j+4,k-4 \mid \text{x},y)} \\ \vdots & d\_{(j+4,k-4 \mid \text{x},y)} & \vdots \\ d\_{(j-4,k+4 \mid \text{x},y)} & \dots & d\_{(j+4,k+4 \mid \text{x},y)} \end{pmatrix} \tag{12}$$

$$IDX = \mathop{\arg\min}\_{k} \sum dM\_{(\text{row}, k)} \tag{15}$$

$$P\_{(x,y,c,i)} = P\_{(x,y,c,ID\mathcal{K}(i))}\tag{14}$$

(, | ௫,௬) = ට∑ ൫(௫, ௬, , | ) −(௫, ௬, , | )൯

ଶ

ୀோ,ீ, , (11)

#### imum sum of (,), as in Equations (13) and (14) for inpainting of specular reflec-**5. Experimental Results**

In this section, performance evaluations and comparisons of the proposed two-fold SRDI and other approaches using different metrics for specular pixel detection and inpainting are discussed. Additionally, analyses of their computational times are conducted.

#### *5.1. Selection of Performance Evaluation Metric*

14: **repeat** steps 4 to 13 to calculate maximum specular reflection in suppressed image of transparent object Both SRD and SRI are evaluated by commonly used statistical evaluation metrics for quantifying detection accuracy and inpainting quality.

#### from already suppressed sub-apertures 5.1.1. Selection of SRD Metric

lated by

A baseline image per sub-aperture index is computed by taking the minimum pixel intensities in both polarized and unpolarized RGB channels. The aim is to suppress the The SRD method is evaluated at the pixel level of a binarized scene in which the pixels related to the specular and the diffuse reflections are white and black, respectively. Its performance can be divided into four pixel-wise classification results: true positive (*Tp*), which means a correctly detected diffuse pixel; false positive (*Fp*), that is, a specular

reflected pixel incorrectly detected as a diffuse reflected one; true negative (*Tn*), which indicates a correctly detected pixel with specularity; and false negative (*Fn*), that is, a diffuse reflected pixel incorrectly detected as a specular reflected one. The binary classification metrics used are precision, recall or sensitivity, F1-score, specificity, geometric-mean (Gmean), and accuracy. Precision is the number of diffuse reflected pixels detected that are actually diffuse reflected ones, while recall is the number of diffuse reflected pixels detected from the actual diffuse reflected ones (recall and sensitivity are similar). The F1-score (a boundary F1 measure) is the harmonic mean of precision and recall values, which measures how closely the predicted boundary of an object matches its ground-truth and is an overall indicator of the performance of binary segmentation. Specificity (a *T<sup>n</sup>* fraction) is the proportion of actual negatives predicted as negatives, sensitivity (a *T<sup>p</sup>* fraction) the proportion of actual positives predicted as positives, G-mean the root of the product of specificity and sensitivity, and accuracy the proportion of true results obtained, either *T<sup>n</sup>* or *Tp*. The mathematical evaluation measures of the aforementioned metrics are shown in Equations (15) to (20) [17,56].

$$Precision\ \left(PR\right) = \frac{T\_p}{T\_p + F\_p} \,\tag{15}$$

$$Recall\ \left(\text{RC}\right)\text{ or }\text{Sensitivity}\left(\text{SN}\right) = \frac{T\_p}{T\_p + F\_n},\tag{16}$$

$$F1 - Score\,\left(F1S\right) = 2 \times \frac{Precision \times Recall}{Precision + Recall} \,\tag{17}$$

$$Specificity\ (SP) = \frac{T\_n}{T\_n + F\_p} \,\tag{18}$$

$$\text{Geometric} - \text{Mean (GM)} = \sqrt{\text{Specificity} \times \text{Sensitivity}},\tag{19}$$

$$Accuracy\left(A\mathbb{C}\right) = \frac{T\_p + T\_n}{T\_p + F\_n + T\_n + F\_p} \,\mathrm{}\tag{20}$$

#### 5.1.2. Selection of Inpainting Quality Metric

Currently, the quality of a fused image can be quantitively evaluated using the metrics [57] structural similarity index (SSIM), peak signal-to-noise ratio (PSNR), mean squared error (IMMSE), and mean absolute deviation (MAD). The SSIM is an assessment index of the image quality based on computations of luminance, contrast, and structural components of the reference and the reconstructed images, with the overall index a multiplicative combination of these three components. The PSNR block computes the PSNR between the reference and the suppressed images in decibels (dB), with higher values of SSIM and PSNR indicating better quality of the reconstructed or the suppressed image. The IMMSE computes the average squared error between the reference and the reconstructed images, while MAD indicates the sum of the absolute differences between the pixel values of these images divided by the total number of pixels, which is used to measure the standard error of the reconstructed image. Lower values of IMMSE and MAD indicate better quality of the reconstructed image. Considering two images (*x* and *y*), the aforementioned mathematical evaluation metrics are shown in Equations (21) to (24).

$$SSIM(\mathbf{x}, \mathbf{y}) = \left[l(\mathbf{x}, \mathbf{y})^{\mathbf{a}}\right] \cdot \left[c(\mathbf{x}, \mathbf{y})^{\boldsymbol{\beta}}\right] \cdot \left[s(\mathbf{x}, \mathbf{y})^{\boldsymbol{\gamma}}\right],\tag{21}$$

where,

$$l(\mathbf{x}, y) = \frac{2\mu\_{\mathbf{x}}\mu\_{\mathbf{y}} + \mathbf{C}\_1}{\mu\_{\mathbf{x}}^2 + \mu\_{\mathbf{y}}^2 + \mathbf{C}\_1} \quad \mathbf{c}(\mathbf{x}, y) = \frac{2\sigma\_{\mathbf{x}}\sigma\_{\mathbf{y}} + \mathbf{C}\_2}{\sigma\_{\mathbf{x}}^2 + \sigma\_{\mathbf{y}}^2 + \mathbf{C}\_2} \quad \mathbf{s}(\mathbf{x}, y) = \frac{\sigma\_{\mathbf{x}\mathbf{y}} + \mathbf{C}\_3}{\sigma\_{\mathbf{x}}\sigma\_{\mathbf{y}} + \mathbf{C}\_3}.$$

where *µx*, *µy*, *σx*, *σ<sup>y</sup>* and *σxy* are local means, standard deviations, and cross-covariances of images *x* and *y*.

$$PSNR(x, y) = 10.log\_{10} \left( \frac{MAX\_I^2}{IMMSE(x, y)} \right) \tag{22}$$

where *MAX* denotes the range of the image (*x or y*) datatype

$$IMMSE(x, y) = \frac{1}{n} \sum\_{i=1}^{n} (x\_i - y\_i)^2 \tag{23}$$

$$MAD\ (x, y) = \frac{1}{n} \sum\_{i=1}^{n} |(x\_i - y\_i)|\_\prime \tag{24}$$

#### *5.2. Generation of Ground Truth*

To evaluate the performance of the proposed two-fold SRDI, we generate two different ground truths for each object, as shown in Figure 8. The SRD and the SRI ones are created manually by an expert, with the maximum possible specular reflected area in the MSPLFI object dataset covered. Figure 8 shows the two-way SRD ground truth generation, where a pixel with an intensity above a threshold (Otsu's method and in the range (0–1)) level is considered a specular reflected pixel. The final column in Figure 13 presents the objects' SRD binary ground truths, with black and white pixels indicating their diffuse and specular reflected pixels, respectively. The final column in Figure 18 shows the objects' SRI ground truths. Due to the real scene in the MSPLFI object dataset, some pixels in an object may exhibit amounts of both specular and diffuse reflections but, to measure the performance in terms of quantity and enable further comparisons, each pixel is classified manually as either specular or diffuse reflected, and the ground truth is re-named as the quasi-ground truth. *Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 14 of 30

**Figure 8.** Quasi-ground truth of SRD. **Figure 8.** Quasi-ground truth of SRD.

#### *5.3. Performance Evaluation of SRD 5.3. Performance Evaluation of SRD*

#### 5.3.1. Analysis of SRD Rate

**Figure 9.** Evaluation results for SRD performances of proposed method for 122 specular reflected images (121 sub-aper-

tures + 1 maximum) of nine sample objects separately using different SRD metrics.

5.3.1. Analysis of SRD Rate Figure 9 shows the SRD rates in terms of the SRD metrics of precision, recall, F1 score, G-mean, and accuracy for nine sample objects both separately (Figure 9) and together for all objects (O#1–O#18) (Figure 10) using the proposed method. For each object, a total of 121 sub-aperture images are used to measure its specularity and box plots to statistically analyze our experiments. Figure 9 exhibits the SRD metric values obtained for nine sample objects separately. Remaining objects are presented in Appendix Section (Figure A1). Accuracy has a higher median value than the F1-score and the G-mean for all the Figure 9 shows the SRD rates in terms of the SRD metrics of precision, recall, F1-score, G-mean, and accuracy for nine sample objects both separately (Figure 9) and together for all objects (O#1–O#18) (Figure 10) using the proposed method. For each object, a total of 121 sub-aperture images are used to measure its specularity and box plots to statistically analyze our experiments. Figure 9 exhibits the SRD metric values obtained for nine sample objects separately. Remaining objects are presented in Appendix A (Figure A1). Accuracy has a higher median value than the F1-score and the G-mean for all the objects, with O#9 and O#3 having superior median values of 0.804, 0.832, and 0.996, and 0.874, 0.882, and 0.991 for F1-score, G-mean, and accuracy, respectively, compared with those of the other objects.

objects, with O#9 and O#3 having superior median values of 0.804, 0.832, and 0.996, and 0.874, 0.882, and 0.991 for F1-score, G-mean, and accuracy, respectively, compared with those of the other objects. Similarly, Figure 10 shows the combined SRD rates for 121 sub-aperture + 1 maximum images × 18 objects = 2196 images. Accuracy has a better overall median and 75th percentile values for all the objects combined (0.981 and 0.992, respectively) compared to the F1-score (0.643 and 0.770, respectively) and the G-mean (0.656 and 0.752, respectively).

Similarly, Figure 10 shows the combined SRD rates for 121 sub-aperture + 1 maxi-

mum images × 18 objects = 2196 images. Accuracy has a better overall median and 75th percentile values for all the objects combined (0.981 and 0.992, respectively) compared to the F1-score (0.643 and 0.770, respectively) and the G-mean (0.656 and 0.752, respectively).

Figure 9 shows the SRD rates in terms of the SRD metrics of precision, recall, F1 score, G-mean, and accuracy for nine sample objects both separately (Figure 9) and together for all objects (O#1–O#18) (Figure 10) using the proposed method. For each object, a total of 121 sub-aperture images are used to measure its specularity and box plots to statistically analyze our experiments. Figure 9 exhibits the SRD metric values obtained for nine sample objects separately. Remaining objects are presented in Appendix Section (Figure A1). Accuracy has a higher median value than the F1-score and the G-mean for all the objects, with O#9 and O#3 having superior median values of 0.804, 0.832, and 0.996, and 0.874, 0.882, and 0.991 for F1-score, G-mean, and accuracy, respectively, compared with

**Figure 8.** Quasi-ground truth of SRD.

*5.3. Performance Evaluation of SRD* 

5.3.1. Analysis of SRD Rate

those of the other objects.

**Figure 9.** Evaluation results for SRD performances of proposed method for 122 specular reflected images (121 sub-apertures + 1 maximum) of nine sample objects separately using different SRD metrics. **Figure 9.** Evaluation results for SRD performances of proposed method for 122 specular reflected images (121 sub-apertures + 1 maximum) of nine sample objects separately using different SRD metrics. *Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 15 of 30

**Figure 10.** Evaluation results for SRD performances of proposed method for 122 specular reflected images (121 sub-aperture + 1 maximum) × 18 objects = 2196 images for all objects (O#1–O#18) combined using different SRD metrics. **Figure 10.** Evaluation results for SRD performances of proposed method for 122 specular reflected images (121 sub-aperture + 1 maximum) × 18 objects = 2196 images for all objects (O#1–O#18) combined using different SRD metrics.

5.3.2. Comparison of SRD Rates of Proposed Method and Those in Literature

**Table 1.** Performance evaluation of different methods in terms of various SRD metrics for 18 objects (O#1–O#18) in MSPLFI

**Methods Metrics Object Index (Maximum SRD) Overall** 

**O#1 O#2 O#3 O#4 O#5 O#6 O#7 O#8 O#9 O#10 O#11 O#12 O#13 O#14 O#15 O#16 O#17 O#18 Mean (SA)** 

Precision 0.178 0.348 0.686 0.445 0.600 0.354 0.460 0.382 0.655 0.519 0.240 0.311 0.336 0.124 0.522 0.542 0.504 0.123 **0.362 ± 0.24**  Recall 0.628 0.629 0.662 0.427 0.514 0.345 0.426 0.417 0.771 0.536 0.598 0.866 0.658 0.622 0.466 0.727 0.328 0.747 **0.512 ± 0.14**  F1-Score 0.277 0.448 0.673 0.436 0.554 0.350 0.443 0.398 0.708 0.528 0.342 0.457 0.445 0.207 0.493 0.621 0.398 0.211 **0.377 ± 0.16**  G-Mean 0.769 0.781 0.810 0.644 0.710 0.578 0.644 0.634 0.874 0.722 0.749 0.917 0.795 0.754 0.676 0.835 0.567 0.834 **0.689 ± 0.10**  Accuracy 0.935 0.962 0.981 0.943 0.957 0.939 0.946 0.939 0.986 0.948 0.928 0.970 0.951 0.910 0.960 0.944 0.940 0.929 **0.926 ± 0.05** 

Precision 0.220 0.610 0.759 0.509 0.613 0.437 0.527 0.477 0.602 0.579 0.462 0.447 0.574 0.388 0.590 0.642 0.622 0.505 **0.655 ± 0.15**  Recall 0.667 0.590 0.639 0.392 0.493 0.301 0.411 0.335 0.831 0.546 0.513 0.848 0.457 0.474 0.476 0.647 0.275 0.599 **0.483 ± 0.15**  F1-Score 0.330 0.600 0.694 0.443 0.546 0.357 0.462 0.393 0.698 0.562 0.486 0.586 0.509 0.426 0.527 0.644 0.381 0.548 **0.527 ± 0.13**  G-Mean 0.797 0.764 0.797 0.620 0.696 0.543 0.635 0.573 0.906 0.730 0.709 0.913 0.672 0.683 0.685 0.794 0.522 0.771 **0.681 ± 0.11**  Accuracy 0.946 0.981 0.983 0.949 0.958 0.948 0.952 0.950 0.984 0.954 0.966 0.983 0.974 0.976 0.964 0.955 0.946 0.988 **0.969 ± 0.01** 

Precision 0.220 0.396 0.603 0.402 0.476 0.269 0.382 0.364 0.595 0.438 0.274 0.224 0.288 0.166 0.416 0.494 0.448 0.156 **0.433 ± 0.19**  Recall 0.817 0.638 0.673 0.457 0.562 0.430 0.442 0.475 0.831 0.571 0.630 0.884 0.671 0.652 0.484 0.758 0.383 0.754 **0.529 ± 0.16**  F1-Score 0.346 0.488 0.636 0.428 0.515 0.331 0.410 0.413 0.694 0.496 0.382 0.358 0.403 0.265 0.447 0.598 0.413 0.258 **0.446 ± 0.14**  G-Mean 0.877 0.789 0.815 0.664 0.737 0.636 0.652 0.675 0.906 0.739 0.772 0.919 0.798 0.782 0.686 0.848 0.609 0.845 **0.707 ± 0.11**  Accuracy 0.939 0.968 0.977 0.937 0.946 0.917 0.936 0.935 0.984 0.937 0.936 0.954 0.941 0.931 0.950 0.936 0.934 0.945 **0.953 ± 0.02** 

Recall 0.645 0.634 0.665 0.435 0.547 0.384 0.456 0.458 0.778 0.565 0.646 0.875 0.680 0.647 0.492 0.791 0.328 0.755 **0.559 ± 0.15** 

**Ym. [37]** Precision 0.199 0.409 0.657 0.435 0.531 0.282 0.302 0.357 0.631 0.406 0.243 0.222 0.296 0.122 0.403 0.364 0.513 0.143 **0.307 ± 0.23** 

5.3.2. Comparison of SRD Rates of Proposed Method and Those in Literature It is worth mentioning that the performances of the existing SRD methods considered are not exactly comparable, as each reports its accuracy for a specific image set using different contexts. Moreover, the accuracy values obtained from them and the color-mapping It is worth mentioning that the performances of the existing SRD methods considered are not exactly comparable, as each reports its accuracy for a specific image set using different contexts. Moreover, the accuracy values obtained from them and the colormapping techniques used for segmentation may vary.

In Table 1, the performances of SRD in terms of different evaluation metrics for the

proposed and other methods are compared for the 18 individual objects. For visualization purposes, short forms of the authors' names are written in the first column, that is, Ak., Sn., Yn., Ym., Ar., St., and Ms. refer to Akashi, Shen, Yang, Yamamoto, Arnold, Saint, and Meslouhi, respectively. The SRD metric values in the object index columns correspond to the maiden specular image among the sub-aperture ones. The final column (overall mean ?SA)) corresponds to the mean ± SD values of the 121 sub-aperture + 1 maximum images

× 18 objects = 2196 images together.

object dataset and overall means (all sub-aperture images in 4D LF).

**Ak. [40]** 

**Sn. [35]** 

**Yn. [1]** 

In Table 1, the performances of SRD in terms of different evaluation metrics for the proposed and other methods are compared for the 18 individual objects. For visualization purposes, short forms of the authors' names are written in the first column, that is, Ak., Sn., Yn., Ym., Ar., St., and Ms. refer to Akashi, Shen, Yang, Yamamoto, Arnold, Saint, and Meslouhi, respectively. The SRD metric values in the object index columns correspond to the maiden specular image among the sub-aperture ones. The final column (overall mean ?SA)) corresponds to the mean ± SD values of the 121 sub-aperture + 1 maximum images × 18 objects = 2196 images together.

**Table 1.** Performance evaluation of different methods in terms of various SRD metrics for 18 objects (O#1–O#18) in MSPLFI object dataset and overall means (all sub-aperture images in 4D LF).


As can be seen, the overall mean SRD different metric values are higher for the proposed method than the studies discussed in this paper, as shown in the final column in Table 1. Additionally, considering all the sub-aperture images of the 18 distinct objects, mean F1-score, G-mean, and accuracy values for the proposed method are 0.546 ± 0.13, 0.654 ± 0.11 and 0.974 ± 0.01, respectively. In Figure 11, the SRD metric values for the 18 individual objects (O#1–O#18) and their maximum specular reflections obtained from different methods are compared. As can be seen, the proposed method achieves superior median values for the F1-score, G-mean and accuracy of 0.662, 0.816 and 0.971, respectively.

**Ar. [41]** 

**St. [42]** 

**Ms. [43]** 

**Proposed** 

dif-ferent m

objects in terms of precision, recall, F1-score, G-mean and accuracy.

F1-Score 0.304 0.497 0.661 0.435 0.539 0.325 0.363 0.401 0.697 0.472 0.353 0.355 0.412 0.205 0.443 0.499 0.400 0.240 **0.346 ± 0.17**  G-Mean 0.782 0.787 0.811 0.649 0.730 0.604 0.656 0.663 0.877 0.734 0.777 0.914 0.804 0.767 0.691 0.847 0.567 0.843 **0.709 ± 0.10**  Accuracy 0.941 0.969 0.980 0.942 0.952 0.924 0.920 0.934 0.985 0.932 0.925 0.954 0.942 0.905 0.948 0.900 0.940 0.939 **0.908 ± 0.06** 

Precision 0.189 0.520 0.463 0.471 0.529 0.258 0.436 0.383 0.410 0.468 0.308 0.191 0.287 0.178 0.366 0.496 0.413 0.255 **0.561 ± 0.12**  Recall 0.594 0.587 0.668 0.394 0.391 0.351 0.422 0.449 0.763 0.526 0.609 0.877 0.353 0.281 0.467 0.727 0.371 0.447 **0.434 ± 0.16**  F1-Score 0.287 0.552 0.547 0.429 0.450 0.298 0.428 0.414 0.534 0.495 0.409 0.314 0.317 0.218 0.410 0.590 0.391 0.325 **0.466 ± 0.10**  G-Mean 0.750 0.761 0.808 0.620 0.619 0.577 0.640 0.658 0.863 0.713 0.763 0.910 0.586 0.524 0.671 0.831 0.598 0.663 **0.644 ± 0.12**  Accuracy 0.941 0.977 0.967 0.946 0.951 0.921 0.943 0.939 0.971 0.942 0.945 0.944 0.955 0.962 0.944 0.936 0.930 0.976 **0.966 ± 0.01** 

Precision 0.461 0.679 0.680 0.597 0.692 0.344 0.609 0.392 0.586 0.616 0.340 0.237 0.491 0.360 0.421 0.631 0.487 0.193 **0.702 ± 0.12**  Recall 0.592 0.535 0.637 0.357 0.502 0.321 0.400 0.381 0.771 0.462 0.558 0.876 0.457 0.394 0.495 0.567 0.315 0.724 **0.422 ± 0.15**  F1-Score 0.518 0.598 0.658 0.447 0.582 0.332 0.483 0.387 0.666 0.528 0.423 0.373 0.473 0.376 0.455 0.597 0.383 0.305 **0.507 ± 0.11**  G-Mean 0.764 0.729 0.795 0.593 0.704 0.558 0.628 0.608 0.873 0.674 0.734 0.916 0.671 0.624 0.693 0.744 0.555 0.834 **0.637 ± 0.12**  Accuracy 0.978 0.983 0.980 0.954 0.963 0.939 0.957 0.942 0.983 0.955 0.952 0.957 0.970 0.975 0.950 0.952 0.938 0.958 **0.971 ± 0.01** 

Precision 0.646 0.878 0.914 0.876 0.765 0.592 0.754 0.585 0.847 0.702 0.557 0.557 0.556 0.348 0.692 0.657 0.729 0.660 **0.868 ± 0.09**  Recall 0.580 0.367 0.502 0.248 0.485 0.212 0.393 0.307 0.568 0.445 0.507 0.831 0.489 0.572 0.366 0.627 0.240 0.338 **0.283 ± 0.11**  F1-Score 0.611 0.518 0.648 0.387 0.593 0.312 0.517 0.403 0.680 0.545 0.530 0.667 0.520 0.433 0.479 0.642 0.361 0.447 **0.412 ± 0.13**  G-Mean 0.759 0.606 0.708 0.498 0.694 0.459 0.625 0.551 0.753 0.664 0.707 0.907 0.695 0.748 0.603 0.783 0.489 0.581 **0.520 ± 0.11**  Accuracy 0.985 0.983 0.984 0.959 0.966 0.956 0.963 0.956 0.988 0.960 0.972 0.988 0.973 0.971 0.967 0.956 0.949 0.989 **0.971 ± 0.01** 

Precision 0.630 0.666 0.728 0.622 0.668 0.643 0.798 0.563 0.756 0.678 0.485 0.624 0.470 0.422 0.665 0.658 0.719 0.614 **0.776 ± 0.10**  Recall 0.630 0.585 0.737 0.798 0.946 0.281 0.767 0.452 0.808 0.613 0.526 0.720 0.554 0.718 0.553 0.784 0.320 0.578 **0.444 ± 0.15**  F1-Score 0.630 0.623 0.732 0.699 0.783 0.391 0.782 0.501 0.781 0.644 0.504 0.668 0.509 0.531 0.604 0.715 0.442 0.596 **0.546 ± 0.13**  G-Mean 0.791 0.762 0.855 0.881 0.960 0.528 0.871 0.666 0.896 0.777 0.718 0.846 0.737 0.839 0.739 0.873 0.563 0.759 **0.654 ± 0.11**  Accuracy 0.985 0.983 0.984 0.965 0.973 0.958 0.978 0.957 0.990 0.963 0.967 0.990 0.968 0.976 0.970 0.961 0.951 0.990 **0.974 ± 0.01** 

> As can be seen, the overall mean SRD different metric values are higher for the proposed method than the studies discussed in this paper, as shown in the final column in Table 1. Additionally, considering all the sub-aperture images of the 18 distinct objects, mean F1-score, G-mean, and accuracy values for the proposed method are 0.546 ± 0.13, 0.654 ± 0.11 and 0.974 ± 0.01, respectively. In Figure 11, the SRD metric values for the 18 individual objects (O#1–O#18) and their maximum specular reflections obtained from

**Figure 11.** Evaluation results for SRD performances of different methods for maximum specular reflected images of 18 objects in terms of precision, recall, F1-score, G-mean and accuracy. **Figure 11.** Evaluation results for SRD performances of different methods for maximum specular reflected images of 18

In Figure 12, the SRD metric values for 121 sub-aperture + 1 maximum images × 18 objects = 2196 images with their specular reflections obtained by different methods are presented. As can be seen, the proposed method has superior median values for F1-score, G-mean, and accuracy of 0.643, 0.676, and 0.981, respectively, to those of the others. In Figure 12, the SRD metric values for 121 sub-aperture + 1 maximum images × 18 objects = 2196 images with their specular reflections obtained by different methods are presented. As can be seen, the proposed method has superior median values for F1-score, G-mean, and accuracy of 0.643, 0.676, and 0.981, respectively, to those of the others.

**Figure 12.** Evaluation results for SRD performances of different methods for 121 sub-aperture + 1 maximum images × 18 objects = 2196 images with specular reflections in terms of precision, recall, F1-score, G-mean, and accuracy. **Figure 12.** Evaluation results for SRD performances of different methods for 121 sub-aperture + 1 maximum images × 18 objects = 2196 images with specular reflections in terms of precision, recall, F1-score, G-mean, and accuracy.

5.3.3. Visualization of SRD Rates of Different Methods In Figure 13, the SRD accuracies obtained by different methods for the maximum 5.3.3. Visualization of SRD Rates of Different Methods

specular reflected images of sample objects in the MSPLFI object dataset are presented. As can be seen, the proposed approach reports fewer SRD errors than the others. Remaining objects are presented in Appendix Section (Figure A2). In Figure 13, the SRD accuracies obtained by different methods for the maximum specular reflected images of sample objects in the MSPLFI object dataset are presented. As can be seen, the proposed approach reports fewer SRD errors than the others. Remaining objects are presented in Appendix A (Figure A2).

**Figure 13.** Comparison of SRD accuracies of different methods for sample objects in MSPLFI dataset. **Figure 13.** Comparison of SRD accuracies of different methods for sample objects in MSPLFI dataset.

#### *5.4. Performance Evaluation of SRI 5.4. Performance Evaluation of SRI*

#### 5.4.1. Analysis of SRI Quality 5.4.1. Analysis of SRI Quality

The SRI qualities in terms of the normalized SRI metrics SSIM, PSNR, IMMSE, and MAD for the nine sample objects using the proposed method are presented separately in Figure 14 and then together for all objects (O#1–O#18) in Figure 15. For each object, a total of 121 sub-aperture + 1 maximum images are considered to measure its SRI and box plots used to statistically analyze our experiments. It is to be noted that a suppressed image The SRI qualities in terms of the normalized SRI metrics SSIM, PSNR, IMMSE, and MAD for the nine sample objects using the proposed method are presented separately in Figure 14 and then together for all objects (O#1–O#18) in Figure 15. For each object, a total of 121 sub-aperture + 1 maximum images are considered to measure its SRI and

box plots used to statistically analyze our experiments. It is to be noted that a suppressed image with high SSIM and PSNR values and low IMMSE and MAD ones is close to the quasi-ground truth. Figure 14 shows that the SSIM has a higher median value than the PSNR but the IMMSE a lower one than the MAD for all the objects while object O#1 has superior median values of 0.966, 0.820, 0.038, and 0.131 for SSIM, PSNR, IMMSE, and MAD, respectively, to those of the other objects. Remaining objects are presented in Appendix B (Figure A3). Similarly, Figure 15 shows the normalized SRI qualities of (121 Sub-aperture + 1 maximum) × 18 Objects = 2196 images together. The SSIM has better overall median and 75th percentile values for all the objects combined (0.966 and 0.980, respectively) than the PSNR (0.735 and 0.778, respectively) and the IMMSE better overall median and 75th percentile values for all the objects (0.073 and 0.118, respectively) than the MAD (0.226 and 0.273, respectively). with high SSIM and PSNR values and low IMMSE and MAD ones is close to the quasiground truth. Figure 14 shows that the SSIM has a higher median value than the PSNR but the IMMSE a lower one than the MAD for all the objects while object O#1 has superior median values of 0.966, 0.820, 0.038, and 0.131 for SSIM, PSNR, IMMSE, and MAD, respectively, to those of the other objects. Remaining objects are presented in Appendix Section (Figure A3). Similarly, Figure 15 shows the normalized SRI qualities of (121 Sub-aperture + 1 maximum) × 18 Objects = 2196 images together. The SSIM has better overall median and 75th percentile values for all the objects combined (0.966 and 0.980, respectively) than the PSNR (0.735 and 0.778, respectively) and the IMMSE better overall median and 75th percentile values for all the objects (0.073 and 0.118, respectively) than the MAD (0.226 and 0.273, respectively). with high SSIM and PSNR values and low IMMSE and MAD ones is close to the quasiground truth. Figure 14 shows that the SSIM has a higher median value than the PSNR but the IMMSE a lower one than the MAD for all the objects while object O#1 has superior median values of 0.966, 0.820, 0.038, and 0.131 for SSIM, PSNR, IMMSE, and MAD, respectively, to those of the other objects. Remaining objects are presented in Appendix Section (Figure A3). Similarly, Figure 15 shows the normalized SRI qualities of (121 Sub-aperture + 1 maximum) × 18 Objects = 2196 images together. The SSIM has better overall median and 75th percentile values for all the objects combined (0.966 and 0.980, respectively) than the PSNR (0.735 and 0.778, respectively) and the IMMSE better overall median and 75th percentile values for all the objects (0.073 and 0.118, respectively) than the MAD (0.226 and 0.273, respectively).

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 19 of 30

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 19 of 30

**Figure 14.** Evaluation results for SRI performances of proposed method for 122 specular reflection suppressed images (121 sub-aperture + 1 maximum ones) of nine sample objects separately using different SRI metrics. **Figure 14.** Evaluation results for SRI performances of proposed method for 122 specular reflection suppressed images (121 sub-aperture + 1 maximum ones) of nine sample objects separately using different SRI metrics. **Figure 14.** Evaluation results for SRI performances of proposed method for 122 specular reflection suppressed images (121 sub-aperture + 1 maximum ones) of nine sample objects separately using different SRI metrics.

**Figure 15.** Evaluation results for SRI performances of proposed method for 121 sub-aperture + 1 maximum images × 18 objects = 2196 images for all objects (O#1–O#18) combined using different SRI metrics.

5.4.2. Comparison of SRI Rates of Proposed Method and Those in Literature

It is worth mentioning that the performances of the existing SRI methods are not exactly comparable, as each reports its accuracy for a specific image set in a different context. Additionally, the quality obtained by the methods and the color-mapping techniques used for inpainting may vary.

In Table 2, the performances of SRI in the proposed and other methods for the 18 individual objects are compared using different evaluation metrics. For visualization, short forms of the authors' names written in the first column as Ar., Yg., Cr., St., Ak., Sn., and Ym. refer to Arnold, Yang, Criminisi, Saint, Akashi, Shen, and Yamamoto, respectively. The SRI metric values in the object index columns correspond to the maiden image of the 121 sub-aperture specular reflected suppressed ones. The final column (overall mean (SA)) corresponds to the mean ± SD values of the 121 sub-aperture + 1 maximum images × 18 objects = 2196 images together. As can be seen, the SRI metric values are significantly better for the proposed method than for the others considered, as shown in the final column in Table 2. For all the sub-aperture images of the 18 distinct objects, the mean SSIM, PSNR, IMMSE, and MAD values obtained from the proposed method are 0.956 ± 0.02, 24.51 ± 2.11, 257.6 ± 119, and 8.427 ± 2.51, respectively.

**Table 2.** Performance evaluations of different methods using different SRI metrics for 18 objects (O#1–O#18) and overall mean (all sub-aperture images in 4D LF) in MSPLFI object dataset.


SSIM: structural similarity index; PSNR: peak signal-to-noise ratio; IMMSE: mean squared error; MAD: mean absolute deviation.

In Figure 16, comparisons of the SRI metric values of individual methods in terms of SSIM, PSNR, IMMSE, and MAD of 18 individual objects (O#1–O#18) with their maiden specular inpainting is presented. It can be seen that the proposed method has superior median values for SSIM and PSNR of 0.985 and 0.754 and the lowest median values for IMMSE and MAD of 0.063 and 0.217, respectively.

**O#1** 

**O#2** 

**O#3** 

**O#5** 

**O#8** 

deviation.

PSNR 33.43 26.16 29.24 22.79 22.26 29.85 25.60 25.06 29.27 23.84 22.76 24.62 26.52 24.91 21.76 22.37 27.64 25.36 **24.51 ± 2.11**  IMMSE 29.54 157.4 77.50 341.9 386.7 67.34 179.2 202.9 76.95 268.6 344.8 224.2 145.1 209.9 433.9 376.7 112.1 189.2 **257.6 ± 119**  MAD 1.172 7.903 5.205 7.680 8.536 4.529 8.723 8.277 4.959 10.07 11.07 8.888 5.257 7.880 13.26 12.97 5.545 7.665 **8.427 ± 2.51** 

> In Figure 16, comparisons of the SRI metric values of individual methods in terms of SSIM, PSNR, IMMSE, and MAD of 18 individual objects (O#1–O#18) with their maiden specular inpainting is presented. It can be seen that the proposed method has superior median values for SSIM and PSNR of 0.985 and 0.754 and the lowest median values for

SSIM: structural similarity index; PSNR: peak signal-to-noise ratio; IMMSE: mean squared error; MAD: mean absolute

IMMSE and MAD of 0.063 and 0.217, respectively.

**Figure 16.** Evaluation results for SRI performances of individual methods for each maiden specular suppressed image of 18 objects in terms of SSIM, PSNR, IMMSE, and MAD. **Figure 16.** Evaluation results for SRI performances of individual methods for each maiden specular suppressed image of 18 objects in terms of SSIM, PSNR, IMMSE, and MAD.

Figure 17 shows the SRI metric values of individual methods in terms of SSIM, PSNR, IMMSE, and MAD of 121 sub-aperture + 1 maiden images × 18 objects = 2196 images. As can be seen, the proposed method has superior median values for SSIM and PSNR of 0.966 and 0.735, respectively, and the lowest median values for IMMSE and MAD of 0.073 and 0.226, respectively, compared with those of the other methods. Figure 17 shows the SRI metric values of individual methods in terms of SSIM, PSNR, IMMSE, and MAD of 121 sub-aperture + 1 maiden images × 18 objects = 2196 images. As can be seen, the proposed method has superior median values for SSIM and PSNR of 0.966 and 0.735, respectively, and the lowest median values for IMMSE and MAD of 0.073 and 0.226, respectively, compared with those of the other methods. *Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 22 of 30

**Figure 17.** Evaluation results for SRI performances of different methods for 121 sub-aperture + 1 maiden images × 18 objects = 2196 images in terms of SSIM, PSNR, IMMSE, and MAD. **Figure 17.** Evaluation results for SRI performances of different methods for 121 sub-aperture + 1 maiden images × 18 objects = 2196 images in terms of SSIM, PSNR, IMMSE, and MAD.

#### 5.4.3. Visualization of SRI Quality Assessment 5.4.3. Visualization of SRI Quality Assessment

Figure 18 presents the SRI qualities obtained by different methods for the maiden specular reflected images of sample scenes in the MSPLFI object dataset. Remaining objects are presented in Appendix Section (Figure A4). As can be seen, the proposed approach demonstrates better SRI quality than the others. Figure 18 presents the SRI qualities obtained by different methods for the maiden specular reflected images of sample scenes in the MSPLFI object dataset. Remaining objects are presented in Appendix B (Figure A4). As can be seen, the proposed approach demonstrates better SRI quality than the others.

**Truth** 

**Figure 17.** Evaluation results for SRI performances of different methods for 121 sub-aperture + 1 maiden images × 18

proach demonstrates better SRI quality than the others.

Figure 18 presents the SRI qualities obtained by different methods for the maiden specular reflected images of sample scenes in the MSPLFI object dataset. Remaining objects are presented in Appendix Section (Figure A4). As can be seen, the proposed ap-

5.4.3. Visualization of SRI Quality Assessment

objects = 2196 images in terms of SSIM, PSNR, IMMSE, and MAD.

**Figure 18.** Comparison of SRI accuracies of different methods for sample objects in MSPLFI dataset. **Figure 18.** Comparison of SRI accuracies of different methods for sample objects in MSPLFI dataset.

#### In this paper, a two-fold SRDI framework is proposed. As transparent objects lack **6. Conclusions**

**6. Conclusions** 

their own textures, combining multisensory imagery cues improves their levels of specular detection and inpainting. Based on the private MSPLFI object dataset, the proposed SRD and SRI algorithms demonstrate better detection accuracy and suppression quality, respectively, than other techniques. In SRD, predictions of multiband changes in the subapertures in both polarized and unpolarized images are calculated and combined to obtain the overall specularity in transparent objects. In SRI, firstly, a distance matrix based on four-connected neighboring pixel patterns is calculated, and then the most similar one is selected to replace the specular pixel. The proposed algorithms predict better detection accuracy and inpainting quality in terms of F1-score, G-mean, accuracy, SSIM, PSNR, In this paper, a two-fold SRDI framework is proposed. As transparent objects lack their own textures, combining multisensory imagery cues improves their levels of specular detection and inpainting. Based on the private MSPLFI object dataset, the proposed SRD and SRI algorithms demonstrate better detection accuracy and suppression quality, respectively, than other techniques. In SRD, predictions of multiband changes in the subapertures in both polarized and unpolarized images are calculated and combined to obtain the overall specularity in transparent objects. In SRI, firstly, a distance matrix based on fourconnected neighboring pixel patterns is calculated, and then the most similar one is selected

IMMSE, and MAD than other techniques reported in this paper. The experimental results

mance evaluation metrics. They also demonstrate that it significantly improves the SRD metrics (with mean F1-score, G-mean, and accuracy 0.643, 0.656, and 0.981, respectively) and SRI ones (with the mean SSIM, PSNR, IMMSE, and MAD 0.966, 0.735, 0.073, and 0.226, respectively) for 18 transparent objects, each with 121 sub-apertures, in MSPLFI compared

As an extension of this work, we will investigate a machine learning technique for feature extraction and learning and testing of SRD and SRI performances on the MSPLFI object dataset. As it is known that a transparent object contains the same texture as its background, developing an automatic algorithm for segmenting it from its background in

**Author Contributions:** Conceptualization, M.N.I. and M.T.; methodology, M.N.I. and M.T.; software, M.N.I.; validation, M.N.I. and M.T.; investigation, M.T. and M.P.; data curation, M.N.I.; writing—original draft preparation, M.N.I.; writing—review and editing, M.N.I., M.T. and M.P.; supervision, M.T. and M.P.; funding acquisition, M.P. All authors have read and agreed to the

with those in the existing literature referenced in this paper.

multisensory imagery will also be explored.

published version of the manuscript.

to replace the specular pixel. The proposed algorithms predict better detection accuracy and inpainting quality in terms of F1-score, G-mean, accuracy, SSIM, PSNR, IMMSE, and MAD than other techniques reported in this paper. The experimental results illustrate the validity and the efficiency of the proposed method based on diverse performance evaluation metrics. They also demonstrate that it significantly improves the SRD metrics (with mean F1-score, G-mean, and accuracy 0.643, 0.656, and 0.981, respectively) and SRI ones (with the mean SSIM, PSNR, IMMSE, and MAD 0.966, 0.735, 0.073, and 0.226, respectively) for 18 transparent objects, each with 121 sub-apertures, in MSPLFI compared with those in the existing literature referenced in this paper.

As an extension of this work, we will investigate a machine learning technique for feature extraction and learning and testing of SRD and SRI performances on the MSPLFI object dataset. As it is known that a transparent object contains the same texture as its background, developing an automatic algorithm for segmenting it from its background in multisensory imagery will also be explored.

**Author Contributions:** Conceptualization, M.N.I. and M.T.; methodology, M.N.I. and M.T.; software, M.N.I.; validation, M.N.I. and M.T.; investigation, M.T. and M.P.; data curation, M.N.I.; writing original draft preparation, M.N.I.; writing—review and editing, M.N.I., M.T. and M.P.; supervision, M.T. and M.P.; funding acquisition, M.P. All authors have read and agreed to the published version of the manuscript. *Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 24 of 30

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable. **Funding:** This research received no external funding.

**Informed Consent Statement:** Not applicable. **Institutional Review Board Statement:** Not applicable.

**Data Availability Statement:** Not applicable. **Informed Consent Statement:** Not applicable.

**Acknowledgments:** The authors would like to thank Denise Russell for her assistance with English expression. **Data Availability Statement:** Not applicable. **Acknowledgments:** The authors would like to thank Denise Russell for her assistance with Eng-

**Conflicts of Interest:** The authors declare no conflict of interest. lish expression. **Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Appendix A** 

Visualizations of SRD Methods. Visualizations of SRD Methods.

**Figure A1.** Evaluation results for SRD performances of proposed method for 122 specular reflected images (121 sub-apertures + 1 maximum) of nine sample objects separately using different SRD metrics. **Figure A1.** Evaluation results for SRD performances of proposed method for 122 specular reflected images (121 subapertures + 1 maximum) of nine sample objects separately using different SRD metrics.

ferent SRD metrics.

flected images (121 sub-apertures + 1 maximum) of 9 sample objects separately using dif-

Evaluation results for SRD performances of proposed method for 122 specular reflected images (121 sub-apertures + 1 maximum) of 9 sample objects separately using different SRD metrics. *Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 25 of 30

**Figure A2.** *Cont*.

**Appendix B** 

**Appendix B** 

**Figure A2.** Comparison of SRD accuracies of different methods for sample objects in MSPLFI dataset. **Figure A2.** Comparison of SRD accuracies of different methods for sample objects in MSPLFI dataset. **Figure A2.** Comparison of SRD accuracies of different methods for sample objects in MSPLFI dataset.

## **Appendix B**

Visualizations of SRI Methods. Visualizations of SRI Methods. Visualizations of SRI Methods.

**Figure A3.** *Cont*.

**References** 

19 June 2020.

*Appl. Opt.* **2006**, *45*, 5453–5469.

Zurich, Switzerland, 7 September 1998.

tation phenology. *Remote Sens.* **2010**, *2*, 2369–2387.

**Figure A3.** Comparison of SRI accuracies of different methods for sample objects in MSPLFI dataset. **Figure A3.** Comparison of SRI accuracies of different methods for sample objects in MSPLFI dataset.

Evaluation results for SRI performances of proposed method for 122 specular reflection suppressed images (121 sub-aperture + 1 maximum ones) of 9 sample objects separately using different SRI metrics. Evaluation results for SRI performances of proposed method for 122 specular reflection suppressed images (121 sub-aperture + 1 maximum ones) of 9 sample objects separately using different SRI metrics. *Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 28 of 30

**Figure A4.** Evaluation results for SRI performances of proposed method for 122 specular reflection suppressed images (121 sub-aperture + 1 maximum ones) of nine sample objects separately using different SRI metrics. 14. Dandois, J.P.; Ellis, E.C. Remote sensing of vegetation structure using computer vision. *Remote. Sens.* **<sup>2010</sup>**, *2*, 1157–1176. **Figure A4.** Evaluation results for SRI performances of proposed method for 122 specular reflection suppressed images (121 sub-aperture + 1 maximum ones) of nine sample objects separately using different SRI metrics.

1. Yang, Q.; Wang, S.; Ahuja, N. Real-time specular highlight removal using bilateral filtering. In Proceedings of the European

3. Lin, S.; Lee, S.W. Estimation of diffuse and specular appearance. In Proceedings of the Seventh IEEE International Conference

4. Hara, K.; Nishino, K.; Ikeuchi, K. Determining reflectance and light position from a single image without distant illumination assumption. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 3 April 2008. 5. Tan, R.T.; Ikeuchi, K. Separating reflection components of textured surfaces using a single image. In Proceedings of the Digitally

6. Kalra, A.; Taamazyan, V.; Rao, S.K.; Venkataraman, K.; Raskar, R.; Kadambi, A. Deep polarization cues for transparent object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–

7. Tyo, J.S.; Goldstein, D.L.; Chenault, D.B.; Shaw, J.A. Review of passive imaging polarimetry for remote sensing applications.

8. Yan, Q.; Shen, X.; Xu, L.; Zhuo, S.; Zhang, X.; Shen, L.; Jia, J. Crossfield joint image restoration via scale map. In Proceedings of

9. Schaul, L.; Fredembach, C.; Susstrunk, S. Color image dehazing using the near-infrared. In Proceedings of the 16th IEEE Inter-

10. Salamati, N.; Larlus, D.; Csurka, G.; Süsstrunk, S. Semantic image segmentation using visible and near-infrared channels. In

11. Berns, R.S.; Imai, F.H.; Burns, P.D.; Tzeng, D.Y. Multispectral-based color reproduction research at the Munsell Color Science Laboratory. In Proceedings of the Electronic Imaging: Processing, Printing, and Publishing in Color, Proceedings of the SPIE,

12. Thomas, J.B. Illuminant estimation from uncalibrated multispectral images. In Proceedings of the 2015 Colour and Visual Com-

13. Motohka, T.; Nasahara, K.N.; Oguma, H.; Tsuchida, S. Applicability of green-red vegetation index for remote sensing of vege-

the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013.

Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012.

national Conference on Image Processing (ICIP), Chiang Mai, Thailand, 7 November 2009.

2. Xin, J.H.; Shen, H.L. Accurate color synthesis of three-dimensional objects in an image. *JOSA A* **2004**, *21*, 713–723.

Conference on Computer Vision, Crete, Greece, 6–9 September 2010.

on Computer Vision, Kerkyra, Greece, 20–27 September 1999.

Archiving Cultural Objects, Nice, France, 14–17 October 2003.

puting Symposium (CVCS), Gjovik, Norway, 25–26 August 2015.

### **References**


## *Article* **Towards Semantic SLAM: 3D Position and Velocity Estimation by Fusing Image Semantic Information with Camera Motion Parameters for Traffic Scene Analysis**

**Mostafa Mansour 1,2,\* , Pavel Davidson 1,3 , Oleg Stepanov 2,4 and Robert Piché <sup>1</sup>**


**Abstract:** In this paper, an EKF (Extended Kalman Filter)-based algorithm is proposed to estimate 3D position and velocity components of different cars in a scene by fusing the semantic information and car model, extracted from successive frames with camera motion parameters. First, a 2D virtual image of the scene is made using a prior knowledge of the 3D Computer Aided Design (CAD) models of the detected cars and their predicted positions. Then, a discrepancy, i.e., distance, between the actual image and the virtual image is calculated. The 3D position and the velocity components are recursively estimated by minimizing the discrepancy using EKF. The experiments on the KiTTi dataset show a good performance of the proposed algorithm with a position estimation error up to 3–5% at 30 m and velocity estimation error up to 1 m/s.

**Keywords:** semantic SLAM; object detection; YOLOv3; object based map; EKF

### **1. Introduction**

In recent years, significant progress has been made in vision-based Simultaneous Localization and Mapping (SLAM) to allow a robot to map its unknown environment and localize itself in it [1]. Many works have been dedicated to the use of geometric entities such as corners and edges to produce a dense feature map in the form of a 3D point cloud. A robot then uses this point cloud to localize itself. The geometric aspect of SLAM has reached a level of maturity allowing it to be implemented in real time with high accuracy [2,3] and with an outcome consisting of a camera pose and sparse map in the form of a point cloud.

Despite the maturity and accuracy of geometric SLAM, it is inadequate when it comes to any interaction between a robot and its environment. To interact with an environment, a robot should have a meaningful map with object-based entities instead of geometric ones. The robot should also reach a level of semantic understanding allowing it not only to distinguish between different objects and their properties but also to distinguish between different instances of the same object.

The required granularity of semantic understanding, i.e., object or place identity, depends on the task. For collision avoidance, it is important to distinguish between different objects while distinguishing between different instances of the same object may not be required. In contrast, robots manipulating different instances of the same object, like in warehouses, should have a deeper level of understanding allowing them to distinguish between different instances of the same object and their geometric characteristics. Semantic understanding can sometimes be considered as place understanding instead of object

**Citation:** Mansour, M.; Davidson, P.; Stepanov, O.; Piché, R. Towards Semantic SLAM: 3D Position and Velocity Estimation by Fusing Image Semantic Information with Camera Motion Parameters for Traffic Scene Analysis. *Remote Sens.* **2021**, *13*, 388. https://doi.org/10.3390/rs13030388

Academic Editors: Jukka Heikkonen and Fahimeh Farahnakian Received: 9 December 2020 Accepted: 20 January 2021 Published: 23 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

understanding: A robot that moves among different kinds of places (room, corridor, elevator, etc.) should be able to distinguish between them.

In autonomous driving, traffic scene analysis is a crucial task. An autonomous vehicle should not only understand the position of different objects but should also be able to predict their trajectory, even in the case of occlusion. Over the last years, traffic scene analysis has leveraged the maturity of deep learning approaches to detect different objects in the scene. The improvements in deep learning-based 2D object detection algorithms [4,5] enable a better understanding of scene content. However, they do not allow us to have a 3D description of the scene. Therefore, recently, many works have been devoted to augment the results of 2D object detectors to obtain a 3D representation of the scene in the form of 3D coordinates of the cars in the scene.

A number of recent works in 3D representation of cars in a scene [6–13] utilize prior information about 3D Computer Aided Design (CAD) models of the cars. After the cars in a scene are detected using a 2D detector, they are matched against a set of 3D CAD models to choose the corresponding models. Then, a virtual (calculated) image for the scene is generated using the 3D CAD models of the detected cars. After that, the virtual image is compared with the actual image captured by the camera and the *discrepancy* between the two images is calculated. The cars' poses are estimated by minimizing the discrepancy between the virtual and actual image. These works can be divided into two groups depending on the way they compute the discrepancy between virtual and actual images. One group computes the discrepancy as the difference between two contours space [7,10]. The first contour represents the virtual images generated using the 3D CAD model while the other contour represents the corresponding detected car in the scene. The second group computes the discrepancy as the difference between some key points (such as window points, wheels, etc.) in the virtual and actual image [6,11–14]. It was found that this approach led to more accurate results than the contour approach [9].

Regardless of the approach used to compute the discrepancy, such works have two drawbacks. **The first drawback** is that they estimate the pose of the detected cars only and do not give information on velocities. **The second drawback** is that they produce a one-shot estimate that infers each frame separately and does not take into account the temporal dynamic behavior of the objects between successive frames. The dynamic behavior of different objects is important for trajectory prediction, especially in the case of full occlusion.

The previous drawbacks inspired us to introduce an approach that has the following contributions:


The rest of the paper is organized as follows. Some important related works are discussed in Section 2. In Section 3, we go through the proposed approach describing its main components. Section 3.2 introduces the mathematical problem statement used in our approach and its solution using EKF (Extended Kalman Filter). Sections 4 and 5 present experiment results and a discussion about the proposed algorithm, respectively. Finally, we present our conclusions in Section 5.

#### **2. Related Work**

In the following lines, some works related to 2D object detection, semantic SLAM without using a 3D CAD model, and image-based 3D object detection using a 3D CAD model will be discussed.

#### *2.1. 2D Object Detection*

Identifying objects is a crucial step in the semantic SLAM pipeline. Therefore, we use the state-of-art in deep learning to detect and identify the objects. There is a trade-off between the speed and accuracy of Convolution Neural Networks (CNN) used in object detection. On the one hand, techniques such as R-CNN [15], Fast R-CNN [16], and Faster R-CNN [17] are accurate. They are region-based techniques that first produce candidate regions containing some potential objects and then classify these objects. Although accurate, they are computationally expensive and are not suitable for real time application. On the other hand, bounding boxes-based techniques such as You Only Look Once (YOLO) [4,18] and Single Shot Multibox Detector (SSD) [5] are less accurate but are suitable to real-time applications. In this paper, we will use YOLOv3 [18] as an object detector because it is considered as the-state-of-art in bounding boxes techniques and it supports both CPU and GPU implementations.

#### *2.2. Semantic SLAM without Using 3D CAD Model*

In recent years, a few works have been dedicated to semantic SLAM without using a prior 3D CAD model. Bowman et al. [19] used an Expectation Maximization algorithm to optimize the joint distribution of camera pose and detected objects locations. Doherty et al. [20,21] addressed the problem of data association in semantic SLAM. In [20], the authors decomposed the problem into a discrete inference problem to estimate the object category and a continuous inference problem to estimate camera and object location. In [21], the authors proposed a proactive max-marginalization procedure for the data association problem in semantic SLAM. Unlike the previous works which did not benefit from a prior knowledge of some objects, in our approach we use prior known models of the objects.

#### *2.3. Image Based 3D Object Detection Using 3D CAD Models*

In [22], Davison argued that 3D object model fitting is an active choice to produce high level semantic mapping. Many works have been dedicated to utilize prior information about the 3D model of the object. Chabot et al. introduced one of the pioneering works for 3D object detection from monocular camera images [6]. Their approach consists of two phases. In the first phase, they used a cascade Faster R-CNN to regress 2D key points of the detected car and produce a template similarity. In phase 2, they selected the matching 3D CAD model from the database based on the template similarity obtained in the first phase. Having a 3D CAD model and the corresponding 2D key points, they used Efficient Perspective-n-Point (EPnP) [23] algorithm to compute the discrepancy and estimate the poses of the detected cars. Kundu et al. estimated the shape and the pose of the cars using "Render-and-compare" loss components [7]. They rendered each potential 3D CAD model with OpenGL and compared it with the images of the detected cars to find the most similar model. However, this approach is computationally heavy. In [12], Qin et al. regressed 3D bounding box's center from a 2D image using sparse supervision. They did not use any prior 3D CAD models. Barabanau et al. improved the previous approach by using a 3D CAD model to infer the depth to the detected cars [13]. Wu et al. [8] extended Mask R-CNN by adding customized heads, i.e., additional output layers, for predicting the vehicle's finer class, rotation, and translation. None of the previous approaches take into account the dynamic nature of the moving cars from frame to frame. Therefore, in our approach, we extend their works by fusing the output of the object detector with the motion parameters to estimate the position and velocity of different cars in the scene.

#### **3. Proposed Approach**

Our proposed approach aims to have a 3D semantic representation of the traffic scene by estimating the 3D position and velocity components of different cars in the scene. It leverages the advances in deep learning-based algorithms to detect the semantic class and different important key points of different cars in the scene. Then, the detected key points are fused with the motion parameters of the camera, i.e., linear velocity and angular velocity, measured by sensors on board to get a 3D representation of the scene with respect to the ego car frame. This section will introduce, first, the proposed pipeline. Then, the mathematical problem statement and its solution will be presented.

#### *3.1. Proposed Pipeline*

Figure 1 illustrates different stages of the proposed approach. In the following lines, we discuss the main steps of the proposed pipeline.

**Figure 1.** The algorithm pipeline focusing on an ego lane car. It is divided into two parts: Extracting the semantic information (in red) and temporal fusing (in blue). (1) The car is detected in the scene using an object detection algorithm like Yolov3. (2) The key points of the car are extracted and matched against several 3D CAD models to select the corresponding model [6–8]. (3) The extracted semantic information is converted to 3D coordinates using the 3D CAD model. (4) The car position is predicted using the ego motion parameters of the camera. (5) Virtual 2D key points are created using the predicted car position and the 3D coordinates of the key points obtained from the CAD model. (6) The distance between actual 2D key points and their corresponding 2D virtual key points is computed. (7) The filter is updated using the computed discrepancy.

#### 3.1.1. Object Detection

Yolov3 can be used for object detection [18]. Yolov3 is much faster in comparison to other object detection algorithms while achieving comparable accuracy, which makes it very suitable for real time implementation. The output of the object detection algorithm is a number of bounding boxes, each of which contains a detected car.

#### 3.1.2. Key Points Detection and 3D CAD Model Matching

Having a car inside a bounding box, a number of 2D key points can be detected. These key points can be: Rear and front windshield corners, centers of the wheels, the corners of the doors windows, etc. These key points and shape of the car are used to match the car with a corresponding 3D CAD model stored in the database [9]. The 3D CAD model consists of the 3D coordinates of the key points resolved in the car coordinate frame. Once the detected car is matched correctly with its corresponding 3D CAD model, we have a number of 2D key points and their corresponding 3D points resolved in the car coordinate frame. There are some works that utilized neural networks to do key points detection and 3D CAD model matching [6,7]. However, the configurations and the weights of their implementations are not open sourced. Therefore, in order to use these approaches, one has to reproduce their results and retrain the networks from scratch, which is a time and resources consuming process. In addition, the paper contribution is not related to this part. Our main contribution is the temporal fusing of the detected key points with the motion parameters of the car. Therefore, we decided to get the results of this stage in a manual way.

#### 3.1.3. Semantic to Metric Information Conversion

The detected key points with their associated IDs are matched against their 3D CAD model to get their corresponding 3D coordinates resolved in their car coordinate frame. By doing so, we convert the semantic information (object class, car model, and the identity of the detected 2D key points) to a metric information in the form of a set of 3D coordinates.

#### 3.1.4. State Prediction

At this stage, the motion parameters of the ego car measured by GPS and IMU sensors on board are used to predict the 3D position and velocity components of the detected cars with respect to the car coordinate frame. For state prediction, we have used a motion model presented in (Section 3.2.1).

#### 3.1.5. Forming Virtual 2D Key Points

The key points obtained previously in step Section 3.1.3 are projected on the image plane to get 2D virtual key points. To do so, the 3D key points are represented in the camera frame instead of the body frame. Then, the points are projected into the image plane using the pinhole camera model (see Section 3.2.2).

#### 3.1.6. Discrepancy Formulation

Having a number of true key points from Section 3.1.2 and their corresponding virtual key points from Section 3.1.5, the distance between them can be computed. This distance is called the discrepancy between the true image and virtual image.

#### 3.1.7. Temporal Fusing of the Discrepancy with the Motion Parameters of the Camera (Filter Update)

The detected car 3D position and velocities can be estimated by minimizing this discrepancy. An Extended Kalman Filter (EKF) is used to recursively fuse the discrepancy calculated in Section 3.1.6 with the predicted 3D position and velocity components obtained in Section 3.1.4. In the following section the mathematical problem statement will be discussed in detail.

#### *3.2. Problem Statement and Its Solution*

Consider a car equipped with a monocular camera and motion sensors moving on a road, capturing successive images for the scene. The cars in the scene are detected using a 2D object detection algorithm like Yolov3. The goal is to estimate the 3D position and the velocity components of each detected car with respect to the ego car (camera) coordinate frame. By doing so, a 3D object-based map for the scene, with respect to the ego car frame, can be created and updated over time.

3.2.1. Motion Model

Suppose *X* = [*P car* , *V car*] *T* is the state vector of a detected car (object). It consists of two subvectors: *P car* = [*P car x* , *P car y* , *P car z* ] *T* is the 3D position of a detected car and *V car* = [*V car x* , *V car y* , *V car z* ] *T* is the 3D velocity vector. Both vectors are resolved with respect to the ego car coordinate frame and they can be described using Singer's model as follows [24,25],

$$\begin{aligned} \dot{P}\_{\chi}^{car} &= V\_{\chi}^{car} - \dot{V}\_{\chi}^{cyo} + v\_{\chi} \\ \dot{P}\_{y}^{car} &= V\_{y}^{car} - \dot{V}\_{y}^{cyo} + v\_{y\prime} \\ \dot{P}\_{z}^{car} &= V\_{z}^{car} - \dot{V}\_{z}^{cyo} + v\_{z\prime} \\ \dot{V}\_{\chi}^{car} &= w\_{\chi\prime} \\ \dot{V}\_{y}^{car} &= w\_{y\prime} \\ \dot{V}\_{z}^{car} &= w\_{z\prime} \end{aligned} \tag{1}$$

where *V*˜ *ego* is the ego car velocity measured by the motion sensors on-board and used as an input to the model in (1) [26]. *v<sup>i</sup>* and *w<sup>i</sup>* , where *i* ∈ [*x*, *y*, *z*], are uncorrelated random white noise components. *v* represents the measurement error of the ego car velocity and the process white noise related to *P car* while *w* represents the process white noise related to *V car*. The model in (1) can be written at any time step (*t*) in a discrete form as follows,

$$X\_t = AX\_{t-1} - B\tilde{V}\_{t-1}^{\varepsilon go} + q(t),\tag{2}$$

where


and

$$B = \begin{bmatrix} \Delta t & 0 & 0 \\ 0 & \Delta t & 0 \\ 0 & 0 & \Delta t \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}. \tag{4}$$

*V*˜ *ego t*−1 is measured by motion sensors at time step (*<sup>t</sup>* − 1). *<sup>q</sup><sup>t</sup>* ∼ *<sup>N</sup>*(**0**6×1, **<sup>Q</sup>**6×<sup>6</sup> ) is a random vector that models the process noise; it has a Gaussian distribution with zero mean and covariance matrix *Q* . The predicted state using this model will be updated using the semantic and metric information extracted from camera images.

#### 3.2.2. Observation Model and Semantic Information Fusing

The semantic information, i.e., the car model, is used to obtain metric information that can be fused with the motion parameters. The semantic information involves three parts: Object category (car or not a car), model category (what is the corresponding 3D CAD model of a detected car), and the identity of the detected key points (which part of the car is represented by a specific keypoint). For each camera frame, the object detection algorithm is used to detect different cars in the scene which is the first part of the semantic information. Once a car is detected, it is matched against the 3D CAD models in the database. This can be done using any of the approaches described in Section 3.1.2. This will address the last two parts in the semantic information, i.e., the 3D CAD model and the identity of the detected key points. The identity of the detected key points is the semantic information that should

be converted to metric information so that it can be used in the observation model. To do so, the 3D CAD model will be used to determine the 3D coordinate position of each detected point resolved in the detected car coordinate frame. Suppose *P kp* = [*P kp <sup>x</sup>* , *P kp <sup>y</sup>* , *P kp z* ] *T* is the 3D position of a detected key point resolved in the detected car coordinate frame. *P kp* can be found using the 3D CAD model of the detected car. However, to get an image point from *P kp*, it should be presented in the camera coordinate frame. Let *P kp ego* = [*P kp ego*,*x*, *P kp ego*,*y*, *P kp ego*,*z* ] *T* denote the key point resolved in the camera frame and can be obtained as follows,

$$P\_{\epsilon go}^{kp}(t) = P^{car}(t) + R\_{\epsilon go}^{car}(t)P^{kp} \,\prime \,\,\,\,\tag{5}$$

where *R car ego* is the rotation matrix from the detected car coordinate frame to the camera frame. Using 2D detected key points and their corresponded 3D key points from the 3D CAD model, *R car ego* can be found using the EPnP algorithm [23]. Having a real camera image with a detected 2D keypoint, the observation model can be formulated as follows,

$$Z\_{2d}(t) = \begin{bmatrix} f \frac{P\_{e\otimes o,x}^{kp}(t)}{P\_{e\otimes o,z}^{kp}(t)} + c\_x\\ \\ f \frac{P\_{e\otimes o,y}^{kp}(t)}{P\_{e\otimes o,z}^{kp}(t)} + c\_y \end{bmatrix} + \epsilon(t),\tag{6}$$

where *f* and (*cx*, *cy*) are camera focal length and principal point, respectively. These parameters are assumed to be known from prior camera calibration. *e*(*t*) ∼ *N*(**0**, **R**) describes the measurement error. **R** = *σ* 2 **I2x2** is the covariance matrix of a pixel point where *σ* = 5 pixels. The model in (6) depends on *P car*(*t*) as *P kp ego* depends on *P car*(*t*). Since *P car*(*t*) <sup>⊂</sup> *<sup>X</sup><sup>t</sup>* , the model in (6) can be written as:

$$Z\_{2d}(t) = \pi(X\_{t\prime}R\_{\text{ego}}^{car}(t), P^{kp}) + \epsilon(t),\tag{7}$$

where *π*(.) is the measurement model that describes the relation between the current measurement and the state vector at a specific time step.

#### 3.2.3. Solution Using EKF

The state vector *X<sup>t</sup>* can be found by minimizing the discrepancy between the detected key points and 2D virtual key points that can be obtained using the model in (7) and the 3D CAD model of the detected car. Taking into account the dynamic constraints provided by the motion model in (2), the state vector *X<sup>t</sup>* can be estimated by finding *X*ˆ *<sup>t</sup>* that minimizes the following cost function,

$$\hat{X}\_l = \underset{X\_l}{\text{argmin}} \sum\_{i=1}^N \|Z\_{2d}(t) - \pi(X\_{l\prime} \mathbb{R}\_{\text{ego}}^{\text{car}}(t), P\_i^{kp})\|^2 \tag{8}$$

where *N* is the number of detected key points. The term *π*(*X<sup>t</sup>* , *R car ego*(*t*), *P kp*) describes a 2D virtual key point *p kp* 2*d* calculated using the predicted position (*P car*)*t*|*t*−<sup>1</sup> from the motion model in (2) as follows,

$$P\_{\mathfrak{e}\wp}^{kp}(t|t-1) = (P^{car})\_{t|t-1} + R\_{\mathfrak{e}\wp o}^{car}(t)P^{kp}.\tag{9}$$

After that, the model in (6) is used to get the virtual point as follows,

$$p\_{2d}^{kp}(t) = \begin{bmatrix} f\frac{(P\_{\varepsilon go,x}^{kp})\_{t|t-1}}{(P\_{\varepsilon go,z}^{kp})\_{t|t-1}} + c\_x\\ \\ f\frac{(P\_{\varepsilon go,y}^{kp})\_{t|t-1}}{(P\_{\varepsilon go,z}^{kp})\_{t|t-1}} + c\_y \end{bmatrix}.\tag{10}$$

EKF can be used to minimize the cost function in (8) using the motion model presented in (2).


$$X\_t^- = A\hat{X}\_{t-1} - B\hat{V}\_{t-1\prime}^{ego} \tag{11}$$

$$P\_t^{-} = AP\_{t-1}A^T + Q.\tag{12}$$

• Update

$$K\_t = P\_t^{-} H\_t \left( H\_t P\_t^{-} H\_t^T + R \right)^{-1},\tag{13}$$

$$
\hat{X}\_t = X\_t^- + K\_t \left( Z\_{2d}(t) - p\_{2d}^{kp}(t) \right),
\tag{14}
$$

$$P\_t = (I - K\_t H\_t) P\_t^{-} \, \, \, \, \, \tag{15}$$

where

$$H\_l = \frac{\partial Z\_{\Delta l}}{\partial X}\_{X = X\_l^-} = \begin{bmatrix} f & 0 & \frac{-f P\_{\varepsilon \text{geo}, x}^{kp}(t)}{P\_{\varepsilon \text{geo}, z}^{kp}(t)^2} & 0 & 0 & 0 \\ 0 & \frac{f}{P\_{\varepsilon \text{geo}, z}^{kp}(t)} & \frac{-f P\_{\varepsilon \text{geo}, y}^{kp}(t)}{P\_{\varepsilon \text{geo}, z}^{kp}(t)^2} & 0 & 0 & 0 \\ \end{bmatrix}\_{X = X\_l^-} $$

Here, the discrepancy between the real image point *Z*2*<sup>d</sup>* and the virtual image point *p kp* 2*d* is represented as an innovation term in the update step as illustrated in (14).

The steps of the pipeline are summarized in (Algorithm 1). The proposed algorithm should be implemented for each detected car. Therefore, the whole pipeline will be a bank of the same algorithm where a copy of the algorithm is attached to each detected car, separately.


**Result:** *X*ˆ *t* , *t* = 1, . . . , *K* Initialize *X*0; **for** *t* = 1, . . . , *K* **do if** *Measurements from motion sensor (V*˜ *ego t*−1 *)* **then** *<sup>X</sup>t*|*t*−<sup>1</sup> <sup>=</sup> *AX*<sup>ˆ</sup> *<sup>t</sup>*−<sup>1</sup> <sup>−</sup> *BV*˜ *ego t*−1 . **end if** *Camera image* **then Extract semantic information**: • Detect a car in the scene, • Extract *N* key points and match them with the 3D CAD models to get a number of key points and their associated identity numbers, • Convert the semantic information to metric information to get 3D coordinates of the key points resolved in the body frame. **Fusing the semantic information with the motion parameters**: **for** *each point in N key points* **do**: • Make a 2D virtual key point using the 3D point coordinates and the predicted car position *P car t*|*t*−1 using (10), • Calculate the discrepancy between the virtual key point and the true key point extracted from the image, • Update *<sup>X</sup>t*|*t*−<sup>1</sup> to get *X*ˆ *<sup>t</sup>* using the computed discrepancy. **end end**

#### **4. Results**

In this paper, we have used some scenarios from the KiTTi dataset [28,29]. According to [28], the motion parameters of the camera are measured by an OXTS RT GNSS-aided inertial measurement system which has 0.1 Km/h RMS of velocity error [30]. This value was used for *v<sup>i</sup>* . *w<sup>i</sup>* was tuned to be equal to 0.1 m/s<sup>2</sup> . The sensors on-board are synchronized with a data rate of 10 Hz. In this paper, we focused on a detected car in the ego lane as a proof of concept. To start the pipeline, the filter can be initialized by a direct 3D position measurement from a stereo camera or from a monocular camera [31,32] depending on the distance of the object [33]. To continue the pipeline, a body coordinate frame should be attached to the detected car. The geometry configuration of the detected car frame is presented in Figure 2. The 3D coordinates of the four corners of the rear windshield are given unique identity numbers and their 3D positions are represented with respect to the detected car body frame and saved to be used as a database in the pipeline. After that, the four corners are detected manually in each frame and fed to the pipeline. For information fusing, EKF is used. In this section, we present some qualitative and quantitative results. For the quantitative results, the ground truth is obtained using a 3D HDL-64E Velodyne LiDAR on board [28].

Figure 3 presents the results of the proposed algorithm. After estimating the position of the detected car, the four corners of the windshield are calculated using the estimated position and the 3D CAD model. Then, the estimated corners are superimposed on the image against the ground truth. It can be seen that they are co-aligned well, which indicates a good performance of the proposed algorithm. We can also notice that at far distances, more than 25 m, the proposed algorithm still works. It means that the algorithm can cope with the uncertainties in the 3D car models and the measurement errors even at far distances. Figure 4 presents the estimated 3D coordinates of the detected car with respect to the camera. Figure 5 presents the estimation error in the detected car position. It shows a good performance of the proposed algorithm in estimating the car's 3D position as the proposed algorithms has an error of 3–5% at 30 m distance. To examine the behavior of the algorithm during occlusion, the camera is switched off for some time. As presented

in Figure 6, the error in estimation increases in the case of occlusion. However, once the camera measurements became available again, the error decreases and the algorithm converges quickly to the correct estimation.

Figure 7 presents the estimation error in *P car <sup>z</sup>* and *V car z* . It indicates a good performance of the algorithm in velocity estimation even at far distances with an estimation error up to 1 m/s.

**Figure 2.** The used coordinate systems for the camera (in black) and the detected car (in blue).

**Figure 3.** The results of the algorithm pipeline focusing on an ego lane car. The estimated 3D position of the rear windshield corners (in blue) are superimposed on the image against the ground truth position (in green).

**Figure 4.** The estimated 3D coordinates of the detected car using the proposed algorithm.

**Figure 5.** Estimation error in the car position presented as error bars around the estimated value.

**Figure 6.** Estimation error in car coordinates in the case of occlusion. The period of occlusion is highlighted with a red oval.

**Figure 7.** Estimation error in *P car <sup>z</sup>* and the longitudinal velocity *V car z* .

### **5. Discussion**

Based on the proposed algorithm and the obtained results, the following points need to be emphasized.

	- **–** Other approaches do not take into account the dynamic motion constraints of the detected cars while the proposed approach fuses these motion constraints with semantic information to increase estimation accuracy;
	- **–** Other approaches do not depend on temporal fusing. They depend on one shot estimation which does not work in the case of occlusion. In contrast, our approach depends on temporal fusing which allows it to predict car position and velocity in the case of occlusion;
	- **–** Unlike other approaches which estimate the 3D position of a detected car only, the proposed approach estimates the velocity as well. Including car velocity in the state vector increases the accuracy of the estimation process due to the natural correlation between position and velocity of a detected car.

the estimation process. This can be done by augmenting the state vector to include, in addition to position and velocity variables, a class ID and a model ID and consider them as random variables to be estimated. The implementation of this point is out of the scope of the current paper and will be considered in future work;

• In this paper, Singer's model (constant velocity model) was used to describe the dynamic motion of the detected car. This model is not enough to describe the motion in some cases, as it leads to incorrect results when the constant velocity condition is violated like in the case of turns. A more robust solution can be achieved by using Interacting Multiple Model filter (IMM filter) [25] where several motion models can be used to describe different types of motion scenarios. This point will be investigated in future work.

**Figure 8.** Distance estimation using the proposed algorithm and EPnP algorithm [23].

**Figure 9.** Distance estimation with different levels of image point noise.

#### **6. Conclusions**

In this paper, we proposed an algorithm to estimate 3D position and velocity components of different cars in a scene. The algorithm fuses semantic information extracted from object detection algorithm with camera motion parameters measured by sensors on-board. The algorithm uses a prior known 3D CAD model to convert semantic information to metric information, which can be used in EKF. The experiments on the KiTTi dataset confirmed the proof of concept. It showed that the proposed algorithm works well and had an error of 3–5% at 30 m distance and a velocity estimation error up to 1 m/s. This percentage is small and allows the algorithm to produce a rough idea about the traffic scene at far distances. In addition, the results showed that the algorithm was able to converge quickly after a period of occlusion.

For future work, we plan to use the IMM filter to describe different motion models of the detected cars. In addition, more experiments with automated key points detection on several numbers of detected cars will be done to gain more insight into the performance of the proposed approach in different driving scenarios.

**Author Contributions:** Study conception, P.D.; software development, M.M.; data acquisition, M.M.; data processing, M.M.; visualization, M.M.; writing the original draft, M.M.; development of methodology, M.M. and P.D.; validation, P.D., O.S. and R.P.; manuscript drafting and revision, P.D., O.S. and R.P.; project administration P.D.; data interpretation, R.P. and O.S.; supervision of the project, R.P.; funding acquisition, R.P. and O.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was partially financially supported by the Government of the Russian Federation (Grant 08-08).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Gudalur Spectral Target Detection (GST-D): A New Benchmark Dataset and Engineered Material Target Detection in Multi-Platform Remote Sensing Data**

## **Sudhanshu Shekhar Jha and Rama Rao Nidamanuri \***

Department of Earth and Space Sciences, Indian Institute of Space Science and Technology, Valiamala, Thiruvananthapuram, Kerala 695547, India; sudhanshushekhar.16@res.iist.ac.in

**\*** Correspondence: rao@iist.ac.in; Tel.: +91-471-256-8519

Received: 20 May 2020; Accepted: 18 June 2020; Published: 3 July 2020

**Abstract:** Target detection in remote sensing imagery, mapping of sparsely distributed materials, has vital applications in defense security and surveillance, mineral exploration, agriculture, environmental monitoring, etc. The detection probability and the quality of retrievals are functions of various parameters of the sensor, platform, target–background dynamics, targets' spectral contrast, and atmospheric influence. Generally, target detection in remote sensing imagery has been approached using various statistical detection algorithms with an assumption of linearity in the image formation process. Knowledge on the image acquisition geometry, and spectral features and their stability across different imaging platforms is vital for designing a spectral target detection system. We carried out an integrated target detection experiment for the detection of various artificial target materials. As part of this work, we acquired a benchmark multi-platform hyperspectral and multispectral remote sensing dataset named as 'Gudalur Spectral Target Detection (GST-D)' dataset. Positioning artificial targets on different surface backgrounds, we acquired remote sensing data by terrestrial, airborne, and space-borne sensors on 20th March 2018. Various statistical and subspace detection algorithms were applied on the benchmark dataset for the detection of targets, considering the different sources of reference target spectra, background, and the spectral continuity across the platforms. We validated the detection results using the receiver operation curve (ROC) for different cases of detection algorithms and imaging platforms. Results indicate, for some combinations of algorithms and imaging platforms, consistent detection of specific material targets with a detection rate of about 80% at a false alarm rate between 10−<sup>2</sup> to 10−<sup>3</sup> . Target detection in satellite imagery using reference target spectra from airborne hyperspectral imagery match closely with the satellite imagery derived reference spectra. The ground-based in-situ reference spectra offer a quantifiable detection in airborne or satellite imagery. However, ground-based hyperspectral imagery has also provided an equivalent target detection in the airborne and satellite imagery paving the way for rapid acquisition of reference target spectra. The benchmark dataset generated in this work is a valuable resourcefor addressing intriguing questions in target detection using hyperspectral imagery from a realistic landscape perspective.

**Keywords:** target detection; multi-platform imaging; spectral matching; terrestrial-hyperspectral imagery; automated image analysis; spectral library

#### **1. Introduction**

Technological advancements in remote sensing systems have led to the availability of compact and high-resolution imaging sensors deployable on the ground, airborne, and space-borne platforms. As a result thatspectral reflective signatures of different materials are distinct in the optical range of the

electromagnetic spectrum (EM), remote sensing data have been used for land surface characterization from local to a global level. Building upon the broader application domain of hyperspectral remote sensing, various organizations have developed spectral libraries of reference spectral signatures for thousands of natural and human-made materials [1–3]. Target detection is one of the general approaches of remote sensing, which has a broader application perspective. Detecting targets—specific material objects (natural or engineered) of interest, with a sparse spatial distribution in remote sensing imagery has been an active area of research. Various mapping and surveillance requirements in defense, mineralogy, and precision agriculture can be addressed quickly from a target detection perspective in remote sensing imagery. In principle, target pixels are sparse (about 10 pixels in a million), thus making their detection challenging. Target detection is influenced by choice of the detection algorithm, sensor, target–background dynamics, and atmospheric perturbance [4–6]. From a target detection perspective, high-resolution multispectral imagery has been used for identifying common land use objects such as buildings, roads, vehicles, and ships [7,8]. Hyperspectral imagery offers appropriate baseline spectral data with finer spectral bandwidth required for typical target detection problems.

There are some attempts on using hyperspectral data for target detection for military infrastructure [9], surveillance [10], and mineral mapping [11–13]. However, a comprehensive evaluation of the target detection in remote sensing data, particularly from the perspective of the vertical continuum of target spectral footprints in remote sensing imagery acquired from multiple platforms (ground, airborne, and space-borne) has not been explored. In addition, most of the reported works have approached the target detection problem from the general classification theory wherein a target object is one among the other multiple land use categories mapped. In addition to using a single source of remote sensing imagery, the land cover category considered as "target" to be detected has abundant spatial distribution and extent, which in theory does not qualify it to be called a target. One of the major impediments in this direction has been the lack of benchmark datasets in the public domain. Most of the recent works on target detection have used the Cooke City, USA, made available by Rochester Institute of Technology (RIT), NY, USA [14] for the evaluation of existing and in-development target detection algorithms. Especially, reference remote sensing imagery on multi-platform based target detection has not been reported so far. Further, most of the experimental data on target detection available for the research community is from a single platform, either airborne or space-borne. A multi-platform target detection experimental data that encompass remote sensing data from different sensors will enhance our understanding of the potential of target detection per se and the dynamics involved in a composite framework.

We have carried out a comprehensive experiment for the acquisition of multispectral (only from a space-borne platform), and hyperspectral imagery from ground, airborne, and space-borne platforms on several engineered/artificial target materials in a complex urban neighborhood. The objective of this research is to explore the target detection problem from various platforms of imaging and detection of targets in optical remote sensing data. The key research questions of this research are: How does the detection performance vary as a function of the imaging platform? What is the impact of local background–target interaction on detection rate? Is the detection rate reproducible for two identical targets? Multi-platform remote sensing datasets were experimentally evaluated for target detections under various scenarios, and the results were validated, computing various statistical measures, and the graphical receiver operating curves (ROC), since it is one of the most robust target detection metrics and is used ubiquitously [4,15,16].

#### **2. Materials and Methods**

#### *2.1. Experimental Design*

The conceptual design of the experimental setup is shown in Figure 1.

remote sensing data.

R: red, W: white, Y: yellow, B: black).

ground truth imagery are shown in Figure 2.

2.2.1. Reference Spectral Data Sources and Pre-Processing

*2.2. Data pre-processing* 

datasets.

**Figure 1.** Conceptual design of the experimental set up used for the acquisition of multi-platform **Figure 1.** Conceptual design of the experimental set up used for the acquisition of multi-platform remote sensing data.

The experimental set up consisted of positioning five targets of different artificial thin-sheet materials of different colors (base material: nylon and cotton), each of the size 10 m × 10 m (Figure 2). For ease of referencing throughout the paper, we designate a distinct name for each target used in this study in Table 1. The third letter in the name of a target indicates the color of the target (G: green, The experimental set up consisted of positioning five targets of different artificial thin-sheet materials of different colors (base material: nylon and cotton), each of the size 10 m × 10 m (Figure 2). For ease of referencing throughout the paper, we designate a distinct name for each target used in this study in Table 1. The third letter in the name of a target indicates the color of the target (G: green, *Remote Sens.* R: red, W: white, Y: yellow, B: black). **2020**, *12*, 2145 4 of 30

spectrum, we positioned two targets (N1G and N3Y) on the grass and soil background. To assess the target detection of materials with broadly similar spectral reflectance characteristics, we chose multiple targets with a single base material but in different colors. Ensuring an overlapping areal extent of the imagery from both the airborne and space-borne platforms, we extracted a subset of the data acquired. The datasets maintain SNR ratio close to one in a million for different scene elements **Figure 2.** (**a**) True color composite of the AVIRIS-NG hyperspectral imagery with the locations of the artificial targets earmarked; (**b**) location of targets—N3Y and N4B; (**c**) location of targets—C1W, N1G, and N2R; (**d**) ground truth map, and (**e**–**f**) enlarged view of the ground truth map for different targets. Field photographs (**g**–**k**) showing the artificial targets placed in the study area for imagery acquisition. **Figure 2.** (**a**) True color composite of the AVIRIS-NG hyperspectral imagery with the locations of the artificial targets earmarked; (**b**) location of targets—N3Y and N4B; (**c**) location of targets—C1W, N1G, and N2R; (**d**) ground truth map, and (**e**–**f**) enlarged view of the ground truth map for different targets. Field photographs (**g**–**k**) showing the artificial targets placed in the study area for imagery acquisition.

under the different spatial-spectral variabilities of materials in the scene. A true color composite of the airborne hyperspectral imagery marked with footprints of the targets and the corresponding

present setup, a nominal spatial resolution of 1cm further approximated to 20cm across the targeted area was acquired in a nadir to oblique view. The AVIRIS-NG hyperspectral sensor was operated to acquire imagery with 4m spatial resolution and 5nm spectral resolution in the 400–2500 nm spectral range. The airborne hyperspectral data acquisition was part of the NASA and ISRO research collaboration for the HYPSIRI hyperspectral satellite [18]. The satellite imagery was acquired about one hour before the acquisition of airborne hyperspectral imagery. Apart from the spectral imagery, we collected point-based in-situ hyperspectral reflectance measurements using a field spectroradiometer (Spectra Vista Corporation, HR-1024i, USA) on the target materials as per the standard procedures [19]. The in-situ measurements are considered pure spectral signatures of the target materials, free of atmosphere, and target–surface–neighborhood interactions. Plots of in-situ reference spectral signatures of the target materials are shown in Figure 3. There are two sources of ground-based target reference spectra, ground-based hyperspectral imagery (THI) (reference in-situ pixels), and the point-based in-situ spectral reflectance from spectroradiometer. Since the THI collects hyperspectral imagery at a finer spatial resolution, we generated the reference target spectra by sampling target pixels corresponding to different places on the target materials. As the THI imager is sensitive to sensor noise beyond 900 nm, we used the THI data acquired in the spectral range 400 nm to 900 nm. After the initial pre-processing, which included the calibration using the concurrent measurements acquired on white reference panels, all the spectral data were convolved and resampled using the sensor response function (SRF) of the respective sensor for analysis across the

On 20th March 2018, we acquired multi-platform remote sensing data: ground-based terrestrial hyperspectral imager (THI), airborne hyperspectral imager (AVIRIS-NG) [17], and the space-borne


**Table 1.** Target materials and naming convention used in the paper.

Out of the five different target materials, we positioned three on natural grass and vegetation features as the background, and two on reflective soil background. To introduce a moderate degree of background resemblance to natural camouflage in the visible spectral range of the electromagnetic spectrum, we positioned two targets (N1G and N3Y) on the grass and soil background. To assess the target detection of materials with broadly similar spectral reflectance characteristics, we chose multiple targets with a single base material but in different colors. Ensuring an overlapping areal extent of the imagery from both the airborne and space-borne platforms, we extracted a subset of the data acquired. The datasets maintain SNR ratio close to one in a million for different scene elements under the different spatial-spectral variabilities of materials in the scene. A true color composite of the airborne hyperspectral imagery marked with footprints of the targets and the corresponding ground truth imagery are shown in Figure 2.

#### *2.2. Data Pre-Processing*

#### 2.2.1. Reference Spectral Data Sources and Pre-Processing

On 20th March 2018, we acquired multi-platform remote sensing data: ground-based terrestrial hyperspectral imager (THI), airborne hyperspectral imager (AVIRIS-NG) [17], and the space-borne multispectral sensor (Sentinel-2). The THI is a push-broom hyperspectral imager (Headwall Photonics Inc., USA) mounted on a movable tripod-kind of the platform. The THI acquires hyperspectral imagery in the VNIR region (40–1000 nm) at about 1 nm spectral resolution. In the present setup, a nominal spatial resolution of 1 cm further approximated to 20 cm across the targeted area was acquired in a nadir to oblique view. The AVIRIS-NG hyperspectral sensor was operated to acquire imagery with 4 m spatial resolution and 5 nm spectral resolution in the 400–2500 nm spectral range. The airborne hyperspectral data acquisition was part of the NASA and ISRO research collaboration for the HYPSIRI hyperspectral satellite [18]. The satellite imagery was acquired about one hour before the acquisition of airborne hyperspectral imagery. Apart from the spectral imagery, we collected point-based in-situ hyperspectral reflectance measurements using a field spectroradiometer (Spectra Vista Corporation, HR-1024i, USA) on the target materials as per the standard procedures [19]. The in-situ measurements are considered pure spectral signatures of the target materials, free of atmosphere, and target–surface–neighborhood interactions. Plots of in-situ reference spectral signatures of the target materials are shown in Figure 3. There are two sources of ground-based target reference spectra, ground-based hyperspectral imagery (THI) (reference in-situ pixels), and the point-based in-situ spectral reflectance from spectroradiometer. Since the THI collects hyperspectral imagery at a finer spatial resolution, we generated the reference target spectra by sampling target pixels corresponding to different places on the target materials. As the THI imager is sensitive to sensor noise beyond 900 nm, we used the THI data acquired in the spectral range 400 nm to 900 nm. After the initial pre-processing, which included the calibration using the concurrent measurements acquired on white reference panels, all the spectral data were convolved and resampled using the sensor response function (SRF) of the respective sensor for analysis across the datasets.

**Figure 3.** Reference spectral signatures of the artificial target materials acquired from in-situ **Figure 3.** Reference spectral signatures of the artificial target materials acquired from in-situ reflectance measurements.

#### 2.2.2. Pre-Processing of Airborne and Spaceborne Imagery

reflectance measurements.

2.2.2. Pre-Processing of Airborne and Spaceborne Imagery The airborne AVIRIS-NG hyperspectral imagery was corrected for atmospheric distortions using the radiative transfer based Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes (FLAASH) model [20] and removed the noisy and uncalibrated spectral bands between 1348–1443 nm, 1804–1954 nm, 2485–2500 nm thus resulting in effective imagery with 370 spectral bands. The Sentinel-2 satellite acquires multispectral imagery at different spatial resolutions, 10 m, 20 m, and 60 m. We used the imagery acquired at 10 m and 20 m resolution corresponding to blue (490 nm), green (560 nm), red(665 nm), NIR(842 nm), and vegetation red edge (705 nm, 740 nm, 783 nm, 865 nm), SWIR(1610 nm, 2190 nm) bands of the sentinel-2 product respectively centered at the given wavelengths. Generating a vertically conforming surface reflectance data, we corrected the Sentinel-2 imagery for atmospheric distortions using the same model and sensor-surface hyper-parameters used for airborne imagery. The imagery acquired at 20 m spatial resolution was resampled to 10 m The airborne AVIRIS-NG hyperspectral imagery was corrected for atmospheric distortions using the radiative transfer based Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes (FLAASH) model [20] and removed the noisy and uncalibrated spectral bands between 1348–1443 nm, 1804–1954 nm, 2485–2500 nm thus resulting in effective imagery with 370 spectral bands. The Sentinel-2 satellite acquires multispectral imagery at different spatial resolutions, 10 m, 20 m, and 60 m. We used the imagery acquired at 10 m and 20 m resolution corresponding to blue (490 nm), green (560 nm), red (665 nm), NIR(842 nm), and vegetation red edge (705 nm, 740 nm, 783 nm, 865 nm), SWIR(1610 nm, 2190 nm) bands of the sentinel-2 product respectively centered at the given wavelengths. Generating a vertically conforming surface reflectance data, we corrected the Sentinel-2 imagery for atmospheric distortions using the same model and sensor-surface hyper-parameters used for airborne imagery. The imagery acquired at 20 m spatial resolution was resampled to 10 m resolution to conform to other imagery datasets.

#### resolution to conform to other imagery datasets. *2.3. Experimental Implementation of Target Detection*

*2.3. Experimental Implementation of Target Detection*  An outline of the methodological process flow adopted for the study is shown in Figure 4. The ground position of the targets was recorded using a GPS device. Since the targets used in the experiments were considerably large, we designated the target footprint for the airborne imagery as a 16-pixel region of interest (ROI) and a 4-pixel ROI for space-borne imagery on similar basis as suggested in [15]. It must be noted that, due to different sensor resolutions (4 m and 10 m for airborne and space-borne sensor respectively) and imaging geometry, target ROI for airborne imagery contains both full pixel as well as sub-pixel targets, while, target ROI for space-borne imagery contains predominantly sub-pixel targets. Since part of our aim was to evaluate the target detection possibility from multiple platforms, the input signal sources for the detector algorithms were collected from various sensors, as shown in Figure 4. We visualize three different scenarios: (i) the use of ground-based target spectra for detection from airborne and space-borne imagery, (ii) the use An outline of the methodological process flow adopted for the study is shown in Figure 4. The ground position of the targets was recorded using a GPS device. Since the targets used in the experiments were considerably large, we designated the target footprint for the airborne imagery as a 16-pixel region of interest (ROI) and a 4-pixel ROI for space-borne imagery on similar basis as suggested in [15]. It must be noted that, due to different sensor resolutions (4 m and 10 m for airborne and space-borne sensor respectively) and imaging geometry, target ROI for airborne imagery contains both full pixel as well as sub-pixel targets, while, target ROI for space-borne imagery contains predominantly sub-pixel targets. Since part of our aim was to evaluate the target detection possibility from multiple platforms, the input signal sources for the detector algorithms were collected from various sensors, as shown in Figure 4. We visualize three different scenarios: (i) the use of ground-based target spectra for detection from airborne and space-borne imagery, (ii) the use of ground-based hyperspectral imager target spectra for detection from airborne and space-borne imagery, and (iii) the use of airborne

of ground-based hyperspectral imager target spectra for detection from airborne and space-borne

based target spectra for detection from space-borne imagery which can represent the essence of target detection problem from multiple civil and defense application perspectives. which can represent the essence of target detection problem from multiple civil and defense application perspectives.

**Figure 4.** Methodological framework adopted for the target detection in multi-platform remote sensing imagery. **Figure 4.** Methodological framework adopted for the target detection in multi-platform remote sensing imagery.

#### Target Detection Algorithms

2.3.1. Target Detection Algorithms Apart from the target's optical-spectral features and environmental settings, the target detection problem has two other primary perspectives—appropriate spectral imagery and detection algorithms. Given the applicable nature of spectral imagery, target recognition and identification are substantially controlled by the nature of algorithms used for target detection. While the development of advanced target detection algorithms is not within the purview of this study, it would be valuable to analyze the variations of target detections as a function of the detection algorithm. We, therefore, studied the target detection in the datasets with popular detection algorithms available in the Apart from the target's optical-spectral features and environmental settings, the target detection problem has two other primary perspectives—appropriate spectral imagery and detection algorithms. Given the applicable nature of spectral imagery, target recognition and identification are substantially controlled by the nature of algorithms used for target detection. While the development of advanced target detection algorithms is not within the purview of this study, it would be valuable to analyze the variations of target detections as a function of the detection algorithm. We, therefore, studied the target detection in the datasets with popular detection algorithms available in the literature, evaluating the quality and sensitivity of the target detections based on the algorithms used.

literature, evaluating the quality and sensitivity of the target detections based on the algorithms used. We considered six different detection algorithms: spectral angle mapper (SAM), matched filter (MF), adaptive cosine estimator (ACE), constrained energy minimization (CEM), orthogonal subspace projection (OSP), and transformed constrained interference minimization filter (TCIMF) for evaluating the target detections on the experimental dataset. The SAM, MF, ACE, and CEM are spectral detectors and hence do not require any prior knowledge of the background. However, OSP and TCIMF require prior scene background characterization. Typically, this is approached heuristically estimating the number of distinct background materials or endmembers. The number of distinct background materials represents the complexity of the scene and hence is a scene dependent parameter. We used the SMACC algorithm [21] for the background endmembers estimation. The detection performance of the OSP and TCIMF was evaluated for three different numbers (5, 10, and 15 endmembers) of background endmembers. We present a summary of the We considered six different detection algorithms: spectral angle mapper (SAM), matched filter (MF), adaptive cosine estimator (ACE), constrained energy minimization (CEM), orthogonal subspace projection (OSP), and transformed constrained interference minimization filter (TCIMF) for evaluating the target detections on the experimental dataset. The SAM, MF, ACE, and CEM are spectral detectors and hence do not require any prior knowledge of the background. However, OSP and TCIMF require prior scene background characterization. Typically, this is approached heuristically estimating the number of distinct background materials or endmembers. The number of distinct background materials represents the complexity of the scene and hence is a scene dependent parameter. We used the SMACC algorithm [21] for the background endmembers estimation. The detection performance of the OSP and TCIMF was evaluated for three different numbers (5, 10, and 15 endmembers) of background endmembers. We present a summary of the mathematical aspects of target detection and the formulation of different target detection algorithms used in this study.

#### used in this study. *2.4. Quantitative Description of Target Detection Algorithms*

*2.4. Target Detection Algorithms*  The taxonomy of detection algorithms depends on various factors such as target-pixel The taxonomy of detection algorithms depends on various factors such as target-pixel occupancy (full pixel vs. sub-pixel target), considerations for spectral variability (either for target or background), and modeling the combination of pixel and sub-pixel targets [22]. Given an image χ(*m*,*n*) having *k*

occupancy (full pixel vs. sub-pixel target), considerations for spectral variability (either for target or

mathematical aspects of target detection and the formulation of different target detection algorithms

spectral channels and *m* × *n* pixels such that each pixel x*<sup>i</sup>* = {*x*1, *x*2, *x*3, *x*<sup>4</sup> . . . *x<sup>k</sup>* } *t* ∈ X*k*,*mn*, target detection is formulated as a hypothesis testing problem. Mathematically, target detection can be expressed as a binary hypothesis testing problem:

> H0(Null Hypothesis)x*<sup>i</sup>* : noise (Target absent), H1(Alternate Hypothesis)**x***<sup>i</sup>* : Target.

Assuming a multivariate normal distribution for target and background, the target detection is represented as a hypothesis testing:

$$\begin{aligned} \mathbf{H}\_0: \mathbf{x} &= \mathbf{n} \\ \mathbf{H}\_1: \mathbf{x} &= \mathbf{s} + \mathbf{n} \end{aligned} \tag{1}$$

where s is the known target spectrum and n is the noise or background with mean vector 'm' and covariance matrix C such that n ∼ *N*(m, C). Since the target and background are assumed to follow a multivariate normal distribution, the probability density function *p*(**x**, θ) for a k-dimensional Gaussian vector **x** is given by:

$$p(\mathbf{x}, \ \Theta) = \frac{1}{\left(2\pi\right)^{\mathbf{k}/2} |\mathbf{C}|^{1/2}} \exp\left\{-\frac{1}{2} [\mathbf{x} - \mathbf{m}]^T \mathbf{C}^{-1} [\mathbf{x} - \mathbf{m}]\right\}.\tag{2}$$

At a given false alarm rate (Neyman–Pearson criterion), the probability of detection is maximized by using a likelihood ratio (LR) type of detectors [23] expressed as:

$$l(\mathbf{x}) = \frac{\mathbf{p}(\mathbf{x}|\mathbf{H}\_1)}{\mathbf{p}(\mathbf{x}|\mathbf{H}\_0)} \overset{\mathbf{H}\_1}{\underset{\mathbf{H}\_0}{\gtrless}} \eta \tag{3}$$

where η is the threshold. If *l*(x) is greater than η, then alternate hypothesis (target-present) is declared true. Equation (1) describes the basic statistical model in case of a full pixel under the ideal assumption of the same covariance estimate for both target and background. However, at times target pixel gets mixed up due to the targets being spatially unresolved. In such cases the appropriate statistical model (also known as replacement model) is:

$$\begin{aligned} \mathbf{H}\_0 &: \mathbf{x} = \mathbf{n} \\ \mathbf{H}\_1 &: \mathbf{x} = \mathbf{a} \mathbf{s} + \beta \mathbf{n} \end{aligned} \tag{4}$$

where x ∼ *N*(0, C) under H<sup>0</sup> and x ∼ *N* αs, β 2C ; α refers to the fraction fill of the target or abundances if s represents a matrix containing endmembers.

Our experimental study involved both kinds of the detection problem, full pixel and sub-pixel targets. Several full and sub-pixel target detection algorithms such as spectral angle mapper (SAM) [24], matched filter (MF) [25], constrained energy minimization (CEM) [26], adaptive cosine estimator (ACE) [27], orthogonal subspace projection (OSP) [28], and target constrained interference minimization filter (TCIMF) [29] were implemented for the detection of targets in this experiment.

Spectral Angle Mapper (SAM):

Modifying the signal model given by Equation (1), we have the hypothesis testing:

$$\begin{aligned} \mathbf{H}\_0 &: \mathbf{x} = \mathbf{n} \\ \mathbf{H}\_1 &: \mathbf{x} = \mathbf{a} \mathbf{s} + \mathbf{n} \end{aligned} \tag{5}$$

where α represents the strength of the target signal in the acquired imagery, n ∼ *N* 0, σ 2 I with σ 2 being variance. We estimated α using the maximum likelihood estimate (MLE) under the modified signal model as:

$$\frac{\partial p(\mathbf{x}|\mathbf{H}\_1)}{\partial \alpha} = \frac{\partial}{\partial \alpha} \left\{ \exp \left( \frac{-1}{2} (\mathbf{x} - \alpha \mathbf{s})^\mathsf{T} \left( \mathbf{x} - \alpha \mathbf{s} \right) \right) \right\}. \tag{6}$$

Solving Equation (5), we obtained the MLE estimate of α as follows:

$$
\hat{\mathbf{x}} = \frac{\mathbf{s}^{\mathrm{T}\_{\mathbf{x}}}}{\mathbf{s}^{\mathrm{T}\_{\mathbf{s}}}}.\tag{7}
$$

It is usual to estimate the variance (σ 2 ) from the image pixel, i.e., pixel under test given by ˆσ <sup>2</sup> = x <sup>T</sup>x. Substituting the estimated parameters in Equation (3) and taking the log-likelihood of the distribution functions, the test statistic is given by:

$$r(\mathbf{x}) = \ln \left( \frac{\mathbf{p}(\mathbf{x}|\mathbf{H}\_1)}{\mathbf{p}(\mathbf{x}|\mathbf{H}\_0)} \right) = \frac{\left(\mathbf{s}^\mathrm{T}\mathbf{x}\right)^2}{\left(\mathbf{s}^\mathrm{T}\mathbf{s}\right)\left(\mathbf{x}^\mathrm{T}\mathbf{x}\right)}\,\mathrm{.}\tag{8}$$

We reframed the Equation (5) to represent the test statistic known as spectral angle mapper (SAM) as:

$$r\_{SAM}(\mathbf{x}) = \cos^{-1}\left[\frac{\mathbf{s}^T \mathbf{x}}{\sqrt{(\mathbf{s}^T \mathbf{s})(\mathbf{x}^T \mathbf{x})}}\right].\tag{9}$$

SAM is one of the widely used algorithms in hyperspectral remote sensing for solving spectral classification and matching problems and works on the assumption of a zero-mean and white background. Geometrically, SAM measures the similarity between two n-dimensional vectors based on the cosine of the angle between two vectors.

Matched Filter (MF):

The assumption of a zero-mean and white background is unrealistic for target detection in a world scenario. Allowing a moderate degree of flexibility in this aspect, the MF allows background representation with a normal distribution with finite mean and covariance. The signal model then becomes:

$$\begin{aligned} \mathbf{H}\_0 &: \mathbf{x} = \mathbf{n} \\ \mathbf{H}\_1 &: \mathbf{x} = \; \mathbf{as} + \mathbf{n} \end{aligned} \tag{10}$$

where n ∼ *N*(m, C), and α are the unknown parameters. For the given model, we have:

$$p(\mathbf{x}|\mathbf{H}\_0) = \frac{1}{(2\pi)^{k/2} |\mathbf{C}|^{1/2}} \exp\left\{ -\frac{1}{2} [\mathbf{x} - \mathbf{\hat{m}}]^T \mathbf{\hat{C}}^{-1} [\mathbf{x} - \mathbf{\hat{m}}] \right\} \tag{11}$$

$$p(\mathbf{x}|\mathbf{H}\_1) = \frac{1}{(2\pi)^{\mathbf{k}/2} |\mathbf{\hat{C}}|^{1/2}} \exp\left\{ -\frac{1}{2} \left[ \mathbf{x} - \mathbf{\hat{a}s} - \mathbf{m} \right]^T \mathbf{\hat{C}}^{-1} [\mathbf{x} - \mathbf{\hat{a}s} - \mathbf{m}] \right\} \tag{12}$$

Applying the MLE estimation technique similar to Equation (6) we get:

$$\hat{\mathbf{x}} = \frac{\mathbf{s}^T \hat{\mathbf{C}}^{-1} (\mathbf{x} - \hat{\mathbf{m}})}{\mathbf{s}^T \hat{\mathbf{C}}^{-1} \mathbf{s}}, \text{ } \hat{\mathbf{m}} = \frac{1}{\mathbf{N}} \sum\_{i=1}^{N} \mathbf{x}\_i \text{ } \hat{\mathbf{C}} = \frac{1}{\mathbf{N}} \sum\_{i=1}^{N} [\mathbf{x}\_i - \hat{\mathbf{m}}] \left[\mathbf{x}\_i - \hat{\mathbf{m}}\right]^T. \tag{13}$$

Since the detector assumes an additive model, for α = 1 under the null hypothesis, we have x = s + m, which is incorrect. In addition, α, by definition, is not constrained to be positive and may cause negative test statistics (Eismann et al., 2009). Correcting for these two problems and using the estimates from Equation (13), we can express MF score *r* as:

$$r\_{\rm MF}(\mathbf{x}) = \frac{(\mathbf{s} - \mathbf{\hat{m}})^T \mathbf{\hat{C}}^{-1} (\mathbf{x} - \mathbf{\hat{m}})}{\sqrt{(\mathbf{s} - \mathbf{\hat{m}})^T \mathbf{\hat{C}}^{-1} (\mathbf{s} - \mathbf{\hat{m}})}} \,\tag{14}$$

Adaptive Cosine Estimator (ACE):

Modifying the Equation (4) to include a scale factor β yields the following replacement model:

$$\begin{aligned} \mathbf{H}\_0 &: \mathbf{x} = \mathfrak{B}\mathbf{n} \\ \mathbf{H}\_1 &: \mathbf{x} = \; \mathbf{x} = \; \mathfrak{as} + \mathfrak{B}\mathbf{n} \end{aligned} \tag{15}$$

where *n* ∼ *N*(0, C) and α, β are the unknown parameters. The above model is similar to Kelly's detector (Kelly, 1986), except for the introduction of an unknown parameter β in the null hypothesis. The ACE detector was derived based on the assumption of different covariance estimates (Cˆ <sup>0</sup>, Cˆ <sup>1</sup>) under the null and alternate hypotheses. It is assumed that the data under the null hypothesis correspond to training data for noise/background estimation and pixel under test (under the alternative hypothesis) is the testing data. Maximizing the joint probability density function of the training and test data yields the following estimates:

$$\hat{\mathbf{x}} = \frac{\mathbf{s}^T \mathbf{\hat{C}}^{-1} \mathbf{x}}{\mathbf{s}^T \mathbf{\hat{C}}^{-1} \mathbf{s}},\\\hat{\boldsymbol{\beta}}\_0^2 = \frac{\mathbf{N} - \mathbf{k} + 1}{\mathbf{N} \mathbf{k}} \mathbf{x}^T \mathbf{\hat{C}}^{-1} \mathbf{x},\\\hat{\boldsymbol{\beta}}\_1^2 = \frac{\mathbf{N} - \mathbf{k} + 1}{\mathbf{N} \mathbf{k}} (\mathbf{x} - \mathbf{\hat{a}} \mathbf{s})^T \mathbf{\hat{C}}^{-1} (\mathbf{x} - \mathbf{\hat{a}} \mathbf{s}),$$

and

$$\mathbf{\hat{C}}\_{0} = \frac{1}{\mathbf{N} + 1} \left[ \frac{1}{\beta\_{0}^{2}} \mathbf{x} \mathbf{x}^{\mathrm{T}} + \mathrm{N} \mathbf{\hat{C}} \right] \mathbf{\hat{C}}\_{1} = \frac{1}{\mathbf{N} + 1} \left[ \frac{1}{\beta\_{1}^{2}} (\mathbf{x} - \alpha \mathbf{s}) (\mathbf{x} - \alpha \mathbf{s})^{\mathrm{T}} + \mathrm{N} \, \mathbf{\hat{C}} \right] \tag{16}$$

where βˆ 0 , Cˆ <sup>0</sup>, βˆ 1 , Cˆ <sup>1</sup> are the estimates under the null and alternate hypothesis, respectively. Plugging the derived estimates in the general form of log-likelihood ratio test detector (Equation (3)), we get the ACE score *r* as:

$$r\_{\rm ACE}(\mathbf{x}) = \frac{(\mathbf{s}^{\rm T}\hat{\mathbf{C}}^{-1}\mathbf{x})^2}{(\mathbf{s}^{\rm T}\hat{\mathbf{C}}^{-1}\mathbf{s})(\mathbf{x}^{\rm T}\hat{\mathbf{C}}^{-1}\mathbf{x})} \,. \tag{17}$$

Constrained Energy Minimization (CEM):

The aforementioned spectral detectors assume the target and background subspace to follow a particular statistical distribution. Based on the assumed distribution function, we usually derive the parameters of the distribution function. The assumption of background conformity to a statistical distribution may lead to ambiguous results if the target or background is different from the assumed statistical function. In such situations, it is desirable to design a detector that does depend upon the target–background distribution function and eliminates the interferer from the target signal. The CEM is one such detector and is functionally equivalent to a finite impulse response (FIR) filter that minimizes the detector output for the background pixels.

Given an image χ(*m*,*n*) with *k* spectral channel and N pixels such that each pixel **x***i*= {*x*1, *x*2, *x*3, *x*<sup>4</sup> . . . *x*k} *t* ∈ Xk×N, the average energy of the FIR filter output can be written as:

$$\begin{split} \frac{1}{\text{(N)}} \Big\{ \sum\_{i=1}^{N} \phi\_{i}^{2} \Big\} &= \frac{1}{\text{(N)}} \Big\{ \sum\_{i=1}^{N} \left( \mathbf{x}\_{i}^{T} \mathbf{W} \right)^{T} \left( \mathbf{x}\_{i}^{T} \mathbf{W} \right) \Big\} \\ &= \mathbf{W}^{T} \Big\{ \frac{1}{\mathbf{N}} \sum\_{i=1}^{N} \mathbf{x}\_{i} \mathbf{x}\_{i}^{T} \Big\} \mathbf{W} = \mathbf{W}^{T} \mathbf{R} \mathbf{W} \end{split} \tag{18}$$

where φ = (x*<sup>i</sup> <sup>T</sup>*W) is the filter output for the pixel vector x*<sup>i</sup>* , W = (*w*1, *w*2, *w*3, *w*<sup>4</sup> . . . *w<sup>k</sup>* ) *T* is the weight vector for the designed filter, and R is the k-dimensional background correlation matrix. The CEM problem statement then becomes a constraint optimization problem, i.e., min w <sup>W</sup>*T*Rk×k<sup>W</sup> subject to s <sup>T</sup>W = 1. The detection problem is solved using the Lagrange's multiplier method to solve the constrained optimization problem to get the CEM score *r* as:

$$r\_{\rm CEM}(\mathbf{x}) = \frac{(\mathbf{s}^{\rm T}\mathbf{R}^{-1}\mathbf{s})}{(\mathbf{R}^{-1}\mathbf{s})^{\rm T}\mathbf{x}}\,. \tag{19}$$

Orthogonal subspace projection (OSP):

In most of the practical hyperspectral target detection problems, the target size is less than a full pixel. In such cases, spectral mixture models are useful to estimate the material abundances. The OSP assumes a linear mixture model expressed as:

$$\mathbf{x} = \mathbf{M}\boldsymbol{\alpha} + \mathbf{n} \tag{20}$$

where M is a matrix of target/known spectral signatures, α is abundance, and n is the noise. The OSP begins by first separating the desired target and unknown target and then projecting desired targets orthogonally to undesired/interferer target space. Mathematically OSP is given by:

$$r\_{\rm OSP} = \mathbf{d}^{\rm T} P\_{\mathbf{U}}^{\perp} \mathbf{x} \tag{21}$$

where d is the desired target, *P* ⊥ U is the projection operator which projects the image pixel to space orthogonal to U (undesired targets/interferer) given as *P* ⊥ U <sup>=</sup> <sup>I</sup>k×<sup>k</sup> <sup>−</sup> UU# , U # is the pseudo inverse of U and given as (UTU)−1U<sup>T</sup> , and Ik×<sup>k</sup> is the identity matrix.

Target constrained interference minimization filter (TCIMF):

In this approach, the image is assumed to be a combination of three signal components, i.e., desired (targets), undesired (unwanted/background), and interferer component. Like the CEM, the desired component is accentuated while suppressing the interference signal. The TCIMF is a theoretical superset of CEM and capable of detecting multiple targets at once, unlike CEM and OSP. Mathematically, TCIMF score is given as:

$$r\_{\rm TCMF}(\mathbf{x}) = \left\{ \frac{\mathbf{R}\_{\mathbf{k}\times\mathbf{k}}^{-1}[\mathbf{DU}]}{\left( [\mathbf{DU}]^{\rm T} \mathbf{R}\_{\mathbf{k}\times\mathbf{k}}^{-1}[\mathbf{DU}] \right)} \begin{bmatrix} \mathbf{1}\_{\mathbf{p}\times\mathbf{l}} \\ \mathbf{0}\_{\mathbf{q}\times\mathbf{l}} \end{bmatrix} \right\}^{\rm T} \times \tag{22}$$

where D = h d1, d2, . . . , d<sup>p</sup> i is the set of desired/known target signals, U = h u1, u2, . . . , u<sup>q</sup> i is the known background/unwanted signals in the image.

#### *2.5. Validation, and Quantitative Spectral Analysis*

The detection results from the different detection algorithms were compared against the ground truth map prepared for each case. Graph-based measures have been increasingly used for quantifying accuracy in various pattern recognition applications, especially in the cases of skewed class distributions [30]. By the rarity of occurrence, target detection is an approximation ofskewed class distribution [31]. We adopted the widely used ROC graphical measure for accuracy assessment. Based on the verified labels of the detections, ROC curves were drawn between the probability of false alarm (PFA) and the probability of detection (PD) expressed as:

$$\begin{aligned} \text{P}\_{\text{D}} &= \frac{\text{Number of correctly identified target pixels}}{\text{Total number of actual target pixels}}\\ \text{P}\_{\text{FA}} &= \frac{\text{Number of pixels identified as false targets}}{\text{Total number of non-target pixels}}. \end{aligned} \tag{23}$$

The possibility and quality of target detections from multi-platform remote sensing imagery depend upon the existence and quantification of inherent spectral matching between target spectra from different platforms. Quantitative analysis of the spectral matching between the various combinations of reference target spectra and imaging platform deciphers the basis of target detections by detection algorithms. For each of the possible scenarios considered, we applied multiple spectral matching metrics: spectral angle (SA) [24], spectral information divergence (SID) [32], and spectral gradient angle (SGA) [33] on the spectral data extracted from the ground reference (ground hyperspectral imagery, and point-based spectral measurements) and the airborne and space-borne imagery. We present a brief description of the spectral matching metrics considered.

Consider any two n-dimensional vectors P = *p*1, *p*2, *p*3, *p*<sup>4</sup> . . . *p*<sup>n</sup> *t* , and Q = *q*1, *q*2, *q*3, *q*<sup>4</sup> . . . *q*<sup>n</sup> *t* . The quantity's spectral matching metrics SA, SID, and SGA are defined as:

$$\text{SA} \left( \mathbf{P}, \mathbf{Q} \right) = \text{cos}^{-1} \left( \frac{\langle \mathbf{P}, \mathbf{Q} \rangle}{\| \| \mathbf{P} \|\_{2} \| \| \mathbf{Q} \|\_{2}} \right) \tag{24}$$

where, hi denotes the dot product of two vectors and k . k<sup>2</sup> denotes the Euclidean norm of a vector.

$$\text{SID } (\mathbf{P}, \mathbf{Q}) = D(P \| \mathbf{Q}) + D(\mathbf{Q} \| \| P)$$

$$= \Sigma\_{i=1}^{n} \left( \frac{p\_i}{\Sigma\_{j=1}^{n} p\_j} - \frac{q\_i}{\Sigma\_{j=1}^{n} q\_j} \right) \left| \log \left( \frac{p\_i}{\Sigma\_{j=1}^{n} p\_j} \right) - \log \left( \frac{q\_i}{\Sigma\_{j=1}^{n} q\_j} \right) \right| \tag{25}$$

where *D* (*P* k *Q*) and *D*(*Q* k *P*) are called the relative entropy of Q with respect to P and relative entropy of P with respect to Q, respectively.

SID is a probabilistic approach to measure the spectral similarity between two spectra. Each pixel is represented in the probabilistic space defined by their spectral histogram. Thus, the SID score is an indication of the behavioral difference in the probability distribution function of any two pixels. A score close to zero from the SA and SID indicates that the spectra are similar [26,34]. The spectral gradient angle can be expressed as:

$$\begin{aligned} \text{SGA } (\mathcal{P}, \mathcal{Q}) &= \text{SA } (abs(\text{SG}(\mathcal{P})), \, abs(\text{SG}(\mathcal{Q}))) \text{ and} \\ \text{SG } (\mathcal{P}) &= (p\_2 - p\_1, \, p\_3 - p\_2, \dots, p\_n - p\_{n-1}), \end{aligned} \tag{26}$$

where SG (.) is the spectral gradient of a given vector. The SGA computes the change of slope of the pixel vectors and is thus invariant to illumination condition similar to SA; a lower value of SGA suggests closer matching of the spectra compared.

#### **3. Results**

Our experimental research set up was aimed at examining three critical perspectives in remote sensing-based target detection: (i) platform—the probability and consistency of target detection vis-à-vis platforms, (ii) reference target spectra—the relevance and level of acquiescence of cross-platform target reference spectra, and (iii) detection algorithm—the variation of detection due to detection algorithms. The first component was approached by quantifying the magnitude and patterns of variation of *P<sup>D</sup>* with the three levels of platforms considered. The second component was addressed by comparing the levels of target detection rates between two sets of reference target spectra generated: from the same dataset and the cross-platform dataset. The third perspective, the influence of algorithms on the detection results, was assessed by measuring the change in patterns and detection rates from the different detection algorithms considered. As different detection algorithms characterize scene background at varying levels of land cover composition, the sensitivity of detection rates relative to the scene complexity (characterized by the number of endmembers) and the contrast between the target and its neighborhood was also carried out. The spectral analysis assessing the matching or lack of it in the multi-platform target spectra, quantitative comparison of the ground-based target reference spectra

materials.

with the image-based target spectra, was also performed using three different spectral matching metrics. We present the results organized based on the source of the target reference spectra. We considered target detection successful at detection probabilities of (*PD*) of 100%, and 75%, recognizing the fact that the datasets encompass a wider range of spectral variability. The detection and false alarm rates from different combinations of the platforms and algorithms are described in detail. *Remote Sens.* **2020**, *12*, 2145 12 of 30 indicating substantial performance degradation in some detection algorithms. The rise in the false alarm rate was not uniform and varied by different classes of detection algorithms. Identical materials vs. background contrast: It is expected that targets of identical material, even

#### *3.1. In-Situ Measurements as Reference Target Spectra* if of a different color or background, are recognizable ina hyperspectral imagery. Results indicate that the possibility of an identical base material target in a different color or on different background

In this section, we present the results of target detection experiments when the in-situ reflectance measurements were used as the reference target spectra for target detection in airborne and space-borne imagery. introduces substantial ambiguity in the quality of target detection. For example, at ܲ of 75%, the ܲிfrom the CEM method is 0.0685, and 1.02 X10ିସ respectively for the targets N2R and N1G placed on the same background. Similarly, theܲி for the ACE method is 0.017, and 2 × 10ିrespectively

#### 3.1.1. Target Detection in Airborne Hyperspectral Imagery for the N4B and N1G targets placed on different backgrounds. During the detection of the N2R, the N1G was also flagged as a potential target and vice-versa (see Figure 6d,e). The failure of the

contrast of targets play a substantial role in the detectability.

Results of the target detection in airborne hyperspectral imagery are summarized in Figure 5 and the corresponding representative detection score image in Figure 6. The detection score image is a raster image which contains a scalar value also known as score, corresponding to each pixel. The value represents the likelihood of the pixel for being flagged as target/non-target. Results indicate successful target detections for the different types of target materials, meeting the threshold detection rate at 100% threshold of *P<sup>D</sup>* for some materials. Overall, the detection rate is consistent across the types of materials. Except for SAM, all the detectors produced an average detection rate of 75% at nearly zero false alarm rate. suppression of targets of identical color but of physically different materials is one of the challenging problems encountered for spectrally close materials. Apparently, by the absolute value, ܲி is relatively low for considering the relevant target detections as ambiguous. However, when the corresponding ܲிestimates are converted into actual pixel count, the certainty of detection seems to be far from the ideal case. For instance, for the N1G target, the CEM flags a false alarm of ~70 pixels distributed across the imagery. If the confidence of the detection rate is increased to 100% (i.e., Pୈ = 100%), almost all the detectors show substantially lower detection results in terms of completeness of the targets. Overall, results suggest that, apart from the target–background interaction, the spectral

**Figure 5.** Target detection performance comparison in airborne imagery for the in-situ target reference spectra. Receiver operation curves (ROC) for the detection from spectral angle mapper (SAM), adaptive cosine estimator (ACE), constrained energy minimization (CEM), and matched filter (MF) for the (**a**) N1G, (**b**) N2R, (**c**) C1W, (**d**) N3Y, and (**e**) N4B targets. ROC curves for the detection from orthogonal subspace projection (OSP) and transformed constrained interference minimization filter (TCIMF) for the N1G, N2R, C1W, N3Y, and N4B targets for (**f**–**j**) 5, (**k**–**o**) 10, and (**p**–**t**) 15 background **Figure 5.** Target detection performance comparison in airborne imagery for the in-situ target reference spectra. Receiver operation curves (ROC) for the detection from spectral angle mapper (SAM), adaptive cosine estimator (ACE), constrained energy minimization (CEM), and matched filter (MF) for the (**a**) N1G, (**b**) N2R, (**c**) C1W, (**d**) N3Y, and (**e**) N4B targets. ROC curves for the detection from orthogonal subspace projection (OSP) and transformed constrained interference minimization filter (TCIMF) for the N1G, N2R, C1W, N3Y, and N4B targets for (**f**–**j**) 5, (**k**–**o**) 10, and (**p**–**t**) 15 background materials.

*Remote Sens.* **2020**, *12*, 2145 13 of 30

**Figure 6.** Target detection score image from (**a**) airborne imagery using in-situ reference target spectra, and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, (**e**) N2R, and (**f**) C1W target (In all the target detection score images, a brighter pixel indicates a higher target detection score and thus a higher probability for it to be declared as a target). **Figure 6.** Target detection score image from (**a**) airborne imagery using in-situ reference target spectra, and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, (**e**) N2R, and (**f**) C1W target (In all the target detection score images, a brighter pixel indicates a higher target detection score and thus a higher probability for it to be declared as a target).

3.1.2. Target Detection in Spaceborne Remote Sensing Imagery Results of the target detection in airborne hyperspectral imagery are summarized in Figure 7 and the corresponding representative detection score image in Figure 8. Due to coarse spectral and spatial resolutions and the substantially higher level of atmospheric influences, target detection in space-borne multispectral imagery is challenging compared to airborne hyperspectral imagery. Use of the in-situ reflectance measurements, considered a pure form of reference spectra, as target Detection rate vs. scene complexity: In contrast to the generally acceptable levels of detection rates for a broader approximation of scene-background, detection rates are substantially variable by the scene complexity, and target–neighborhood contrast. The detection rates are consistent and satisfy the lower threshold when the scene complexity was represented by five endmembers. When the scene complexity increased to represent 15 endmembers, the false alarm rate increased steeply, indicating substantial performance degradation in some detection algorithms. The rise in the false alarm rate was not uniform and varied by different classes of detection algorithms.

reference spectra, elicited no quantifiable spectral discrimination of target pixels in the satellite imagery. As evident from Figure 8, the detection scores and surrounding pixels are similar for targets N1G, N2R resulting in higher false alarm rates across all the algorithms (Figure 7). While the detection results included the pixels of targets, the apparent gross overestimation indicates the detection results to be unreliable. The detection algorithms either fail to detect or the respective false alarm rates are higher due to the relatively lesser number of estimated background endmembers. However, when the probability of detection was set at 75% and the scene complexity increased by representing with a large number of endmembers (10 or more), the sub-pixel target detection algorithms (e.g., CEM, TCIMF, Figure 7p) resulted in stable detection results. It is interesting to note that unlike target detection in airborne imagery, there was no change in the false alarm rate when the probability of detection was increased from 75% to 100%. Identical materials vs. background contrast: It is expected that targets of identical material, even if of a different color or background, are recognizable ina hyperspectral imagery. Results indicate that the possibility of an identical base material target in a different color or on different background introduces substantial ambiguity in the quality of target detection. For example, at *P<sup>D</sup>* of 75%, the *PFA* from the CEM method is 0.0685, and 1.02 <sup>×</sup> <sup>10</sup>−<sup>4</sup> respectively for the targets N2R and N1G placed on the same background. Similarly, the *<sup>P</sup>FA* for the ACE method is 0.017, and <sup>2</sup> <sup>×</sup> <sup>10</sup>−<sup>6</sup> respectively for the N4B and N1G targets placed on different backgrounds. During the detection of the N2R, the N1G was also flagged as a potential target and vice-versa (see Figure 6d,e). The failure of the suppression of targets of identical color but of physically different materials is one of the challenging problems encountered for spectrally close materials. Apparently, by the absolute value, *PFA* is relatively low for considering the relevant target detections as ambiguous. However, when the corresponding *PFA* estimates are converted into actual pixel count, the certainty of detection seems to be far from the ideal case. For instance, for the N1G target, the CEM flags a false alarm of ~70 pixels distributed across the imagery. If the confidence of the detection rate is increased to 100% (i.e., P<sup>D</sup> = 100%), almost all the detectors show substantially lower detection results in terms of completeness of the targets. Overall, results suggest that, apart from the target–background interaction, the spectral contrast of targets play a substantial role in the detectability.

#### 3.1.2. Target Detection in Spaceborne Remote Sensing Imagery

Results of the target detection in airborne hyperspectral imagery are summarized in Figure 7 and the corresponding representative detection score image in Figure 8. Due to coarse spectral and spatial resolutions and the substantially higher level of atmospheric influences, target detection in space-borne multispectral imagery is challenging compared to airborne hyperspectral imagery. Use of the in-situ reflectance measurements, considered a pure form of reference spectra, as target reference spectra, elicited no quantifiable spectral discrimination of target pixels in the satellite imagery. As evident from Figure 8, the detection scores and surrounding pixels are similar for targets N1G, N2R resulting in higher false alarm rates across all the algorithms (Figure 7). While the detection results included the

pixels of targets, the apparent gross overestimation indicates the detection results to be unreliable. The detection algorithms either fail to detect or the respective false alarm rates are higher due to the relatively lesser number of estimated background endmembers. However, when the probability of detection was set at 75% and the scene complexity increased by representing with a large number of endmembers (10 or more), the sub-pixel target detection algorithms (e.g., CEM, TCIMF, Figure 7p) resulted in stable detection results. It is interesting to note that unlike target detection in airborne imagery, there was no change in the false alarm rate when the probability of detection was increased from 75% to 100%. *Remote Sens.* **2020**, *12*, 2145 14 of 30 *Remote Sens.* **2020**, *12*, 2145 14 of 30

**Figure 7.** Target detection performance comparison from space-borne imagery for the in-situ target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, (**d**) C1W, and (**e**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**f**–**j**) 5, (**k**–**o**) 10, and (**p**–**t**) 15 endmember/background materials. **Figure 7.** Target detection performance comparison from space-borne imagery for the in-situ target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, (**d**) C1W, and (**e**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**f**–**j**) 5, (**k**–**o**) 10, and (**p**–**t**) 15 endmember/background materials. **Figure 7.** Target detection performance comparison from space-borne imagery for the in-situ target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, (**d**) C1W, and (**e**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**f**–**j**) 5, (**k**–**o**) 10, and (**p**–**t**) 15 endmember/background materials.

**Figure 8.** Target detection score image (**a**) from space-borne imagery using in-situ target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, (**e**) N2R, and (**f**) C1W targets. **Figure 8.** Target detection score image (**a**) from space-borne imagery using in-situ target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, (**e**) N2R, and (**f**) C1W targets. **Figure 8.** Target detection score image (**a**) from space-borne imagery using in-situ target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, (**e**) N2R, and (**f**) C1W targets.

pixel-based reference spectrum is a viable substitute to the in-situ spectra.

pixel-based reference spectrum is a viable substitute to the in-situ spectra.

3.2.1. Target Detection in Airborne Hyperspectral Imagery

3.2.1. Target Detection in Airborne Hyperspectral Imagery

In remote sensing, in-situ or laboratory-based measurement of spectral reflectance is considered to be the pure form of the spectral signature of a material. While the relevance of the purity of spectral signature seems on a theoretically sound basis, the results presented in this section indicate that a

In remote sensing, in-situ or laboratory-based measurement of spectral reflectance is considered to be the pure form of the spectral signature of a material. While the relevance of the purity of spectral signature seems on a theoretically sound basis, the results presented in this section indicate that a

The results of target detection in airborne hyperspectral imagery and a representative detection score image are shown in Figures 9 and 10. Results indicate the possibility of target detection,

The results of target detection in airborne hyperspectral imagery and a representative detection score image are shown in Figures 9 and 10. Results indicate the possibility of target detection,

#### *3.2. Ground-Based Hyperspectral Imagery (THI) as Reference Target Spectra*

In remote sensing, in-situ or laboratory-based measurement of spectral reflectance is considered to be the pure form of the spectral signature of a material. While the relevance of the purity of spectral signature seems on a theoretically sound basis, the results presented in this section indicate that a pixel-based reference spectrum is a viable substitute to the in-situ spectra.

#### 3.2.1. Target Detection in Airborne Hyperspectral Imagery

The results of target detection in airborne hyperspectral imagery and a representative detection score image are shown in Figures 9 and 10. Results indicate the possibility of target detection, suggesting the existence of a spatially distinct spectral matching between the ground hyperspectral imagery and the airborne hyperspectral imagery. As shown in Figure 10e, in the case of the THI reference spectrum, suppression of similar but different targets (NIG suppressed when N2R was detected and vice-versa) is superior compared to the results from in-situ reference spectra (see Figure 6). However, the false alarm rate is higher compared to the extent and spatial distribution of the target pixels in the airborne hyperspectral imagery. This may be due to the limited in the spectral coverage (400–1000 nm), compared to the full optical spectrum of the airborne hyperspectral imagery (400–2500 nm). As the targets considered are inorganic artificial materials, spectral reflectance in the shortwave infrared region (1000–2500 nm) may provide characteristic spectral discrimination. Compared to the case of using in-situ reference target spectra, spectral matching based detection algorithms showed relatively better detection rate, consistent across the targets. In addition, contextually camouflaged targets were also detected, as indicated by the relatively higher scores of *P<sup>D</sup>* and negligible scores of *PFA*. *Remote Sens.* **2020**, *12*, 2145 15 of 30 suggesting the existence of a spatially distinct spectral matching between the ground hyperspectral imagery and the airborne hyperspectral imagery. As shown in Figure 10e, in the case of the THI reference spectrum, suppression of similar but different targets (NIG suppressed when N2R was detected and vice-versa) is superior compared to the results from in-situ reference spectra (see Figure 6). However, the false alarm rate is higher compared to the extent and spatial distribution of the target pixels in the airborne hyperspectral imagery. This may be due to the limited in the spectral coverage (400–1000 nm), compared to the full optical spectrum of the airborne hyperspectral imagery (400– 2500 nm). As the targets considered are inorganic artificial materials, spectral reflectance in the shortwave infrared region (1000–2500 nm) may provide characteristic spectral discrimination. Compared to the case of using in-situ reference target spectra, spectral matching based detection algorithms showed relatively better detection rate, consistent across the targets. In addition,

The detection rate of the targets by background-characterization based algorithms is ambiguous. In-scene estimation of background material spectra was poor. For e.g., for the N3Y target, detection by TCIMF improved when the estimated number of background material increased from 5 to 15 but degraded at the same time for the N2R target. As observed, if the *P<sup>D</sup>* rate is required to be high (*P<sup>D</sup>* = 100%), detection rate from all the detectors is unacceptable for any practical system. contextually camouflaged targets were also detected, as indicated by the relatively higher scores ofܲ and negligible scores of ܲி. The detection rate of the targets by background-characterization based algorithms is ambiguous. In-scene estimation of background material spectra was poor. For e.g., for the N3Y target, detection by TCIMF improved when the estimated number of background material increased from 5 to 15 but degraded at the same time for the N2R target. As observed, if the ܲ rate is required to be high (ܲ =

100%), detection rate from all the detectors is unacceptable for any practical system.

**Figure 9.** Target detection performance comparison in airborne imagery for the terrestrial hyperspectral imager (THI) target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, and (**d**) N4B targets. ROC curves for the subspacebased detector OSP and TCIMF for the N1G, N2R, N3Y, and N4B targets for (**e**–**h**) 5, (**i**-**l**) 10, and (**m**– **p**) 15 endmember/background materials. **Figure 9.** Target detection performance comparison in airborne imagery for the terrestrial hyperspectral imager (THI) target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, and (**d**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, and N4B targets for (**e**–**h**) 5, (**i**-**l**) 10, and (**m**–**p**) 15 endmember/background materials.

**1**

materials.

the detection is by chance.

*Remote Sens.* **2020**, *12*, 2145 16 of 30

*Remote Sens.* **2020**, *12*, 2145 16 of 30

**Figure 10.** Target detection score image from (**a**) airborne imagery using THI target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, and (**e**) N2Rtarget. **Figure 10.** Target detection score image from (**a**) airborne imagery using THI target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, and (**e**) N2Rtarget. **Figure 10.** Target detection score image from (**a**) airborne imagery using THI target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, and (**e**) N2Rtarget.

3.2.2. Target Detection in Spaceborne Remote Sensing Imagery 3.2.2. Target Detection in Spaceborne Remote Sensing Imagery

**1**

With the consideration of THI pixel spectra as target reference spectra, the results of target detection in space-borne multispectral imagery and a representative detection score image in Figures 11 and 12, respectively. Similar to the results obtained with the point-based in-situ target reference spectra, the target detection in space-borne multispectral imagery is ambiguous across the types of targets. A couple of detection algorithms (e.g., CEM, OSP) produced detection scores meeting the threshold limit. However, the corresponding disproportionately high false alarm rate indicates that the detection is by chance. With the consideration of THI pixel spectra as target reference spectra, the results of target detection in space-borne multispectral imagery and a representative detection score image in Figures 11 and 12, respectively. Similar to the results obtained with the point-based in-situ target reference spectra, the target detection in space-borne multispectral imagery is ambiguous across the types of targets. A couple of detection algorithms (e.g., CEM, OSP) produced detection scores meeting the threshold limit. However, the corresponding disproportionately high false alarm rate indicates that the detection is by chance. 3.2.2. Target Detection in Spaceborne Remote Sensing Imagery With the consideration of THI pixel spectra as target reference spectra, the results of target detection in space-borne multispectral imagery and a representative detection score image in Figures 11 and 12, respectively. Similar to the results obtained with the point-based in-situ target reference spectra, the target detection in space-borne multispectral imagery is ambiguous across the types of targets. A couple of detection algorithms (e.g., CEM, OSP) produced detection scores meeting the threshold limit. However, the corresponding disproportionately high false alarm rate indicates that

**1**

**1**

**Figure 11.** Target detection performance comparison in space-borne imagery for the THI target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, and (**d**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**e**–**h**) 5, (**i**–**l**) 10, and (**m**–**p**) 15 endmember/background materials. **Figure 11.** Target detection performance comparison in space-borne imagery for the THI target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, and (**d**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**e**–**h**) 5, (**i**–**l**) 10, and (**m**–**p**) 15 endmember/background **Figure 11.** Target detection performance comparison in space-borne imagery for the THI target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, and (**d**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**e**–**h**) 5, (**i**–**l**) 10, and (**m**–**p**) 15 endmember/background materials.

camouflaged targets.

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

*Remote Sens.* **2020**, *12*, 2145 17 of 30

**Figure 12.** Target detection score image from (**a**) space-borne imagery using THI target reference spectra and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, and (**e**) N2R target. **Figure 12.** Target detection score image from (**a**) space-borne imagery using THI target reference spectra and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, and (**e**) N2R target.

#### *3.3. Target Reference Spectra from the Airborne Hyperspectral Imagery*

#### *3.3. Target Reference Spectra from the Airborne Hyperspectral Imagery*  3.3.1. Target Detection in Airborne Hyperspectral Imagery

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

3.3.1. Target Detection in Airborne Hyperspectral Imagery Target detection experiments were carried out on the airborne hyperspectral imagery and space-borne multispectral imagery using considering pixel-based spectra extracted from the airborne hyperspectral imagery as target reference spectra.

Target detection experiments were carried out on the airborne hyperspectral imagery and spaceborne multispectral imagery using considering pixel-based spectra extracted from the airborne hyperspectral imagery as target reference spectra. Figure 13 shows the target detection scores for the different types of targets in the airborne hyperspectral imagery. Targets were detected with detection scores exceeding 90% with negligible false alarm rates. The accurate detection of the lowest false alarm rates across the target types and detection algorithms indicate the possibility of consistent target detections in airborne hyperspectral imagery. However, the relatively higher rate of false positives for the contextually camouflaged targets suggests Figure 13 shows the target detection scores for the different types of targets in the airborne hyperspectral imagery. Targets were detected with detection scores exceeding 90% with negligible false alarm rates. The accurate detection of the lowest false alarm rates across the target types and detection algorithms indicate the possibility of consistent target detections in airborne hyperspectral imagery. However, the relatively higher rate of false positives for the contextually camouflaged targets suggests the dominance of local background–target interactions (as evident in Figure 14) on the radiance measurements. The limitations of the present suite of detection algorithms in discerning complex background–target interactions might also be a reason higher false alarm rate for detecting contextually camouflaged targets.

the dominance of local background–target interactions (as evident in Figure 14) on the radiance measurements. The limitations of the present suite of detection algorithms in discerning complex background–target interactions might also be a reason higher false alarm rate for detecting contextually

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

*(a) (b) (c) (d) (e)*

*(f) (g) (h) (i) (j)*

*(k) (l) (n) (o) (m)*

*(p) (q) (r) (s) (t)*

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0 1 0.25 0.50 0.75**

**PFA**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**SAM ACE CEM MF OSP TCIMF**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**0.25 0.50 0.75 1**

**PD**

**Figure 13.** Target detection performance comparison in airborne imagery for the airborne target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, (**d**) C1W, and (**e**) N4B targets. ROC curves for the subspace-based detector OSP and camouflaged targets.

**Figure 12.** Target detection score image from (**a**) space-borne imagery using THI target reference spectra and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, and (**e**) N2R target.

Target detection experiments were carried out on the airborne hyperspectral imagery and spaceborne multispectral imagery using considering pixel-based spectra extracted from the airborne

Figure 13 shows the target detection scores for the different types of targets in the airborne hyperspectral imagery. Targets were detected with detection scores exceeding 90% with negligible false alarm rates. The accurate detection of the lowest false alarm rates across the target types and detection algorithms indicate the possibility of consistent target detections in airborne hyperspectral imagery. However, the relatively higher rate of false positives for the contextually camouflaged targets suggests the dominance of local background–target interactions (as evident in Figure 14) on the radiance

*3.3. Target Reference Spectra from the Airborne Hyperspectral Imagery* 

3.3.1. Target Detection in Airborne Hyperspectral Imagery

hyperspectral imagery as target reference spectra.

**Figure 13.** Target detection performance comparison in airborne imagery for the airborne target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, (**d**) C1W, and (**e**) N4B targets. ROC curves for the subspace-based detector OSP and **Figure 13.** Target detection performance comparison in airborne imagery for the airborne target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, (**d**) C1W, and (**e**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**f**–**j**) 5, (**k**–**o**) 10, and (**p**–**t**) 15 endmember/background materials. *Remote Sens.* **2020**, *12*, 2145 18 of 30 TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**f**–**j**) 5, (**k**–**o**) 10, and (**p**–**t**) 15 endmember/background materials.

**Figure 14.** Target detection score image from (**a**) airborne imagery using airborne target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, (**e**) N2R, and (**f**) C1W target. **Figure 14.** Target detection score image from (**a**) airborne imagery using airborne target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, (**e**) N2R, and (**f**) C1W target.

#### 3.3.2. Target Detection in Spaceborne Multispectral Imagery

3.3.2. Target Detection in Spaceborne Multispectral Imagery The target reference spectra extracted from the airborne hyperspectral imagery were transferred and convolved to space-borne level for target detection in the space-borne multispectral imagery. The detection results are summarized in Figure 15 and a representative detection score image in Figure 16. Most of the detection results are ambiguous with a higher rate of false alarms. However, when compared to the detection results from using in-situ target reference spectra, detection in satellite imagery increased substantially across the targets and algorithms. For instance, in the case of MF and ACE, the rate of false positives at ܲ of 75% is very low (10ିଶ to 10ିହ). Further, contrary to the influence of background types observed in the airborne imagery, target detection in space-borne The target reference spectra extracted from the airborne hyperspectral imagery were transferred and convolved to space-borne level for target detection in the space-borne multispectral imagery. The detection results are summarized in Figure 15 and a representative detection score image in Figure 16. Most of the detection results are ambiguous with a higher rate of false alarms. However, when compared to the detection results from using in-situ target reference spectra, detection in satellite imagery increased substantially across the targets and algorithms. For instance, in the case of MF and ACE, the rate of false positives at *P<sup>D</sup>* of 75% is very low (10−<sup>2</sup> to 10−<sup>5</sup> ). Further, contrary to the influence of background types observed in the airborne imagery, target detection in space-borne imagery seems not sensitive to the local background. For example, for the two different targets (e.g., N1G and N2R) placed against the same background, the difference in false alarm rate is relatively

continued to yield ambiguous detection results for most of the targets. The differences in the spatial and spectral resolutions, coupled with acquisition geometry and enhanced atmospheric effects may

have led to the relatively weaker target localization in the space-borne imagery.

imagery seems not sensitive to the local background. For example, for the two different targets (e.g.,

**1**

low. However, this sensitivity is not stable across the detection algorithms. The subspace detectors continued to yield ambiguous detection results for most of the targets. The differences in the spatial and spectral resolutions, coupled with acquisition geometry and enhanced atmospheric effects may *Remote Sens.* have led to the relatively weaker target localization in the space-borne imagery. **2020**, *12*, 2145 19 of 30 *Remote Sens.* **2020**, *12*, 2145 19 of 30

**1**

**1**

**Figure 15.** Target detection performance comparison in space-borne imagery for the airborne target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, (**d**) C1W, and (**e**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**f**–**j**) 5, (**k**–**o**) 10, and (**p**–**t**) 15 endmember/background materials. **Figure 15.** Target detection performance comparison in space-borne imagery for the airborne target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, (**d**) C1W, and (**e**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**f**–**j**) 5, (**k**–**o**) 10, and (**p**–**t**) 15 endmember/background materials. reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, (**d**) C1W, and (**e**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**f**–**j**) 5, (**k**–**o**) 10, and (**p**–**t**) 15 endmember/background materials.

**Figure 16.** Target detection score image from (**a**) space-borne imagery using airborne target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, (**e**) N2R, and (**f**) **Figure 16.** Target detection score image from (**a**) space-borne imagery using airborne target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, (**e**) N2R, and (**f**) C1W target. **Figure 16.** Target detection score image from (**a**) space-borne imagery using airborne target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, (**e**) N2R, and (**f**) C1W target.

#### C1W target. *3.4. Target Reference Spectra from the Spaceborne Multispectral Imagery*

*3.4. Target Reference Spectra from the Spaceborne Multispectral Imagery*  The results of target detection in space-borne imagery obtained from using in-scene target reference spectra are shown in Figure 17 and a detection score image for the best case detection in *3.4. Target Reference Spectra from the Spaceborne Multispectral Imagery*  The results of target detection in space-borne imagery obtained from using in-scene target reference spectra are shown in Figure 17 and a detection score image for the best case detection in The results of target detection in space-borne imagery obtained from using in-scene target reference spectra are shown in Figure 17 and a detection score image for the best case detection in Figure 18. Results indicate improved detection scores and low false alarms compared to the detection

detection rates meet the 75% level of probability. However, detection performance from the subspace target detectors is random and unreliable. The overall detection results show substantial viability in

detection rates meet the 75% level of probability. However, detection performance from the subspace target detectors is random and unreliable. The overall detection results show substantial viability in borne imagery.

**1**

**1**

performance obtained from using the target reference spectra from in-situ spectral measurements or airborne hyperspectral pixel spectra. The performance of all the statistical detectors is similar, and detection rates meet the 75% level of probability. However, detection performance from the subspace target detectors is random and unreliable. The overall detection results show substantial viability in the detection of the engineered targets using the in-scene multispectral target spectra from the space-borne imagery. *Remote Sens.* **2020**, *12*, 2145 20 of 30 the detection of the engineered targets using the in-scene multispectral target spectra from the space-*Remote Sens.* **2020**, *12*, 2145 20 of 30 the detection of the engineered targets using the in-scene multispectral target spectra from the spaceborne imagery.

**1**

**1**

**1**

**Figure 17.** Target detection performance comparison in space-borne imagery for the airborne target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, (**d**) C1W, and (**e**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**f**–**j**) 5, (**k**–**o**) 10, and (**p**–**t**) 15 endmember/background materials. **Figure 17.** Target detection performance comparison in space-borne imagery for the airborne target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, (**d**) C1W, and (**e**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**f**–**j**) 5, (**k**–**o**) 10, and (**p**–**t**) 15 endmember/background materials. **Figure 17.** Target detection performance comparison in space-borne imagery for the airborne target reference spectra. ROC curves for the detection from SAM, ACE, CEM, and MF for the (**a**) N1G, (**b**) N2R, (**c**) N3Y, (**d**) C1W, and (**e**) N4B targets. ROC curves for the subspace-based detector OSP and TCIMF for the N1G, N2R, N3Y, C1W, and N4B targets for (**f**–**j**) 5, (**k**–**o**) 10, and (**p**–**t**) 15 endmember/background materials.

**Figure 18.** Target detection score image from (**a)** space-borne imagery using space-borne target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, (**e**) N2R, and (**f**) C1W target. **Figure 18.** Target detection score image from (**a)** space-borne imagery using space-borne target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, (**e**) N2R, and (**f**) C1W target. **Figure 18.** Target detection score image from (**a)** space-borne imagery using space-borne target reference spectrum and the enlarged detection score footprint for (**b**) N3Y, (**c**) N4B, (**d**) N1G, (**e**) N2R, and (**f**) C1W target.

Results of the spectral similarity assessment between the possible pairs of ground, airborne, and

considerable spectral variability in the in-scene target spectra, particularly the case of in-situ reference

considerable spectral variability in the in-scene target spectra, particularly the case of in-situ reference

*3.5. Quantitative Spectral SimilarityAnalysis* 

*3.5. Quantitative Spectral SimilarityAnalysis* 

spectra compared to the airborne image spectra (Figure 19a–e (I)).

spectra compared to the airborne image spectra (Figure 19a–e (I)).

#### *3.5. Quantitative Spectral Similarity Analysis*

Results of the spectral similarity assessment between the possible pairs of ground, airborne, and space-borne target reference spectra are presented in Tables 2–4. For visual comparison, spectral signatures of the targets from imagery and reference sources are shown in Figure 19. We found considerable spectral variability in the in-scene target spectra, particularly the case of in-situ reference spectra compared to the airborne image spectra (Figure 19a–e (I)).

**Table 2.** Spectral similarity measures between the point-based in-situ target reference spectra and the corresponding airborne, and space-borne target image spectra (spectral angle (SA) is measured in degrees and spectral gradient angle (SGA) in radians) Values in bold are statistically significant.


**Table 3.** Spectral similarity between the THI target reference spectra and the corresponding airborne, and space-borne target image spectra (SA is measured in degrees and SGA in radians). Values in bold are statistically significant.


**Table 4.** Spectral similarity between the airborne target reference spectra and the space-borne target image spectra (SA is measured in degrees and SGA in radians). Values in bold are statistically significant.


The relatively higher accuracy of target detections observed in the airborne imagery (Section 3.1.1) while using the in-situ spectral measurement as reference target spectra can be attributed to the inherent spectral similarity between in situ reference spectra and airborne image spectra (Table 2; lower SID and SGA value across all target materials). Further, the score for the in-situ target reference spectra and space-borne target image spectra shows stark dissimilarities across the targets explaining the apparent unsatisfactory detection performance across the algorithms (Section 3.1.2). Similarly, the detection performance observed in Section 3.2 conforms to the similarity measure seen in Table 3. Comparing the similarity scores from Tables 2 and 4, we found a close similarity between the airborne reference spectra and space-borne image spectra compared to that of the in-situ to the space-borne image spectra. This matching reflected aptly in the detection performance observed in Section 3.3. It may be noted that the similarity measures employed for quantifying spectral matching are designed mainly for hyperspectral resolution data. Use of these measures for the quantitative spectral matching in multispectral data may not be optimal. Therefore, we recommend caution while arriving at conclusions on detection performance based on similarity measures alone. *Remote Sens.* **2020**, *12*, 2145 21 of 30

**Figure 19.** Spectral comparison of the reference target spectra with the corresponding image target spectra for: (I) in-situ measurements of (**a**,**f**) N1G, (**b**,**g**) N2R, (**c**,**h**) C1W, (**d**,**i**) N3Y, and (**e**,**j**) N4B compared to airborne and space-borne image spectra respectively; (II) THI measurements of (**a**,**e**) N1G, (**b**,**f**) N2R,(**c**,**g**) N3Y, and (**d**,**h**) N4B compared to airborne and space-borne image spectra respectively;and (III) airborne measurements of (**a**) N1G, (**b**) N2R, (**c**) C1W, (**d**) N3Y, and (**e**) N4B compared to space-borne image spectra. **Figure 19.** Spectral comparison of the reference target spectra with the corresponding image target spectra for: (I) in-situ measurements of (**a**,**f**) N1G, (**b**,**g**) N2R, (**c**,**h**) C1W, (**d**,**i**) N3Y, and (**e**,**j**) N4B compared to airborne and space-borne image spectra respectively; (II) THI measurements of (**a**,**e**) N1G, (**b**,**f**) N2R,(**c**,**g**) N3Y, and (**d**,**h**) N4B compared to airborne and space-borne image spectra respectively;and (III) airborne measurements of (**a**) N1G, (**b**) N2R, (**c**) C1W, (**d**) N3Y, and (**e**) N4B compared to space-borne image spectra.

#### **Table 2.**Spectral similarity measures between the point-based in-situ target reference spectra and the **4. Discussion**

corresponding airborne, and space-borne target image spectra (spectral angle (SA) is measured in degrees and spectral gradient angle (SGA) in radians) Values in bold are statistically significant. **In-Situ Reference Spectra vs. Airborne Image Spectra In-Situ Reference Spectra vs. Satellite Imagery Spectra Metric N1G N2R C1W N3Y N4B N1G N2R C1W N3Y N4B**  SA **7.623** 10.386 12.273 8.503 11.617 8.338 14.111 15.246 **8.008** 19.219 Having the spectral profiling a priori, targeted detection of artificial/engineering materials using remote sensing is emerging as a data paradigm for a host of civil and strategic applications. Among the recent developments in hyperspectral remote sensing, target detection has the potential to deploy on a broader application base. There have been a few seminal efforts on acquiring and making them freely available benchmark airborne hyperspectral datasets (Cooke City, and 'Viareggio 2013 trial' [16]), which have further attempted detecting specific information class/materials of interest. There have also been a few studies on target detection in synthetic or simulated hyperspectral imagery [35].

SID 0.031 0.050 0.050 **0.028** 0.105 0.045 0.126 0.074 **0.019** 0.306 SGA 0.650 0.839 **0.523** 0.678 0.744 0.688 1.040 0.904 **0.667** 0.887

While these datasets and experiments provide a solid base for classification-oriented exploration, targets and their landscape-neighborhoods in these datasets are set in a relatively controlled environment. They may not represent typical landscapes and target conditions. Apart from that, the criteria used for labeling a pixel detection as 'true' or 'false' has a substantial bearing on the magnitude of detection accuracy. For example, the best accuracy estimates for the case of airborne imagery in this study are equal or slightly lesser compared to the accuracy reported in the state-of-the-art literature [14,36]. The potential target detection performance in our experiments, considering only from the pixel labelling perspective would be substantially higher than the values presented in this paper, and the values reported in the literature. From the state-of-the-art in accuracy estimates in target detection, the difference between our potential accuracy and reported accuracy is due to the relatively liberal criterion used for accuracy estimation in the literature. The past studies define a target guard window—representing a neighborhood region at three different levels and proximity to the core 'target pixel' for labeling a detection true or false. The detection of even a single pixel within any of these three levels is considered 100% correct detection of the whole target, which may lead to overestimation of detection performance. Avoiding the possibility of this uncertainty, we used the stringent pixel-for-pixel matching based count of target pixels for computing the performance metrics *P<sup>D</sup>* and *PFA*.

Furthering the experimental landscapes and the benchmark reference datasets for target detection, the goal of our research is the acquisition and exploration of a multi-platform—ground, airborne, and space-borne remote sensing dataset for target detection of artificial/engineered materials. Our experiments were aimed at assessing the dynamics of target detection in terms of (i) spectral attribute conformity of reference target spectra from the ground to space-borne, (ii) target–background interaction: identical target material on similar, and different backgrounds, and (iii) the relevance of detection algorithms and their functional categorization. We present in the following sub-sections the relevance and importance of the results organized according to the three perspectives mentioned above.

#### *4.1. Spectral Conformity of the Reference Target Spectra from the Ground to Spaceborne Platform*

The continued detections of the engineered material targets in the ground to space-borne imagery, though at different levels of confidence, preserving the location adherence and material-specific identifications indicates the presence of material-specific spectral features. Results from the airborne hyperspectral imagery exhibit successful target detections from both the point-based in-situ and pixel-based THI reference target spectra. However, target detections using the in-situ target reference spectra are valid only for ground and airborne imagery. As evident from Figure 7, the target detections in the space-borne imagery drop to that of a random process. Contrasting to this trend, detection results from the pixel-based reference target spectra indicate patterns in the target detection in both the airborne and space-borne imagery. However, point-based in-situ, and the pixel-based THI reference target spectra yield comparable levels of target detections in the airborne hyperspectral imagery.

Target detection and the quantitative spectral assessment of the pixel-based THI reference target spectra with the airborne (AVIRIS-NG imagery) and the space-borne (Sentinel-2 imagery) spectra suggests stable spectral conformity of material spectra at the ground, airborne, and space-borne platforms. The pixel-based THI spectral conformity leads to two practical implications: (i) a new source of in-situ reference spectra, and (ii) potential syllogism that impure contextual spectrum is better than the laboratory-grade pure spectrum. Ground-based hyperspectral image acquisitions can replace the spectroradiometer based in-situ or laboratory spectral measurements. Image-based reference spectra acquisition is particularly advantageous in surveying inaccessible terrain or to acquire rapid reference measurements for the dynamic image-based target detection systems. The concept of spectral purity, considered to be inherent in the spectral endmembers of reference spectral library based databases needs to be revisited to consider for infusing some degree of spectral-contextual-impurity for further usage in the image-based detection systems. Compared to point measurement, a pixel has the inherent structure to infuse geometrical, illumination and micro-environmental settings of material-energy

interactions in the reflectance spectra. The pixel spectra may help represent the dynamics of material target spectra acquired at different platforms.

Target detection in space-borne imagery using the reference target spectra from airborne imagery helps evaluate detection possibilities over a wider geographical region. Successful target detections for targets in the space-borne imagery using the reference target spectra from airborne imagery suggests the existence of a spectral continuum between airborne and space-borne imagery. Compared to the results from in-situ or pixel-based THI spectra, the airborne image-based reference spectra produced relatively lesser false alarms in space-borne imagery. For example, in the case of the lowest target detection scenario (N2R; algorithm: CEM), the false alarms reduced from 5624 to 1712 when the confidence of the detection rate is set at 75%. Target detections in the airborne imagery using the reference target spectra from the airborne imagery itself are accurate and unambiguous across all the detection algorithms at the 100% probability of detection rate. However, the target detections in space-borne imagery using the reference target spectra from the space-borne imagery itself are comparable with the results obtained from using the pixel-based THI reference target spectra. At the 75% probability of detection rate, the target detections are erroneous mainly by overestimation—most of the targets are detected albeit with substantial proportions of false alarm. Overall, the results confirm that the strength of spectral conformity of the input reference target spectra determines the quality of the target detection in imagery acquired from different platforms.

#### *4.2. Target–Background Interaction—Role of Context*

To test the impact of contextual background–target spectral interactions on the repeatability of the target detections, we placed targets of identical material in different colors on different backgrounds. Considering the background–target spectral interactions, the detection of identical materials on identical background vary from being systematic and successful to random and fail. With marginal to moderate variations in the false alarm rate (PFA), our results suggest unambiguous target detection of identical materials on an identical background in both the airborne and space-borne imagery (see Figure 20). Compared to the case of identical materials on identical background, detection rates of identical material targets positioned on different backgrounds vary mainly on the local contrast between target material and background. Accordingly, the detection rates vary from being chance matching to consistent detection. A similar observation has been reported by [6], confirming the substantial effect of scene parameters on the target detection accuracy. In addition, we find that the potential of background interference for altering the detection scores depends substantially on the source of reference target spectra and the detection algorithm.

The variability in the detection rate of identical materials poses a plausible question: How do we standardize the detection rate and ensure detection reproducibility under different environmental, background, and other geometrical factors? The inconsistency in the detection performance needs to be addressed from an algorithmic design perspective, modeling and incorporating the source of uncertainties in the reference target reflectance spectra as observed by different sensors. One of the primary causes for the different detection rates is the non-linearity in the contextual background reflectance recorded by sensors at different platforms, as shown in Figure 21a. Modeling the reference target spectra with possible background mixtures and developing contextual-background sensitive algorithms may enhance target detections across platforms and sensors. Overall, we observe that targets placed on a comparatively reflective local background are detected with lower false alarms *<sup>P</sup>FA* <sup>∼</sup> <sup>10</sup>−<sup>4</sup> by all the algorithms. Although a detailed analysis of the role of background is not in the purview of this paper, our results support the theoretical perspectives of different target-background outlined by [37], and we suggest maintaining a balance between model sophistication and its real-time applicability.

**0.14**

*Remote Sens.* **2020**, *12*, 2145 25 of 30

**0.16**

**0.05**

**0.05**

**Figure 20.** False alarms at different levels of Pୈ for (I) identical target material (N1G and N2R) in the same context (vegetative) for the (**a**) best case, and (**b**) worst-case detection performance; (II) identical target material (N1G and N3Y) in a different context (vegetation and soil respectively) for (**c**) best **Figure 20.** False alarms at different levels of P<sup>D</sup> for (I) identical target material (N1G and N2R) in the same context (vegetative) for the (**a**) best case, and (**b**) worst-case detection performance; (II) identical target material (N1G and N3Y) in a different context (vegetation and soil respectively) for (**c**) best case, and (**d**) worst-case detection performance. **Figure 20.** False alarms at different levels of Pୈ for (I) identical target material (N1G and N2R) in the same context (vegetative) for the (**a**) best case, and (**b**) worst-case detection performance; (II) identical target material (N1G and N3Y) in a different context (vegetation and soil respectively) for (**c**) best case, and (**d**) worst-case detection performance.

**Figure 21.** (**a**) Visualization of the non-linear interaction of background signal with the target spectrum for the N2R and N1G targets, and (**b**) best case target detection continuum results of detection performance across imagery from all the platforms (G-ground, A-airborne, S-space-borne) **Figure 21.** (**a**) Visualization of the non-linear interaction of background signal with the target spectrum for the N2R and N1G targets, and (**b**) best case target detection continuum results of detection performance across imagery from all the platforms (G-ground, A-airborne, S-space-borne) for all the targets used in at a false alarm rate of 10ିଷ for the in-situ target reference spectra. **Figure 21.** (**a**) Visualization of the non-linear interaction of background signal with the target spectrum for the N2R and N1G targets, and (**b**) best case target detection continuum results of detection performance across imagery from all the platforms (G-ground, A-airborne, S-space-borne) for all the targets used in at a false alarm rate of 10−<sup>3</sup> for the in-situ target reference spectra.

#### for all the targets used in at a false alarm rate of 10ିଷ for the in-situ target reference spectra. *4.3. Detection Algorithms and Their Functional Categorization*

case, and (**d**) worst-case detection performance.

*4.3. Detection Algorithms and Their Functional Categorization. 4.3. Detection Algorithms and Their Functional Categorization.*  Apart from the spectral-geometrical-imaging platform dynamics of the target materials, detection algorithms play a key role in recognizing and identifying material targets. Given the Apart from the spectral-geometrical-imaging platform dynamics of the target materials, detection algorithms play a key role in recognizing and identifying material targets. Given the acquisition

Apart from the spectral-geometrical-imaging platform dynamics of the target materials,

of appropriate spectral imagery and meeting the minimum dimension of the target material, the detection algorithm employed determines the possibility and quality of target detections. For the given target reference spectra, the functional characterization expected from a potential detection algorithm is the ability to deal with target–background interactions and spectral pattern discrimination in imagery. Based on the functional characteristics, we used three types of detection algorithms, belonging to categories of geometric approach, spectral matching, and background characterization. Target detection of materials in the airborne imagery, with target reference spectra extracted from the same imagery, is accurate and complete (at *P<sup>D</sup>* = 75%) by most of the detection algorithms and the material targets. However, major performance missings of the detection algorithms can be attributed to the sensitivity to backgrounds. The detection rate of an identical material target positioned on two different backgrounds varied substantially by the detection algorithm. Among the spectral matching based detectors, CEM consistently detected material targets across the source of reference target spectra and imagery platform. Yet, the average number of false alarms is ~50, predominantly in the urban areas (see Figure 5), which may not meet the practical target detection purposes. The performance of subspace-based detectors is determined by the quality of extracted endmembers, which in turn depends upon the endmember extraction algorithm used. For example, OSP and TCIMF yielded the lowest false alarms for some materials (*PFA* <sup>∼</sup> <sup>10</sup>−<sup>5</sup> for N1G and C1W), but high false alarms for other materials (N4B, N3Y with *<sup>P</sup>FA* <sup>∼</sup> <sup>10</sup>−<sup>2</sup> to 10−<sup>4</sup> ) (see results in Section 3.1.1). However, for the two similar materials placed on a different contextual background, the detection rate varied drastically between the spectral and subspace-based detectors. For example, for the MF the difference in the detection rate between N4B and N1G is ~20 times; whereas, for ACE, it is about 10,000 times.

The adaptability ofthe sub-pixel detection algorithms, such as CEM, TCIMF, ACE, and OSP, for the detection of engineered materials from space-borne imagery is fraught with a large number of false alarms. While the pixels of target materials are detected, the number of false alarms outweighs the detection rate *P<sup>D</sup>* at 75%. For instance, when the *P<sup>D</sup>* is 75, CEM yielded 3260 false alarms for the detection of the N1G from the space-borne imagery. In addition, the effect of target–background interaction(due to mixed pixels) on algorithms' performance seems pronounced in space-borne imagery (Figure 7). However, when the confidence of the detection rate *P<sup>D</sup>* is reduced to 50%, the results from the space-borne imagery (Sentinel-2 at 10 m resolution) are consistent, indicating the potential utility of space-borne imagery for target reconnaissance. We find that the state of the art target detectors needs substantial refinements for target detection problems. A couple of studies suggest the use of local mean and covariance estimation, and quantification of interaction effects for improved detection [4,36]. Algorithms with adaptive target–background signal modeling with incorporations of non-linear signal mixing models for sub-pixel/mixed pixel targets can provide better results compared to the traditional statistical detectors.

#### *4.4. Key Elements of Influence in Target Detection*

Based on our analyses of the extensive target detections observed under different combinations of background, material, and detection algorithms, we present an empirical estimation of the relative contributions of the three key elements of a remote sensing-based target detection system—ground (including local background), sensor (spectral properties), and target (types and positioning) as vertices of an isosceles triangle. As illustrated in Figure 22, the target detection space represents the possibility of detecting material targets under the full detection possibility (area of the triangle) considering the possible levels of the three key elements. The quality of detections depends upon finding the optimal range in each of the key elements and modeling the appropriate weights. Background contrast (as defined from the target spectral attributes), and sophistication of detection algorithm (ability to localize the target–background spectral attributes) have major contribution compared to the spectral dimensionality of imagery. The spectral features and detection algorithms have equal participation (about 35% each) in the detection as represented by sides of the triangle (Figure 22). The base of the triangle, the target-background, has about 30% contribution in the detection and is a landscape

driven parameter, not amenable for prior human intervention. Improvement in the precision and detection scores, representing the height of the triangle, is the sophistication of detection algorithms with reference to optimal spectral dimensionality. A stable target detection system will be the weighted combination of the three key elements and will have its detection scores in the triangle represented by 'realistic detection space'. Reaching the most optimized combination of the key elements (indicated by the green circular dot) is the theoretical upper limit of the target detection system. Improvement in the precision and detection scores, representing the height of the triangle, is the sophistication of detection algorithms with reference to optimal spectral dimensionality. A stable target detection system will be the weighted combination of the three key elements and will have its detection scores in the triangle represented by 'realistic detection space'. Reaching the most optimized combination of the key elements (indicated by the green circular dot) is the theoretical upper limit of the target detection system.

*Remote Sens.* **2020**, *12*, 2145 27 of 30

**Figure 22.**Various elements of a target detection system and their mutual correlation in the detection space. **Figure 22.** Various elements of a target detection system and their mutual correlation in the detection space.

#### *4.5. Experimental Dataset*

*4.5. Experimental Dataset*  The multi-source multi-platform dataset for target detection will be a valuable resource for the ongoing efforts on target detection using hyperspectral and multispectral remote sensing data. The high-quality in-situ reference spectral data, acquired both in point and pixel mode, will be helpful to test the nuances of detection related problems and assessment of detection algorithms. Since the present dataset was acquired from an urban neighborhood, the complexity of the imagery would provide a rigorous test to the existing theories about the detection problems. The detection of engineered material at pixel level from satellite data is vital for strategic purposes, and the dataset acquired in this research can be used for validating such endeavor. For all the practical purposes, we propose that the detection metric (ܲ ) of target detectors should be relaxed and re-evaluated according to the imaging complexity of the scene. Target detection can be undertaken in both the reflectance and radiance modes. However, for the present work, we have only tested the detection performance in the reflectance domain. Radiance domain target detection will be pursued as future work. The experimental dataset used in this study will be made available on an appropriate freely The multi-source multi-platform dataset for target detection will be a valuable resource for the ongoing efforts on target detection using hyperspectral and multispectral remote sensing data. The high-quality in-situ reference spectral data, acquired both in point and pixel mode, will be helpful to test the nuances of detection related problems and assessment of detection algorithms. Since the present dataset was acquired from an urban neighborhood, the complexity of the imagery would provide a rigorous test to the existing theories about the detection problems. The detection of engineered material at pixel level from satellite data is vital for strategic purposes, and the dataset acquired in this research can be used for validating such endeavor. For all the practical purposes, we propose that the detection metric (*PD*) of target detectors should be relaxed and re-evaluated according to the imaging complexity of the scene. Target detection can be undertaken in both the reflectance and radiance modes. However, for the present work, we have only tested the detection performance in the reflectance domain. Radiance domain target detection will be pursued as future work. The experimental dataset used in this study will be made available on an appropriate freely accessible public platform.

#### accessible public platform. **5. Conclusions**

**5. Conclusions**  Detection of a specific material of interest/target has been one of the promising applications of remote sensing. Contributing to the public availability of benchmark and comprehensive datasets for target detection studies, we have acquired a benchmark multi-platform remote sensing dataset for exploring the various perspectives target detections and algorithms development and evaluation. We have carried out experiments on target detections as a function of sensor, platform, target– background, and the source of reference target spectra. We observe unambiguous detection of targets in the airborne imagery. The false alarm rate is substantially low if the probability of detection (PD) is reduced to 75%.The continuity and the quality of target detections are found to be influenced by the Detection of a specific material of interest/target has been one of the promising applications of remote sensing. Contributing to the public availability of benchmark and comprehensive datasets for target detection studies, we have acquired a benchmark multi-platform remote sensing dataset for exploring the various perspectives target detections and algorithms development and evaluation. We have carried out experiments on target detections as a function of sensor, platform, target–background, and the source of reference target spectra. We observe unambiguous detection of targets in the airborne imagery. The false alarm rate is substantially low if the probability of detection (PD) is reduced to 75%.The continuity and the quality of target detections are found to be influenced by the source of reference target spectra. While the target–background interaction is one of the key components determining the quality of detection, it is not a decisive constraint on the overall detection

source of reference target spectra. While the target–background interaction is one of the key

of targets. Target detection results from the ground-level hyperspectral imagery based target reference spectra are at par with point-based in-situ target reference spectra. The ground-based hyperspectral imaging sensor is a viable source for rapid acquisition of target reference spectra. A non-imaging spectroradiometer generated in situ reference spectrum may not conform to the landscape area element based target pixel spectrum in spectralimagery. The continuity of target detections from the ground to space, though with different proportions of false positives, suggests the viability of satellite imagery-based target detection. However, further experiments are required to generalize this observation.

Notwithstanding the quality spectral data sources, detection algorithm determines the quality of target detections. The false positives rate is substantial in most of the detection algorithms evaluated, calling for the development of multi-resolution spectral dimensionality invariant target detection algorithms. Since remote sensing-based target detection finds applications in various strategic and civilian applications, the dataset generated in our experiment will help the research community to validate detection algorithms.

**Author Contributions:** Conceptualization, S.S.J. and R.R.N.; methodology, S.S.J.; validation, S.S.J.; formal analysis, S.S.J. and R.R.N.; writing—original draft preparation, S.S.J.; writing—review and editing, R.R.N.; supervision, R.R.N.; funding acquisition, R.R.N. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Department of Science and Technology, Government of India (Grant Number: BDID/01/23/2014-HSRS/37) as part of the Network Programme on Imaging Spectroscopy and Applications (NISA).

**Acknowledgments:** The authors would like to thank Space Application Centre (SAC) from India, and Jet Propulsion Lab (JPL), from the USA for facilitating the airborne hyperspectral imagery which was acquired as part of the collaboration between ISRO, India, and NASA, USA. We acknowledge the European Space Agency (ESA) for providing the Sentinel-2 satellite imagery. We express our sincere gratitude for the anonymous reviewers for helping us with critical suggestions for improving the quality of our article.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Remote Sensing* Editorial Office E-mail: remotesensing@mdpi.com www.mdpi.com/journal/remotesensing