**Advanced Computational Intelligence for Object Detection, Feature Extraction and Recognition in Smart Sensor Environments**

Editor

**Marcin Wozniak ´**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editor* Marcin Wozniak ´ Silesian University of Technology Poland

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Sensors* (ISSN 1424-8220) (available at: https://www.mdpi.com/journal/sensors/special issues/ computational intelligence object detection).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-1268-6 (Hbk) ISBN 978-3-0365-1269-3 (PDF)**

© 2021 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

### **Contents**




### **About the Editor**

**Marcin Wozniak ´** received an M.Sc. degree in applied mathematics, a Ph.D. degree in computational intelligence and a D.Sc. degree in computational intelligence, in 2007, 2012, and 2019, respectively. He is currently an associate professor with the Faculty of Applied Mathematics, Silesian University of Technology. He is a scientific supervisor in editions of "The Diamond Grant" and "The Best of the Best" programs for highly talented students from the Polish Ministry of Science and Higher Education. He has participated in various scientific projects (as lead investigator, scientific investigator, manager, or participant) at Polish and Italian universities and in the IT industry. He was a visiting researcher with universities in Italy, Sweden, and Germany. He has authored/coauthored over 150 research papers in international conferences and journals. His current research interests include neural networks with their applications together with various aspects of applied computational intelligence. He was awarded by the Polish Ministry of Science and Higher Education with a scholarship for outstanding young scientists. In 2020, he was nominated among the "Top 2% of Scientists in the World" by Stanford University for his career achievements.

M. Wozniak was an editorial board member or an editor for Sensors, IEEE ACCESS, ´ Measurement, Frontiers in Human Neuroscience, PeerJ Computer Science, International Journal of Distributed Sensor Networks, Computational Intelligence and Neuroscience, Journal of Universal Computer Science, etc., and a session chair at various international conferences and symposiums, including IEEE Symposium Series on Computational Intelligence, IEEE Congress on Evolutionary Computation, etc.

### *Editorial* **Advanced Computational Intelligence for Object Detection, Feature Extraction and Recognition in Smart Sensor Environments**

**Marcin Wo ´zniak**

Faculty of Applied Mathematics, Silesian University of Technology, 44-100 Gliwice, Poland; marcin.wozniak@polsl.pl

#### **1. Special Issue**

The recent years have seen a vast development in various methodologies for object detection and feature extraction and recognition, both in theory and in practice. When processing images, videos, or other types of multimedia, one needs efficient solutions to perform fast and reliable processing. Computational intelligence is used for medical screening where the detection of disease symptoms is carried out, in prevention monitoring to detect suspicious behavior, in agriculture systems to help with growing plants and animal breeding, in transportation systems for the control of incoming and outgoing transportation, for unmanned vehicles to detect obstacles and avoid collisions, in optics and materials for the detection of surface damage, etc. In many cases, we use developed techniques which help us to recognize some special features. In the context of this innovative research on computational intelligence, contributions to the Special Issue "Advanced Computational Intelligence for Object Detection, Feature Extraction and Recognition in Smart Sensor Environments" present an excellent opportunity for the dissemination of the recent results and achievements for further innovations and development.

Among the total 88 manuscript submissions to this Special Issue, only 24 manuscripts were accepted after a rigorous reviewing process and published in final forms as a separate MDPI *Sensors* volume collection under the link https://www.mdpi.com/journal/sensors/ special\_issues/computational\_intelligence\_object\_detection. This creates an acceptance rate at the level of 27.2%, which confirms the high level of presented research and the outstanding interest of researchers in contributing their innovative research articles to this venue. The published articles show innovative research results from authors from Europe, Asia, the Americas, and Africa, showing a worldwide research interest in the topic of this Special Issue and the importance of the proposed contributions. The published articles cover important fields of science and technology by showing models and applications for medical image processing, automated drone and vehicle driving systems, marine object detection and recognition, and agriculture and harvesting, with many interesting theoretical aspects of new training models and data augmentation. Additionally, the published articles bring new data sets to the scientific community—i.e., defect detection from optical fabric images and Industrial10 for industrial area image processing.

#### **2. Contributions**

The topic of using computer vision for autonomous driving systems, aerial vehicles, and vessel classification has been covered by many innovative ideas. In [1], a model of a system developed for the detection of flying objects for automatic drone protection systems was presented. A proposed solution is composed of a background subtraction model which cooperates with the applied model of the convolutional neural network (CNN). As a result, the system detects flying drones and provides their initial recognition to the operator. In [2], a model was proposed for ship type classification. The proposed complex neural

**Citation:** Wo ´zniak, M. Advanced Computational Intelligence for Object Detection, Feature Extraction and Recognition in Smart Sensor Environments. *Sensors* **2021**, *21*, 45. https://dx.doi.org/10.3390/s21010045

Received: 7 December 2020 Accepted: 22 December 2020 Published: 24 December 2020

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/ licenses/by/4.0/).

architecture was based on a time convolutional layer model which helped to compare the extracted ship features. In [3], the authors discuss a model of vehicular traffic congestion with various approaches. As a result of this, a study presented a set of comparative results for different deep learning models. In [4], a real-time vehicle detection drone system was developed which can detect a car from a bird-view perspective. The model was based on an adapted DRFBNet300 structure. In [5], the YOLOv2 model was adapted to the task of multi-scale vehicle detection. The adopted neural network was enhanced with a proposed foreground–background imbalance estimation. Another interesting model for non-conventional vessel detection was presented in [6]. An applied system using a convolutional neural network (CNN) was trained by the Adam algorithm. The authors compared various architectures and drew conclusions for the best applications in the automatic detection system.

Among interesting propositions for potential industrial applications, we can find applications for various types of images, from object surfaces to whole-scene processing. In [7], a new approach for correlating scene images in industrial areas was discussed. In this model, a concept of a regression model of nested markers was used for viewpoints in augmented reality. As a result, the research presented a more efficient image capturing technique for industrial applications, but also a new data set called Industrial10. We kindly encourage the scientific community to adapt this data set in the research on camera pose regression methods. In [8], a model to detect surface regions of interest (ROI) in 3D was presented. As a processing mechanism, a deep convolutional neural network (CNN) modeling mechanism was adapted with the Adam training algorithm. This combination was applied in industrial processes to optimal CCD laser image scanning with very good results. In [9], an idea of composite interpolating feature pyramid (CI-FPN) was applied in a model of fabric defect detection. The result was processed by a cascaded guidedregion proposal network (CG-RPN) to classify the detected regions. In addition to the model, this research article also introduced a new data set for defect detection from optical fabric images. In [10], a model of a convolutional neural network (CNN) for industrial application in tool wear identification was presented, where parts of the face milling process can be evaluated for potential damage. An application in farming and plant growing was proposed in [11]. A proposed model of a weakly dense connected convolution network (WeaklyDenseNet-16) was used to detect plant disease from images. In [12], a system model for robotic inspection tasks was proposed. The proposed system enabled drones to detect novelty in inspected areas from a distant viewpoint.

This Special Issue also received interesting research presentations concerning human pose detection and recognition. In [13], an innovative video frame analysis model for surveillance and security applications was presented. The model uses a support vector machine (SVM) or a convolutional neural network (CNN) as an extractor and detector of key features from CCTV and operation units. As a result, a faster detection of potential situations for legal actions was achieved. In [14], a model of active player detection for a sport vision system was presented. The solution was based on the idea of a bounding box area, which was associated with motion centroids of the human body pose. As a result, a model of active support for sport transmission to annotate players during the game was developed. In [15], a hand gesture recognition model was proposed. Such a development can be very useful for a man–machine interaction system, where the computer should read human intention, i.e., from the hand gesture presented to the camera. The proposed model was based on EMGNet architecture processing images collected by using electronic marker devices such as the Myo armband.

Another important category is new models of image processing and feature extraction and detection by the developed models of computational intelligence. In [16], a new approach to remote sensing image processing was presented, where the image should be cleared from radio-frequency interference (RFI) artefacts. The model used a proposed pixel value conversion from RGB to greyscale as a means to detect such artifacts and remove them from the adapted neural network. In [17], a semantic segmentation approach to object extraction from images was examined. The model proposed adapted the WASPnet architecture working on the Waterfall Atrous Spatial Pooling (WASP) module. Experiments showed a high efficiency for various types of images. In [18], a comparative review for models of traffic sign detection systems based on various computational intelligence techniques was presented.

The Special Issue received several interesting articles in the domain of medical image processing, where new ideas proposed models of detection and recognition of tissue features. In [19], an applied model of a SegNet convolutional neural network encoderdecoder construction used for more efficient medial image processing was presented. As a result, a processing model for tumor segmentation in CT liver scans in a DICOM format was proposed. In [20], a model for human embryo image generator based on generative adversarial networks (GAN) trained using the Adam algorithm was proposed. The resulting model enables one to manipulate the size, position, and number of artificially generated embryo cells in the composed image. In [21], acute brain hemorrhages on computed tomography scans were detected with the use of an adapted 3-dimensional convolutional neural network. The main goal of such a system is to efficiently reduce the time between diagnosis and treatment.

The Special Issue also received some interesting propositions for various pattern analysis. In [22], a simulation result for vibration signals of high-speed trains for non-stationary object modeling was presented. The research presents the use of intelligent modeling for signal noise reduction. The model proposed in [23] discussed an idea for information retrieval from large-scale text data by using BERT (CLS) representation. To improve this, an efficiency method was based on reasoning paths from a composed cognitive graph structure. In [24], a multi-view approach was discussed for visual question answering (VQA) systems, which are encountered in complex artificial intelligence systems, as operating both in text conversation and image processing and recognition. The proposed approach gave us a chance to boost such systems due to processing several images from one scene, and this therefore enabled the system to consider more aspects on the way to a final decision.

In summary, we should congratulate the authors of the articles accepted in this Special Issue for their outstanding research results and wish them great success in the continuation of their research and projects for future development. The topic of the Special Issue was clearly well accepted in the worldwide scientific community, which gives a sign for the future research and direction of trends for technology and science in the field of computational intelligence for object detection, feature extraction, and recognition in smart sensor environments.

**Acknowledgments:** The editor would like to thank all the authors who have contributed their innovative works to the Special Issue "Advanced Computational Intelligence for Object Detection, Feature Extraction and Recognition in Smart Sensor Environments". Thanks are also given to all the hardworking reviewers for their detailed comments and suggestions in the revision process. I also acknowledge the great help and involvement of the technical editorial team for managing all the publications.

#### **References**


### *Article* **Real-Time and Accurate Drone Detection in a Video with a Static Background**

#### **Ulzhalgas Seidaliyeva 1,**†**,**‡**, Daryn Akhmetov 2,**‡**, Lyazzat Ilipbayeva 2,**‡ **and Eric T. Matson 3,\***


Received: 6 June 2020; Accepted: 7 July 2020; Published: 10 July 2020

**Abstract:** With the increasing number of drones, the danger of their illegal use has become relevant. This has necessitated the creation of automatic drone protection systems. One of the important tasks solved by these systems is the reliable detection of drones near guarded objects. This problem can be solved using various methods. From the point of view of the price–quality ratio, the use of video cameras for a drone detection is of great interest. However, drone detection using visual information is hampered by the large similarity of drones to other objects, such as birds or airplanes. In addition, drones can reach very high speeds, so detection should be done in real time. This paper addresses the problem of real-time drone detection with high accuracy. We divided the drone detection task into two separate tasks: the detection of moving objects and the classification of the detected object into drone, bird, and background. The moving object detection is based on background subtraction, while classification is performed using a convolutional neural network (CNN). The experimental results showed that the proposed approach can achieve an accuracy comparable to existing approaches at high processing speed. We also concluded that the main limitation of our detector is the dependence of its performance on the presence of a moving background.

**Keywords:** unmanned aerial vehicles; object detection; deep learning; computer vision; image processing; drone detection; UAV detection; visual detection

#### **1. Introduction**

With the constant development of technology, drone companies such as DJI, Parrot, and 3DRobotics are producing different types of unmanned aerial vehicles (UAVs) or systems (UAS). Because of their accessibility and ease of use, UAVs are widely used for commercial purposes, such as the delivery of goods and medicines, surveying, the monitoring of public places, cartography, search and rescue (SAR), first aid, and agriculture. However, the wide and rapid spread of UAVs causes danger when the illegal flight of drones is used for crimes such as smuggling (the illegal transportation of goods at borders, in restricted areas, prisons, etc.), illegal video surveillance, and interference with aircraft flying. In recent years, unmanned aerial vehicles, which are publicly known as drones, have hit the headlines by flying over restricted zones and entering the high-security areas. In January 2015, a drone flown by an intoxicated government officer crashed right in front of the White House's lawn [1]. Another accident happened in 2017 in the Canadian province of Quebec, where during landing, a plane with a light engine crashed into a UAV at an elevation of 450 m [2]. Fortunately, the plane only suffered

small damage and was able to land safely. In December 2018, London's Gatwick Airport was shut down for 36 h with reports of drones over the runway, which strangely appeared whenever the airport attempted to reopen [3]. Because of this incident, approximately 1000 flights had to be cancelled, which affected the lives of 140,000 passengers. Due to low visibility of detection, drones can be ideal tools for illegal smuggling. In April 2020, in the state of Georgia, three people were accused of arranging to transport tobacco and phones by means of a drone to a convict at Hays State Prison in Trion [4]. The given examples of drone incidents show the need to monitor the flight of drones. To guarantee security, some drone producer companies have set up no-fly zones by prohibiting drones from flying within a 25 km radius of a few sensitive zones, such as airports, prisons, power plants, and other critical facilities [5]. However, the impact of no-fly zones is exceptionally constrained, and not all drones have those built-in safeguards. Therefore, to solve this problem, the development of anti-drone systems is vigorously developing, and the problem of real-time drone detection is becoming relevant [6]. Drone detection technologies are usually divided into four categories: acoustic, visual, radio-frequency signal-based, and radar [7]. A good balance between price and detection range is achieved using visual drone detection technologies that use images of surveillance areas from cameras. One of the main disadvantages of visual drone detection is the high level of false positives caused by the visual similarity of different objects, especially when they occupy several pixels in an image [8]. As a result, a drone can be mistaken for a bird or background and vice versa. The task becomes even more difficult due to large changes in images caused by varying weather and lighting conditions. At the same time, drones can reach speeds of up to 100 miles per hour, which imposes additional requirements on the speed of detection. To address these problems, the Drone-vs-Bird detection challenge [5] was established. The challenge provides video sequences in which drones are present along with birds. The goal is to detect all drones that appear on video, while birds should not be mistaken for drones. The challenge focuses on the detection accuracy of proposed algorithms, but the execution time of the algorithms is not considered. Our objective was to develop a real-time drone detection algorithm that could achieve a competitive accuracy.

#### *1.1. Drone Detection Modalities*

Based on investigations conducted by academia and commercial industries, the primary modalities that can be used for drone detection and classification tasks are radar, radio frequency (RF), acoustic sensors, and camera sensors supported by computer vision algorithms [7].

#### 1.1.1. Radar-Based Drone Detection

Radar is considered to be a traditional sensor that provides the robust detection of flying objects at long-range distances and almost uninfluenced performance in unfavorable light and weather conditions [9,10]. As radar sensors are mostly designed for detecting high velocity ballistic trajectory targets such as military drones, aircrafts, and missiles, they are not suitable to detect small commercial UAVs that fly with relatively lower non-ballistic trajectory velocities [11]. While radar sensors are well-known as reliable solutions for detection, their classification abilities are not optimal [9]. Since UAVs and birds have key characteristics that often make them difficult to distinguish, the above-mentioned drawback of radar sensors makes it an unprofitable solution for the classification task of UAVs and birds. The complexity of installation and high cost of radar sensors are other reasons that necessitate a relatively low-cost anti-drone system.

#### 1.1.2. Acoustic-Based Drone Detection

Relatively low-cost acoustic detection systems use an array of acoustic sensors or microphones to classify specific acoustic patterns of UAV rotors, even in low visible environments [7]. However, the maximum operational range of these systems remains below 200–250 m. Additionally, sensitivity of these systems to environment noise, especially in urban or noisy loud areas and wind conditions, influences detection performance.

#### 1.1.3. RF-Based Drone Detection

RF-based UAV detection system is one of the most popular anti-drone systems in the market, and they detect and classify drones by their RF signatures [12]. An RF sensor is a passive listener between a UAV and its controller, and the sensor does not transfer any signal like in radar-based systems, which makes RF-based detection energy efficient. Unlike acoustic sensors, RF sensors solve the limited detection range problem by utilizing high-gain recipient antennas together with highly sensitive recipient systems to listen to UAV controller signals, and the environmental noise problem is suppressed by using some de-noising methods such as band pass filtering and wavelet decomposition [13]. However, not all drones have RF transmission, and this approach is not suitable for detecting UAVs operating autonomously without communication channels [7].

#### 1.1.4. Camera-Based Drone Detection

The detection of drones that do not have RF transmission can be performed by using low cost camera sensors based on computer vision algorithms. It is well-known that detection and classification abilities are highest when the target is visible, and camera sensors have the advantage of giving a formal vision verification that the detected object is a drone while giving extra visual information such as drone model, dimensions, and payload that other drone detection systems cannot provide [14]. A medium detection range, good localization, affordable price, and easy human interpretation are achieved using visual drone detection technologies that use images of surveillance areas from cameras. However, this modality operates poorly at nighttime and in limited visibility conditions such as in the presence of clouds, fog, and dust. To address some issues of such scenarios, thermal cameras can be utilized in combination. Thermal cameras can solve the detection issue of nighttime surveillance, and sometimes, depending on the used technology, they even can operate better in rain, snow, and fog weather conditions. However, high quality thermal cameras are used for military applications, and low cost commercial thermal cameras might fail in high humidity weather conditions or other unfavorable environmental conditions [9].

#### 1.1.5. Bi- and Multimodal Drone Detection Systems

As we can see, each of these modalities has its specific limitations, and a robust anti-drone system might be complemented by fusing several modalities. In order to develop a cost-efficient drone monitoring system, some researchers [15] considered composing a sensor network with different types of sensors. Depending on the number of sensors used for the detection task, bimodal and multimodal drone detection systems can exist [7]. To improve detection accuracy, a bimodal drone detection system can combine two different modalities such as camera array and audio assistance [16], camera and radar sensors [17], and radar and audio sensors [18]. Meanwhile, a multimodal drone detection system can be performed with the simultaneous use of acoustic arrays; optical and radar sensors [19]; or simple radar, infrared, and visible cameras—as well as an acoustic microphone array [20]. Therefore, a maximal system performance can be achieved by fusing several drone detection modalities. However, our focus is the approach that uses camera images and computer vision algorithms.

#### *1.2. Related Work*

#### 1.2.1. UAV Detection and Classification

Drone detection based on visual data (image or video) can be performed using handcrafted feature-based methods [8,21,22] and deep learning-based [6,23–25] algorithms. Handcrafted feature-based methods are based on traditional machine learning algorithms by using traditional descriptors such as scale-invariant feature transform (SIFT), histogram of oriented gradients (HOG), Haar, local binary pattern (LBP), deformable parts model (DPM), and generic Fourier descriptor (GFD) that provide low-level handcrafted features (edges, drops, blobs, and color information) and classical classifiers (support vector machine (SVM), AdaBoost)), whereas the second category relies

on the learned features using two-stage (region-based convolutional neural network (R-CNN), Fast R-CNN, Faster R-CNN, and Mask R-CNN) and single-stage (single shot detector (SSD), RetinaNet, and you only look once (YOLO)) deep object detectors.

Drone detection using handcrafted feature-based methods: Unlu et al. [21] developed GFD vision-based features that are invariant to translation and rotation changes to describe the binary forms (such as silhouettes) of drones and birds. In accordance with the system proposed by the authors, the silhouette of a moving object is obtained using a fixed wide-angle camera and a background subtraction algorithm—the region growing algorithm—is used to separate the pixels of the object from the background. In order to avoid the loss of any form information morphological operations are not used after the image segmentation phase, GFD is calculated after the normalization and centering of the silhouette; finally, GFD signals are classified into birds and drones through a neural network consisting approximately 10,000 neurons. To teach their system, the authors created a dataset including 410 drone images and 930 bird images collected from open sources. Training and testing on the custom dataset were performed by using five-fold cross validation. In the test data, CNN classification accuracy was 85.08%, whereas the proposed GFD method showed an accuracy of 93.10%, and the CNN architecture significantly increased the classification efficiency of a small dataset by including the GFD signal vector before classifying the neural network. In [8], the authors proposed two methods of detecting and tracking an unmanned aerial vehicles at a distance of no more than 350 feet during the daytime using image processing and motion detection to control movement and to extract the drone detected by machine learning. The authors made a comparative analysis of the MATLAB, OpenCV, and EmguCV packages currently used in image processing and object detection, and they used OpenCV in their work. According to the proposed system, to reduce the memory, the RGB image captured by a USB camera is converted to grayscale, and the adaptive threshold method is used to adjust the noise level of the image depending on the light condition by setting the threshold value to 60. To eliminate some noise, a dilation morphological operation is used by enlarging the image until it is clearly visible, including a blob tracking algorithm to hold the object and the Dlib technique if the object moves in three frames. To test the proposed system, the authors tested four types of drones (Phantom 4 Pro, Agras MG-1s, Pocket Drone JY019, and Mavic Pro), as well as other objects (birds and balloons). Wang et al. [22] proposed a simple, fast, and efficient detection system for unmanned aerial vehicles based on video images shot with static cameras that covers a large area and is very economical. The method of temporal median background subtraction was used to identify moving objects in a static video camera, and then global Fourier descriptors and local HOG features were obtained from images of moving objects. As a result, the combined Fourier descriptor (FD) and HOG features were sent to the SVM classifier, which performed classification and recognition. To prepare a dataset, the authors converted 10 videos of the unmanned quadcopter Dajiang Phantom 4, taken in various positions, into a series of images; as a result, the drones became a positive class, and other objects such as leaves and buildings were manually annotated as a negative class. For the recognition of "drone" and "non-drone" objects, FD, HOG, and the proposed FD and HOG algorithms were used, the overall accuracy of the proposed recognition method was 98%. The authors also experimentally proved that the proposed FD and HOG algorithm, even with a small dataset, could perform the task of classifying birds and drones with a greater accuracy than the GFD algorithm.

Drone detection using deep learning-based methods: Manja Wu et al. [6] developed a real-time drone detector using the deep learning method. Since training a reliable detector requires a large number of training images, the authors first developed a semi-automatic dataset with a KCF (kernelized correlation filter) tracker instead of manual labeling. The semi-automatic method of labeling datasets based on the KCF tracker accelerated the process of preprocessing the trained images. The authors developed the YOLOv2 deep learning model by changing the resolution structure of input images and adjusting the size parameters of the anchor box. To get the detection network, the authors removed the last convolution layer of Darknet19, which was previously trained on ImageNet dataset, and added three 3 × 3 convolution layers with 1024 filters and one 1 × 1 convolution layer with 30 filters at the end of the network. The network was trained using a public-domain USC (University of Southern California) drone dataset and an anti-drone dataset labeled with a KCF tracker. The 2 and 4 GB graphics processing unit (GPU)-random-access memory (RAM) configurations were used to test the detector's operation in real time and at a low cost. With 2 GB of GPU-RAM, the processing speed reached 19 frames per second (FPS), whereas with 4 GB of GPU-RAM, the processing speed reached 33 FPS. Through various experiments, the authors achieved good results in real-time detection using the proposed detector at an affordable price for the system. Yoshihashi et al. [23] proposed a new integrated detection and control system using information on the movement of small flying objects. This system, called the recurrent convolutional network (RCN), consists of four modules, each of which performs a specific task: a convolutional layer, convolutional long short-term memory (ConvLSTM), a cross-correlation layer, and a fully connected layer. The authors used training methods for AlexNet and Visual Geometry Group-16 (VGG-16) tuning systems without training the system from scratch. To evaluate the system, it was first tested on a bird dataset to detect birds around a wind farm, and then the system was tested on a drone dataset of 20 manually captured video components to check whether the system could be applied to other flying objects. The experimental results presented in the form of receiver operating characteristics (ROC) curves showed that the proposed system gave better results than previous solutions. In [24], the authors proposed a drone tracking system that provides the exact location of the drone in the input image based on deep learning. The proposed system consists of detection and tracking modules that complement each other to achieve high performance. While the Faster R-CNN drone detection module detects and localizes the drone from static images, an object tracking module called MDNet (multi-domain network) determines the position of the drone in the next frame based on its position in the current frame, which allows one to identify a drone only in a certain area without looking for the entire frame. To prepare the dataset, the authors used a publicly available drone dataset consisting of 30 YouTube videos shot using different drone models and a USC drone dataset consisting of 30 videos shot using one drone model. Since the number of static drone images for the drone tracking problem was very limited and labeling is a laborious task, the authors developed a method for increasing data based on a model that generates training images and automatically annotates the position of the drone in each frame. The main idea of this technique is to cut out drone images in the background and place them on top of background images. The experiment results showed that, despite training in synthetic data, the proposed system worked well on realistic images of drones against a complex background. Peng et al. [25] used the physical rendering instrumentation tool (PBRT) to solve the problem of limited visual data by creating photorealistic images of UAVs. The authors developed a large-scale training set of 60,480 rendered images, choosing different positions and orientations of UAVs, 3D models, external materials, internal and external camera characteristics, environmental maps, and the post-processing of rendered images. To detect unmanned aerial vehicles, the Faster R-CNN network was precisely tuned to Detectron, recommended by Facebook artificial intelligence (AI) research, using the weights of the basic ResNet-101 model. On the basis of experimental results, the Faster R-CNN network, trained on rendered images, showed an average accuracy of 80.69% in the manually annotated UAV test set, 43.03% in the pre-trained COCO (Common Objects in Context) 2014 dataset, and 43.36% in the PASCAL VOC (Visual Object Classes) 2012 dataset, and it showed an average precision of 56.28% in the rendered training set. According to the results of the experiment, the average precision (AP) of the Faster R-CNN detection network trained on rendered images was relatively higher compared to other methods.

Hu et al. [26] adapted and fine-tuned the YOLOv3 detector [27] to detect drones. The authors collected a dataset consisting of images of drones on which they trained the detector. The video processing speed reached 56.3 frames per second.

The best results in drone detection have been achieved by detectors based on deep learning. This is evidenced by the fact that most studies on the detection of drones [28–31] have partially or totally relied on CNNs for solving the problem.

#### 1.2.2. Drone-vs.-Bird Challenge

The primary related works on UAV detection are the methods proposed specifically for the Drone-vs-Bird detection challenge [5], which was organized in 2017 and 2019. The main goal of the challenge is to detect and distinguish drones from birds in short videos taken from a large distance by a static camera. To perform flying object detection, Saqib et al. [32] evaluated the Faster R-CNN [33] object detector with different CNN backbones. According to the results of the conducted experiment, the Faster R-CNN with VGG-16 backbone network performed better than other networks by reaching 0.66 mAP. The authors concluded that the experiment results might be improved by annotating birds as a separate class, which could reduce false positive factors and enable the trained model to accurately distinguish birds and drones. C. Aker et al. [34] solved the problem of predicting the location of a drone in video and distinguishing drones from birds by adapting and finetuning single-stage YOLOv2 [35] algorithm. An artificial dataset was created by mixing real images of drones and birds subtracted from their backgrounds with frames of coastal area videos. The proposed network was evaluated by using precision–recall (PR) curves, where precision and recall values reached the value of 0.9 at the same time. Nalamati et al. [28] examined the problem of detecting small drones using state-of-the-art deep learning methods such as the Faster R-CNN [33] and SSD [36]. Inception v2 [37] and ResNet-101 [31] were chosen as the backbone networks. The authors fine-tuned backbone networks using the Drone-vs-Bird challenge dataset, which consisted of 8771 frames extracted from 11 Moving Picture Experts Group (MPEG4)-coded videos. For each algorithm, two cases were considered when the drone is close to the camera, i.e., when it is big, and when the drone is far from the camera, that is small. According to the results of the conducted experiment, in the first case, all the algorithms were able to detect a drone, and in the second case, the Faster R-CNN with ResNet-101 backbone network could successfully detect both drones when two drones appeared in the frame simultaneously at large distances, whereas the Faster R-CNN with the Inception v2 backbone network was only able to detect one of two drones. The single-stage SSD detector, on the other hand, could not detect both long-range drones and was ineffective at detecting small objects. According to the conducted evaluation and the challenge results, the Faster R-CNN with ResNet-101 performed best and achieved recall and precision values of 0.15 and 0.1, respectively. The authors did not pay attention to the detection time, but in future work, they will use the detection time as a key indicator to evaluate the effectiveness of the proposed model in real-time drone detection application. David de la Iglesia et al. [38] proposed an approach based on the RetinaNet [39] object detector. To perform drone predictions at different scales, the feature pyramid network (FPN) [40] architecture was used, where lower pyramidal levels are responsible for detecting small objects and upper levels are focused on larger objects. The ResNet-50-D [29] was used as a backbone network that was trained on the Drone-vs-Bird and Purdue UAV [41] datasets. The precision attained on the Drone-vs-Bird challenge was around 0.52, while recall was 0.34. In addition, experimental results in this work showed that the use of motion information can significantly increase the accuracy of detection. According to the challenge results, the F1 score of this approach reached 0.41. In order to improve the accuracy of detection of existing detectors, some approaches used the additional processing of the data. For example, Magoulianitis et al. [30] pre-processed images with deep CNN with skip connection and network in network (DCSCN) super-resolution technique [42] before using the Faster-RCNN detector. As a result, the detector became capable of detecting very distant drones, increasing its recall performance. The results obtained on challenge were 0.59 for recall and 0.79 for precision. In some works [14,43], the solution was divided into two stages. In the first stage, all objects that are highly likely drones are detected. In the second stage, a high-precision classifier is applied to the detections to reduce the number of false positives. For example, Schumann et al. [43] designed a flying object detector using the median background subtraction or deep neural network-based region proposal network (RPN) algorithm. After that, the detected flying objects were classified into drones, birds, and clutter by using the proposed CNN classifier optimized for small targets. In order to train a robust classifier, the authors created their own dataset containing totally 10,386 images of drones, birds, and backgrounds. The proposed

framework was evaluated by using five static camera sequences and one moving camera sequence of the Drone-vs-Bird challenge dataset. All appearing birds in the video sequences were manually annotated. The classification of flying objects for different input image sizes such as 16 × 16, 32 × 32, and 64 × 64 was performed separately for the author's own dataset and the Drone-vs-Bird challenge dataset. To participate in the 2017 Drone-vs.-Bird challenge, the authors proposed a VGG-conv5 RPN detector optimized for 16 × 16 image size, and, based on the challenge metric results, this team took first place in that competition. Celine Craye et al. [14] developed two separate networks—the semantic segmentation network U-net [44] for the detection stage and ResNetV2 network [45] for the classification stage. By achieving a recall value of 0.71 and a precision value of 0.76, their approach won the Drone-vs-Bird detection challenge in 2019.

#### **2. Proposed Approach**

In this work, we focused on the real-time detection of drones in the scene with a static background. As illustrated in Figure 1, our approach consists of 2 modules: a moving object detector and a drone–bird-background classifier.

**Figure 1.** The proposed drone detection pipeline.

The motion detector was based on a background subtraction method. The outputs of this module are all moving objects in the scene. All the detections are fed into a classifier, which differentiates drones from other moving objects. The classifier is a CNN that was trained on the dataset of images of birds, drones, and backgrounds we collected.

#### *2.1. Background Subtraction Method*

There are several methods that are used for detecting flying or moving objects from a video sequence such as background subtraction, optical flow method, edge detection, and frame differencing [45]. Optical flow is used for motion estimation in a video and detects moving objects on the basis of objects' relative velocities in the scene. The complicated calculation of the optical flow method makes it inapplicable for real-time detection tasks [45]. By calculating the difference between the current and the previous frames of a video sequence, a frame differencing algorithm extracts the moving objects. Despite its advantages, including quick implementation, flexibility to dynamic changes of the scene, and relatively low computation, frame differencing is generally inefficient for extracting all the relevant pixels of the moving regions [45]. To detect a foreground object from the background of a video sequence background, the subtraction method is used. The background subtraction method is considered one of the widely used detection methods because of its fast and accurate detection, which makes it applicable for real-time detection. Additionally, it is easy to implement. The main drawback of this method is its invalidity for moving cameras, because each frame has different background. In our case, we focused on a video with a static background, and all the short videos of the Drone-vs-Bird detection challenge dataset were taken from a large distance by a static camera. Therefore, our motion detector was based on a background subtraction method.

#### Moving Objects Detection

The task of a motion detector is to detect all objects moving in a scene. The performance of this module was evaluated by its recall value. We conducted experimental studies of various motion detectors using the Drone-vs-Bird challenge dataset. The greatest recall was achieved by the motion detector based on the two-points background subtraction algorithm [46]. The output of the common background subtraction algorithm is a binary image in which the pixels that change their values in the next frame take the value of 1. The unchanged pixels are set to zero. In addition to moving objects, the output image contains noise in the form of single pixels distributed throughout the image. To remove this noise, the output binary image is filtered. An example of a filtering result is shown in Figure 2.

**Figure 2.** All steps of the proposed drone detection algorithm.

Next, dilation is performed to connect closely spaced pixels. This operation reduces the number of individual regions that are checked by the classifier, therefore increasing the processing speed of the detector. The last step of the moving object detector is to find the bounding boxes covering the regions found in the previous step. All found bounding boxes are sent to the drone–bird-background classifier.

#### *2.2. CNN Image Classification*

Audio, image, and text classification tasks are mostly performed using artificial neural networks. For image classification, CNNs are mainly used [45]. Usually, a CNN consists of three primary layers: the convolution, pooling, and fully connected (FC) layers. Convolution layers are the main building blocks of CNN models. Convolution is a mathematical operation for combining two sets of information. Convolution layers consist of filters and feature maps. A convolution operation is performed by sliding the filter along the input. In each place, the element-wise multiplication of the matrices is performed, and the result is summed. This sum is fed into the feature map. That is, trained filters are used to extract important features of the input, and the feature map is the output of the filter applied to the previous layer. The size of the filter that performs the convolution operation is always an odd number size (1 × 1, 3 × 3, 5 × 5, 7 × 7, 11 × 11, etc.). Deep learning is commonly used to solve non-linear problems. The values obtained from the product of matrices in the convolutional layer are linear. To convert values to non-linear, after each convolutional layer, a non-linear activation function (elu, selu, relu, tanh, sigmoid, etc.) is usually used. The pooling layer is periodically added into a CNN's architecture. Its main function is to reduce the image size and compress each feature map by taking maximum pixel values in the grid. Most CNN architectures use max pooling. Max pooling uses a 2 × 2 window with stride of 2 and takes the largest elements of the input feature map; as a result, the output of the feature map is half the size. After going through the processes described above, the model is able to understand the features. The fully connected layer comes after the convolution, activation, and pooling layers. The outputs of convolution and pooling layers are always three-dimensional (3D), but

a FC layer expects a one-dimensional vector of numbers. Therefore, we flatten the output of a pooling layer to a vector, and it becomes the input of a FC layer [47]. Then, it is inserted into the nodes of the neural network, which performs the classification. Different CNN architectures exist, such as LeNet, AlexNet, VGGNet (VGG-16 and VGG-19), GoogLeNet (Inception v1), ResNet (ResNet-18, ResNet-34, ResNet-50, etc.), and MobileNet (MobileNetV1 and MobileNetV2). They differ in the number of layers and trainable parameters' sizes. These networks have very deep networks and can have thousands or even million parameters. The huge number of parameters lets the network learn more difficult patterns, which improves classification accuracy. On the other hand, the huge number of parameters affects the training speed, required memory for saving the network, and computational complexity. MobileNet [48] is an efficient convolutional neural network architecture that reduces the amount of memory used for computing while maintaining a high predictive accuracy. It was developed by Google and trained on the ImageNet dataset. This network is suitable for mobile devices or any devices with low computational power. MobileNetV1 has two layers, which are the depthwise convolution layer for lightweight filtering using a single convolutional filter for each input channel and a 1 × 1 convolution (or pointwise) layer for building new features via computing linear combinations of input channels, while MobileNetV2 consists of two blocks, which are a residual block with a stride of 1 and another block with a stride of 2, that are used for downsizing [49]. Each of these blocks has 3 layers: a 1 × 1 convolution layer with a rectified linear unit (ReLU6), a depthwise convolution, and another 1 × 1 convolution layer without non-linearity.

#### Moving Objects Classification

One of the most important part of our approach is the classification of found objects. In real-world scenarios, the detected moving objects are drones, birds, airplanes, insects, and moving parts of scenes. Therefore, we decided to use a classifier that divides all found objects into 3 classes: drones, birds, and background. The MobileNetV2 [50] CNN was chosen as the classifier. The choice of the CNN was due to its low value of inference time and high accuracy. According to [51], the highest detection speed was achieved by a detector with the MobileNet [48] backbone network. MobileNetV2 is an improved version of MobileNet that significantly improves its accuracy [50]. The MobileNetV2 network architecture consists of 19 original basic blocks named bottleneck residual blocks (see Figure 3). These blocks followed by a 1 × 1 convolution layer with an average pooling layer. The last layer is a classification layer. We used the modified version of the MobileNetV2 network [52]. The author made the network more suitable for tiny images by changing the stride, padding, and filter size. We changed the classification layer so that the number of classes the network classifies became 3.

**Figure 3.** The architecture of the MobileNetv2 network.

#### **3. Experiments and Results**

#### *3.1. Data Preparation*

The amount of data is crucial for CNN training. Insufficient data affect the generalization ability of the network. As a result, the accuracy of the classification is decreased when the network receives new data. As training data in our work, we used the Drone-vs-Bird challenge dataset, which consisted of 11 videos recorded by a static camera. In addition to a drone, birds and other moving objects may appear in the videos. For each video, annotations of drones appearing in the frames are provided. Annotations are presented in the form of coordinates and sizes of the ground truth bounding boxes. Using the videos and annotations to them, we extracted 10,155 images of drones. In order to extract images of birds and background from the videos, we applied the detector of moving objects described in the previous subsection to the entire dataset. Next, we manually labeled all the images of the detected moving objects. As a result, 1921 images of birds and 9348 images of the background were obtained. Since the number of images of birds was low compared to the other two classes, we additionally used 2651 bird images from Wild Birds in a Wind Farm: Image Dataset for Bird Detection [53] Thus, the total number of images in our dataset was 24,075. The input size of the network was 32 × 32 × 3, so we resized all the images to match the input layer of the network. Some examples of the resized images are shown in Figure 4.

**Figure 4.** Some of the collected images for training. The first row shows the drones, the second row consists of background images, and the third row is the images of birds.

#### *3.2. Training*

We trained the MobileNetV2 CNN from scratch using the dataset described in the previous section. The dataset was divided into a training set and a testing set in a proportion of 80 to 20. To train the network, we used the stochastic gradient descent (SGD) optimization algorithm with a starting learning rate of 0.05, a momentum of 0.9, and a weight decay of 0.001. The training was done on NVIDIA GeForce GT 1030 2 GB GPU with a batch size of 88. We decreased the starting learning rate by a factor of 10 every 50 epochs during the training.

#### *3.3. Evaluation Metrics*

For the evaluation of any object detection approach, some statistical and machine learning metrics can be used: ROC curves, precision and recall, F-scores, and false positives per image [54]. Generally, first the results of an object detector are compared to the given list of the ground-truth bounding boxes. To answer the question of when a detection can be considered as correct, most studies related to object detection have used the overlap criterion, which was introduced by Everingham et al. [55] for the Pascal VOC challenge. As noted above, the detections are assigned to ground truth objects, and, by calculating the bounding box overlap, they are judged to be true or false positives. In order to be considered as a correct detection according to [55], the overlap ratio between the predicted and ground truth boxes must be exceed 0.5 (50%). The Pascal VOC overlap criterion is defined as the intersection over union (IoU) and computed as follows:

$$IoI = a0 = \frac{\operatorname{area}\left(B\_p \cap B\_{\mathcal{S}^t}\right)}{\operatorname{area}\left(B\_p \cup B\_{\mathcal{S}^t}\right)}\tag{1}$$

where the IoU is the intersection over union; *a*<sup>0</sup> is an overlap ratio; *Bp* and *Bgt* are predicted and ground truth bounding boxes, respectively; *area*(*Bp* ∩ *Bgt*) means the overlap or intersection of predicted and ground truth bounding boxes; and *area*(*Bp* ∪ *Bgt*) means the area of union of these two bounding boxes. Having matched detections to ground truth we can determine the number of correctly classified objects, which are called true positives (TPs), incorrect detections or false positives (FPs), and ground truth objects that are not missed by the detector or false negatives (FNs). Using the total number of TPs, FPs, and FNs, we could compute a wide range of evaluation metrics.

We evaluated our approach using the Drone-vs-Bird challenge's [5] metrics. The challenge provided three test videos for evaluation that were named gopro\_001, gopro\_004, and gopro\_006. The first video contained frames with two drones and a moving background. A feature of the second video was a static background and a very small size of the drone. In the third video, in addition to the drone, several birds were present on the frames. The main evaluation metric used in the challenge is the F1-score:

$$F\_1 = 2 \ast \frac{Precision \ast Recall}{Precision + Recall} \tag{2}$$

In order to calculate Precision and Recall, we applied our drone detector on the test videos and counted the total number of TPs, FPs, and FNs.

$$Precision = \frac{TP}{TP + FP} \tag{3}$$

$$Recall = \frac{TP}{TP + FN} \tag{4}$$

Precision and recall are the metrics that can be used to evaluate most information extraction algorithms. Sometimes they might be used on their own, or they might be used as a basis for derived metrics such as F-score and precision–recall curves. Precision is the proportion of predicted positive results that are truly true-positive results for all positively predicted objects, whereas recall is the fraction of all true-positively objects to the total number of positively predicted objects—it shows how many samples of all positive examples were classified correctly. Based on these two metrics, we could calculate F1-score metric, which combines information about precision and recall. A detection was counted as a true positive if the value of the IoU between the detected and the ground truth bounding boxes exceeded 0.5.

#### *3.4. Results*

As a result of training, the accuracy of the classifier on the entire dataset was 99.83%. The confusion matrix is shown in Figure 5.

**Figure 5.** Confusion matrix of the trained convolutional neural network (CNN).

The experiment results obtained by applying our detector to all test videos are shown in Figure 6. True positives and false positives values were counted for an IoU = 0.5. The results were divided into three ranges, depending on the drone size.

**Figure 6.** The results of the experiment for various drone sizes.

In Figure 6, w and h are the width and height of the ground truth bounding box in pixels, respectively. The value <sup>√</sup> *w* ∗ *h* reflects the size of the drone in the image. The lower this value, the farther the drone is from the camera. Based on these data, precision and recall values were calculated. Then, we calculated the F1-score by Equation (1) and added it to the last row of Table 1. The same sequence was individually performed for each video.

**Table 1.** The results of the evaluation for an intersection over union (IoU) = 0.5.


For a more detailed analysis of the detector, we conducted experiments for various values of the IoU. The curves were plotted based on the obtained results of recall, precision, and F1-score, as shown in Figure 7.

**Figure 7.** Evaluation metrics values for various IoU values.

Qualitative detection results are depicted in Figure 8.

**Figure 8.** Qualitative results of our detector. Green bounding boxes are ground truth bounding boxes. Red bounding boxes are the results of applying our detector.

Eighty-five percent of all false positives were caused by inaccurate estimations of the bounding boxes, resulting in a calculated IoU value of less than 0.5. The remaining 15% were classification errors, as a result of which other moving objects were misclassified as drones. Examples of false detections caused by incorrect classification are shown in Figure 9.

**Figure 9.** Examples of false detections.

The images shown in Figure 9 were classified by the detector as drones. In most cases, the objects that were misclassified were birds, clouds, swaying tree branches, and grass. On average, the detector processed nine frames of size 1920 × 1080 per second. A third of the processing time was spent on the classification of moving objects, and the rest was spent on their detection. We noticed that the detection speed depended on the background change rate, which increased as the number of bounding boxes fed into the classifier increased. This dependency is shown in more detail in Figure 10.

**Figure 10.** The evaluation results of detection speed.

#### **4. Discussion**

Our findings suggest that dividing drone detections task into detecting moving objects and classifying detections can be effective for accurate and fast drone detection. However, the use of motion information for detecting moving objects has several drawbacks. First, as shown in Figure 10, the moving background caused an increase in the number of detected objects, which led to an increase in the classification time and the number of false positives. Secondly, if the drone was flying close to moving objects, then it became impossible to separately distinguish it from other objects, as can be seen in Figure 11.

**Figure 11.** The result of applying background subtraction to the video segment in which a drone was flying near a swaying grass.

As a result, the drone was not detected, which led to an increase in the number of false negatives. Along with this, the number of false positive results also increased due to the fact that more images were fed to the classifier. Since the accuracy of the classifier was not equal to 100%, this caused a greater number of classification errors. The usage of the metrics of the Drone-vs-Bird detection challenge [5] allowed us to compare our results with the results of the other teams participating in the challenge. According to the experiment results, the accuracy of our approach was comparable with the approaches proposed in [30] and [14]. Compared to [30], in which only the application of the super resolution was

performed at a speed of 0.58 FPS, our detector had a significantly higher detection speed. For IoU = 0.3, our approach performed much worse than [14] but still better than [30]. The comparison results for previous works and our approach are shown in Table 2.


**Table 2.** Comparison results.

#### **5. Conclusions**

In this paper, we present a real-time drone detection algorithm, the accuracy of which is comparable to existing algorithms. We provided further evidence that the task of drone detection can be successfully solved by dividing it into the detection and classification stages. The experimental results showed the advantages and disadvantages of this approach. The most important limitation of our detector lies in the fact that its performance is highly dependent on the presence of a moving background. We believe that the accuracy of our detector can be improved by using a larger dataset to train the classifier. For future work, we suggest combining visual information with motion information to detect candidates in the detection stage.

**Author Contributions:** Conceptualization, U.S., D.A., L.I., and E.T.M.; methodology, U.S. and D.A.; software, D.A.; validation, U.S., D.A., and E.T.M.; formal analysis, U.S. and D.A.; investigation, U.S. and D.A.; resources, U.S. and D.A.; data curation, U.S., D.A., and E.T.M.; writing—original draft preparation, U.S. and D.A.; writing—review and editing, U.S. and E.T.M.; visualization, D.A.; supervision, E.T.M. and L.I. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** We would like to express our great appreciation to SafeShore project for supporting Drone-vs-Bird challenge dataset. Special thanks to Eric Matson for his patient guidance and useful critiques of this research work.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Ship Type Classification by Convolutional Neural Networks with Auditory-Like Mechanisms**

#### **Sheng Shen, Honghui Yang \*, Xiaohui Yao, Junhao Li, Guanghui Xu and Meiping Sheng**

School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an 710072, China; shensheng@mail.nwpu.edu.cn (S.S.); yaoxiaohui@mail.nwpu.edu.cn (X.Y.); ljhhjl@mail.nwpu.edu.cn (J.L.); hsugh@mail.nwpu.edu.cn (G.X.); smp@nwpu.edu.cn (M.S.)

**\*** Correspondence: hhyang@nwpu.edu.cn

Received: 1 November 2019; Accepted: 26 December 2019; Published: 1 January 2020

**Abstract:** Ship type classification with radiated noise helps monitor the noise of shipping around the hydrophone deployment site. This paper introduces a convolutional neural network with several auditory-like mechanisms for ship type classification. The proposed model mainly includes a cochlea model and an auditory center model. In cochlea model, acoustic signal decomposition at basement membrane is implemented by time convolutional layer with auditory filters and dilated convolutions. The transformation of neural patterns at hair cells is modeled by a time frequency conversion layer to extract auditory features. In the auditory center model, auditory features are first selectively emphasized in a supervised manner. Then, spectro-temporal patterns are extracted by deep architecture with multistage auditory mechanisms. The whole model is optimized with an objective function of ship type classification to form the plasticity of the auditory system. The contributions compared with an auditory inspired convolutional neural network include the improvements in dilated convolutions, deep architecture and target layer. The proposed model can extract auditory features from a raw hydrophone signal and identify types of ships under different working conditions. The model achieved a classification accuracy of 87.2% on four ship types and ocean background noise.

**Keywords:** machine learning; neural network; ship radiated noise; underwater acoustics

#### **1. Introduction**

The soundscape of the oceans is heavily affected by human activities, especially in coastal waters. The increase of the noise level in oceans is correlated with burgeoning global trade with the expansion of shipping. Automatic recognition of ship type by ship radiated noise is not only affected by the complicated mechanism of noise generation, but is also affected by the complex underwater sound propagation channel. The conventional recognition methods based on machine learning generally include three stages—feature extraction, feature selection and classifier design.

The conventional feature extraction methods for ship radiated noise include waveform features [1,2], auditory features [3,4], wavelet features [5] and so on. These manually designed features are limited in their ability to capture variations in complex ocean environments and ship operative conditions for the use of fixed parameters or filters. Biophysical based models [3,4] are limited to early auditory stages for extracting auditory features. Auditory features designed from perceptual evidence always focus on the properties of signal description rather than the classification purpose [6]. These features do not utilize the plastic mechanism and representation at various auditory stages to improve the recognition performance. Although the noise features or redundant features can be removed by feature selection methods [7], the inherent problem of manually designed features still cannot be solved radically.

Support vector machine (SVM) was used to recognize ship noise using manually designed features [7]. With the increase of data size, hierarchical architectures have been shown to outperform shallow models. Spectrogram [8], probabilistic linear discriminant analysis and i-vectors [9] were used as the input of neural classifiers to detect ship presence and classify ship types. For the application of deep learning, Kamal [10] used a deep belief network and Cao [11] used a stacked autoencoder. A competitive learning mechanism [12,13] was used to increase cluster performance during the training of the deep network. In these works, classifier design and feature extraction were separated from each other. This has a drawback that the designed features may not be appropriate for the classification model.

Deep learning has made it possible to model the original signal as well as to predict targets in a whole model [6,14], to which the auditory system is thought to be adapted. The time convolutional layer in an auditory inspired convolutional neural network (CNN) [6] provided a new way for modeling underwater acoustic signals. However, it did not have enough depth to build an appropriate model to match the expanding acquired dataset. Moreover, the conventional convolutional layer and the fully connected layer led to numerous parameters.

In this paper, we present a deep architecture with the aim of capturing the functionality and robustness of the auditory system to improve recognition performance. The key element in the approach is a deep architecture with time and frequency-tolerant feature detectors, which are inspired by neural cells along the auditory pathway. The early stage auditory model is derived from the auditory inspired CNN, in which the time convolutional layer is improved by dilated convolution. The construction of the deep network refers to inception and residual network [15] in the field of machine vision, in which some ideas are also found in the auditory pathway. Thus, the frequency convolutional layers in Reference [6] are improved to increase the depth of the network. At the final stage, the substitution of the fully connected layer by global average pooling at target layer greatly reduces the parameters. The main findings of this paper are briefly summarized as follows:


This paper is organized as follows. Section 2 gives an overview of auditory mechanisms and the structure of the proposed model. Section 3 describes details of the model, which include the cochlea model for ship radiated noise modeling and the multistage auditory center model for features learning and targets classification. Section 4 includes experimental data description, experimental setup and results. An overall discussion and directions for future work are concluded in Section 5.

#### **2. Model**

#### *2.1. Auditory Mechanisms*

Decades of physiological and psychoacoustical studies [16–18] have revealed elegant strategies at various stages of the mammalian auditory system for representation of acoustic signals. Sound is first analyzed in terms of relatively few perceptually significant attributes, followed by higher level integrative processes.

When sound arrives at the ears, the vibration of the eardrum caused by the sound wave is transmitted to the cochlea in the inner ear via ossicular chain in the middle ear. The cochlea performs two fundamental functions. First, through the vibration of different parts of the basement membrane, the cochlea effectively separates the frequency components of sound. The second function of the cochlea is to transform these vibrations into neural patterns with the help of hair cells distributed along the cochlea [18].

The auditory center is one of the longest central pathways in the sensory system. The acoustic spectrum is extracted early in the auditory pathway at the cochlear nucleus, the first stage beyond the auditory nerve. Multiple pathways emerge from the cochlear nucleus up through the midbrain and thalamus to the auditory cortex. Each pathway passes through different neural structures and repeatedly converges onto and diverges from other pathways along the way [18]. The complexity structure extracts rich and varied auditory percepts from the sound to be later interpreted by the brain. Neurons in the primary cortex have been shown to be sensitive to specific spectro-temporal patterns in sounds [19].

As a result of auditory experience, the systematic long-term changes in the responses of neurons to sound are defined as plasticity. Plasticity and learning probably occurs at all stages of the auditory system [20].

By reviewing the process of auditory perception, we can conclude the following four mechanisms of early and higher auditory stages that are useful for establishing an auditory computational model.


#### *2.2. Model Structure*

The nature of the auditory computational model is to transform the raw acoustical signal into representations that are useful for auditory tasks. In this paper, mechanisms of auditory system are established mathematically in deep CNN for ship type classification. The model mainly includes the cochlea model and the auditory center model. A complete description of the specifications of the network is given in Figure 1.

**Figure 1.** Model structure. The model mainly includes cochlea model and auditory center model. In the time convolutional layer, four colors represent four groups. Dilated convolutions are represented by parallel lines at equal intervals. The time frequency conversion layer includes permute layer and max-pooling layer. At the top of the graph, auditory feature recalibration is implemented by global max-pooling and fully connected layers. Frequency convolutional layers are performed by Inception-ResNet. The final stage includes global average pooling and softmax layer.

Based on the research foundation of the human cochlea, a series of multiscale gammatone auditory filters [21] are used to initial the time convolutional layer with dilated convolutions [22]. Ship noise signals are decomposed by convolution operation with auditory filters. Inspired by the function of hair

cells, the time frequency conversion layer transforms these decomposed components into amplitudes of its corresponding frequency components—or its frequency spectrum [18]. We introduce an auxiliary classifier with the goal of enhancing the gradient and recalibrating the learned spectrum in supervised manner. Learned spectra are further extracted by deep architecture with inception structures and residual connections [15] to model the multistage auditory pathway. These layers are defined as frequency convolutional layers. The resulting feature maps of the last frequency convolutional layer are fed into a global average pooling layer [23]. Then ship types are predicted in the softmax layer. During the training of the network, auditory filters and features are subject to classification tasks on the basis of matching human auditory systems.

#### **3. Methodology**

#### *3.1. Cochlea Model for Ship Radiated Noise Modeling*

The cochlea model is the first stage of the proposed model, it includes the time domain signal decomposition of basement membrane and the time frequency conversion of hair cells. The cochlea model creates a frequency-organized axis known as the tonotopic axis of cochlea.

#### 3.1.1. Time Convolutional Layer with Dilated Auditory Filters

Much is known about the representation of spectral profile in the cochlea [18,21]. The physiologically derived gammatone filter *g*(*t*) is shown in (1).

$$\log(t) = at^{n-1}e^{-2\pi bt}\cos(2\pi ft + \phi),\tag{1}$$

where *a* is amplitude, *t* is time in second, *n* is filter's order, *b* is bandwidth in *Hz*, *f* is center frequency in *Hz*, and *φ* is phase of the carrier in radians. Center frequency *f* and bandwidth *b* are set by an equivalent rectangular bandwidth (ERB) [24] cochlea model in (2) and (3).

$$ERB(f) = 24.7(4.37f/1000 + 1)\tag{2}$$

$$b = 1.019 \times ERB(f). \tag{3}$$

Convolutional kernels of the time convolutional layer represent a population of auditory nerve spikes. A series of gammatone filters of different sizes are used to initialize weight vectors of this layer which forms the primary feature extraction base of the network. This layer performs convolutions over a raw time domain waveform. Suppose the input signal **<sup>S</sup>**(**<sup>S</sup>** <sup>∈</sup> <sup>R</sup>*L*×*N*) has *<sup>N</sup>* frames, each frame length is *L*. As shown in (4), signal **S** is convolved with kernel **k***<sup>m</sup>* and added to an additive bias *bm*. Then, the output puts through activation function *<sup>f</sup>* to form the output feature map **<sup>t</sup>***m*(**t***<sup>m</sup>* <sup>∈</sup> <sup>R</sup>*L*×*N*).

$$\mathbf{t}\_{\mathfrak{m}} = f(\mathbf{S} \ast \mathbf{k}\_{\mathfrak{m}} + b\_{\mathfrak{m}}), m \in \mathbf{1}, \mathbf{2}, \dots, \mathbf{M}. \tag{4}$$

For time convolutional layer that has *M* kernels, the output **T**<sup>1</sup> = [**t**1, **t**2, ..., **t***M*] will be obtained. We use 128 gammatone filters with center frequencies ranging from 20 to 8000 Hz. For 16 kHz sampling frequency, the impulse widths range from 50 to 800 points approximately. These filters are divided into 4 groups by quartering according to impulse widths. The convolutional kernel widths of the 4 groups are 100, 200, 400, 800 respectively. Thus, the number of parameters is reduced from 128 × 800 to 32 × (100 + 200 + 400 + 800).

Bigger kernel size means more parameters, which make the network more prone to overfitting. Dilated convolutions could reduce network parameters, while the receptive fields, center frequencies and band widths remain unchanged. To give the 4 groups an equal number of parameters, dilation rates are 1, 2, 4, and 8 respectively for the 4 groups. The number of parameters in this layer are further reduced to 128 × 100. Figure 2 illustrates the signal decomposition by using underwater noise radiated

from a passenger ship. During the recording period, the ship was 1.95 km away from the hydrophone and its navigational speed was 18.4 kn.

**Figure 2.** Signal decomposition of time convolutional layer. Black line at the bottom is the input signal. Four colors (orange, yellow, green and blue) at the upper and middle parts represent the four groups. Four capsules at the top represent the outputs of 4 groups. Four colors of straight lines at the middle part represent dilated convolutional kernels with a dilated rate of 8, 4, 2 and 1.

#### 3.1.2. Time Frequency Conversion Layer

Stronger vibrations of basement membrane lead to more vigorous neural responses, which are further transformed by hair cells. The amplitude of the decomposed signal can be regarded as a neural response or frequency spectrum. The proposed time frequency conversion layer transforms the output of the time convolutional layer into a frequency domain. This is accomplished by a permute layer and a max-pooling layer. As shown in (5), **T**<sup>1</sup> is permuted to **T**<sup>2</sup> = [*τ*1, *τ*2, ..., *τN*], where *<sup>τ</sup><sup>n</sup>* <sup>∈</sup> <sup>R</sup>*L*×*M*, *<sup>n</sup>* <sup>=</sup> 1, 2, ..., *<sup>N</sup>*.

$$\mathbf{T}\_2 = \operatorname{permute}(\mathbf{T}\_1). \tag{5}$$

As shown in (6), the amplitude of *τ<sup>n</sup>* within regular time bins is calculated as **fr***n*(*n* = 1, 2, ...*N*,**fr***<sup>n</sup>* <sup>∈</sup> <sup>R</sup>*K*×*M*) by max-pooling along time axis. The output of this layer is **Fr** = [**fr**1,**fr**2, ...,**fr***N*]. Thus, the internal representation of sound is calculated as a spectro-temporal excitation pattern which provides a clearer picture of sound.

$$\mathbf{f}\mathbf{r}\_n = \max-pooling\limits\_{\mathbb{S}} (\mathbf{r}\_n). \tag{6}$$

#### *3.2. Multistage Auditory Center Model for Feature Extraction and Classification*

The output of the time convolutional layer is divided into two routes. One route is the time frequency conversion layer, then directly through the deep neural network to model the multistage auditory pathway. Another route performs auditory feature recalibration.

#### 3.2.1. Supervised Auditory Feature Recalibration

Given the relatively large depth of the network, the ability to propagate gradients back through all the layers should be enhanced, especially for time convolutional layer at the front of the whole network. Therefore, we propose an auditory feature recalibrate block on the basis of the recalibrate block [25]. This block takes **T**<sup>1</sup> as the input. As shown in (7), global max pooling is used to aggregate **t***m* across frame length.

$$\mathbf{r}\_{\mathrm{fl}} = \max-pooling(\mathbf{t}\_{\mathrm{fl}}), m = 1, 2, \dots, M. \tag{7}$$

The output **<sup>r</sup>***m*(**r***<sup>m</sup>* <sup>∈</sup> <sup>R</sup>*N*) is the amplitude of all the frames in **<sup>t</sup>***m*. The output of this layer is **<sup>R</sup>**<sup>1</sup> = [**r**1,**r**2, ...,**r***M*]. It is permuted to **<sup>R</sup>**<sup>2</sup> = [*γ*1, *<sup>γ</sup>*2, ..., *<sup>γ</sup>N*], where *<sup>γ</sup><sup>n</sup>* <sup>∈</sup> <sup>R</sup>*M*, *<sup>n</sup>* <sup>=</sup> 1, 2, ..., *<sup>N</sup>*. Then, it is followed with two fully connected layers to capture the dependencies of frequency components. The activation function of the fully connected layers are Rectified Linear Unit (ReLU) and sigmoid, respectively. The equation of the two layers are shown in (8):

$$\mathbf{f}\mathbf{c}\_n = \text{sigmoid}(\mathbf{v}\_n \text{ReLU}(\omega\_n \gamma\_n)), n = 1, 2, \dots, N. \tag{8}$$

**Fc** = [**fc**1,**fc**2, ...,**fc***N*] is the output of the fully connected layers, where **fc***n*(**fc***<sup>n</sup>* <sup>∈</sup> <sup>R</sup>*M*) corresponds to the *nth* frame. **W** = [*ω*1, *ω*2, ..., *ωN*] and **U** = [*υ*1, *υ*2, ..., *υN*] are weight vectors of the two layers. **Fc** is also divided into two routes. One route is fed directly into a softmax layer to make an auxiliary classifier. We would expect to increase the gradient signal by adding the loss of the auxiliary classifier to the total loss of the network. Another route is the auditory feature recalibration shown in (9).

$$\mathbf{f}\mathbf{r}'\_{\mathrm{il}} = \mathbf{f}\mathbf{c}\_{\mathrm{il}} \times \mathbf{f}\mathbf{r}\_{\mathrm{il}}, \mathbf{n} = 1, 2, \dots, N\_{\prime} \tag{9}$$

where **fr** *<sup>n</sup>*(**fr** *<sup>n</sup>* <sup>∈</sup> <sup>R</sup>*K*×*M*) is channel-wise multiplication between **fc***<sup>n</sup>* and **fr***n*. The output of the layer is **Fr** = [**fr** 1,**fr** 2, ...,**fr** *<sup>N</sup>*](**Fr** <sup>∈</sup> <sup>R</sup>*K*×*M*×*N*). This operation can be interpreted as a means of selecting the most informative frequency components of a signal in supervised manner. The recalibrated auditory features could establish the correlation between features and categories.

#### 3.2.2. Deep Architecture for Feature Learning

Auditory perception depends on the integration of many neurons along the multistage auditory pathway. These neurons likely facilitate frequency topological topographic maps of most hearing region [26]. The proposed frequency convolutional layers perform convolution in both frequency and time axis to extract spectro-temporal patterns embedded in a ship radiated noise signal.

However, a drawback of a deep network constructed with a standard convolutional layer is the dramatically increased use of computational resources. Inception-Resnet [15] has been shown to achieve very good performance in image recognition at a relatively low computational cost. In this paper, a deep neural network with inception structures and residual connections are introduced in frequency convolutional layer to perform the auditory task. Multiscale convolutional kernels in inception block can be interpreted as simulating the different neural structures of auditory pathways. Inception structures have the ability to learn spectro-temporal patterns of different scales with less parameters. Residual connections can be interpreted as simulating the convergence and divergence between different pathways. The architecture allows for increasing the number of layers and units to form the multistage of auditory system. These layers could also preserve locality and reduce spectral variations of the line spectrum in ship radiated noise.

We use an global average pooling layer [23] at the final stage to generate one feature map for each ship type. The resulting vector is fed directly into softmax layer to predict targets. Compared with fully connected layers, global average pooling layer is more native to the convolution structure by enforcing correspondences between feature maps and categories. Moreover, there is no parameter to optimize in global average pooling layer thus overfitting is avoided at this layer.

#### **4. Experiment**

#### *4.1. Experimental Dataset*

Our experiments were performed on hydrophone acoustic data acquired by Ocean Networks Canada observatory. The data were measured using an Ocean Sonics icListen AF hydrophone placed at 144 m–147 m below sea level. Ship radiated noise data were from ships in a 2 km radius while no other ships were present in a 3 km radius. The duration of the recordings vary from about 5 to 20 min, depending on navigational speed and position. Each recording was sliced into several segments to make up the input of neural network. Each sample was a segment of 3 s duration and was divided into short frames of 1 s. Acoustic data were resampled to a sampling frequency of 16 kHz. Classification experiments were performed on ocean background noise and four ship types (Cargo, Passenger ship, Tanker, and Tug). The four ship type categories were designated by the World Shipping Encyclopedia from Lloyd's Registry of Ships. About 29 months of data were collected, the first 18 months for training and the remaining 11 months for testing. The dataset comprised about 140,000 training samples (771 recordings) and 82,000 validation samples (449 recordings).

#### *4.2. Classification Experiment*

To evaluate the classification performance of the proposed model, we report the classification accuracy against the previous proposed auditory inspired CNN and manually designed features. These hand designed features included waveform features [1,2], Mel-frequency Cepstral Coefficients (MFCC) [3], wavelet features [5], auditory features [7] and spectral [11,27]. Waveform features included peak-to-peak amplitude features, zero-crossing wavelength features and zero-crossing wavelength difference features. MFCC features were extracted based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency. Auditory features were extracted based on the Bark scale and masking properties of the human auditory system. Wavelet features contained a low frequency envelope of wavelet decomposition and entropy of zero-crossing wavelength distribution density of all levels of wavelet signals. For the calculation of spectral, two pass split window (TPSW) was applied subsequently after a short time fast Fourier transform. Signals were windowed into frames of 256 ms before extracting features. The extracted features on frames were stacked or averaged to feed into support vector machine (SVM), back propagation neural network (BPNN) or CNN to classify ship types. The kernel function of SVM was the radial basis function (RBF). The penalty factor and kernel parameter of RBF were selected by grid search. The SVM ensemble was performed by the AdaBoost algorithm. The used BPNN had one hidden layer with 30 hidden units. The structure of CNN from the bottom up was a convolutional layer with 128 feature maps, a max pooling layer, convolutional layer with 64 feature maps, max pooling layer and fully connected layer with 32 units. Kernel size was 5 × 5 and pooling size was 2 × 2. The learning rate was 0.1 and momentum was 0.9. When training the proposed network, optimization was performed using RMSprop with learning rate 0.001, momentum 0.9, and a minibatch size of 50. The results are shown in Table 1. Our experiments demonstrate that the proposed method remarkably outperforms manually designed features. Benefiting from the improvements in the network structure, the accuracy has been greatly improved compared with auditory inspired CNN.

The confusion matrix of the proposed model on test data is shown in Table 2. The accuracy is at the bottom right corner. Both the precision and recall of background noise are higher than all ship types. This result indicates that it is easier to detect ship presence than classify ship types. The confusion between Cargo and Tanker are larger than other categories. This may be because the two categories always have similar propulsion systems, gross tonnage and ship size.


**Table 1.** Comparison of the proposed model and other methods.



We evaluated the performance of the proposed model on recordings by majority voting. One recording would be classified to a category to which the most samples in the recording are classified. The confusion matrix is shown in Table 3. The obtained accuracy is 94.75% at the bottom right corner. The recall and precision of all categories are improved compared with Table 2. Although individual samples could be misidentified, we can still make a correct recognition results of the whole signal by majority voting.


**Table 3.** Confusion matrix of recordings.

#### *4.3. Operative Conditions Analysis*

We analyzed the recognition results and operative conditions together in order to observe how the accuracy varies with ship operative conditions. Speed over ground (SOG), course over ground (COG) and distance to hydrophone were analyzed. Different ship types usually had different speeds and routes. Most ships were northbound, whereas only some passenger ships and tugs went in other directions. Because of these differences between ship types, it is necessary to analyse each ship type separately. From Figure 3, we can see that, with the increase of distance between ships and hydrophone, the recall rate of Cargo, Passenger ships and Tanker decreased. It is hard to find obvious laws about the influence of SOG and COG. The results indicate that the proposed model is robust to ship operative conditions. The detection results of passenger ships in each operative condition were obviously better than for the other ship types. This may be because the samples of a passenger ship are uniformly distributed in operative conditions and the classification model can fit it better.

**Figure 3.** Operative condition and classification results analysis. Four rows from top to bottom represent Cargo, Tanker, Passenger ship and Tug, respectively. Histograms are the number of samples under different operative conditions. Yellow and green histograms represent the false negative samples and true positive samples, respectively. Orange lines are recall rates.

#### *4.4. Visualization*

#### 4.4.1. Learned Auditory Filter Visualization

To observe learned auditory filters in the time convolutional layer, we selected one convolutional kernel (learned filters) from each of the 4 groups. Output feature maps corresponding to the 4 selected kernels were also extracted. As shown in Figure 4, the training of the network modified the shapes of these filters. The use of dilated convolution enables the 4 groups to have the same kernel width. The output feature maps are decomposed signals whose center frequencies are consistent with the original auditory filters. The dilated convolution can reduce parameters as well as preserve the center frequency of the auditory filter.

**Figure 4.** Visualization of learned auditory filters and their outputs in the time convolutional layer. Blue lines are learned auditory filters and orange lines are the corresponding output feature maps. The 4 panels from top to bottom represent the dilation rate of 1, 2, 4, and 8, respectively. For each panel, vertical axis represents the amplitude of filter and decomposed signal and horizontal axis represents the time domain sampling points of the filter and decomposed signal.

#### 4.4.2. Learned Spectrogram Visualization

Outputs of the time frequency conversion layer were extracted as a learned spectrogram. It was compared with the gammatone spectrogram in Figure 5. The frame length and hop time for the gammatone spectrogram were the same as the kernel size and strides in the time convolutional layer. Thus, the dimension of the gammatone spectrogram was the same as the learned spectrogram. The learned spectrogram generated by the network is similar to the gammatone spectrogram. The network reserved low frequency components in signal and smoothed noises in high frequency components.

**Figure 5.** Comparison of learned spectrogram and gammatone spectrogram. (**a**) Learned spectrogram. (**b**) Gammatone spectrogram.

The data visualization method t-distributed stochastic neighbor embedding (t-SNE) [28] was used to visualize extracted features by giving each sample a location in a two dimensional map. In Figure 6, the output of the whole network, learned spectrogram in time frequency conversion layer and gammatone spectrogram were visualized. As shown in Figure 6a, the last layer constructs a map in which most classes are separate from other classes. Figure 6b,c are the samples distribution of learned spectrograms and gammatone spectrograms, respectively. There are larger overlaps between the classes compared with Figure 6a. The samples distribution in Figure 6b is slightly better than that in Figure 6c. The results indicate that the proposed model could provide better insight into class structure of ship radiated noise data. Features extracted by the deeper layer are more discriminative than those from the shallow layer.

**Figure 6.** Feature visualization by t-distributed stochastic neighbor embedding (t-SNE). Five thousand samples selected randomly from test data were used to perform the experiments. Dots of different colors represent different types of ships. (**a**) Output of the whole network. (**b**) Learned spectrogram. (**c**) Gammatone spectrogram.

#### **5. Conclusions**

A deep convolutional neural network with auditory-like mechanisms is proposed to simulate the processing procedure of an auditory system for ship type classification. The integrated auditory mechanisms from early to higher auditory stages include auditory filters at basement membrane, neural pattern transformation by hair cells, spectro-temporal patterns along hierarchical structure, multiple auditory pathways and plasticity.

The classification experiments demonstrate that the proposed method outperforms manually designed features and classifiers. This study analyzes the recognition results in a way that is closer to the real-world scenario. The accuracy of recordings obtained by majority voting is much higher than the accuracy of segments. The increase of distance between ships to hydrophone has a negative effect on the recognition results in most cases. The proposed method has robustness to ship operative conditions. The network could generate a spectrogram that is similar to gammatone spectrogram, but smooth noises of high frequency components. The auditory filter banks in the network are adaptive in shape to ship radiated noise.

The proposed method facilitates the development of a smart hydrophone that could not only measure underwater acoustic signals, but also send alerts if it detects a specific underwater acoustic event. It will make it easier for researchers to listen to the ocean.

**Author Contributions:** Conceptualization, S.S. and H.Y.; Data curation, S.S., X.Y., J.L. and G.X.; Investigation, H.Y.; Methodology, S.S.; Project administration, H.Y.; Writing–original draft, S.S.; Writing–review and editing, S.S., H.Y., J.L. and M.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by a grant from the Science and Technology on Blind Signals Processing Laboratory.

**Acknowledgments:** Data used in this work were provided by Ocean Networks Canada. The authors would like to gratefully acknowledge the support from Ocean Networks Canada.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### **Vehicular Tra**ffi**c Congestion Classification by Visual Features and Deep Learning Approaches: A Comparison**

#### **Donato Impedovo \*, Fabrizio Balducci, Vincenzo Dentamaroand Giuseppe Pirlo**

Dipartimento di Informatica, Università degli Studi di Bari Aldo Moro, 70125 Bari, Italy; fabrizio.balducci@uniba.it (F.B.); vincenzo@gatech.edu (V.D.); giuseppe.pirlo@uniba.it (G.P.) **\*** Correspondence: donato.impedovo@uniba.it; Tel.: +39-080-544-2280

Received: 17 October 2019; Accepted: 26 November 2019; Published: 28 November 2019

**Abstract:** Automatic traffic flow classification is useful to reveal road congestions and accidents. Nowadays, roads and highways are equipped with a huge amount of surveillance cameras, which can be used for real-time vehicle identification, and thus providing traffic flow estimation. This research provides a comparative analysis of state-of-the-art object detectors, visual features, and classification models useful to implement traffic state estimations. More specifically, three different object detectors are compared to identify vehicles. Four machine learning techniques are successively employed to explore five visual features for classification aims. These classic machine learning approaches are compared with the deep learning techniques. This research demonstrates that, when methods and resources are properly implemented and tested, results are very encouraging for both methods, but the deep learning method is the most accurately performing one reaching an accuracy of 99.9% for binary traffic state classification and 98.6% for multiclass classification.

**Keywords:** vehicular traffic flow detection; vehicular traffic flow classification; vehicular traffic congestion; deep learning; video classification; deep learning; benchmark

#### **1. Introduction**

Urban roads and highways have, nowadays, plenty of surveillance cameras initially installed for various security reasons. Traffic videos coming from these cameras can be used to estimate the traffic state, to automatically identify congestions, accidents, and infractions, and thus helping the transport management to face critical aspects of the mobility. At the same time, this information can also be used to plan the mid and long-term roads mobility strategy. This is a clear example of a smart city application having also a strong impact on citizens' security [1].

Studies dealing with traffic state estimation by videos adopt a common processing pipeline, which includes the following:


It is difficult to identify and select the best algorithms to be adopted because, often, systems reported in many studies are different at many stages, as well as adopt different datasets and testing conditions. This research provides a brief review of the most used techniques and reports an extended and systematic experimental comparison under common set-up conditions, thus highlighting strengths and weaknesses of each approach and supporting interested readers in the most profitable choices. More specifically:


The article is organized as follows: Section 2 describes related studies, Section 3 presents the object detectors candidate to identify vehicles, the visual features useful to characterize a traffic video frame, and classifiers. The video datasets, evaluation metrics, along with experimental results are shown in Section 4. Section 5 presents conclusions and future researches.

#### **2. Related Studies**

Among the different methods that can be adopted to provide traffic flow (congestion) estimation, surveillance cameras play a crucial role. These systems can be installed without interfering with road infrastructures. Moreover, a large plethora of retrofit solutions are available in many cases, or systems have been already installed for some initial different aim. In any case, surveillance cameras can supply real-time information. The estimation of the traffic can be provided to users and police patrols to help in departures planning and congestion avoiding. Road panels or integrated vehicular monitors can also be used to reach the aim [2].

One of the first steps within the pipeline is vehicle identification [3]. Vehicles can be identified using features such as colors, pixels, and edges along with some machine learning algorithms. More specifically, detectors able to locate and classify objects in video frames exploiting visual features must be considered. State-of-the-art features are SURF (Speeded Up Robust Features) and bag-of-features [4], Haar Features [5], Edge Features [6], Shape Features [7], and Oriented Gradients Histograms [8]. Approaches based on visual features and machine learning models greatly benefited in efficiency with the introduction Convolutional Neural Networks (CNN) [9].

Choudhury et al. [10] proposed a system based on Dynamic Bayesian Network (DBN). It was tested on three videos after extracting visual features such as the number of moving vehicles [11]. Other examples of vehicle detector are in [12] and in [13] where features to characterize the traffic density are extracted from video footage. Li et al. [14] exploited the texture difference between congestion and unobstructed images, while similar approaches were used to recognize tow-away road signs from videos [15].

Vehicular traffic behavior can be also revealed by observable motion [16], in fact, it can be used to determine the number of vehicles performing the same action (e.g., by exploiting trajectory clustering for scene description). Drones and radio-controlled model aircrafts are exploited in the works of Liu et al. [17] and Gao et al. [18] to shoot live traffic flow videos and to study roads conditions. Shen and He [19] analyzed the vehicular behaviors at traffic bottlenecks and their movement to verify the decision-making process. In the research of Avinash et al. [20], Multiple Linear Regression (MLR) Technique was adopted to comprehend the factors influencing pedestrian safety. Koganti et al. [21] adopted Lane Distribution Factor (LDF) to describe the distribution of vehicular traffic across the roadway. Thomas et al. [22] presented a perceptual video summarization technique on a stack of videos to solve accident detection.

Real-time traffic videos have been used to observe temporal state of vehicles on a pedestrian crossing lane by using several image processing algorithms, connected components, and ray-casting technique [23]. Lin and Wang [24] implement a Vehicular Digital Video Recorder System able to support an online real-time navigator and an offline data viewer: The system was able to consider data security about vehicular parameters and to depict the instant status of vehicles.

#### **3. Methods and Materials**

Two approaches are compared in this research: The first one relies on visual features evaluated from traffic videos through computer vision algorithms using state-of-the-art object detectors and classifiers, the latter considers deep learning models able to automatically extract features from videos needed for the final classification.

#### *3.1. Object Detectors*

Four different object detectors have been explored: Haar Cascade, You Only Look Once (YOLO), Single Shot MultiBox Detector (SSD), and Mask R-CNN.

The Haar Cascade object detector, originally developed by Viola and Jones [25], relies on a set of visual features able to exploit rectangular regions at a specific location identifying pixel intensities and differences between the regions. The term "Cascade" refers to the decision sequencing, in fact the algorithm constructs a "strong" classifier as a combination of weighted weak classifiers using a boosting technique. A search window moves through the whole image to cover all the pieces and, at the same time, it is resized to different scales to find objects of different sizes. The detector requires a set of positive and negative samples. Haar features are extracted in the test phase and then compared with those used in the training phase, in order to obtain a positive or negative identification. Haar Cascade has been successfully used for Vehicle Detection [26], also to evaluate a traffic congestion index [27].

The YOLO (You Only Look Once) object detector consists of a CNN called Darknet [28] with an architecture made by 24 convolutional layers working as feature extractors and 2 dense layers for the prediction. Successively, YOLO v2 introduced anchors as a set of boxes used to predict the bounding boxes, while in YOLO v3 the prediction is performed at different scales. The algorithm looks at the image only once and splits it into an NxN grid where each cell predicts a fixed number of bounding boxes to associate an object to the supposed class providing a confidence score. Many of the bounding boxes have a very low confidence. Therefore, they can be filtered by applying a threshold. YOLO is a fast algorithm because it requires only one image processing, at the same time accuracy decreases when two different objects are very close to each other. YOLO has been successfully used for Vehicle Detection in [29] where the focal loss is exploited and validated on the BIT-Vehicle dataset and in [30] where, on the same dataset, a mean Average Precision (mAP) of 94.78% is reported.

Convolutional Neural Network (CNN) is also at the basis of the *Single Shot MultiBox Detector* (SSD), producing a set of fixed-size bounding, a confidence score is provided representing the probability that the object in the box belongs to a specific class. The CNN is constituted by several layers that progressively decrease in size so that objects can be predicted on several scales. SSD has been successfully used for Vehicle Detection in [12] while Sreekumar et al. [31] developed a multi-algorithm method for the real-time traffic pattern generation and Asvadi et al. [32] exploited a Dense Reflection Map (DRM) inputted to a Deep Convolutional Neural Network for the vehicle detection.

Mask R-CNN is the evolution of R-CNNs [33]. The original R-CNN detector used selective search to extract a huge number of candidate (proposal) regions (2000) from the input image. The algorithm first generates many candidate regions by performing initial sub-segmentation. Then, a greedy approach combines similar (adjacent) regions in larger ones. The selective search and the greedy approach result in very low computing [34]. R-CNN developed by Microsoft Research addresses this speed computation problem. One single model is used to extract features from each region, predict the belonging class, and compute the box coordinates. This is performed on a filtered image, which uses the low rank approximation with Singular Value Decomposition (SVD) of the original image. Fast R-CNN is 10× faster than R-CNN. Faster R-CNN [34] improves Fast R-CNN by using an additional network, called Region Proposal Network, in place of the selective search (originally derived from R-CNN) used for the generation of the regions of interest. This increases the prediction speed by about

10× [35]. Mask R-CNN is built on top of Faster R-CNN and, in addition to Faster R-CNN, it provides also the object segmentation. The mask of the segmented object could be used for inferring the valid shape of the classified object.

It is important to note that, while SSD used pre-trained weights on the Pascal Visual Object Classes (VOC) dataset [36], YOLO v3 and Mask R-CNN used pre-trained weights on the Coco dataset [37].

#### *3.2. Visual Features*

The output produced by a specific detector can be exploited to evaluate the traffic state and its density. In this research, five visual descriptors were considered: Total Vehicles, Traffic Velocity, Traffic Flow, Road Occupancy, and the Texture Feature.

The *Total Vehicles* feature is the number of bounding boxes provided by the vehicle detector and evaluated for each video frame [10,13,20,21,24,38–43].

The Traffic Velocity (average speed of all vehicles in the frame) is evaluated by tracking the vehicular bounding boxes centers and calculating the Euclidean distance with the corresponding position in the next frame [31]. The distance of each vehicle has been normalized according to the fps (frame per second) rate thus finding the individual speed and the global average speed (the sum of all vehicles velocities divided by their number). The last parameter can be considered as an estimation of the global traffic velocity [12,13,42,44,45].

The third visual feature is *Tra*ffi*c Flow*, it is calculated considering the difference between the incoming and the outgoing traffic, respectively evaluated as the number of vehicles entering the camera field of view and those leaving it [15,39,40,42,45].

Background suppression and morphological transformations (i.e., opening and closing) can be used to isolate vehicle shapes: This process returns the *Road Occupancy* providing a relationship between the flat road surface (white pixels in Figure 1) and vehicles (black pixels in Figure 1). To gain this result, frames are converted to grayscale. Successively, pixel values are subtracted from the next frame highlighting non-modified areas (background) and modified ones (foreground): Pixels belonging to vehicles shape are those related to changes in the scene. A threshold operator (i.e., Regional Minimum) is adopted to distinguish flat areas from areas occupied by vehicles which result from the ratio between black and white pixels dynamically changing according to the number of vehicles [9,12,16,41,42,44,45].

**Figure 1.** The morphological operator applied to the frame of the vehicular traffic video.

The *Texture Feature* was calculated according to the Gray Level Co-occurrence Matrix (GLCM) method [14,45–47]. This parameter is typically used to estimate vehicles' density by exploiting the corresponding 'energy' and 'entropy' values. More specifically, energy reveals whether the texture is thick (high value) or thin, while the entropy expresses how the elements are distributed (uniformly featuring a high value or not) [48]. The value of energy is inversely proportional with vehicle density, and the value of entropy is proportional with vehicle density. In other words, the gray histogram of image should be distributed uniformly and the texture of the image should be thin in case there are

many vehicles in a frame. Therefore, the energy feature value should be small, and the entropy feature value should be big.

#### *3.3. Machine Learning Classifiers*

In this research three classifiers were considered and compared: k-Nearest Neighbors, Support Vector Machine, and Random Forest.

The K-Nearest Neighbors (KNN) is used for classification and regression tasks exploiting sample vectors in a multi-dimensional feature space. K is a user-defined parameter referred to the number of class labels: the unknown input vector is classified assigning it the "nearest class" among those already known. Many distance measures can be considered as, for example, the Euclidean or the Manhattan one. K-NN has been applied to this specific field in [38] and [49].

Support Vector Machine (SVM) maps feature vectors of two different classes within a hyperspace and searches for the best separating hyperplane taking into account only a reduced set of the initial amount of examples (called support vectors), which are those difficult to be classified. According to data distribution, different separating hyperplanes (kernel) can be considered. In this research the linear kernel and *a Gaussian Radial Basis* (rbf) function were considered as in [50] where the traffic congestion is classified through a comparison between AlexNet+SVM and VGGNet+SVM.

The Random Forest (RF) classifier relies on a bagging method to build several "base learners", usually decision trees. The base learners are successively combined to provide the final classification. RF repeatedly selects a bootstrap sample from the training set. It selects a random subset of features, and then fits a decision tree to this sample. Due to this randomness, the bias of the forest increases, but, due to averaging, its variance also decreases. In extremely randomized trees (ET), randomness is taken a step further by also completely randomizing the cut-point choice while splitting each node of a tree. This allows the variance to be reduced a little more at the expense of a slight increase in bias. In this research, 100 decision trees were used as base learners for RF and ET.

#### *3.4. Deep Learning Classification Models*

He et al. [51] proposed a residual deep learning framework for image classification, where layers were redrafted in order to learn residual functions with respect to the input layer. The proposed ResNET had 34 layers that follow the same pattern while performing 3 × 3 convolutions with a fixed feature map dimension, the input is bypassed every two convolutions. Moreover, the width and height dimensions remain constant for the entire layer thus reducing the complexity per layer of the network. The output is a binary classification (congested/not congested). The ResNET has been also re-trained in [52] on the Shaanxi Province dataset.

Kurniawan et al. [53] used two convolutional layers, a max pooling layer, and a fully connected layer where the first two layers are convolute with 3 × 3 filters and 32 feature maps, the third one is a 2 × 2 max pooling layer used for down-sampling, and the last one is a fully connected layer with 128 neurons. Rectified Linear Units (ReLU) activation function has been exploited in both the convolutional and fully connected layers while a sigmoid activation function has been used for the output layer [45,50].

#### **4. Experiments and Discussion**

#### *4.1. Video Datasets*

Different datasets, used for different aims, were adopted in this research. The GRAM Road Traffic Monitoring (RTM) is generally used for vehicle detection and it was adopted here to evaluate and compare performance of object (vehicles) detectors [54]. Trafficdb contains annotations related to the state of the traffic and it is here used to compare classification techniques [55,56].

#### 4.1.1. GRAM RTM

The Road-Traffic Monitoring [54] is a dataset specifically used for vehicular detection in a traffic environment. It consists of three video sequences from which individual frames were labeled with bounding box around vehicles. The "M-30" video includes 7520 frames recorded on a sunny day with a Nikon Coolpix L20 camera having a resolution of 640 × 480 pixels at 30 fps. The second video "M-30-HD" includes 9390 frames recorded in the same place of the previous video but on a cloudy day at a higher resolution (1280 × 720 pixels at 30 fps using a Nikon DX3100 camera). The last "Urban1" video contains 23,435 frames in low resolution (480 × 320 pixels at 25 fps). This dataset offers the possibility to evaluate vehicles detectors under different working conditions.

From each video of the dataset, the ground-truth consists in bounding boxes around all vehicles per each frame. Information about the acquisition properties are provided, together with pixel masks useful to extract region of interests (ROI) and decrease the computational load of subsequent processing phases (Figure 2).

**Figure 2.** GRAM dataset: (**left**) a pixel mask that highlights the Region of Interest about vehicles present, annotated (and subsequently trackable) in each video frame (**right**).

#### 4.1.2. Trafficdb

Trafficdb dataset is a state-of-the-art dataset used for vehicular traffic state classification since it is provided with specific annotations [55,56]. It is constituted by 254 videos acquired between 8 April and 8 June 2004 on Seattle (USA) highway segments. The ground-truth includes three classes (Figure 3): 'Heavy'—very congested traffic, 'Medium'—low vehicular flow, and 'Light'—normal travel velocity.

**Figure 3.** Three images from Trafficdb that depict a traffic state classified as Light, Medium, and Heavy.

#### *4.2. Evaluation Metrics*

Vehicle bounding boxes provided by the object detector can be compared to the real vehicle annotation highlighting the most accurately performing one. It is quite clear that the result of this phase has a strong impact on all the subsequent stages within the processing pipeline.

The metric *Correct detections* refers to correctly detected vehicles for each frame and it is measured as the pixel intersection between the original ground-truth bounding box area (named G) and the predicted one (named P). The *Jaccard Index* or *Intersection over Union* (1) was considered.

$$f(G, P) = \frac{|G \cap P|}{|G \cup P|}. \tag{1}$$

The second performance metric here considered is the *Computing time* evaluated by adding the processing time for each video frame at each iteration. It is useful to select the most suitable detector for a specific problem (real-time, on site, off-line, etc.).

The accuracy was considered and evaluated as follow:

$$\text{accuracy} = (\text{TP} + \text{TN}) / (\text{TP} + \text{TN} + \text{FP} + \text{FN}), \tag{2}$$

where:

• TP: the true positive samples, example of class X and classified by the system as X;

• TN: the true negative samples, object not of class X and not classified by the system as X;

• FP: the false positive samples, object not of class X but classified by the system as X;

• FN: the false negative samples, object of class X, but not classified by the system as X;

Experiments were performed on a System featuring Ubuntu 18.04 as Operating System, AMD Ryzhen threadripper 1920x with 12 cores, Nvidia Titan RTX 24 GB RAM, 64 GB RAM DDR4.

#### *4.3. Vehicle Detectors Evaluation*

Table 1 reports results of the object detectors on the Road-Traffic Monitoring GRAM dataset.


**Table 1.** Performance comparison according to the processing time and vehicle detection accuracy of the four object detectors on the three videos in the GRAM dataset.

The experimental phase pointed out that the Haar Cascade detector is the fastest one on each dataset. However, it provides good accuracy only on M-30-HD (75%).

The lowest processing time is achieved on the 'Urban1' video due to the low frame resolution that also impacts on the correctly identified vehicles, with 40% of accuracy also due to incorrect multiple detections.

The Single Shot MultiBox Detector (SSD) is the slowest object detector. The experiment on the "M-30" video reported an accuracy of 22%, which is the lowest between the four solutions. On the "M-30-HD" video, the computational time is much heavier (11 to 14 s) with frequent peaks between 14 and 17 s, the reported accuracy reaches 70%.

YOLO detector offers the best compromise with acceptable execution times and very good performances on all datasets.

Mask R-CNN exhibits very discordant results in terms of accuracy. In particular, the poor performance on Urban1 dataset is probably due to the poor image quality: JPEG compression kneaded colors, and thus cheated the Region Proposal Network.

The most accurately performing detector, considering the processing time, is Haar Cascade, while YOLO represents the compromise between time resources and detection accuracy.

#### *4.4. Tra*ffi*c State Classification: Visual Features and Machine Learning Classification*

Results obtained in the previous section clearly report YOLO and Haar Cascade as the most accurately performing vehicles detectors in terms of accuracy and processing time. For this reason, they have been chosen to support the visual features extraction to build the input vector for vehicle classifiers.

The first experimental session involves the five visual features calculated using the best two selected object detectors with the different machine learning classifiers seen in Section 3.3. A 10-fold cross validation setup was adopted to minimize the effect of variance when choosing the training and testing examples. Classification results on the Trafficdb dataset are reported in Table 2. Visual features were extracted on a sampling rate of 30 frames.


**Table 2.** Traffic state classification accuracy on the Trafficdb dataset.

Table 2 shows that YOLO detector combined with the Random Forest is the most accurately performing solution with an accuracy of 84%. The confusion matrix provided by this solution is reported in Table 3. Due to the unbalance of the classes in Trafficdb (the 'Heavy' class instances are about four times the other two), the classification results are provided in a normalized form.

**Table 3.** Normalized confusion matrix of the traffic state classification reached by the Random Forest classifier on the Trafficdb video dataset.


#### *4.5. Tra*ffi*c State Classification: Deep Learning*

The deep learning models described in Section 3.4 were implemented and re-trained on the Trafficdb video dataset in a 10-fold cross validation setup. The ResNet [51] and the deep network architecture proposed in [52] were originally tested by respective authors on a two class (Heavy vs. Light) traffic state classification. To perform similar tests, at a first stage, samples labeled as 'Medium' were removed from the Trafficdb: results are shown in Table 4.

**Table 4.** Normalized confusion matrix about the binary traffic state classification performed by the deep neural network of Kurniawan et al. [53] on the Trafficdb video dataset.


Finally, to compare results to the cases if the previous section, the two deep learning architectures were extended to perform traffic state classification on three classes. In this case, the best performance was reached by the ResNet [51] with an accuracy of 98.61% (results are in Table 5).

**Table 5.** Normalized confusion matrix about the multiclass traffic state classification performed by the deep neural network of Kurniawan et al. [53] on the Trafficdb video dataset.


#### **5. Conclusions**

A pipeline to develop state-of-the-art traffic state classification systems from videos has been presented in this research. The pipeline is made up of three main steps: vehicle detection, feature extraction, and classification. Several state-of-the-art approaches have been considered and compared. A preliminary comparison between object detectors, performed on the GRAM Road Traffic Monitoring video dataset, has pointed out that YOLO v3 can be used for real-time vehicle detector exhibiting a detection accuracy of over 80%.

For the traffic state classification, two different approaches have been studied and tested on the Trafficdb video dataset. The first approach relies on visual features calculated through computer vision techniques and machine learning classifiers while the second one exploits deep learning able to embed the features extraction when training the model on the annotated dataset. In the classic approach (visual features and machine learning classifiers), the Random Forest has gained 84% of accuracy while the deep learning approach has reached an accuracy of over 98% with the same experimental setup, and thus showing a noticeable increase of +14% in the results.

The problem here considered is obviously complex, and the provided results need further improvement as for example: refinements of the object detection algorithms, validation on more traffic datasets. The last aspect is non-trivial because different road settings (country road, city road, road junction and crossroad, double lane roads, etc.) and weather conditions (rainy nights, fog, snow, gusts of wind, and so on) could have significant impact on systems.

**Author Contributions:** Conceptualization, D.I. and G.P.; methodology, D.I.; software, F.B. and V.D.; validation, D.I. and G.P.; investigation, D.I. and G.P.; resources, D.I.; data curation, F.B. and V.D.; writing—original draft preparation, F.B. and D.I.; writing—review and editing, F.B., D.I., and V.D.; supervision, D.I.; project administration, D.I.; funding acquisition, D.I.

**Funding:** This research is within the "Metrominuto Advanced Social Games" project funded by POR Puglia FESR-FSE 2014–2020—Fondo Europeo Sviluppo Regionale—Asse I—Azione 1.4—Sub-Azione 1.4.b—Avviso pubblico "Innolabs".

**Conflicts of Interest:** The authors declare no conflict of interest.

**Code:** https://gitlab.com/islabuniba/traffic\_benchmark.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Real-Time Vehicle-Detection Method in Bird-View Unmanned-Aerial-Vehicle Imagery**

#### **Seongkyun Han 1, Jisang Yoo <sup>1</sup> and Soonchul Kwon 2,\***


Received: 31 July 2019; Accepted: 10 September 2019; Published: 13 September 2019

**Abstract:** Vehicle detection is an important research area that provides background information for the diversity of unmanned-aerial-vehicle (UAV) applications. In this paper, we propose a vehicle-detection method using a convolutional-neural-network (CNN)-based object detector. We design our method, DRFBNet300, with a Deeper Receptive Field Block (DRFB) module that enhances the expressiveness of feature maps to detect small objects in the UAV imagery. We also propose the UAV-cars dataset that includes the composition and angular distortion of vehicles in UAV imagery to train our DRFBNet300. Lastly, we propose a Split Image Processing (SIP) method to improve the accuracy of the detection model. Our DRFBNet300 achieves 21 mAP with 45 FPS in the MS COCO metric, which is the highest score compared to other lightweight single-stage methods running in real time. In addition, DRFBNet300, trained on the UAV-cars dataset, obtains the highest AP score at altitudes of 20–50 m. The gap of accuracy improvement by applying the SIP method became larger when the altitude increases. The DRFBNet300 trained on the UAV-cars dataset with SIP method operates at 33 FPS, enabling real-time vehicle detection.

**Keywords:** vehicle detection; object detection; UAV imagery; convolutional neural network

#### **1. Introduction**

In recent years, studies have been conducted to apply a large amount of information obtained from Unmanned Aerial Vehicles (UAVs) to various systems. Representative UAV applications exist in social-safety, surveillance, military, and traffic systems, and the field is increasingly expanding [1–7]. In this application, vehicle detection, which is detecting the position and size of vehicles in UAV imagery, is very important as background information. Zhu et al. [5] and Ke et al. [6] proposed a vehicle-flow- and density-calculating system in UAV imagery using the vehicle-detection model. Yang et al. [7] proposed an Intelligent Transportation System (ITS).

Traditional methods are less accurate because of poor generalization performance, and only vehicles on asphalt roads are detectable in top-view images, where only relatively standardized vehicle shapes are shown [8–10]. However, AlexNet [11] won the 2012 ImageNet Large Scale Visual Recognition Competition (ILSVRC) [12] and showed excellent generalization performance on a Convolutional Neural Network (CNN). As a result, in 2015, a CNN was able to classify more accurately than humans [12]. Various CNN-based object-detection models have been proposed, such as the Single Shot MultiBox Detector (SSD) series [13–15] and Region proposals with CNNs (RCNN) series [16–18], which utilize CNN. A diversity of vehicle-detection methods has been proposed using various CNN-based object detectors, but these UAV imagery vehicle-detection methods fail to find small objects and operate at low altitudes [19] using YOLOv2 [20]. There is also a real-time problem

using complex models to improve accuracy [4,5,21,22], like Faster-RCNN [18], deep and complex SSD [13], and YOLOs [15,20].

The MS COCO [23] and PASCAL VOC [24] used in the training of a general object-detection model consists of front-view images. In addition, each dataset has labels that are not needed for UAV imagery vehicle detection, such as suitcase, fork, wine glass or bottle, potted plant, and sofa, respectively. Vehicles in UAV imagery captured at a high altitude have different composition and distortion peculiarities than general front-view images. For those reasons, vehicle detection using a model trained with a general object detection dataset is not accurate in UAV imagery.

Therefore, in this paper, we propose a real-time vehicle-detection method in bird-view UAV imagery using a lightweight single-stage CNN-based object detector. First, we designed a DRFB module and DRFBNet300, which is a light and fast detection model that uses the MobileNet v1 backbone [25]. Our DRFB module has multi-Receptive Field-size branches to improve the expressive power of feature maps, using dilated convolution [26] to minimize increments of computational complexity. We propose the UAV-cars dataset, which includes distortion peculiarities of UAV imagery to train object-detection models. We also propose a Split Image Processing (SIP) method to improve the accuracy of the detection model. Our SIP method improves accuracy by using divided input frames, different from existing CNN-based object-detection methods. Thus, using the DRFBNet300 trained on the UAV-cars dataset and by using the SIP method, we propose a real-time bird-view UAV imagery vehicle-detection method.

In Section 3.1, we describe DRFBNet300 using the DRFB module, which is optimized for small-object detection. Section 3.2 outlines the SIP method, which improved the accuracy of the object-detection model. In Section 3.3, we describe the UAV-cars dataset that contains the distortion peculiarities of vehicles in UAV imagery. Section 3.4 describes the overall flow of our vehicle-detection method. In Section 4, we lay out the environment of the experiments. Section 5.1 outlines a performance comparison between each model using the MS COCO dataset. In Section 5.2, we compare the performance of the models trained on our UAV-cars dataset with and without the SIP method.

All models used MobileNet v1 [25] with one *Width Multiplier* (*α*) and *Resolution Multiplier* (*ρ*) as a backbone network for fast detection. In the MS COCO experiment, SSD300 [13], RFBNet300 [14], YOLOv3 320 [27], FSSD300 [28], and our DRFBNet300, which are single-stage object detectors, were used for the performance comparison. In the UAV-cars dataset experiment, SSD300, RFBNet300, and our DRFBNet300, which are lightweight single-stage methods, were used for the comparison. At each model, the number means input size.

#### **2. Related Work**

Vehicle-detection algorithms require a high-end computing model that needs high amounts of power. It is difficult to mount these high-end computing models directly on small battery-powered UAVs. Therefore, most of them use precaptured or transmitted images from the UAV and run detection algorithms on the server PC [1–10,19,21,22,29–34].

Vehicle detection has been widely studied as a background research area for various applications, like surveillance systems and traffic systems [1–7]. Various studies using traditional handcrafted methods have been carried out. Zhao et al. [8] detected vehicles using the edge-contour information of vehicles on the road in top-view low-resolution images. Breckon et al. [29] detected vehicles using a cascaded Haar classifier [35] in bird-view UAV imagery. Shao et al. [31] found vehicles using various algorithms, such as Histogram of Oriented Gradients (HOG) [36], Local Binary Pattern (LBP) [37], and exhaustive search in top-view high-resolution images. Yang et al. [9] tracked vehicles with Scale Invariant Feature Transform (SIFT) [38] and the Kanade–Lucas–Tomasi (KLT) feature tracker [39] after finding a vehicle using blob information in top-view images. Xu et al. [10] improved the original Viola–Jones object-detection method [35] to enhance the accuracy of UAV imagery vehicle-detection models at low altitudes. However, these handcrafted methods are not robust, and are only accurate in

certain environments, such as on roads with one direction. They are also only optimized for top-view images, where vehicles are seen as a formulaic square shape, or for low-altitude UAV imagery.

To overcome the low generalization performance of traditional handcrafted methods, various UAV imagery vehicle-detection methods using CNN-based object detectors have been proposed. Yang et al. [6] proposed the Enhanced-SSD, which modified the SSD structure [13], and detect vehicles in ultrahigh-resolution UAV imagery. Tang et al. [33] proposed a UAV imagery vehicle-detection model using the original structure of YOLOv2 [20]. Radovic et al. [19] used the structure of YOLO [15] to detect vehicles in UAV imagery. Xu et al. [22] proposed a deeper YOLO model, DOLO, using the structure of YOLOv2. Xie et al. [34] proposed a UAV imagery vehicle-detection model by modifying the structure of RefineDet [40]. Fan et al. [21] proposed a vehicle-detection model using Faster-RCNN [18], which is a two-stage method. However, most of these methods use simple top-view images. They also designed the models to operate at low altitudes [19] or to use ultrahigh-resolution images [6], and heavyweight models are used to achieve high accuracy, which has high computational complexity [6,21,22,33,34].

#### **3. Proposed Method**

Figure 1 shows imagery capturing the schematic concept of the proposed vehicle-detection method in bird-view UAV imagery. The angle of the camera was 30 degrees from the ground, and the video was taken at various altitudes while maintaining the camera angle. The altitude of the UAV was 10–50 m above ground. The vehicle-detection model used a prerecorded bird-view UAV image to infer the location and size of the vehicle on the server PC. Imaging was done using a built-in UAV camera; its detailed specifications are covered in Section 4.1, and those of the server PC used in the experiment are covered in Section 4.2.

**Figure 1.** Unmanned-aerial-vehicle (UAV) imagery-capturing schematic concept.

Figure 2 shows the overall flowchart of the proposed UAV imagery vehicle-detection method. First, the input image was separated left and right through the Image Separation part of the SIP method. The two separated images were inputted, respectively, to the DRFBNet300 trained on the UAV-cars dataset. Next, using the coordinates indicating the position and size of objects found in the overlapped area at each separated image, the overlapping results were combined at the Box Merging part. Finally, result boxes were drawn on the input image using the generated coordinate values. In Section 3.1, we explain the DRFB module and DRFBNet300, used for vehicle detection. Section 3.2 describes the SIP method that separates the input image of DRFBNet300 and combines or removes duplicated coordinates from the inference result. Section 3.3 describes the UAV-cars dataset, which is used to train and validate the proposed vehicle model. In Section 3.4, we discuss the overall framework.

**Figure 2.** Overall flowchart of our UAV imagery vehicle-detection method.

#### *3.1. DRFBNet300*

Speed and precision, the main performance indices of the object detector are directly related to the structure of the backbone network and the meta-architecture of the detector. Therefore, there are performance differences according to the meta-architecture even for the same backbone network [41]. To improve the accuracy of the object-detection model, several studies use a heavyweight backbone network such as ResNet [42] or VGGs [43], or meta-architecture like the RCNN series [16–18]. Such a heavyweight structure is computationally complex and expensive, resulting in the real-time problem of the object-detection model. This problem can be solved using a lightweight backbone like MobileNet v1 [25] or single-stage meta-architecture such as SSD [13]. However, a lightweight structure results in low accuracy because of the lack of network capacity.

In this paper, we designed a DRFB module to improve the accuracy of MobileNet v1 backbone SSD300 [13], which is a light and fast detector. The values of *Width Multiplier* (*α*) and *Resolution Multiplier* (*ρ*) of MobileNet v1 were 1. The proposed DRFB module was designed based on the human population Receptive Field (pRF) [44], Inception series [45–47], and RFBNet [14]. The DRFB module improved the quality of feature maps with weak expressive power. DRFBNet300 is an object-detection model using our DRFB module and RFB basic module [14] on the MobileNet v1 backbone SSD300 structure.

DRFB module. Using a multisized Receptive Field (RF) branch structure rather than a fixed-sized RF layer in CNN increases scale invariance and produces better-quality feature maps [48,49]. In addition, if the Inception family-based module [45–47] that concatenates the feature maps generated by the multisized RF convolution is applied to the CNN, the expressiveness of the feature maps and the accuracy of the model are improved, with faster training speed [46,47]. These Inception module-based approaches have been verified in classification, semantic-segmentation, and object-detection tasks [14,47–49].

The proposed DRFB (Deeper Receptive Field Block) module was connected to feature maps for detecting small objects and consists of branches with variously sized RFs. The left-hand side of Figure 3 shows the structure of our DRFB module. Each branch uses dilated convolution [26] to generate good-quality feature maps using large RF. The module has a shortcut branch of ResNet [42] and Inception-ResNet V2 [46], and follows the multibranch structure of inception [45,46]. This makes it possible to enhance the expressiveness of the feature maps and speed up model training while minimizing parameter increase.

Our DRFB module used 1 × 1 convolution to increase nonlinearity and depth. This minimizes the amount of computation increases and improves the capacity of the structure [50]. Instead of using 3 × 3 convolutions, 1 × 3 and 3 × 1 convolutions were used to reduce computational complexity with nonlinearity increments. The depth of the 5 × 5-dilated convolution branch was deeper than other branches. The SSD series object-detection model deduces the position, size, and label of multiple objects in a single-input image at once. Therefore, we used a deeper structure to increase the capacity of the large RF branch by adding nonlinearity in order to extract better features from objects that

were scattered in the image. We also used a cascaded structure to enhance the expressiveness of the feature maps. The deeper branch had a bottleneck structure based on Szegedy et al. [45,46] to increase efficiency while minimizing the number of parameter increasing.

**Figure 3.** Structure of the (**left**) DRFB module and (**right**) RFB basic module.

In each structure in Figure 3, each layer in every branch includes batch normalization and ReLU activation after the convolution layer (**Conv**). However, a separable convolution (**Sep Conv**), shortcut and the concatenation layer did not include an activation function. Table 1 shows the number of channels in each layer before the DRFB module was cascaded. In Table 1, the top and bottom row mean input and output, respectively, and each number sequentially indicates the number of input/output channels, the application of batch normalization (**BN**), and the ReLU activation function (**ReLU**). The DRFB module was composed of the structure cascade in Table 1. The spatial size of all inputs and outputs equaled 19 × 19. The shortcut branch, shown in Figure 3, was multiplied by a scale factor (0.1 [14]) and added to each feature map. The structure of the RFB basic module was the same as the one of Liu et al. [14].

**Table 1.** Input and output channels before the cascaded structure of the DRFB module.


DRFBNet300. The SSD object-detection model has various combined backbone versions. Among them, the MobileNet v1 backbone SSD300 uses depthwise convolution [25], which reduces the number of parameters and computational complexity, preserving its accuracy. However, the SSD

object detector was trained to detect small-sized objects using feature maps from the front side of the feature extractor. Accordingly, feature maps for small-sized object detection have relatively low expressive power. Therefore, the SSD could quickly detect objects, but overall accuracy is low.

In this paper, we propose our DRFB module-applied MobileNet v1 backbone SSD300 with RFB basic module [14] and define it as DRFBNet300. The right-hand side of Figure 3 shows the structure of the RFB basic module. Figure 4 shows the structure of the proposed DRFBNet300. For the backbone network, we used ImageNet [51] pretrained MobileNet v1. All of the structures in Figure 4 were identical to MobileNet v1 backbone SSD300 except the RFB basic and DRFB modules. The feature extractor consisted of the MobileNet v1 backbone, DRFB module, RFB basic module, and six additional convolution layers. The quality of the feature maps for small-object detection, 19 × 19 × 512 shapes, was enhanced through the DRFB module. The RFB basic module was connected to the front side of the extra layers. As a result, the expressiveness of the feature maps for large-object detection was enhanced, improving the overall accuracy of the detection model.

**Figure 4.** DRFBNet300 structure.

#### *3.2. Split Image Processing*

In general object-detection methods, the input image of a single-stage detector is resized to a specific size. An SSD is divided into SSD300 and SSD512 according to the resized input image [13]. Single-stage object-detection models deduce the position, size, and label of the object in the input image with only one network forward pass. Therefore, the SSD512 uses high-resolution input detect objects relatively well, but the SSD300 does not. On the other hand, the SSD300 using small-sized inputs deduces results using only 9.7% (90,000 pixels) of the input image when the input size is 720P (921,600 pixels). This makes SSD300 relatively fast, but low accuracy is inevitable.

Therefore, in this paper, we propose a SIP method that reduces information loss at the input-image-resizing process of the network. The bottom side of Figure 5 shows the schematic concept of the SIP method. Unlike the conventional method shown in the upper part of Figure 5, the detection method with SIP inputs separated images into two segments at the Image Separation part and outputs the final result through the Box Merging part. Overall flow is shown in Figure 2.

Image Separation. A single input image is separated into two images so that 12.5% of the original width is overlapped at the center. If the input image is 720P (1280 × 720 pixels), then 160 × 720 pixels are overlapped at the center. A single 720P image is separated into two 720 × 720 pixel images. The separated images are inputted in object-detection model DRFBNet300 through normalization. The network infers the positions of the objects in each left and right image. The Box Merging part of the SIP method merges the coordinates of the objects in the overlapped area to generate the final result.

**Figure 5.** Schematic concept of (**top**) existing object-detection method and (**bottom**) proposed Split Image Processing (SIP) method.

Box Merging. Figure 6 shows a flowchart of the Box Merging part. All thresholds were 720P image referenced values, and optimal values were obtained through experiments. The object-detection model outputs a result in a coordinate format. Values used in the Box Merging part are the coordinate values of objects in the overlapped area. If the detector locates the same object in the overlapped area at each left and right image, the box is truncated or overlapped, as shown in the left image in Figure 7. This happens when objects in the overlapped area are simultaneously detected in the left and right images.

**Figure 7.** Experiment results (**a**) before and (**b**) after applying the Box Merging part.

In general, when the same object is detected in each of the left and right UAV imagery, the difference of the *Y* coordinates is not large. Using this, considering the difference between minimum (top) and maximum (bottom) *Y* values between each left and right box, if the difference is larger than 20 pixels, it is decided as another object. In the comparison of *Y* coordinates, the top and bottom values are separately compared in the overlapping boxes. The final result can be true when both respective values satisfy the condition. When the *Y* coordinate-value condition is satisfied, the Box Merging part uses the center point between the same objects of the result. To do this, the *X*-coordinate center point of each box was calculated, and, if the distance between them was less than 40 pixels, it was decided as the same object. If the size of the bounding box was smaller than 30 × 30 pixels even if all of the previous conditions were satisfied, it was decided as another box. This is a condition that considers when the size of the vehicles is very small at a high altitude. If the box size was larger than 160 × 160 pixels, the maximum *X* value of the left image box was in the range of 710–720 pixels, and if the minimum *X* value of the right image box was in the range of 560–570 pixels, it was decided to the same object. This is a condition that considers when the size of the vehicles is very large at low altitude. Figure 7 shows the results before and after applying the Box Merging part.

#### *3.3. UAV-Cars Dataset*

To train the general object-detection model, most studies used datasets such as MS COCO [23] or PASCAL VOC [13–18,20,24,40,50]. Each dataset has 81 and 21 labels, including backgrounds, and labels such as frisbee, hot dog, and potted plant. These labels are very insignificant in UAV imagery vehicle detection. Furthermore, UAV imagery is captured using a wide-angle camera, resulting in object composition, ratio, and angle distortion. Most general object-detection datasets consist of front-view images, and even equally labeled objects have different characteristics. Figure 8 shows feature differences of the vehicle between MS COCO and UAV imagery. If the general object-detection dataset is used for UAV imagery vehicle-detection model training, detection accuracy deteriorates because it does not have the peculiarities of UAV imagery.

**Figure 8.** Vehicle-feature difference between (**a**) MS COCO and (**b**) UAV imagery.

In this paper, we propose a dataset for vehicle detection in bird-view UAV imagery and UAV-cars. The UAV-cars dataset includes a training and a validation set. The training set consists of 4407 images containing 18,945 objects, and the validation set consists of 628 images containing 2637 objects. To generate the dataset, the vehicles on the road were directly captured using the built-in UAV camera, and then the video was sampled to make images at a constant frame rate. A total of 5035 images were used to generate ground truth (GT) using LabelImg [52]. We used LabelImg to create the GT boxes and save coordinates in the form of XML files. In addition, 628 images that were not included in the training set were used as the validation set. The validation set included the altitude condition, which was 10 m intervals from 10 to 50 m.

We used a camera equipped with a wide-angle lens and UAV to capture the UAV imageries. Detailed specifications of UAV and camera will be covered in Section 4.1. All images were taken in

various vehicle compositions with altitude changes between 10 and 50 m. As a result, the UAV-cars dataset contained all various distortion peculiarities of UAV imagery.

#### *3.4. Vehicle Detection in UAV Imagery*

Network Training. We used GPU-enabled Pytorch 1.0.1, which is the deep learning library, to implement DRFBNet300. Our training strategies were similar to SSD, including data augmentation and the aspect ratios of the default box. The size of the default box was modified to detect small objects well. For weight-parameter initialization, the weight values of ImageNet [51]-pretrained MobileNet v1 were used for the backbone network. All remaining layers were initialized using the MSRA method [53]. The loss function used in the training phase was multibox loss [13]. The Stochastic Gradient Descent (SGD) momentum optimizer was used to optimize the loss function. DRFBNet300 was trained on the UAV-cars training set for 150 epochs. Further details are covered in Section 4.3.

Vehicle detection. The proposed vehicle-detection model in bird-view UAV imagery is implemented by applying the SIP method to DRFBNet300 trained on the UAV-cars dataset. Figure 2 shows a flowchart of the entire vehicle-detection method. We use precaptured bird-view UAV imagery as input. The video input to the program was divided into frames and conveyanced to our vehicle-detection model. The input frame was separated into left and right images through the Image Separation part. The separated images were fed to DRFBNet300 pretrained on the UAV-cars dataset to infer the coordinates and scores. The Box Merging part used the coordinates of the bounding boxes inferred from DRFBNet300 to merge redundant detection results when objects were in the overlapped area. Finally, the completed coordinate values were drawn in the bounding-box shape on the input image, and results were displayed on the screen and saved.

#### **4. Experimental Environment**

#### *4.1. UAV Specification*

The experiment used images taken with the DJI Phantom 4 Advanced model (Shenzhen, China). The weight of the fuselage was 1368 g and the size was diagonally 350 mm except for the propellers. The fuselage was equipped with four front- and bottom-side cameras, a GPS, and a gyroscope for the autonomous flight system. The UAV used these sensors to fly at a vertical error of ±10 cm. The built-in camera used a 20M pixel one-inch CMOS sensor and it was equipped with an 8.8/24 mm lens of 84 FOV. The gimbal that connects the camera to the fuselage has three axes to compensate for yaw, pitch and roll motion. All images were shot at 720P (1280 × 720 pixels) with 30 FPS. Figure 9 shows the UAV and its built-in camera used in the experiment.

**Figure 9.** DJI Phantom 4 Advanced UAV (**left**) and its built-in camera (**right**).

#### *4.2. Experiment Environment*

During the implementation of the proposed method, we used GPU-enabled Pytorch 1.0.1. Pytorch uses CUDA 9.0 and the cuDNN v 7.5 GPU library. Table 2 shows the specifications of the server PC used for model training and operating our vehicle-detection model.


**Table 2.** Server PC specification table.

#### *4.3. Training Strategies*

The same training strategies were applied to training models using MS COCO [23] and the UAV-cars dataset. Data augmentation, including distortions such as cropping, expanding, and mirroring, was applied to the training phase. Data normalization was applied for fast training and global-minima optimization using mean RGB values (104,117,124) of MS COCO [14]. The models were trained with 32 batch sizes during 150 epochs. The learning rate started at 2 × <sup>10</sup>−<sup>3</sup> and was reduced by 1/10 at 90, 120, and 140 epochs, respectively. We applied a warm-up epoch [54] that helped global-minima convergence during the initial five epochs, linearly increasing the learning rate from <sup>1</sup> × <sup>10</sup>−<sup>6</sup> to 2 × <sup>10</sup>−<sup>3</sup> during the very first five epochs. The SGD momentum using a 0.9 momentum coefficient and 5 × <sup>10</sup>−<sup>4</sup> weight decay coefficient was used as an optimizer.

Different methods were applied to the backbone and the remaining layers to initialize the weight parameters of the network. The initial weight parameter of the backbone network used ImageNet [51] pretrained MobileNet v1 [25], and all other layers were initialized using the MSRA method [53].

#### **5. Experimental Results**

#### *5.1. MS COCO*

In this experiment, we used the MobileNet v1 [25] backbone SSD300 [13], RFBNet300 [14], YOLOv3 320 [27], FSSD300 [28] and our DRFBNet300, which are single-stage object-detection methods. We trained each model using MS COCO *trainval35k* [23]. The training strategies in Section 4.3 were used for each model. All models in this experiment were evaluated using MS COCO *minival2014* [23].

Table 3 shows the speed and mean Average Precision (mAP) of each model trained on MS COCO. The experiment result showed that the proposed DRFBNet300 achieves 21 mAP. This is the highest mAP score compared to the SSD300 and RFBNet300, which are lightweight single-stage object-detection models running in real time. The network inference of DRFBNet300 also only took 22.3 ms, meaning real-time detection is possible at about 45 FPS. The FSSD300 and YOLOv3 320 were accurate, but the number of parameters was 19.1M and 24.4M, respectively. In addition, operation speed was 60.9 and 40.1 ms, meaning real-time object detection is impossible. Figure 10 shows the detection results of DRFBNet300 of MS COCO val2017 [23].

Figure 11 shows the results of person detection in bird-view UAV imagery using MS COCO-trained SSD300, RFBNet300, and our DRFBNet300. Comparing the experiment results of each model, the detection results of DFRBNet300 were better than the other models. This is because the generalization performance of DRFBNet300 is the best and it was designed to detect small objects well.


**Table 3.** Experiment results of the MS COCO dataset.

**Figure 10.** Object-detection results of DRFBNet300 on MS COCO val2017.

**Figure 11.** Person detection in bird-view UAV imagery of each model trained on MS COCO. (**a**,**d**) SSD300 results; (**b**,**e**) RFBNet300 results; and (**c**,**f**) DRFNet300 results.

Figure 12 shows the experiment results of applying the SIP method to DRFBNet300 trained on MS COCO. The model with the SIP method slowed down because the amount of computation increased. However, unlike undetected or misdetected objects when the SIP method is not applied, the accuracy of the applied model was greatly improved.

**Figure 12.** Experiment results of adjacent-frame object-detection (**top**) before and (**bottom**) after applying SIP of MS COCO-trained DRFBNet300.

#### *5.2. UAV-Cars Dataset*

In this experiment, we trained SSD300, RFBNet300, and our DRFBNet300, which are lightweight single-stage object detectors running in real time, using the training strategies described in Section 4.3. A training set consisting of 4407 images, including 18,945 objects, in the UAV-cars dataset was used for each model's training. The trained models were evaluated using the UAV-cars validation set consisting of 628 images containing 2637 objects. The validation set was divided into five cases at altitude intervals of 10 m from 10 to 50 m. The True Positive criterion was set to a 0.5 Intersection over Union (IoU) threshold, which was the same value of the PASCAL VOC [24].

Table 4 and Figure 13 show the AP and detection results of models trained on the UAV-cars training set. In Table 4, we can see that DRFBNet300 achieved the highest AP score at all altitudes except for at 10 m. In addition, since inference time was only 17.5 ms, real-time vehicle detection was possible at 57 FPS. In Figure 13, we can see that DRFBNet300 detected small-sized vehicles better than the other models.

In Table 4, accuracy at all altitudes except for at 10 m was greatly improved when the SIP method was applied. Especially as altitude increased, the AP score also further increased. Even at 50 m altitude, the AP score of DRFBNet300 with the SIP method was 57.28, which is a 30.07 AP increase at 27.21 AP when not applied. This is more than double the AP score when it was not applied.

Figure 14 shows UAV imagery vehicle-detection results in the practical case of DRFBNet300 according to whether SIP was applied or not. Figure 14 shows DRFBNet300 with SIP detected vehicles that cannot be detected by normal DRFBNet300 in bird-view UAV imagery at high altitudes. In addition, it ran in real time at 33 FPS even when SIP was applied.



*Sensors* **2019**, *19*, 3958

**Figure 13.** Experiment results of the UAV-cars validation set without applying SIP method. (left to right) Altitudes of 30, 40, and 50 m. Results of (**a**–**c**) SSD300; (**d**–**f**) RFB300; and (**g**–**i**) our DRFBNet300.

**Figure 14.** Experiment results of DRFBNet300 trained on UAV-cars dataset (**top**) before and (**bottom**) after applying SIP.

#### **6. Conclusions**

In this paper, we proposed the use of DRFBNet300 with a DRFB module for bird-view UAV imagery vehicle detection, the UAV-cars dataset to train DRFBNet300, and the SIP method to improve the accuracy of our vehicle detector. The single-stage object-detection model, SSD, has low computational complexity and is fast, but does not detect small objects well. Accuracy is also lower when using a lightweight backbone network for speeding up. DRFBNet300 is a DRFB and RFB basic module attached to the MobileNet v1 backbone SSD300, which is a lightweight object-detection model. The DRFB module was designed to have a multisized RF branch, and dilated convolution was implemented to minimize the increase of computation amount. The multibranched and cascaded structure of our DRFB module improved the quality of feature maps, which improved the accuracy of the vehicle-detection model. We also proposed a UAV-cars dataset consisting of 5035 images containing 21,582 objects, including distortion peculiarities of vehicles in bird-view UAV imagery. Lastly, we proposed the SIP method to improve DRFBNet300 accuracy. DRFBNet300 with the DRFB

module achieved the highest score among other lightweight single-stage methods running in real time with 21 mAP at 45 FPS on the MS COCO metric. In the experiment on the UAV-cars dataset, DRFBNet300 also obtained the highest AP score, regardless of whether the SIP method was applied or not at 20–50 m altitudes. The DRFBNet300 increased the accuracy improvement by the SIP method as the UAV altitude increased, and accuracy was improved by more than two times at an altitude of 50 m. Because of DRFBNet300 and the SIP method, the proposed method can more accurately detect vehicles in real-time in UAV imagery at 33 FPS.

**Author Contributions:** Conceptualization, S.H.; methodology, S.K.; software, S.H.; investigation, S.H.; writing—original-draft preparation, S.H.; writing—review and editing, J.Y., and S.K.; supervision, S.K.; project administration, S.K.

**Funding:** This research received no external funding.

**Acknowledgments:** This research was supported by the Ministry of Science and ICT (MSIT), Korea, under the Information Technology Research Center (ITRC) support program (IITP-2019-2016-0-00288), supervised by the Institute for Information and Communications Technology Planning and Evaluation (IITP).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Multi-Scale Vehicle Detection for Foreground-Background Class Imbalance with Improved YOLOv2**

**Zhongyuan Wu 1,2, Jun Sang 1,2,\*, Qian Zhang 1,2, Hong Xiang 1,2, Bin Cai 1,2 and Xiaofeng Xia 1,2,\***


Received: 22 June 2019; Accepted: 26 July 2019; Published: 30 July 2019

**Abstract:** Vehicle detection is a challenging task in computer vision. In recent years, numerous vehicle detection methods have been proposed. Since the vehicles may have varying sizes in a scene, while the vehicles and the background in a scene may be with imbalanced sizes, the performance of vehicle detection is influenced. To obtain better performance on vehicle detection, a multi-scale vehicle detection method was proposed in this paper by improving YOLOv2. The main contributions of this paper include: (1) a new anchor box generation method Rk-means++ was proposed to enhance the adaptation of varying sizes of vehicles and achieve multi-scale detection; (2) Focal Loss was introduced into YOLOv2 for vehicle detection to reduce the negative influence on training resulting from imbalance between vehicles and background. The experimental results upon the Beijing Institute of Technology (BIT)-Vehicle public dataset demonstrated that the proposed method can obtain better performance on vehicle localization and recognition than that of other existing methods.

**Keywords:** vehicle detection; YOLOv2; focal loss; anchor box; multi-scale

#### **1. Introduction**

Vehicle detection is one of the essential parts in computer vision, which aims to locate the vehicles and recognize the vehicle types. In recent years, vehicle detection has been applied in numerous fields, such as traffic surveillance, unmanned vehicle, gate monitoring and so on. However, due to the complicated background, the greatly varying illumination intensity, the occlusion problem and the small variations of each vehicle types, vehicle detection is still a challenging task and a hot research field of Artificial Intelligence (AI).

Numerous vehicle detection methods have been proposed, which can be divided into two categories: traditional machine learning methods and deep learning-based methods. For the traditional methods, Tsai et al. [1] used a new color transformation model to obtain the key vehicle colors and locate the candidate objects. Then, a multichannel classifier was adopted to recognize the candidate objects. For vehicle detection in video, Jazayeri et al. [2] thought that it should combine the temporal information of the features and the vehicle motion behaviors, which can compensate the complexity in recognizing vehicle colors and shapes. In Refs. [3–5], the Histogram of Oriented Gradient (HOG) method was applied to extract the vehicle features in the image, which have a lower false positive rate. In recent years, with the continuous improvement of computing power, deep learning [6] gradually becomes the main way for vehicle detection. The methods based on deep learning have surpassed the methods based on traditional methods [7–10] on detection and recognition performance. Different

from designing feature by human beings, Ref. [11–13] proposed to learn features automatically with Convolutional Neural Network (CNN), which only needs large labeled vehicle images to train the network with supervision, while it does not need to design feature manually. The first systematic framework for object detection is R-CNN [14]. In R-CNN, the selective search algorithm [15] was used to generate the regions of interest, and then CNN was applied to recognize whether the region of interest is object or background. After R-CNN, numerous methods [16–19] adopted the same framework with slight changes for object detection. In Ref. [20] and Ref. [21], the Faster R-CNN was applied to detect the vehicles, which surpassed the previous methods. Ref. [22] developed an accurate vehicle proposal network to extract vehicle-like targets. Then, a coupled R-CNN method was proposed to extract the vehicle's location and attributes simultaneously. To improve the detection speed, Redmon et al. [23] proposed YOLO, which was the first one to convert detecting object to regression and achieved end-to-end detection. Then, some other methods, such as SSD [24], YOLOv2 [25], YOLOv3 [26], etc., were proposed, which reduce the detection time greatly. Ref. [27] proposed an improved YOLOv2 for vehicle detection. k-means++ [28] was used to generate anchor boxes, instead of k-means [29], and the loss function was improved with normalization. In Ref. [30], by modifying the net resolution and depth of YOLOv2, the proposed model gave near Faster R-CNN performance at more than4 times speed. In object detection, usually the objects only occupy a small part of the image, while the majority in the image is background. The imbalance between objects and background can hinder object detection models from converging in a correct direction in the training stage. Therefore, to reduce the impact of the imbalance, Lin et al. [31] proposed Focal Loss to focus training on a sparse set of hard examples. Their experiments validated that the method can improve the detection result on end-to-end methods.

To improve the accuracy of localization and recognition simultaneously, an improved vehicle detection method based on YOLOv2 was proposed in this paper. A new anchor box generation method was applied to enhance the network localization ability. In addition, Focal Loss was introduced in YOLOv2 to improve the network recognition ability.

#### **2. Brief Introduction on YOLOv2**

In YOLOv2 [25], the input image is divided into S × S grids. Each grid predicts the location information, the class probability and the object confidence for each anchor box. The location information includes *x*, *y*, *w* and *h* of the bounding box, where *x*, *y* represent the abscissa and ordinate of the center pixel of the bounding box, and *w*, *h* represent the length and height of the bounding box. The class probability indicates which class the object in the current grid is most likely to belong to. The object confidence represents the confidence that there exists objects in the current grid. Then, the Non-Maximum Suppression (NMS) is applied to remove the bounding boxes with low object confidence. Finally, the rest bounding boxes are decoded to obtain the detection boxes. In this section, YOLOv2 is introduced in brief, mainly including the generation of anchor boxes, the network structure and the loss function.

#### *2.1. The Generation of Anchor Boxes*

Anchor boxes were first proposed in Faster R-CNN [18], which aims to generate bounding boxes with a certain ratio instead of predicting the sizes of bounding boxes directly. The authors of YOLOv2 indicated that generating anchor boxes with manual design was absurd. Instead, they applied k-means cluster on training set to obtain better anchor boxes.

When implementing k-means, instead of the traditional Euclidean distance, YOLOv2 adopted the Intersection over Union (IoU) distance to measure the closeness degree of two bounding boxes. The main reason is that, if the Euclidean distance is adopted, the anchor boxes with big size will produce more errors than those with small size. Consequently, by adopting the IoU distance, the errors will be irrelevant to the sizes of the anchor boxes. The distance of k-means in YOLOv2 can be expressed as Equation (1).

$$d(box,centroid) = 1 - IOL(box,centroid) \tag{1}$$

#### *2.2. The Network Structure*

In recent years, most of the detection methods take VGG [7] or ResNet [8] as the base feature extractor. The authors of YOLOv2 argued that these networks were accurate and powerful, but they were needlessly complex. Therefore, Darknet19 was adopted as the backbone of YOLOv2, which has less parameters and may obtain better performance than VGG and ResNet. The network of YOLOv2 is shown in Figure 1.

**Figure 1.** The network of YOLOv2.

As shown in Figure 1, the network includes 32 layers, which have 19 convolutional layers and five maxpooling layers. Similar to VGG, 3 × 3 convolutional layer is used to double the channel of feature maps after each pooling layers, and 1 × 1 convolutional layer is used to halve the channels and fuse the features. Feature fusion is applied on the feature maps from the direct path and path (a) in Figure 1, which can retain the features from the shallower layer to improve the ability of detecting the small objects.

#### *2.3. The Loss Function*

In YOLOv2, multi-part loss function is adopted, which includes the bounding box loss, the confidence loss and the class loss. The bounding box loss includes the coordinate loss and the size loss of the bounding box. The confidence loss includes the confidence loss of bounding box with objects and without objects. The class loss is calculated by softmax to obtain the class probability. The loss function in YOLOv2 can be expressed as Equation (2).

$$\begin{aligned} &\lambda\_{cord}\sum\_{i=0}^{S^2}\sum\_{j=0}^{B}\mathbb{I}\_{ij}^{abj}[(\mathbf{x}\_i-\mathbf{x}\_i)^2+(y\_i-\hat{y}\_i)^2] \\ &+\lambda\_{cord}\sum\_{i=0}^{S^2}\sum\_{j=0}^{B}\mathbb{I}\_{ij}^{abj}[(\sqrt{w\_i}-\sqrt{w\_i})^2+(\sqrt{h\_i}-\sqrt{h\_i})^2] \\ &+\sum\_{i=0}^{S^2}\sum\_{j=0}^{B}\mathbb{I}\_{ij}^{abj}(\mathbb{C}\_i-\mathbb{C}\_i)^2 \\ &+\lambda\_{mobj}\sum\_{i=0}^{S^2}\sum\_{j=0}^{B}\mathbb{I}\_{ij}^{mobj}(\mathbb{C}\_i-\hat{\mathbb{C}}\_i)^2 \\ &+\sum\_{i=0}^{S^2}\mathbb{I}\_{i}^{abj}\sum\_{c\in\text{classes}}(p\_i(c)-\hat{p}\_i(c))^2 \end{aligned} (2)$$

As shown in Equation (2), *xi* and *yi* denote the center coordinates of the box relative to the current grid bounds in the *i*-th grid. *wi* and *hi* denote width and height of the bounding box relative to the whole image in the *i*-th grid. *Ci* denotes the confidence of the bounding box in the *i*-th grid. *pi*(*c*) denotes the class probability of the bounding box in the *i*-th grid. *x*ˆ*i*, *y*ˆ*i*, *w*ˆ*i*, ˆ *hi*, *C*ˆ*i*, *p*ˆ*i*(*c*) denote the corresponding predicted values of *xi*, *yi*, *wi*, *hi*, *Ci*, *pi*(*c*). *<sup>S</sup>*<sup>2</sup> denotes the S <sup>×</sup> S grids. *<sup>B</sup>* denotes the bounding boxes. λ*coord* denotes the weight of the coordinate loss and λ*noobj* denotes the weight of the loss of bounding boxes without objects. I *obj <sup>i</sup>* denotes whether the object is on the *i*-th grid or not and I *obj ij* denotes whether the *j*-th box predictor in the *i*-th grid is "responsible" for that prediction or not.

In the loss function, the first line is to compute the coordinate loss, the second line is to compute the bounding box size loss, the third line is to compute the bounding box confidence loss containing objects and the last line is to compute the bounding box confidence loss not containing objects. To prevent the sizes of the bounding boxes from making a significant impact on the loss, the square roots of width and length of bounding boxes are applied to decrease their magnitudes. Since usually only a few bounding boxes with object exist in the real pictures, the confidence loss of bounding box with object is much smaller than the other losses. Consequently, the weighted method is applied to balance the different kinds of losses. Usually, λ*coord* is set as 5 and λ*noobj* is set as 0.5 to balance each loss. Otherwise, each loss may result in different contributions to the total loss, which can cause some losses ineffective for network training.

#### **3. The Proposed Method**

The proposed vehicle detection method was based on YOLOv2, in which a new anchor box generation method was proposed and Focal Loss was introduced in YOLOv2. In this section, these two improved points will be introduced in detail.

#### *3.1. The Generation of Anchor Boxes*

The quality of anchor boxes is important for the end-to-end detection methods. It is efficient by using k-means or k-means++ to generate anchor boxes. However, the anchor boxes generated by such methods are usually suitable to the common sizes of objects, while the sizes of the generated anchor boxes may be far away from those of the objects with usual sizes. By analyzing the detection boxes of YOLOv2, we found that the common sizes of the vehicles can be predicted well, while some unusual sizes of vehicles, such as the size of a truck which is usually much bigger than that of the common vehicles, were predicted terribly. As shown in Figure 2, the black boxes denote the ground truths and the other color boxes denote the detection boxes predicted by YOLOv2. It is obvious that the sedan was detected well and the minivan was detected terribly, since the size of minivan is unusual and the size of sedan is common.

**Figure 2.** The detection boxes generated by YOLOv2.

To improve the localization accuracy, it is better to generate the anchor boxes to match most sizes of the ground truths, instead of only matching the common sizes of the ground truths. Therefore, a clustering method called Rk-means++ was proposed in this paper. As shown in Figure 3, in Rk-means++, 2 ratios regarding the width and the length of the ground truths were obtained by applying k-means++. Then, we applied a certain proportion to two ratios and generate anchor boxes with different scale. Compared with the methods of clustering anchor boxes directly, such as k-means and k-means++, Rk-means++ generates anchor boxes on different hierarchies, which may match most sizes of the ground truths much better. In our experiments, the proportion was set as 1:2:4:6:8:10 and six anchor boxes were obtained.

**Figure 3.** The computing procedures of k-means++ and Rk-means++. (**a**) k-means++; (**b**) Rk-means++. The anchor boxes obtained by Rk-means++ ensures that each size of the vehicle with different scale can be matched with one of the anchor boxes. In other words, the proposed anchor box generation method can enhance the robustness of YOLOv2 for different scales of vehicles and improve the localization accuracy.

#### *3.2. Focal Loss*

For vehicle detection, usually the vehicles are only a small part of the whole image, while the majority in the image is background. Consequently, thousands of candidate bounding boxes will be generated in one image with YOLOv2, while only a few of them include vehicles. Obviously, the problem of imbalance between the positives (i.e., vehicles) and the negatives (i.e., backgrounds) is serious. In the training stage of YOLOv2, because of the large quantity of candidate bounding boxes not containing object, the loss will be overwhelmed by them and a few quantity of candidate bounding boxes containing object cannot influence the loss effectively. The imbalance of quantity will make training inefficient since most candidate bounding boxes tend to be easy negative, which is useless for CNN learning. The numerous easy negatives will overwhelm some important training examples, which leads the model to converge in a wrong direction. Therefore, in order to decrease the imbalance between the positives and negatives, Focal Loss [31] was introduced in YOLOv2 to detect the vehicles.

Before Focal Loss was introduced, the cross entropy loss is commonly employed to calculate the classification loss, which is shown in Equation (3) for binary classification.

$$L\_{cc} = -y \log y' - (1 - y) \log(1 - y') = \begin{cases} -\log y', & y = 1 \\ -\log(1 - y'), & y = 0 \end{cases} \tag{3}$$

In Equation (3), *y* denotes the ground truth, which is 1 for positive and 0 for negative. *y* denotes the predicted value ranging from 0 to 1. For the positives, the higher the predicted probability is, the smaller the cross entropy loss is. For the negatives, the lower the predicted probability is, the smaller the cross entropy loss is. Consequently, it is inefficient to train with the iteration of numerous easy examples. More seriously, the model may not be optimized to a good state.

Focal Loss is based on cross entropy loss, which aims to reduce the weights of easy examples in loss and make the model focus on distinguishing the hard examples. Focal Loss can be expressed as Equation (4).

$$L\_{fl} = \begin{cases} -\alpha (1 - y')^\vee \log y', & y = 1 \\ -(1 - \alpha) y'^\vee \log(1 - y'), & y = 0 \end{cases} \tag{4}$$

As shown in Equation (4), corresponding to cross entropy loss, 2 factors, i.e., γ and α, are added. γ is used to reduce the loss of easy examples. For instance, by setting γ as 2, the loss will be smaller than the cross entropy loss if an easy positive with predicted value 0.95 or an easy negative with predicted value 0.05. The factor γ makes the training more focus on the hard examples and reduce the impact of easy examples. α is used to balance the weights of positives and negatives, which can control the decline rate for the weights of examples. Within the limited numbers of experiments, we found that setting γ as 2 and setting α as 0.25 can obtain the best accuracy.

#### **4. Experiments**

The implementation of our experiments was based on Darknet, which is a light deep learning framework and provided by the author of YOLOv2. The experiments were conducted on the GPU server, which includes 8 pieces of GPUs, 24 G video memory and 64 G memory. The experimental platform was equipped with 64-bit ubuntu14.04, Opencv2.7.0 and CUDA Toolkit8.0.

#### *4.1. Dataset*

The experiments were conducted on BIT-Vehicle [32] public dataset provided by Beijing Institute of Technology. The BIT-Vehicle dataset contains 9580 images, including 6 vehicle types, i.e., bus, microbus, minivan, sedan, truck and suv. The numbers of each vehicle type are 558, 883, 476, 5922, 822 and 1392. As shown in Figure 4, the BIT-Vehicle dataset includes day scene and night scene, and the images are seriously influenced by the variation of illumination intensity.

**Figure 4.** Some images in BIT-Vehicle dataset.

In our experiments, the BIT-Vehicle dataset was divided into training set and test set with the ration of 8:2, namely, 7880 images were used for training and 1970 images were used for testing.

#### *4.2. Implementation*

In our experiments, the size of input image was 416 × 416. The initial learning rate was set to be 0.001. The model was trained by 120 epochs and the learning rate was decreased 10 times when the iteration reached 60 and 90 epochs, respectively. The size of batch was set to be 8 and the momentum was set as 0.9. In order to make the model detect well for the images with different sizes, the multi-scale training trick was applied in our experiments. A new size of input image was selected randomly for training every 10 epochs. Since the factor of down sampling is 32 in YOLOv2, the sizes of the randomly selected input images all were multiple of 32, i.e., the maximum size was 608 × 608 and the minimum size was 352 × 352. Other hyperparameters followed as those in YOLOv2.

#### *4.3. Experimental Results and Analysis*

The proposed method was compared with YOLOv2\_Vehicle [27] and the other methods for vehicle detection. In addition, all of these methods were implemented on our platform, so the results of these methods were obtained under the same experimental environment. The experimental results are shown in Table 1.


**Table 1.** Experimental results for different methods.

As shown in Table 1, Class Average Precision (AP) denotes the recognition performance for each vehicle class, mean Average Precision (mAP) denotes the average recognition performance and IoU denotes the localization performance. It can be seen that mAP and IoU obtained with our proposed method are 97.30% and 92.97%, respectively. Specifically, the mAP of our method is higher than those of other methods by nearly 1 percentage point and the IoU is higher than those of other methods by nearly 1 percentage points. Significantly, except the speed and AP of bus, microbus, all the metrics obtained by our proposed method are better than those of the others. For the AP of bus and microbus, our method is not the best one. The reason may be that the different models will focus on different classes due to the random initial parameters, the network architecture or the anchor boxes being different. Consequently, it is acceptable that there are two class APs slightly smaller than others. For the speed, our method is not the best one, which may result from the fact that the more candidate bounding boxes were proposed before the NMS, which slow down the speed lightly. In general, our proposed method showed better performance than others on the accuracy of localization and recognition, while maintaining the speed.

Figure 5 showed the detection results of our proposed method. Whether it was daytime or night, the abilities of localization and recognition were not affected. Whether the vehicle is with small size or with large size, our method can locate and recognize them as well. In addition, it can be seen from the second picture and last picture of the second column in Figure 5, though the vehicle was incomplete, they can be detected accurately. Consequently, regardless of the variation of illumination intensity, the variation of vehicle sizes or the vehicle completeness, our method can recognize and locate the vehicles accurately. It demonstrates the strong robustness of our proposed method.

**Figure 5.** Examples of detection results.

#### *4.4. The Validation of Rk-means*++

In order to validate the effect of Rk-means++ for YOLOv2 in vehicle detection, numerous experiments were conducted to compare the performance of different methods with different anchor box generation methods. In the experiments, YOLOv2 and YOLOv3 were adopted as the basic model. The experimental results are shown in Table 2.


**Table 2.** Experimental results on comparing different anchor box generation methods.

As shown in Table 2, it is obvious that the IoU with Rk-means++ are the best with YOLOv2 and YOLOv3.The reason could be that each size of vehicle with different scales can be matched with one of the anchor boxes, since the anchor boxes generated by Rk-means++ are on different hierarchies. However, the mAP with Rk-means++ is slightly lower than that of others. This may result from the fact that more candidate bounding boxes were proposed with Rk-means++, which makes the detection precision drop slightly. In general, with Rk-means++, the model can obtain better localization performance while the precision is slightly lower.

#### *4.5. The Validation of Focal Loss*

In order to validate the effect of Focal loss for YOLOv2 in vehicle detection, YOLOv2 was adopted as the basic model with different anchor box generation methods. The experimental results are shown in Table 3, where 'Wo' denotes 'Without' and 'FL' denotes 'Focal Loss'.


**Table 3.** Experimental result on comparing YOLOv2 with Focal Loss (FL).

As shown in Table 3, k-means, k-means++ and Rk-means++ were applied to compare the performance without or with Focal Loss. It is obvious that all the YOLOv2 models with Focal Loss will surpass those without Focal Loss regardless of the different anchor box generation methods, which indicates that Focal Loss can reduce the influence on imbalance of positives and negatives. It also validates that Focal Loss is useful for improving both mAP and IoU.

#### **5. Conclusions**

In this paper, a multi-scale vehicle detection method based on YOLOv2 was proposed to improve the accuracy of localization and recognition simultaneously. The main contributions of this paper include:


Although the method proposed in this paper shows good performance on vehicle detection, the scales of data and the types of vehicle are relatively few, which cannot show the distinctions of different methods well. In the future work, we will collect more vehicle data and try to improve the accuracy on the larger vehicle dataset.

**Author Contributions:** Conceptualization, Z.W. and J.S.; Data curation, Q.Z. and B.C.; Funding acquisition, H.X.; Investigation, Q.Z.; Methodology, Z.W. and Q.Z.; Supervision, J.S.; Validation, B.C.; Visualization, X.X.; Writing—original draft, Z.W. and X.X.; Writing—review & editing, J.S. and H.X.

**Funding:** This research was funded by National Key R&D Program of China (No. 2017YFB0802400).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Automatic Classification Using Machine Learning for Non-Conventional Vessels on Inland Waters**

#### **Marta Wlodarczyk-Sielicka 1,\*and Dawid Polap <sup>2</sup>**


**\*** Correspondence: m.wlodarczyk@am.szczecin.pl; Tel.: +48-513-846-391

Received: 23 May 2019; Accepted: 8 July 2019; Published: 10 July 2019

**Abstract:** The prevalent methods for monitoring ships are based on automatic identification and radar systems. This applies mainly to large vessels. Additional sensors that are used include video cameras with different resolutions. Such systems feature cameras that capture images and software that analyze the selected video frames. The analysis involves the detection of a ship and the extraction of features to identify it. This article proposes a technique to detect and categorize ships through image processing methods that use convolutional neural networks. Tests to verify the proposed method were carried out on a database containing 200 images of four classes of ships. The advantages and disadvantages of implementing the proposed method are also discussed in light of the results. The system is designed to use multiple existing video streams to identify passing ships on inland waters, especially non-conventional vessels.

**Keywords:** machine learning; image analysis; feature extraction; ship classification; marine systems

#### **1. Introduction**

Monitoring marine vessels is crucial to navigational safety, particularly in limited waters. At sea, vessel traffic services (VTS) are the systems responsible for supervision and traffic management [1]. In the case of inland waters, the support system used to track ships is called river information services (RIS) [2]. These systems use a variety of technologies to assist in observation, e.g., the automatic identification system (AIS), radars, and cameras [3]. The AIS is an automated tracking system that uses information concerning the position of the ship and additional data entered by the navigator, provided that the ship has an appropriate transmitter [4,5]. The international convention for the safety of life at sea (SOLAS convention) [6] developed by the International Maritime Organization requires that AIS transmitters be installed on vessels with a gross tonnage of 300 and more engaged in international voyages, vessels with a gross tonnage of 500 and more not engaged in international voyages, and passenger vessels. All other vessels navigating at sea and in inland waters cannot be unambiguously identified by traffic monitoring systems (for example, non-commercial craft, recreation craft, yachts, and other small boats).

No system or method is available at present for the automatic identification of small vessels. The only way to recognize them is to observe markings on their sides. RIS and VTS information systems have video monitoring, where cameras are mounted mainly on river bridges and in ports. It provides a useful opportunity to observe ships from different perspectives. However, the entire process is not automated. These systems are operated round the clock or during the day (in the case of daily navigation) by human workers. Therefore, to minimize cost, the automation of system operations is desirable. This research problem is a part of the SHREC project, the main purpose of which is to develop a prototype system for the automatic identification of vessels in limited areas [7,8]. At this stage, the authors propose a method to detect and classify vessels belonging to four classes based on an

image database. The proposed method is a component of the system and forms part of the detection and recognition layer.

The recognition process is one of the most popular tasks due to the number of problems. Feature extraction is particularly important, thanks to which it is possible to further assign these features to a given object. However, quite often, two-dimensional images contain various kinds of additional elements such as sky, trees, or houses. In the case when an image is taken with a good camera, further objects can be blurred, which is a certain distortion causing a better extraction of the features of the closer elements. Unfortunately, this situation is not always available. Hence, scientists from around the world are modeling different solutions to get the best results. First of all, artificial intelligence techniques have developed quite dynamically in recent years. In [9], the authors presented the recent advances in convolutional neural networks, which are a popular tool for image classification. The authors noticed the growing popularity trend as well the modeling of more hybrids and modifications. In [10], the idea of using this mathematical model of neuron nets was used in medical diagnosis. Again in [11], the idea of long short-term memory was introduced as an improvement of the classic approach, which can be used for obtaining better efficiency. The problem of classification of water facilities was analyzed in [12,13], where deep representation for different vessels was examined. The problem of the amount of data in the classification process was analyzed in [14] by the use of data augmentation. Existing classification solutions apply to marine waters for large ships. It should be emphasized that, the designed system will be using multiple existing video streams to identify passing ships on inland waters, especially non-conventional vessels.

The remainder of this article is organized as follows: Section 2 presents the problem of detecting the features of ships in images using different filters, and Section 3 contains a description of the convolutional neural network used for classification. Section 4 details the proposed ship classification system model, and Section 5 describes our experiments and the conclusions of this study. The main goal of this paper is to propose a detection algorithm based on parallel image processing that allows the creation of two different samples from one image and the architecture of a deep convolutional neural network. Our tests are the basis during the construction of the SHREC system architecture.

#### **2. Detecting Features of Ships**

The proposed technique of detecting features of ships in images is based on the parallel processing of a given image using different filters. This processing allows the distortion of images in such a way that only important elements giving features are mapped. Images obtained from the parallel filtering process are used to find features that appear in both images with different filters. The Hessian matrix was used for comparison and detection. The proposed technique relies on classical image processing to reduce the area used for further classification. The features of vessels were detected in order to minimize the amount of data in images such as background or surroundings. We used the idea of key points to detect only significant elements. An additional advantage of the proposed detection method is extension of the number of samples in each class. The proposed detection technique returns two rectangles, where both are interpreted as different images. Before being used in a subsequent classification process, they are scaled down in size, resulting in additional distortion.

The first of the images is blurred using a fast box algorithm [15]. To formulate the expression for this action, a convolution needs to be defined. The convolution of two functions is called an operation, the result of which is a new function *h*(·). Thus, image convolution consists of a moving window (matrix) with function values *g*(·) called a filter along function value *f*(·) that calculates the sum of products of the obtained values. This can be formally described as follows (in a discrete way):

$$h[i,j] = (f \bullet \mathbf{g})[i,j] = \sum\_{y=i-r}^{i+r} \sum\_{x=j-r}^{j+r} f[y,x] \cdot \mathbf{g}[y,x] \tag{1}$$

The discreteness implies that the functions *f*(·) and *g*(·) are matrices. The first one stores the pixel values of the processed image, and the second one is much smaller (its dimensions are 3 × 3 or 4 × 4 and are marked as *r* × *r*). Its coefficients are constant and predetermined for the corresponding filter. For Gaussian blur, the following matrix is obtained:

$$\mathbf{g} = [\cdot, \cdot] = \begin{pmatrix} \frac{1}{16} & \frac{1}{8} & \frac{1}{16} \\ \frac{1}{8} & \frac{1}{4} & \frac{1}{8} \\ \frac{1}{16} & \frac{1}{8} & \frac{1}{16} \end{pmatrix} \tag{2}$$

Box blur involves blurring certain areas (the size of the filter) and allows for multiple instances of blurring in a given area (approximated) compared with the transition of the complete image several times. This solution reduces the number of calculations. The blur distorts the image, but to enhance the features, the image is converted into grayscale. This is done by replacing each pixel, *pxl*, (composed of the three components of the RGB model (Red–Green–Blue)) according to the following formula:

$$R(p\text{xl}) = G(p\text{xl}) = B(p\text{xl}) = \frac{R(p\text{xl}) + G(p\text{xl}) + B(p\text{xl})}{3} \tag{3}$$

Then, the contrast and brightness are increased. Changing the contrast allows for the darkening of the dark pixels and the brightening of the bright ones, and can be achieved using

$$\mathbb{C}(p\text{xl}) = [\mathbb{C}[p\text{xl}] \ 128] \cdot \eta + 128 \tag{4}$$

where η ≤ 0 is a threshold, and *C(pxl)* is a specific function defined as a component of the RGB model. The brightness of this image is increased, and all colors are approximated to white. The effect is the elimination of pixels close to white in color:

$$\mathcal{C}\left(p\mathbf{x}l\right) = \mathcal{C}\left(p\mathbf{x}l\right) + \sigma \tag{5}$$

where σ is a threshold. The colors of the pixels are then inverted, which allows for better visualization of the other elements. To this end, the process involves converting each component of the pixel as

$$\mathcal{C}(p\text{xl}) = 25\mathbf{5} - \mathcal{C}(p\text{xl}) \tag{6}$$

It is easy to see that the inverted color highlights small elements that would otherwise be invisible. They can be removed using two operations: Improving the gamma value and binarization. Gamma correction with a coefficient γ allows for the scaling of the brightness of the image (that is, the other elements):

$$\mathcal{C}(p\text{xl}) = 255 \cdot \frac{\mathcal{C}(p\text{xl})^\vee}{255} \tag{7}$$

The last step in processing the first image is its binarization, i.e., the conversion of all pixels into white or black. The conversion is carried out by calculating the average value of all components of a given pixel, *pxl*, and checking if it is within the range 0, α or (α, 255 as

$$\begin{cases} \frac{\sum\_{\mathbb{C}\in\{R,C,B\}}\mathbb{C}\{\text{pxl}\}}{\frac{3}{3}} \le \alpha \; \text{then} \; \mathbb{C}\{\text{pxl}\} = 0\\ \frac{\sum\_{\mathbb{C}\in\{R,C,B\}}\mathbb{C}\{\text{pxl}\}}{\frac{3}{3}} > \infty \; \text{then} \; \mathbb{C}\{\text{pxl}\} = 255 \end{cases} \tag{8}$$

The second image is processed in almost the same way using other coefficients, such as the threshold values. The most significant difference is the absence of binarization on the second image. Images created in this way can be used to detect features that can help locate the ship. For this purpose, the SURF algorithm (Speeded-Up Robust Features) is used [16]. In a given image, it searches for

features by analyzing the neighborhood based on the determinant of the Hessian matrix defined as follows:

$$H(p,\omega) = \begin{bmatrix} L\_{xx}[p\_\prime \alpha] & L\_{xy}[p\_\prime \alpha] \\ L\_{xy}[p\_\prime \alpha] & L\_{yy}[p\_\prime \alpha] \end{bmatrix} \tag{9}$$

where *Lij(p,* ω*)* represents the convolution of the integrated image *I* at point *p* = *(x, y)* with smoothing using the Gaussian kernel through parameter ω. The coefficients of this matrix can be represented by the following formulae:

$$L\_{\text{xx}}(p,\omega) = I(p)\frac{\partial^2}{\partial \mathbf{x}^2}\mathbf{g}(\omega) \tag{10}$$

$$L\_{yy}(p,\omega) = I(p)\frac{\partial^2}{\partial y^2}g(\omega)\tag{11}$$

$$L\_{xy}(p,\omega) = I(p)\frac{\partial^2}{\partial x y}\mathcal{g}(\omega)\tag{12}$$

The aforementioned integrated image is an indirect representation called rectangle features. For a given position *p* = *(x, y)*, the sum of all pixels in the original image *I'(*·) is calculated as

$$I(p) = I((x, y)) = \sum\_{i=0}^{i \le x} \sum\_{j=0}^{j \le y} I'(i, j) \tag{13}$$

The algorithm chooses points that form the trace of the matrix shown in Equation (9) and the local extreme. For this purpose, the values of the Hessian determinant and the trace are determined:

$$\det(H(p,\omega)) = \alpha^2 \Big(\mathcal{L}\_{\text{xx}}(p)\mathcal{L}\_{yy}(p) - \mathcal{L}\_{xy}^2(p)\Big) \tag{14}$$

$$\text{trace}(H(p,\omega)) = \omega(L\_{\text{xx}}(p) + L\_{yy}(p))\tag{15}$$

In this way, the key points are determined, i.e., the second image is processed without binarization. Then, for each key point found, its color is checked on the second image (after binarization). As a result, two sets of points are obtained belonging to one of two colors—black or white. However, the number of points can be very large, and therefore it is worth minimizing them using the average distance for the entire set.

It is assumed that the set of key points for one of the colors is marked as *p*{0} , *p*{1} , ... , *p*{*n*−1} . The average distance ξ for all points in the set is calculated using the Euclidean metric defined as

$$\xi = \frac{\sum\_{i=0}^{n-1} \sum\_{j=0}^{n-1} \sqrt{\left(p\_{\mathbf{x}}^{(i)} - p\_{\mathbf{x}}^{(j)}\right)^2 + \left(p\_{\mathbf{y}}^{(i)} - p\_{\mathbf{y}}^{(j)}\right)^2}}{\left(n-1\right)^2 - \left(n-1\right)}\tag{16}$$

The calculated distance is used to analyze all points in the set. If, for any point, there is a neighbor at a distance smaller than ξ, the given point remains in the set. Otherwise, it is removed.

Finally, the method returns two sets of points that can be called features of the processed image. These points are used to find the maximum number of dimensions of the found object by finding the largest and smallest values of the coordinates *x* and *y*. The image between these extremes is cut out and used as a sample in the ship classification process. The model of image processing is shown in Figure 1.

**Figure 1.** Graphical hierarchy of image processing.

#### **3. Convolutional Neural Network for Classification**

The image in computing is a tensor, i.e., a mathematical object described by three components *w* × *h* × *d*. The first two components form the dimensions of the image, i.e., width and height, and the third is its depth. In the classic approach to this subject, the depth of the image is the number of components. Thus, for the RGB model, there are three values and the depth is three. As a result, three matrices are created in which the numerical values of a given color are stored. Such data can be used in the recognition process using artificial neural networks called convolutional neural networks. Such structures are inspired by the action and mechanisms of the cerebral cortex [10,17].

This input structure accepts a graphic file and processes it to classify it. It is composed of several types of layers, where the first is called the convolutional layer. It is interpreted as a filter of a given size *k* × *k*. This filter moves on the image and modifies it. However, it is worth noting that by dividing the image according to filters of different sizes, a grid of rectangles can be obtained. Each rectangle represents a part of the image to which a weight w is assigned, that is, a numerical value used for further training. Each part is interpreted as a neuron and returns the following value:

$$f\left(\sum\_{i=0}^{k}\sum\_{j=0}^{k}w\_{i,j}\circ a\_{x+i,y+j}\right)\tag{17}$$

where *f* (·) is the activation function, and *ax,y* is the activation value for the output from position *(x, y)*.

The convolutional layer is responsible for the extraction of features and eliminating redundancy in the data. It is easy to see that the grids can overlap, which may result in multiple appearances of the same fragment. To prevent this, a pooling layer is used to reduce the size of the image by extracting only the most appropriate features. For this selection, the maximum value is typically used. Both layers may occur several times depending on the problem and its size (understood as image size). A third type of layer, a fully connected layer, is also used. It is a classical neural network, the input to which consists of data from the previous layer (for each pixel, there are three values). The operation of the classical neural network can be described as the processing of numerical values. Neurons are arranged in columns called layers, where neurons between neighboring layers are related to one another by weighted connections. For each neuron, the output value *xi* and weight *w* from the previous neurons are sent. The value of each neuron is calculated as in Equation (17), that is, as the sum of the products of weights and values of neurons from the previous layer rescaled by the activation function.

As the output of the entire network, values are obtained that are not yet a probability distribution. They are normalized using the softmax function. The approximate values should indicate the assignment to the class predicted by the classifier.

However, all weights in the neural network are assigned in a random manner, which causes the classifier to return meaningless information. To remedy this, an important step is to train the classifier. The training process consists of checking the assignment returned by the network and comparing it with the correct one. The comparison is made using a loss function. The idea of training is to minimize the loss function by modifying weights in the network. An example of an optimization algorithm used for training is adaptive moment estimation, often called the Adam optimizer [18]. The algorithm is based on calculating the variance *v* and mean *m* for each iteration *t*. If a single weight is *w(t),* the values have the following form:

$$
\delta m\_t = \beta\_1 m\_{t-1} + (1 - \beta\_1) \mathbf{g}\_t \tag{18}
$$

$$
\upsilon\_t = \beta\_2 \upsilon\_{t-1} + (1 - \beta\_2) \mathbf{g}\_t^2 \tag{19}
$$

where β<sup>1</sup> and β<sup>2</sup> are distribution values. Using them, the following correlations can be defined:

$$
\hat{m}\_t = \frac{m\_t}{1 - \beta\_1^t} \tag{20}
$$

$$
\psi\_t = \frac{v\_t}{1 - \beta\_2^t} \tag{21}
$$

The above correlation formulae are used to update weights in the network using the following equation:

$$
\hat{\theta}\_{t+1} = \theta\_t - \frac{\eta}{\sqrt{\theta\_t} + \varepsilon} \hat{m}\_t \tag{22}
$$

where η is the training rate, and is a very small value preventing an indeterminate form.

#### **4. Model of Ship Classification System**

The idea of the proposed system is to create software that can process real-time video samples and categorize ships appearing in them. The camera recording the image can be placed in any position, such as a water buoy or a bridge. The video being recorded is composed of many frames, which may not necessarily feature a vessel. Moreover, it makes no sense to process each frame—one second of a recorded video usually contains 24 frames, it is not possible to process this amount of data in real time.

Hence, image processing can be minimized to one frame every one or two seconds. This prevents the task queue from overloading and blocking. However, keeping the camera in one position leads it to record the same elements in space. Therefore, to avoid searching for ships in nearly identical images, a pattern was created to avoid the need for continuous processing. The idea of this solution is to find key points at different times of the day and in different weather (for this purpose, the SURF algorithm was used). If a given frame has only points that coincide with the previous pattern, the frames do not need to be analyzed in terms of searching for a floating vehicle. If there are greater deviations from the standard (for example, a greater number of points), image processing should be performed as described in Section 2, and classification should then be performed using the convolutional neural network described in Section 3.

The proposed data processing technique allows reducing the area of the incoming image. An additional advantage is the creation of two images, which are created by parallel image processing. In the classification process, the images are reduced in order to normalize the size of all samples. The effect of this action is to obtain two images that are distorted by size reduction.

#### **5. Experiments**

The previous sections described the image processing model to extract only fragments and the model of the classifier that categorizes incoming images into selected classes. Depending on the number of classes, this is the output of the network. For *n* classes, there are *n* neurons in the last layer, and the classes are saved as *n*-element vectors consisting of *n* − 1 zeros and a single one. For each class, the "1" is in a different position in the vector. The experimental part is divided into two parts: Tests on public images and tests on real data collected by means of cameras mounted on bridges and placed on the bank in the port of Szczecin.

#### *5.1. Experiments on Test Data*

To check the operation of the proposed method, the authors used an image database divided into four classes, each of them composed of 200 samples. The classes represented four types of ships—a boat, a motor yacht, a navy ship, and a historic ship. The image processing method presented in Section 2 was tested on different values of parameters in the range {0, 15}. The adjustment of these values for the system led to the setting of an average number of key points. All coefficients were used in different combinations, and the authors selected one that returned approximately 15–20 key points (after processing). The selected configuration for image processing is presented in Table 1. This number is not accidental because the authors noticed that if the number of key points was small, the worse area was returned. On the contrary, when the number of key points was large, the original image was returned. This is why the configuration was chosen in an empirical manner.

In this way, the best configuration was selected and the files were processed. It is worth noting that two images were always generated from one sample, as presented in Figure 2. Before these samples were used for training, images among them that did not contain vessels were removed. The database of each class (composed of 200 samples) was thus extended to an average value of 320 images (first class, 343; second class, 316; third class, 324; and fourth class, 299). The obtained samples did not always contain the entire vessel, but only fragments or other objects in the image. This was important for the training process because it could not be guaranteed that an incoming vessel would be oriented in a particular manner. In addition, using the same method of extraction yielded a similar sample.


**Table 1.** The selected configuration for image processing.

**Figure 2.** Sample examples of images, where the rectangles indicate the extracted images: (**a**) boat; (**b**) motor yacht; (**c**) navy ship; and (**d**) historic ship.

The proposed technique for processing and extracting features allowed for an average of 320 samples, whereas the ideal solution would return exactly 400 (double the amount from the entry database). However, the different conditions in which the images were taken (in terms of quality, distance, and exposure) and the background posed problems. For ships in high seas, the extraction was perfect. However, a large part of these samples was set against the background of forests or cities. Thus, the recorded average extraction efficiency of 80% is a good result. It might have been affected by incorrect filter configuration. An empirical approach to the evaluation of the obtained data indicated the best adjustment of these values.

Table 2 presents the structure of the classification network. The layers were selected in order to obtain the highest effectiveness. It was noted that the proposed technique for the division of samples in the ratio 80:20 (training:verification) and the learning rate of η = 0.001 gives the best results. We trained the classifier during 100 epochs. The resulting efficiency was 89%. A graph of the training is shown in Figure 3.


**Table 2.** Architecture of convolutional neural network.

**Figure 3.** Learning history of the classifier obtained using the created database.

To more accurately verify the operation of the classifier, 118 false samples were added to the database. In our experiments, different sizes of images were analyzed. For a given architecture, we tested the classifier in order to achieve the best possible accuracy. The obtained measurements are presented in Figure 4.

The classification results were then checked and are presented as a confusion matrix in Figure 5. In addition, matrices for individual classes were made, as shown in Figure 6. The proposed classifier categorized every wrong sample into one of the classes of vessels. A condition was thus added (to the extraction stage) that if no key point was found in an image, the image was rejected.

**Figure 5.** The confusion matrix of the classification model.

**Figure 6.** The confusion matrices of the classification model for each class: (**a**) boat; (**b**) motor yacht; (**c**) navy ship; and (**d**) historic ship.

Based on the results, the authors determined the accuracy, overlap, sensitivity, specificity, and the F-measure. The results are presented in Table 3.


**Table 3.** Mean quality measures for the trained classifier.

The resulting effectiveness was nearly 92%, which is high considering that a small number of samples were used for training. A higher value was obtained for sensitivity, which implies accurate prediction of true samples. The F1 score was also high, which indicates that the classifier was efficient. Considerably worse results were obtained for specificity, nearly 0.541, implying that the classification of true negatives was substandard. The same result was obtained for the rate of false omission, indicating that a large portion (almost 30%) of the false samples had been rejected.

As part of the conducted research, classification tests on a given database were performed with and without the proposed detection method and the results are presented in Figure 7.

**Figure 7.** Comparison of different classification techniques on a selected database. SVM—support vector machine; KNN—k-nearest neighbors; CNN— convolutional neural network.

First of all, we made a test that uses a convolutional neural network without detection. The efficiency was about 80%. Again, for classic classifiers such as k-nearest neighbors (KNN) and support vector machine (SVM), the results were similar. In both cases, the proposed detection technique increased the value of effectiveness. However, the highest results were obtained for the proposed technique along with a convolutional neural network.

#### *5.2. Experiments on Real Data*

In the next step, we analyzed the proposed solution on real data. To test the proposed solution in real conditions, we manage to create a database contained 18,361 image files that stored six classes of boats—barge (1703 images), yacht (1489 images), marine service (2362 images), motorboat (6980 images), passenger (3629 images), and other kinds (2198 images). The data were obtained by the setting of cameras mounted on bridges and placed on the bank in Szczecin in Poland. The resulting video data were processed with detection to get video frames containing the passing ship on inland waters. Within one second, a maximum of one frame was collected. For the needs of the experiments, a maximum of 20 photos of one ship crossing the bridge were left in the database. The examples of this images are presented in Figure 8.

**Figure 8.** Sample examples of images: (**a**) barge; (**b**) yacht; (**c**) marine service; (**d**) motorboat; (**e**) passenger; and (**f**) other.

Images obtained at the stage of detection differ from each other in size and quality. As part of the research, the same classification architecture was used, which is presented in Table 2, with one difference being that there were six neurons in the output layer because of the number of classes. For the purposes of conducting experiments, the images within each class were divided in the ratio 80:20 (training:verification), with a training coefficient\eta = 0.001 and 100 iterations. After the training process, the database was enlarged by an additional 300 pictures that did not fit into any of the six classes. For each of the classes, plus the previously mentioned fake images, the efficiency of the trained classifier was checked. The results in the form of confusion matrices are presented in Figure 9 for the classifier and for each class in Figure 10.

**Figure 9.** The confusion matrix of the classification model for real data.

**Figure 10.** The confusion matrices of the classification model for each class: (**a**) barge; (**b**) yacht; (**c**) marine service; (**d**) motorboat; (**e**) passenger; and (**f**) other.

Statistical tests for each class are presented in Table 4. In the case of average performance of the proposed classification, accuracy reached 63%. As in the case of test data, the higher value was obtained for sensitivity. In addition, the recall is on the same level. The F1 score was also high, which indicates that the classifier is efficient. Here, we also see that worse results were obtained for specificity and miss rate.


**Table 4.** Mean quality measures for the trained classifier.

Next, for the database thus created, the operation of the proposed method was verified with other classic architectures of the convolution network as shown in Figure 11.

**Figure 11.** Comparison of different convolutional neural networks on a selected database.

In our experiments, we used a strategy to train some layers and leave others frozen in existing models of CNN such as VGG16 and VGG19 [19]. Such a strategy was used to transfer knowledge from previously trained classifiers so as not to train classifiers from scratch. While choosing to leave specific layers, we paid attention to possible overfitting. We have chosen a few examples of architectures to illustrate the proposed solution with existing ones. In the case of the first version (on the chart v1), the last five layers were frozen, and in the second case (on the charts v2), it was the last six layers. In both cases, two layers were added to the architecture as fully connected ones—in both cases, a first layer was composed of 512 neurons and a ReLu function, and the last one had six neurons. This architecture yielded less effectiveness than the proposed technique. Summarizing, the proposed solution can be used in the system being built. Although the classes are similar, the network was able to classify the images. However, the number of samples in the database should be increased.

#### **6. Conclusions**

The identification and categorization of ships in images can be very useful for creating a register of passing ships, observing boundaries, and even recording information on a catastrophic recorder. This paper proposed a model to extract features to identify marine vessels from a single frame of an image and a deep learning-based method to categorize them. The authors analyzed only four categories of vessels, each of which was initially composed of 200 sample images. Features of ships

were detected in images through the parallel processing of an image using different filters. The aim was to distort images so that only important elements yielded feature maps. Both images obtained from the parallel filtering process were used to find features common to them using different filters. The Hessian matrix was used for this. For feature extraction, the database was extended to include additional samples used to train the classifier. During classification, an image was assumed to be a tensor with its number of components as its depth (RGB model). The convolutional neural network was used for the recognition process. Its structure was inspired by the action and mechanisms of the cerebral cortex. Following tests and the reconfiguration of the network, the obtained efficiency for test data was nearly 90%, which is a very good result considering the small number of samples. In the case of a real database with six classes the average accuracy reached 63%. The proposed solution will be implemented in the identification system being built. However, the database should be increased.

In the SHREC system, the camera stream is analyzed using image processing methods for detection, classification, and text recognition to properly identify a vessel. The method for ship classification presented in this article will form part of ongoing research in the SHREC project. Future work will focus on improving the proposed classification method. The authors intend to modify the parameters of the method to be able to classify more types of ships. For this purpose, a more extensive database of images of ships is being gathered using video cameras at varying resolutions.

**Author Contributions:** Conceptualization, M.W.-S. and D.P.; methodology, D.P.; software, D.P.; data curation, M.W.-S. and D.P.; writing—original draft preparation, M.W.-S. and D.P.; writing—review and editing, M.W.-S. and D.P.

**Funding:** This work was supported by the National Centre for Research and Development (NCBR) of Poland under grant no. LIDER/17/0098/L-8/16/NCBR/2017 and grant no. 1/S/IG/16, financed by a subsidy from the Ministry of Science and Higher Education for statutory activities.

**Conflicts of Interest:** The authors have no conflict of interest to declare.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **WatchPose: A View-Aware Approach for Camera Pose Data Collection in Industrial Environments**

#### **Cong Yang 1, Gilles Simon 1, John See 2, Marie-Odile Berger <sup>1</sup> and Wenyong Wang 3,\***


Received: 29 April 2020; Accepted: 25 May 2020; Published: 27 May 2020

**Abstract:** Collecting correlated scene images and camera poses is an essential step towards learning absolute camera pose regression models. While the acquisition of such data in living environments is relatively easy by following regular roads and paths, it is still a challenging task in constricted industrial environments. This is because industrial objects have varied sizes and inspections are usually carried out with non-constant motions. As a result, regression models are more sensitive to scene images with respect to viewpoints and distances. Motivated by this, we present a simple but efficient camera pose data collection method, WatchPose, to improve the generalization and robustness of camera pose regression models. Specifically, WatchPose tracks nested markers and visualizes viewpoints in an Augmented Reality- (AR) based manner to properly guide users to collect training data from broader camera-object distances and more diverse views around the objects. Experiments show that WatchPose can effectively improve the accuracy of existing camera pose regression models compared to the traditional data acquisition method. We also introduce a new dataset, Industrial10, to encourage the community to adapt camera pose regression methods for more complex environments.

**Keywords:** data acquisition; augmented reality; pose estimation; deep learning; industrial environments

#### **1. Introduction**

Camera pose (location and orientation) estimation is a fundamental task in Simultaneous Localization and Mapping (SLAM) and Augmented Reality (AR) applications [1–3]. Recently, end-to-end approaches based on convolutional neural networks have become popular [4–13]. Instead of using machine learning for only specific parts of the estimation pipeline [14–17], these methods aim to learn the full pipeline with a set of training images and their corresponding poses. Building on that, the trained models directly regress the camera pose from an input image. Several works [4,7] in the literature report that those methods are plausible in regular living environments (e.g., along the street or path), achieving around a 9∼25 m and 4∼17◦ accuracy in localization and orientation, respectively.

However, the utility of such methodology is limited in industrial environments. This is because people in such environments usually exhibit two kinds of typical motions: (1) They inspect industrial objects with non-constant motions and views and (2) they inspect industrial objects from different distances. For the first case, traditional data collection methods cannot cover enough viewpoints to properly train a generalized pose regression model. For example, in [5] a smartphone was used by a pedestrian to take videos around each scene while in [7,18], robot cars were designed to take pictures along the street. Meanwhile, viewpoints in industrial environments may be restricted to only random collection and limited moving directions. Figure 1a visualizes the viewpoints of the video collection method and it is obvious that most regions around the industrial object are uncovered. As a result, the trained models easily over-fit to such limited training data and may not generalize well to uncovered scenes. For the second case, a simple idea is to apply data augmentation techniques such as image zooming to imitate different camera-object distances (hereinafter is referred to as "camera distance"). However, over-zooming could reduce the quality of training data and decay the regression accuracy [19]. To solve these problems, one possible way is to reconstruct a 3D model of the target object and then generate training data via rendering [20–22]. Nevertheless, the problem we tackle here involves more complex scenarios: 3D object models are not always available in industrial environments and in many cases, they are hard to construct because of the presence of textureless and specular surfaces under sharp artificial lights [1]. Figure 1b shows an example of an industrial object and its reconstructed 3D model using Kinect Fusion [23,24]. We can easily observe the missing parts (the black holes) in the model. In fact, as shown in Figure 2, industrial environments are typically inundated with such textureless and occluded objects and specular surfaces, etc.

**Figure 1.** Challenges of collecting training data in industrial environments. (**a**) There are some uncovered regions within the training data using traditional video collection [6]. (**b**) There are some missing parts within the reconstructed 3D model on specular surfaces using Kinect Fusion [23,24].

(a) Textureless objects (C) Specular surfaces (c) Articial lights

**Figure 2.** Sample images in industrial environments.

Thus, we seek to find an efficient approach that can collect pose training data from sufficient viewpoints and camera distances. Towards this goal, as shown in Figure 3, a view-aware approach, WatchPose, is introduced. The basic idea of WatchPose is to place a marker close to the target object and the training data is collected through marker tracking [25,26] with dynamic viewpoint control around the object. Here, we introduce three strategies to improve the efficiency of this process: (1) We propose Nestmarker, a combination, and nesting of traditional markers, for marker tracking. The Nestmarker performs detection and tracking from different camera distances since it contains two markers with fixed relative positions and different sizes. Thus, we can flexibly collect training data from a range of larger and smaller camera distances as compared to traditional markers. (2) During the data collection process, virtual imagery is drawn on the marker for checking the correctness of marker tracking (the blue box). (3) A virtual ball is drawn for visualizing the captured viewpoints (the red points) and navigating the uncovered regions. With these strategies, our data collection approach is applied in a real-time setting and it is robust towards variations in objects, surfaces, and lights since the camera poses are directly generated from marker tracking. Moreover, it is easier for us to visually control the camera distances, viewpoint locations, and densities since this approach can navigate the camera controller (e.g., people, robot, etc.) to move the camera dynamically so that most regions around the

object can be covered. Once the training data are collected, we apply a set of post-processing steps such as calibrating the camera poses from the marker to the target object, data augmentation, etc.

**Figure 3.** Main idea of WatchPose. (**a**) The Nestmarker is placed close to the target object. (**b**) A virtual imagery is drawn for checking the correctness of marker tracking (the blue box) from different camera distances. (**c**) For each camera distance interval, a virtual ball is drawn and automatically switched for visualizing the captured viewpoints (the red points) and navigating the uncovered regions.

To encourage the community to adapt camera pose regression methods towards more complex domains, we collected a new benchmark dataset, Industrial10, to reflect the challenges of industrial environments. Industrial10 comprises of training and testing data of 10 industrial objects. To assess the efficiency of the proposed method, five actively used pose regression methods are employed in our experiments. Evaluations show that the proposed method can effectively improve the camera pose regression accuracy compared to the traditional data collection method.

The contributions of this work are summarized as follows: (1) We propose a novel training data collection method called WatchPose to improve the performance of camera pose regression in industrial environments. The proposed method is sufficiently general and can be extended to other scenarios. (2) We introduce a new dataset, Industrial10, to spur the computer vision systems community towards innovating and adapting existing camera pose regression approaches to cope with more complex environments.

#### **2. Related Works**

Here, we briefly glanced through several existing camera pose data collection strategies followed by a review of camera pose regression methods. For a more detailed treatment of this topic in general, the recent compilation by Shavit and Ferens [27] offers a sufficiently good review.

#### *2.1. Camera Pose Data Collection*

In general, camera pose collection methods can be classified into two ways: Direct and indirect approaches. With direct approaches, camera poses are acquired from markers or physical sensors. For example, Brachmann et al. [28] collected the ground-truth camera poses via integrating a set of traditional markers densely surrounding the target object. Tiebe et al. [29] collected the camera poses via a finely-controlled robot arm. In most cases, camera poses are collected indirectly. For instance, the actively used dataset 7Scenes [30] was first recorded from a handheld Kinect RGB-D camera [31]. After that, the ground-truth camera poses were extracted by Kinect Fusion technique [24]. To generate the Cambridge Landmarks dataset, Kendall et al. [5] first captured high definition videos from around each scene using a smartphone and then proceeded to extract ground-truth camera poses using Structure from Motion (SfM) techniques [32]. Alternatively, Wohlhart et al. [20] first reconstructed 3D models of the target objects and then extracted training data (images and their pose annotations) by rendering the 3D models. More recently, the WoodScape [18] dataset were collected from a car

mounted with a set of sensors (e.g., Inertial Measurement Unit,GPS, LiDAR, multiple cameras, etc.) along selected streets in USA, Europe, and China. Camera poses of this dataset could be extracted by fusion of LiDAR, IMU, and camera sensors. However, these indirect collection methods cannot be properly applied in industrial environments since industrial objects are normally textureless and may possess specular surfaces under strong artificial lights. For direct methods, it is also not feasible to use traditional markers or robot arms as well. This is because in addition to high costs, there are normally limited spaces around industrial objects and people often inspect them from varying distances. Therefore, we introduce WatchPose, which can better imitate watchers' motion and avoid the aforementioned problems. WatchPose extends the idea of traditional markers [28] to support data collection from both close and far distances. Moreover, it is robust to textureless and specular surfaces since the camera poses are directly collected from Nestmarker. Finally, as more viewpoints are covered in the data collection of WatchPose, the trained models can generalize more robustly.

#### *2.2. Camera Pose Estimation*

Leveraging on the idea of transfer learning, more and more researchers attempt to learn models for pose estimation tasks. Generally, these methods function by training descriptors, classifiers, and regressors using a variety of ways. For descriptors, Gu et al. [33] built discriminative models by training a mixture of HOG templates, while Aubry et al. [34] employed them for 3D object detection and pose estimation. In contrast to mixed descriptors, Masci et al. [35] trained a single-layer neural network with different hashing approaches to compute discriminative descriptors for omnidirectional image matching. Wohlhart et al. [20] extend this idea further and train a LeNet [36] to compute features for rendered object views that capture both the object identity and camera poses. Though these approaches can efficiently handle a large number of objects under a large range of poses, they highly rely on handcrafted representations [33,34] and 3D object models [20–22]. For classifiers, Schwarz et al. [22] computed image features by a pre-trained Convolutional Neural Network (CNN) and then fed them to a Support Vector Machine (SVM) to determine object class, instance, and pose. Brachmann et al. [28,37] employed image features from [30] to train a random forest for object detection, tracking, and pose estimation. Those approaches achieved promising performance in cluttered scenes, but they normally require additional reconstruction steps to generate dense scenes and object models. For regressions, Shotton et al. [30] introduced a regression forest that is capable of inferring an estimate of each pixel's correspondence from a given image to 3D points in the scene's world coordinate frame. With this, the computation of feature descriptors are not required. Unlike their approach, Gupta et al. [38] proposed a 3-layer convolutional neural network to regress coarse poses using detected and segmented object instances. However, these algorithms are constrained by the use of RGB-D images to generate the scene coordinate label, which in practice limits its use in industrial environments. To improve it, Kendall et al. [5,6] proposed PoseNet that directly regresses the camera pose from RGB images. However, it easily overfitted with its training data while its localization error on indoor and outdoor datasets was an order of magnitude larger. Such limitations motivated a surge of absolute pose estimation methods such as Bayesian PoseNet [39], MapNet [7], LSTM-Pose [40], Hourglass-Pose [8], SVS-Pose [9], BranchNet [10], and NNnet [11] to involve deeper encoders, better loss functions, and more sensor data. However, as discussed in [4], those pose regression models is more closely related to pose approximation via image retrieval. In other words, the training data should cover viewpoints and camera-distances as much as possible. Thus, we try to improve the performance of pose regression models by feeding better data. Experiments in Section 4 showed that performance improvement with better training data is more obvious, and particularly suitable for industrial type of objects.

#### **3. WatchPose**

WatchPose employs a Nestmarker-based strategy to generate training images and their camera pose labels built on AR frameworks [1]. The reason is that AR frameworks have powerful libraries for robust and real-time marker tracking [25]. In other words, we can directly collect accurate camera poses during the marker tracking process. The proposed pose generation strategy is general and can be built on top of most existing pose training and estimation measures.

#### *3.1. Nestmarker*

Inspired by [41], Nestmarker is a combination and nesting of traditional markers (Figure 4). In particular, the inner 50% of the Nestmarker is a traditional 40 mm square marker plus a 10 mm white square gap. We call this inside marker as the small marker. Using it as a pattern, we add a continuous border to make a 100 mm square marker (big marker). In other words, a Nestmarker contains two square markers: A small marker and a big marker. Hence, the proposed Nestmarker has three distinct features: (1) Since patterns of small and big markers are rotationally asymmetric, both markers can be trained and tracked independently. (2) By marker tracking, the collected camera poses from small and big markers are correlated since their relative positions are fixed and their sizes are known. (3) The big marker can be tracked from larger camera distances than the small marker. Based on our preliminary experiment using a 1080P HD Webcam and 1024 × 768 frame resolution, the small and big markers could be detected between 5 cm to 80 cm and 13 cm to 300 cm camera distances, respectively. With these features, we developed a Nestmarker tracking system as illustrated in Figure 4c. Particularly, the big marker is triggered and tracked (e.g., red box) when the small marker cannot be detected due to big camera distances. Otherwise, the small marker is triggered and tracked (e.g., blue box). In both cases, the square marker, from the black continuous border to the inner part, is tracked.

#### *3.2. Data Collection*

In our work, the ARToolKit [25] library is employed for Nestmarker learning, tracking, and camera pose collection. Figure 5 shows an overview of our strategy. Firstly, the small and big marker patterns (Figure 5a,b) are independently learned by our system thereby they can be detected in the next steps. After that, the Nestmarker is placed close to an object (Figure 5c) and is tracked by a camera around the object. During tracking (the second row in Figure 5), a virtual image (blue box) is synchronously drawn above the marker so that we can visually check whether the marker has been detected. If the marker is correctly detected from the camera, a viewpoint (shown with red point) is simultaneously plotted on the virtual ball and the current image and the real camera pose are saved. In some cases, the collected image could be badly blurred due to fast camera motion or out-of-focus condition. For this, we quantitatively evaluate its blur effect based on the method in [42] and discard the image and its camera pose if its blur value exceeds a predefined threshold , which is set to 0.5 in our experiments.

During the capturing process, the virtual ball is automatically switched and rolled depending on the camera position and orientation. This strategy can effectively navigate the camera controller by determining the direction that the camera should be moved in order to capture training data from uncovered regions and distances. The pose computed using our algorithm is expressed in the Marker Coordinate System (MCS). The labels and/or added graphics must be expressed in this frame. To assist in this placement of the virtual scene in the MCS, reference points representing the object may be added in that frame. To do this, two views of the object with the marker are all that is needed. Using the marker, the poses for both views can be calculated and the 3D reference points can be reconstructed in the MCS by triangulation [43]. Pairing the points between the two views can be done manually if a few reference points are sufficient, or with the help of a Scale-invariant Feature Transform (SIFT)-like descriptor [44] if a denser point cloud is desired.

**Figure 5.** Camera pose training data collection using WatchPose.

In order to properly plot viewpoints on the virtual ball, as shown in Figure 6a, a viewpoint (red point) is mapped to the virtual ball surface (arc) from a real camera position (triangle) with respect to the ball center point (black point). In our case, the real 3D camera position **x** is calculated by **<sup>x</sup>** <sup>=</sup> <sup>−</sup>*R*<sup>T</sup> <sup>∗</sup> *<sup>T</sup>*, where *<sup>R</sup>*<sup>T</sup> is a 3 <sup>×</sup> 3 camera rotation matrix and *<sup>T</sup>* is a 1 <sup>×</sup> 3 camera translation vector that are both directly generated in our system based on the ARToolKit library.

**Figure 6.** Viewpoint calculation and density control.

As shown in Figure 6a, we introduce a threshold *λ* to control the minimal viewpoint distance. With this, we can dynamically control the viewpoint distribution density by changing the *λ* value. In other words, if the current viewpoint is too close to a plotted one based on the threshold *λ* (i.e., < *λ*), the system will discard the current camera pose. It should be noted that there is a high probability that viewpoints from different camera positions could overlap on the virtual ball (Figure 6b) if they have the same orientation and their locations are on the same line with respect to the object. As such, we use multiple virtual balls to visualize and control the viewpoint distribution based on real camera distance intervals (Figure 5). In this work, we predefine 3 camera distances: 0.5 m, 1.0 m, and 2.0 m to control and visualize viewpoint distributions separately. For the overlapping and uncovered viewpoints within other distances (e.g., 0.3 m, 1.3 m, etc.), we introduce a data augmentation strategy, described in Section 3.3, to enrich the collected data. There is another possibility that the Nestmarker may be

occluded by the object under certain viewpoints thereby it cannot be properly detected (Figure 6c). In this case, the small marker will be triggered if the big marker is occluded. If both markers are occluded, we can place multiple markers around the object in advance depending on scenarios. In our case, we only place another small marker close to the object to avoid occlusion since this phenomenon normally appears when the camera is close to an object.

In practise, some industrial objects may have limited space (e.g., in the corner or occluded) for data collection. In such a case, as an example shown in Figure 7a, a flexible Nestmarker is proposed in which the black continuous border and the nested small marker are detachable. In particular, we first capture training data from large camera distances using the large marker (Figure 7b). After that, the black continuous border is removed and only the small marker is kept for capturing training data from smaller camera distances (such as 5 cm to 80 cm in our case) (Figure 7c). Using this strategy, the proportion of the marker in an image can remain small.

**Figure 7.** Flexible use of Nestmarker in narrow environments.

#### *3.3. Parametrization and Augmentation*

For a collected camera pose **p**, we parameterize it with a 7-dimensional pose vector **p** = [**x**, **q**], where **x** contains 3 values representing the camera position while quaternion **q** contains 4 values representing the camera orientation. Here, the quaternion **q** can be directly converted from the camera rotation matrix *R*. Consequently, for each collected image *I*, a 7-dimensional **p** is constructed to represent the camera pose of *I*. We also apply an inpainting process using Coherency Sensitive Hashing (CSH) [45] to remove the marker from each image. The main reason is that in practice there is a time distance between learning and testing ages. It is quite normal to have a marker during the learning age and to remove it during the application time. CSH relies on hashing to seed the initial local matching and then on image coherence to propagate good matches. As a marker location is automatically detected and saved by ARToolKit, CSH can quickly find matched parts in the marker neighbourhood in the image plane and thereby replace the marker region. Moreover, to reduce the inpainting error, markers can be placed in the place with a homogeneous color or texture. This is not challenging in industrial environments since most walls or surfaces are painted with monotonous colors.

After inpainting, we apply a data augmentation process to imitate different camera locations and orientations (Figure 8). We enrich each collected image by employing rotation and slight zooming (i.e., scaling along the camera axis). It should be noted that the number of times zooming is applied within a particular camera distance is highly dependent on the scenario. For instance, our experiment in Section 4.3.2 suggests that augmenting four times within a camera distance is enough to achieve promising pose estimation accuracy. Thus, if the training data are densely captured (with small distance intervals), the number of times augmentation is needed can be reduced or even cancelled altogether. To properly incorporate viewpoint density control and augmentation, we suggest that the viewpoint density *λ* (Figure 6a) should be reduced when the camera is closer to the object so as to avoid redundant images resulting from augmentation.

**Figure 8.** Camera pose augmentation.

#### **4. Experiments**

In this section, we first introduce the proposed Industrial10 dataset and its properties. Built on that, we evaluate the WatchPose and widely used pose regression approaches.

#### *4.1. Industrial10*

To the best of our knowledge, there is no existing dataset specially designed for evaluating camera pose estimation in industrial environments. For this, we collected a dataset containing 10 industrial objects using our proposed WatchPose approach. As shown in Figure 9, the main purpose of selecting these objects is to reflect the real challenges from industrial environments. For example, Object1 and Object10 have similar appearances under certain viewpoints that may confuse object detection algorithms. Object6 has limited training data since it is located in a narrow environment. Since Object9 is relatively big, only part of its appearance was captured with small camera distances. Object7 is occluded by a green pipe so its appearance could look different depending on camera pose. For each object, we collected the training data from three camera distances: 0.5, 1.0, and 2.0 m (Figure 8) using the proposed WatchPose approach. Figure 10 presents an example depicting the viewpoint distributions of Object1 with different camera distances. We can observe that most regions around Object1 are covered. The number of original training data for the 10 objects is also shown in Figure 9 after their respective names. For testing, each object has a set of 200 testing images that are randomly collected within 0.5 to 2.0 m camera distances. The Industrial10 dataset is publicly available to the community (Please check congyang.de for more details).

**Figure 10.** Viewpoints of Object1 in different camera distances.

In addition to the Industrial10 training data collected with WatchPose (named Industrial10-WatchPose), we also extracted another set of training data, named Industrial10-Traditional, using the traditional data acquisition approach. Specifically, Industrial10-Traditional was collected by capturing a high definition video around each object from camera distances 0.5, 1.0, and 2.0 m. In total, 3 × 10 videos were recorded and each video was then sub-sampled at 2Hz to generate its frames. These images are then inputted to the SfM pipeline to extract the camera poses. For fair comparison, two post-processing steps were followed: (1) We converted the camera poses from SfM to the same format as of Industrial10-WatchPose based on the pre-measured datum line. (2) We uniformly selected the same number of training images according to the training data distribution in Industrial10-WatchPose. For example, Object1 has 1793 training images in both Industrial10-WatchPose and Industrial10-Traditional datasets.

#### *4.2. Target Object Detection*

Built on the target object detector, the pre-trained pose regression models can be easily switched based on the detected object. For this, we train a Faster R-CNN [46] using the additional collected and annotated images. For each object, there are 1000 and 300 images used for training and verification, respectively. For target object annotation, the popular annotation tool LabelMe [47] is used. We did not train the detector from scratch since the industrial objects are mostly rigid and textureless, and they resemble some objects from ImageNet. For this, the pretrained ImageNet model with the lightweight ResNet18 [48] backbone network was selected, while the other parameter settings introduced in [46] were kept. After 20 epochs, some detection results using the trained model and testing data are presented in Figure 11. We evaluated the detector's Average Precision (AP) on the testing images of each object and the results are presented in Table 1. We also calculated the mean Average Precision (mAP) by averaging the detection APs across the 10 objects. The mAP was around 99.98%, which is comparatively higher than the reported results in [46]. This is because most objects' surroundings in industrial environments are much simpler than images taken from natural environments such as in PASCAL VOC [49] and MS COCO [50]. Moreover, objects are more distinguishable from their backgrounds in industrial environments since they normally have different paint surfaces and specular effects. Considering the number of training images and the performance we achieved, the training and testing images for pose regression could also be annotated for training the object detector in practice.

**Figure 11.** Object detection results (shown in red rectangles) using the fine-tuned Faster R-CNN [46].


**Table 1.** Average precision of detecting object-of-interest on the testing images.

#### *4.3. Ablation Studies of WatchPose*

To fully assess the proposed Watchpose scheme and its properties, we performed a set of experiments by fixing the pose regressor to PoseNet [5]. Specifically, we first evaluate and compare the trained PoseNet using original images from both Industrial10-Traditional and Industrial10-WatchPose. The main purpose is to show the performance improvement using our proposed method against the traditional method of collection. We also evaluated the usability of our proposed data augmentation strategy on both datasets. It should be noted that the *β* value in PoseNet's loss function should be calibrated based on the scenario. As reported in [6], different *β* brings considerable performance gaps. Based on the preliminary experiments and discussion in [5], *β* is mainly influenced by the camera location unit and the scene type (i.e., indoor and outdoor). Since our locations are labeled in millimeter (mm) scale, as compared to the original PoseNet experiments in [5], the computed Euclidean loss values are normally much bigger than the one from orientation. Thus, we systematically searched for appropriate *β* values by ascending order. Figure 12 illustrates some pose estimation results with different *β* values using a subset of Industrial10-WatchPose training and testing data (for ease of experimentation). As mentioned in [6], the balanced choice of *β* must be struck between the orientation and translational errors, both of which are highly coupled as they are regressed from the same model weights. Therefore, we considered both location and orientation estimations and finally selected *β* = 40000 for the experiments below.

**Figure 12.** Estimated camera location (mm) and orientation (◦) errors using different *β* values (x axis). *β* = 40,000 was finally selected for our experiments.

#### 4.3.1. Original Images

Here, we compare the pose regression performance using the original training data from Industrial10-WatchPose and Industrial10-Traditional. In other words, the images were not augmented. The mean results of each object are detailed in Table 2. We can clearly observe that our proposed WatchPose scheme significantly improved the pose regression performance in this challenging dataset by around 4.7 times for location error and 3.9 times for orientation error, compared to the traditional data collection approaches [5,7].

Based on Industrial10-WatchPose, we also experimentally verified the necessity and usability of image inpainting for pose regression in our scenario. Sample images before and after inpainting are shown in Figure 13a,c and b,d, respectively. Particularly, we applied experiments with cross combinations of with-inpainting and without-inpainting for training and testing. Table 3 presents the

mean pose estimation results among 10 objects. We can clearly find that with-inpainting training and testing as well as without-inpainting training and testing were similar to each other (around 60∼70 mm location error and 20◦ orientation error), but both apparently outperformed the other combinations (more than 150 mm location error and 5◦ orientation error). As in practice the markers were removed after pose generation as it is necessary to apply the inpainting process on training images to meet a promising performance. The followed experiments were all performed based on marker inpainted training and testing images.

**Table 2.** Estimated camera location (mm) and orientation (◦) errors of objects using traditional and WatchPose data collection methods.


**Figure 13.** Sample images before (**a**,**c**) and after inpainting (**b**,**d**).

**Table 3.** Pose estimation using with-marker/no-marker data for training/testing.


#### 4.3.2. Augmentation

In this section, we experimentally explore the efficacy of the data augmentation strategy introduced in Section 3.3 on the Industrial10-WatchPose and Industrial10-Traditional datasets. Following the rotation augmentation introduced in existing reports [6,39], all images were first rotated 4 times. After that, we empirically enriched each training data 4 times within each camera distance interval. For instance, augmenting 4 times within the 0.5 m∼1.0 m camera distance interval meant that the camera distances 0.6, 0.7, 0.8, and 0.9 m were imitated by performing zooming-in and center cropping from the 1.0 m image. Evaluations were done on the same testing data as in Section 4.1 and the result are detailed in Table 4.

We can observe that the improvement of pose regression accuracy using augmented Industrial10-Traditional was limited, around 4%. In contrast, the pose regression accuracy improved around 5.3 times in location and 4.2 times in orientation using the augmented Industrial10-WatchPose. The main reason was that the uncovered regions by the traditional approach were still not properly covered after augmentation. As a result, the performance gap between the traditional and proposed WatchPose approaches was enlarged after data augmentation. Specifically, we achieved errors of around 13.3 mm for location and 4.7◦ for orientation using Industrial10-WatchPose after augmentation, which is around 19 times better than Industrial10-Traditional.

In Figure 14, we plot the pose estimation results in each iteration during the training process using augmented Industrial10-WatchPose. Horizontal axis represents the iteration numbers and vertical axis represents the estimated median performance with respect to camera location and orientation. To optimise the visualisation effect, results from 5 objects are plotted together in each sub-figure. We can clearly observe that both location and orientation errors dramatically dropped after 10,000 training iterations and then stabilized after around 15,000 training iterations.

**Table 4.** Estimated camera location (mm) and orientation (◦) errors of each object using models trained on the augmented Industrial10-WatchPose and Industrial10-Traditional.


**Figure 14.** Estimated camera location and orientation errors of 10 objects in each training iteration.

In addition to the quantitative comparison, we also compare the pose estimation results based on the marker-reprojection approach (some samples shown in Figure 15) so that we can visually observe the differences in performance. For better visualization, we employed the original images before marker inpaining and transferred the relative camera pose of the object back to that of the Nestmarker. If an estimated pose is closer to the ground truth (blue box), the reprojected marker (green box) covers the physical location more accurately. Promising reprojections shown in the lower row of images in Figure 15 demonstrate the robustness of WatchPose. As shown in the upper row, we find that the estimated poses trained with Industrial10-Traditional performed less satisfactorily. The main reason is that the Industrial10-Traditional did not cover enough regions around the target objects thereby the trained PoseNet model could not generalize well on the testing data from uncovered regions.

**Figure 15.** Comparison of marker reprojections (green) between the Industrial10-Traditional (upper) and Industrial10-WatchPose (lower) trained PoseNet. The ground truth is marked in blue.

#### 4.3.3. Dense Control

In this section, we quantitatively evaluate the influence of viewpoint dense control parameter *λ* in Section 3.3 to the final pose regression performance. To this end, we employ Object1 (*λ* = 0.2) in the Industrial10-WatchPose dataset for the experiment. We also fix other factors such as *β* in PoseNet, the augmentation strategy, and the testing data. With the increase of *λ* value (from 0 to 1), original viewpoints of Object1 were proportionally and randomly selected from the virtual ball to imitate the data collection from dense to sparse. For *λ* = 0, 20% of the original viewpoints were randomly selected and duplicated to imitate the overlapped views. The corresponding viewpoints of original images were selected (or duplicated) to generate different training sets. Table 5 presents the pose estimation results with different *λ* values (the original number of training images before augmentation is also provided for reference). We observed that the trained model performed less and less satisfactorily when *λ* increased. This is expected since more and more regions were not covered by the training data. We also found that both orientation and location errors were surprisingly slightly higher at *λ* = 0, compared to the performance at *λ* = 0.2. The main reason is that in some regions the viewpoints were densely distributed and redundant with *λ* = 0. As a result, the trained model was slightly overfitted to these regions.

**Table 5.** Estimated camera location (mm) and orientation (◦) errors of Object1 in using training data with different viewpoint density controlled by *λ*.


Based on the evaluation in Table 5, we can observe that *λ* is important to balance the coverage and redundancy of viewpoints. In practice, *λ* is configured empirically depending on the target object size and the collection conditions. If none of them can be determined in advance, *λ* = 0 is a reasonably acceptable choice of value to ensure the sufficient coverage of different viewpoints, though redundant training images could be generated with this value.

#### *4.4. Deep Absolute Pose Estimators*

Using the augmented Industrial10-Traditional and Industrial10-WatchPose datasets, we compare the pose regression performance of five existing approaches. Specifically, Bayesian-PoseNet [39] was released by the same authors of the PoseNet approach. Bayesian-PoseNet first generates some samples by dropping out the activation units of convolutional layers of PoseNet based on a probability value. The final pose is then computed by averaging over the individual samples' predictions. Meanwhile, PoseNet+ [6] introduced a loss with learned uncertainty parameters (learnable weights pose loss) for optimizing PoseNet. Hourglass-Pose [8] focused on optimizing the architecture of PoseNet by suggesting an encoder-decoder architecture implemented with a ResNet34 [48] encoder (removing the average pooling and softmax layers). In contrast to the aforementioned approaches, a more recent method called MapNet [7] suggested to include additional data sources in order to constrain the loss. It is trained with both absolute and relative ground truth data.

On the Industrial10-Traditional dataset, we analyzed the estimation performance of each object category so that deeper observations could be made. The final results are detailed in Table 6. We found that the performances varied among different objects. We also observed that the improved optimization of loss in PoseNet+ and the architecture design in Hourglass-Pose both improved PoseNet's accuracy by around 2 times, particularly in the location errors which were dramatically reduced. Overall, MapNet achieved the best performances, around 3 times better in location and 2 times better in orientation, compared to the poses estimated by the original PoseNet.

**Table 6.** Estimated camera location (mm) and orientation (◦) errors of different pose regression approaches using the post-processed Industrial10-WatchPose dataset.


In contrast to Table 6, the estimation errors in Table 7 are smaller among 5 approaches using the Industrial10-WatchPose dataset. This is because the WatchPose data covers more camera poses thereby leading to better generalization in pose regression models. In particular, while the Bayesian-PoseNet only marginally improved the pose regression accuracy (over PoseNet), the Hourglass-Pose and PoseNet+ approaches achieved around 14% improvement in location and 4% in orientation estimation. Once again, MapNet achieved the best performance, with around 226.187 mm in location error and 57.551◦in orientation error. It also shows that the proposed WatchPose method of collecting data enabled much higher in both location and orientation performance improvements on MapNet compared to the traditional data collection method. This shows that the proposed protocol is particularly suitable for complex industrial applications as the efficacy of the pose estimation methods are wholly improved.

**Table 7.** Estimated camera location (mm) and orientation (◦) errors of different pose regression approaches using the augmented Industrial10-Traditional dataset.


#### *4.5. Discussion on Restrictions*

Theoretically, WatchPose is applicable to a scenario when it fulfills the following conditions: (1) Inspections are carried out from within 0.2 to 2 meters of the target object. Otherwise, the Nestmarker cannot be properly tracked. (2) There should be enough space around the target object for data acquisition via a handheld camera. (3) There should be a homogeneous place close to

the target object for pasting Nestmarkers. However, there could be more factors that influence the efficiency of WatchPose and the performance of pose regression in the application phase. To explore these factors, we employed the marker reprojection approach (similar to Figure 15) so that the pose estimation performance of MapNet could be visually observed, as shown in Figure 16. We can see that most reprojected markers (in blue) on most objects could cover the ground truth (the original marker) quite accurately. However, there were still some testing images with poor reprojection results, as shown in Figure 16b. These images were mainly affected by specific types of challenging conditions: Under- and over-exposed images, partially blurred regions, incorrectly detected objects, and also lesser training data. This is partly attributed to the lighting conditions of the industrial environment. Other factors such as camera configurations and moving speed, shaking, and camera resolution could also impact the image quality and the regression performance. In practice, a camera with high resolution and fixed exposure time is recommended. During the capturing process, the camera should also be moved as slow as possible. Moreover, some tools such as handheld gimbals [51] could be used to stabilize the acquisition. In Section 5, we also introduce two future works to deal with these challenging problems.

(a) Promising estimation results

(b) Poor estimation results

**Figure 16.** Marker reprojection results (blue) using the estimated camera poses from MapNet [7]. For better visualization, the original marker is attached in each image.

Our experiments were performed on two platforms: A laptop and a desktop machine. Training images and camera pose labels were processed on a laptop with Intel Core i7 2.9 GHz CPU, 16 GB installed memory and 64-bit Windows 7 OS. An ELP 1080P HD Webcam was connected to the laptop via a 2 m USB-cable for data collection. Model training, object detection, and pose estimation experiments were accomplished on the desktop machine with 6 Intel Xeon Core 3.5 GHz CPUs, 64 GB installed memory, a Quadro M4000 GPU (8 GB global memory and 1664 CUDA Cores), and Ubuntu 14.04 LTS. The pose generation was implemented with C++. The object detection, pose training, and estimation tasks were implemented with Python. For object detection, the entire training process took around 26 hours and the detection (at inference time) took about 0.28 seconds per testing image. For camera pose estimation, each camera pose could be regressed in about 5 to 100 ms depending on the model, which puts the system at real-time speeds.

#### **5. Conclusions**

In this paper, we introduced a simple but efficient data collection method for complex industrial environments named WatchPose so as to learn effective absolute camera pose regression models. The proposed method integrated the advantageous properties of marker tracking and viewpoint visualization approaches. The features of WatchPose could properly handle the challenges of industrial-type objects: Textureless and specular surfaces under strong artificial lights, and highly variable distances and views angles. We also proposed two post-processing steps (inpainting and augmentation) to improve the robustness and stability of the trained models. To advance pose estimation research in industrial environments, we introduced a new challenging dataset, Industrial10, to represent the aforementioned challenges of industrial-type objects and environments. Experiments showed that the proposed WatchPose method could effectively improve the pose regression performance of five widely-used approaches.

In the future, we propose two further directions. Firstly, we will collect more data to cover other kinds of challenges in industrial environments. For instance, we can enrich the training data by varying an image brightness [52] to imitate different lighting conditions. Secondly, we will compare the motion tracking performance between ARToolKit and ARCore [53], which can be applied with and without markers, respectively. Finally, we will release further baselines for the Industrial10 dataset using a few other camera pose estimation methods such as SfM [32], 3D Scene [30], etc.

**Author Contributions:** Conceptualization, C.Y.; Data curation, C.Y.; Funding acquisition, M.-O.B. and W.W.; Methodology, C.Y.; Project administration, G.S.; Software, C.Y.; Supervision, G.S., M.-O.B. and W.W.; Validation, G.S. and J.S.; Writing—original draft, C.Y.; Writing—review & editing, G.S., J.S. and M.-O.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was funded by the Region Lorraine and Université de Lorraine. Grant Number: 1.87.02.99.216.331/09

**Acknowledgments:** The authors would like to thank Pierre Rolin, Antoine Fond, Vincent Gaudilliere, and Philippe Vincent from INRIA/LORIA for their invaluable help.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **A Deep-Learning-based 3D Defect Quantitative Inspection System in CC Products Surface**

#### **Liming Zhao, Fangfang Li, Yi Zhang \*, Xiaodong Xu, Hong Xiao and Yang Feng**

Research Center of Intelligent System and Robotics, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; zhaolm@cqupt.edu.cn (L.Z.); liffang123@gmail.com (F.L.); xxd@cqupt.edu.cn (X.X.); xiaohong@cqupt.edu.cn (X.H.); fy1222111@163.com (Y.F.)

**\*** Correspondence: zhangyi@cqupt.edu.cn; Tel.: +01-267-366-4048 or +86-188-8376-9122

Received: 27 December 2019; Accepted: 7 February 2020; Published: 12 February 2020

**Abstract:** To create an intelligent surface region of interests (ROI) 3D quantitative inspection strategy a reality in the continuous casting (CC) production line, an improved 3D laser image scanning system (3D-LDS) was established based on binocular imaging and deep-learning techniques. In 3D-LDS, firstly, to meet the requirements of the industrial application, the CCD laser image scanning method was optimized in high-temperature experiments and secondly, we proposed a novel region proposal method based on 3D ROI initial depth location for effectively suppressing redundant candidate bounding boxes generated by pseudo-defects in a real-time inspection process. Thirdly, a novel two-step defects inspection strategy was presented by devising a fusion deep CNN model which combined fully connected networks (for defects classification/recognition) and fully convolutional networks (for defects delineation). The 3D-LDS' dichotomous inspection method of defects classification and delineation processes are helpful in understanding and addressing challenges for defects inspection in CC product surfaces. The applicability of the presented methods is mainly tied to the surface quality inspection for slab, strip and billet products.

**Keywords:** continuous casting; surface defects; 3D imaging; neural network; deep learning; defect detection

#### **1. Introduction**

In recent years, with the advent of the industrial 4.0 enterprises undergoing transformation and upgrading manufacturing processes, continuous casting (CC) as a main solidification process for molten steel has been widely popularized to produce metal semi-finished products [1]. In the iron and steel industry with the maturity of CC technology, hot charging and direct rolling (HC-DR) as an energy-efficient production pattern is currently experiencing rapid development [2,3]. Technically, to implement HC-DR, the defect-free CC products will undoubtedly be an essential prerequisite [4,5]. Although the technical objectives to be improved have been identified, no manufacturer in the world has reported one-hundred percent defect-free CC semi-products manufacturing technology in such a complex and systematic setting [6]. Therefore, complementary technologies such as automatic nondestructive examination (NDE) for CC products surface quality evaluation have become essential in the promotion of HC-DR [7,8]. This is an advisable method to eliminate flaw segments according to accurate NDE evaluation results [9]. Machine vision (MV) in NDE combined with AI algorithms is becoming a burgeoning method which can perform with a fast response, a high signal-to-noise ratio and a strong anti-jamming capability [10,11] compared with ultrasonic, eddy current and other contact methods. The MV merits make it more competitive in harsh environment application like CC manufacturing field [12,13]. On the other hand, MV-based 3D optical metrology has gradually demonstrated superiority, such as [14–16] stereoscopy triangulation (mm), interferometry (nm), con-focal vertical scanning, and fringe projection (um). ArcelorMittal Corp. developed a conoscopic

holography rangefinders system tested in ACERALIA Crop. (Spain). The Cognex Corp. in the US developed a SmartView detection system that applied to a wide variety of surface defects inspection tasks. Elkem Corp. in Norway and Honeywell Corp. in the United States conducted infrared and visible-light MV detection methods [17]. Xu et al. [18], based on MV technology, carried out extensive research on CC slab and rolled strip surface defects inspection. To obtain effective 3D defects shapes, Zhao et al. [19] combined line array CCD and area array CCD imaging methods and devised the informative image scanning method. As a fast developing subfield of machine learning, multilayer perceptron convolutional neural networks combined with deep learning(CNN-DL) strategies in MV inspection field have shown state-of-the-art performance [20]. CNN-DL methods do not require laborious hand-craft features for classifier design [21] and as a branch of ANN, they make the complex function approximation feasible by learning a deep nonlinear network. He Di et al. [22] trained a classifier for strip defects recognition based on convolutional auto-encoder (CAE) and a devised semi-supervised Generative Adversarial Networks. To overcome the trivial image pre-processing and feature extraction process, Wangzhe Du et al. [23] presented an X-ray defect detection system based on the Feature Pyramid Network and a data augmentation method for model generalization training. Veitch-Michaelis et al. [24] studied the 3D cracks recognition method through the combination of morphological detection and SVM classifier. Hongwen Dong in Northeastern University proposed a pyramid feature fusion and a global context attention network for pixel-wise detection of surface defect in the industrial production process [25]. Fatima A. Saiz et al. [26] reported a deep-learning based automatic defects recognition system in which CNN was utilized in the model design, which achieved an outstanding classification rate. CNN-DL strategies need to make full use of training datasets and learning algorithms to make the detection results relatively stable. Therefore, they generally require a large number of training samples as input. In high-noise environments, MV based-intelligent inspection methods as mainstream schemes have been successfully applied in the CC products line, although the accuracy and mechanism of the AI algorithm require in-depth research with the improvement of application requirements.

In the CC production line, with the improvement of quality requirements, the defect depth has become a significant factor, which, especially for the CC slab, sometimes may cause potential security problems. In other words, some defects can be ignored or repaired by the follow-up finishing process if the depth of the defects does not exceed a certain value. Furthermore, conventional optical imaging 2D inspection methods are susceptible to high-temperature radiation interference. In this work, we refer to the entire defects inspection process as two separate steps: recognition and delineation, and based on our previous work in [6], a novel two-step defects inspection strategy was presented by devising a fusion deep CNN model (fully connected CNN with fully convolution CNN). The entire scheme, as shown in Figure 1, was implemented by the devised flexible binocular 3D quantitative inspection deep-learning system (3D-LDS). In this system, unlike traditional inspection methods the 3D depth point cloud mapping images will be feed into 3D-LDS. Furthermore, a region proposal method was designed using 3D-LDS ROI location that can effectively suppress redundant candidate bounding boxes in a real-time defects recognition process. Systematically a 3D-LDS-based CNN-DL strategy was attempted for CC products surface defects inspection that allows a feasible method of AI algorithms and powerful ROI recognition and delineation strategies to be further studied in industrial applications.

**Figure 1.** The scheme of binocular CCD-based 3D image deep-learning CC products surface defects inspection.

#### **2. An Improved 3D Image Scanning System**

#### *2.1. Optimal Image Laser Scanning Method*

In image-based ROI inspection methods, it is a prerequisite for the imaging sensor to be able to capture objects informatively and adjust imaging parameters adaptively as the peripheral environment changes. Therefore, the 3D-LDS as a structured light assistant active imaging system needs a laser stripe with a maximal color contrast and the most homogeneous gray-level. Namely, the imaging sensor should be set to an appropriate optical integral time (OIT) and focus status. When it comes to a rigid system architecture, the focus status can be fixed as the imaging distance and imaging depth of field (DOF) are constants. However, the automatic OIT controlling method needs to be focused on if imaging sensor works in an unstable high-temperature radiation environment. According to the Planck theorem [27], we took the CC production as an blackbody and assumed that its surface emissivity is equal to 1. While T > 500 °C (like the CC slab roughly varied between 600 °C to 900 °C when it comes out of the second cooling area), the visible red-light radiation can be sensed by unaided eyes. We tested the optical spectrum radiation interference in different temperatures in hot CC slab surface from 720 °C to 1021 °C, as in shown in Figure 2. We can observe the regular patterns of light strength distribution with different OIT and object surface temperatures. The experiments present a quantitative guidance for determining laser luminous wavelength and controlling the imaging sensor's parameters. In 3D-LDS, to minimize radiation interference, we selected a 532 nm green laser emitter. On the one hand, it can ensure that the CCD sensor is in the imaging spectral sensitive range and on the other hand, it can avoid high-temperature radiation interference as much as possible. We can observe that the radiation intensity of the laser stripe at 3 ms is easily distinguishable from the hot slab surface (1000°C) at the integral time of 10 ms.

**Figure 2.** The spectral radiation intensity at different temperatures in slab surface and light intensity distribution of the laser stripe. (**a**) High temperature spectrum measurement; (**b**) Spectral intensity comparison.

On the basis of the light radiation principle, we presented an improved method to determine threshold *TL* in 3D-LDS, which allows the CCD cameras to scan the laser stripe precisely without being interfered by a high temperatures radiation. Based on the CCD imaging principle, theoretically, the objects luminance can be formulated as follows [28]:

*<sup>E</sup>*<sup>0</sup> = (*n n* ) 2 *K*π*L* sin2 *U* , (1)

where *n* and *n* denote, respectively, the refractive index in object space and image space, *K* is the optical system transmittance, *L* is the light luminance, and *U* represents the image aperture angle. Supposing that the laser reflected luminance can be expressed by *L* = ρ*E*, where *E* represents laser transmitter luminance and ρ is the reflectivity (0 <ρ< 1), then, the diffuse reflection of the laser stripe on the object surface can be formulated by

$$E\_0' = \left(\frac{n'}{n}\right)^2 \mathcal{K}\pi\rho E\sin^2\mathcal{U}',\tag{2}$$

Apparently, as shown in Figure 3, quantitatively determining the best color distance between the slab surface and the laser stripe depends on the threshold at the optimal light integration time [29]. It also shows that in the figure, the laser stripe shape is easily extracted when the light intensity is concentrated.

Therefore, *TL* can be found by the following method. Firstly, we convert the 24-bit color image into gray level directly by assigning *R* = *G* and *B* = *G*. While the CCD sensor's images have pixel levels of [1, ..*T*.., *L*], let *ni* and *N* denote the number of pixels at level *i* and the total number in one frame, then the *TL* should be between the μ*<sup>b</sup>* and μ*<sup>f</sup>* [30]:

$$\begin{cases} \mu\_b = \sum\_{i=1}^T i p\_i / \omega\_b = \mu(T) / \omega(T) \\ \quad \mu\_f = \sum\_{i=T+1}^L i p\_i / \omega\_f = \frac{\mu - \mu(T)}{1 - \omega(T)} \end{cases} \tag{3}$$

where *pi* <sup>=</sup> *ni*/*N*, <sup>μ</sup>(*T*) = *<sup>T</sup> i*=1 *ipi*, <sup>μ</sup>*<sup>t</sup>* <sup>=</sup> *<sup>L</sup> i*=1 *ipi*, <sup>ω</sup>*<sup>t</sup>* <sup>=</sup> *<sup>T</sup> i*=1 *pi* <sup>=</sup> <sup>ω</sup>(*T*), <sup>ω</sup>*<sup>f</sup>* <sup>=</sup> *<sup>L</sup> i*=*T*+1 *pi* = 1 − ω(*T*). The variances of the foreground of the laser stripe and the background are formulated as follows:

$$\begin{cases} \begin{aligned} \sigma\_b^2 &= \sum\_{i=1}^T (i - \mu\_b)^2 p\_b / \omega\_b \\ \sigma\_f^2 &= \sum\_{i=T+1}^L \left( i - \mu\_f \right)^2 p\_f / \omega\_f \end{aligned} \tag{4}$$

**Figure 3.** Laser stripe shape and intensity distribution counts at different CCD integral times.

Based on the Otsu and CCD imaging definition variance evaluation function, the optimal scanning threshold *T* can be determined by the following discriminate criterion measure:

$$f(T) = \sigma\_B^2(T) / \sigma\_{t'}^2\tag{5}$$

where σ<sup>2</sup> *<sup>B</sup>*(*T*) = ω*b*(μ*<sup>b</sup>* − μ*t*) <sup>2</sup> <sup>+</sup> <sup>ω</sup>*f*(μ*<sup>f</sup>* <sup>−</sup> <sup>μ</sup>*t*) <sup>2</sup> denotes the classes variance and σ<sup>2</sup> *<sup>t</sup>* <sup>=</sup> *<sup>L</sup> i*=1 (*i* − *ut*) 2 *pi* represents the current frame total variance. In fact, optimal *To* can be computed by searching the threshold interval [1, ..*T*.., *L*] to meet the requirement:

$$T\_L = \max\_{1 \le T \le L} \sigma\_B^2(T),\tag{6}$$

Figure 4 displays the laser imaging results that Figure 4b is the most convenient shape for data processing through experiments under optimal imaging states.

**Figure 4.** Laser stripe imaging features at different CCD imaging states. (**a**) Short CCD integral time; (**b**) Optimal laser imaging; (**c**) Long CCD integral time.

#### *2.2. System Construction*

To implement the deep learning 3D inspection method and create a reliable detection system to meet the special requirements, we devised an improved experimental system based on our previous research. Figure 5a is the schematic principle of the devised binocular CCD laser image scanning system. Figure 5b is the corresponding experimental system devised that we updated from our previous multi-source CCD imaging system in the literature [6]. The previous system mainly utilized the traditional inspection methods, and the 3D laser scanning system just played auxiliary role in defect location. In the new 3D-LDS system, the integrity of defects can be captured properly without the line scanning CCD. In this system, we employed two MERCURY CCD cameras (model: MER-500-14GC-P) and the lens model selected was M0814-MP2. Here, the deep learning defects recognition process was conducted on the fusion image from the two imaging sensors. In this system, the two laser scanning images were overlaid informatively by a registration method and this process is a rigid transformation of rotation and translation. Once the system calibration was completed, the imaging parameters between the two CCD cameras were settled. Notice that the applicability of the proposed experimental system is not tied to CC products surface defects inspection exclusively.

**Figure 5.** The schematic illustration of the binocular CCD laser image scanning system. (**a**) The schematic working principle diagram; (**b**) The corresponding devised experimental system.

In the system, the 3D images pixels (12-bit) are indirectly mapped from the calibrated laser triangulation strategies (the metric is millimeter). Therefore, the image ROI was reconstructed by converting the 3D distance point cloud of the object surface. From the experiments in Figure 6, we can visually observe that the system can change its detection accuracy and sensitivity for depth information by finely adjusting θ according to the detection requirements. Generally, CNN-DL model training requires a large number of labeled examples. We utilized the angular fine adjustment to acquire different scanning images for the testing samples as an auxiliary data augmentation method. The depth of variation was explicitly added to the training samples. Based on this method, we also used the typical variation, including changes in contrast, rotations and translations. Deep-learning is extremely data-hungry and performance grows only logarithmically with the amount of data used. This is one of main limitations that the field is currently facing.

**Figure 6.** The influence of different oblique angles on detection accuracy and sensitivity (left:θ/2 = 60◦ and right: θ/2 = 45◦ ).

#### **3. CNN-DL Inspection Method Design in 3D-LDS**

#### *3.1. CNN Networks Design in 3D-LDS*

In neural networks, a neuron is the fundamental unit that takes a bias *w*<sup>0</sup> and a weight vector ω = (*w*0, ... *wn*) as parameters to a decision model: *f*(*x*) = *h*(ω*Tx* + *w*0) where *h*(*x*) is a non-linear activation function. More complex nonlinear mapping is usually based on the combination of lots of neurons that are arranged in layers. Commonly, a single layer network can be expressed as a linear combination of *N* individual neurons [31]:

$$f(\mathbf{x}) = \sum\_{i=0}^{N-1} v\_i h(w^T \mathbf{x} + w\_{0,i}),\tag{7}$$

where the trainable parameters for this network can be summarized as (*v*0, *w*0,0, *w*0,..., *vN*, *w*0,*N*, *wN*). Appropriate parameters can decrease the ideal function and its approximation: | *f*(*x*) − ∼ *f*(*x*)|. Theoretically, any function can be approximated using a single layer network only if we give a large number of neurons and have the proper parameters within the same compact set that the network can be trained. The more layers (deeper networks) the network creates, the stronger the networks' modeling capacity. However, the deeper the number of layers, the more challenging it is to train the network parameters. In recent years, deep learning technology has been widely used in many fields, especially the proposed convolutional and pooling payers make the model have a robust ability to extract local and macro characteristics. In Figure 7, the convolutional and pooling process in DL networks achieved locality perception and parameter-sharing mechanism, which dramatically reduce the amount of model training parameters. In addition, the End-to-End training strategy makes the feature extraction-selection and classifier design integrated in a streamlined process. The hand-crafting features are no longer required while everything is learned by the network model based on a data-driven mode.

Based on the end-to-end training mechanism, we built a complete deep neural network model in 3D-LDS. As shown in Figure 8, we devised a dichotomous defects inspection strategy that includes two steps and a two-branch deep neural network for defects types classification (recognition) and ROI delineation. In the overall inspection process, the input images mapped from the laser triangulation were finally converted to a predication map and a classification label. The proposed methodology is helpful in understanding and addressing challenges for CC production surface inspection. In the recognition process, 3D point cloud images in 3D-LDS was utilized to locate the defect positions accurately according to the depth detection results. Through the initial location of the possible ROI(defects) the candidate bounding box(BBox) will be generated, which we define this process as depth based ROI initial location and BBox generation. In the last two steps, the BBox will be classified

by fully connected neural networks and the defects types will be output in images level, and the prediction map in pixel-wise will be output in fully convolutional neural networks for delineation.

**Figure 7.** The convolution and pooling process in CNNs for locality perception and parameters sharing mechanism.

**Figure 8.** Defects recognition and delineation process based on deep CNN modeling mechanism.

A significant characteristic of DL strategies is the automatic feature learning for data representations through an end-to-end training process. To realize the two-step defects recognition and delineation in 3D-LDS, we constructed a novel network architecture by integrating the blocks of Resnet [32] and Unet [33]. The aim is to take advantages of the deep CNN merits in classifier design and fuzzy ROI delineation. Thereinto, ResNet were designed to enable training of very deep networks due to the residual block is introduced. Ronneberger's full convolution idea is a breakthrough towards automatic image segmentation. In fact, the ROI segmentation can be expressed as an auto encoder and decoder process. It consists of a contracting and an expanding branch and enables multi-resolution analysis. Figure 9 indicates the schematic network architectures for defects classification(recognition) and ROI segmentation (delineation). A novel idea here is the devised multi-model-based recognition and delineation that in the defects inspection process the system will according to the input images size automatically select different training models. Usually, the detected candidate ROI will have different sizes to reduce the computational complexity in 3D-LDS only the BBox will be input into system as shown in Figure 8. In the experimental testing process, we trained five different sizes of BBoxes for classifier and delineation DL models (input sizes: 32\*32,48\*48,64\*64,80\*80,128\*128), the candidate depth ROI based BBoxes will be resized to one of the 5 sizes according to its size proximity. Note that the images will be reconstructed after the recognition and delineation are finished because the real location in CC products surface will be predicted through the system measurement calibration parameters.

**Figure 9.** The schematic network architectures for defects recognition and ROI delineation.

#### *3.2. Model Training Strategies in 3D-LDS*

Generally, a CNN network consists of convolutional layers, pooling layers, full connection layers and loss layers, etc., among them, the algorithms in the full connection layer and yjr loss layer are basic parts of the network. CNN based recognition methods have been widely used in image analysis fields. CNN based modeling capability is gradually strengthened owing to the improvement of loss function and optimization algorithm in model training process. In this work, as shown in Figure 10 we utilized softmax function to train multi-classification model [34]:

$$P(y = j | z^{(i)}) = \phi\_{soft \text{max}}(z^{(i)}) = \frac{\varepsilon^{z^{(i)}}}{\sum\_{k=1}^{t} \varepsilon^{z^{(i)}}} \, ^{\prime} \tag{8}$$

**Figure 10.** The system multi-classification method for different defects types in training and testing process.

We can see that the range of this function value is defined in [0,1], where, *z* = *w*0*x*<sup>0</sup> + *w*1*x*<sup>1</sup> + ··· + *wnxn* = *<sup>n</sup> <sup>i</sup>*=<sup>0</sup> *wixi* = **<sup>w</sup>Tx**, *<sup>t</sup>* represents the total number of defects categories, *<sup>w</sup>* is the weight vector, *<sup>x</sup>* is the feature vector of a training sample, and *w*0s the bias unit. *zk* denotes the value of the output of class *k*, in the experimental process we basically tested five classifications of defects for transversal cracks, longitudinal cracks, star cracks, hole-shaped defect and others respectively. The softmax function computes the probability that the current training sample *x*(*i*) belongs to class *j* given the weight and net input *<sup>z</sup>*(*i*). Therefore, we compute the probability (*<sup>y</sup>* = *<sup>j</sup>*|*x*(*i*); *wj*) for each class label in *<sup>j</sup>* = 1, ... .*k*. Note that the normalization term in the denominator causes the whole class probabilities sum up to one under the assumption that the training samples are independent of each other.

Based on the softmax function we can introduce the softmax *loss* as formulated as below:

$$L = -\sum\_{j=1}^{T} y\_j \log s\_{j\prime} \tag{9}$$

Here, *sj* is the *j* − *th* value of the output vector *s* from softmax function, which indicates the probability that the testing sample belongs to the *j* − *th* category. *yj* is a vector of *1\*T* that only the value of the position corresponding to the real label is equal to 1. Therefore, this formula actually has a simpler form when *j* is the real label that points to the current sample:

$$L = -\log \mathbf{s}\_{\mathbf{j}'} \tag{10}$$

Next, we can give the concept of cross entropy which it is formulated as below:

$$E = -\sum\_{j=1}^{T} y\_j \log p\_{j\prime} \tag{11}$$

Here, *cross entry* is equal to softmax loss while the input *pj* of cross entry is the output of *softmax.* In our work, we set the activation function as softmax in dense layer. Based on the above discussion, we can define the function of the optimization to minimize (or maximize) the loss function *E* in training process. Basically, gradient descent is one of the most popular algorithms to perform optimization and up to now the most common way to optimize neural networks. Moreover, there are three basic variants of gradient descent which differ in how much data we use to compute the gradient of the objective function, which include [35] batch gradient descent (GGD), stochastic gradient descent (SGD) and mini-batch gradient descent (MBGD). In fact, there are some challenges need to be solved in allusion to the above three optimization methods. However, these methods are often used to test the effectiveness of the network training process. We will not pay too much attention to these issues because of the focus of this paper. In these experiments, we utilized the adaptive moment estimation (Adam) optimization to compute adaptive learning rates for network parameters. Adam keeps an exponentially decaying average of past gradients similar to momentum besides storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop [36]. Adam prefers flat minima in the error surface and the decaying averages of past and past squared gradients *mt* and *vt* are computed separately as follows [37]:

$$\begin{aligned} m\_t &= \beta\_1 m\_{t-1} + (1 - \beta\_1) g\_t \\ v\_t &= \beta\_2 v\_{t-1} + (1 - \beta\_2) g\_t^2 \end{aligned} \tag{12}$$

where *mt and vt* are estimates of the first moment and the second moment of the gradients respectively, if the *mt and vt* are initialized as vectors of 0, they counteract these biases by computing bias-corrected first and second moment estimates:

$$
\stackrel{\circ}{\dot{m}\_t} = \frac{m\_t}{1 - \beta\_1^{t'}},\\ \stackrel{\circ}{\dot{v}}\_t = \frac{v\_1}{1 - \beta\_2^{t'}} \tag{13}
$$

Therefore, based on the bias-corrected estimates, the Adam gradient update rule is generated as below:

$$
\partial\_{t+1} = \partial\_t - \eta \frac{\stackrel{\wedge}{\dot{m}\_t}}{\sqrt{\stackrel{\wedge}{\dot{\upsilon}\_t}} + \varepsilon},
\tag{14}
$$

The authors propose default values of 0.9 for β1, 0.999 for β2, and 10<sup>−</sup><sup>8</sup> for ε.

#### *3.3. Experimental Results Analysis*

Due to the all-pervading oxide scales on CC products surface have similar characteristics with real defects, especially in 2D images while it is processed by imaging processing algorithms. We call it pseudo defects interference in inspection process as presented in Figure 11b. The steel plate displays confusing ROI with a crack and also some other outliers. This will make ROI extraction very challenging even in room temperature. In Figure 11b we clustered the ROI and finally found 1400 candidate ROIs. Figure 11c shows the laser scanning image for the Figure 11b, by the same way the counterpart of Figure 11b given by Figure 11c contains 3 candidate ROIs. Therefore, the selective patches given by the location of the candidate ROI will be computed and returned by the recognition model in 3D-LDS. Basically, region proposal algorithms are often employed to identify prospective objects in an image such as the proposed methods of objectness, randomized prim or selective search and so on. In this paper we referred to the region proposal method but devised a more effective way by referring to the laser scanning images depth location as given in Figure 11c. The candidate bounding boxes for defects recognition will be proposed and resized to the closest image patch for recognition.

**Figure 11.** Searching for ROI in optical image and laser scanning (depth) image in 3D-LDS. (**a**) CCD image (Object depth:3mm); (**b**) Searching results for ROI; (**c**) Laser scanning image and search results for ROI.

Figure 12 denotes the ROI depth location method. For abnormal depth areas we only extract the centroid line as the position depth values and 3D image reconstruction in scanning process. Figure 12a is the artificial defect that for convenience of calculation we made some samples of different depths and sizes for four defect types and others (made randomly). Figure 12b,c are the laser location process that pixels offset reflected on the image. Figure 12d is the ROI depth based candidate bounding box generation method.

Figure 13a shows the training samples for L crack generated in 3D-LDS in different scanning angles, distances and optical integral times. The labels (ground truth) in second row are mainly delineated manually and generated by an interactive method to ensure accuracy. In this work, the data augmentation strategy was utilized, the parameters we used for generating a new image are as follow:rotation\_range,translation\_shift\_range,zoom\_range and blur operation. Roughly, the training and testing data sets were split in 7:3 separately from different original data. Figure 13b shows the testing results that actually is a reconstructed image from the mapping pixels' prediction values. We can set a different classification number for the softmax *function* to obtain different output. However, the final binary image will be segmented by a fixed threshold.

**Figure 12.** The candidate ROI location in different areas. (**a**) The testing sample defects; (**b**) ROI location in linear defect; (**c**) ROI extraction in star defect; (**d**) The candidate bounding box.

**Figure 13.** Experimental results for training data sets and predicted reconstruction images. (**a**) Training samples for crack based on the devised data augmentation in 3D-LDS; (**b**) Predication map (left) with its binary image (middle) and the 3D visualization (right) for the inspected L-crack defect.

In the 3D-LDS defect inspection process, there is a sensitive parameter: the radius of the candidate bounding box(BBox-R), which determines the size of the ROI relative to the size of BBox. Generally, in order to ensure the candidate BBox includes the ROI accurately. We can set a relatively large radius to locate the ROI. However, this will lead to regional imbalances (RI) and consequently, bring about two main issues, especially in full convolutional networks training and the testing process:


Table 1 is the testing results for five types of defects, thereinto, L-110(440) means the type is longitudinal cracks and training and testing samples are 440 and 110 respectively. T means transverse crack, S denotes star shape defects and H means hole defects. To facilitate the quantitative analysis, we employed image segmentation evaluation methods to test validation in delineation step that includes dice coefficient (DICE), false positive (FP), false negative (FN) and mean hausdorff distance(M-HD). Dice is twice the area of overlap between ground truth(A) and prediction(B) divided by the total number of pixels in both regions [38]:

$$\dot{Dice} = \frac{2|A \cap B|}{|A| + |B|} \times 100\% \,\tag{15}$$

Dice value ranges from 0 to 1 with 1 signifying the greatest similarity between the predicted and truth.


**Table 1.** Experimental quantitative results.

We also used the FP and FN to give us an overall understanding for the predicted results. Because both of the FP and FN are errors in data reporting in which a test result improperly indicates presence of a condition. In general, we will get under segmentation results if the FP is greater than FN and and vice versa. Meanwhile, we utilized M-HD to check the predicted boundary as it is sensitive to it. However, we use the *mean* computing way instead of the *max* method to prevent isolated point noise interference:

$$d\_H(X,Y) = \operatorname{mean} \{ d\_{XY}, d\_{YX} \} = \operatorname{mean} \{ \operatorname{maxmin}(\mathbf{x}, \mathbf{y}), \operatorname{maxmin}(\mathbf{x}, \mathbf{y}) \}, \tag{16}$$

In the model training process, we utilized the basic quantitative quality indicators ACC to validate the system:

$$\text{ACC} = (TP + TN) / (TP + TN + FP + FN),\tag{17}$$

ACC reflects the classifier's overall prediction correctness that TP represents the number of observations correctly assigned to the positive class. TN is the number of observations correctly assigned to the negative class. FP denotes the number of observations assigned by the model to the positive class. FN is the number of observations assigned to the negative class, which in reality belong to the positive class. Figure 14 is the validation process for training error and testing error. Table 1 shows the quantitative experimental results that we used the extra FP and FN to get feedback for over-segmentation and under-segmentation so that we can adjust the model parameters.

In allusion to the running time, we tested on the computer with two GPU cards: GEforce GTX 1080 and GEforce RTX 2080Ti, the 2080Ti was used to do the delineation and tested on the maximum BBox(320 × 320). It can perform 15 image segmentation tasks per second that meets the CC production online detection process. With regard to the image scanning speed, we tested image size: 1200 × 600 (the selected CCD cameras is 14fps in full resolution: 2592 × 1944). The system can finish 45fps laser scanning because only laser ROI will be processed in the image. Therefore, the casting speed should be less than 0.8 m/min if the scanning spacing is 0.3 mm. Actually, in real application, the high-performance image workstation or multi-machine distributed processing is preferred. The quantitative experimental results are given in Table 1.

**Figure 14.** Validation on model training and testing; (**a**) Training and testing errors; (**b**) ACC results on training and testing data sets.

#### **4. Conclusions and Future Work**

In this paper, an improved binocular vision-based 3D laser image scanning deep-learning system (3D-LDS) was established for CC products surface evaluation. The main work is as below:


Future work: Based on the experimental analysis, it is found that the optimization of network architecture is a long-term job. There is no unified network model for different detection tasks and targets. Therefore, it is essential to conduct field experimental studies to improve and construct a more robust network architecture especially for the defects classification network. The aim is to solve the common over-fitting problem of current networks and to reduce the dependence on data source quality in model training process. Furthermore, the improvement method of optimization algorithm for deep CNN model training should be further studied through the deep neural network mechanism research in the specific application context. In the following work, we will carry out field experiments and application research in the continuous casting production line.

**Author Contributions:** Conceptualization and methodology, L.Z.; data analysis and writing, F.L. and L.Z.; CNN networks architectural design and algorithm Improvement, Y.Z.; design and improvement of motion control system, X.X.; mechanical structure design of high precision experimental platform, H.X.; literature search and system validation, Y.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work was sponsored by National Natural Science Foundation of China (51604056 and 51605064) and Chongqing Science & Technology commission Foundation (cstc2016jcyjA0537). Chongqing Science and Technology Commission Industrial Application demonstration Project (cstc2017zdcy-zdyfX0025).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Automatic Fabric Defect Detection Using Cascaded Mixed Feature Pyramid with Guided Localization**

#### **You Wu, Xiaodong Zhang and Fengzhou Fang \***

State Key Laboratory of Precision Measuring Technology & Instruments, Centre of Micro/Nano Manufacturing Technology—MNMT, Tianjin University, Tianjin 300072, China; 13943173135@163.com (Y.W.); zhangxd@tju.edu.cn (X.Z.)

**\*** Correspondence: fzfang@tju.edu.cn

Received: 17 January 2020; Accepted: 4 February 2020; Published: 6 February 2020

**Abstract:** Generic object detection algorithms for natural images have been proven to have excellent performance. In this paper, fabric defect detection on optical image datasets is systematically studied. In contrast to generic datasets, defect images are multi-scale, noise-filled, and blurred. Back-light intensity would also be sensitive for visual perception. Large-scale fabric defect datasets are collected, selected, and employed to fulfill the requirements of detection in industrial practice in order to address these imbalanced issues. An improved two-stage defect detector is constructed for achieving better generalization. Stacked feature pyramid networks are set up to aggregate cross-scale defect patterns on interpolating mixed depth-wise block in stage one. By sharing feature maps, center-ness and shape branches merges cascaded modules with deformable convolution to filter and refine the proposed guided anchors. After balanced sampling, the proposals are down-sampled by position-sensitive pooling for region of interest, in order to characterize interactions among fabric defect images in stage two. The experiments show that the end-to-end architecture improves the occluded defect performance of region-based object detectors as compared with the current detectors.

**Keywords:** fabric defect; object detection; mixed kernels; cross-scale; cascaded center-ness; deformable localization

#### **1. Introduction**

Industrial defect detection is important in manufacturing. Specifically, fabric defect control is the main content of quality control in the textile industry, which would significantly increase the additional processing costs of the fabric. The cost is derived from manual positioning and the detection of defects and suspending to remove them. On the one hand, manual quality inspections are inefficient and they must often be seen under good backlighting. On the other hand, there is no quantitative defect classification indicator or boundary. This can result in false or mis-detection, and it is not conducive to the late repair of defects or the removal of defects before they occur.

With the popularization of artificial intelligence, the automatic detection algorithm is gradually replaced by the data-based intelligent learning algorithm from the traditional extraction method based on feature values and low-dimensional pixel features. When compared with the traditional algorithm, the heuristic learning algorithm has the advantages of high recognition precision, strong generalization ability, no need to construct complex analytical relations, and small sensitivity range for hyper-parameters. The intelligent detecting methods are divided into unsupervised learning and supervised learning, both of which have gained good performance in defect detection. For the former, Ahmed et al. [1] proposed to conduct the low rank and sparse decomposition jointly and extract weaker defects feature based on wavelet integrated alternating dictionary matrix transformation; Gao et al. [2] utilized an unsupervised sparse component extraction algorithm to detect micro defects in a thermography imaging system by building an internal sub-sparse grouping mechanism and adaptive

fine-tuning strategy; Wang et al. [3] established a successive optical flow for projecting the thermal diffusion and constructed principal component analysis to further mine the spatial-transient patterns for strengthening the detectability and sensitivity; Hamdi et al. [4] utilized non-extensive standard deviation filtering and K-means to cluster fabric defect block; Mei et al. [5] reconstruct fabric defect image patches with a convolutional denoising auto-encoder network at multiple Gaussian pyramid levels and to synthesize the detection results from the corresponding resolution channels. For the latter, detecting methods that are mainly based on convolutional deep learning [6] and researchers take an effort in optimizing architecture. Li et al. introduced DetNet [7], which was specifically designed to keep high resolution feature maps for prediction with dilated convolutions to increase receptive fields. Ren et al. proposed the Region Proposal Network (RPN) [8] to generate proposals in a supervised way based on sliding convolution filters. For each position, anchors (or initial estimates of bounding boxes) of varying size and aspect ratios were proposed. Liu et al. [9] learned a lightweight scale aware network to resize images, such that all objects were in a similar scale. Singh et al. [10] conducted comprehensive experiments on small and blurred object detection. Girshick et al. [11] proposed the Region of Interest Pooling (RoI-Pooling) layer to encode region features, which is similar to max-pooling, but across (potentially) different sized regions. He et al. [12] proposed RoI-Align layer, which addressed the quantization issue by bilinear interpolation at fractionally sampled positions within each grid due to the misalignment of the object position from down-sampling operation. Based on RoI-Align, Jiang et al. [13] presented PrRoI-Pooling, which avoided any quantization of coordinates and it had a continuous gradient on the bounding box coordinates.

Although generic models are simple and easy to deploy, they are either over-fitting or feature extracting insufficiency. More targeted detectors for defect images need to be re-established, owing to the dependence on the quality and manifold distribution of the dataset. A variety of imbalances in the image data structure lead to the difficulty of defect identification. The main contributions of this work lie in several aspects due to the misalignment of the object position from down-sampling operation. Firstly, a large imbalanced fabric defect dataset is collected and especially selected for training and validating a robust detector. Secondly, an efficient architecture for defect detection is well-designed, along with improved sub-modules from generic architectures for object detection. Third, the experiments reveal that overfitting and feature extracting sufficiency are the main causes for the accuracy loss of defect detection. By simplifying models, not only is the accuracy improved, but parameter space is also compressed, which reduces the inference delay and lays the foundation of industrial real-time defect detection.

The remainder of this paper is structured, as follows: Section 2 introduces a dataset well-designed on principles of imbalanced variance of real-scene fabric defect images aiming at exploring a robust detector framework and validating its generalization when compared with others. Section 3 reveals the general network configuration for fabric defect detection. A simplified backbone with mixed convolution is proposed for avoiding over-fitting, a composite interpolating pyramid is used for deep feature fusion and a cascaded center-ness refining block is provided for localization regression. Section 4 states the experimental settings and related work in the comparison between proposed network and other generic ones. The proposed modules for fabric defect are estimated effective on location by an ablation study. Finally, Section 5 concludes the paper.

#### **2. Data Space**

The feature unwrapping of high-dimensional abstract data space is the core task of deep learning. The data set connects the real scene and semantic algorithm. Therefore, the sample space approximation considers the real space, according to the maximum likelihood principle. This study trains the generalization ability in the corresponding unbalance of real data by constructing the imbalance of sample data. To date, there is still no fabric defect dataset that is adequate and classic in the size, class, and background variations. This paper proposes large-scale optical fabric images, which is named as Fabric Defects (FBDF) and consists of 2575 images covered by 20 defect categories and 4193 instances, in order to

address the aforementioned problems. All the raw defect instances and fabric images are collected from textile mills in Guangdong Province, China. These images are selected from hundreds of fabric products with classical defects and labeled in 20 classes according to product demand and expert experience. Details and access are available in https://github.com/WuYou950731/Fabric-Defect-Detection.

#### *2.1. Defect Class Selection*

Selecting appropriate texture defect classes is the first step of constructing the dataset. Ambiguous category and bounding is one of the major issues for industrial datasets, in other words, they are too blurred to label accurately, despite experts making sure that it is a defect. Therefore, defect classes selected need to be high-resolution and relatively salient when compared with background and other categories. Some categories that are not common in real-world applications are not included in FBDF and some fine-grained categories are considered as a child category. For example, some stains that are common, clear, and play an important role in textile manufacturing environment analysis, such as oil stains, rust strains, and dye stains, are labeled as the same. In addition, most of the defect categories in existing datasets are selected from gray cloth, which are from the substrate with primary colors rather than decorative patterns. However, defects appear not only in the production stage, but also in transportation, sorting, and even cutting. Therefore, different patterns and backgrounds would be taken into account in FBDF beside texture of fabric, which could be the interference of detectors. By overall selecting classes and image properties, Table 1 shows the samples of FBDF.

#### *2.2. Characteristics of FBDF Dataset*

A good detector should be of high generalization ability and a good dataset should be a benchmark and guidance in testing and training, respectively. When dataset is relatively interior balanced [14], especially for natural images, such as MS-COCO (Microsoft Common Objects in Context) [15] and Pascal-VOC (Pattern Analysis, Statistical Modelling and Computational Learning Visual Object Classes) [16], different architectures of detectors published have similar performance. However, for industrial dataset, which is seriously imbalanced in image quality, size, and background, some techniques do not work anymore. It is necessary to introduce the imbalance into FBDF in order to train the recognition capability of defect detector. Four key characteristics are highlighted.


(**b**)

**Figure 1.** Imbalance distribution of instances for FBDF. (**a**) Area and number distribution of instances per class. (**b**) Size ratio distribution of bounding boxes per class.


To sum up, FBDF are designed with the purpose of containing imbalanced image data, which are common and practical in textile scenes. FBDF provides a criterion data space for them to learn, fit, and represent to make detectors adapt to various environments, sizes, and classes.

#### **3. Methodology**

The state-of-the-art object detectors with deep learning can be mainly divided into two major categories: two-stage detectors and one-stage detectors. For a two-stage detector, in the first stage, a sparse set of proposals is generated; and, in the second stage, deep convolutional neural networks encode the feature vectors of generated proposals, followed by making object class predictions. A one-stage detector does not have a separate stage for proposal generation (or learning a proposal generation). They typically consider all positions on the image as potential objects, and they try to classify each region of interest as either background or a target object. Although two-stage detectors generally fall short in terms of lower inference speeds, they often reported state-of-the-art results on dark and low-saliency defect detection.

As shown in Figure 2, an end-to-end defect detector is composed by data input, backbone for feature extraction, neck for feature fusion and enhancement, RPN for anchor generation, and head for training or inference. In this section, more details of the framework and learning strategies of fabric defect detection application will be introduced.

**Figure 2.** End-to-end fabric defect detection architecture.

#### *3.1. Backbone for Feature Extraction*

Detectors are usually trained based on high-dimension semantic information by adopting convolutional weights to reserve image spatial transformation. R-CNN (Region Convolutional Neural Network) [11] showed the classification ability of backbone is consistent with the location ability in detecting. Moreover, the amount of backbone parameters, which are positive correlation with detection performance, is majority of that in detectors. In this section, the trade-off between latency and accuracy of final forward inference is the main design principal. A lightweight mixed depth-wise convolutional network is introduced for enhancing feature extraction within one single layer, since the imbalance of multi-scale features derived from fabric defect.

Mixed convolution [17] utilized different receptive fields to fuse multi-scale local information by diverse kernel and group sizes. Squeezed and excited [18] branch distinguished the salience of feature layer with visual attention mechanism [19] and residual skip connection [20] deepened semantic extraction and decoding. Inspired by these works, the configuration of network would contain these sub-models and be adjusted for textile dataset in order to lower the burden of RoI extractor to recognize occluded and blurred objects.

Mixed convolution (MC) partitions channels into groups and applies different kernel sizes to each group, as shown in Figure 3. Group size g determines how many different types of kernels to use for a single input tensor. In the extreme case of g = 1, a mixed convolution becomes equivalent to a vanilla depth-wise convolution [21–23]. The experiments reveal that g = 5 is generally a safe choice for defect detection, which is size-imbalanced and the maximum area ratio is near 25, as illustrated in Figure 1a. For kernel size per group, if two groups share the same kernel size, it is equivalent to merging these two groups into a single one. Hence, each one should be restricted in different kernel size. Furthermore,

kernel size is design to starts from 3 × 3, and monotonically increases by two per group, since small-size kernels generally possess less parameters and floating-point operations per second (FLOPS). Under this circumstance, the kernel size for each group is predefined for any group size g, thus simplifying the designing process. On the other hand, for channel size per group, exponential partition is more generalized than equal partition, since a smaller kernel size fuses less global information, but acquire more channels to compensate local details.

**Figure 3.** Mixed convolutional module for feature extraction.

Table 2 states the main specification for feature extraction backbone. SE denotes whether there is a squeezed excited module in that block. AF means the type of nonlinear activation function. Here, HS is for h-swish [24] and RE for ReLU. Batch normalization is used after convolution operations. The stride could be deduced from other information of layer and, here, would be passed over. EXP Size denotes the expansion of the convolution inherit from MobileNetv2 [23], which avoids the loss of pixel feature appeared in ResNet, and the number of elements in the list reveals the times while using Bneck [25]. The FPN column illustrates whether there is a head to introduce the feature map to FPN layers [26]. Operators are mainly mixed convolution parameters and {3 × 3, 5 × 5, 7 × 7} means that the group number is 3 and they are filtered by these kernels, respectively.


**Table 2.** Specification for mixed convolution (MC) backbone.

#### *3.2. Neck for Feature Integrating and Refining*

Multi-scale feature fusion aims to aggregate features at different resolution necks. Formally, the multi-scale feature map of conventional FPN is defined as an iteration term:

$$P\_i^{out} = Conv(P\_i^{in} + Sample(P\_{i+1}^{out})) \tag{1}$$

in which,

$$
\stackrel{\rightarrow}{P}^{in} = \left( P\_{l\_1}^{in}, P\_{l\_2}^{in}, \ldots \right) \tag{2}
$$

*Sample* is an up-sampling or down-sampling operation for resolution matching, and *Conv* represents convolutional feature processing. In Equation (1), *pin <sup>i</sup>* is the *i*-th input feature layer with different pixel resolution. *pout <sup>i</sup>* is the *i*-th output feature layer in the other back propagation path with the same resolution as *pin <sup>i</sup>* . In Equation (2), <sup>→</sup> *P in* represents the parallel feature input flow with interpolating scales. As the foundation of scale fusion, FPN offers two crucial conclusions for defect detection use. Firstly, the defect instance of any scale could be unified to the same resolution as long as stride and kernel are well designed. Secondly, the sampling feature could reserve the most useful information of defect image and defect position is different from background mainly lies in its pixel brightness. This is slightly inconsistent with natural image dataset and more approaching to the principle of maximum pooling. However, the simple top-down FPN is inherently limited by the one-way feature flow.

Fusing layers need to be cross-scale connected with each other, which derived from compression or interpolation, to continue the strengthening features. Additionally, fusion operation focuses on two aspects, aggregation path, and expansion path. For the former, PANet [27] adds an extraction bottom-up path and CBNet [28] overlays parallel feature maps in different size. The latter outperforms the former for defect classes in detection, since CBNet possesses more parameters and more aggregating feature, as illustrated in Table 3. For the latter, as shown in Figure 4, NAS-FPN [29] treats up-sampling equally to convolution by employing neural architecture search and utilize large-ratio connection to deepen the above two operations. However, simplified configuration, especially unexplained topology of NAS [30] detectors and EfficientDet [31] series, takes the efficiency as the loss of semantic precision.

**Table 3.** Performance of the state-of-the-art FPN baselines on FBDF.

**Figure 4.** Feature pyramid network frameworks.

For low-salience defects, this paper proposes several designing principles for neck feature fusion:


**Figure 5.** Composite interpolating feature pyramid (CI-FPN).

After setting the general fusion operation, weighted pixel analysis is necessary for feature refinement. Since different input layers are at distinguished resolutions, they usually unequally contribute to the output. Previous fusion methods treat all inputs equally without distinction and cause bounding regression drift. Based on conventional resolution resizing and summing up, the paper proposes to weight the salience of feature layer, as follows:

$$B\_L^j(i) = \text{Conv}\left(\frac{w\_1 \cdot B\_L^{j-1}(i) + w\_2 \cdot S\_L^j(B\_L^j(i)) + w\_3 \cdot S\_H^{j-1}\left(B\_H^{j-1}(i-1)\right)}{w\_1 + w\_2 + w\_3 + \epsilon}\right) \tag{3}$$

$$B\_H^j(i) = \text{Conv}\left(\frac{w\_1 \cdot S\_L^j \left(\mathcal{B}\_L^j(i+1)\right) + w\_2 \cdot S\_H^j \left(\mathcal{B}\_H^j(i+1)\right)}{w\_1 + w\_2 + \epsilon}\right) \tag{4}$$

In Equations (3) and (4), the feature flow between *Mi* and *Mi*<sup>+</sup><sup>1</sup> mainly build up on two kind of blocks *BL* and *BH*, which reveal the *j*-th module, *i*-th layer, and extraction path as well as interpolating path. Moreover, learnable weight *wi* is scalar in the feature level, which is comparably accurate to tensor in pixel level, yet with minimal time costs. Normalization is resorted to bound the data fluctuation and h-swish replaces Softmax to assign probability to each weight here ranges from 0 to 1 and it alleviates the truncation error at the origin point. ε = 0.001 is a disturbance constant to avoid numerical instability.

#### *3.3. Anchor Sampling and Refining*

Region anchors, which are the cornerstone of learning-based detection, play a role in predicting proposals from predefined fixed-size candidates. Selecting positive instances from a large set of densely distributed anchors manually is time-consuming and limited to finite size variance. Some defect instances contained extreme sizes and regression distance between ground truth and anchor may be great. Therefore, in the first stage, the detection pipelines of this paper focus on guided anchor (GA-Net) [32] mechanism to predict centers and sizes of proposals from FPN outputs and, in the second stage, regression and classification are conducted after feature fusion and alignment by position-sensitive (PS) RoI-Align [33], as shown in Figure 6.

**Figure 6.** Cascaded Guided-Region Proposal Network (CG-RPN) with semantics-guided refinement.

#### 3.3.1. Stage for Proposal Generation

All of the proposals would be regressed to bounding boxes of final prediction, thus the quality of generator is crucial. Following the paradigm of GA-Net, RPN comprised of two branches for location and shape prediction, respectively. Given a FPN input F, the location prediction branch Ct (Center-ness) yields a probability map that indicates the Sigmoid scores for centers of the objects by Conv1\*1, while the shape prediction branch Sp (Shape) predicts the location-dependent sizes. This branch will lead shapes to the highest coverage with the nearest ground-truth bounding box. Two channels represent two variables height and width but it is necessary to be transformed by Equation (5) for the large range and instability of them.

$$B\_H^j(i) = \text{Conv}\left(\frac{w\_1 \cdot S\_L^j \left(\mathcal{B}\_L^j(i+1)\right) + w\_2 \cdot S\_H^j \left(\mathcal{B}\_H^j(i+1)\right)}{w\_1 + w\_2 + \epsilon}\right) \tag{5}$$

in which,

$$\beta\_{\text{tr}} = \frac{w\_{\text{max}}}{w\_{\text{min}}} \cdot \prod\_{i=1}^{L} s\_{i\text{\textdegree}} \beta\_{\text{h}} = \frac{h\_{\text{max}}}{h\_{\text{min}}} \cdot \prod\_{i=1}^{L} s\_{i} \tag{6}$$

where (*w*, *h*) are the output of shape prediction, *si* is the stride for different layer *L,* and β is a scale factor, depending on size of image data. The nonlinear mapping normalizes the shape space from approximately (0, 2000) to (0, 1), leading to an easier and stable learning target. Since it allows for arbitrary aspect ratios, our scheme can better capture those extremely tall or wide defect instances and encode them to a consistent representation.

A further size-adaptation offset map is introduced, as the anchor shapes are supposed to be changeable to capture defects within different ranges. With these branches, a set of anchors are generated by selecting the center-ness whose predicted probabilities are above a slightly lower threshold and several shapes with the top probability at each of the chosen feature position. Subsequently, the center-ness threshold is increased for the refinement of the next anchor-selection module and the policy for shapes of anchors is unchanged. Increasing thresholds are set in different sub-modules to include more probable central points and deal with the misalignment of extreme shaped defects. Center-ness is shifted and updated after DF convolution. By this way, with the aid of dense sampler, a large number of anchors are selected, suppressed, and regressed to 256 proposals for stage two. As shown in Figure 7, yellow boxes are the maximum-IOU (Intersection Over Union) candidates chosen after coarsely locating irregularly-shaped defect instances, which are named cascaded guided RPN (CG-RPN). The red points denote strongly semantic feature positions and blue triangles represent centroids of them.

**Figure 7.** Effective receptive field from deformable extraction in guided localization.

#### 3.3.2. Stage for Bounding Box Generation

The bounding boxes need to be regressed and filtered from a large amount of low-quality anchors and Non-maximum Suppression (NMS) [34] are operated to filter the overlaying ones by local maximum search, whose result is shown in Figure 8. On the other hand, multi-classifying branch fixes the output size of full-connection layer so RoI-Align along with adapting pooling aggregates different fields into shape-identical feature map.

**Figure 8.** From positive anchors to proposals and finally to bounding boxes.

Moreover, the loss of the proposed detector is divided into four parts: location branch and final classification branch are similar and Focal Loss (FL) [35] and Cross Entropy (CE) Loss optimize them. Shape branch, which compares IOUs by manually assigning central area ranges, use Bounded IOU Loss (BI) conducted by height and width and regression branch is common with Smooth L1 Loss (SM), as following:

$$L = F\mathcal{C}\_{\text{loc}} + BI\_{\text{slupc}} + \mathcal{C}E\_{\text{cls}} + SM\_{\text{reg}} \tag{7}$$

where *CE* and *SM* are applied in stage two and smoothing factor is set 0.04 to avoid sensitivity to outliers and suppress gradient explosion. *BI* derived from *SM* along with parameters of bounding boxes. γ in *FC* loss is 2 for balancing positive and negative samples.

#### *3.4. Evaluation Metrics for Imbalanced Detection for Defects*

Imbalanced detection needs to be evaluated by average recognition precision and variance fluctuation among different categories. Similarity, between ground truth and predicting bounding boxes is proportional to the recognition ability of detectors. Similarity is denoted by IOU based on the Jaccard Index, which evaluates the overlap between two bounding boxes, as shown in Figure 9 and Equation (8), By comparing with confidence threshold, IOU of every instance in every category would divide the prediction results into three aspects: True Positive (TP) denotes a correct detection with IOU ≥ threshold, FP denotes a wrong detection with IOU < threshold and FN reveals a ground truth not detected. After counting the number of instances in distinguished quality, a balanced metrics of AP (Average Precision) could be calculated and used for representing the average performance of detection. In Figure 1, the dashed curve is the Recall-Precision Curve, which is denoted by blue bins and whose area is no more than 1 for facilitating consistency with probability. AP could be calculated

in Equation (9), in which b is the number of bins and here is 11. *P* and Δ*r* are the height and width of each bin, respectively.

$$IOU\_{(c,i)} = \frac{\text{area}(S\_{Prod} \cap S\_{GT})}{\text{area}(S\_{Prod} \cup S\_{GT})} \tag{8}$$

$$AP = \sum\_{b=1}^{n} P(b)\Delta r(b) \tag{9}$$

**Figure 9.** Metrics for performance of imbalanced detection.

For each single class, precision is the ability of a model to only identify the relevant objects. It is the percentage of correct positive predictions and it is given by TP/(TP + FP). Recall is the ability of a model to find all of the relevant cases (all ground truth bounding boxes). It is the percentage of true positive detected among all of the relevant ground truths and it is given by TP/(TP + FN). AP is different among every category due to the imbalance of fabric defects in inter-class and intra-class. Firstly, the mean AP of all categories could be used as the overall performance of detectors and it is named mAP (usually use AP as default). Secondly, for inter-class imbalance, which means that the data distribution of every class significantly differs from each other, the PR curve is better than ROC (Receiver Operating Characteristic), since ROC considers both positive and negative examples. AP focuses on positive ones and Variance Precision (VP) illustrates the inter-class accuracy stability, as in Equation (10). Thirdly, intra-class imbalance mainly lies in size-variance and the AP for small, medium, and large objects is divided by cross scale 962 and 2562.

$$VP = \frac{1}{\overline{\mathbb{C}}} \sqrt{\sum\_{k=1}^{\mathbb{C}} \left( AP\_k - \mathbf{m} AP \right)^2} \tag{10}$$

#### **4. Experiments and Discussion**

#### *4.1. Experimental Settings*

The experiments are performed on FBDF to validate whether the modules above could solve the imbalance of the textile industrial scenes. The validation dataset is evenly split from the whole at splitting ratio of 0.2. Additionally image size does not need to be resized and without changing the aspect ratio. Mini-batch stochastic gradient descent and batch normalization [36] are implemented over two TITAN RTX GPUs with 18 images per worker on one GPU. The training epochs are uniformed to 20 and learning rate is decreased every four epochs with a decreasing rate of 0.1. The evaluation metrics is AP at different IOU thresholds (from 0.5 to 0.95). 200 instances of every class from these images are randomly split as the pre-trained classification dataset, with which all backbones of the architectures are initialized, in order to further strengthen feature extraction.

#### *4.2. Main Result*

The proposed scheme can be evaluated with other state-of-the-art well-designed methods, as the comparison in Table 4. MC-Net and CI-FPN along with Mixed-16, which is short for Defect-Net and composite interpolating FPN along with mixed convolution network of 16 layers, achieves a remarkable improvement, especially for small defects. It reports a testing AP of 72.6%, an improvement of 12.1% over 60.5%, being achieved by cascaded FPN under the same setting. When using the light-weight backbone (i.e., Mixed-16), the AP improvement over Mixed-16 is 6.7% and APS improves 29.1%, which prove the availability of mixed convolution. The phenomenon of the increasing from 60.5% (Cascaded R-CNN) to 65.9% (DF-Net + CI-FPN + ResNet50) and 65.9% to 72.6% (MC-Net + CI-FPN + Mixed-16) prove the guessing of over-fitting from large-scale backbone, especially for small defects. The VP of generic baselines is larger than that of the proposed architecture on average. VP of MC-Net along with CI-FPN and Mixed-16 is the lowest and FCOS is the second one, which reveals that the range of AP for different categories is narrow and distribution is relatively even. However, it does not mean that the more VP is, the higher accuracy detector possesses. Take YOLOv3 as an example, the experiments show that the APs of 11 classes are less than 30% in spite of 10.8 in VP. Additionally, Cascaded-FPN gains 0.3 VP larger than Libra-FPN-RetinaNet, but 4.5% AP larger than that. Therefore, the ability for addressing the imbalance of detectors should be evaluated by a combination of AP and VP.


**Table 4.** Accuracy of different detectors on FBDF testing set.

#### *4.3. Ablation Experiments*

Backbone extraction. As MC-Net uses a light-weight powerful backbone, Figure 10 reveals how much each of them contributes to the accuracy and efficiency improvements. Faster R-CNN along with FPN is our baseline for comparison of different backbone. First, the RestNet series are heavy and low-efficiency, which achieve a relatively low accuracy and ResNet-101, along with ResNet-152 are even worse, and are thus are not shown in the figure. When replacing with MobileNet, AP increases from 53% to 61% over MobileNetv3-Large [44] without cropping the images. Mixed series achieve a similar performance with EfficientNet series and Mixed-16 gains the top AP of 67.2% based on FBDF and slightly decreasing from Mixed-20 is due to the redundancy of the weights.

Along with the variant improvement for Faster R-CNN, the MC module is still efficient for other detectors in defect detection. In Table 5, Cascaded FPN gains 6.0% promotion from 66.5% on the ResNet-50 backbone and the average improvement of small instances is 5.5%, which proves MC-Net could extract more and deep feature from the low-salience defects.

Neck fusion. In Figure 11, the AP of composite interpolating FPN is rising with the model complexity expanding. Bn is short for n blocks of two-way information flow. Notably, when three blocks are employed along with inter-layer and intra-block cross-scale connections, the scheme is the most accurate one, with 72.3% (AP), 50.9% (APS), and 36.4 MB training parameters.

**Figure 10.** Model sizes with Average Precision (AP) for FBDF of different high-efficiency backbones upon Faster R-CNN and FPN.

**Table 5.** Performance of Mixed-16 applied in some generic detectors.


**Figure 11.** The performance of different configurations of CI-FPN with CG-RPN modules.

Proposals generation. With the deployment of CG-RPN, three feature adaption modules would refine the anchor centers and shapes in stage one. The center-ness thresholds are 0.3, 0.5, and 0.7 in different cycles and, in regression branch, every position in feature map choose three anchors with aspect ratios of 0.5, 1.0, and 2.0 to enlarge the search space. In Figure 11, the left one is from common Faster R-CNN and the right one is from CG-RPN and less low-quality proposals is reserved here. In Table 6, different center-ness configurations are displayed and the tuple (0.3, 0.5, 0.7) is better than the others in AP, since it introduces more computing parameters and relaxes the hard border of whether belonging to positive instances.

**Table 6.** Performance of CG-RPN with different center-ness threshold tuples.


Additionally, from the pre-trained model of MC-Net with CI-FPN, several bounding boxes and confidence values are shown in Figures 12 and 13.

**Figure 12.** Sample RPN proposals from Faster R-CNN and C-G RPN.

**Figure 13.** Visualization of the best bounding boxes of defects from MC-Net along with CI-FPN.

For Area Under Curve (AUC), the MC-Net along with Mixed-16 gains better performance than AP: 76.9% for mean AUC, 55.4% for small defects and 82.6% for large defects. For ConerNet and CenterNet, the mean AUC increases 3.5% and 4.3% and some small promotions in accuracy appear in other detecting systems. However, in textile industry, positive examples draw more attention than negative examples and a detector that is robust to sensitive metrics. When negative examples increase a lot, the curve does not change a lot, which is equivalent to generating a large number of FP. In the context of imbalanced categories, the large number of negative cases makes the growth of FPR (FPR = FP/(FP + TN); TPR = TP/(TP + FN)) not obvious, resulting in an ROC curve that shows an overly optimistic effect estimate. Finally, misdetection would lead to constant interruptions of machine tools, which results in low efficiency in manufacturing. Therefore, in this work, the ROC curve is replaced for the PR curve.

#### **5. Conclusions**

This study solves the problem of the imbalanced detection for fabric defect. Firstly, a large-scale, publicly available dataset for defect detection in optical fabric defect images is released, which enables the community to validate and develop data-driven defect detection methods. Secondly, several modules to refine traditional inefficient network are designed, including mix convolutional backbone, interpolating FPN, and cascaded guided anchor, etc., in order to improve recognition performance of occluded and size-variant defects Finally, the study shows the importance of these frameworks in defect detecting and provides a scheme for precisely meeting the needs of the textile industry.

**Author Contributions:** Conceptualization, Y.W. and X.Z.; methodology, Y.W.; software, Y.W.; validation, Y.W.; formal analysis, Y.W., X.Z. and F.F.; investigation, Y.W. and F.F.; resources, X.Z.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W., X.Z. and F.F.; visualization, Y.W.; supervision, F.F.; project administration, X.Z. and F.F.; funding acquisition, F.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the "111" Project by the State Administration of Foreign Experts Affairs and the Ministry of Education of China (No. B07014).

**Acknowledgments:** The authors would like to thank the financial support received from National Natural Science Foundation of China (NSFC) (No. 51320105009 & 61635008) and the "111" Project by the State Administration of Foreign Experts Affairs and the Ministry of Education of China (No. B07014). Thanks go to the Alibaba Group for the dataset of fabric defect images.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Automatic Identification of Tool Wear Based on Convolutional Neural Network in Face Milling Process**

#### **Xuefeng Wu \*, Yahui Liu, Xianliang Zhou and Aolei Mou**

Key Laboratory of Advanced Manufacturing and Intelligent Technology, Ministry of Education, Harbin University of Science and Technology, Harbin 150080, China

**\*** Correspondence: wuxuefeng@hrbust.edu.cn; Tel.: +86-451-8639-0538

Received: 13 August 2019; Accepted: 2 September 2019; Published: 4 September 2019

**Abstract:** Monitoring of tool wear in machining process has found its importance to predict tool life, reduce equipment downtime, and tool costs. Traditional visual methods require expert experience and human resources to obtain accurate tool wear information. With the development of charge-coupled device (CCD) image sensor and the deep learning algorithms, it has become possible to use the convolutional neural network (CNN) model to automatically identify the wear types of high-temperature alloy tools in the face milling process. In this paper, the CNN model is developed based on our image dataset. The convolutional automatic encoder (CAE) is used to pre-train the network model, and the model parameters are fine-tuned by back propagation (BP) algorithm combined with stochastic gradient descent (SGD) algorithm. The established ToolWearnet network model has the function of identifying the tool wear types. The experimental results show that the average recognition precision rate of the model can reach 96.20%. At the same time, the automatic detection algorithm of tool wear value is improved by combining the identified tool wear types. In order to verify the feasibility of the method, an experimental system is built on the machine tool. By matching the frame rate of the industrial camera and the machine tool spindle speed, the wear image information of all the inserts can be obtained in the machining gap. The automatic detection method of tool wear value is compared with the result of manual detection by high precision digital optical microscope, the mean absolute percentage error is 4.76%, which effectively verifies the effectiveness and practicality of the method.

**Keywords:** tool wear monitoring; superalloy tool; convolutional neural network; image recognition

#### **1. Introduction of Tool Condition Monitoring (TCM)**

In milling operations, the quality of machined workpiece is highly dependent on the state of the cutting insert. Factors such as wear, corrosion, or fatigue can affect tool wear. Therefore, monitoring of tool wear in machining process has found its importance to predict tool life, reduce equipment downtime, and optimize machining parameters [1]. The detection method of tool wear can be categorized into two groups: direct and indirect method [2]. Indirect measurement is the use of sensors to measure a signal related to tool wear and obtained by analyzing the signal. Applicable signals that are widely used for tool wear measurement include acoustic emission, force, vibration, current, and power signals, etc. For example, Yao et al. [3] used the acoustic emission signal, the spindle motor current signal, and the feed motor current signal to monitor the tool wear state. Li et al. [4] used acoustic emission (AE) signals to realize the TCM and tool life prediction. Jauregui et al. [5] used cutting force and vibration signals to monitor tool wear state in the high-speed micro-milling process. Prasad et al. [6] analyzed the sound and light-emitting signals during milling and obtained

the relationship between tool wear and surface roughness. However, all of these signals are heavily contaminated by the inherent noise in the industrial environment, reducing their performance [7].

Recent advances in digital image processing have suggested the machine vision should be used for TCM. In this case, the direct method of measuring tool wear has higher accuracy and reliability than the indirect method [7]. An unsupervised classification was used to segment the tool wear area through an artificial neural network (ANN) and then used to predict tool wear life [8]. Garcia-Ordas et al. [9] used computer vision technology to extract the shape descriptor of the tool wear area, combined with the machine learning model to achieve TCM. D'Addona et al. [10] used ANN and DNA-based computing, the tool wear degree was predicted based on image information extracted from the pre-processing. Alegre-Gutierrez et al. [7] proposed a method based on image texture analysis for TCM in the edge contour milling. The extracted descriptor of the tool wear area and the cutting parameters as the input parameters of wavelet neural network (WNN) model, which can predict the tool wear degree [11]. Mikolajczyk et al. [12] developed a system that can automatically measure the wear degree of the cutting edge by analyzing its image. Therefore, applying machine vision method in TCM has become more and more mature.

Since deep learning is a network structure with multiple layers of nonlinear series processing units, it is possible to omit data preprocessing and use the original data for model training and testing directly [13]. With deep learning method, enormous breakthroughs have been made in image recognition and classification [14–16], fault diagnosis [17,18], and medical health [19,20], etc., and CNN has become one of the research focuses in artificial intelligence algorithm directions. In the mechanical manufacturing industry, CNN can be used for monitoring the operating conditions of gearboxes, bearings, etc., to identify the faults intelligently and classify the diagnosis [21–23]. In the aspect of TCM, Fatemeh et al. [24] used extracted features to train and test the Bayesian network, support vector machine, K nearest neighbor regression model, and established CNN model. The experimental results show that the CNN model has a higher recognition rate precision for TCM. Zhang et al. [25] converted the original vibration signal into an energy spectrum map by wavelet packet transform, which trained and tested the energy spectrum map based on the CNN to classify the tool wear.

The above methods are used to realize real-time monitoring of the tool wear, but the research is mainly concerned with whether it is worn is qualitative monitoring, and there are relatively few studies of quantitative determination, especially for each cutting edge of the milling tool. The precision of conventional measurement methods, such as using laser measuring instruments, optical projectors, etc. are high, and so is the evaluation criterion for tool wear, but the measurement process needs to be separated from the production process with the characteristics of low efficiency, significant labor intensity, and shut-down measurements. It is challenging to meet the modern production requirements of high efficiency and automated processing [26].

The purpose of this paper is to develop a detection system that automatically recognizes tool wear types and obtains the wear value. It can obtain the wear image information of all the inserts of face milling cutter in the machining gap, and no downtime measurement, just reducing the spindle speed. In the future, the entire system can be integrated into the machine tool computer numerical control (CNC) system to achieve high automation of the machining process and measurement process. We present an on-machine measurement method to measure the tool wear. A tool wear image acquiring device installed on the working platform of the CNC machine tool is used, and the images of each cutting edge are obtained through the frame rate matching of the rotating speed. This paper draws on the VGGNet-16 network model [27] with a strong classification ability and high recognition precision, combined with the characteristics of tool wear images and the actual processing environment, to construct a CNN model to identify wear types in the machining of superalloy tools. Based on the model, the image processing method is developed according to the type of wear to improve the wear value extraction algorithm. The Roberts operator is used to locate the wear boundary and refine the edge image coarsely, thus improving the precision of automatic tool wear value detection (named ATWVD). A sample set of the tool wear image is obtained by the milling process experiments, and the

tool wear value can be measured by the method. Precision and efficiency are estimated to meet the requirement of automation and precision in the milling process.

#### **2. Materials and Methods**

The tool wear collection device consists of the Daheng image industrial digital camera (model: MER-125-30UC, Daheng company, Beijing, China) which has a resolution of 1292 (H) × 964 (V) and a frame rate of 30 fps. Telecentric lens with a distortion rate of less than 0.08% (model: BT-2336, BTOS company, Xian, China), ring light source (model: TY63HW, Bitejia Optoelectronics company, Suzhou, China) and regulator, multi-function adjustment bracket, and a laptop computer. Three-axis vertical machining center CNC milling machine (model: VDL-1000E, Dalian machine tool company, Dalian, China), 490 R series of CVD coated carbide tools, and PVD coated carbide inserts produced by Sandvik are used in the milling of Inconel 718. The dimensions of the workpiece are 160 mm × 100 mm × 80 mm. The chemical composition of Inconel 718 material is shown in Table 1. There is no coolant during processing. The parameters of the cutting tool are shown in Table 2.

**Table 1.** Chemical composition of Inconel 718 material.



**Table 2.** Parameters of the cutting tool.

Since the CNC milling machine spindle usually does not have the function of angle control, the tool wear area cannot be accurately stopped at the captured position of the CCD, so we have to collect all the wear images of all the inserts while the spindle is rotating. Image acquisition is performed every two minutes in the milling process. It is necessary to reduce the spindle speed of the machine tool and move the face milling cutter to the coordinate measurement position. Wear images are automatically acquired by the tool wear collection device installed on the CNC machine, and then the generated image files are saved on the computer hard disk. After the automatic acquisition is completed, the inserts are removed in the process of downtime. Then the actual tool flank wear value is measured by using a precision digital optical microscope (model: Keyence VHX-1000).

The cutting parameters used for the face milling operation are shown in Table 3. There are a total of 18 possible combinations of experimental conditions (ADG, AEG, AFG, ADH, AEH, AFH, BDG, BEG, BFG, BDH, BEH, BFH, CDG, CEG, CFG, CDH, CEH, CFH) based on the given cutting parameters. Each image is labeled with a combination of cutting conditions and processing time. For example, the image "3ADG08" represents the third insert of face milling cutter, at a cutting speed of 60 m/min, a feed rate of 0.03 mm/z, a depth of cut of 0.3 mm, and a tool wear image is obtained after 8 min of machining. The above-acquired images are used as datasets.



Each set of tests uses the same cutting parameters until the maximum flank wear of the cutting tool reaches 0.4 mm. Face milling cutters have six inserts. At the end of each set of milling tests, wear images of six inserts are acquired, and six sets of complete tool wear life cycle data are obtained. Both coated tools are used in this way. Experimental environment for milling and tool wear detection is shown in Figure 1. In the process of collecting the tool wear images, the trigger mode is a soft trigger, the exposure time is 10 ms, and the exposure mode is timed. If the spindle speed is 2 r/min, the time interval for saving the pictures is 250 ms, and the number of saved pictures is 120; if the spindle speed is 4 r/min, the time interval for saving the pictures is 150 ms, the number of saved pictures is 100, which guarantees a complete acquisition of the wear image of each cutting edge. The process of matching the camera frame rate with the machine spindle speed is shown in Figure 2. The matching between the rotational speed and the acquisition frame rate is studied, and the maximum speed of 4 r/min is obtained according to 30 fps of the industrial camera. If the spindle speed is increased, the acquired tool wear image will have different degrees of distortion. Improving the frame rate of the camera can increase the spindle speed to improve detection efficiency.

**Figure 1.** Process of matching the camera frame rate with the machine spindle speed. (**A**,**C**) Images of different numbered inserts at different speeds; (**B**) numbered picture of inserts for face milling cutter.

**Figure 2.** Experimental environment for milling and tool wear detection. (**A**,**C**) Actual detection images; (**B**) The overall image of camera bracket.

#### **3. Tool Wear Types Identification Based on CNN**

#### *3.1. The Overall Process for Tool Wear Types Identification*

The images of tool wear are the datasets after the image size normalization. Wear characteristics of tool wear images are extracted adaptively, some are used as the training data, and the other part is used as the test data. After the network model is pre-trained using a convolutional automated encoder (CAE), the output is set as the initial value of the CNN parameters. CNN is used to continue the training to obtain the optimal solution of the entire network parameters. Finally, the softmax classifier is used to identify and classify the wear types of milling tools. The whole network structure continuously iterates and feeds back the actual error results of the calculation. At the same time, the weight of the network is updated to make the whole network develop in the optimal direction, and the optimal wear type identification model is obtained finally. The overall process for tool wear identification is shown in Figure 3.

**Figure 3.** The overall process for tool wear identification.

#### *3.2. Network Structure for Tool Wear Types Identification*

The CNN structure is mainly composed of a convolutional layer, a pooled layer (also referred to as a down sampling layer), and a fully connected layer. The paper constructs a recognition model of tool wear types (named ToolWearnet) based on CNN concerning a VGGNet-16 model applied to image classification and developed by the visual geometry group of Oxford University. The parameters of the network are set according to the characteristics of the tool wear identification model. The ToolWearnet model consists of 12 layers, including 11 convolutional layers and one fully connected layer, and each convolution kernel is 3 × 3 in size.

Relative to the convolutional layer, the full connection layer has more connections, and more parameters will be generated. In order to reduce the network parameters, only one fully connected layer is set in the model, and the last pooling layer in front of the fully connected layer is set to the mean pooling layer. The kernel size of the mean pooling layer is 4 × 4. This configuration can effectively reduce network parameters while maintaining network feature extraction capabilities as much as possible. The acquired image size of the tool wear image is 256 × 256 × 3 and randomly cut into a

uniform 224 × 224 resolution as the input data of the network model. The tool wear images are divided into four categories according to the wear types, and the fully connected layer is set as four types of output. The framework of tool wear identification method and network structure based on CNN are shown in Figure 4.

**Figure 4.** The framework of wear type identification method and network structure based on CNN.

#### 3.2.1. Convolution Layer

The convolution operation is to use the convolution kernel to "slide" on the image at a set step size, thereby extracting features in the image to obtain feature maps. Batch normalization (BN) layer is add in the layers of Conv1, Conv3, Conv5, Conv7, Conv9, Conv10, and Conv11 to prevent over-fitting and speed up model convergence. Large-sized pictures *xl* (*rl* × *cl*) are operated through the convolution with the size of *xs* (*rs* × *cs*), the number of learned features is *kc*, multiple features maps *Cli* (*i* = 1, 2, 3... *kc*) can be obtained [24] as in the following Equation (1).

$$Cl\_i = \operatorname{ReLu}\left(\mathcal{W}^{(1)}\mathbf{x}\_s + b^{(1)}\right) \tag{1}$$

where *W*(1) and *b*(1) are the weight and offset of the display unit to the hidden layer unit, respectively. *ReLu* (*x*) is the Nonlinear mapping activation function. Then the actual size of the feature map *Cli* is:

$$S(\mathcal{Cl}\_l) = \left[ \left( (r\_l + 2 \times \text{paling} - r\_s) / \text{stride} \right) + 1 \right] \times \left[ \left( (\text{c}\_l + 2 \times \text{paling} - \text{c}\_s) / \text{stride} \right) + 1 \right] \times k\_\varepsilon \tag{2}$$

where *kc* represents the number of convolution kernels; *padding* is the edge extension parameter; *stride* represents the step size. For example, in the CNN model of the tool wear identification, as shown in Figure 4, a tool wear image of 224 × 224 × 3 is input, a convolution kernel of 3 × 3 × 64 is used in the convolution layer Conv1 to perform convolution operation on the tool wear image. Set the step size as 1 and the edge expansion parameter as 1, and the actual size of the output feature map after convolution is 224 × 224 × 64.

#### 3.2.2. Pooling Layer

The pooling methods include maximum pooling, average pooling, and random value pooling. The dropout technology is added after the AvgPool layer and the MaxPool4 layer to reduce the complexity of interaction between neurons, so that neuron learning can obtain more robust features, improve the generalization, and avoid over-fitting of the network. Maximum pooling method and the average pooling method are used for pooling operation. Pooling kernel size is set as *pp* (*rp* × *cp*), and *Cl* (*r* × *c*) is the feature map obtained after the convolution operation. Multiple feature maps *pi* (*i* = 1,2,3 ... *kp*) obtained [25] after the pooling operation are as shown in Equations (3) and (4).

$$p\_i = \max\_{r\_p \prec c\_p} (\mathcal{Cl}) \tag{3}$$

$$\text{For } p\_i = \underset{r\_p \times c\_p}{\text{average}} (\text{Cl}) \tag{4}$$

Then the actual size of the feature map pi is: Type equation here.

$$S(p\_i) = \left[ \left( \left( r + 2 \times \text{paling} - r\_p \right) / \text{stride} \right) + 1 \right] \times \left[ \left( \left( c + 2 \times \text{paling} - c\_p \right) / \text{stride} \right) + 1 \right] \times k\_p \tag{5}$$

where *kp* represents the number of pooling kernels. For example, input a feature map of 224 × 224 × 64 in the pooling layer MaxPool1 as shown in Figure 4, pooling kernel of 2 × 2 × 128 is used for pooling operation of the tool wear images. Set the step size as 2 and the edge expansion parameter as 0, and the actual size of the output feature map after pooling operation is 112 × 112 × 128.

#### 3.2.3. Fully Connected Layer

Each neuron in the fully connected layer is fully connected to all neurons in the upper layer. After the training of the tool wear image and the corresponding label, the classification results of tool wear types are output with the recognition precision rate and loss function value of the CNN. The ToolWearnet model is constructed using the network structure described above. The size of the input tool wear image is 224 × 224. The specific parameters of each layer are shown in Table 4. The ToolWearnet model, which includes a 12-layer network structure and does not include an input layer, has 11 convolutional layers, five pooling layers, and one fully connected layer. Conv1 to Conv11 are convolutional layers, MaxPool1 to MaxPool4 are the largest pooling layer, AvgPool is the average pooling layer, and FC layer is a fully connected layer.

**Table 4.** Structural parameters of the ToolWearnet model.


#### 3.2.4. Parameter Training and Fine Tuning

The training process of CNN is mainly composed of two stages: forward propagation and back propagation [28]. The network model is pre-trained by the convolutional automated encoder (CAE) and then is optimized and fine-tuned by the back propagation (BP) algorithm combined with the stochastic gradient descent (SGD) method. In the actual calculation, the batch optimization method decreases the value and gradient of the loss function in the entire training datasets. However, it is difficult to run on a PC for large data sets. The network may converge more quickly, and the above problem may be solved after applying the SGD optimization method in CNN.

The standard gradient descent algorithm updates the parameter θ of the objective function *J*(θ) using the following formula [24]:

$$
\theta = \theta - \eta \nabla\_{\theta} E[f(\theta)] \tag{6}
$$

where expectation *E*[*J*(θ)] is the estimated value of loss function value and the gradient for all training datasets, and η is the learning rate.

The SGD method cancels the expected term when updating and calculating the gradient of the parameters. The formula is as follows:

$$\theta = \theta - \eta \nabla\_{\theta} l(\theta; m^{(i)}, n^{(i)}) \tag{7}$$

where (*m*(*i*) , *n*(*i*) ) is the training set data pair.

When updating the model parameters, the objective function of the deep network structure has a locally optimal form, and local convergence of the network may be led by the SGD method. Increasing the momentum parameter is an improved method that can speed up the global convergence of the network. The momentum parameter updated formula is as follows [18]:

$$\upsilon\_t = a v\_{t-1} + \eta \nabla\_{\theta} l(\theta; m^{(i)}, n^{(i)}) \tag{8}$$

$$
\theta = \theta - \upsilon\_t \tag{9}
$$

where *vt* represents the current velocity vector, and *vt-*<sup>1</sup> represents the velocity vector of the previous moment, both having the same dimensions as the parameter θ; α is momentum factor, and α (0,1), where α is equal to 0.5.

#### *3.3. Method of ATWVD Based on Tool Wear Types Identification*

The minimum bounding rectangles (MBR) method [29] is used to study the tool wear and tool breakage in milling of superalloy mentioned above. For the tool wear in the case of abrasive wear or chipping, the value of tool wear can be detected accurately, but in the case of adhesive wear and sheet peeling, the value obtained by using the MBR method directly has a significant error. This is because when the tool wear is in adhesive wear or sheet peeling, the image obtained from the camera is affected by the chips adhered on the cutting edge, as shown in Figure 5a,d, resulting in a larger rectangle surrounding the wear zone than the actual one.

**Figure 5.** Comparison of detection methods for the value of tool wear. (**a**,**d**) Roughly processed images of adhesive wear and sheet peeling; (**b**,**e**) cropped images; (**c**,**f**) edge refined images.

In order to improve the deficiencies of the above detection methods, an ATWVD method based on tool wear types identification is proposed, the specific algorithm is as follows:


#### **4. Results and Discussion**

#### *4.1. Dataset Description*

The tool wear and tool failure in the milling of high superalloys are analyzed as an example. The tool wear is severe due to high hardness, excellent fatigue performance, and fracture toughness of superalloys. At a lower cutting speed, flank wear and rake face wear are the main forms of wear, as shown in Figure 6a,d; at a high-speed cutting, the heat generated causes significant chemical and adhesive wear on the tool, as shown in Figure 6c. In actual cutting, the wear form is often complicated, and multiple wears coexist. For example, flank wear and crater wear (rake face wear) can be observed simultaneously, and two wears that interact with each other will accelerate the wear of the tool and finally lead to breakage, as shown in Figure 6b. The adhesive wear at the edge will cause chips to adhere to the rake face, and finally, the sheet peeling will occur at the edge.

**Figure 6.** Form of tool wear and tool breakage (**a**) Flank wear; (**b**) tool breakage; (**c**) adhesive wear; (**d**) rake face wear.

After the tool wear images were acquired in the milling process using the above methods, the datasets were divided into three parts: 70% as the training set; 15% as the validation set; 15% as the test set. We randomly selected two different cutting parameters and obtained twelve sets of wear images with complete tool wear life as the test set. Then, twelve sets of wear images were selected in the same way as the verification set. The remaining tool wear images were all used as the training set. The test set and training set are mutually exclusive (the test samples do not appear in the training set and are not used in the training process). The training set was used for model training, the validation set was used to select model hyper parameters, and the test set was used to evaluate model performance.

The CNN model is prone to overfitting due to its powerful fitting ability, especially when the number of datasets is small [30]. Since the number of tool wear images collected in this paper is limited, and the ToolWearnet model has more layers, more tool wear images are needed for training. Methods of rotation, flipping transformation, scaling transformation, translation transformation, color transformation, and noise disturbance are used to extend the datasets by five times. There are 8400 images in the datasets. The tool wear images mentioned above are classified according to the adhesive wear, tool breakage, rake face wear, and flank wear. After randomly cutting into a uniform, the datasets of the images were obtained. The specific number of images for each wear type is shown in Table 5.


**Table 5.** Statistics of the tool wear image data.

The tool wear type identification is a multi-classification task. Since the tool wear types are identified in the milling machine during machining, it is likely to be affected by pollutants and fragment chips in the processing environment. Irregular chips adhere to the tool when it is adhesive wear. If the chip is small and not easily observed, it is likely that the classifier will recognize it as a rake face wear, as shown in Figure 7a. In order to reduce the probability of such errors, we collected images that are prone to recognition errors and used them as a sample set to re-train the CNN model. Similarly, if contaminants are attached to the main cutting edge of the rake face, it is likely to be recognized by the classifier as adhesive wear, as shown in Figure 7b. In order to avoid such errors, we will configure a high-pressure blowing device on one side of the machine spindle, using high pressure gas to blow away contaminants and fragment chips before performing tool wear type detection. In actual cutting, the wear form is often complicated, and multiple wears coexist. As shown in Figure 7c, it has the characteristics of adhesive wear and tool breakage, which will also cause the classifier to have

recognition error. Due to the current small sample set, this mixed form of wear type has a lower occurrence probability. With the expansion of the sample set, the number of its occurrences will increase. In the future, we will further study a specialized algorithm to reduce the above error cases.

**Figure 7.** Images of the detection error cases. (**a**) adhesive wear, (**b**) flank wear, (**c**) multiple wear coexist.

The CNN used in the deep learning experiment is built using the deep learning development kit provided by MATLAB2017b combined with the Caffe deep learning framework. Keyence VHX-1000 (KEYENCE company, Osaka, Japan) high precision digital optical microscope was used to test the actual tool wear value and verify the feasibility of the ATWVD method. Images manually collected by high precision digital optical microscopes are also part of the datasets. The partial images of the datasets and the process of depth search are shown in Figure 8.

**Figure 8.** The partial images of the datasets and the process of depth search.

#### *4.2. Evaluation Indications*

After training the ToolWearnet network model, the test set was used to evaluate the precision of the ToolWearnet network model in identifying the tool wear image. We used the "precision and average recognition precision" metrics to perform the evaluation process [31], as shown in Equations (10) and (11). The ToolWearnet network model constructed in this paper has an average recognition precision (*AP*) of about 96% for tool wear types.

$$precision = \frac{TP}{TP + FP} \times 100\% \tag{10}$$

$$AP = \left[ \left( \frac{1}{n} \frac{TP}{TP + FP} \right) \right] \times 100\% \tag{11}$$

where *TP* is the number of true positive samples in the test samples, *FP* is the number of false positive samples in the test samples, *n* is the number of categories in the test sample, *n* = 4.

In order to test the performance of the automatic tool wear value detection (named ATWVD) method based on the tool wear types in detecting the tool wear degree, the parameter error rate (*ek*) and mean absolute percentage error (*MAPE*) were used to evaluate, as shown in Equations (12) and (13). In this paper, the mean absolute percentage error (*MAPE*) of the ATWVD method is about 5%.

$$c\_k = \frac{|F\_k - A\_k|}{A\_k} \times 100\% \tag{12}$$

$$MAPE = \left[ \left( \frac{1}{k} \sum\_{k=1}^{k} e\_k \right) \right] \times 100\% \tag{13}$$

where *Fk* represents the tool wear degree detected by the ATWVD method, *Ak* represents the wear value measured by the precision optical microscope, and *k* represents the number of measured tool wear images.

#### *4.3. Analysis and Results*

The initial learning rate of the training sets in the experimental phase is set to 10<sup>−</sup>3, the maximum number of training iterations is 103, the dropout rate is 0.2, the weight attenuation is 5 <sup>×</sup> 10−4, and the number of batch images is 128. The algorithm of the neural network parameter optimization is stochastic gradient descent (SGD). When training this network model, the batch normalization (BN) method is used to accelerate the training of the network. Because the BN method uses the normalized method to concentrate the data at the center, it can effectively avoid the problem of the gradient disappearing. When testing this network model, the training sets are tested once every six training sessions.

#### 4.3.1. Comparison of Two CNN Models for Tool Wear Recognition

In order to reflect the reliability of the ToolWearnet network model built in this paper in tool wear types identification, and to make a reasonable evaluation of its recognition precision rate and efficiency, the VGGNet-16 network model was applied for the comparative experiment. VGGNet-16 is a CNN model developed by the VGG (visual geometry group) group of Oxford University for image classification. The VGGNet-16 network model has a 16-layer network structure with 5 maximum pooling layers, 13 convolutional layers, and 3 fully connected layers. The convolution kernels are all 3 × 3, and the pooling kernels are all 2 × 2. The first and second fully connected layers use dropout technology, and all hidden layers are equipped with ReLU layers. The model has strong image feature extraction ability. The detailed training parameters for the two models are shown in Table 6. After training and testing, different recognition precision rate and loss function values are obtained, as shown in Figure 9. The comparison of the precision rate of the recognition, training time, and recognition time of the different tool wear by the two models are shown in Table 7.


**7RRO:HDUQHWPRGHOUHFRJQLWLRQSUHFLVLRQUDWH**

**7RRO:HDUQHWPRGHOORVVIXQFWLRQYDOXH**

**Table 6.** The detailed training parameters for the two models.

**Figure 9.** Comparison of two models testing precision rate and loss function values.

**Table 7.** The comparison of the recognition precision rate, training time, and recognition time of the two models.


The results show that the ToolWearnet network model achieves a high precision rate at the beginning of the test quickly. In the first 24 iteration periods, the recognition precision rate of the two models are continuously increasing, and the loss function is continuously decreasing; but when the iteration period exceeds 24 times, the recognition precision rate of ToolWearnet model tends to be stable, while the recognition precision rate of the VGGNet model is continuously increasing, and its loss function is continuously decreasing. It will be stable until the iteration period exceeds 76 times. Therefore, the convergence speed of this model is faster than the VGGNet-16 network model. The VGGNet-16 model has more network layers, complex network structure, and network parameters, resulting in an astronomical calculation amount and slow convergence speed. The trend of recognition precision rate of the two models increases with the number of iterations, and the trend of the loss function decreases with the number of iterations. When the recognition precision rate and loss function of both models tend to be stable, the recognition precision rate curve of the VGGNet model is above the ToolWearnet model. Therefore, the recognition precision rate of this model is slightly lower than the VGGNet model. Compared with the ToolWearnet model, it has the deeper network depth and more feature surfaces, the feature space that the network can represent is more extensive, which will make the learning ability of the network stronger and the deeper feature information can be mined,

but it will also make the network structure more complicated. The storage memory and the training time will increase.

The ToolWearnet network model has an average recognition precision rate of 96.20% for tool wear types, the highest recognition precision rate for adhesion wear, with a value of 97.27%, and the lowest recognition precision rate for flank wear, which is 95.41%. Because irregular chips adhere to the cutting edge during adhesion wear, the contour features of the wear area are most apparent, while the contour features of the flank wear are not evident due to the difference of the shape of the wear area being small. Although the recognition precision rate of the ToolWearnet network model is slightly lower than the VGGNet-16 network model, the training time and recognition time are shorter, and the hardware has low computational power requirements, which can meet the needs of slightly fast and accurate cutting in the industry.

#### 4.3.2. Tool Wear Value Detection Comparison Experiment

In the experiment, the spindle speed was matched with the frame rate of the camera, and the wear image of each insert on the face milling cutter was collected. Each acquisition took the corresponding insert directly below the label on the cutter as the starting point and sequentially numbers the six inserts in the direction while the tool rotates. The images with good quality and complete tool wear areas were acquired by filtering by the fuzzy and the insert position detection. Then, the flank wear value of each insert was obtained by the ATWVD method combined with the CNN model. After the automatic acquisition of the tool wear image, the inserts are detected by a high precision digital optical microscope to obtain the exact value of tool wear after stopping the CNC machine. The comparison of the pictures collected in two ways is shown in Figure 10. We randomly selected one set of tool wear images with complete tool wear life cycle in the test set for comparison, taking "2CEG" as an example. The maximum wear value VB of the ATWVD method detection and high precision digital microscope manual detection and the error rate of the ATWVD method were obtained, as shown in Figure 11.

**Figure 10.** Comparison of two acquisition methods. (**a**) image is manually collected by high precision digital microscope (model: Keyence VHX-1000); (**b**) image is collected by the automatic acquisition device.

The results show that the wear value detected by the ATWVD method is close to the wear value of high precision digital optical microscope manual detection, and the error rate of the wear value is in the range of 1.48~7.39% and the mean absolute percentage error (*MAPE*) is 4.76%. Therefore, the method of tool wear types automatic identification based on CNN is verified. Practicality, the process is suitable for intermittent detection of the tool, such as collecting tool information while the tool is in the tool magazine or the tool change gap. It can automatically and effectively obtain the tool wear types and wear value, which can be used to provide data for the future tool life prediction, tool selection, and tool design.

**Figure 11.** Comparison of the maximum wear value measured by the two methods and the error rate of the ATWVD method.

#### **5. Conclusions**

In this paper, an automatic recognition model of tool wear types based on CNN is proposed by using Caffe deep learning framework. The model considers the characteristics of the actual processing environment and tool wear images in milling of superalloy. The tool wear images obtained through the milling experiments of Nickel-based superalloy Inconel 718 are used as the datasets to train and test the ToolWearnet network model. The results show that the model has a robust feature extraction ability. The recognition precision rate of different wear types of high-temperature alloy tools is in the range of 95.41~97.27%, and the average recognition precision rate is 96.20%. It has the advantages of high recognition precision rate and robustness.

Furthermore, an ATWVD method is improved based on ToolWearnet network model, the value of tool wear obtained by this method is compared with the wear value detected by a high precision digital optical microscope. The error rate of this method is in the range of 1.48~7.39%, and the mean absolute percentage error is 4.76%, which proves the reliability of the method. Although the recognition precision rate of the network model is slightly lower than that of the VGGNet-16 network model, the training time and recognition time are shorter, and the network parameters are less. It can be applied to the application scenario that identifies tool wear types accurately and slightly faster with lower hardware consumption.

Practicality, the process is suitable for intermittent detection of the tool, such as collecting tool information while the tool is in the tool magazine or the tool change gap. It can automatically and effectively obtain the tool wear types and wear value, which can be used to provide data for the future tool life prediction, tool selection, and tool design. The proposed tool wear identification method can also be extended to other machining processes, such as drilling and turning. Meanwhile, future work will consider more extensive experimental data with different cutting tools, as well as extend the applications to predict tool life, optimizing the machining parameters with the method.

**Author Contributions:** Conceptualization, X.W. and Y.L.; Data curation, Y.L., A.M., and X.Z.; Formal analysis, X.W. and Y.L.; Investigation, X.W. and Y.L.; Methodology, X.W. and Y.L.; Project administration, X.W.; Resources, A.M. and X.Z.; Software, X.W., Y.L., and X.Z.; Supervision, X.W.; Writing—original draft, Y.L.; Writing—review and editing, X.W., Y.L., A.M., and X.Z.

**Funding:** This work was financially supported by the National Science Foundation of China (Grant No.51575144).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Nomenclature**


#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Citrus Pests and Diseases Recognition Model Using Weakly Dense Connected Convolution Network**

#### **Shuli Xing 1, Marely Lee 1,\* and Keun-kwang Lee 2,\***


Received: 18 June 2019; Accepted: 16 July 2019; Published: 19 July 2019

**Abstract:** Pests and diseases can cause severe damage to citrus fruits. Farmers used to rely on experienced experts to recognize them, which is a time consuming and costly process. With the popularity of image sensors and the development of computer vision technology, using convolutional neural network (CNN) models to identify pests and diseases has become a recent trend in the field of agriculture. However, many researchers refer to pre-trained models of ImageNet to execute different recognition tasks without considering their own dataset scale, resulting in a waste of computational resources. In this paper, a simple but effective CNN model was developed based on our image dataset. The proposed network was designed from the aspect of parameter efficiency. To achieve this goal, the complexity of cross-channel operation was increased and the frequency of feature reuse was adapted to network depth. Experiment results showed that Weakly DenseNet-16 got the highest classification accuracy with fewer parameters. Because this network is lightweight, it can be used in mobile devices.

**Keywords:** citrus; pests and diseases identification; convolutional neural network; parameter efficiency

#### **1. Introduction**

Pests and diseases are the two most important factors affecting citrus yields. Types of citrus pests and diseases are numerous in nature. Some of them are similar in appearance, making it difficult for farmers to precisely recognize them in time. In recent years, developments of convolutional neural networks (CNNs) have dramatically improved the state-of-the-art in computer vision. These new structures of network have enabled researchers to obtain high accuracy for image classification, object detection, and semantic segmentation [1]. Therefore, some studies have adopted the CNN model to identify the category of pests or diseases based on image. Liang et al. [2] have proposed a novel network consisted of residual structure and shuffle units for plant diseases diagnosis and severity estimation. Cheng et al. [3] have compared the classification performance of different depths of CNN models for 10 classes of crop pests with complex shooting background. The highest classification accuracy in both studies was obtained with the deepest network. For detection tasks, people are also more willing to select a very deep network architecture instead of a shallow one. Shen et al. [4] have applied a faster R-CNN [5] framework with improved Inception-V3 [6] to detect stored-grain insects under field condition with impurities. The same feature extractor network and SSD [7] model have been utilized by Zhuang et al. [8] to evaluate the health status of farm broilers.

In theory, the complexity of the CNN model depends on the scale of dataset. However, deep convolutional networks mentioned above were all over-fitted because they were proposed based on ImageNet [9] initially. Although a fine-tuned method [10] can be used to reduce the divergence between training and testing, the space required for model storage is so large that they cannot be deployed on mobile devices with little memory.

In this paper, a simple but effective network architecture was developed to classify pictures of citrus pests and diseases. Our network design principles focused on improving the utilization of model parameters. There has been evidence suggesting that some feature maps generated by convolutions are not useful [11,12]. To decrease the impact of redundant features on classification, Hu et al. [13] and Woo et al. [14] have introduced an attention mechanism to suppress unnecessary channels. Their approaches are more adaptable than the Dropout [15] and stochastic depth [16]. However, the extra branch in each building block increases the overhead of a network. Unlike these approaches, the channel selection of this paper was implemented through the method of cross-channel feature fusion. In Network in Network [17], two consecutive 1 × 1 convolutional layers were regarded as a way to enhance model discriminability for local patches. From another perspective, this structure is also a good choice to refine feature maps. Highway network [18] first provided the idea of feature reuse to ease the optimization difficulty suffered by deep networks. ResNet [19] generalized it with identity mappings. DenseNet [20] further boosted the frequency of skip-connection. DenseNet has a better representation ability than ResNet because it can produce a higher accuracy with fewer parameters. The concatenation operation of DenseNet was followed but some connections between long-range layers were removed by us. Because of this weakly dense pattern, our network is called Weakly DenseNet.

Experiment results showed that Weakly DenseNet achieved the highest accuracy in classifying citrus pests and diseases. With regard to computational complexity, our proposed model is also lightweight. These phenomena indicate that the optimization of network structure is more important than blindly increasing the depth or width. The main contributions of this work are summarized as follows:

A specific image dataset of citrus pests and diseases is created. It is a relatively complete image dataset for the diagnosis of citrus pests and diseases.

A novel and lightweight convolutional neural network model is proposed to recognize the types of citrus pests and diseases. The network design is based on improving parameter efficiency.

A new data augmentation method is developed to reinforce model generalization ability, which can significantly reduce the similarity between generated images.

#### **2. Related Work**

Pests and diseases can cause great damage to crops if they are not controlled. To recognize them, farmers used to rely on experienced experts. With the popularity of image sensors, using computer vision methods to identify pests and diseases has become a trend. Boniecki et al. [21] have proposed to use image analysis techniques and artificial neural network model to classify images of apple pests in simple background. Their dataset included 12,000 images from six species of apple pests which are most commonly found in Polish orchards. For training and testing proposed artificial neural network model, seven selected coefficients of shape and 16 color characteristics were extracted from each pest image as inputs. Sun et al. [22] have combined SLIC (simple linear iterative cluster) with SVM (support vector machine) classifier to detect diseases on tea plant. Their algorithm improved the prediction accuracy of disease images taken with complex backgrounds but needed more pre-treatments to reduce interference. A total of 1308 pictures from five common tea plant diseases were included in their dataset. These images were divided into two parts with a ratio of 4:1 for training and testing. Ferentinos [23] has employed deep CNN models to perform plant disease detection and diagnosis. They used an open database which contains 87,848 photographs of leaves to train each model. Images without pre-processing were regarded as inputs in his study. Compared with other selected models, VGG achieved the highest success rate with 99.48%. These advantages of deep CNNs have encouraged more researchers to apply them in the agricultural field.

A wide range of CNN architectures has been proposed to improve performance. VGGNets [24] first use small size convolution filters to reduce parameters and increase depth. ResNet exploits a simple identity skip-connection to ease optimization issues of deep networks. WideResNet [25] replaces the bottleneck structure in ResNet with two broad 3 × 3 convolutional layers to reduce depth. DenseNet enhances deep supervision [26] by iteratively concatenating input features with output features. Xception [27] introduces a depthwise separable convolution to decrease the number of parameters in a regular convolution. In it, depthwise convolution is responsible for feature extraction and pointwise convolution (a regular 1 × 1 convolution) is used for cross-channel feature fusion. This new convolution operation has become a core component of many lightweight networks, such as MobileNets [28,29] and ShuffleNets [30,31]. The structure of MobileNet-v1 is similar to that of VGG. MobileNet-v2 develops an inverted residual block to increase memory efficiency. To maintain the representational power of narrow layer in each inverted residual block, ReLU activation [32] behind it is removed. ShuffleNet-v1 employs group convolution [33] to further reduce the computational cost of depthwise separable convolution, a channel shuffle operation is adopted to enhance the information exchange of subgroups. ShuffleNet-v2 is constructed based on ShuffleNet-v1. However, it suggests splitting channels into two equal parts and using concatenation instead of addition to execute feature reuse. People tend to use their architectures designed for the ImageNet without considering their own dataset scale. This behavior may lead to overfitting problems and waste of computing resources. Different from previous approaches, a novel, and lightweight network was constructed to classify images in our dataset.

#### **3. Dataset**

The dataset used in our experiment included 17 species of citrus pests and seven types of citrus diseases. Pests' images were mainly collected from the Internet. Images of diseases were taken in a tangerine orchard of Jeju Island using a high-resolution camera. Our image dataset is available at the website of Appendix B. Table 1 shows the name and number of images of each kind of pest and disease.


**Table 1.** The description of citrus pests and diseases image dataset.

#### *3.1. Image Collection of Citrus Pests*

Insect pests have metamorphosis properties. We focused on images of adults. This is because other stages in their life cycles are short and rare to observe. The appearance of the same pest can vary significantly from one viewing angle to another (refer to Figure 1). To reduce the effect of shooting angle on classification accuracy, photos of pests taken from different angles were gathered. Some citrus pests have small sizes, such as aphid, mealybug, and scale. It is difficult to capture images of their individuals and most of them live by groups to resist predators. For these species, pictures of their group living on a tree were collected (refer to Figure 2).

**Figure 1.** Pictures of brown marmorated stink bug taken from different angles.

**Figure 2.** Examples of small size citrus pests: (**a**), (**c**), and (**e**) are the images of their individuals, (**b**), (**d**), and (**f**) are those of their groups.

#### *3.2. Image Collection of Citrus Diseases*

Compared with pests, features of citrus diseases are more regular. Pictures of citrus diseases were mainly taken in the summer after a heavy rain because the incidence was higher than usual. To keep more details, the distance between camera and diseases was close. Some diseases will cause leaf holes at a later phase. To enhance comparison, images of the leaf holes created by pests (PH) were included as a disease label. Figure 3 displays sample images of each disease.

**Figure 3.** Representative images of citrus diseases. (**a**) citrus anthracnose, (**b**) citrus canker, (**c**) citrus melanose, (**d**) citrus scab, (**e**) sooty mold, (**f**) leaf miner, and (**g**) pest holes.

#### *3.3. Data Augmentation*

The problem of imbalanced data classification has been discussed by Das et al. [34]. It prompted us to increase the number of images in the class whose dataset scale was smaller than that of others. For augmenting image data, the generic practice is to perform geometric transformations, such as rotation, reflection, shift, and flip. However, images generated by a single type of operation are similar to each other. They increase the probability of overfitting. To avoid this situation, a new data augmentation method was proposed, which could randomly select three kinds of operations and combine them together to produce new images. Available operations and values of them are shown in Table 2. Figure 4 presents pictures obtained from this approach.


**Table 2.** Parameter set for data augmentation.

(**g**)

**Figure 4.** Comparison between different data augmentation methods. (**a**) Original image, (**b**), (**c**), and (**d**) images generated from proposed algorithm, (**e**), (**f**), and (**g**) pictures produced by a single rotation operation.

#### **4. Weakly DenseNet Architecture**

Convolutional layers in the CNN model are responsible for feature extraction and generation. Therefore, many researchers have focused on increasing depth and width to improve classification accuracy. In contrast, the proposed Weakly DenseNet was created to improve parameters' utilization. To reach this goal, a complex cross-channel operation was adopted to refine feature maps and concatenation method was used for feature reuse.

#### *4.1. The 1* × *1 Convolution for Feature Refinement*

A regular convolution contains two aspects: Local receptive field and weight share. From the local receptive field point of view, a 1 × 1 convolution regards each pixel of a feature map as input. However, when weight share is considered, it was equivalent to the whole feature map multiplied by a learnable weight. Therefore, this kind of convolution has the function of refining feature maps.

One layer of 1 × 1 convolution only implements a linear transformation. Many network architectures just use it to alter channel dimension [19,20]. To extend the functionality of 1 × 1 convolution, two layers of it were stacked after each 3 × 3 convolutional layer. The proposed structure takes each whole feature map as an input and thus does not need an extra branch to execute feature recalibration. This reduces the optimization difficulty in contrast with SENet [13]. Figure 5 illustrates the difference between them.

**Figure 5.** Feature refinement. (**a**) Squeeze-and-excitation block, (**b**) the proposed method.

#### *4.2. Feature Reuse*

As network depth increases, gradient propagation becomes more difficult. ResNet addresses this issue by adding input features to output features across a few layers (1). DenseNet simplifies the addition operation by concatenation, which allows feature maps from different depths to combine along channel dimension (2). Considering operation efficiency, the concatenation method of DenseNet was selected for feature reuse

$$
\overline{X} = X + H(X) \tag{1}
$$

$$\overline{X} = \{X, H(X)\}\tag{2}$$

where *X* represents inputs, *H(X)* is defined as the underlying mapping, *X*) denotes outputs. In the addition operation, *X* and *X*) should have the same dimension.

However, overuse of features of previous layers can increase network overhead. To solve this problem, DenseNet employs a transition layer to reduce the number of input features for a dense block. The dense connectivity pattern of DenseNet also made low-level features to be repeatedly used many times. Yosinski et al. [35] have proved that features generated by convolutions far from the classification layer are general, thus contributing little to classification accuracy. According to their conclusion, some connections between long-range layers were removed.

#### *4.3. Network Architecture*

The network architecture was divided into three parts during design. Figure 6 demonstrates the building block of each part. As for the frequency (*v*) of feature reuse, it was adapted to the depth of the network:

**Figure 6.** Building blocks of the Weakly DenseNet. (**a**) initial building block, (**b**) and (**c**) intermediate building blocks, (**d**) final classification building block.

Features generated by adjacent convolutions are highly correlated [6]. A final classification layer concentrates on using high-level features [20]. Therefore, keeping connections between short-distance layers and reducing the combination of low-level features to the classification layer are necessary.

Middle layers produce features are between general and specific [35]. It is assumed that if the network depth is increased, the value of *v* in the middle layers should be enlarged. Figure 7 illustrates the building block of DenseNet for fitting ImageNet dataset. In it, *v* > 1.

We set *v* to 1, because our network was constructed not very deeply based on image data scale of citrus pests and diseases. Table 3 summarizes the architecture of the network.

Each convolution in the building block is followed by a batch normalization layer [36] and a ReLU layer [32].

**Figure 7.** Building block of DenseNet.

**Table 3.** Network architecture for citrus pests and diseases.


#### **5. Experiments and Results**

The original dataset was split into three parts: Training set, validation set, and test set. The ratio between them was 4:1:1. The images in each set were resized to 224 × 224 by a bilinear interpolation approach. To evaluate the effectiveness of Weakly DenseNet, it was compared with several baseline networks from different aspects. The software implementation was based on Keras with TensorFlow backend. The hardware foundation was GPU, 1080Ti. Code and models are provided in Appendix C.

#### *5.1. Training*

All the networks were trained with SGD and a Nesterov momentum [37] of 0.9 was introduced to accelerate convergence. To improve models' generalization performance, a small batch size of 16 was selected during training [38]. The initial learning rate was set to 0.001 and it can be adjusted based on Algorithm 1.

**Algorithm 1.** Learning Rate Schedule

```
Input: Patience P, decay θ, validation loss L
Output: Learning rate γ
1: Initialize L = L0, γ = γ0
2: i ← 0
3: while i < P do
4: if L ≤ Li then
5: i = i + 1
6: else
7: L = Li
8: i = i + 1
9: end if
10: end while
11: if L = L0 then
12: γ = γ ∗ θ
13: end if
```
In the experiment, *P* = 5 and θ = 0.8. Weight initialization proposed by He et al. [39] was followed, and a weight decay of 10−<sup>4</sup> was used to alleviate the overfitting problem. The maximum training epoch of each model was 300. The rate of Dropout was set to be 0.5.

The VGG-16 of this paper used a global average pooling layer [17] to reduce the number of parameters in fully connected layers. In the original SE block, the size of the hidden layer was reduced by a ratio (*r* > 1) to limit model complexity. To keep the same computation cost as NIN-16, the value of *r* was set to 1 by us.

Table 4 shows the training results of each model. With regard to classification accuracy, WeaklyDenseNet-16 was the highest, followed by VGG-16, and NIN-16. The higher accuracy of NIN-16 than SENet-16 indicates that two layers of 1 × 1 convolution have better performance of refining feature maps than SE block. By concatenating previous layers' features, the recognition performance of NIN-16 was significantly strengthened: The accuracy of WeaklyDenseNet-16 was 1.58 percent higher than that of NIN-16. As for computation cost, VGG-16 was the largest while that of SENet-16 was the least. It should be noticed that ShuffleNets and MobileNets overfitted citrus pests and diseases image dataset. Even though they had similar model sizes to WeaklyDenseNet-16, their larger values of depth brought bigger error rate on validation dataset. As for training speed per batch size, bigger size model took more time except for ShuffleNet-v2 [31]. The accuracy training plots of benchmark models are displayed in Figure 8. It can be seen that each model completely converges in 300 epochs.



'NIN' represents Network in Network.

**Figure 8.** Training plot of each model. (**a**) MobileNet-v1, (**b**) MobileNet-v2, (**c**) ShuffleNet-v1, (**d**) ShuffleNet-v2, (**e**) NIN-16, (**f**) SENet-16, (**g**) VGG-16, (**h**) WeaklyDenseNet-16.

#### *5.2. Test*

Test accuracy results of selected models are shown in Figure 9 which shows the same accuracy trend as Table 4. The confusion matrix of Weakly DenseNet-16 on the test dataset is presented in Figure A1 (refer to Appendix A). Figure A1a shows that the recall rate (3) of citrus root weevil is the lowest and that of the citrus swallowtail is the highest. Among misclassified images, citrus anthracnose and citrus canker are the most easily confused by the proposed model: Nine images of citrus anthracnose were considered as the class of citrus canker and four images of citrus canker were regarded as citrus anthracnose by the network. The two diseases at later phase show a similar appearance on leaves. Thus, Weakly DenseNet-16 gave some incorrect predictions. The precision rate (4) of the PH is the largest while that for citrus flatid planthopper is the lowest. Figure A1b displays the wrong predictions between citrus pest and disease. Ten images of citrus disease were misclassified into pest labels and seventeen pictures of citrus pest were identified as diseases by mistake. In them, the probability that citrus soft scale is falsely regarded as citrus sooty mold is the highest. Adult citrus soft scales can secrete honeydew sooty mold around them for growth, which shows a similar pattern to the symptom of sooty mold.

$$\text{Recall} = \frac{\text{Number of true positive samples}}{\text{Number of true positive samples} + \text{Number of false negative samples}} \times 100\% \tag{3}$$

Precision <sup>=</sup> *Number o f true positive samples Number o f true* positive *samples* <sup>+</sup> *Number o f f alse positive samples* <sup>×</sup> 100% (4)

**Figure 9.** Comparison of the test accuracy.

The hierarchical structure of the CNN model allows features generated from layers of different depths to show significant differences [40]. To better understand the learning capacity of intermediate building blocks of Weakly DenseNet-16, several important feature maps of them were visualized and compared. From Figure 10, it can be noticed that:

(**e**)

**Figure 10.** *Cont*.

**Figure 10.** Visualization of features. (**a**) and (**e**) input images, (**b**) and (**f**) output features of the intermediate building block 1, (**c**) and (**g**) sampled features of the intermediate building block 2, (**d**) and (**h**) examples of the feature maps in the intermediate building block 3. Brighter color in images corresponds to higher value.

A bank of convolutional filters in the same layer can extract features of different parts of the target object. This feature extraction method allows the CNN model to acquire sufficient visual information for subsequent analysis.

With increasing depth, the background features become less visible. Therefore, CNN models do not require additional pre-processing techniques to reduce background noise. They are more convenient to use than conventional machine learning algorithms.

Features of the deeper layer are more abstract than those of the shallow layer. More convolution and max pooling operations are performed on shallow layer features in the deeper layer, resulting in higher-level features that are more suitable for classification.

#### **6. Conclusions and Future Work**

Pests and diseases can reduce citrus output. To control their impact, a new image dataset about citrus pests and diseases was created and a novel CNN architecture was proposed to recognize them. The network was constructed from the aspect of improving parameters' utilization instead of depth and width. The structure of two 1 × 1 convolutional layers was revisited and applied to refine feature maps. To relieve the optimization difficulty in the deep network, the idea of feature reuse was followed. Considering operation efficiency, the concatenation method of DenseNet was employed. However, the high frequency of feature reuse increased the overhead of network. To save computation cost, the value of feature reuse frequency was set based on network depth. To further improve the robustness of the CNN model, a new data augmentation algorithm was provided, which can significantly lessen the similarity between generated images. In experimental studies, NIN-16 got a test accuracy of 91.66% which was much higher than that of SENet-16 (88.36%). This phenomenon indicates that two-layer 1 × 1 convolution has better performance of refining feature maps than SE block. The higher accuracy of WeaklyDenseNet-16 (93.33%) than NIN-16 indicates that feature reuse method can further enhance network performance. VGG-16 achieved the second-highest classification accuracy (93%) but consumed the most computing resources: Model size is 120.2 MB. This fact implies the importance of network structure optimization on fitting different datasets.

The object scale in the image is an essential factor that influences classification accuracy of the CNN model. Using extremely deep networks to identify big scale objects will cause a waste of computational resources. For the identification of small size objects, shallow networks cannot give accurate results. Future work is to build a CNN model that can adapt to the size of the object in the image.

**Author Contributions:** Conceptualization, S.X. and M.L.; methodology, S.X.; software, S.X.; validation, M.L., S.X., K.-k.L.; formal analysis, M.L., K.-k.L.; data curation, S.X.; writing—original draft preparation, S.X.; writing—review and editing, all authors.

**Funding:** This work was supported by a grant (No: 2017R1A2B4006667) of National Research Foundation of Korea (NRF) funded by the Korea government (MSP).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**


(**a**)


(**b**)

**Figure A1.** The confusion matrix for the test set. (**a**) test result of each class, (**b**) prediction for pest and disease label.

#### **Appendix B**

Image dataset is available at: https://files.mycloud.com/home.php?brand=webfiles#23a3c71/ device\_30757105/ARlab/xingshuli600/.

#### **Appendix C**

Models and code are available at: https://github.com/xingshulicc/xingshulicc/tree/master/citrus\_ pest\_diseases\_recognition.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

### **Vision-Based Novelty Detection Using Deep Features and Evolved Novelty Filters for Specific Robotic Exploration and Inspection Tasks**

#### **Marco Antonio Contreras-Cruz 1, Juan Pablo Ramirez-Paredes 1, Uriel Haile Hernandez-Belmonte 2and Victor Ayala-Ramirez 1,\***


Received: 8 May 2019; Accepted: 28 June 2019; Published: 5 July 2019

**Abstract:** One of the essential abilities in animals is to detect novelties within their environment. From the computational point of view, novelty detection consists of finding data that are different in some aspect to the known data. In robotics, researchers have incorporated novelty modules in robots to develop automatic exploration and inspection tasks. The visual sensor is one of the preferred sensors to perform this task. However, there exist problems as illumination changes, occlusion, and scale, among others. Besides, novelty detectors vary their performance depending on the specific application scenario. In this work, we propose a visual novelty detection framework for specific exploration and inspection tasks based on evolved novelty detectors. The system uses deep features to represent the visual information captured by the robots and applies a global optimization technique to design novelty detectors for specific robotics applications. We verified the performance of the proposed system against well-established state-of-the-art methods in a challenging scenario. This scenario was an outdoor environment covering typical problems in computer vision such as illumination changes, occlusion, and geometric transformations. The proposed framework presented high-novelty detection accuracy with competitive or even better results than the baseline methods.

**Keywords:** visual inspection; one-class classifier; grow-when-required neural network; evolving connectionist systems; automatic design; bio-inspired techniques; artificial bee colony

#### **1. Introduction**

Novelty detection is the task of recognizing data that are different in some aspects from the already known data [1]. This is a challenging problem because the datasets may have a large number of examples of the normal class and an insufficient number of examples of the novel class (in almost all cases, no novelty examples are available). Having robust methods for this type of problem is of great importance in practical applications such as fraud detection [2,3], fault detection [4], medical diagnosis [5–7], video surveillance [8,9], and robotic tasks [10–12], among others. For these applications, it is not common to have access to data labeled as novel. Another complication is that even when using the same type of information across different applications (e.g., visual information), the concept of novelty varies among them. For these reasons, multi-class classifiers are infeasible for novelty detection. As an alternative, there are dedicated methods for novelty detection that provide all the elements to solve the problem.

In general, the novelty detection methods construct a model with the examples of the normal class and use this model with unknown data to compute novelties. The methods can be classified into five categories [1]: probabilistic, distance-based, reconstruction-based, domain-based, and information-theoretic techniques. One-class classification techniques have been broadly applied for novelty detection with successful results in environments where no dynamic adaptation of the models is required. Recently, advances in deep learning algorithms have shown a new open area into novelty detection [9,13]. The deep-learning-based methods for novelty detection combine the ability of deep neural networks to extract features with the ability of one-class classifiers to model the normal data. The main drawback of these techniques is the need for large-scale datasets and high computational load to train the models.

Inspired by the ability of animals to detect novelties and to respond to changes in their environment [14], researchers have tried to incorporate novelty detection methods into robots to improve their adaptation capability to the dynamic environments that are often present in real-world robotic tasks. Presently, it is possible to capture useful information to perform this process with the use of sensors incorporated into the robots (e.g., sonar, laser, camera, GPS, etc.). Among them, visual sensors are one of the most popular devices to extract information for novelty detection [10,11,15], perhaps because humans use visual information unconsciously as a central component to detect novelties.

In robotics, a novelty detection module is beneficial for several applications (e.g., exploration, inspection, vigilance, etc.). Specifically, in exploration and inspection tasks [11], the robot should explore its environment, building a model of normality using the sensed information. After the model construction, the robot patrols (inspection phase) the same route of the exploration phase in order to detect novelties. It is worth noting that the number of path executions is limited. Although the routes are the same in both phases, due to the operating conditions it is not possible to ensure the same robot positions between different path executions.

For the above problem, the robot needs online novelty detectors to cope with dynamic environments and approaches with fast learning capabilities to detect novelties in scenarios with a reduced amount of information. Most of the traditional one-class classifiers operate offline, which means that it is difficult to adapt these methods to dynamic environments. Meanwhile, deep-learning approaches need large-scale datasets and a huge computation load to train the models. Alternatively, online approaches are based on evolving connectionist systems [11] and grow when required neural networks [16] meet the above conditions. These methods not only build a model of normality incrementally, but they also adapt the model to dynamic changes of the input data—that is, they can insert new information and forget old information. However, we still see challenges in the application of the online novelty detectors into exploration and inspection tasks based on visual information. First, current robotic applications use low-level visual features that are sensitive to illumination changes, occlusion, or geometric transformations. Some visual features used in robotic applications are RGB histograms [11], color angular indexing [17], GIST descriptor [15], and others. Second, in different exploration and inspection tasks, the robots use the same parameters in the novelty detection module, without considering that the performance of the detector depends on the specific task to be solved. These reasons have restricted the applications of the above online novelty detectors to indoor environments where many conditions have been controlled.

Motivated by the previous issues, in this work we propose the application novelty detectors based on evolutionary connectionist systems and grow when required neural networks with visual descriptions drawn from deep convolutional networks for exploration and visual inspection tasks. In contrast with existing deep learning approaches for novelty detection, we propose the use of already-trained networks to extract visual features, instead of learning new visual features, in order to reduce the computational load in the feature extraction phase. We prefer deep descriptions over traditional visual description due to its reliability in generating robust features for classification tasks. Additionally, we propose a framework to design novelty detectors automatically via the selection of the best parameters, depending on the specific robotic exploration and inspection task. This framework uses a global optimization technique as the main component to find the

most appropriate parameters for the task. We verified the utility of the proposed visual novelty detection system in outdoor applications, where an unmanned aerial vehicle (UAV) captured images in challenging environments (i.e., environments with illumination changes, geometric transformations in the objects of the environment, and occlusions). In summary, this proposed work presents the following contributions:


The rest of this document is structured as follows. Section 2 reviews some works related to visual novelty detection in robotics. Section 3 presents our visual-based novelty detection approach. Section 4 describes the experimental setup and compares our experimental results against traditional visual novelty detectors. In Section 5, we discuss the results and limitations of this work. Finally, in Section 6 we share our main conclusions and perspectives for future work.

#### **2. Related Work**

Marsland et al. [14] proposed a self-organizing map (SOM) with a habituation model embedded into the nodes to detect novelty. The system uses sonar readings as inputs, and the nodes habituate to similar inputs. The habituation level of the nodes represents the novelty value of the input. Crook and Hayes [18] developed a novelty detection system based on the Hopfield network—a type of fully-connected recurrent neural network. They implemented the novelty detector in a robot to detect cards in a gallery. The robot captures a color image and through simple processing finds the orange cards. The binary image (detection of the orange color) enters the network to perform the novelty detection process. The operation of the detector consists of updating the weights of the network every time a new input is fed into the network. The system uses a threshold value and the energy level of the network to decide if the input is novel.

Both detectors have restrictions in their operation because they keep a fixed network structure. Therefore, they cannot adapt their behaviors to dynamic changes in the inputs. For this reason, Marsland et al. [16] proposed a novelty detection system for mobile robots based on a grow-when-required (GWR) neural network. The GWR network topologically connects nodes subject to habituation and incorporates new nodes based on their habituation level and the activation level of the nearest node to the given input. Besides, the GWR network can forget patterns, deleting nodes without topological connections. Crook et al. [19] compared the Hopfield-based novelty detector against the GWR network for novelty detection. In this study, they performed two experiments: the first experiment used sonar readings as input, and the second one used images (the problem of card detection in galleries). The results showed that both approaches could construct an appropriate model of the environment. However, the GWR-based approach produced more precise models because of its lower sensitivity to noise, more flexible representation of the inputs, and ability to adapt to dynamic changes in the inputs.

Afterwards, Neto et al. [20] applied a GWR network with visual information as input. They proposed a framework that combines a visual attention model and a visual description of the more salient points in the image based on color angular indexing and the standard deviation of the intensity. This type of description is invariant to illumination changes; however, it is infeasible to detect new objects outside the attention regions. Neto and Nehmzow [17] used the novelty detectors based on GWR and incremental principal component analysis (IPCA) with two interest point detectors: the detection based on saliency and the Harris detector. They compared two ways of representing the patches in the visual input (raw pixels of the image). The first method was to keep a fixed size of the patch, while the second was to find the size of the patch automatically. The results often showed that the fixed-size approach presented the best results. Inspired by the evolving connectionist systems [21] and the habituation model proposed in the GWR networks, Özbielge [11] proposed a recurrent neural network for novelty detection for exploration and inspection tasks. This method predicts the next input and computes a novelty threshold value during its operation. This information is used and compared to the observed input to decide if it is novel. The system uses laser readings, motor outputs, and RGB color histograms as input information. Also, Özbielge [22] proposed a dynamic neural network for static and dynamic environments. The method computes the novelty in a similar way to the previous approach—it computes the error between the input observation and the prediction of the network, and if the error is higher than the evolved threshold, then the object is considered a novelty.

Apart from the above detectors, Kato et al. [15] implemented a system based on reconstruction that takes advantage of the position where the robot captured the images. The novelty detector used the GIST descriptor and a reconstruction-based approach to generate a system invariant to illumination changes. A principal limitation of their system is the absence of a threshold value to detect novelties (no optimization is provided for tuning the threshold). Gonzalez-Pacheco et al. [23] developed a novelty filter to detect new human poses. The system uses visual information of the Kinect sensor and four one-class classifiers: Gaussian mixture model, K-means, one-class support vector machines, and lLeast suares anomaly detection. For this task, the Gaussian mixture model performed better than the other novelty detectors. However, the performance of the method depended on the number of specified Gaussians (the user defined this value in the experiment). Recently, Gatsoulis and McGinnity [24] proposed an online expandable neural network similar to the GWR network. The method uses speeded-up robust features (SURF) and an ownership vector. The main difference between the GRW approach and this method is that the habituation is defined by the object and not by the feature vectors.

All the above novelty detectors have been applied for indoor environments, and few works have been proposed for outdoor environments. For instance, Wang et al. [25] implemented an approximation to the nearest neighbor via search trees to detect novelties in indoor and outdoor environments (they used a static camera for the outdoor environment). The inputs were visual features extracted from patches—for example, color histograms in the HSV space (hue, saturation, value) and texture information (Gabor filters). They compared the performance of their system against the GWR network. The results showed that their proposed approach was better than the GWR network in their particular experiments. Ross et al. [12] presented a vision system for obstacle detection based on novelty for field robotics. The motivation in the use of novelty is that in agricultural applications, it is infeasible to train a system with all types of obstacles. The inputs of the detector were color, texture, and position of the patches in stereo images. The system detects novelty by using the probability density estimated by a weighted version of Parzen windows.

Previous works have explored low-level visual features for image description such as color angular indexing, GIST descriptor, RGB raw values, RGB color histograms, HSV histograms, and Gabor filters, among others. Few efforts have been made to take advantage of emerging deep convolutional neural networks for feature description in visual novelty detection. One such effort is the robotic system proposed by Ritcher and Roy [26]. The objective of this work was to develop a robot with a safe navigation module. An autoencoder network composes the novelty detection module with three hidden layers that automatically find a compressed representation of the image captured by the robot. The goal of the network is to reconstruct the input image, and if the input image cannot be

reconstructed (i.e., the error between the input and the output is higher than an error tolerance) then the system will detect the novelty and use it to maintain the safety of the robot.

In summary, most of the existing visual novelty detectors have been configured manually by humans, or no specific procedure for the configuration of the detector has been provided. Also, most of the visual novelty detectors use traditional feature extraction techniques. There are few explorations applying the recent advances in convolutional neural networks as visual feature descriptors. Both the lack of automatic configuration of novelty detectors and the use of low-level traditional visual features have restricted the exploration and inspection task for indoor environments, with controlled conditions (e.g., illumination), and with simple visual novelty detection problems (i.e., conspicuous objects). The proposed work presents an approach to addresses these issues.

#### **3. Materials and Methods**

In this section, we describe the proposed system for visual exploration and inspection tasks. In this work, we used images captured by a UAV operating in outdoor environments. Figure 1 illustrates the proposed system. In the exploration phase, the UAV follows a fixed trajectory and captures images of the environment. The system represents the captured images via deep features by using a pre-trained convolutional neural network called MobileNetV2 [27]. The novelty detector processes the feature vector and constructs a model of the environment. The user can select between two detectors: simple evolving connectionist systems (SECoS) or GWR network. Finally, in the inspection phase, the UAV again executes its path and searches for novel objects. The UAV uses the above model to identify novelties. Then, we describe in more detail the components of the proposed visual novelty detection system.

**Figure 1.** Graphical description of the proposed system for visual exploration and inspection tasks. SECoS: simple evolving connectionist systems.

#### *3.1. Visual Feature Extraction*

One way to represent the images is via visual feature vectors. Among the visual features, traditional features such as RGB color histograms [11], color angular indexing [10], and the GIST descriptor [15] have been applied for visual novelty detection in robotics. However, traditional visual features are highly sensitive to illumination changes, noise, occlusion, or geometric transformations. Recently, convolutional neural networks have been applied successfully as powerful tools to extract features from images [28], having robust performances in a wide variety of classification tasks.

Motivated by the success of convolutional neural networks as feature extraction methods, we propose the application of a convolutional neural network to extract features from images for the task of visual novelty detection in robotics. In this work, we selected MobileNetV2 [27] because it is the network with the lowest number of parameters in the Keras API and the TensorFlow engine. In our implementation, we used a pre-trained network with the weights trained on the ImageNet dataset. In order to extract the visual features, we resized the input image to the default size in the Keras API of 224 × 224 pixels. We also deactivated the classification layer and activated the average pooling mode for feature extraction. We obtained visual feature vectors of 1280 elements.

#### *3.2. Novelty Detectors*

We selected two online novelty detection methods that are used as the base to develop exploration and inspection tasks with real robots [10,11,16]. Both techniques are constructive and can evolve the structures of the models and their parameters during their operation. We selected the SECoS and the GWR network.

#### 3.2.1. Simple Evolving Connectionist Systems

The evolving connectionist systems (ECoS) proposed by Kasabov [21] are a type of neural network that can evolve their parameters and their structure over time. Below, we show the characteristics of the ECoS that make them attractive to address the problem of visual novelty detection in robotics [29]:


The SECoS conserve these characteristics [30], but they present two advantages concerning the other ECoS implementations. The SECoS are easy to implement because they have a low number of layers to learn the input data, and they work directly on the input space. Figure 2 shows a graphical description of the SECoS network. Three layers compose the network: the input layer, which transfers the inputs to the nodes of the next layer; the hidden layer (evolving layer), which incorporates new nodes to represent novel data; and the output layer, which uses saturation linear activation functions to compute the output. In a SECoS network, there are two connection layers: the connections between the nodes of the input layer and the nodes of the evolving layer (incoming connections), and the connections between the nodes of the evolving layer and the nodes of the output layer (outgoing connections).

In this work, we used the SECoS learning algorithm proposed by Watts and Kasabov [30]. The algorithm receives as input the weights of the connections in the network, the input features, and the desired output. The proposed approach uses a SECoS implementation with the same number of nodes in the input layer and the output layer. The objective of the approach is to generate a system able to reconstruct the input vector. When the model generated by the SECoS implementation is not able to represent an input, it should add a new node in the evolving layer with the incoming weight values equal to the input vector and the outgoing weight values equal to the desired output. Also, it should add a new node to the model when the reconstructed output is significantly different from the desired output, that is, when the Euclidean distance between the desired output and the current output of the network is greater than the threshold *Ethr*. When the model can represent a given input successfully, the SECoS implementation only updates the model (updating of the connection weights) to better represent the input data. The parameters of this learning model include the learning coefficients (*η*1, *η*2), the sensitivity threshold (*Sthr*), and the error threshold (*Ethr*). For more details about this learning algorithm, the readers can refer to the work by Watts and Kasabov [30].

**Figure 2.** Graphical description of the SECoS network. Adaptation of the general ECoS representation from Watts [29].

#### 3.2.2. Grow-When-Required Neural Network

GWR is an online self-organized neural network proposed to solve the novelty detection problem [31]. Figure 3 shows a graphical representation of the GWR neural network. A clustering layer of nodes and a single output node compose the network. The nodes in the clustering layer use weight vectors to represent the centers of the clusters. The GWR network can add and remove nodes to its structure, specifically in the clustering layer, to adapt to the changes of the inputs. The connection synapses to the clustering layer in the network are subject to a habituation model, which is a reduction in response to similar inputs.

**Figure 3.** Graphical representation of the grow-when-required (GWR) neural network. Adaptation of the network architecture presented by Neto et al. [20].

In the proposed framework, we use the algorithm of the GWR network for novelty detection as described by Neto [10]. The network starts with two dishabituated nodes with weight vectors initialized to the positions of the first two input vectors. At the beginning, there are no topological connections between both nodes. From the third input vector, the best matching node *s* and the second best matching node *t* of the clustering layer are found (i.e., the nearest nodes to the input vector). If there is a topological connection between both nodes, its age is set to zero; otherwise, the connection between both nodes is created with age zero. The GWR network uses the activation and habituation levels of node *s* to decide if the input is novel or not. If the input vector is novel, a new node in the clustering layer is created with its weight vector initialized to the average position between the input vector and the best matching node. Also, the topological connections of the nodes in the clustering layer are updated by removing the connection between the best matching nodes and inserting new connections between the best matching nodes and the created node. Then, the best matching and its topological neighbors update their positions in the direction of the input vector and also update their habituation levels. Finally, all the connections increase their ages and all connections with ages higher than the maximum age are removed. A node is also removed when it has no topological connections (i.e., ability to forget). The parameters that impact the behavior of the network are the parameters of the habituation model, the activation threshold (*aT*), the habituation threshold (*hT*), the proportionality factor (*η*), and the learning rate (). A detailed description of the learning algorithm of the GWR neural network can be found in [10].

#### *3.3. Global Optimization of Novelty Detectors*

One of the main problems in the application of novelty detectors is the proper selection of their parameters in order to obtain the best results regarding the detection accuracy. With this in mind, we propose a framework to tune the novelty detectors automatically for a specific task (see Figure 4). Our optimization approach not only searches for parameters of the novelty detectors, but also finds the best size of the visual feature vector.

In this work, we propose the use of the artificial bee colony algorithm (ABC) [32] as the optimization tool. Note that although in this work we show the use of the ABC algorithm, in the proposed framework we can incorporate different algorithms to find the more appropriate parameters of the filters to solve specific tasks. The ABC algorithm offers a population-based approach for numerical optimization. In the ABC algorithm, artificial bees update their position over time to find the best food sources. This algorithm has shown to be better than or competitive to other bio-inspired optimization techniques. Besides, we can find applications of the ABC algorithm for a wide variety of engineering problems, such as image processing, data mining, control, and mobile robotics [32]. The implementation details of the algorithm can be found in Mernik et al. [33]. In the proposed methodology, we use an implementation with a termination condition based on the number of iterations, also known as ABC*imp1*.

In our implementation of the ABC algorithm, each food position represents a set of parameter values of the novelty detector. Table 1 shows the parameters that should be adjusted by using the ABC algorithm. The search range of all the decision variables is within [0, 1]. In the case of the GWR novelty filter, we set the parameters of the habituation model to the default values, and we also keep the maximum age value constant. For the ABC algorithm, we used a population of 20 food positions and a total number of 100 iterations.

**Design of a novelty detector for specific tasks**

**Figure 4.** Flowchart of the visual novelty detection for specific tasks. In the training phase, the novelty filter learns to detect a specific object. In the inspection phase, the evolved model is used to detect the object(s) in the environment.


**Table 1.** Parameters to be tuned for each novelty detector.

#### **4. Experimental Preparation**

We validated the performance of the proposed method using images captured by a real robot in outdoor environments. We constructed the datasets using these images to train and test the novelty-detection system. We designed an experiment to compare the deep visual feature extraction technique against commonly used visual features for the problem of visual exploration and inspection. In this section, we describe the datasets, the methods for comparison, the experimental setup, and the evaluation metrics.

#### *4.1. Datasets*

In this work, we constructed a dataset with images captured by the visual sensor of a UAV. For this purpose, we used a Parrot Bebop 2 Drone with a 14-Mpx flight camera. The captured images had a dimension of 1920 × 1080 pixels, but we constrained the search in the center region of the images with a reduced field-of-view of 640 × 480 pixels. Figure 5 shows the UAV used for data acquisition. Note that the novelty detector system received images of the environment every 250 ms.


**Figure 5.** Parrot Bebop 2 Drone with a 14-Mpx flight camera. In the bottom-left corner, we show its visual sensor system.

Figure 6 illustrates the outdoor environment used in this experiment. The UAV executed its default execution control module to fly over the environment in a rectangular shape. In order to generate the datasets, the UAV executed the same path several times with different environment setups.

**Figure 6.** Experimental setup: the outdoor environment, and some sample captured images. UAV: unmanned aerial vehicle.

In the first set of experiments, the UAV flew at 2 m above the ground with morning light conditions (around 11:00 and 12:00). The original environment contained an orange trash can (we called this environment "O-1"). First, the UAV explored the O-1 environment, executing its path two times. The UAV captured a total of 896 images—448 for each execution. Then, it executed the inspection phase and captured another 896 images. In this inspection phase, a person appeared in the environment (we denoted this new environment as O-2). The sequence contains 60 frames with the person. In the second experiment, we added a tire to the O-1 environment (we denoted this environment as O-3). The UAV captured a total of 896 images. The tire is present in 58 frames. Finally, the UAV executed its path in the environment with the person and the tire at the same time. The UAV captured another

896 images in its two path executions. In total, the person is present in 37 frames, and the tire is present in 64 frames. We identified this environment as O-4.

We developed a second set of experiments to test the robustness of the proposed method, considering different scales, types of occlusions, novel objects, and light conditions. In this new set, the UAV flew 4 m above the ground with afternoon light conditions (around 16:00 and 17:00). The methodology to capture the image sequences was similar to the first set of experiments, but with some differences in the settings of the environments. We introduced environment O-5, where the orange trash can was removed. We designed another environment with a person in a different position, and named it O-6. To test the robustness of the proposed method, we added inconspicuous novel objects to environment O-5 (brown boxes). We denoted this environment as O-7. Finally, we set a new environment O-8, where the UAV could visualize how the person occluded the boxes in the environment.

Figure 7 shows some sample images of the above environments. Table 2 summarizes the environments used for novelty detection, and Table 3 reports the data partition of the environments to perform the training and test phases.

 !-- 

**Figure 7.** Sample images captured by the UAV in the environments: (**a**) original in the morning (O-1), (**b**) the person in the morning (O-2), (**c**) the tire in the morning (O-3), (**d**) the person and the tire in the morning (O-4), (**e**) empty environment in the afternoon (O-5), (**f**) the person in the afternoon (O-6), (**g**) the boxes in the afternoon (O-7), and (**h**) the person and the boxes in the afternoon (O-8).


**Table 2.** Summary of the environments used in the experiments for novelty detection.

**Table 3.** Data partition for novelty detection.


In all the experiments, the novelty detectors used the images of the training environment of both loops for exploration while only using one loop of the test environment for inspection. The other loop of the test environment was used to evolve the novelty detectors.

#### *4.2. Evaluation Metrics*

To measure the performance of the novelty detectors, we used the confusion matrix shown in Table 4. *TP* represents the number of true positives (normal data labeled as normal), *TN* represents the number of true negatives (novel data labeled as novel), *FP* represents the number of false positives (novel data labeled as normal), and *FN* represents the number of false negatives (normal data labeled as novel).

**Table 4.** Confusion matrix to evaluate the performance of the novelty detectors. *FN*: false negative; *FP*: false positive; *TN*: true negative; *TP*: true positive.


Different metrics have been proposed to reflect the performance reached by the novelty detectors in a single quantity. Three of the most commonly adopted are the *F*<sup>1</sup> score, accuracy (*ACC*), and Matthews correlation coefficient (*MCC*). Similar to Özbielge [11], we used these three metrics to evaluate the performance of the novelty detectors. These metrics are respectively defined as:

$$F\_1 = \frac{2 \cdot TP}{2 \cdot TP + FP + FN'} \tag{1}$$

$$\text{ACC} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}'} \tag{2}$$

$$\text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \tag{3}$$

In the problem of novelty detection, it is important to correctly label all the novel data as novel. Also, it is tolerable to label normal data as novel data, but it is inadmissible to label a novel data as normal. For example, suppose a thief, representing novel data, enters a warehouse. In our novelty detection system, we prefer a system that can detect the thief all the time in order to prevent theft. If the system detects a thief and there is no thief in the scene, there is no problem concerning theft. In order to reflect the desired behavior of novelty detectors, we also incorporated two additional metrics: the true negative rate (*TNR*) and the true positive rate (*TPR*).

To establish the quality of a detector with a single number, we used the average ranking of the measures in all the metrics, inspired by Bianco et al. [34]. Let us consider a set of detectors to be compared, denoted as M = {*M*1, *M*2, ... , *Mm*}, where *m* is the number of detectors; a set of test images denoted as T ; and a set of *P* performance metrics, in this study *P* = 5. We can compute the average ranking of a detector *Mi* as:

$$R\_i = \frac{1}{P} \sum\_{j=1}^{P} \text{rank} \left( M\_i; \text{measure}\_j(M\_k(\mathcal{T})), \,\forall k \neq i \right), \tag{4}$$

where *rank*(*Mi*; ·) computes the rank of the detector *Mi* considering the results of the rest of the detectors in the measure *measurej*.

#### *4.3. Experimental Setup*

All the algorithms for novelty detection under study can operate online. However, to compare the detectors, they used the same data partition shown in Table 3. We implemented the SECoS, GWR, and ABC algorithms in the C++ programming language. The developed ABC library used the Mersenne Twister pseudo-random generator of 32-bit numbers. In the case of the deep feature extraction technique, we used the pre-trained MobileNetV2 available in the Keras API and the TensorFlow engine. The experiments were developed in a computer with an Intel Core i5 processor, running at 2.9 GHz and with 16 GB of RAM.

To verify the performance of the detectors, we used three traditional visual feature extraction techniques: the RGB color histograms used by Özbilge [11], the color angular indexing used by Neto [10], and the GIST descriptor used by Kato et al. [15]. We compared the performance of the detectors with these feature extraction techniques against the features extracted by the MobileNetV2 network. In this experiment, the system for automatic design used the two image sequences in the exploration phase as training and one sequence of the inspection phase as a validation. The goal of the optimization process was to maximize the performance of the detector concerning the *F*<sup>1</sup> score, the *ACC*, and the *MCC*. Therefore, we used the following fitness function:

$$f = 1 - \frac{1}{3} \left( F\_1 + A\mathcal{C}\mathcal{C} + \frac{1 + M\mathcal{C}\mathcal{C}}{2} \right),\tag{5}$$

where *f* ∈ [0, 1], *f* = 1 represents the worst case with no data classified correctly and *f* = 0 indicates that the novelty detector under study classifies all the data from the validation correctly. In this experiment, we executed 30 simulations for each novelty detector, and we report the average results to perform the comparison.

#### **5. Results and Discussion**

This section shows and discusses the results of the experiments. We designed the specific novelty detectors for each visual feature independently. We found the most suitable size of the feature vector and the parameters of the novelty detection methods for the particular visual exploration and inspection tasks. In the first part of this section, we compare the results of the proposed feature extraction technique against the well-established feature extraction techniques in the problem of visual novelty detection. Then, we present an analysis of the optimization process of the novelty detectors

that use the MobileNetV2 feature extractor. We also show some sample novelty detectors (evolved detectors) generated by the proposed framework and their visual results. Finally, we discuss some limitations of the proposed methodology.

#### *5.1. Deep Features and Traditional Visual Features in Novelty Detection*

We used well-known visual feature extraction techniques in the problem of novelty detection to compare the performance of the MobileNetV2. We used as reference the RGB color histograms used by Özbilge [11], the color angular indexing applied by Neto [10], and the GIST descriptor implemented by Kato et al. [15]. Table 5 reports the average performance of the novelty detectors in the inspection phase for each dataset, where CAI represents the color angular indexing technique, hRGB represents the RGB color histograms, and MNF represents the feature extraction method based on MobileNetV2. In the table, we also report the average vector size of the features (*VSize*) and the average size of the learned models of the environment (*MSize*)—that is, the average number of nodes in the models. Note that the CAI descriptor produces feature vectors of four elements. In the rest of the descriptors, the optimization process can produce feature vectors of different sizes. In the table, we mark the best-performing method for each metric, according to the specific detector and the particular dataset. The ranking metric uses the *TPR*, *TNR*, *F*1, *ACC*, and *MCC* values to compare the different descriptors for each dataset and detector.

For the D-1 dataset, the objective was to learn a model of the original environment O-1, and to detect a dynamic object represented by a person. In this dataset, the feature extraction technique MNF showed the best performance compared to all other visual extraction techniques. The detectors that used the MNF descriptor could generate compact models of the environment and keep higher performance. They showed accuracies greater than 98%, and *MCC* near 0.9. On the second dataset (D-2), the novelty detectors had to learn a model of the environment O-1 and identify the black tire as the new object. The proposed method achieved the best performance over all others in this dataset—see the ranking of the D-2 dataset in Table 5. The average *ACC* by using both detectors with the MNF technique was around 98%, and the *MCC* was 0.87. Dataset D-3 presents a more challenging situation because the detector was required to learn a model of the environment with a person and detect a black tire. The environment in the inspection phase included both the person and the black tire. Under this situation, the novelty detectors that used the MNF also achieved the best performance, with *ACC* values around 96% for both detectors, and *MCC* values of 0.79 and 0.76 for the SECoS and GWR detectors, respectively. On dataset D-4, the objective was to learn a model of the environment with a tire. In the inspection phase, the person represented the novel object and the black tire represented a normal object. The results indicate that the MNF technique was the second best (the first was the GIST descriptor) with 96% *ACC* and 0.6 *MCC* for both detectors. On dataset D-5, the novelty detectors were required to learn a model of environment O-1 and detect multiple novel objects (both the tire and the person). The MNF description achieved the best performance, with *ACC* values around 97% for both novelty detectors and *MCC* values of 0.89 and 0.88 for the SECoS and GWR detectors, respectively.

On the above datasets, the novelty detectors were tested with novel objects that were highly different from the environment. This could facilitate their detection. In the following, we tested the detectors in more challenging situations. To this end, we used datasets D-6 and D-7, generated by the UAV at a different height (4 m) and with a different light condition (images captured in the afternoon). In the inspection phase of dataset D-6, we used inconspicuous brown boxes to represent the novel objects. In this dataset, the detectors with MNF feature extraction were the best methods to detect novelties, with a ranking of 1.2. Finally, we show the results of the detectors on dataset D-7. The objective in this dataset was to learn a model of an environment with a person and tire and to detect the brown boxes that were occluded by the person in some frames. The results show the superiority of the MNF descriptor for novelty detection, with *MCC* values above of 0.9 and *ACC* values around of 98%, for both detectors.

**Table 5.** Average results in the inspection phase over the 30 runs. Bold values indicate the best result for each metric according to the specific dataset and the specific novelty detector. CAI: color angular indexing; hRGB: RGB color histogram; MNF: feature extraction based on MobileNetV2; *VSize*: average vector size; *MSize*: average model size; *TPR*: true positive rate; *TNR*: true negative rate; *F*1: *F*<sup>1</sup> score; *ACC*: accuracy; *MCC*: Matthews correlation coefficient; *R*: ranking of the detector.



**Table 5.** *Cont.*

We then compared the average CPU time to generate the visual features per image on all the datasets. The average time excludes the reading of the image and the post-processing of the visual features. The post-processing only consisted of reducing the vector size to the size found by the optimization process. The reduction was through the average of sectors of equal elements. Figure 8 shows the average time to generate visual features in all the datasets. hRGB was the fastest method, mainly because it only needs to count the number of pixels that belong to a given intensity value. The CAI method was the second fastest method because its computation consists of simple image operations such as average, standard deviation, inverse cosine, and dot product. Meanwhile, the GIST descriptor involves more advanced operations. It includes convolution between the image and Gabor filters at different scales and orientations. The MNF was the slowest feature extraction technique because it includes more complex operations in the image (i.e., it is a deep structure with different convolutional layers). However, all the feature extraction techniques in this work could generate visual features in less than 200 ms—a time that is acceptable for the proposed visual exploration and inspection tasks.

**Figure 8.** Average time (seconds) to generate the visual features using different descriptors on all datasets.

Overall, MNF had balanced results in contrast with the baseline methods. The models found by the MNF descriptor and the novelty detectors were compact, with no more than 35 nodes. In most cases, MNF worked better in detecting novelties than the traditional visual descriptors. Besides, we found that traditional visual features required a low number of nodes to represent the environment. However, their low performance concerning the *ACC* and the *MCC* indicates that the extracted features were insufficient to differentiate the image in the sequences.

#### *5.2. Analysis of the Optimization Process*

Figure 9 presents the average fitness value of the best-evolved novelty detectors per iteration in the 30 runs on dataset D-2. We will show the optimization processes of both novelty detectors that use the MNF feature extraction technique. In this figure, we also present the standard deviation of the fitness values through bars. At the beginning, the best detectors in the different runs had more variations among them, and this variation was reduced according to the increment in the number of iterations. Analyzing the curve, we can observe that detectors evolved easily on the dataset because they reached fitness values near to the perfect score (zero values), that is, the optimization process found the appropriate parameter values of the detector for the specific novelty detection task. For the GWR, from the initial to the final iteration, it had a decrement of 0.2591 in the average fitness. The more notable change occurred in the first 20 iterations with a change of 0.2532. For the SECoS detector, the optimization process showed a decrease of 0.3386 in the average fitness from the initial to the final iteration. The more significant change occurred in the first 14 iterations, with a change in the average fitness of 0.3341. For the rest of the datasets, the results showed similar behaviors in the optimization process.

**Figure 9.** Average fitness value of the best-evolved detectors by using the artificial bee colony (ABC) algorithm in the 30 independent runs on dataset D-2. The detectors used the MNF feature extraction technique: (**a**) GWR detector; (**b**) SECoS detector.

Now, we compare the CPU time used in evolving the novelty detectors for specific exploration and inspection tasks of the different feature extraction techniques. Figure 10 shows the average CPU time to evolve the novelty detectors in all the datasets. The search cost excludes the feature extraction phase and includes the post-processing time of the feature vectors. In the figure, we can observe that the GWR detector evolved faster than the SECoS detector. One reason is that the SECoS detectors need to reconstruct the input data and compute the distance to the nearest neighbor node in the novelty detection process, while the GWR method only requires the computation of the distance between the input data and the closest node and the habituation level of this node (without reconstruction).

It is not surprising that the CAI descriptor was the fastest method to evolve the detectors because it keeps the number of inputs in the detector fixed (4 data points) during the entire optimization process. For the rest of the approaches, the vector size varied during the optimization. The maximum number of elements was 778 (3 channels with 256 intensity values), 512, and 256 features for the hRGB, GIST, and MNF descriptors, respectively.

**Figure 10.** Average CPU time (seconds) to generate a specific novelty detector for each dataset by using different feature extraction techniques: (**a**) GWR detectors; (**b**) SECoS detectors.

#### *5.3. Evolved Novelty Detectors*

We used an evolved SECoS detector with deep features on dataset D-3 to illustrate the effects of task-specific novelty detectors. The evolved detector had the following characteristics: *η*<sup>1</sup> = 0.0183574, *η*<sup>2</sup> = 0.4830270, *Athr* = 0.4651190, *Ethr* = 0.7776980, and *VSize* = 256. The proposed global optimization process obtained these parameters. In dataset D-3, the training of the detector consisted of generating a model of the O-2 environment (an environment with a person) and the objective was to detect a black tire in an environment with a tire and a person (this new environment was called "O-4").

Figure 11 presents the exploration and inspection phases by using the evolved SECoS novelty detector. In the exploration phase, the detector constructs the model of the environment finding the most relevant information as the football goal, the orange trash can, the basketball court, and the person. It is commonly adopted for novelty detectors that the first input will be part of the learned model. The image to the left of the football goal in Loop 1 represents the first input image. We used two loops of the same normal environment (O-2) to train the detector. The evolved detector found a model of 18 nodes to represent the O-2 environment. In the inspection phase, the detector uses this model on environment O-4 to detect novelties. In this new environment, the detector found the tire as the novel object in almost all cases, with a single false novelty detection. The performance of this particular detector was *TPR* = 0.9976, *TNR* = 0.9677, *F*<sup>1</sup> = 0.9976, *ACC* = 0.9955, and *MCC* = 0.9653.

Figure 12 shows some image frames captured by the UAV at different time steps, where the evolved SECoS detector classified these images as normal images in the inspection phase. The first row represents some sample images in the exploration phase, and the second row represents the corresponding image frames in the inspection phase. Although there was considerable variation with the dynamic object and slightly different perspective changes in the images, the evolved detector could classify both situations as part of the normal class. Figure 13 shows some image frames where the evolved SECoS detected novelty: image frames used in the exploration phase at different time steps (see Figure 13a), and some sample images captured in the inspection phase where the detector found the novelty (see Figure 13b). We can observe the black tire at different scale in the images captured in the inspection phases.

#### -


**Figure 11.** Illustration of the visual exploration and inspection task on dataset D-3 to detect the black tire as the novel object. In the exploration phase, the SECoS detector constructs a model of the environment with the person. In the inspection phase, the detector uses this model to detect the black tire.

**Figure 12.** Sample image frames labeled as normal images by the evolved SECoS detector in the inspection phase: (**a**) sample image frames used to learn the model of the environment, and (**b**) sample images detected as normal images in the inspection phase.

**Figure 13.** Sample image frames labeled as novelty images by the evolved SECoS detector in the inspection phase: (**a**) sample images frames used to learn the model of the environment, and (**b**) sample images detected as novelty in the inspection phase.

Table 6 presents a set of sample novelty detectors generated by the proposed framework for each dataset. We show the parameter values of *η*1, *η*2, *Sthr*, and *Ethr* for the SECoS detectors, and the parameter values of *aT*, *hT*, *η*, and  for the GWR detectors. The table also reports the found vector size of the deep features for each detector.

**Table 6.** Set of sample evolved detectors generated by the proposed global optimization framework on all the datasets.


In Table 7, we report the performance of the above-evolved detectors. We can observe that the SECoS detectors had similar behavior to the GWR detectors concerning the novelty detection (see the *TNR* values), except on dataset D-5, where the SECOS detector outperformed the GWR. Besides, on datasets D-1, D-3, D-4, D-6, and D-7, the SECoS detectors exceeded the GWR concerning the *TPR* values.


**Table 7.** Results in the inspection phase (unseen data) of the sample evolved detectors. Bold values indicate the best result for each metric.

Now, we introduce some visual results of the evolved detectors in the environments in the morning. In Figure 14, the novelty detectors learned a model of the original environment O-1, and detected the person as the novel object. The figure shows the novelty indication of both methods, an image frame in the exploration phase (picture in the upper left corner), and a picture at the same time step in the inspection phase. We mark the novel object with a yellow ellipse. This figure also presents some successful novelty detections on the right side. From these samples, we can observe the advantage of the evolved detectors, which is that they could detect the person at different scales, perspectives, and occlusion levels.

**Figure 14.** Visual results in novelty detection on dataset D-1, with the person as the novel object.

Figure 15 shows another example of the visual exploration and inspection task. The task consists of learning a model of the original environment O-1 and to detect the black tire in the inspection phase in environment O-3. The detectors found the tire as the novel object in all cases, the methods could even detect novelties with occlusion; see the last detection sample (*t* = 334), where the tire is almost incomplete.

**Figure 15.** Visual results in novelty detection on dataset D-2, with the tire as the novel object.

A more challenging example is presented in Figure 16. In this figure, the detectors should have found that the black tire was the novel object and the person was the normal object. In almost all cases, the methods could detect the novel object. However, some false novelty detections appeared with the person. The SECoS was less sensitive to this phenomenon than the GWR. Another challenging problem is to detect the person as the novel object and the tire as the normal object. Figure 17 illustrates the performance of both detectors in this situation. Like the above example, the methods could detect the person in almost all cases and discover false novelties in the tire.

**Figure 16.** Visual results in novelty detection on dataset D-3 (the tire as the novel object, and the person as the normal object).

**Figure 17.** Visual results in novelty detection on dataset D-4 (the person as the novel object and the tire as the normal object).

We then present the visual results in detecting both the tire and the person as the novel objects (multiple novel object detection). In this case, both methods could identify the tire and the person with only one false novelty detection; see Figure 18.

**Figure 18.** Visual results in novelty detection on dataset D-5, with the person and the tire as the novel objects.

While the previous cases showed results on novel objects that were different from the environment, the next cases show visual exploration and inspection tasks with inconspicuous novel objects (i.e., brown boxes in this experiment). To capture the image frames, the UAV flew at a 4 m height with afternoon light conditions. In Figure 19, the problem was to detect the images with the brown boxes through a learned model of the empty environment in the afternoon (called environment "O-5") . We can observe that the evolved detector detected the brown boxes in almost all cases, with only two false novelty indications.

Finally, we show the results of the evolved detectors when a person occluded the brown boxes. Figure 20 presents this situation. The results show that the evolved detectors learned a model of

the environment with the person and detected the images with the brown boxes, even if the person occluded them.

**Figure 19.** Visual results in novelty detection on dataset D-6 (the brown boxes as the novel objects).

**Figure 20.** Visual results in novelty detection on dataset D-7 (the brown boxes as the novel objects).

In summary, the visual results show that the evolved detectors could identify the novelty in almost all cases. The detectors presented some false novelty detections. However, it is more critical in this type of problem to detect the novelties than to miss the novelties and detect all the normal data. Furthermore, the proposed detectors had excellent capabilities in challenging scenarios with illumination changes, scales, and occlusions.

#### *5.4. Limitations*

The proposed framework addresses the visual novelty detection in exploration and inspection tasks. Although our proposed method was robust to illumination changes, scale, and occlusion, the evolved detectors presented some issues with abrupt perspective changes in images induced by the flight control of the UAV.

Figure 21 shows some failure samples of novelty detections. In the first row, we present some sample images for the training of the evolved novelty detector (GWR in this case). In the second row, we show some sample images in the inspection phase, with a change in the perspective induced by the flight control of the UAV. In the exploration phase, the GWR system builds a model of normality of the environment with the tire (environment O-3). In the inspection phase, the system should detect the person as the novelty in the environment with the tire and the person (environment O-4). Due to the change in perspectives in the image frames in the inspection phase induced by the flight control module of the UAV, these frames were encoded by information that was not currently represented in the learned model of normality. Therefore, the system detected them as novelty. A possible solution to the problem is to evolve the novelty detectors online to adapt to dynamic changes in the environment. Another possible solution is to learn ad-hoc visual features for the problem. We could also explore the incorporation of information from several UAV sensors in order to complement the visual information. With this new information, we could detect new types of novelty, such as novelty based on the object position. All these issues will be the subject of future studies.

**Figure 21.** Failure cases in the evolved GWR detector on dataset D-4: (**a**) sample image frames in the exploration phase, and (**b**) false novelty indications in the inspection phase. In the exploration phase, the UAV explores environment O-3. Then, it should detect the person as the novelty in environment O-4. In the inspection phase, due to changes in perspective in the frames induced by the UAV's flight, some false novelty detections were presented because the information of the frame encoding was too different from the learned model.

#### **6. Conclusions**

The proposed methodology addresses the problem of automatic design of novelty detectors in visual exploration and inspection tasks, facing the challenge of unbalanced data. We proposed a new framework that uses deep features extracted by a pre-trained neural convolutional network. The methodology exploited the robust capabilities of the deep features to represent the images. A significant contribution of the work is the design of novelty detectors for specific tasks based on a global optimization technique. The proposed methodology simultaneously finds the size of the feature vector and the parameters of the novelty detectors. The methodology was tested on an outdoor environment with images captured by an unmanned aerial vehicle. We considered different types of novelties to verify the performance of the proposed methodology, including conspicuous or inconspicuous novel objects, static or dynamic novel objects, and multiple novel objects. We also considered two different light conditions in the outdoor environment (morning and afternoon), and two different flight heights of 2 m and 4 m, respectively. We performed a comparison with well-established feature extraction techniques in the problem of visual exploration and inspection tasks in the above conditions. The results showed that the proposed methodology is competitive or even better than these traditional techniques. Based on the results, we observed that the evolved detectors are robust to illumination changes, scale changes, and some levels of occlusion. Although they presented some problems with perspective changes produced by the flight control module of the

unmanned aerial vehicle, the proposed evolved methods could detect the novelties in almost all cases, which is a desirable characteristic of novelty detection methods.

As future work, we will develop an online technique to design novelty detectors to address dynamic changes in the environment. More studies must be done to test the performance of the methodology with abrupt perspective changes of the objects. Another exciting research direction would be to use sensor fusion to detect novelties when it is difficult to do so with visual information alone.

**Author Contributions:** Conceptualization, M.A.C.-C. and V.A.-R.; Methodology, M.A.C.-C.; Software, M.A.C.-C.; Validation, V.A.-R., U.H.H.-B., and J.P.R.-P.; Investigation, M.A.C.-C.; Resources, J.P.R.-P. and U.H.H.-B.; Data Curation, U.H.H.-B. and J.P.R.-P.; Writing—Original Draft Preparation, M.A.C.-C.; Writing—Review and Editing, M.A.C.-C., V.A.-R., U.H.H.-B., and J.P.R.-P.; Supervision, V.A.-R.

**Funding:** This research received no external funding.

**Acknowledgments:** Marco A. Contreras-Cruz thanks the National Council of Science and Technology (CONACYT) for the scholarship with identification number 568675. The authors thank the Program for the Strengthening of Educational Quality (PFCE) 2019 of the University of Guanajuato for providing the publication costs.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Training Data Extraction and Object Detection in Surveillance Scenario †**

#### **Artur Wilkowski \*, Maciej Stefa ´nczyk and Włodzimierz Kasprzak**

Institute of Control and Computation Engineering, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warszawa, Poland; maciej.stefanczyk@pw.edu.pl (M.S.); wlodzimierz.kasprzak@pw.edu.pl (W.K.)

**\*** Correspondence: artur.wilkowski@pw.edu.pl; Tel.: +48-22-234-7276

† This paper is an extended version of our paper published in: Wilkowski, A.; Kasprzak, W.; Stefa ´nczyk, M. Object detection in the police surveillance scenario. In Proceedings of the 14th Federated Conference on Computer Science and Information Systems, Leipzig, Germany, 1–4 September 2019.

Received: 31 March 2020; Accepted: 5 May 2020; Published: 8 May 2020

**Abstract:** Police and various security services use video analysis for securing public space, mass events, and when investigating criminal activity. Due to a huge amount of data supplied to surveillance systems, some automatic data processing is a necessity. In one typical scenario, an operator marks an object in an image frame and searches for all occurrences of the object in other frames or even image sequences. This problem is hard in general. Algorithms supporting this scenario must reconcile several seemingly contradicting factors: training and detection speed, detection reliability, and learning from small data sets. In the system proposed here, we use a two-stage detector. The first region proposal stage is based on a Cascade Classifier while the second classification stage is based either on a Support Vector Machines (SVMs) or Convolutional Neural Networks (CNNs). The proposed configuration ensures both speed and detection reliability. In addition to this, an object tracking and background-foreground separation algorithm is used, supported by the GrabCut algorithm and a sample synthesis procedure, in order to collect rich training data for the detector. Experiments show that the system is effective, useful, and applicable to practical surveillance tasks.

**Keywords:** object detection; few shot learning; SVM; CNN; cascade classifier; video surveillance

#### **1. Introduction**

Police and various security services use video analysis when investigating criminal activity. Long surveillance videos are increasingly searched by dedicated image analysis software to detect criminal events, to store them, and to initiate proper security actions. One of the prominent examples is the P-REACT project [1] (Petty cRiminality diminution through sEarch and Analysis in multi-source video Capturing and archiving plaTform). Solutions to automatic analysis of surveillance videos seem already to be mature enough, as the research community is recently also involved in significant benchmark initiatives [2,3]. The computer vision research focus is now shifted to the analysis of video data coming from handheld, body-worn, and dashboard cameras and on the integration of such analysis results with police- and public-databases.

In typical object detection scenarios, there are much data to learn from and a major objective is to use them effectively. In a security-oriented environment, the user interaction should be kept as simple as possible. The optimal solution would be marking only a single object in a selected image frame and initiating a search to find occurrences of similar objects in other frames of the processed sequence or different sequences. This imposes several constraints on the Machine Vision solution that need to be addressed.

First of all, the system should learn on-line or nearly on-line. The system must also perform per-frame detection quickly and provide approximate results in a short time, but also should be tuned in such a way, that no occurrence of the interesting object is skipped. Last but not least, the system must be able to learn from small data sets.

This paper is an extension of our conference paper [4] describing an effective and time-efficient algorithm for instance search and detection in images from handheld video cameras. The system described there uses a discriminant approach to differentiate the object from its foreground. To do so, a combined Haar–Cascade detector and Histogram of Oriented Gradients–Support Vector Machine (HOG-SVM) classifier are used. We argued that this provided a desirable trade-off between detection quality and training/detection times. Both the positive, as well as negative samples, are extracted only from training images.

The extended version presented here includes new system elements and experiments. In particular, the additions are as follows:


The main components of the system that are affected by the additions were marked bold in Figure 1.

**Figure 1.** Structure of the training procedure.

Comparable detector solutions based on CNNs provide excellent detection performance [5]. Such solutions, however, rely on off-line training and the training/detection speed is still a bottleneck for such systems. This effect is, to some extent, ameliorated by GPU utilization. Recent developments aim at the reduction of detection times by cascading CNNs [6] or by detecting salient regions first using fuzzy logic [7]. However, a significant reduction of training time is still an open area of research.

This work can be categorized as the few-shot learning system. The most popular approaches in this field focus on distance metric learning (which, on its own, has a long history [8]) with further data clustering using the *L*<sup>2</sup> metric. In recent years, new solutions based on deep learning emerged [9], with relational networks being the popular choice for few-shot learning [10]. A popular choice for the loss function, in that case, is triplet loss [11]. The application of distance metric learning for classification is straightforward. Metric is produced based on training samples, and the queries embed-dings are compared with the class representatives, with *L*<sup>2</sup> metric and/or *k*-Nearest Neighbor. This approach proved to be successful in multiple classification tasks [12–15].

Application of few-shot learning for detection is a harder problem, as usually the marked object in training set occupies only a small portion of the image. Hence, the training data set is heavily unbalanced. One of the possibilities is to transfer the knowledge from existing detectors [16]. Another alternative is to use semi-supervised learning, with few labelled samples and a larger set of unlabelled data, iteratively used in training [17]. A different approach is described in [18], where the imitations are used in the training process for a robot-grasping task. Finally, the direct application of distance metric learning was proposed in [19], where the InceptionV3 network is used as a backbone for metric learning.

Using the taxonomy proposed in [20] (data-, model-, or algorithm-based), our solution is data-based, with handcrafted rules for transforming the training dataset.

One contribution of the paper is the procedure of collecting as much realistic training data as possible, providing limited user interaction. Another contribution involves the proposition of a complete computer-aided video surveillance procedure and a thorough evaluation of the applicability of particular computer vision methods for specific stages of processing, having in mind the system

requirements such as the ability to learn from short video sequences, performing quick learning and reaching fast detection time. The evaluation comprised both classic computer vision methods as well as contemporary CNN approaches.

Ideally, the modeling stage should be able to start from a single selection of a Region of Interest (ROI), and all additional examples should be obtained automatically. Such least-user-effort approaches were already discussed e.g., for semi-automatic video annotation and detection systems, such as [21,22]. In the cited method, however, the user may be asked to annotate video several times (to decide about samples lying on decision boundary), which is not necessarily acceptable for all end users. An example of another successful detector that works on a single selection is given in [23]. The detector operates on sparse image representation (collection of Scale Invariant Feature Transform (SIFT) descriptors), so it is very time efficient. Our initial experiments have shown that descriptor-based approaches work the best for highly textured and fairly complex objects that occupy a large part of the image, which is not always true in the surveillance scenarios.

The procedure of collecting training data, given in this paper, combines object tracking and background subtraction methods for semi-supervised collection of training windows and their foreground masks. This step is supported by the GrabCut algorithm [24], with the possibility of smoothly mixing the results of both. The samples collected during tracking are further synthetically generalized (augmented) to enrich the training set. Scenarios, where tracking results are utilized for the collection of detector's training data, were already covered in literature, especially regarding tracking, with prominent examples [25,26] or more recent CNN approaches [27,28]. In such approaches, the exact foreground–background separation (which is crucial for effective synthesis of samples) is often neglected, since the algorithms typically have enough frames to collect rich training data.

The proposed methods were evaluated on a corpus of surveillance videos and proved that its efficiency is good enough to be effective in supporting a user (police officer or security official) in their everyday working tasks. The proposed two-level detector architecture was evaluated using alternative methods for feature calculation, namely Histogram of Oriented Gradients (HOG) features and VGGNet-based deep features. In the former case the SVM classifier was used, in the latter, the neural network classification layers were consistently utilized. For final evaluation we compared the accuracy of our proposed system with the state-of-the-art Faster-RCNN [29] detector trained on the same data.

The paper is organized as follows: in Section 2 there is given a technical background and methodology used in our system, Section 3 provides experimental results, and Section 4 contains conclusions. For the reader's convenience, a short dictionary of abbreviations used in the paper is presented at the end.

#### **2. Methods**

#### *2.1. Detector Overview*

In the system described in this paper, we utilize a classic detection framework, where a sliding window with varying sizes is moved over each frame and, for each location, the selected image part is evaluated against information gathered from training samples. A crucial part of the detector is formed by a classifier (SVM or NN), which is responsible for the evaluation of each selected image part. A pure classifier, when applied to hundreds of thousands of candidate areas, would be too slow to learn and detect. In our scenario, a pre-classification step utilizing Haar-like feature-based cascade classifier is applied to limit the number of candidate windows to several hundred. We claim that this simple structure combines a good detection rate with acceptable detection speed (about ten full-HD frames per second on modest computer) as well as acceptable training speed in typical scenarios (less than a few minutes per pattern).

In essence, the two-stage detector architecture resembles some significant modern CNN approaches, where the detection is divided into the region proposal part and the region recognition part [29]. In our

approach, region proposal is performed by the cascade classifier, and the (SVM/NN) classifier does the final classification. Both methods offer reasonable training and detection speeds required for this application.

In our scenario, sources of data are naturally sparse. Depending on user decision, the detector can be trained either on one or a short sequence of training images. Therefore a critical part of our system is a set of tools aiding the user in an effortless collection of training examples from short image sequences as well as methods for artificial synthesis and generalization of training samples to provide the detector with the training data as rich as possible. These tools and methods are discussed in subsequent sections. The overall structure of the training procedure is given in Figure 1.

The first stage of processing is and interactive collection of training samples (*Interactive ROI selection*). The user marks an object in the image and initializes the tracking procedure to collect more training data for the selected object (*Tracking of marked object*). In case of un-stabilized images (e.g., from hand-held camera) additional stabilization step can be applied (*Short sequence stabilization*). The tracking is open to user intervention e.g., correction of the ROI in a single frame or frames.

The next step of the procedure is foreground–background segmentation (*Fg./Bg. segmentation*), which helps to recover the precise outline of the object from the rectangular ROI basing on motion and color information. After this comes *Samples synthesis*, here, the collected training samples are jittered in different ways to enrich the training set. Both steps enable the operator to correct algorithm parameters with visual feedback. The last step of detector generation is the training of a 2-level classifier (*2-level classifier training*).

#### *2.2. Collection of Positive Training Samples*

Although for some patterns (which include e.g., flat patterns) good detection results can be obtained using only one selected sample that is further generalized and synthesized into a set with larger variability, in most cases detection results highly depend on size and diversity of input training set. In the scenario discussed in this paper, these properties of the training set can (at least partially) be achieved by collecting samples from a short sequence of input images. Our scenario is organized as follows: (1) a user selects an object of interest using a rectangular area, (2) the application tracks the object in subsequent frames of the sequence (with optional manual reinitialization), (3) object foreground masks are established using motion information and image region properties.

#### 2.2.1. Object Tracking and Foreground–Background Separation Using Motion Information

For tracking of rectangular area an optimized version of Circulant Structure of Kernels (CSK) tracker [30] that utilizes color-names features [31] is used. As a result of the tracking procedure, we obtain a sequence of rectangular areas that encompass the object of interest in subsequent frames. In most cases both object foreground, as well as background, will be present in the tracked rectangle. However, if the object is moving against a moderately static background, we can exploit motion information to effectively separate object foreground from background by background subtraction.

Let the tracking results be described by a sequence of rectangular areas {*R*1, ... , *<sup>R</sup>T*} and let us denote coordinates of pixel *i* as *pi*, color attributes for pixel *i* at time *t* as *c<sup>t</sup> i* , and a mean of color attributes in the background as:

$$\mathcal{E}\_i = \frac{1}{n\_i} \sum\_{t: p\_i \notin \mathcal{R}^t} c\_i^t \tag{1}$$

where averaging factor *ni* is the number of frames where tracking window does not contain pixel *i* and can be computed as *ni* = |{*<sup>t</sup>* : *pi* ∈/ *<sup>R</sup><sup>t</sup>* }|.

Now we can specify a background training sequence for each pixel {*c*ˆ *t i* }

$$\mathcal{C}\_{i}^{t} = \begin{cases} c\_{i}^{t} & \text{if } p\_{i} \notin \mathcal{R}^{t} \\ \bar{c}\_{i} & \text{if } p\_{i} \in \mathcal{R}^{t} \end{cases} \tag{2}$$

Following the rule above, only pixels that at given time-step do not belong to the tracked area contribute to the background model computed for the image. Each pixel that always belongs to the tracked area is conservatively treated as a foreground as we are unable to establish a background model for these areas.

The background model adopted here follows algorithms from [32]. In this method, scene color is represented independently for all pixels. The color for each pixel (both from the background and foreground *BG* + *FG*) given the training sequence *CT*, is modeled as:

$$p(c\_i | C\_T, BG + FG) = \sum\_{m=1}^{M} \pounds\_m \mathcal{N}(c\_i; \hat{\mu}\_m, \hat{\sigma}\_m) \tag{3}$$

where *μ*ˆ*m*, *σ*ˆ*<sup>m</sup>* are estimated means and standard deviation of color mixture components, *π*ˆ*<sup>m</sup>* are mixing coefficients, *M* is the total number of mixtures, and N (*ci*; *μ*ˆ*m*, *σ*ˆ*m*) denotes Gaussian density function evaluated in *ci*.

In the algorithm a (generally correct) assumption is made that background pixels, as appearing most often, will dominate the mixture. Therefore the background model (*BG*) is built from the selected number of largest clusters in the color mixture:

$$p(c\_i | \mathbb{C}\_T, BG) = \sum\_{m=1}^{B} \pounds\_m \mathcal{N}(c\_i; \mu\_m, \mathfrak{d}\_m) \tag{4}$$

where *B* is the selected number of background components. The pixel is decided to belong to the background when

$$p(c\_i | C\_{T'} BG) \gg c\_{thr} \tag{5}$$

Threshold *cthr* can be interactively adjusted by the user. Exact algorithms for updating mixture parameters are given in [32]. Sample result of background subtraction procedure is given in Figure 2.

**Figure 2.** Results of automatic foreground–background separation.

Some modern developments in foreground–background separation using the Robust Principal Component Analysis (RPCA) approach were proposed e.g., in [33,34]. They are founded on extensive optimization in 3D spatio-temporal volumes and offer excellent accuracy at the expense of some processing speed. Since our system relies on the interaction between the human operator and computer, the processing speed is very important, so purely local methods seem to be currently the best choice. However, this topic will be investigated in the future versions of the system since some ideas from [33,34] are likely to be complementary to our developments presented in Section 2.2.3.

#### 2.2.2. GrabCut Algorithm

GrabCut [24] is a widely acclaimed method for (semi)automatic foreground/background segmentation. The method takes into account several properties or image regions: color distribution, coherence of regions, contrast between regions. These factors are described in the form of image-wide energy function to optimize, that assumes the following form:

$$\mathbf{E}(\underline{\alpha}, \underline{\theta}, \mathbf{z}) = \mathcal{U}(\underline{\alpha}, \underline{\theta}, \mathbf{z}) + V(\alpha, \mathbf{z})\mathbf{z}$$

where *α<sup>n</sup>* describes segmentation information for pixel *n* (can be binary), *θ* is the set of parameters of the Gaussian Mixture Model representing color distribution in background and foreground and **z** are observed image pixels. Estimated parameters are underlined. The energy term *U* describes, how well current estimation of foreground and background pixels matches the assumed color distribution of foreground and background, while the term *V* evaluates spatial consistency of regions by penalizing discontinuities (except for the areas of high contrast).

The best configuration of parameters is the one minimizing the term **E**(*α*, *θ*, **z**). The components are designed in a way that the energy term can be minimized using an effective graph-cut algorithm. The optimization algorithm is iterative and switches between (re)estimation of region color distribution and (re)estimation of segmentation.

The input of the algorithm is defined in [24] as a trimap {*TB*, *TU*, *TF*}. *TB* stands for sure background, *TF* stands for sure foreground, and *TU* is an unknown area (to estimate). *TB* and *TF* are fixed and cannot change during the algorithm. Typical initialization is to set *TB* to the area outside of object ROI, set *TF* to ∅, and *TU* is the remaining part of the image. In the first iteration, all pixels from *TB* are initialized as background and all pixels from the unknown area *TU* as foreground (which is subject to change). Implementations like [35], however, allow to specify the additional areas within *TU*: the likely foreground *T*ˆ *<sup>F</sup>* and likely background *T*ˆ *<sup>B</sup>* as a convenient starting point for optimization.

2.2.3. Object Tracking and Foreground–Background Separation Using Hybrid Motion Information and GrabCut

While pure object motion information is sufficient to perform foreground–background segmentation in most cases, it fails altogether for static objects. In addition, the precision of such an approach varies from case to case and strongly depends on the manual threshold selection for background subtraction *cthr*. Therefore we propose a modified procedure of fine-tuning results of background subtraction using GrabCut.

	- the foreground region is used to initialize the G-C *TF*, with an exception for the area for which no background model could be reliably established (areas that belong to each collected tracking ROI),
	- the G-C *TB* is initialized outside the tracked ROI border, to provide enough pixels for background estimation the tracking area is scaled uniformly by 50%,
	- the remaining area of ROI becomes the *TU*.

We found our solution somewhat similar to the one proposed quite recently in [36]. However, in the cited approach the trimap is initialized differently. The G-C *TU* area is limited only to the area of the morphological gradient (difference between dilation and erosion) of the foreground area established by background subtraction. In our solution we safely assume that *TB* is always outside the tracked ROI, so there is no risk to incorporate the foreground object in *TB*. Additionally, in contrast to our approach, in [36] the background subtraction is not discussed within the tracking context.

The solution proposed here is parametrized by a single background subtraction threshold *cthr* and provides a smooth user experience when transiting between different threshold values. By specifying very low thresholds, the user selects a foreground mask covering the whole tracked ROI. For larger thresholds, we obtain the results of G-C algorithm with foreground constrained to be at least the mask generated by background subtractor. For very large, extreme values of *cthr*, the foreground seeds from the background subtractor become small, and the method converges to the output of the vanilla G-C algorithm for static images.

This is interesting to note that for static ROIs we use exactly the same procedure. For such ROIs, the area of uncertain background model (the area where no background model could be reliably established) is very large and covers the whole ROI. In such a situation, the entire ROI area is simply a subject to the classic G-C algorithm. Note also, that using only methods from Section 2.2.1 the whole ROI area would be inevitably labelled as foreground.

#### 2.2.4. Image Stabilization in a Short Sequence

The foreground–background segmentation procedure works best when the stable camera position is available (or image sequence is stabilized before segmentation). The system proposed here uses a stabilization procedure basing on matching of SURF features [37] and computation of homography transformation between pairs of images. The stabilization works on short subsequences of the original sequence. The first frame to stabilize is the one used for marking the initial region of interest. The procedure then aligns all subsequent frames to the first frame by evaluating homography relating two images. In order to do so, matching methods from [38] and the Least Median of Squares principle [39] are utilized. To increase stabilization efficiency, GPU-accelerated procedures for keypoints/descriptors extraction and matching from OpenCV library are utilized [35].

#### *2.3. Collection of Negative Training Samples*

Negative samples that are used in detector training are extracted from the same sequence images that positive samples originated from. For each training image, one fragment is used to extract a positive sample, while the remaining part of the image is divided into at most four sources of negative samples, as given in Figure 3. Thus, an assumption is made that these remaining parts of the training sequence images do not contain positive samples. This assumption is not always valid, but may be strengthened by asking a user to mark all positive examples in the training sequence.

**Figure 3.** Division into positive (P) and negative (N1–N4) examples.

#### *2.4. Positive Samples Generalization and Synthesis*

#### 2.4.1. Geometric Generalization

In this step, 3D rotations are applied to collected pattern images and their masks. It is assumed that patterns are planar, so this generalization method can be useful only to some extent for non-planar objects. The rotation effect is obtained by applying a homography transformation, imitating application of three rotation matrices *Rx*(*α*), *Ry*(*β*), *Rz*(*γ*) to a 3D object. The matrices correspond to rotations around *x*, *y*, and *z* axes correspondingly. 3D rotation matrices are defined classically:

$$\begin{aligned} R\_x(\theta) &= \begin{pmatrix} 1 & 0 & 0 \\ 0 & \cos \theta & -\sin \theta \\ 0 & \sin \theta & \cos \theta \end{pmatrix} \\ R\_y(\theta) &= \begin{pmatrix} \cos \theta & 0 & \sin \theta \\ 0 & 1 & 0 \\ -\sin \theta & 0 & \cos \theta \end{pmatrix} \\ R\_z(\theta) &= \begin{pmatrix} \cos \theta & -\sin \theta & 0 \\ \sin \theta & \cos \theta & 0 \\ 0 & 0 & 1 \end{pmatrix} \end{aligned} \tag{6}$$

To compute the transformation, first a homography matrix is computed using formula:

$$H = R - \frac{\mathbf{t}\mathbf{n}^T}{d} \tag{7}$$

where **n** is a vector normal to the pattern plane (we set it to **n** = (0, 0, 1)*T*), *d* is the distance from the virtual camera to the pattern (we set it arbitrarily to *d* = 1, since it only scales 'real-world' units of measurement) and *R* is the 3D rotation matrix that is decomposed as:

$$R = \left( R\_x(\mu) \cdot R\_y(\beta) \cdot R\_z(\gamma) \right)^{-1} \tag{8}$$

In order for the image center (having world coordinates *C* = (0, 0, *d*)*T*) to remain intact during transformation we define 'correcting' translation vector as:

$$\mathbf{t} = -RC + \mathbf{C} \tag{9}$$

Then we can specify artificial camera matrices as *K*<sup>1</sup> and *K*<sup>2</sup>

$$K\_1 = \begin{pmatrix} f & 0 & c\_{\text{in}}^x \\ 0 & f & c\_{\text{in}}^y \\ 0 & 0 & 1 \end{pmatrix}, K\_2 = \begin{pmatrix} f & 0 & c\_{\text{out}}^x \\ 0 & f & c\_{\text{out}}^y \\ 0 & 0 & 1 \end{pmatrix} \tag{10}$$

where (*c<sup>x</sup> in*, *c y in*)*<sup>T</sup>* and (*c<sup>x</sup> out*, *c y out*)*<sup>T</sup>* are pixel coordinates of input and output image correspondingly, while *f* is the artificial camera focal length given in pixels. In this application we set *f* to be *fmul* times larger input image dimension. Multiplier *fmul* decides about the virtual distance of our virtual camera to the object. Smaller values introduce larger perspective distortions of the transformation, larger values introduce smaller distortions. We arbitrarily set *fmul* to 10 implying only slight perspective distortions.

The final homography transformation applied to the pixels of the input image is given by

$$P = K\_2HK\_1^{-1} \tag{11}$$

Rotation angles *α*, *β*, and *γ* are selected randomly from the uniform distribution (denoted here as U). The amount of rotation around axes *y* is twice times the amount of rotation around remaining axes to better reflect dominant rotations in human movement.

$$\begin{aligned} \alpha &\sim \mathcal{U}(-1,1) \cdot \delta\_{\max} \cdot 0.5, \\ \beta &\sim \mathcal{U}(-1,1) \cdot \delta\_{\max}, \\ \gamma &\sim \mathcal{U}(-1,1) \cdot \delta\_{\max} \cdot 0.5 \end{aligned}$$

and *δmax* is the parameter specifying the maximum extent of allowed rotation.

#### 2.4.2. Intensity and Contrast Synthesis

In the proposed approach image intensity and contrast synthesis is applied in addition to geometric transformations. It is especially important for Haar-like features that lack intensity normalization.

First, intensity values of pixels *Iin* are retrieved from RGB image by extracting V component from HSV representation of the image and setting *Iin* = V. The intensity and contrast adjustment affects only V channel. After adjustment, the RGB image is reconstructed from HSV' where V' = *Iout*.

For adjustment, a simple linear formula is used. For each pixel gray value *Iin* we have

$$I\_{out} = a \ast I\_{in} + b$$

where

$$a = 1 + c\_{dev}, \\ b = I\_{dev} - \mu\_I \cdot c\_{dev} \tag{13}$$

where *μ<sup>I</sup>* is the average intensity of the sample. Contrast deviation *cdev* as well as intensity deviation *Idev* are sampled from the uniform distribution *cdev* ∼ U(−1, 1) · *cmax* and *Idev* ∼ U(−1, 1) · *Imax*. *cmax* is a parameter denoting the maximum allowed contrast change and *Imax* is a parameter denoting the maximum allowed intensity change. Changes in contrast preserve mean intensity of an image. After application of the formula its results are appropriately saturated.

#### 2.4.3. Application of Blur

Training and test samples may differ in terms of quality of image details due to different factors such as deficiencies of optics used, motion blur or distance. In our case, we apply a simple Gaussian filter to simulate natural blur effects

$$
\sigma = \mathcal{U}(0, 1) \cdot \sigma\_{\text{max}} \cdot \min(I\_{width}, I\_{height}) \tag{14}
$$

where *Iwidth* and *Iheight* are image sample sizes and *σmax* controls the maximum size of the Gaussian kernel.

#### 2.4.4. Merging with the Background

Generalized training images are superimposed on background samples extracted from negative examples of size ranging from about 0.25 to 4 times the positive sample size. Gray-level masks are used for the seamless incorporation of positive samples into background images.

#### *2.5. Detector Training*

The detector training procedure is divided into two steps. In the first step, the cascade classifier using HAAR-like features is trained. The classifier is trained on training samples resampled to a fixed size of 24 × 24 pixels. In our scenario, for each cascade stage, 300 positive samples and 100 negative samples are utilized. The minimum true positive rate for each cascade level is set to 0.995, and the maximum false positive rate is set to 0.5. The classifier is trained for a maximum of 15 stages or until reaching ≈0.00003 FPR. The expected TPR is at least 0.995<sup>15</sup> ≈ 0.93. By using these settings, up to about 1000 detections are generated for each Full-HD test image.

#### *2.6. Detector Training Using HOG+SVM*

During the second stage of training an SVM classifier is trained to handle samples that passed the first cascade classification. For most experiments, the SVM classifier is trained on 300 positive and 300 negative samples or 600 and 600 samples accordingly. The SVM classifier uses the Gaussian RBF kernel.

$$K(\mathbf{x}, \mathbf{y}) = \exp(-\gamma ||\mathbf{x} - \mathbf{y}||^2) \tag{15}$$

The Gaussian kernel size *γ* and SVM regularization parameter *C* are adjusted using automatic cross-validation procedure performed on the training data. For SVM classification Histogram of Oriented Gradients features [40] are extracted. There were used 2 resolutions of training images: 24 × 24 and 32 × 32. For each sample a 9-element histogram in 4 × 4 cells is created with 16 × 16 histogram normalization window overlapping by 8 pixels, thus giving 4 ∗ 16 ∗ 9 = 576 HOG features in total in 24 × 24 case and 9 ∗ 16 ∗ 9 = 1296 in 32 × 32 case.

Negative samples are extracted from the Cascade Classifier decision boundary (containing samples that were positively verified by CC but still negative) if possible. If not, image fragments used as background images for positive samples or (as the last resort) other randomly selected samples are used. In all experiments OpenCV 3.1 [35] Cascade Classifier and SVM implementation are utilized.

Given our test data, the number of resulting support vectors in the SVM classifier varies between 200 and 400 for 24 × 24 case but can be twice as large for 32 × 32 case. Let us review one specific configuration: 'hat' pattern trained on 55 24 × 24 images with masks and pattern generalization settings *σmax* = *cmax* = 0, *δmax* = 0.7, *Imax* = 50. After SVM metaparameter optimization we obtain SVM regularization parameter *C* = 2.5, RBF kernel size *γ* = 0.5, and the number of support vectors 233.

#### *2.7. Detector Training Using Tuned VGG16 Network*

The number of features processed by our detector is limited to HOG and Haar-like features. There exist quite a few other solutions that use a much richer set of features for object detection e.g., [41], however, at the expense of increased computational complexity, which is an important practical concern of our system. As an alternative to evaluating complex sets of hand-crafted features, we decided to resort to the state-of-the-art methods automatic feature computation based on Convolutional Neural Networks.

Therefore in addition to the classifier described in Section 2.6, we evaluate a VGG16 network [42] trained on the ImageNet dataset [43] tuned to our problem using transfer-learning principle [44]. Transfer learning is a common technique to overcome the problem of the availability of training data. In case of a limited access to the training data for the given problem, one can use an existing network trained on another dataset and subsequently fine-tune (retrain) this network to accommodate the specific problem samples. Since usually the original network was trained on millions of examples and hundreds of classes, it is capable of extracting robust and usually quite universal features. Thus, the new network can benefit from pre-trained feature layers and train only classification layers.

For our task, we utilize a convolutional layer of VGG16 network (which is obtained from the original network by removing fully-connected layers). The network is augmented with two fully connected layers (1024 neurons with ReLu activation and a single neuron with sigmoid activation correspondingly) and one dropout layer (with dropout rate 0.5) to avoid overfitting. The VGG16 network was originally trained on a sample image of 224 × 224, however, the convolutional part would accept any multiplicity of 32 for image width and height. In this paper, we verify classifier performance for input image sizes 32 × 32, 64 × 64, 128 × 128, and finally 224 × 224 pixels. The total number of parameters in fully connected layers ranges from 522 337 to 25 692 161 depending on input image size. The overall network structure is given in Figure 4.

**Figure 4.** Neural Network Structure, *n* ∈ {1, 2, 4, 7}.

The selected loss function is a binary cross-entropy for binary classification. During training, all the convolutional layers are frozen and only the parameters of fully connected layers are adapted.

#### *2.8. Detection and Post-Processing*

During the detection phase, each test image is first processed by the cascade classifier, typically returning several hundreds of candidate areas. After this, each candidate area is examined by the classifier and a score is assigned to each detection. In case of the SVM, the score is computed as the signed distance from the separating plane in support vector space with the lowest negative scores treated as best matches and high positive scores as worst matches. For the VGG16-based neural network, the output of the sigmoid function is negated and used as the score.

For each image, only the best score area is considered for further processing. Frames from the test sequence are sampled and processed with increasing density (first, last, and middle frame for a start, and then intermittent frames), to quickly produce some results for the user to review (non-minima suppression is used to reduce clutter).

#### **3. Experiments**

#### *3.1. Experimental Setup*

In order to evaluate the detector performance the following tools are utilized.


In the subsequent sections, each ROC and PR curve was plotted basing on a single run of the detector, whereas the majority of tables included in the subsequent sections contain aggregated data from multiple runs of the algorithm to compensate the stochastic nature of Machine Learning algorithms used. Computational performance experiments were performed on AMD Ryzen 5 2600X processor (32 GB of memory) and GeForce RTX 2070 GPU (8 GB of memory) unless stated otherwise.

#### *3.2. Preliminary Experiments*

During the first stage of experiments there was selected a single test sequence '00012' with 1776 Full-HD frames. Using this sequence, various parameter configurations were evaluated in order to assess the basic properties of the solution proposed. Basing on these experiments, some answers can be given regarding problems such as the impact of utilization of a two-layer detector on detection results and detection/training speed, the impact of the method of selection of training samples on detection accuracy or influence of values of image synthesis parameters on overall quality. Above questions will be discussed in the following paragraphs. During the first three experiments, one sample pattern 'hat' was utilized, and in the last experiment three other patterns: 'logo', 'helmet', and 'shirt' were introduced. Examples of training samples are given in Figure 5, and samples marked in the full-frame image are given in Figure 6. Filtered detection results for one test sequence presented in the form of a simple GUI are given in Figure 7.

**Figure 5.** Example training samples of hat, logo, helmet, and shirt.

**Figure 6.** Frame with marked hat, logo, helmet, and shirt samples.

**Figure 7.** Detection results filtered by minimum distance (25 frames) between hits.

#### 3.2.1. Two-Layer Detector

In the first experiment a trade-off between the detection and training speed for different number of expected cascade stages *k* was evaluated (Figure 8a). In this experiment HOG+SVM second stage classifier was used. Identical parameters were used for all *k* values, except for the number of the SVM training samples. For *k* < 15, a larger number of 900 positive and negative samples were used. For *k* ≥ 15, the default of 300 positive and negative samples were utilized. This was used to balance the total number of training samples consumed by the detector (for smaller *k* the task of SVM is harder, since more negative samples pass the first stage of detection).

The experiment shows, that for low *k* training time is dominated by SVM training, and for large *k* cascade training dominates. A good compromise for our data can be obtained for value of *k* = 15. Larger *k* obviously means faster detection (Figure 8a), but also slightly worse detection results (Figure 8b), likely due to utilization of more robust HOG features in the second stage. This statement is also generally supported by the shape of corresponding ROC curves in Figure 9. These preliminary performance experiments were performed on Intel Core i5 computer.

#### 3.2.2. Collection of Training Samples

In the next experiments, detector performance for different training data collection methods was evaluated. In the first place the data samples were collected using the automatic tracking and foreground–background separation methods given in this paper. In the process 55 data samples of hat pattern were collected, together with their automatically generated masks (using methods from Section 2.2.1). The data consisted of images of a white hat on top of a head, while the head was making full 180 degrees rotation around central axis. For comparison, a short sequence of training samples representing only 3 extreme head positions (*en-face* and two profiles) was utilized. For both sequences either appropriate foreground–background masks or no masks were used given 4 different combinations of settings. The detection results are given in Figure 10.

Not surprisingly, the richest possible data source (55 frames with generated masks) gives the best results. It is valuable to note that for our data, applying both object tracking and automatic mask generation is substantial to get optimal results.

**Figure 9.** Hat in '00012' detection results with respect to number of the requested cascade stages.

**Figure 10.** Hat in '00012' detection results for different training data collection methods.

#### 3.2.3. Application of the GrabCut Algorithm

In this paragraph, we evaluate the hybrid method for foreground/background segmentation utilizing background subtractor and GrabCut that was described in Section 2.2.3. It is evident that for single training images, the proposed segmentation method is reduced to pure GrabCut, and as such GrabCut becomes the only segmentation option, which should be beneficial in most cases (see: discussion on masks in the previous paragraph). However, the more interesting question is if GrabCut can effectively cooperate with background subtractor and enhance its results. Intuitively, it should be the case since both methods operate on different principles.

Taking a quick glance at Figure 11 shows us that this is actually the case, so the GrabCut building on the initial foreground/background segmentation can effectively smooth out imperfections in the initial estimation. In order to evaluate it quantitatively, we propose an experiment where the detection results are evaluated by a GrabCut module turned on and off for different values of background cut-off threshold *cthr*). The results are summarized in the Tables 1 and 2, and Figure 12.

**Figure 11.** Foreground/background segmentation results using only background subtraction (on the left) and background subtraction + GrabCut (on the right) for background cut-off threshold *cthr* = 600.


**Table 1.** Detection results with application of GrabCut.

**Table 2.** Detection results without application of GrabCut.


**Figure 12.** Comparison of detector accuracy with GrabCut turned on or off for different thresholds of background cut-off (pattern hat).

The results can be summarized two-fold. Firstly, the compound algorithm (mixing background subtraction and GrabCut) offers better accuracy for all thresholds analyzed, especially for larger background cut-off thresholds. However, since GrabCut results must always contain the area initialized by B-S, this does not need to hold for extremely small thresholds (where the initial area is already very large).

Secondly, the performance of the pure background subtractor strongly depends on the cut-off threshold, while GrabCut seems to loosen this correlation. This may enable (in the future version of the system) to remove the threshold *cthr* altogether and to fix some reasonable default.

Finally we decided to evaluate the performance of 3 methods: pure background subtractor, the mixture of a background subtractor with GrabCut, and a successful contemporary tracker and mask generator [45]. The latter method relies on Siamese Network (a popular method for tracking

applications), extended with the ability to generate binary foreground/background masks for the tracked object. In our experiments we used a network trained on a DAVIS dataset [46]. Comparison of three methods is given in Table 3.

**Table 3.** Comparison of detection results using Background Subtraction + GrabCut (BS + GC), pure Background Subtraction (BS), and SiamMask [45] method for collecting training samples.


Although the results are not fully comparable (e.g., we had to approximate more complex ROI returned by the SiamMask algorithm with a rectangular ROI for further utilization in training), we can observe that the performance of SiamMask is somewhat worse than the performance of the other two methods. Although the tracking process is fine, it can be observed that the generated mask does not correspond well to ground truth data. This effect could be easily attributed to a natural bias of the method towards specific classes of objects that it was trained on. However, if the method was trained on the data from a domain similar to our test data, the results could be much better. This is an open area for further research.

#### 3.2.4. Synthetic Generalization of Training Data

In these experiments, different measures and intensity of synthetic samples augmentation were evaluated. The results are given in Figures 13 and 14. The results show that moderate geometric, as well as contrast and sharpness generalization, provides the best results. However, the selection of appropriate parameters is object and sequence-specific. It may be observed that near-flat surfaces (like logo) benefits from aggressive geometric distortions (i.e., larger rotation angles). In addition, the reduction of sharpness proved to work best for computer-graphics-generated patterns.

**Figure 13.** Hat in '00012' detection results for different levels of geometric synthesis.

**Figure 14.** Hat in '00012' detection results for different contrast and sharpness synthesis levels.

#### 3.2.5. Selection of Training Data Size

The selection of the appropriate size of the training dataset is crucial for detection accuracy as well as time performance. It applies especially to the 2-nd stage of detection since this is the layer that performs the final evaluation of samples. To tune our system with this respect, we performed an analysis using one pattern: hat. There was evaluated a HOG+SVM classifier. The input data was augmented basing on data resulting from foreground–background segmentation from Section 2.2.3 (*cthr* = 130). The evaluated input image size were 24 × 24 and 32 × 32. The tested number of training images was 300, 600, 1200, and 2400. In all cases the SVM hyperparameters were trained using a 10-fold cross-validation procedure. The results obtained are given in the Tables 4 and 5. Visual comparison of accuracies is given in Figure 15.

**Table 4.** Detector performance for 24 × 24 HOG features and different number of training samples per class.


**Table 5.** Detector performance for 32 × 32 HOG features and different number of training samples per class.


**Figure 15.** Comparison of detector performance using 24 × 24 and 32 × 32 HOG features for different sizes of the training set and hat pattern.

As can be deduced from the figure, the 24 × 24 HOG descriptor performs significantly better for smaller dataset sizes (like 300/300). This is quite obvious, taking into account the difference in the size of space parameters. For 1200/1200 samples and above, both resolutions offer similar performance. The accuracy (measured by AUC) saturates at about 1200/1200 samples, so for this pattern there is little justification to use larger training sets. An attractive choice of parameters seems to be 24 × 24 HOG descriptor with 600/600 training samples. In this experiment this solution offered almost top accuracy accompanied by very quick training (only 49 s.) and detection (63 ms. per frame). The 24 × 24 HOG descriptor with 300/300 samples may also look attractive, but it more depends on the actual data selection (high standard deviation of results for different runs), and the performance is expected to differ more between consecutive runs.

It should be noted that the number of required samples strongly depends on the complexity of the pattern and its uniqueness, so it is tough to establish the value in general. Therefore this discussion should be treated rather as a proposition of some sane default values for the dataset size parameter in the system.

#### 3.2.6. VGG16 as 2-nd Stage Classifier

In this part of the experiments, the VGG16 network was utilized as the last stage classifier. After pre-detection by the cascade classifier, the results were fed to the VGG. The classifier, pre-trained on the ImageNet dataset, was immediately trained on-line on the set of gathered training samples. The number of training samples was set to 2400 for each class. There were evaluated four variants of the classifier: accepting images of resolution 32 × 32, 64 × 64, 128 × 128, and 224 × 224. The selection of a particular image size induced different numbers of layers. The training set of images was divided into training and validation using proportion 5:1. Only the fully-connected layers in the network were trained. The network was trained for a maximum of 6 epochs (for 224 × 224 resolution) or 10 epochs for smaller resolutions using Stochastic Gradient Descent or up to the moment when the validation error started to increase (using early-stopping principle). Practically in all cases, the network was able to train up to 100% of accuracy on the validation set.

The results obtained are given in the Table 6. The overall performance of the detector using VGG16 network can be regarded as acceptable, with the value of AUC greater than 0.7. It should be noted that the AUC value is quite similar for different network configurations, but the differences between them manifest in the *average precision-recall* measure. As could be expected, the full-width version, accepting images 224 × 224, where the output of the convolution part is 7 × 7 × 512 (as in the original paper), offers the best performance. The worst performance is offered by the network version accepting input 32 × 32, where the output of the convolution part is 1 × 1 × 512. The performance hit is probably due to getting rid of the information of feature location (the output is only 1 pixel wide). In such a configuration, the network is able to take into account the *presence* of features but not their position.

The full-width version of the network additionally has a prohibitive computational complexity of detection (even on a good GPU) consuming almost 1.5 s per each frame (compare it to classifier HOG+SVM classifier also given in the Table 6). The overall accuracy for VGG16 turns out to be worse than HOG+SVM. The VGG network suffers from the hidden overfitting problem (not manifesting itself in validation set error) due to incomplete training data. On the other hand, it seems that SVM+HOG classifier benefits more from the fully generic HOG feature extractor. The VGG16 network should perform better after some retraining of the specialized convolutional layer using enough quality data. Our unreported experiments show that the training data available in our problem is not sufficient for this purpose.

**Table 6.** Performance of VGG16 network as a second stage classifier. Last column contains HOG results for reference.


#### 3.2.7. Application of Faster-RCNN to Preprocessed Data

As the final experiment involving neural networks, we evaluated the Faster-RCNN [29] detector on our test sequence '00012' using as the training input the data collected and preprocessed using methods described in this paper. In order to prepare data acceptable by RCNN, augmented training images for hat pattern and their masks were superimposed on the negative training images in a regular grid-like fashion using masks for effective blending. The ROIs of input images were defined accordingly.

The original Faster-RCNN network was then re-trained with respect to the new data for 3000 training loops (which took 15 minutes in the Google Colab [47] environment). Then, the detection was performed, which took about 2 s per frame.

The Faster-RCNN network was able to achieve AUC = 0.67 and AVGPR = 0.56. The results are comparable to the ones obtained using VGG16 classifier and worse than the pair HOG+SVM.

#### 3.2.8. Detection of Various Patterns

In the last of our preliminary experiments we evaluated how the detector handles different types of patterns. Therefore, the pattern logo was trained on a single training example with no mask, the pattern shirt was trained on a sequence of 30 samples without a mask and the pattern helmet was trained on 41 samples also without a mask using HOG+SVM for the second stage classifier. The results are given in Figure 16.

The relatively worse performance for the shirt pattern is mainly due to numerous occlusions. Even in the case of the 'shirt' pattern we still have about 90% of successful hits for recall rates of 0.3. For best patterns, such as helmet, we have about 50% of positive examples with still 0 false positives!

In the course of the experiments, it was observed that motion blur (inherent or originating from de-interlacing) is the most destructive type of noise regarding both the training and detection phase. In addition, due to quite severe subsampling of the pattern (down to 24 × 24), the detector may suffer from problems in distinguishing between patterns differing only in small details. On the other hand, due to this property, the detector should well handle also small patterns—only slightly bigger than the nominal 24 × 24 pattern size.

**Figure 16.** Receiver Operator Characteristic (ROC) and Precision-Recall (PR) curves of hat, logo, helmet, and shirt detections in '00012' sequence.

#### *3.3. Large-Scale Experiments*

Tests of the presented algorithm were conducted on a dataset containing 11 recordings, with nearly 30 thousand frames in total, with full HD resolution. Three patterns were created (Figure 17), and all sequences were carefully labeled by hand to create ground-truth data. All patterns were created based on a single frame (one positive sample). As training data, a high quality still picture was used, with resolution scaled down to full HD.

**Figure 17.** Three tested t-shirt logo patterns: (**a**) pattern P1, (**b**) pattern P2, (**c**) pattern P3.

Results of the experiments (ROC curve) for the selected pattern P1 are presented in Figure 18a. EER is similar for all patterns P1–P3, and is equal to 25.3%, 28.3%, and 28.0% for each pattern, respectively. Accumulated EER equals to 27.4%. Obtained results resemble those from smaller dataset. Even though the training sample and query images were obtained from different devices and had a different quality, the algorithm gave satisfactory results.

**Figure 18.** (**a**) ROC curve for the pattern P1. (**b**) Accumulated ROC curve for 5-elements sequence analysis. (**c**) ROC curve for cereal\_1 object in desk\_3 sequence.

The final addition to the testing scenario was the utilization of short sequences. For every short time window, from all the results, only the one with the best response was taken as a final detection and passed to further processing. Accumulated results for the windows of length 5 is presented in Figure 18b. EER for them are, respectively: 27.4%, 15.1%, and 14.3%. It was observed that the longer the window the smaller is the quality gain.

From the point of view of a surveillance system operator, the system does not need to correctly annotate each and every frame in the sequence. In reality, what the operator expects is at least a single detection per continuous segment of appearances of the object. If the system can issue an alert, that the object is present in a single frame from such sub-sequence, the operator can perform further investigation manually.

To evaluate this, the experiments were performed concerning the detection of two selected patterns in seven different sequences. The detection was evaluated on per-segment and not per-frame basis. The ground (GT) for the movie sequence was composed of positive and negative segments. A positive segment was the continuous segment of positive samples; negative segment was a continuous segment of negative samples. Similar positive and negative segments were extracted from the detection result by applying hysteresis thresholds to the detector response. If the positively evaluated segment overlapped the true segment, it was considered ((H)it). In other case, it was considered ((M)iss). If the negatively evaluated segment overlapped the true positive segment, it was a ((S)kip).

Results are presented in Table 7. The Fragmentation describes the ratio of good Hits (multiple hits in a single segment are possible) to the number of segments in the ground truth data (GT). For each sequence, we had the information about the total number of times the target appears (number of continuous time segments, GT).


**Table 7.** Continuous sequence detection results.

The most important conclusion here is that every continuous segment with the query pattern was hit at least once. The total fragmentation of 1.53 means, that the operator, on average, will be notified only slightly more often than he or she expects. Number of Misses, although not very small (0.42 of expected detections), is balanced by no Skips, which is the most important factor in the surveillance systems.

Additional experiments were conducted using one of the publicly available datasets. As our system relies on the masks, and one of the key concepts is to model the 3D object based on the small number of 2D views, we decided to use the Washington RGB-D dataset [48]. This dataset contains training sequences with a single object with labels (masks) and testing sequences with complex scenes containing multiple objects. In the experiments, the operator used the training sequence to pick the object of interest, and our system continued with the creation of the masks from the provided short sequence. Masks available in the dataset were used as a measure of the accuracy impact of automatic mask generation. After creating the model from a few images object was searched in the complex scenes. Figure 18c presents sample results obtained for the cereal\_1 object in desk\_3 sequence. The model was created using only 7 views of the object in this case, without using the GrabCut for mask creation.

#### **4. Conclusions**

In this paper, we presented a solution that can support the work of surveillance system operators. The system proved to positively address difficult task requirements concerning small training data sets, quick learning, and fast and reliable detection. An attractive training/detection speed and recognition rate trade-off was obtained by the application of a 2-layer cascade/SVM classifier. Additionally, a combination of cascade classifier and CNN was evaluated in the paper. The system proposed can learn from a single training sample but also can collect samples from short image sequences with only small user supervision in order to obtain rich training data. The addition of background subtraction and GrabCut for pattern generation made the process even more reliable. The performance of the system varies depending on the type and quality of training/test data, but we argue that, on average, results are satisfactory, and even not-the-best results provide sufficient information to be useful in practical surveillance scenarios.

**Author Contributions:** Conceptualization, A.W. and W.K.; methodology, A.W. and M.S.; software, A.W., M.S., and W.K.; validation, A.W., W.K., and M.S.; formal analysis, A.W. and W.K.; investigation, A.W., M.S., and W.K.; resources, W.K.; data curation, M.S. and A.W.; writing—original draft preparation, A.W. and M.S.; writing—review and editing, A.W., M.S., and W.K.; visualization, A.W. and M.S.; supervision, W.K.; project administration, W.K.; funding acquisition, W.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** Work done as part of the CYBERSECIDENT/369195/I/NCBR/2017 project supported by the National Centre of Research and Development in the frame of CyberSecIdent Programme.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Active Player Detection in Handball Scenes Based on Activity Measures**

#### **Miran Pobar and Marina Ivasic-Kos \***

Department of Informatics University of Rijeka, Rijeka 51000, Croatia; mpobar@uniri.hr **\*** Correspondence: marinai@uniri.hr; Tel.: +385-51-584-700

Received: 31 January 2020; Accepted: 5 March 2020; Published: 8 March 2020

**Abstract:** In team sports training scenes, it is common to have many players on the court, each with his own ball performing different actions. Our goal is to detect all players in the handball court and determine the most active player who performs the given handball technique. This is a very challenging task, for which, apart from an accurate object detector, which is able to deal with complex cluttered scenes, additional information is needed to determine the active player. We propose an active player detection method that combines the Yolo object detector, activity measures, and tracking methods to detect and track active players in time. Different ways of computing player activity were considered and three activity measures are proposed based on optical flow, spatiotemporal interest points, and convolutional neural networks. For tracking, we consider the use of the Hungarian assignment algorithm and the more complex Deep SORT tracker that uses additional visual appearance features to assist the assignment process. We have proposed the evaluation measure to evaluate the performance of the proposed active player detection method. The method is successfully tested on a custom handball video dataset that was acquired in the wild and on basketball video sequences. The results are commented on and some of the typical cases and issues are shown.

**Keywords:** object detector; object tracking; activity measure; Yolo; deep sort; Hungarian algorithm; optical flows; spatiotemporal interest points; sports scene

#### **1. Introduction**

Many applications of computer vision, such as action recognition, security, surveillance, or image retrieval, depend on object detection in images or videos. The task of object detection is to find the instances of real-world objects, such as people, cars, or faces. Detecting an object includes determining the location of that object and predicting the class to which it belongs. Object location and object classification problems both pose individual challenges, and the choice of the right object detection method can depend on the problem that needs to be solved.

In our case, object detection, or specifically, leading player detection, can be very helpful for the recognition of actions in scenes from the handball domain. Handball is a team sport played in the hall on a handball court. Two teams consisting of a goalkeeper and six players participate in the game. The goal of the game is to score more points than the opposing team by throwing the ball into the goal of the opposing team, or by defending the goal, so that the opposing team does not score. The game is very fast and dynamic, with each team trying to arrange the action to defeat the defense and score a goal or to prevent the attack and defend the goal. All of the players can move all over the court except in the space 6 m in front of both goals where the goalkeepers are. The players move fast, change positions, and combine different techniques and handball actions, depending on their current role, which might be attack or defense. A player can shoot the ball toward the goal, dribble it, or pass it to a teammate during the attack, or block the ball or a player when playing defense.

Different techniques and actions that players perform, as well as the frequent changes of position and occlusions, cause the shapes and appearances of players to change significantly during the game, which makes the detection and tracking process more challenging. The indoor court environment with relatively reflective floors is often illuminated with harsh lighting so that the shadows and reflections of players are usually present on the recordings and, due to the speed of action performance, motion blur occurs, which further complicates the tracking and detection problems. Additionally, the positions of the player and the distance to the fixed-mounted camera are constantly changing, so that players can be close to the camera, covering most of the image, or at a distance from the camera and occupy just a few pixels. The ball is also an important object for recognizing the handball scene, but it is even more complex to detect and track than the player, because it is even smaller, moves quickly, in different directions, and it is often covered by the player's body or not visible because it is in the player's hands [1].

Handball rules are well-defined and prescribe permitted actions and techniques, but during training, when adopting and practicing different techniques or game elements, the coach changes the rules to maintain a high level of player activity and a lot of repetition. Players then preform several sets of exercises in a row to train and improve certain techniques or actions, and to learn to position themselves properly in situations that imitate real game situations. In training sessions, each player also usually has his own ball, to reduce the waiting time to practice techniques and to train handball skills as quickly as possible. For example, a shooting technique involves a player that shoots the ball towards the goal and the goalkeeper that moves to defend the goal. Other players are waiting their turn to perform this activity or they gather their balls on the playground and then run to their position in the queue, as shown in Figure 1.

**Figure 1.** A typical training situation. The goalkeeper and the player in the center are performing the current task, while the rest are waiting, returning to queue, collecting the ball, etc.

Although all of the players move and interact in a certain moment, not all players contribute to the action, such as passing the ball or shooting at the goal, which is the most relevant for some exercise or for the interpretation of a handball situation. Thus, the focus should be on the players that are responsible for the currently most relevant handball action, and whose label can then be used to mark the whole relevant part on the video. Here, these players are referred to as active players.

The goal of the proposed method is to determine which of the players that are present in the video sequences are active at a certain time of the game or at the training sessions, and who of them performs the requested action at a given moment.

Detecting the player who performs an action of interest is particularly demanding during training sessions, because there are more players on the field than allowed during a normal game, there is more than one ball, because each player has their own ball and certain actions take place in parallel to make the players get the most out of the time to adopt or perfect the technique. The given goal is too complex for simple methods for people segmentation, such as background subtraction or Chroma keying, so, for this purpose, we suggest an active player detection method that combines player detection and tracking with an activity measure to determine the leading player(s).

For the player detection, we suggest the use of deep convolution neural networks (CNNs) that have proven to be successful in classification and detection tasks in real-world images. In this paper, we will use Yolo V3 [2] which achieves great accuracy when detecting objects in the images.

We propose three activity measures to determine the level of activity of a particular player. The one, named DT+STIP, is based on density of spatiotemporal interest points (STIPs) that are detected within the area of a player, the second, DT+OF, is computationally simpler and based on optical flow (OF) motion field between consecutive video frames, and the third, DT+Y, uses convolutional neural network (CNN) for determining player activity and classifying active and inactive players.

We consider the use of Hungarian assignment algorithm based on the bounding box positions of detected players, and the more complex Deep SORT tracker [3] that uses additional visual appearance features to assist the assignment process to track the detected players along the time dimension.

We have defined the evaluation measure for evaluate the performance of the proposed active player detection method. The proposed method is tested on videos that were taken in the gym during several training sessions, in real, uncontrolled indoor environment with cluttered background, complex light conditions and shadows, and with multiple players, on average 12, who practice and repeat different handball actions and dynamically change positions and occlude each other.

The rest of the paper is organized, as follows: in Section 2 we will briefly review the object action localization, sports player tracking and salient object detection. In Section 3, the proposed method that combines the Yolo object detector, three proposed activity measures and tracking methods to determine the most active player in sports scenes is described. We have defined the evaluation measure and applied the proposed method on a custom dataset that consists of handball scenes recorded during handball training lessons. The performance evaluation of the proposed active player detection method and comparison of different method setups with respect to the three proposed activity measures and two player tracking methods are given and discussed in Section 4 The paper ends with a conclusion and the proposal for future research.

#### **2. Related Work**

The goal of detecting the active or leading player in a sports video is to find the sequence of bounding boxes that correspond to the player currently performing the most relevant action. The task can be separated into player detection and player tracking, which produces sequences of bounding boxes that correspond to individual players, combined with saliency detection to determine which of the sequences belong to the active player. Alternatively, the first part of the problem can be viewed as a task of action localization in general, where the goal is to find the sequences of bounding boxes in videos that likely contain actions, not necessarily by explicitly trying to detect players, but possibly using other cues, such as only the foreground motion.

Visual object tracking, including player tracking, is a very active research area that attracts dozens of papers to computer vision conferences annually [4], and numerous approaches have been developed for both the problem of multiple object tracking in general [5] and specifically for player tracking in sports videos. For the sports domain, player tracking is usually considered together with detection, e.g. in [6] for the case of hockey, [7] for handball, [8,9] for indoor, and [10–13] for outdoor soccer. These methods commonly attempt to exploit the specific knowledge regarding the domain of the given sport and about the video conditions, such as the distribution of the colors of the playing field or players to isolate potential areas that contain players [6–8,11–13], or the layout of the playing field to recover depth information [8–11]. Player detection ranges from template matching with handcrafted features [7] to machine learning approaches using e.g., a SVM classifier [13] or Adaboost [12], often with particle filter based tracking [6,9,13].

Recently, deep learning-based approaches to player detection, e.g. [14], are becoming increasingly attractive due to the increased detection performance as well as the reduced need for domain-specific knowledge. With the increased performance of convolutional neural network-based object detectors, relatively simple tracking-by-detection schemes that use the Hungarian algorithm to assign the

detected bounding boxes to tracks only based on box dimensions have achieved good performance in the multiple object tracking task [15], such as tracking the leading player [16]. This work uses a similar approach.

Determining who is the leading player among the detected players is related to the problem of salient object detection, where the aim is to detect and segment the object in an image that visually stands out from its surroundings, which would be in the center of attention in the human system. In addition to finding naturally (visually) distinct regions (bottom-up-saliency), the criteria for saliency may be based on a specific task, expectations, prior knowledge, etc. (top-down saliency) [17]. In this sense, finding the active player among the detected players can be regarded as a top-down saliency task. Different cues have been used for saliency detection, such as those that are based on intensity or color contrast, informed by the sensitivity of human perception to color, or those that are based on location information, e.g., treating center region of an image as more likely to contain salient objects [18]. With the popularization of deep learning models for object detection, similar models are also increasingly used for salient object detection [19]. In the video context, motion-based saliency cues can be used, e.g., based on by directional masks applied to the image [18] or on optical flow estimation [20,21]. In this work, motion cues that are based on optical flow and STIPS are used to inform saliency of a player. A survey of salient object detection work can be found in [17,19].

In the area of action localization, the goal is to find the sequence of bounding boxes in a sequence of video frames that correspond to an action. Generating a set of candidate spatiotemporal regions, followed by the classification of the candidate region into an action class or the non-action class, is commonly utilized to perform this. Jain et al. [22] first segment a video sequence into supervoxels, and then use a notion of independent motion evidence to find the supervoxels where motion deviates from the background motion, which signals the potential area of interest where action is likely going to be detected. Similarly, in [23], a motion saliency measure that is based on optical flow is used along with image region proposal based on selective search to select candidate spatiotemporal regions in videos with actions and to disregard areas with little movement where no action is likely. Kläser et al. [24] explicitly split the action localization task into first detecting the persons in video while using a HOG [25] descriptor and a sliding window support vector machine classifier and then tracking the detected persons.

In [26], basketball players are detected and tracked in the video for the task of event detection, and the most relevant player is implicitly learned from the correspondence of video frames to the event label in a LSTM model trained to classify these events.

#### **3. Proposed Method for Active Player Detection**

The goal of the proposed method is to automatically determine the player or players in the scene that are responsible for the current action, among all of the players present in the scene. Figure 2 presents the overview of the proposed method for active player detection.

**Figure 2.** An overview of the active player track detection. Player detector provides bounding boxes for each input video frame (**A**). Player activity measure is computed for each detected player bounding box in each frame (**B**) and players are tracked across frames (**C**). Combining the track information from (C) and activity measure from (B), final decision about the active player track is made (**D**).

First, the players are detected in the input video while using a CNN-based object detector. Here, the YOLOv3 detector pre-trained on the MS-COCO dataset was used to mark the players on each frame of the video sequence with their bounding boxes (Figure 2A), using the detector's general person class. In this case, the object detector does not distinguish between active and inactive players, i.e. those who perform the action of interest from those who do something else. The activity measures for each detected player in a frame are computed based on motion features that are extracted within each player's bounding box in that video frame (Figure 2B). To determine a player's overall activity in a time period, it is necessary to know their activity measure in each frame, but to also uniquely identify that player throughout the period of interest, i.e., to track them through the sequence. While considering that the object detector only determines the class of a detected object (e.g., person), without distinguishing between objects of the same class, the tracking of player's identities between frames is completed outside the detection step (Figure 2C). Finally, the information about player activity in each frame and the track information is joined to determine the track of the most active or leading player in the whole video (Figure 2D).

Each of the steps is described in more detail below.

The Hungarian assignment algorithm was used to track the players, with a distance function based on properties of bounding boxes as the baseline method. As an alternative, the Deep SORT [3] method, which uses additional visual features to aid the tracking, is considered. Since Deep SORT gave better tracking results, only this method was used when testing the activity measures.

Different ways of computing player activity were considered and three activity measures are proposed. The first measure, named DT+OF, is based on the optical flow calculated within a player's bounding box and considers the velocity component of the motion, while the second measure, named DT+STIP, uses the density of the spatiotemporal interest points (STIPs) within the bounding boxes to determine the activity measure. The third method, called DT+Y, uses the YOLOv3 network to classify a player into an active or inactive class.

The input videos used as input data were recorded during the handball training involving 15 and more players. Each video sequence was selected and trimmed, so that it contains all the steps of the given action.

#### *3.1. Player Detection*

The goal of object detectors is to locate and classify objects on the scene. The detected objects are typically marked with a bounding box and labeled with corresponding class labels.

The current focus in object detection is on convolutional neural networks (CNNs) that were, at first, used for image classification, but have later been extended to be able to both detect and localize individual objects on the scene. This can be achieved by the independent processing of image parts that are isolated while using a sliding window that is moved over the image [27]. Multiple scales of windows are used in order to take into account different sizes of objects in the image. If the classifier recognizes the contents of the window as an object, the window is used as the bounding box of the object and the corresponding class label is used to label the window. After processing the entire image, the result is a set of bounding boxes and corresponding class labels. As a certain object can be partially or fully contained within multiple sliding windows, a large number of duplicate predictions can be generated, which can then be rejected while using non-maximum suppression and a confidence threshold. Since the sliding window approach essentially makes an image classification for each window position, a naive implementation can be much slower and computationally expensive when compared to simple image classification.

Here, the role of the object is to detect the players in each frame of the handball videos. Ideally, the detector should be as precise as possible, yet less computationally demanding, so that in can detect objects in real time. It has been previously shown [28,29] that Mask R-CNN [30] and Yolo have comparable performances of player detection in handball scenes, but, since Yolo was significantly faster and was successful for player detection in previous work [31], it is used for this experiment. The YOLOv3 detector was used here using the pre-trained parameters on the COCO dataset with the standard Resnet-101-FPN network configuration, with no additional training.

YOLOv3 is the third iteration of the YOLO object detector [32], which performs a single pass through a neural network for both detecting the potential regions in the image where certain objects are present, and for classifying those regions into object classes. The object detection task is framed as a problem of regression from image pixels to objects' bounding box coordinates and associated class probabilities.

The architecture of the YOLOv3 detector is a convolutional network that consists of 53 convolutional layers of 3 × 3 and 1 × 1 filters with shortcut connections between layers (residual blocks) is used for feature extraction. The last convolutional layer of the network predicts the bounding boxes, the confidence scores, and the prediction class. YoloV3 predicts candidate bounding boxes at three different scales using a structure that is similar to feature pyramid networks, so three sets of boxes are predicted at each feature map cell for each scale, to improve the detection of objects that may appear at different sizes.

For the task of active player detection, only the bounding boxes and confidence values for the objects of the class "person" were used. An experimentally determined confidence threshold value of 0.55 was used to filter out the unwanted detections and obtain good balance of high detection and low false positives rates. Figure 3 shows an example of detection results for the "person" class.

**Figure 3.** Bounding boxes with confidence values as results of person detection with YOLOv3.

#### *3.2. Player Tracking*

The detections that were obtained with the Yolo detector are independent of each processed frame and, thus, contain no information about which bounding boxes correspond to the same objects in consecutive frames. Finding this correspondence between bounding boxes and player identities across frames is required to obtain the player trajectories. An additional post-processing step for player tracking is used for this purpose. Two methods are considered for this task. The first is a simpler approach utilizing the Hungarian algorithm, which assigns player bounding boxes detected in different frames into the player track while taking the dimensions and positions of the bounding box into account. In the second approach, the Deep SORT method [3] is used, which uses additional visual features to determine which detection in a frame matches a previously detected player.

#### 3.2.1. Tracking Using the Hungarian Algorithm

In the first video frame, the player tracks are initialized, so that a track ID is assigned to each detected player bounding box. In the next frames, the individual detected bounding boxes are assigned to previously initialized tracks while using the Munkres' version of the Hungarian algorithm [33], whose objective is to minimize of the total cost of assigning the detections to tracks. Assigning a bounding box to a track has a cost value that depends on the scale difference and the relative position of the candidate bounding box and the box that was already assigned to the track in the previous frame to take simplified visual and spatial distances into account.

*Sensors* **2020**, *20*, 1475

More precisely, a new bounding box B*<sup>b</sup>* will be assigned to a player track T*b*−<sup>1</sup> if it has the minimal cost computed as a sum of the linear combination of Euclidean distance between the centroids (*Cb*−1) of the last bounding box assigned to the player track T*b*−<sup>1</sup> and the detected centroids (*Cb*) and the absolute area difference of the detected bounding boxes area (*Pb*) and the last bounding box area (*Pb*−1) assigned to the track T*b*−<sup>1</sup> (1):

$$B\_b(\mathbb{C}\_b, P\_b) : \text{ b = } \underset{i}{\text{argmin}} \sum\_{i \in F} \left( \text{wd}\_2(\mathbb{C}\_{b-1}, \mathbb{C}\_i) + (1 - \mathbf{w})|P\_i - P\_{b-1}| \right); \\ w \in [0, 1]; \ d\_2(\mathbb{C}\_{b-1}, \mathbb{C}\_i) < \text{T}. \tag{1}$$

where the bounding box *Bb*(*Cb*, *Pb*) is represented with its centroids *Cb* and the area *Pb*, <sup>T</sup> <sup>∈</sup> <sup>R</sup> is a distance threshold between centroids in consecutive frames, *w* is the adjustable parameter that determines the relative influence of the displacement and the change of bounding boxes area in consecutive frames and *F* is the set of all the bounding boxes within the current frame.

The number of tracks can also change throughout the video, since players can enter or exit the camera field of view at any time. Additionally, the detection of players is not perfect, so some tracks should resume after a period where no detection was assigned. The distance threshold T is used to control the maximum allowed distance between a detected bounding box and a last bounding box assigned to the track. Any box whose distance to a track is greater than this threshold cannot be assigned to that track, even if it is the closest one to the track. If for *<sup>M</sup>* <sup>∈</sup> <sup>N</sup> consecutive frames no detections are assigned to a track, the track is considered to be completed and no further detections can be added to it. The values of M and T are experimentally determined and set to 20 and 100, respectively.

The Figure 4 shows an example of a sequence of frames from a video, with player bounding boxes being additionally associated with a track IDs across five different frames, such that, on different sequential frames, the same player has the same ID. It can be noted that the player with track ID 6 is not detected in several frames, but the tracking continues with the correct ID after he was detected again in the last shown frame.

**Figure 4.** Player tracking across frames.

#### 3.2.2. Tracking Algorithm—DeepSORT

DeepSORT [3] is a tracking algorithm that is based on the Hungarian algorithm that, in addition to the parameters of the detected bounding boxes, also considers the appearance information regarding the tracked objects to associate new detections with previously tracked objects. The appearance information should be useful, in particular, with re-identifying players that were occluded or have temporarily left the scene. As with the basic application of Hungarian algorithm, it is able to perform the tracking online, i.e. it does not need to process the whole video at once, but only needs to consider the information about the current and the previous frames to assign detections to tracks.

Like in the previous case, in the first frame, a unique track ID is assigned to each bounding box that represents a player and has a confidence value higher than a set threshold and the Hungarian algorithm is used to assign the new detections to existing tracks, so that the assignment cost function reaches the global minimum. The cost function entails the spatial (Mahalanobis) distance *d*(1) of a detected bounding box from the position that is predicted from its previously known position, and a visual distance *d*(2) that compares the appearance of a detected object with a history of appearances of the tracked objects. The cost of associating a detected bounding box *Bb* to player track T*b*−<sup>1</sup> that ends with the bounding box *Bb*−<sup>1</sup> is given by the expression:

$$\mathcal{L}\_{b\cdot b - 1} = \lambda d^{(1)}(B\_{b - 1 \prime} B\_i) + (1 - \lambda)d^{(2)}(B\_{b - 1 \prime} B\_b)\_{\prime} \tag{2}$$

where λ is a settable parameter that determines the relative influence of the spatial and the visual distances *d*(1) and *d*(2).

The distance *d*(1) is given by the expression:

$$d^{(1)}(B\_{b-1}, B\_b) = \left(d\_b - y\_{b-1}\right)^T S\_{b-1}^{-1} \left(d\_b - y\_{b-1}\right). \tag{3}$$

where *yb*−<sup>1</sup> and *Sb*−<sup>1</sup> are the mean and the covariance matrix of the last bounding box observation assigned to player track T*b*−1, and *db* is the detected bounding box.

The visual distance *d*(2) is given by the expression:

$$d^{(2)}(B\_{b-1}, B\_b) = \min \{ 1 - r\_b^T r\_k^{(b-1)} \Big| r\_k^{(b-1)} \in \mathcal{T}\_{b-1} \},\tag{4}$$

where *r<sup>b</sup>* is the appearance descriptor obtained from the part of the image within the detected bounding box *Bb* and <sup>T</sup>*b*−<sup>1</sup> is the set of last 100 appearance descriptors *<sup>r</sup>* (*b*−1) *<sup>k</sup>* that are associated with the T*b*−<sup>1</sup> track.

The goal of the *d*(2) measure is to select the track where visually the most similar detection was previously found to the current detection. The similarity is computed via the cosine distance between the appearance descriptors of the *Bb* detected frame and T*b*−<sup>1</sup> track. The 128-element descriptors are extracted while using a wide residual neural network with two convolutional layers and six residual blocks that was pre-trained on a person re-identification dataset of more than one-million images of 1261 pedestrians. The appearance descriptor vectors are normalized to fit within a unit hypersphere so that the cosine distance can be used [3].

When a detection cannot be assigned to any track because it is too far from any track according to the distance *d*(1), or is not visually similar enough to any previous detection according to the distance *d*(2), a new track is created. The maximum allowed *d*(1) and *d*(2) distances when an assignment is still possible is a settable parameter. A new track is also created whenever there are more detected players in a frame than there are existing tracks. A track is abandoned if no assignment has been made to a track for *<sup>M</sup>* <sup>∈</sup> <sup>N</sup> consecutive frames, and a new track ID will be generated if the same object re-appears later in the video.

#### *3.3. Activity Detection*

The object detector provides the location of all the players present in the scene in bounding boxes, but it has no information regarding the movements or activities of the players. Some information about movement can be obtained by first tracking the players and then by calculating the shift of the bounding box centroid across frames. However, it is expected that strong and sudden changes in both velocity and appearance that cannot be described by the shift of the bounding box centroid alone will characterize various actions in sports videos. Three activity measures are proposed (DT+OF, DT+STIPS, and DT+Y) to better capture the information about the activity of the players, as noted earlier.

Both the optical flow estimation method and STIPs detection by themselves only consider sections of video frames without taking the object identities into account. Thus, the information about player locations obtained with Yolo is combined with the activity measure obtained from either optical flow or STIPs within the player's bounding box to measure the individual player's activity.

#### 3.3.1. Optical Flow-Based Activity Measure—DT+OF

In the proposed DT+OF activity measure, the optical flow estimate from time-varying image intensity is used to capture the information regarding speed and direction of movement of image patches within players bounding boxes *Bb*, which may correspond to player activity [21]. Movement of any point on the image plane produces a two-dimensional (2D) path *<sup>x</sup>*(*t*) <sup>≡</sup> (*x*(*t*), *<sup>y</sup>*(*t*))*<sup>T</sup>* in camera-centered coordinates. The current direction is the velocity *V* = *dx*(*t*)/*dt*. The 2D velocities of all visible surface points make up the 2D motion field.

The optical flow motion field in consecutive video frames is estimated while using the Lucas–Kanade method [34]. In the Lucas–Kanade method, the original images are first divided into smaller sections, and it is assumed that the pixels within each section will have similar velocity vectors. The result is the vector field *V* of velocities, where, at each point (x, y), the magnitude of the vector represents the movement speed and its angle represents the movement direction.

Figure 5 visualizes an example of the calculated optical flow vectors between correspondent bounding boxes in two video frames from the dataset. The direction and length of blue arrows represent the direction and magnitude of optical flow at each point.

**Figure 5.** Player bounding box (yellow) and optical flow vectors (blue).

The activity measure (*AOF <sup>b</sup>* ) of a player detected with its bounding box (*Bb*) is calculated in each frame as the maximum optical flow magnitude *Vx*,*<sup>y</sup>* within the area *Pb* of the bounding box:

$$A\_b^{OF} = \max\_{B\_b} \big| V\_{x,y} \big| ; \ x, y \text{ within } B\_b. \tag{5}$$

The assumption is that the players performing a sports action make more sudden movements that will result in larger optical flow vectors within the player bounding boxes. The maximum value of the optical flow is used in order to make the comparison between players with different bounding box sizes simple.

#### 3.3.2. STIPs-Based Activity Measure—DT+STIP

The DT+STIP activity measure is based on spatiotemporal interest points (STIPs). STIPs are an extension of the idea of local points of interests in the spatial domain, i.e. of points with a significant local variation of image intensities into both spatial and temporal domains, by requiring that image values have large variation in both spatial and temporal directions around a point.

It is expected that the higher density of STIPs in a certain area will point to a region of higher movement, which might correspond to the higher level of player's activity, since STIPs capture the spatiotemporal "interestingness" at different points in the image, and most actions in sports videos are characterized by strong variations in velocity and appearance over time [16]. Thus, an activity measure *ASTIP <sup>b</sup>* that is based on STIPs density is calculated for a player with a bounding box *Bb* and area *Pb*. in a certain frame, as:

$$A\_b^{STIP} = \frac{\#STIP}{P\_b}; \text{ STIP within } B\_b. \tag{6}$$

Figure 6 shows an example of detected STIPS in a frame of video, as well as the detected STIPs superimposed on the detected players' bounding boxes.

**Figure 6.** Fusion of bounding box and spatiotemporal interest points.

A number of methods have been proposed to detect the STIPs. The method that is proposed in [35] is based on the detection of the spatiotemporal corners derived from Harris corner operator (Harris3D), in [36] on Dollar's detector that uses a Gaussian filter in the space domain and a Gabor band-pass filter in the time domain and obtains a denser sampling by avoiding scale selection, in [37] on Hessian3D derived from SURF or selective STIPs detector [38] that focuses on detecting the STIPs that likely belong to persons and not to the possibly moving background. The Harris3D detector is used in the experiments described here.

In the Harris3D STIPs detector, a linear scale-space representation *L* is first computed by convolving the input signal *f* with an anisotropic Gaussian kernel *g*, i.e., a kernel with independent spatial and temporal variances [35]:

$$L\left(\therefore \sigma^2, \tau^2\right) = g\left(\therefore \sigma^2, \tau^2\right) \* f(.),\tag{7}$$

where σ<sup>2</sup> and τ<sup>2</sup> represent the spatial and temporal variance, respectively, and the kernel g is defined as:

$$\log\left(\mathbf{x}, y, t; \sigma^2, \tau^2\right) = \frac{1}{\sqrt{\left(2\pi\right)^3 \sigma^4 \tau^2}} e^{\frac{-\left(\mathbf{x}^2 + y^2\right)}{2\sigma^2} - \frac{t^2}{2\tau^2}}.\tag{8}$$

A spatiotemporal second-moment matrix μ, which is composed of first order spatial and temporal derivatives of L averaged using a Gaussian weighting function *g*, is computed:

$$\mu = \lg\left(\cdot; \sigma\_i^2, \tau\_i^2\right) \ast \begin{pmatrix} L\_x^2 & L\_x L\_y & L\_x L\_t \\ L\_x L\_y & L\_y^2 & L\_y L\_t \\ L\_x L\_t & L\_y L\_t & L\_t^2 \end{pmatrix} \tag{9}$$

where *Lx*, *Ly*, and *Lt* are the first-order derivatives of *g* ∗ *f* in the x, y, and t directions.

The spatiotemporal interest points are obtained from the local maxima of the corner function H:

$$H = \det(\mu) - k \cdot \text{trace}^3(\mu),\tag{10}$$

where *k* is a parameter that was experimentally set to be *k* ≈ 0.005 [35].

The STIPs are extracted while using the with default parameters from the whole video.

#### 3.3.3. CNN-Based Activity Measure—DT+Y

The DT+Y activity measure uses the YOLOv3 to recognize the level of activity of detected player, or to determine whether the player is active or not.

The YOLOv3 network that was, in the previous cases, only used for player detection, was additionally trained on custom data to classify the detected player as either active or inactive so that the class person was replaced with two classes: active player and inactive player. Aside from the number of detected classes, the architecture of the network is the same as the network used for player detection in DT+OF and DT+STIPS.

The training was completed for 70,000 iterations on 3232 frames that were extracted from 61 videos in our training dataset. The frames were manually labeled either as active or as inactive player. The learning rate was 0.001, momentum 0.9, and decay was 0.0005. The input image size was 608 × 608 pixels without using YOLO's multiscale training. Data augmentation in the form of random image saturation and exposure transformations, with maximum factors of 1.5, and with random hue shifts by, at most, 0.1 was used.

The activity measure *A<sup>Y</sup> <sup>b</sup>* for each bounding box *Bb* corresponds to confidence value *c* of the active player class, such as:

$$A\_b^X = \ c(\text{active player class}(B\_b)). \tag{11}$$

Additionally, a threshold value can be defined for the classification accuracy of the *active player class*, above which the player will be considered active, and inactive otherwise.

#### *3.4. Determining the Most Active Player*

Throughout the sequence, different players will have the highest activity scores at different times. The activity scores for each player in individual frames are first aggregated into the track scores to select the most active player in the whole sequence. In Figure 7, a thick white solid bounding box line indicates the most active player for each frame (here player with ID 6). As shown in the Figure 7, in the fourth frame, in the case where multiple players have the same activity level, none is marked as being active.

**Figure 7.** Indicating the most active player on each frame (marked with thick white bounding box).

The track scores are represented by a vector whose elements represent the ranking by activity measure in a frame. For example, if at the first frame the track with ID 6 (track 6) has the highest activity measure, and track 1 has the second largest activity measure, the track score at frame 1 for track 6 will be 1, and for the track 2 it will be 2. Table 1 provides an example of the ranking of players in each individual frame by their activity level and determining the most active player in a given time frame.

Finally, the active player's track is the one that is most commonly ranked first by the activity level in the whole sequence, i.e. whose track score vector contains the most ones among all track score vectors.

**Table 1.** The track score vectors for each detected player on video sequence is presented in Figure 7. Number of wins for each player in a sequence and final rankings are highlighted.


The result of the proposed method for determining the most active player is a series of active player bounding box representations *Bb*(*Cb*, *Pb*) that are stored in the player track T*<sup>b</sup>* = (*B*1, *B*1, ... .*Bb*) from the beginning to the end of performing an action. The graphical representation of the centroids *Cb* of each bounding box *Bb*(*Cb*, *Pb*) through the motion sequence corresponds to the trajectory of his movement that is shown with the yellow line (Figure 8).

**Figure 8.** Detected most active player (white thick bounding box) and his trajectory through the whole sequence (yellow line).

The active player track can ultimately be represented by a collage of thumbnails that correspond to the bounding boxes of the most active player, showing the stages of an action (Figure 9).

**Figure 9.** Extracted collage of the most active player's action.

#### **4. Performance Evaluation of Active Player Detection and Discussion**

The proposed method was tested on a custom dataset that consists of scenes that were recorded during a handball training session. The handball training was organized in a sports hall or sometimes in an outdoor terrain, with a variable number of players without uniform jerseys, with challenging artificial lighting and strong sunshine, with cluttered background and other adverse conditions.

The dataset consists of 751 videos, each containing one of the handball actions, such as passing, shooting, jump-shot, or dribbling. Multiple players, on average, 12 of them, appear in each video, as shown in Figure 10. Table 2 provides the statistics of the number of players per frame.

**Figure 10.** Distribution of number of players per frames.

**Table 2.** Number of detected players per frame.


Each player can move in any direction, but mostly one player performs an action of interest, so he is considered as the active player. Therefore, each file is only marked with one action of interest. The scenes were shot with stationary GoPro cameras that were mounted on the left or right side of the playground, from different angles and in different lighting conditions (indoor and outdoor). In the internal scenes, the camera is mounted at a height of 3.5 m and in the outside scenes at 1.5 m. The videos were recorded in full HD (1920 × 1080) at 30 frames per second. The total duration of the recordings used for the experiment was 1990 s.

In the experiment, the individual steps of the proposed method and the method as a whole were tested, from person detection to selecting the trajectory of an active player.

For the player detection task, the Yolo detector was tested on full-resolution videos and without frame skipping. The performance of the Yolo detector was evaluated in terms of recall, precision, and F1 score [39]. The detections are considered to be true positive when the intersection over union of the detected bounding box and the ground truth box exceeded the threshold of 0.5. The intersection over union measure (IoU) is defined as the ratio of the intersection of the detected bounding box and the ground truth (GT) bounding box and their union, as in Figure 11.

**Figure 11.** Visual representation of intersection over union (IoU) criteria equal to or greater than 50%.

The results of the detector greatly depend on the complexity of the scene, which in the used dataset depends on the number of players on the scene and their size, as well as on the distance from the camera and the number of occlusions. For example, if there are up to eight players on the scene, close to the camera, the results are significantly better than in the case when there are nine or more players on scene that are far from the camera and occlusions exist, as shown in Figure 12. The background is similar in both cases, and it includes the ground surface and its boundaries, the advertisements along the edge of the terrain, and chairs in the auditorium.

**Figure 12.** Results of player detection with Yolo in simple and complex scenarios.

The obtained results confirm that player detection, which is the first step of the active player detection method, provides a good starting point for the next phases of the algorithm.

In the next step, we tested the performance of each activity measure in terms of precision, recall, and F1. Activity measures for each detected player in a frame are computed based on motion features that are extracted within each player's bounding box in that video frame in the case of DT+OF and DT+STIP and according to the confidence value of active player class detection in the case of DT+Y. The performance of all measures is evaluated on each frame by comparing the bounding box with the highest activity measure with the ground truth active player bounding box, as the DT+OF and DT+STIP measures are used in combination with the bounding boxes of detected players proposed by Yolo and DT+Y measure, as one of the outputs gives the bounding box of the active player.

A detection was considered to be true positive if it was correctly labeled as active and had the largest IoU with the ground truth active player bounding box among the detected boxes. If the goal is to evaluate the performance of an activity measure, that is, how many times the measure has selected the right player as the most active, the exact match of the detected box with the ground truth is not crucial, but it is important to match GT as closely as possible (large IoU) if one wants to monitor player postures while performing an action. The minimum IoU overlap between the detected bounding box and ground truth was set to 10%, since, in this part, the focus was on the ability of the measure to distinguish between activity or inactivity of the player and not on the correct localization.

Figure 13 shows the evaluation results for all activity measures and indicates significantly higher values for the DT+Y activity measure. The DT+Y achieves the best F1 score of 73%, which is significantly better than the results that were achieved by other measures. DT+OF has a precision of 51%, a DT+STIP of 67%, and DT+Y of 87%. All of the measures achieve lower recall scores than precision scores, DT+OF achieves only 20%, DT+STIP 23%, and DT+Y 63%; however, this is an even more significant difference in DT+Y performance when compared to other measures.

**Figure 13.** Evaluation results for DT+OF, DT+STIP, and DT+Y activity measures.

Figure 14 shows examples of positive and negative detection of an active player. Negative detection occurred because the other player on the field had a greater activity measure than the player performing the required action who is supposed to be the most active.

**Figure 14.** Marking of the leading player (white thick bounding box); (**a**) Correct detections; (**b**) Wrong detections. Examples were taken in different handball halls and under different indoor and outdoor lighting conditions.

(a) (b)

It is necessary to know their activity measure in each frame, but also to uniquely identify that player throughout the period of interest, i.e., to track them through the sequence, to determine a player's overall activity in a time period. The track of the active player T*<sup>b</sup>* = (*B*1, *B*1, ... .*Bb*) on sequential frames is compared with the ground truth in order to evaluate the results of the proposed method for detecting an active player over a period of time.

A true positive rate was calculated as a number of true positives divided by the number of tested sequences to quantify the performance. An active player track is considered to be true positive if the IoU of the detected active player is equal to or greater than a threshold value α = 50% in more than θ of frames (here set to 50%). Formally, the measure for active player track evaluation is defined, as follows:

Let be T a sequence of bounding box representations B, T*GT* a ground truth sequence, *IoU* : T *x*T*GT* → [0, 1] given function and *f* : T → {0, 1}, being defined as:

$$f(\mathbf{x}) = \begin{cases} 1 & \text{IoU}(\mathbf{x}, \mathbf{g}) \ge \alpha \\ 0 & \text{IoU}(\mathbf{x}, \mathbf{g}) < \alpha \end{cases} \tag{12}$$

then a true positive (*TP*<sup>T</sup> ) for active player track T . is defined as:

$$TPR\_{\overline{T}} = \begin{cases} 1 & \frac{\sum\_{\overline{R}\_i \in \mathcal{T}} f(R\_i)}{|\mathcal{T}|} \ge \theta \\ 0 & \text{else} \end{cases} . \tag{13}$$

The *TP*<sup>T</sup> is computed for every track in a test set, so TPR% is the ratio of the true positive tracks to the total number of tracks in the test set.

This means that the measure is strictly defined, because only those results that have a true positive detection of the player in θ consecutive frames are considered to be positive, as long as the action is completed, with the additional condition that it should be the given handball action. Thus, the case when a player is detected and properly tracked during his movement, but does not perform an action of interest in more that θ frames is not counted as the true result. Additionally, when the player is correctly identified and followed only in the part of the sequence (less than θ) is not counted as a positive result.

Table 3 provides the evaluation results for active player detection method along the entire motion trajectory for all activity measures where IoU is set to α = 50% and at least θ = 50% of frames per track correctly detected.


**Table 3.** True positive rates (TPR %) for active player detection.

The best overall score was achieved with the DT+Y activity measure that was used in conjunction with the Deep Sort tracker with 67% correct active player sequences, while the second-best result was achieved using the DT+STIP activity measure and Deep Sort tracker, with 53% correct sequences. The gap in performance between the DT+STIP and DT+OF methods from the DT+Y is smaller here, where whole sequences are taken into account, than when being examined for their performance of activity detection in isolation. When considering whole sequences, if player detection and tracking performs well, some mislabeling of players as inactive in certain frames will not affect the label of the whole sequence, so the result will not degrade either. This is important to note, because even though the DT+Y activity measure performs the best for determining the activity of the player, to use it is necessary to have annotated training data, which, if not available, can be a time consuming and tedious job to acquire, and this is not required for the DT+OF and DT+STIP measures. An even larger set of learning data and data balancing is needed to achieve even better results of DT+Y measure, because the number of inactive athletes on the scene is much higher than the inactive ones.

With all activity measures, better results were obtained while using the Deep Sort tracker than with Hungarian algorithm-based tracking, so it seems a clear choice, since it does not require much more effort to use.

We tested the proposed methods with different α threshold parameter and Figure 15 presents the results. In the case where only the information that the player is active for a certain period is relevant, the accuracy of detection is not as important (low overlap with the ground truth is acceptable) as in the case when we want to observe the player's movements and postures when performing an action (large overlap with the ground truth is desirable).

Similar rankings hold for most values of the minimum required IoU (parameter α), as can be seen in Figure 16.

**Figure 15.** Evaluation results for player tracks obtained with DT+OF, DT+STIP, and DT+Y activity measures using Deep sort or Hungarian algorithm for tracking. The results at minimum θ = 50% correct frames per sequence.

When considering different values of minimum correct frames per sequence (parameter θ. ), again similar rankings can be observed, but, as expected, with different achieved TPR%. For example, Figure 16 shows that the TPR% is lower for various values of α when θ is increased to 70% than in the case shown in Figure 15. Here, the TPR score at lowest values of α is mostly limited by the performance of the tracking and activity detection, while at higher values of α. with the precision of player detection.

In the following, some typical cases and issues are shown and commented on. Figure 17 shows a collage of thumbnails that present good examples of player tracks that include all correctly detected bounding boxes on the frames in the sequence. As each player performs a particular action in a specific way, the number of frames is variable, even when performing the same handball action or technique according to the same rules.

**Figure 16.** Evaluation results for player tracks obtained with DT+OF, DT+STIP, and DT+Y activity measures using Deep sort or Hungarian algorithm for tracking. The results at minimum θ = 70% correct frames per sequence.

**Figure 17.** Examples of shooting action with and without jump performed by different players in different positions with respect to the camera in both outdoor and indoor fields.

The number of frames in player track depends on detector performance and tracking at all stages of the action. Figure 18 shows the problem of imprecise detection in some parts of the player tracking, where the detector has correctly detected the player, but the IoU is less than the threshold because his legs are outside of the bounding box. Figure 19 shows a negative example of an active player track that is caused by poor player detection that has propagated to other stages of the algorithm. The thumbnails show that the player is active and well tracked as long as it was detected. Additional training of the person class on samples from the handball domain can reduce the problem of imprecise detection.

**Figure 18.** Examples of imprecise player detection.

**Figure 19.** Tracking errors due to poor player detection.

Figure 20 shows a negative example of an active player's track due to improper tracking of players occurring due to cluttered scenes, players overlapping, or changing position.

**Figure 20.** Examples of tracking problems.

A detector detects all players on each frame and the activity level is calculated for each one and their movement is monitored and recorded. The proposed method will select the player with the highest level of activity in most frames as the most active player; however, it is possible that the most active player is not the one performing the given action, so, in this case, there is no match to the GT, even though everything is done correctly. Figure 21 shows the cases when players who were preparing to perform an action or ran in the queue had a higher level of activity then player who was performing the default action and so were wrongly selected as active players.

**Figure 21.** Tracking of players that perform actions outside of the scope.

As the two measures of activities DT + OF and DT + STIPS do not require any learning, we have tested them in completely different sports scenes from other team sports to test how general the proposed activity measures are.

We used the basketball event detection dataset [26] for the experiment. The dataset contains automatically obtained player detections and tracks for basketball events in 11 classes, such as three-point success, three-point failure, free-throw success, free-throw failure, and slam-dunk. Scenes were taken from a basketball game so that only one ball was present on the field. The clips are four seconds long and subsampled to 6fps from the original frame rate. The clips are tailored to include the entire event, from throwing the ball to the outcome of the shot. A subset of 850 clips also have the known ball location in the frame where the ball leaves the shooter's hands.

We have used two activity measures, DT+OF and DT+STIPS, on the basketball event dataset without any modification. The DT+YOLO method was not tested, because it needs to be trained before use, and it was not trained on the basketball domain. The player detections and tracks provided with the dataset were used to evaluate only the impact of the activity measures on active player detection, Figure 22.

**Figure 22.** Marking of the leading player in basketball scenes (white thick bounding box).

In [26], the authors selected the player(s) who was closest in image space to the ball as the leading player ("shooter") to evaluate the attention mechanism. On the other hand, the handball scenes correspond to individual handball techniques and they were taken during training so that each player has his own ball, therefore the possession of the ball could not be taken as a measure of player activity and was not included in the design of proposed activity measures. For this reason, we selected a part of the basketball test set and determined the active player for the entire duration of the given action and then evaluated the active player detection metrics against these criteria.

The evaluation results for active basketball players detection along the entire motion trajectory for DT+OF and DT+SIFT activity measures, where at least θ = 50% of frames per track are correctly detected, as given in Table 4.

**Table 4.** True positive rates for active player detection in basketball event detection scenes.


The obtained results are lower than in the handball scenes (up to 8%), primarily because the video sequences in basketball dataset are event-driven rather than determined by the action for which the method was designed. This is most noticeable on actions, such as three-point throw, where it happens that the player who threw the ball is not even present in most of the frames as the camera zooms in on the ball's trajectory and the surrounding players and basket to track the outcome of the throw. Contrary to the method of detecting events in the basketball game, the proposed methods are defined from the perspective of the player and they correspond to some handball techniques, for example, a shot jump takes from the moment the player receives the ball, takes up to three steps, and throws the ball towards the goal, regardless of the outcome of the action (the ball enters/does not enter the goal).

The results are significant and promising, because they are comparable and even slightly better than the results that were reported in [26], although the method in [26] was purposefully trained to detect an active player on the basketball dataset while using 20 GPUs.

The track results exhibit similar properties as on the handball dataset, with problems stemming from the same sources—player detection, tracking, or activity measure. The difference is that basketball has fewer players on the field, smaller terrain, bigger ball, and the player's occlusion in basketball is even greater than in handball scenes because of the speed and difference in the rules of the game. Additionally, in our handball dataset, the actions are defined from the player's perspective and they correspond to individual handball techniques.

Finally, Figure 23 presents a visual sample of results on the basketball set.

**Figure 23.** Example of results on the basketball dataset. (**a**) Correct tracks for the three-point failure event; (**b**) Wrong result due to imprecise player detection; (**c**) Incorrect result due to higher activity of player who is not performing the action of interest; (**d**) Incorrect results due to bias to longer tracks.

For this reason, for the final track selection, our method assumes that the player should be tracked throughout the whole video sequence, so it is, by design, biased towards longer tracks. It often misses choosing the shooter in cases, such as three-point shot, where the player who throws the ball might not even be present in the majority of the clips (Figure 23d). Track fragmentation, due to poor tracking, where the player that should be labeled as active is actually labeled with more than one track ID, impairs the ability to select the correct track for the same reason. The wrong selection of active track can happen even though the correct player might have been the most active in the frame the ball leaves the shooter's hands. A similar bias toward longer tracks is reported in [26] for the tracking-based attention model. In future work, the method should be able to switch the leading player track from player to player to handle this problem and extend the application to longer, temporally unsegmented clips.

#### **5. Conclusions**

There is great interest in the automatic interpretation of sports videos, for the purposes of analysis of game tactics and individual athlete's performance, for retrieval, sports analysis and monitoring and refinement of player action techniques and playing style. To facilitate automatic analysis in team sports, such as handball, the detection and tracking of players who are currently performing a relevant action is of particular interest, as the interpretation of the whole scene often depends on that player's action. Additionally, during training sessions, the time should be efficiently utilized, so that all players are active in sufficient proportion.

In this paper, an active player detection method is proposed, which entails frame-by-frame object detection, player activity measures, and tracking methods to detect and track active players in video sequences. The CNN-based Yolo object detector was used for player detection, Hungarian assignment algorithm based method, and Deep SORT were considered for tracking the detected players along the time dimension, and three different measures are proposed for the task of player activity detection in a certain frame, DT+OF, DT+STIP, and DT+Y. The DT+STIP measure is calculated from the spatiotemporal interest points (STIPs) that were detected within the area of a player, the simpler DT+OF method depends on the computed optical flow (OF) motion field and the DT+Y method uses convolutional neural network (CNN) for determining player activity and classifying active and inactive players.

The proposed method and its components were tested on the test set of handball videos from several training sessions that were acquired in the wild. A sequence-based evaluation measure was proposed for overall evaluation, and the best result was achieved with the Deep Sort tracking method with the CNN-based DT+Y activity measure, achieving the rate of 67% correct sequences on the test set. The DT+Y method requires training data, and if it is not available, the DT+STIP or DT+OF method can be used, which achieved 53% TPR and 50% TPR on the test set, respectively.

To test how general the proposed activity measures are, we have chosen completely different sport scenes from basketball sports domain and evaluated the performance of DT+OF and DT+STIPS measures, since they do not require any leaning. The obtained results are lower than in the handball scenes (up to 8%), primarily because the video sequences are event-driven rather than determined by the action, as in our case, and there is an even greater occlusion of the players, as basketball is played faster and on a smaller field. The results are promising, because they are comparable to the results of a method that was trained on that basketball dataset to detect an active player in a video sequence.

The achieved results confirm that active player detection is a very challenging task and it could be further improved by improving all three components (player detection, tracking, and activity measure). Particularly, player detection should be improved first, because both other components directly depend on its performance. Activity measure should be calculated inside the detected bounding box, so imprecise detection will negatively affect this measure, since parts of player body may be outside the detected box and irrelevant parts of background within. Similarly, in addition to the inability to track objects that are not detected at all, the Deep Sort tracking method uses visual history of previous appearances of players, so their imprecise detection impacts the ability of the tracker to correctly re-identify them after occlusion.

For application in match setting, where a single ball exists in play, ball detection could be incorporated as an additional significant clue for detecting the leading player. It is indisputable that the information that the ball has in the game is important for determining the action and the player, but it is a small object that is hidden in the player's hand most of the time, and, when thrown quickly, moves and changes direction, so it is often blurred and tracking is still a challenge that is being intensively tried to tackle.

In the future, the method should be extended to better deal with multi-player actions, such as crossing or defense, and the player activity could be monitored in relation to the type of action the player is performing. Additionally, the method should be extended to handle long-term video clips, where the leading player changes in different time intervals.

**Author Contributions:** Conceptualization, M.P. and M.I.-K.; Data curation, M.P.; Formal analysis, M.I.-K.; Investigation, M.P. and M.I.-K.; Methodology, M.P. and M.I.-K.; Validation, M.P. and M.I.-K.; Visualization, M.P.; Writing—original draft, M.P. and M.I.-K.; Writing—review & editing, M.I.-K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Croatian Science Foundation, grant number HRZZ-IP-2016-06-8345 "Automatic recognition of actions and activities in multimedia content from the sports domain" (RAASS).

**Conflicts of Interest:** The authors declare no conflict of interest

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Hand Gesture Recognition Using Compact CNN via Surface Electromyography Signals**

**Lin Chen 1,2,**†**, Jianting Fu 1,2,**†**, Yuheng Wu 1,3, Haochen Li 1,2 and Bin Zheng 1,\***


Received: 25 December 2019; Accepted: 22 January 2020; Published: 26 January 2020

**Abstract:** By training the deep neural network model, the hidden features in Surface Electromyography(sEMG) signals can be extracted. The motion intention of the human can be predicted by analysis of sEMG. However, the models recently proposed by researchers often have a large number of parameters. Therefore, we designed a compact Convolution Neural Network (CNN) model, which not only improves the classification accuracy but also reduces the number of parameters in the model. Our proposed model was validated on the Ninapro DB5 Dataset and the Myo Dataset. The classification accuracy of gesture recognition achieved good results.

**Keywords:** surface electromyography (sEMG); convolution neural networks (CNNs); hand gesture recognition

#### **1. Introduction**

In recent years, Surface Electromyography(sEMG) signals have been widely used in artificial limb control, medical devices, human-computer interaction, and other fields. With the development of artificial intelligence and robotics technology, the intention of human hand movements can be obtained by using an artificial intelligence algorithm to analyze the sEMG signals collected from the residual limb. Robotics and artificial intelligence can be leveraged to better help the disabled people to independently complete some basic interactions in their daily life. The sEMG signals, which are non-stationary, represent the sum of subcutaneous athletic action potentials generated through muscular contraction [1]. Also, it is one of the main physical signals of an intelligent algorithm to identify motion intention.

Distinguishing sEMG signals collected from different gestures is the core part of the related applications using sEMG signals as intermediate media. At present, the literature on gesture recognition or artificial limb control by sEMG signals primarily focuses on the time and frequency domain feature extraction of sEMG signals, which aims to distinguish sEMG signals by feature recognition [1–3]. After years of exploration by researchers, some effective feature combinations have been proposed in both the time domain and frequency domain [4–6], and some fruitful results have been achieved with their respective datasets. Choosing feature extraction is particularly important in that different gestures can be distinguished by traditional methods. However, it is difficult to improve the performance of gesture recognition based on sEMG by traditional methods. Nevertheless, the process of designing

and selecting features can be complicated and the combinations of features are diverse, leading to increasing of workload and dissatisfied results [7].

Using deep neural networks to distinguish sEMG signals has been proposed by researchers. Wu et al. [7] proposed LCNN and CNN\_LSTM models (These models can be thought of as autoencoders for automatic feature extraction.), which do not require the process of traditional feature extraction. In recent years, deep learning has achieved great success in the field of image recognition. An important idea was put forward in [8,9] that the signals of a channel can form a graph, after the short- time Fourier transform or wavelet transform of sEMG signals. It was a good idea to convert the sEMG signal into an image and inspired us with the transform of the sEMG signal. Researchers such as Côté-Allard et al. [8], who regarded the original sEMG signals as an image, constructed the ConvNet model to further improve the classification accuracy of sEMG signals. However, the LCNN and CNN\_LSTM models proposed by Wu et al. [7], and the ConvNet model used by Côté-Allard et al. [8], contain a large number of parameters.

For deep learning algorithms, the final test accuracy is directly affected by the size of the training data, but one participant cannot be expected to generate tens of thousands of examples in one experiment during the data collection process. However, a large amount of data can be obtained by aggregating the records of multiple participants, so that the model can be pre-trained to reduce the amount of data required by new participants. On the other hand, designing a compact network structure to reduce the number of parameters can also reduce the demand for data size.

In order to reduce the number of model parameters and improve the accuracy of model classification, we present a new compact deep convolutional neural network model for gesture recognition, called as EMGNet. It was validated on the Myo Dataset that the average recognition accuracy of EMGNet can achieve 98.81%. The NinaPro DB5 dataset has often been used to test classical machine learning methods. The accuracy of the EMGNet on these datasets was higher than that of the traditional machine learning methods. Figure 1 shows the overall flow chart of sEMG signal acquisition and identification.

**Figure 1.** Surface Electromyography(sEMG) signals collection and classification process. Myo armband was used to collect the original signal, and then the collected signal was filtered and sampled to get sEMG signals. Continuous wavelet transform was selected to obtain the signal spectrum, and the neural network model was applied to classify the spectrum to achieve gesture recognition.

The rest of this paper is organized as follows. The related work of gesture recognition through deep learning is outlined in Section 2. The data processing and network architecture are described in detail in Section 3. The proposed network model is compared with the current excellent deep network framework and the classical machine learning classification method in Section 4. Finally, we present the conclusion in Section 5.

#### **2. Related Work**

People will produce different signals when completing the same action even lots of precise control electrodes are used to sense them [10]. Therefore, it is difficult to recognize sEMG signals. Since the AlexNet network proposed by Krizhevksy et al. [11] won the ImageNet challenge in 2012, deep learning has achieved great success in image classification, speech recognition, and other fields. Images can be accurately classified by training the neural network model to learn the characteristics of images. Nowadays, exploring network architecture has become part of deep learning research.

Currently, some researchers have successfully applied deep learning to sEMG signal classification and explored several effective network frameworks [8,12–16]. Using CNN to classify sEMG signals, literature [12,17,18] took the raw signals as input space. The spectrograms of raw sEMG signals were obtained by Short-Time Fourier Transform (STFT) and fed into the convolutional network (ConvNets) [13,19]. Literature [8] used the ConvNets to classify the characterizations of the sEMG signals extracted by short-time Fourier transform-based spectrogram and Continuous Wavelet Transform (CWT). Since sEMG signals correspond to the timing signal, we proposed to classify the sEMG signal by combining Long Short-Term Memory (LSTM) and CNN from our previous work [7]. The temporal information in the signal is retained and the ability of CNN to extract features is utilized. We take advantage of the complementarity of CNNs and LSTMs by combing them into one unified architecture. Meanwhile, we analyze the effect of adding CNN before the LSTM. We propose LCNN and CNN-LSTM models, which can directly input pre- processed EMG signals into the network [7]. In practical work, we verified that the performance of the LCNN model is better than CNN-LSTM. Figure 2 simply depicts the architecture of the LCNN model. We use PReLU [20] as the non-linear activation function. ADAM [21] is utilized for the optimization of the model.

**Figure 2.** LCNN Architecture diagram, the LCNN consists of 2 LSTM layers, 2 one- dimensional convolution layers and 1 output layer. We use 2 LSTM layers, and each LSTM layer has 52 cells, and every cell has 64 hidden layers.

However, the ConvNets model (Shown in Figure 3) in [8] was too complicated, and the LSTM model was introduced in [7], which led to expensive computation in gesture recognition. Therefore, a new network model was proposed in this paper, and it was proved by experiment that this model not only improves the accuracy of recognition, but also reduces the complexity of the network model.

**Figure 3.** Schematic diagram of ConvNet architecture. In this figure, Conv refer to Convolution and F.C. to Fully Connected layers.

#### **3. sEMG Signals Recognition Algorithm**

#### *3.1. sEMG Signals Feature Extraction*

Designing a sEMG signal feature is one of the main tasks of the algorithm. It can not only easily distinguish the sEMG signals produced by different movements, but also maintain a small variance between the sEMG signals produced by the same movements. Four feature sets—Time domain characteristics (TD) [5], Enhanced TD [4], Nina Pro Features [6,22], and SampEn Pipeline [23]—were selected as features of the classical machine learning algorithm compared to the proposed method. The TD feature set includes four features: Mean Absolute Value (MAV), Zero Crossing (ZC), Slope Sign Changes (SSC), and Waveform Length (WL) [5]. The Enhanced TD includes features such as Root Mean Square (RMS) and Autoregressive Coefficient (AC) [3]. Nina Pro Features includes Root Mean Square, TD features, etc. [6,22]. SampEn Pipeline includes features such as Sample Entropy (SampEn) [24], Root Mean Square, and Waveform Length [23].

Since the sEMG signals is a non-stationary signal [25], the analysis of these signals using the Fourier transform is limited. One technique to solve this problem is the Short Time Fourier Transform (STFT), which separates the signal into smaller segments by applying a sliding window and separately calculates the Fourier transform for each of the segments. The spectrogram of the signal can be calculated from the square of the signal after STFT transformation. When the signal *x*(*t*) and the window function *w*(*t*) are given, the spectrogram is calculated as follows:

$$\text{isspectogram } (\mathbf{x}(t), w(t)) = \left| STFT\_{\mathbf{x}}(t, f) \right|^2 \tag{1}$$

$$STFT\_{\mathfrak{X}}(t,f) = \int\_{-\infty}^{+\infty} [\mathfrak{x}(u)w(u-t)]e^{-j2\pi fu} du\tag{2}$$

where f represents frequency. The wavelet transform(WT) is similar to the STFT, but it overcomes the shortcomings of the window in the STFT which does not change with frequency. The WT adapts to the change in frequency in the signal by adjusting the width of the window. When the frequency in the signal climbs high, the WT increases the resolution by narrowing the time window. In addition, WT, which is an ideal signal analysis tool, has the ability to obtain the amplitude and frequency of mutations in a signal.

$$X(a,b) = \frac{1}{\sqrt{b}} \int\_{-\infty}^{\infty} x(t) q\left(\frac{t-a}{b}\right) dt\tag{3}$$

$$\int\_{-\infty}^{+\infty} \frac{\left| \wp(\omega) \right|^2}{\omega} d\omega < \infty \tag{4}$$

where the Fourier Transform ϕ(ω) of ϕ(*t*) must satisfy Equation (4). ϕ(*t*), also known as the mother wavelet function, is a signal with a limited duration, varying frequency and a mean of zero. The scaling factor *b* controls the scaling of the wavelet function. The translation amount *a* controls the translation of the wavelet function.

After the continuous wavelet transform of the sEMG signal, the corresponding spectrum information will be obtained, which is similar to the image in scale and also contains the frequency-domain information of the timing sequence data. The datasets tested in this article were all collected by Myo armband, as shown in Figure 4. Myo is an 8-channel, dry-electrodes, low-sampling rate (200 Hz), low-cost consumer grade sEMG armband, which is convenient to wear and easy to use [7]. The data of one channel was separated by applying sliding windows of 52 samples (260 ms). The mother wavelet of continuous wavelet transform adopts the Mexican Hat wavelet function. The CWTs were calculated with 32 scales obtaining a 32 × 52 matrix. The scale (32 × 52) sampled down 0.5 in spectrum information is taken as the input of EMGNet model. Thus the input of the EMGNet model has 8 channels, each of which consists of a matrix of size 15 × 25. Figure 5b is the spectrum of the signal shown in Figure 5a after wavelet transformation.

**Figure 4.** The 5 hand/wrist gestures and Myo armband. In this figure, the left is a schematic diagram of five gestures, and the right is the Myo armband.

**Figure 5.** Part (**a**) shows the waveform of sEMG signals and Part (**b**) the spectrum of the sEMG signals shown in (**a**) after wavelet transformation.

#### *3.2. EMGNet Architecture*

A new network model called EMGNet is proposed in this paper, and shown in Figure 6. It consists of four convolutional layers and a max pooling layer without using the full connection layer as the final output. ConvNet architecture (Shown in Figure 3) contains 67179 learnable parameters used in the CWT+TL method proposed in Literature [8]. As shown in Table 1, the model proposed in this paper contains fewer parameters than the models used in current advanced methods. In view of the better performance of the EMGNet model in the actual measurement, this section mainly introduces the EMGNet model.

**Figure 6.** EMGNet Architecture contains four convolutional layers and a max pooling layer without using the full connection layer as the final output. In this figure, Conv refer to Convolution and avg\_pool to a max pooling layer.

**Table 1.** Number of parameters used by various models.


The cost function Loss can be computed as follows:

$$\text{Loss} = -\sum\_{i=1}^{n} y\_i \log(y\_i') \tag{5}$$

where *yi* is the truth value of the *i*th category, *n* is the number of categories, and *yi'* is the ith category predicted value of the output. Because we adopted One-Hot Encoding, the true value of the one category is 1 while the others are 0.

The advanced optimization method is used for the backpropagation of the EMGNet, and our final target is to minimize the cost function Loss. In the field of image recognition, the size of convolution kernels commonly used by researchers is 3 × 3 and 5 × 5 [11,26,27]. We found that the experimental results obtained by setting the size of the convolution kernel to 3 × 3 are better. Therefore, the size of all convolution kernels in the EMGNet model is set to 3 × 3. Meanwhile, in order to reduce the parameters of the model as much as possible, the feature map of each layer of the EMGNet model is also set smaller. When the feature map of the network model increases, the step length in the convolution process is set to 2, so as to achieve the effect of halving. In other cases, the step length and the padding are set to 1, so that the feature remains unchanged. In order to further reduce the number of network parameters, the output of the model does not use the full connection layer, but carries out adaptive mean sampling first, and then uses the convolution layer for classification.

#### **4. Experiment and Results**

We evaluated our mode on two publicly available hand gesture recognition datasets composed of Myo Dataset [8,9] and NinaPro DB5 Dataset [6]. First, we compared the method proposed with the methods of classical machine learning and three other methods (CNN-LSTM, LCNN, CWT+TL) on the Myo Dataset [8,9]. Then, it was compared with the methods of classical machine learning and the methods (CNN-LSTM, LCNN) proposed earlier on the NinaPro DB5 Dataset [6].

#### *4.1. Evaluation Dataset*

Containing two different sub-datasets, this gesture dataset (Myo Dataset [8,9]) was collected using the Myo armband. The first Dataset was used as the pre-training dataset in [8] and the second includes the part of training and testing. The former is mainly used to establish, verify, and optimize the classification model, which consists of 19 subjects. The latter, used only for final training and validation, consists of 17 participants. The second Dataset contains a training section and two test sections in the literature [8,9], which is an unreasonable arrangement and leads to a decrease in the amount of training data. In order to facilitate the comparative experiment, this article uses the same settings. The Myo Dataset contains 7 types of gestures, and there are significant differences between gestures. It provides sufficient amount of data, and its gestures are shown in Figure 7.

**Figure 7.** The 7 hand/wrist gestures in the Myo Dataset. In the Myo dataset, seven gestures are included: Netral, Hand Close, Wrist Extension, Ulnar Deviation, Hand Open, Wrist Flexion, and Radial Deviation.

This dataset (NinaPro DB5 [6]) is based on benchmark sEMG-based gesture recognition algorithms [6] containing data from 10 able-bodied participants divided into three exercise sets—Exercise A, B, and C contain 12, 17, and 23 different movements (including neutral) respectively. It is recorded by two Myo armbands, and only one of them is used in this work. One of the characteristics is that there are some similar gestures and the training data is not enough. Therefore, the model is prone to overfitting in the training process. Figure 8 shows the gesture categories in the dataset of Exercise A.


**Figure 8.** The gesture categories in the Exercise A dataset.

#### *4.2. Method of Training*

Adam [11] is used as the optimization method of network model training in this work. The length of data collected by each gesture in the two datasets was the same. After the same segmentation method, the data amount of each gesture was the same, and the samples are balanced. In the Myo dataset, each person has 2280 samples for each gesture, with a total of 19 participants, while in the Nina Pro dataset, each person has 1140 samples for each gesture, with a total of 10 participants.

At the same time, after the samples are segmented, we used the shuffle algorithm to shuffle the samples of each gesture, and then took 60% as a training sample set, 10% as the verification sample set, and the last 30% of each gesture as the test sample set.

The amount of data fed into the network by each training batch was 128 samples, and a total of 50 rounds of iterative training were conducted. We initially set the learning rate at 0.01 and shrank the learning rate by 10 times at epoch 20 and 40. In order to prevent over-fitting of the network, L2 regularization is used in this paper. The parameter settings in the training process are shown in Table 2.

**Table 2.** Training policy parameter setting (lr represents the learning rate).


#### *4.3. Myo Dataset Classification Results*

The loss and accuracy curves during training and testing on the Myo Dataset using the EMGNet model are shown in Figure 9.

**Figure 9.** The loss and accuracy curves during training and testing on the Myo Dataset. when the training reaches convergence, there is no phenomenon where the accuracy of the training set is high and the accuracy of the test set is low, which is over-fitting.

As shown in Figure 9, the EMGNet network successfully completed the classification task in the Myo Dataset, and the over-fitting phenomenon did not appear in the training and testing. We tested the accuracy of our model and compared it with the current three most advanced methods. Table 3 shows the accuracy of each method. According to the results shown in Table 3, the accuracy of our proposed model is better than the current advanced methods.


The STD represents the standard deviation in accuracy for the 20 runs over the 17 participants.

#### *4.4. NinaPro DB5 Dataset Classification Results*

Figure 10 shows that the loss and accuracy curves during training and testing on exercise A of the NinaPro DB5 Dataset. During training of the EMGNet model, the phenomenon of overfitting appears. Reducing the number of layers in the EMGNet does not solve this problem.

**Figure 10.** The loss and accuracy curves during training and testing on the exercise A of the NinaPro DB5 Dataset. During training and testing on the exercise A of the NinaPro DB5 Dataset, the phenomenon of overfitting appears.

Table 4 shows the accuracy of three subsets of DB5 Dataset. Time domain characteristics (TD) [5], Enhanced TD [4], Nina Pro Features [6,22] and SampEn Pipeline [23] were selected as the classification features of the classical machine learning algorithm (LDA and SVM). The LCNN and CNN\_LSTM we proposed previously did not perform feature extraction on the data and directly processed the sEMG signal. The proposed method uses continuous wavelet transform (CWT) to process the data as the input of EMGNet. From the experimental results, we can conclude that the classification accuracy of our proposed EMGNet model is higher than that of the classical machine learning algorithm.

With the increase of the categories of gestures, the classification algorithms decline at different degrees. The degree of decline of EMGNet is lower than that of the classical machine learning algorithms (see Figure 11 and Table 4).

**Figure 11.** The average accuracy of three subsets on the DB5 Dataset.


*Sensors* **2020** , *20*, 672

The classification accuracy of both the classical machine learning method and EMGNet is not as high as that tested on the Myo Dataset (see Table 4). Two reasons are as follows:


**Figure 12.** Parts (**a**) and (**b**) represent pictures of two different gestures in exercise A, Parts (**c**) and (**d**) show the spectrum diagrams of the sEMG signal generated by (**a**) and (**b**) respectively.

#### **5. Conclusions**

This paper presents a novel CNN architecture consisting of four convolutional layers and a max pooling layer with a compact structure and fewer parameters. The experimental results show that the proposed EMGNet not only reduces the complexity of the model but also improves the accuracy of the sEMG signal classification. It is highly competitive with the classical classifiers and deep learning frameworks currently used to classify sEMG signals, which also shows the potential of deep learning in classifying sEMG signals. (The elimination of transition time between successive gestures is explained in Appendix A).

**Author Contributions:** Writing—original draft, L.C.; Investigation, L.C. and Y.W.; Conceptualization and methodology, J.F.; Software, Y.W. and H.L.; Visualilization, H.L.; Experiment design, Y.W.; Data curation, Y.W.; Writing—review and editing, B.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Priority research & design program of Chongqing technology innovation and application demonstration, Grant No. cstc2017zdcy-zdyfX0036.

*Sensors* **2020**, *20*, 672

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A**

We assume that each gesture can be held for a certain time. As shown in the figure below, if three sliding windows among four consecutive sliding windows are judged as gesture A, then we judge this signal as gesture A, if not, we judge this signal as the previous gesture remains unchanged. We use the same method for other gestures. (The numbers 4 and 3 used above are only examples, not actual numbers.) Our proposed method can identify the signals generated by different gestures during the signal acquisition process.

**Figure A1.** Schematic of continuous signal recognition.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **RFI Artefacts Detection in Sentinel-1 Level-1 SLC Data Based On Image Processing Techniques**

**Agnieszka Chojka 1,\* , Piotr Artiemjew <sup>2</sup> and Jacek Rapi ´nski <sup>1</sup>**


Received: 31 March 2020; Accepted: 19 May 2020; Published: 21 May 2020

**Abstract:** Interferometric Synthetic Aperture Radar (InSAR) data are often contaminated by Radio-Frequency Interference (RFI) artefacts that make processing them more challenging. Therefore, easy to implement techniques for artefacts recognition have the potential to support the automatic Permanent Scatterers InSAR (PSInSAR) processing workflow during which faulty input data can lead to misinterpretation of the final outcomes. To address this issue, an efficient methodology was developed to mark images with RFI artefacts and as a consequence remove them from the stack of Synthetic Aperture Radar (SAR) images required in the PSInSAR processing workflow to calculate the ground displacements. Techniques presented in this paper for the purpose of RFI detection are based on image processing methods with the use of feature extraction involving pixel convolution, thresholding and nearest neighbor structure filtering. As the reference classifier, a convolutional neural network was used.

**Keywords:** RFI; artefacts; InSAR; image processing; pixel convolution; thresholding; nearest neighbor filtering; deep learning

#### **1. Introduction**

Since NASA launched the first satellite (at that time known as the Earth Resources Technology Satellite) of the United States' Landsat program in 1972 [1], satellite remote sensing has developed strongly. They are currently placing newer satellites in orbit around Earth, equipped with advanced sensors for Earth monitoring. Among them worthy of emphasis are satellites with Synthetic Aperture Radar (SAR) instruments on board.

SAR is an active microwave remote sensing tool [2] with wide spatial coverage, fine resolution, all-weather and day-and-night image acquisition capability [3,4]. These features allow the ability to use SAR images for a multitude of scientific research, commercial and defense applications ranging from geoscience to Earth system monitoring [2,4,5].

Interferometric Synthetic Aperture Radar (InSAR) technology exploits differences in the phase of the waves returning to the satellite of at least two complex SAR images to generate, for instance, surface topography and deformation maps or Digital Elevation Models (DEMs) [6–8].

Among various InSAR techniques, satellite Differential Interferometric Synthetic Aperture Radar (DInSAR) has emerged as a powerful instrument to measure surface deformation associated with ground subsidence on a large scale with centimeter to millimeter accuracy [9–11].

DInSAR exploits the phase information of SAR images to calculate the ground displacements between two different satellite acquisitions [12]. The Permanent Scatterers InSAR (PSInSAR) method, an upgrade of DInSAR, for analytical purposes uses long stacks of co-registered SAR images to identify coherent points that provide consistent and stable response to the radar on board of a satellite [13]. Phase information obtained from these persistent scatterers is used to derive the ground displacement information and its temporal evolution [9].

Thanks to the increasing availability of a large amount of SAR data from such missions like ALOS-2, COSMO-SkyMed, PAZ, RADARSAT-2, Sentinel-1 or TerraSAR-X [3,11] and due to high-quality images covering a wide area [14,15], in the last decades, Earth observation techniques have become a valuable and indispensable remote sensing tool in geophysical monitoring of natural hazards such as earthquakes [16], volcanic activity [17] or landslides [18], mine subsidence monitoring [19] and structural engineering, especially monitoring of subsidence [20] and structural stability of buildings [15] or bridges [10].

DInSAR and PSInSAR are complementary methods and both have essential advantages and some disadvantages [16]. By way of illustration, PSInSAR is considered to be more precise than DSInSAR, because it requires about 15–20 SAR data acquisitions for a successful result, whereas DInSAR needs only 2 [13].

Despite the obvious benefits, InSAR technology has some limitations. InSAR measurements are often affected by various artefacts that not only make interpreting them more challenging, but also affect the reliability and accuracy of its outcomes.

One of the most significant is the effect of the atmosphere. In general, it results from the phenomenon of electromagnetic waves delay when traveling through the troposphere and accelerate when traveling through the ionosphere [3]. Atmospheric artefacts are usually strongly correlated with the topography (elevation) and the proximity of the sea [11,21]. Over the past decades, numerous methods were investigated to identify and mitigate these artefacts e.g., [3,8,11,22,23].

Another type of InSAR data failure described in the literature is a border noise [24–27]. This undesired processing artefact appeared in all of the Sentinel-1 GRD products generated before March 2018 [24]. Although this problem has been solved for the newly generated products, it did not cover the entire range of products and researchers still develop new methods and tools to effectively detect and remove this particular type of noise [24–27].

Contrary to the unwanted artefacts influence on InSAR data, Bouvet [28] proposed a new indicator of deforestation based on geometric artefact, called the shadowing effect. It appears in SAR images in the form of a shadow at the border of the deforested patch.

The primary issue of this paper is one more contamination frequently appearing in SAR images that is called Radio-Frequency Interference (RFI). Since the work is in progress on the development of the automatic monitoring system for high-energy para-seismic events, the urgent matter is to elaborate the effective method supporting the automatic PSInSAR processing workflow by the removal of faulty SAR data (with artefacts) that can lead to misinterpretation of the final results.

Our main goal is to find the easiest to implement technique for marking images with RFI artefacts. The solution presented in this paper for the purpose of RFI detection was based on image processing methods with the use of feature extraction involving pixel convolution, thresholding and nearest neighbor structure filtering techniques. As the reference classifier we used a convolutional neural network.

After a short introduction on the characteristics of RFI, common forms and sources of this artefact, included in Section 2, the used materials are presented in Section 3. The applied methodology and techniques are depicted in Section 4. Their results are presented and comprehensively discussed in Section 5. The main findings and recommendations for future work are summarized in Section 6.

#### **2. RFI Artefact**

RFI is defined as 'the effect of unwanted energy due to one or a combination of emissions, radiations or inductions upon reception in a radio communication system, manifested by any performance degradation, misinterpretation or loss of information which could be extracted in the absence of such unwanted energy' according to Article 1.166 of the International Telecommunication Union Radio Regulations [29].

In the SAR images these incoherent electromagnetic interference signals usually appear as various kind of bright linear features [30,31] like bright stripes with curvature or dense raindrops [2,4]. RFI introduces artefacts in image by slight haziness [2,31] that acutely degrade its quality [4,30]. Such affected images may lead to wrong interpretation process and results [2].

The reason for RFI contamination for SAR is that many different radiation sources operate with the same frequency band as the SAR system [2,30]. In general they can be grouped into terrestrial and space-borne sources [2]. Most of these incoherent electromagnetic interference signals are emitted by terrestrial commercial or industrial radio devices [2], e.g., communication systems, television networks, air-traffic surveillance radars, meteorological radars, radiolocation radars, amateur radios and other, mainly military-based, radiation sources [2,4,30,31]. An example of space-borne RFI source are signals broadcasting from other satellites, such as global navigation satellite systems (GNSSs) constellations, communication satellites or other active remote sensing systems [2].

Over the past years, great efforts have been made to better understand RFI effects and to develop robust methods for detecting and mitigating this artefact, e.g., [32–37], in particular from SAR data. See [2] for general review. Although in the majority of cases, research is focused on the recognition and removal of RFI signatures from L-band SAR data, where this artefact is commonly observed [30]. In case of SAR systems, the most susceptible for RFI effects are signals operating in the low band frequency region, such as P, L, S and even C-band [2,30,31]. Studies are usually conducted on raw data [2,4,30,31,38]. To fill this gap in the literature, our research addressed the detection of RFI artefacts in SAR data, especially in recently available Sentinel-1 data. Additionally, our work meets the recommendations, proposed among others by Tao [2] and Itschner [39], concerning the application of artificial intelligent techniques, such as deep learning methods, for RFI recognition.

#### **3. Materials**

This section shortly describes the materials we used in our investigation. We depict the source data and datasets used to carry out experiments and tests of our solution.

#### *3.1. Source Data*

A set of dedicated satellites, the Sentinel families, are developed by the European Space Agency (ESA) for the operational needs of the Copernicus programme [40]. The European Union's Earth observation programme delivers operational data and information services openly and freely in a wide range of applications in a variety of areas, such as urban area management, agriculture, tourism, civil protection, infrastructure and transport [41].

The Sentinel-1 mission is a polar-orbiting, all-weather, day-and-night C-band synthetic aperture radar imaging mission for land and ocean services. It is based on a constellation of two satellites: Sentinel-1A was launched on 3 April 2014 and Sentinel-1B on 25 April 2016 [40,42].

Sentinel-1 data are intended to be available systematically and free of charge, without limitations, to all data users including the general public, scientific and commercial users [40]; however, the most reachable for the majority of users are Level-1 products, provided as Single Look Complex (SLC) and Ground Range Detected (GRD) [40].

Level-1 Single Look Complex (SLC) products, as other Sentinel-1 data products, are delivered as a package containing, among others, metadata, measurement data sets and previews [40].

The Sentinel-1 SAR instrument mainly operates in the Interferometric Wide (IW) swath mode over land [26,40] that is one of four Sentinel-1 acquisition modes. IW uses the Terrain Observation with Progressive Scans SAR (TOPSAR) technique [43] and provides data with a large swath width of 250 km at 5 m × 20 m (range × azimuth) spatial resolution for single look data.

In this study, we used quick-look images contained in the preview folder, a part of Sentinel-1 data package. As in our case we have dual polarization products, data are represented by a single composite color image in RGB [40]. Because quick-look data has a lower resolution version of the source image, it is easy to detect RFI artefacts and then exclude data contaminated by RFI from further

processing, thus improving the whole PSInSAR processing workflow, which in our case required a stack of 25–30 SAR images to calculate the ground displacements. Faulty data included in this dataset would lead to misinterpretation of the processing results. Besides, elimination of failure data does not affect the quality and reliability of the final data processing results as data are obtained continuously and observations can be repeated with a frequency better than 6 days, considering two satellites (Sentinel 1A and 1B) and both ascending and descending passes [40].

#### *3.2. Experimental Datasets*

Due to these advantages of Sentinel-1 data products, in summary, high and free of charge availability, wide ground coverage and fine resolution, these data are used in developing the automatic monitoring system for high-energy para-seismic events and its influence on the surface in the study area, the Zelazny Most, with the use of GNSS, seismic and PSInSAR techniques. The ˙ Zelazny Most is ˙ the largest tailings storage facility in Europe and the second largest in the world [44,45]. It is an integral part of copper production technological chain [46] that since 1977 collects the tailings from the three mines of the KGHM Polska Mied ´z (formerly - the Polish State Mining and Metallurgical Combine) [45]. This project exploits Level-1 SLC products in IW mode. Images comprising the area of interest are regularly acquired since September 2018 and processed using PSInSAR technique.

RFI characteristic depends strongly on the geographic position of data acquisition [2]. According to the RFI power map with continental coverage over Europe [38] our study area was located in potential RFI-affected zone. On the other hand, the IEEE GRSS RFI Observations Display System [47] did not indicate any irregularities in C-band frequency (ranging from 4–8 GHz) over this area.

In order to carry out experiments and to verify the proposed approach for RFI artefacts detection, we used 3 different quick-look datasets, named RIYADH, MOSCOW and ASMOW, for the needs of our research. This selection was based on above mentioned RFI power map and IEEE GRSS RFI Observations Display System. Each collection includes both correct images and images with various levels of RFI contamination. RIYADH dataset consists of 136 images that were collected between May 2015 and January 2019, and comprise the area of the capital of Saudi Arabia. The MOSCOW set includes 99 images acquired between January and December 2019. These images cover the area surrounding the capital of Russia. The ASMOW collection consists of 53 images covering the study area, located in southwest Poland, in Lower Silesia, east of the town of Polkowice in the municipality of Rudna. These data were acquired from September 2018 to February 2020.

#### **4. Methodology**

In this section, we introduce basic information about the techniques we used in the experimental part. Here we start by discussing an example of how to store an image in digital form.

#### *4.1. Digital Representation of the Image*

A digital image can be simply represented by binary numbers of pixel color saturation in the relevant system [48,49]. Greyscale image pixels are represented by single Bytes that take decimal values from 0 to 255. In binary format from 00000000 to 11111111. Sample change 01101011 ⇒ 01101010 would not be noticeable to the human eye and in that way we could exploit that shortcoming of a human eye to hide data. RGB image pixels are represented by triples of Bytes which take decimal values from 0 to 255. For example, white color in decimal form is represented by (255, 255, 255) and in binary form by (11111111, 11111111, 11111111). Data in digital systems are often shown in hexadecimal form for reading convenience. The white color is (*FF*, *FF*, *FF*) where *F* is the letter of the hexadecimal system alphabet to which a decimal value of 15 is assigned.

In the pre-processing step, the images are converted to greyscale representation. This conversion makes it very easy to use feature extraction techniques. Next, we discuss a sample conversion.

#### *4.2. Conversion from RGB to Greyscale*

$$new\_{pixel} = R\_{old\_{pixel}} \* 0.292 + G\_{old\_{pixel}} \* 0.594 + B\_{old\_{pixel}} \* 0.114 \tag{1}$$

The above formula is a transformation from OpenCV library [50]. An example of a conversion is in Figure 1.

**Figure 1.** ASMOW quick-look—demonstration of conversion to greyscale using the formula (Section 4.2).

In the next section we discuss the technique of image features extraction based on pixel convolution [51–53].

#### *4.3. Feature Extraction Based on Pixel Convolution*

In Figure 2 we have an example of how a pixel convolution works. We have used the 3 × 3 Gaussian blur mask.

**Figure 2.** Exemplary convolution—first two steps based on Gaussian blur mask 3 × 3, *Ax*,*<sup>y</sup>* × *mask* = ∑*mask*\_*width <sup>i</sup>*=<sup>1</sup> <sup>∑</sup>*mask*\_*height <sup>j</sup>*=<sup>1</sup> *Ax*+*i*−<sup>1</sup> ∗ *maskij*, (*x*, *<sup>y</sup>*) are the top left coordinates of convolved pixels.

Other useful techniques that were applied are image filtering techniques by means of thresholding [51,52,54]. The method allowed us to focus on areas that are visible with a fixed, properly defined frequency or belonging to a defined range.

#### *4.4. Thresholding*

We used two types of thresholding [52]. The first type consists of giving the image pixels that do not exceed a certain fixed normalized threshold of color saturation.

The second method consisted of using a histogram of pixel values and filtering out those which (e.g., giving black color) belong to a fixed frequency range.

The last method we tested for an artefact classifier was the use of structural filtering with the nearest pixel neighbors [51,52,55,56].

#### *4.5. Nearest Neighbor Structure Filtering*

We also applied the technique of filtering the image structure by replacing the pixel with its closest neighbor in the sense of Manhattan distance [57,58]. Filtering allows you to eliminate unstructured pixels in the image. This step allowed us to better discriminate against images with artefacts.

As a reference classifier defined in the following section, we used the Convolutional Neural Network (CNN) [59,60].

#### *4.6. Exemplary Deep Neural Network Architecture as Referenced Classifier*

To verify the effectiveness of our artefacts extraction method, we used a simple deep neural network [59,61] defined in this section for classification. We used Python related tools such as PyTorch, TorchVision and NumPy. The visualization of results was performed using Seaborn and Matplotlib. A transformation was made when loading images, scaling everything to 400 × 600 pixels to ensure the same size of the input for the network. The data were divided into train and test sets in 80/20 ration in a random way. A simple network with two convolutional layers, linear transformations and pooling was proposed. The activation function was RELU (*f*(*colour saturation*) = *max*(0, *colour saturation*)), and loss function took the form of categorical Cross Entropy (thus it can be higher than one).

The Adam optimizer [62] was used. The training was done over 15 epochs. To split the data, the following function was used:

*train*\_*dataset*, *test*\_*dataset* = *torch*.*utils*.*data*.*random*\_*split*(*dataset*, [*train*\_*size*, *test*\_*size*])

A detailed definition of the neural network is given in Listing 1.

Listing 1: Neural network configuration.

```