**Visual and Camera Sensors**

Editors

**Kang Ryoung Park Sangyoun Lee Euntai Kim**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editors* Kang Ryoung Park Division of Electronics and Electrical Engineering Dongguk University Seoul Korea, South

Sangyoun Lee School of Electrical and Electronic Engineering Yonsei University Seoul Korea, South

Euntai Kim School of Electrical and Electronic Engineering Yonsei University Seoul Korea, South

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Sensors* (ISSN 1424-8220) (available at: www.mdpi.com/journal/sensors/special issues/vcSensors).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-1584-7 (Hbk) ISBN 978-3-0365-1583-0 (PDF)**

© 2021 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**



## **About the Editors**

#### **Kang Ryoung Park**

Kang Ryoung Park received his B.S. and M.S. degrees in electronic engineering from Yonsei University, Seoul, South Korea, in 1994 and 1996, respectively. He received his Ph.D. degree in electrical and computer engineering from Yonsei University in 2000. He has been a professor in the division of electronics and electrical engineering at Dongguk University since March 2013. His research interests include image processing and deep learning.

#### **Sangyoun Lee**

Sangyoun Lee received B.S. and M.S. degrees in electrical and electronic engineering from Yonsei University, Seoul, South Korea, in 1987 and 1989, respectively, and a Ph.D. degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 1999. He is currently a Professor and the Head of Electrical and Electronic Engineering at the Graduate School and the Head of the Image and Video Pattern Recognition Laboratory at Yonsei University. His research interests include all aspects of computer vision, with a special focus on pattern recognition for face detection and recognition, advanced driver-assistance systems, and video codecs.

#### **Euntai Kim**

Euntai Kim received B.S., M.S., and Ph.D. degrees in Electronic Engineering, all from Yonsei University, Seoul, Korea, in 1992, 1994, and 1999, respectively. From 1999 to 2002, he was a Full-Time Lecturer in the Department of Control and Instrumentation Engineering, Hankyong National University, Kyonggi-do, Korea. Since 2002, he has been with the faculty of the School of Electrical and Electronic Engineering, Yonsei University, where he is currently a Professor. He was a Visiting Researcher with the Berkeley Initiative in Soft Computing, University of California, Berkeley, CA, USA, in 2008. He was also a Visiting Researcher with Korea Institute of Science and Technology (KIST), Korea, in 2018. His current research interests include computational intelligence, statistical machine learning and deep learning and their application to intelligent robotics, autonomous vehicles, and robot vision.

## **Preface to "Visual and Camera Sensors"**

Recent developments have led to the widespread use of visual and camera sensors, such as visible light, near-infrared (NIR), and thermal camera sensors, in a variety of applications in video surveillance, biometrics, image compression, computer vision, image restoration, etc. While existing technology has matured, its performance is still affected by various environmental conditions, and recent approaches have been attempted to use multimodal camera sensors and fuse deep learning techniques with conventional methods to guarantee higher accuracy. The goal of this Special Issue was to invite high-quality, state-of-the-art research papers that deal with challenging issues in visual and camera sensors. We solicited original unpublished and completed research papers that were not currently under review by any other conference/magazine/journal. Topics of interest included, but were not limited to, the following:





> **Kang Ryoung Park, Sangyoun Lee, Euntai Kim** *Editors*

**Hua Zhang 1,†, Pengjie Tao 1,† , Xiaoliang Meng 1,\*, Mengbiao Liu <sup>1</sup> and Xinxia Liu <sup>2</sup>**


† These authors contributed equally to this work.

**Abstract:** With the growth in demand for mineral resources and the increase in open-pit mine safety and production accidents, the intelligent monitoring of open-pit mine safety and production is becoming more and more important. In this paper, we elaborate on the idea of combining the technologies of photogrammetry and camera sensor networks to make full use of open-pit mine video camera resources. We propose the Optimum Camera Deployment algorithm for open-pit mine slope monitoring (OCD4M) to meet the requirements of a high overlap of photogrammetry and full coverage of monitoring. The OCD4M algorithm is validated and analyzed with the simulated conditions of quantity, view angle, and focal length of cameras, at different monitoring distances. To demonstrate the availability and effectiveness of the algorithm, we conducted field tests and developed the mine safety monitoring prototype system which can alert people to slope collapse risks. The simulation's experimental results show that the algorithm can effectively calculate the optimum quantity of cameras and corresponding coordinates with an accuracy of 30 cm at 500 m (for a given camera). Additionally, the field tests show that the algorithm can effectively guide the deployment of mine cameras and carry out 3D inspection tasks.

**Keywords:** camera networks; open-pit mine slope monitoring; optimum deployment; close range photogrammetry; three-dimensional reconstruction; OCD4M

#### **1. Introduction**

Slope damage results in serious disasters that cause thousands of deaths and injuries and extensive property damage every year [1]. This poses a serious threat to people working in open-pit mines with slopes. There are three scales of slope damage that can occur in open-pit slopes, and they are bench damage, interslope damage and overall damage [2]. With economic development and the rapid growth of the demand for mineral resources, the exploitation of mine enterprises continues to increase. Many hillside openpit mines are transformed into deep mines, which leads to increasing the overall angles of slopes and consequently, an increased landslide risk [3]. The statistics of industrial accidents that occurred during the 2005–2010 open-pit coal production period in Turkish coal companies indicates that the most likely risks in open-pit mines are related to mine slopes [4]. According to the US Centers for Disease Control and Prevention (CDC) statistics for mine disasters in 2014, mine slope-related accidents were most reported in quarry operations, accounting for 33.3% of all accidents [5]. In total, 40% of Chinese open-pit mines have slope stability problems. According to statistics, there were 240 slope collapses and 369 fatalities from 2013 to 2017, ranking second amongst noncoal mine accidents in China [6]. Consequently, it is especially important to monitor the slope of open-pit mines.

For mine slope safety monitoring, scholars have used geodetic methods, 3S technology, photogrammetry, the Synthetic Aperture Radar (SAR), 3D laser scanning, and other meth-

**Citation:** Zhang, H.; Tao, P.; Meng, X.; Liu, M.; Liu, X. An Optimum Deployment Algorithm of Camera Networks for Open-Pit Mine Slope Monitoring. *Sensors* **2021**, *21*, 1148. https://doi.org/10.3390/s21041148

Academic Editor: Kang Ryoung Park Received: 30 December 2020 Accepted: 4 February 2021 Published: 6 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

ods in associated research. The geodesic survey methods [7], such as leveling instrument, theodolite, rangefinder, and distance measurement equipment, represent technology which is widely used in the establishment of high-precision control networks of mines [8]. In addition, the high-precision deformation monitor is mature, displays data reliability, and has a high accuracy; however, for open-pit mine slope monitoring, it has the disadvantages of requiring a large degree of manual involvement, being influenced by terrain access and climate, and is not able to automate monitoring, and those disadvantages result in low detection efficiency. Scholars have begun to focus more on other techniques to carry out mine slope safety monitoring. Manconi et al. [9] used surveying robot data to monitor the mine slope and simplify the complex deformation of the slope, which gained support from relevant departments due to good experimental results. Wang et al. [10] integrated Global Positioning System (GPS)/pseudo-satellite (PL) positioning technology to improve the position settlement accuracy and provide a high-precision monitoring model for highprecision slope monitoring in open-pit mines. Akbar et al. [11] employed the integration of GPS, the Geographic Information System (GIS), and Remote Sensing (RS) to map a hill slide disaster map. Zeybek et al. [12] used a long-range terrestrial laser scanner to measure the precision of Ta¸skent Landslide (Konya, Turkey). Liao et al. [13] applied high-resolution SAR data to monitor landslides in the Three Gorges Reservoir area in China and were able to identify the precise location, deformation, and time range of the landslide more accurately. Tang et al. [14] used new generation SAR satellites (Sentinel-1 and TerraSAR-X) to map surface displacements and slope instability at three open-pit mines in the Rhenish coalfield in Germany, in order to provide a long-term monitoring solution for open pit mining and its operations. Wang et al. [15] employed inclined photogrammetry to generate a mine Digital Surface Model (DSM) and carried out the construction of a Digital Elevation Model (DEM) for an open-pit mine. Alameda-Hernández et al. [16] used ultraclose range terrestrial digital photogrammetry to monitor the stability of soft foliated rocky slopes and analyzed their errors during rock weathering using the example of soft rocks in Alpujarras (Andalusia, Spain). Tong et al. [17] used Unmanned Aerial Vehicle (UAV) photogrammetry and ground-based laser scanning for open-pit mine inspection and three-dimensional (3D) mapping, aligning image data with point cloud data and classifying land cover. González-Díez et al. [18] elaborated on the methods which used digital photogrammetry to accurately measure slope changes caused by landslides.

In consideration of the use of each monitoring method and the research results of the above-mentioned references, the characteristics of each monitoring method and its corresponding scope of application are summarized in Table 1. Among the methods applied for open-pit mine monitoring, the close-up photogrammetry technique is a moderate method which can provide high efficiency and inexpensive measurement, especially compared to the other usual methods of laser scanning, the Interferometric Synthetic Aperture Radar (InSAR), the Laser Radar (LiDAR), etc. [19–21]. The use of photogrammetry has the requirement of capturing images with a degree of overlap, which places demands on the deployment of cameras.

The visual camera network, as a type of sensor network, is a spatially distributed network of smart cameras that collects and processes multimedia information to transform scene images into a more useful form [22]. Visual sensors can perceive more information than ordinary sensors, and visual sensor networks can handle higher-level visual tasks than single vision sensors [23]; for example, Kulkarni et al. [24] designed the multilayer camera network sensEye for object monitoring, identification, and tracking. In addition, coverage is an important aspect when evaluating the quality of detection of multiple regions of interest in visual sensor networks and is an important research direction for camera networks. Related studies on the coverage problem of visual sensor networks have been conducted and relevant algorithms have been designed to obtain the maximum coverage units with the most optimal camera deployment scheme [25,26]. Based on the research of visual sensor networks, it is a good choice to introduce the idea of visual sensors in the field of

slope risk monitoring; make full use of the multimedia resources of the camera; and realize functions such as 3D slope monitoring, tramcar positioning, and video monitoring.

**Table 1.** Characteristics and application scope of the main monitoring methods employed for open-pit mine slopes.


The intelligent application of multicamera video data in mining is an important aspect of smart mine constructions, and a number of studies have applied photogrammetry to the digitization of mines [27], reconstructing visual data in three dimensions and measuring parameters such as slope deformation monitoring and the slope gradient. Giacomini et al. [21] used close-up photogrammetry to continuously monitor the rock surface, in order to assess the potential rockfall risk and estimate the area of impact. Aggarwal et al. [28] implemented an Internet of Things (IOT) landslide monitoring system based on Raspberry Pi using a camera. It analyzed the area in real time based on the video stream obtained from the camera and applied computer vision algorithms to detect landslides. This method can only monitor the occurrence of large landslides and cannot provide an early warning, and the distance cannot be too far. A camera can also be mounted on a UAV to reconstruct the mine pit in 3D to obtain a DSM by extracting the comparative elevation, slope, slope direction, surface fluctuations, and surface roughness distribution and performing crack analysis [29–31]. Kromer et al. [32] used a digital Single Lens Reflex (SLR) camera to form a camera system. The use of photogrammetric workflows for mine slope monitoring achieved a high level of accuracy, but the shortcoming of this paper is that it does not describe the camera network deployment options for mine conditions and camera parameters, as well as the corresponding budgets. The intelligent use of video camera data at a mine site for slope collapse risk monitoring is a trend in smart mine

constructions, and the rationalization of camera deployment is one of the most important aspects. At the present stage, corresponding optimized camera network deployment rarely occurs in open-pit mines.

To solve the above problem, we propose an optimum deployment algorithm of camera networks for open-pit mine slope monitoring. The remainder of the paper is structured as follows: in Section 2, the optimum deployment algorithm for open-pit mine landslide monitoring is presented; in Section 3, the experimental simulation for the algorithm, field test, and demonstration are described; and finally, in Section 4, the discussions and conclusions are provided.

#### **2. Materials and Methods**

This study introduces the idea of combining visual camera observation with digital photogrammetry, and designs the Optimum Camera Deployment algorithm for open-pit mine slope monitoring (OCD4M). For a given mine slope, an observation platform, and camera parameters, the algorithm determines the deployment scheme with the minimum quantity of cameras and the optimal camera positions, in order to meet the need for overlap in 3D monitoring by close-up photogrammetry and the need for full coverage of safety monitoring, as shown in a simple deployment schematic in Figure 1. In this section, we introduce the OCD4M algorithm in terms of monitoring object description, mathematical model, mine surface preprocessing, and deployment algorithm workflow.

**Figure 1.** Sketch deployment diagram for open-pit mine slope monitoring. At the top left, the blue area represents the bench work surface, the yellow area represents the bench slope and the red area represents the risk exceeding the threshold.

#### *2.1. Description of Targets and Criteria for Monitoring*

Figure 2 shows the composition of the mine's slope. The object of monitoring is to calculate the bench width, bench height, bench slope angle and overall slope angle of the open-pit slope, in order to meet the safety requirements, and to provide early warning if the threshold values are exceeded. The calculation accuracy is required to be decimeter.

#### *2.2. Model Description and Problem Definition*

As shown in Figure 3, the camera sensor network contains a number (N) of sensors (S), which are deployed at the observation platform (B) that monitors the target surface (A).

$$\mathbf{S} = \{ \mathbf{s}\_1, \mathbf{s}\_2, \dots, \mathbf{s}\_N \}, \mathbf{A} \; : \; \mathbf{f}(\mathbf{x}, \mathbf{y}, \mathbf{z}) = \mathbf{0}, \; \mathbf{B} \; : \; \mathbf{g}(\mathbf{x}, \mathbf{y}, \mathbf{z}) = \mathbf{0} \tag{1}$$

From a geometric point of view, for each sensor, its sensing area is defined by tuple C(i) : (s <sup>i</sup> (x, y, z), D, α, θ), where si(x, y, z) is the sensor's position, D is the photography distance, α is the sensing angle, and θ is the angle at which the camera deviates from

the normal case photography direction. The aims are to ensure that the target surface is fully covered by the sensing area and that the image overlap rate between adjacent sensors is more than 80%. We can define the aims by the following three formulas:

$$A \subseteq \mathbb{C}\_{\mathbf{s}(1)} \cup \mathbb{C}\_{\mathbf{s}(2)} \cup \dots \cup \mathbb{C}\_{\mathbf{s}(N)'} \tag{2}$$

$$\mathbf{S}\_{\mathbf{C}\_{\mathbf{s}(i)} \cap \mathbf{C}\_{\mathbf{s}(i+1)}} \ge \mathbf{0}.\mathbf{8S}\_{\mathbf{C}\_{\mathbf{s}(i)} \prime} \tag{3}$$

,

$$\mathbf{s}\_{i}(\mathbf{x}, \mathbf{y}, \mathbf{z}) \subseteq \mathbf{B}\_{i} \tag{4}$$

**Figure 2.** Schematic diagram of the monitoring object. (**a**) Schematic diagram of the monitoring of the open pit slope by **Figure 2.** Schematic diagram of the monitoring object. (**a**) Schematic diagram of the monitoring of the open pit slope by cameras. (**b**) Section of the mine slope with the various parts of the open-pit mining. ( )∩ ( ) ≥ ( ) ⊆

**Figure 3.** Geometric deployment of the camera sensor network, where curve A is the surface, B is the observation platform, s(i) is the camera sensor, the area corresponding to the two dashed lines is the shooting range C(i), α is the field of view, and D is the shooting distance.

#### *2.3. Discussion of Different Situations*

The mine coordinate system is established as shown in Figure 4, with the vertical direction as the *Z*-axis, the photographic direction as the *Y*-axis, and the plane perpendicular to the plane formed by *Z* and *Y* as the *X*-axis.

In order to sense the whole *Z*-direction (Figure 4) area, one or more cameras need to be deployed. As shown in Figure 5a, when the sensing area of a single camera can cover the *Z*-direction of the mine face, only one camera is sufficient at this point; if not, multiple K cameras as shown in Figure 5b need to be deployed at the same monitoring point to meet the need for sensing the whole coverage of the *Z*-direction, while the overlap of the

monitoring areas of the multiple K deployed cameras also needs to be satisfied to meet the photogrammetry requirements.

α

α

**Figure 4.** Coordinate system established within the mine. **Figure 4.** Coordinate system established within the mine.

**Figure 5.** Observation coverage in the Z-direction. (**a**) The ca **Figure 5.** Observation coverage in the Z-direction. (**a**) The case where a single camera can cover the Z-direction of the slope of the open-pit mine, where a single camera is deployed on the right, with the dashed line showing the range of shots and the black curve showing the Z-direction of the mine slope. (**b**) The case where K cameras cover the Z-direction of the broken face of the mine, similar to (**a**), where multiple cameras are deployed on the right, with multiple dashed lines and ellipses indicating the range of the multiple cameras.

∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ ∑ ∑

In the XY flat area, we simplify the mine surface. In consideration of the complexity of the stereoscopic surface, for the purpose of computational convenience, the mining surface is simplified twice. The slope toe of the mine is closer to the observation platform. In terms of the characteristics of the camera sensors, the closer the observation is, the smaller the observation range. Consequently, we can simplify the surface at Figure 6 to form a curve which is located at the bottom of the mine.

**Figure 6.** Mine surface simplification: in the XY flat, closest to the camera is the slope line where the red curve in the diagram is located. The closer the distance, the smaller the camera's sensing range. If this line can be fully covered, other heights of the slope must be covered, so we have chosen this line to represent the situation of the mine surface in the XY plane.

In order to reduce the complexity of the calculation, we use a straight line instead of the curve obtained by simplifying Figure 7. The specific approach was carried out using least squares to find the best-fitting straight line. This is modeled by selecting several points on the curve at certain intervals, which are used to solve the linear equation:

$$\mathbf{y} = \mathbf{a}\mathbf{x} + \mathbf{b},\tag{5}$$

for the parameters a and b, where a and b are calculated by the following equations:

$$\mathbf{a} = \frac{\text{Num}\,\sum \mathbf{x}\_{\text{i}} \mathbf{y}\_{\text{i}} - \sum \mathbf{x}\_{\text{i}} \sum \mathbf{y}\_{\text{i}}}{\text{Num}\,\sum \mathbf{x}\_{\text{i}}^{2} - \left(\sum \mathbf{x}\_{\text{i}}\right)^{2}}, \mathbf{b} = \frac{\sum \mathbf{y}\_{\text{i}} \sum \mathbf{x}\_{\text{i}}^{2} - \sum \mathbf{x}\_{\text{i}} \sum \mathbf{x}\_{\text{i}} \mathbf{y}\_{\text{i}}}{\text{Num}\,\sum \mathbf{x}\_{\text{i}}^{2} - \left(\sum \mathbf{x}\_{\text{i}}\right)^{2}}, \mathbf{i} = \text{1, 2, ..., N} \text{mm}, \tag{6}$$

∑ ∑

where Num is the number of data points (red points in Figure 7).

∑ ∑

**Figure 7.** Slope toe simplification: the diagram shows how the broken line is finally reduced to a straight line by line fitting.

According to the observation platform and mine surface, it can be divided into two situations, as Figure 8 shows. In the X-direction, when the length of the observation is longer than the mine surface, we adopt normal case photography; otherwise, we adopt convergent photography, which has a larger coverage area, in order to meet the requirement.

#### *2.4. The Optimum Deployment Algorithm*

α

The OCD4M algorithm is used to solve the camera sensor deployment problem. The algorithm starts with the input of camera sensor parameters, mine face parameters, observation platform parameters, and photographic distances. The next step is to simplify the corresponding mine face in conjunction with the mining extent. Next, we calculate the

1

minimum quantity of cameras and the coordinates through the algorithm. Therefore, the workflow of the algorithm is as shown in Figure 9.

**Figure 8.** Camera sensor deployment in the X-direction: (**a**) the use of normal case photography and (**b**) the use of conve α **Figure 8.** Camera sensor deployment in the X-direction: (**a**) the use of normal case photography and (**b**) the use of convergent photography to cover the entire mine surface. A represents the observed mine surface, the dotted line represents the photographic area, α represents the field of view, s(i) represents the camera monitoring point, and B represents the observation platform.

**Figure 9.** Workflow of the Optimum Camera Deployment algorithm for open-pit mine slope monitoring (OCD4M) algorithm.

The input parameters for the algorithm are the parameters of the camera sensors (mainly the camera's field of view α and focal length f), the shooting distance D, the mine plane A to be observed, and the observation platform B. The sensing range C of the camera sensor can be calculated from the camera parameters and the shooting distance using the following formula:

$$\mathbf{C}\_{\mathrm{i}} = 2 \cdot \mathbf{D}\_{\mathrm{avg}} \cdot \tan \frac{\alpha}{2} . \tag{7}$$

In the above equation, C<sup>i</sup> is the camera area, including Cix and Ciz; Davg is the camera distance; and α is the field of view of the camera.

The next step calculates the number of cameras K per surveillance point to achieve full coverage of the open-pit slope in the Z-direction, calculated according to the following formula:

$$\mathbf{C}\_{\rm iz} + 0.2 \cdot \mathbf{K} \cdot \mathbf{C}\_{\rm iz} \ge \mathbf{A}\_{\rm Z} \tag{8}$$

where Ciz is the length of the photographic area C<sup>i</sup> in the Z-direction, K is the quantity of cameras, and A<sup>z</sup> is the extent of the mine slope in the Z-direction.

When the X-direction of the open-pit mine slope A is greater than the X-direction range of the observation platform B, normal case photography is used, and the minimum quantity of monitoring points N is calculated according to the following equation:

$$\mathbf{C}\_{\text{ix}} + \mathbf{0}.2 \cdot \mathbf{N} \cdot \mathbf{C}\_{\text{ix}} \ge \mathbf{A}\_{\text{X}}.\tag{9}$$

In the above equation, Cix is the camera area in the X-direction, N is the minimum quantity of monitoring points, and A<sup>x</sup> is the extent of the mine slope in the X-direction.

Otherwise, convergent photography is used, and the minimum quantity of cameras N can be obtained from

$$\mathbf{C}\_{i} = \begin{cases} \begin{array}{l} \text{D\_{avg}} \cdot \tan\left(\Theta + \frac{\alpha}{2}\right) - \text{D\_{avg}} \cdot \tan\left(\Theta - \frac{\alpha}{2}\right) \,\Theta \geq \frac{\alpha}{2} \\\ \text{D\_{avg}} \cdot \tan\left(\frac{\alpha}{2} - \theta\right) + \text{D\_{avg}} \cdot \tan\left(\Theta + \frac{\alpha}{2}\right) \,\Theta < \frac{\alpha}{2} \end{array} . \end{cases} \tag{10}$$

In the above equation, C<sup>i</sup> is the camera area in the X-direction, Davg is the camera distance, α is the field of view of the camera, θ is the angle of deviation with respect to the orthogonal direction, and A<sup>x</sup> is the extent of the mine slope in the X-direction. The range of cameras Cmin∼ Cmax is calculated to solve Nmin∼ Nmax according to Formula (9). Based on the results of the quantity of cameras, the coordinates of each camera on the mining plane can be calculated through the following formula:

$$\mathbf{x} = \mathbf{x}\_0 + \frac{\mathbf{D}}{\mathbf{D}\_{\text{max}}} \cdot \frac{\mathbf{B}\_{\text{x}}}{\mathbf{N} - \mathbf{1}}, \text{ y = B}\_{\text{ymax}} \tag{11}$$

where y is the maximum value employed to achieve greater coverage, and the quantity of cameras is K\*N.

The whole workflow of the Algorithm 1 can also be described by the following pseudocodes.

#### **Algorithm 1.** OCD4M

**Require:**

Camera sensor set S = {s <sup>1</sup> , s2, . . . , sN}, target surface A : f(x, y, z)= 0, Observation platform B : g(x, y, z)= 0, Photography distance D, sensing angle α

**Ensure:**

A is covered by the set of C(i) : (xyz(s(i)), D, α, → di), and the degree of overlap of adjacent *C*(*i*) greater than 80%. (1) A ⊆ Cs(1) ∪ Cs(2) ∪ . . . ∪ Cs(N) ; (2) SCs(i)∩Cs(i+1) ≥ 0.8SCs(i) ; (3) xyz(s i ) ⊆ B.

#### **Process:**


14: y = Bymax, *z*

15: Else


<sup>1:</sup> Compute the range of A. AZ= Azmax–Azmin; AX= Axmin–Axmax,

#### **3. Implementation and Results**

#### *3.1. Simulation of Experimental Tests* A

In this section, the results of the simulation experiments are given. The experimental tests focus on simulating the quantity of cameras and camera accuracy at different distances and the quantity of cameras and camera accuracy at different field of view angles. For small to medium-sized open-pit mines and some large mines, 500 m is relatively adequate. A greater distance means a lower accuracy, so beyond 500 m, it is necessary to improve the quality of the camera or add other means to control the accuracy, which means an increase in cost. This is unacceptable for small to medium-sized mines with low revenues, so we chose 500 m as the camera distance for our simulations. The main test condition parameters are as follows: A<sup>ଡ଼</sup> Bଡ଼

	- A<sup>Z</sup> = 100 m
	- A<sup>X</sup> = 500 m

#### 3.1.1. Quantity and Precision Analysis

The camera focal length was 8 mm, the horizontal field of view angle was 32.69◦ , and the vertical field of view angle was 24.81◦ . Assuming that the monitoring distance was between 50 and 500 m, the minimum quantity of cameras and the resolutions were calculated as shown in Figure 10.

D f

S pixel

**Figure 10.** Quantity and accuracy of cameras at different distances. (**a**) The minimum quantity of cameras calculated fo **Figure 10.** Quantity and accuracy of cameras at different distances. (**a**) The minimum quantity of cameras calculated for different distances. The further the distance, the fewer the cameras needed. (**b**) The distance each pixel represents in the field increases, which means that the accuracy decreases.

As shown in Figure 10a, according to Equation (7), the coverage becomes larger as the distance becomes larger whilst the camera remains the same, which means that the entire mine surface can be covered using fewer cameras. It can be seen from Figure 10b that the resolution of the object decreases as the distance increases, according to Formula (12):

$$\frac{\text{f}}{\text{D}} = \frac{\text{pixel}}{\text{S}}.\tag{12}$$

where f is the focal length, D is the photographic distance, pixel is the size of each image element of the image, and S is the field distance represented by an image element.

As the distance increases, the actual distance of an object represented by a pixel becomes larger, which means that the accuracy decreases. The accuracy can reach 30 cm at 500 m, which meets the requirements for the calculation of slope parameters in open-pit mines and can be used for early warnings on slopes.

#### 3.1.2. Focal length, Field of View Angle and Quantity Analysis

Assuming that the distance was 200 m, we analyzed the quantity of cameras and object resolution results for different focal lengths (2.8–25 mm) and field of view angle conditions.

The calculation result is shown in Table 2. It can be seen that as the focal length increases, the field of view angle decreases accordingly and the quantity of cameras required gradually increases. This is because, as the field of view angle decreases, the sensing range of the camera head decreases, and more camera sensors are needed to meet the coverage requirements. In addition, it can be seen that the accuracy increases as the focal length increases, because, as the focal length increases, the range in which each pixel represents an object becomes smaller at the same distance, allowing the accuracy to increase. With a camera focal length of 2.8 mm, the accuracy can be achieved at a distance of 34 cm at 200 m.


Commonly, for open-pit mine slope damage alerts, the resolution of the monitoring should not be less than 50 cm. Most open-pit mines have high-definition cameras which can be fit for this requirement, but only be used for manual monitoring. In addition, the current deployment of their cameras is not suitable for slope monitoring. They need to deploy more cameras facing the mine slope if they want to realize slope damage risk monitoring. Through our OCD4M algorithm, we can make random deployments optimal. The algorithm can calculate the minimum quantity of cameras needed to achieve the large overlap and full coverage required to make the most of the mine's video and multimedia resources.

#### *3.2. Field Testing and Demonstration*

The testing field is the Shunxing Quarry in Guangzhou, Guangdong, China, which is located at (113.518859 E, 23.405858 N), as shown in Figure 11. This is a medium and typical open-pit mine. The observed slope length of the mine is 493.22 m. The length of the mining platform is 225.72 m, and the slope height difference is 72.86 m. The distance (that is, the photography distance) is 398.97 m. The proposed camera model is HIKVISION DS-IPC-B12H-I with an 8 mm focal length, 1/2.7" sensor size, 32.69◦ field of view angle, and 24.81◦ vertical angle. Mine parameters, camera parameters, etc. were input into the OCD4M algorithm to calculate the minimum quantity of cameras and the coordinates of the monitoring points. A network of cameras was built in the field according to the coordinates and the 3D monitoring of the mine was automated without human participation by the data processing system we have developed. Given that we were working at the decimeter level of accuracy, we measured the slope of the open-pit, the height and width of the mining benches to be measured, and the volume of mining to be counted.

**Figure 11.** Study field located in Shunxing Quarry in Guangzhou, Guangdong, China. **Figure 11.** Study field located in Shunxing Quarry in Guangzhou, Guangdong, China.

In the field, GPS sampling was used to locate camera points and set up the cameras. Multiple photos extracted from camera videos were prepared as the inputs. We used Smart3D software for the data preprocessing, aerial triangulation, dense matching of oblique images, DSM point cloud generation, triangulated irregular network (TIN) construction, texture mapping, model modification, and other processes, in order to produce a realistic 3D model, as Figures 12 and 13 show.

**Figure 12.** 3D model of the study field. **Figure 12.** 3D model of the study field.

① ② ③ **Figure 13.** Schematic diagram of the camera field deployment. 1 Schematic view of the slope of the mine. 2 Pictures of the field installation, including pole tower transportation, drilling and camera installation, etc. 3 Schematic view of the Shunxing Quarry monitoring.

> Figure 14 illustrates the open-pit mine slope monitoring system which we developed for the visualization of the results of the algorithm and the early warnings during the safety monitoring. This can be used to generate the field camera deployment solution, analyze the slope of the mine, and measure the width and height of the working bench to assess and construct warnings about mine slope damage risks. It also integrates 3D reconstruction, 3D monitoring, and 3D visualization of the mine slope.


As calculated by the OCD4M algorithm, the camera sensing range is between 234.62 and 290.58 m, the minimum quantity of cameras is six, and the coordinate results are shown in Table 3. In the given conditions, these six monitoring point coordinates are the best locations for deploying the cameras to monitor their opposite slopes.

**Figure 14.** Software interfaces of the demonstration system.

**Table 3.** Camera coordinates for the 6 different monitoring points shown in Figure 13.


Zhang et al. [33] explored the relationship between overlap and accuracy. They assumed that each point is covered by photos five times, as shown in Figure 15. When the overlap is more than 80%, the accuracy is improved, but the speed of accuracy improvement is slowed down.

**Figure 15.** Relative overlap (80%), where each point is covered by five photos at least.

Considering the engineering practice, using more cameras is not conducive to cost control, and 80% overlap is a relatively reasonable choice in terms of the accuracy and engineering practice. Figure 16 shows a comparison of different results when lacking one of the conditions.

**Figure 16.** Comparison of different results when lacking one of the conditions: (**a**) 80% overlap at five cameras, no full **Figure 16.** Comparison of different results when lacking one of the conditions: (**a**) 80% overlap at five cameras, no full coverage, and (**b**) full coverage at five cameras, overlap < 80%.

For verifying the result of the algorithm, we conducted a comparison test by reducing and increasing the quantity of cameras. The comparison of overlap and coverage length in conditions of different quantities is shown in Table 4. When the quantity of cameras is less than six, the camera monitoring cannot cover the whole mine surface with regard to overlap, and the overlap between adjacent photos will be less than 80%.


**Table 4.** Overlap and coverage length for different quantity of cameras.

As shown in Table 4, meeting both requirements of the 493.2m photography distance and the 80% overlap, six cameras are needed to be deployed at least. Considering the actual engineering costs, using more cameras means higher cost. This proves that the deployment result (in Table 3) calculated by our OCD4M algorithm is relatively reasonable for this study case.

#### **4. Discussion and Conclusions**

In summary, the OCD4M algorithm is proposed for the deployment of camera sensor networks for slope monitoring to achieve the minimum quantity of cameras and obtain the deployment location coordinates, in order to optimize the deployment, enabling 3D monitoring capabilities and making full use of the multimedia data obtained from the cameras in the open-pit mine. We have conducted experimental validation with the simulated conditions of quantity, view angle, and focal length of cameras, at different monitoring distances. The OCD4M algorithm was tested in the medium-sized mine field, using Hikvision DS-IPC-B12H-I model 8 mm focal length cameras for mine surfaces photography and reconstructed in Smart3D software. The field test result shows that the accuracy of 30 cm can be achieved at the monitoring distance of 500 m. We also developed the visualization system software, through which the camera deployment scheme for the mine scenario can be generated automatically. According to the result of the algorithm, 3D monitoring of the working platform (e.g., calculating slope angle, height and width of the mine bench) can be realized at the decimeter level.

There are some considerations need to be emphasized in terms of deployment process, application scenario elaboration, engineering costs and limitations of the method. Since the physical deployment of control points is an unstable and costly solution for actual mining work, a high precision calibration of the camera is an important task [34]. Our method guarantees decimeter-level accuracy at 500 m monitoring distance without control points; thus, our method allows for decimeter resolution level safety monitoring in the open-pit mines, such as bench damage and interslope risk monitoring. However, overall landslide monitoring always needs a millimeter scale resolution, which no current camera model on the market can provide, especially when the monitoring slope is further than 500 m away. Due to topographical and other shading issues at the mine site, the use of front-to-surface photography may result in missing images in some areas that cannot be reconstructed in 3D, so the observation platform should be selected with due consideration of whether the area to be observed can be captured by all of the corresponding cameras. Considering the actual cost of the project, the low-cost solution of camera photogrammetry is easy to accept for small and medium-sized mines [7,21]. Additionally, our solution allows for automated monitoring after deployment, which also reduces the investment in manpower costs for the mine. It is more efficient than 3D laser scanning and traditional manual-based measurements. In the case of large mines, where the mines are large, distant, or complex, more cameras are required to ensure coverage and higher quality cameras to ensure accuracy, which can lead to increased costs, which are acceptable for the revenue of large mines.

In addition, the algorithm can be used for the calculation of other slope deployment scenarios (e.g., modeling of cultural heritage objects [35], 3D robot localization [36], monitoring coastal morphology [37], etc.). The algorithm can be improved by considering more photogrammetric geometry factors (e.g., the angle of intersection, the length of the photographic baseline, etc.) to optimize deployment scenarios for obtaining higher measurement accuracy, and by considering the mine topography and the actual deployable location of the mine to perform more complex deployment scenario calculations. The next step of the study will focus on the identification and warn of landslide areas using smart video image recognition based on the deployed system, so that the deployment algorithm can serve both monitoring of slope collapse risk and identifying landslide areas.

**Author Contributions:** Conceptualization, X.M.; methodology, X.M. and H.Z.; software, H.Z. and M.L.; validation, P.T., H.Z. and M.L.; formal analysis, P.T. and H.Z.; data work, H.Z. and X.L.; writing—original draft preparation, X.M. and H.Z.; writing—review and editing, H.Z, P.T., M.L., X.M. and X.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by National Natural Science Foundation of China (NSFC) under grant number 41971352 and 42071246, National Key Research and Development Program of China under grant number 2018YFB0505003, Ecological SmartMine Joint Fund of Hebei Natural Science Foundation under grant number E2020402086 and Hebei Natural Science Foundation under grant number E2020402006.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data sharing not applicable.

**Acknowledgments:** The authors are very grateful to the many people who helped to comment on the article, and the Large Scale Environment Remote Sensing Platform (Facility No. 16000009, 16000011, 16000012) provided by Wuhan University. Thanks to Shunxing Quarry (Guangdong, China) for providing us with the test field. Special thanks to the editors and reviewers for providing valuable insight into this article.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Sky Monitoring System for Flying Object Detection Using 4K Resolution Camera**

#### **Takehiro Kashiyama 1,\*, Hideaki Sobue <sup>2</sup> and Yoshihide Sekimoto <sup>1</sup>**


Received: 9 November 2020; Accepted: 8 December 2020; Published: 10 December 2020

**Abstract:** The use of drones and other unmanned aerial vehicles has expanded rapidly in recent years. These devices are expected to enter practical use in various fields, such as taking measurements through aerial photography and transporting small and lightweight objects. Simultaneously, concerns over these devices being misused for terrorism or other criminal activities have increased. In response, several sensor systems have been developed to monitor drone flights. In particular, with the recent progress of deep neural network technology, the monitoring of systems using image processing has been proposed. This study developed a monitoring system for flying objects using a 4K camera and a state-of-the-art convolutional neural network model to achieve real-time processing. We installed a monitoring system in a high-rise building in an urban area during this study and evaluated the precision with which it could detect flying objects at different distances under different weather conditions. The results obtained provide important information for determining the accuracy of monitoring systems with image processing in practice.

**Keywords:** flying object detection; drone; convolutional neural network; image processing

#### **1. Introduction**

In recent years, the use of drones and other unmanned aerial vehicles (UAVs) has expanded rapidly. These devices are currently used in various fields worldwide. In Japan, where natural disasters are a frequent occurrence, these devices are expected to be used for applications such as scanning disaster sites and searching for evacuees. In societies with shrinking populations, especially in areas with declining populations suffering from continuous labor shortages, there is hope that these devices can serve as new means for delivering medical supplies to emergency patients and food supplies to those without easy access to grocery stores.

However, there are growing concerns over these devices being misused for terrorism or other criminal activities. An incident occurred in April 2015 in Japan, when a small drone carrying radioactive material landed on the roof of the Prime Minister's official residence. Moreover, there have been incidents where drones have crashed during large events or at tourist spots. In response to such incidents, legislation has been passed in Japan to impose restrictions on flying UAVs near sensitive locations such as national facilities, airports, and population centers and to set rules and regulations on safely flying these devices. However, this has not stopped the unauthorized use of drones by foreign tourists unaware of these regulations or users who deliberately ignore them. Hence, the number of accidents and problems continues to increase each year.

As the number of drones in use can only increase, it becomes important to ensure that the skies are safe. Therefore, a framework must be developed to prevent drones' malicious use, such as for criminal activities or terrorism. Sky monitoring systems must address this important responsibility. Currently, methods using radar, acoustics, and RF signals to detect UAVs or radio-controlled aircraft [1] have been proposed. However, the systems [2–5] that use radar technology can impact the surrounding environment (such as blocking radio waves from nearby devices) and can only be used in limited areas. The method that uses acoustics [6] cannot detect a drone in a noisy urban environment, limiting the types of environments where the system can be installed. The methods in which drones are detected based on RF signals between the drone and operator [7] face the issue that they cannot detect autonomous control drones that do not transmit RF signals.

Methods employing images [8–11] have likewise been proposed, as we can see from the fact that image processing is used in various fields [12–18]. These methods have gained superiority, as they do not affect the monitoring targets or the surrounding environment and allow for systems to be built at a comparatively lower cost. It is also easy to miniaturize such systems, allowing them to be installed almost anywhere. Another benefit is that it becomes possible to use an indoor monitoring system to monitor an outdoor environment, reducing maintenance costs. Sobue et al. [8] and Seidaliyeva et al. [11] used background differencing to detect flying objects. Schumann et al. [9] used either background differencing or a deep neural network (DNN) to detect flying objects and used a separate convolutional neural network (CNN) model for categorization. Unlu et al. [10] proposed a framework in which the YOLOv3 [19] model was used to scan a wide area captured using a wide-angle camera to detect possible targets, and subsequently, a zoom camera was used to inspect the details of suspicious objects.

Although the aforementioned studies have explained the framework enough, the flying objects detected during the evaluation were large, and factors such as changes in the environment were not considered to a sufficient degree when evaluating their performance. Therefore, it is unclear under which conditions detection would succeed or fail in these methods, thereby making it difficult to determine whether they can be put to actual use. In this study, we propose a system capable of achieving a wide range of observations in real-time at a lower cost compared to existing systems using a single 4K resolution camera and YOLOv3, which is the state-of-the-art CNN. We installed the proposed system in a high-rise building located in the Tokyo metropolitan area, collected images that include flying objects, and evaluated the monitoring precision in changing weather and assuming various distances from the monitoring targets. It should be noted that the monitoring system was tested only during the day.

The remainder of this paper is organized as follows. We describe the proposed sky monitoring system in Section 2. The detection accuracy is presented in Section 3, and the results are discussed in Section 4. Finally, Section 5 concludes the study.

#### **2. Proposed System**

#### *2.1. System Overview*

We developed a sky monitoring system for use in urban areas that present a substantial risk concerning devices such as UAVs. The following restrictions must be assumed for use in an urban area:


Due to these restrictions, in this study, we installed a video camera indoors and developed a system that only uses image processing for monitoring. Using a video camera instead of special equipment, such as lasers, allowed the installation environment to be resolved and the system to be built at a reasonable cost. This also allowed us to install multiple systems.

In addition to monitoring nearby UAVs' flight status, we decided to monitor flying objects in general. We assumed that the system would also be used for building management and would need to monitor the sky to prevent damage by birds. The system's objective was to detect objects of a total length of approximately 50 cm or longer, which is the typical size of most general-purpose drones. The system was designed to monitor an area extending 150 m from it. This distance was selected for two reasons. First, UAV regulations within Japan permit flight only at an altitude of 150 m or lower. Second, Amazon has proposed a drone fly zone from 200 to 400 feet (61 to 122 m) for high-speed UAVs, and a fly zone up to 200 feet (61 m) for low-speed UAVs [20]. Although it would also be possible for a UAV used for malicious purposes (such as terrorism or criminal activity) to approach from an altitude of 150 m or higher, we considered that detecting small flying objects at a greater distance is not realistic when using only image processing. In addition, if the goal of drones is surreptitious photographing or dropping hazardous materials, the drones finally need to approach the facility. Therefore, we believe that the proposed system would be suitable for practical use under the aforementioned distance limitations.

#### *2.2. System Configuration*

The proposed system was configured from an indoor monitoring system that includes a video camera and a control PC for transmitting the captured video, and a processing server that detects flying objects using a CNN. Figure 1 shows the proposed system configuration.

**Figure 1.** Proposed system configuration.

The monitoring system was equipped with a 4K video camera (which is the highest resolution offered by current consumer video cameras) to detect flying objects up to a distance of 150 m, and the widest angle is used to capture the footage of the sky to cover a wide area. It was assumed that the system would be installed indoors. Hence, we cannot ignore the fact that light would often be reflected from windows in typical office buildings, as blinds are opened and closed, or multiple lights are switched on. A covering hood was installed between the camera and window to reduce the impact of window reflections. The system was painted mostly black to reduce reflected light. This is a simple physical measure against reflected light, but it is essential for indoor installation.

A c5.4xlarge instance type processing server with 16 virtual CPUs was selected from Amazon EC2 as the processing server. In our proposed system, one 4K frame per second was sent from the server's monitoring system for processing. This instance type was selected as it would be sufficient to process 4K images in real-time.

#### *2.3. CNN-Based Detection*

Recently, CNN, which represents a deep learning approach, has been applied in numerous computer vision tasks, such as image classification and object detection [21]. Many models [19,22,23] have been proposed and compared in terms of accuracy on the COCO [24] and KITTI [25] benchmark datasets. In this study, we use the YOLOv3 [19] model to detect flying objects. YOLOv3 extracts the features of an image by down-sampling the input image with filters of three sizes of 8, 16, and 32 to detect objects of different sizes. The training process uses the loss that is calculated based on both the objectness score calculated from bounding box coordinates (x, y, w, h) and the class score. The advantage of YOLOv3 is the high balance between processing speed and accuracy. Therefore, we presume that YOLOv3 would be most suitable for use in the security field, which requires both real-time processing and high detection accuracy. In 2020, YOLOv4 [26] and YOLOv5 [27] were developed. However, in consideration of stable operation, we adopted YOLOv3, which boasts abundant practical results.

A collection of images of airplanes, birds, and helicopters was used as data to train the model instead of drone images. This was done for two reasons. First, there is no publicly available dataset for use in CNN training. Second, it would likely be impossible to collect images of all types of drones, because as opposed to automobiles and people, they assume various shapes. Therefore, images of airplanes, birds, and helicopters were collected for model training and evaluation during this study.

The proposed system uses a 4K video camera to capture a wide view of the sky. A drone with a size of approximately 50 cm flying at a maximum distance of 150 m would span approximately 10 pixels in an image. If the system can detect airplanes, birds, and helicopters under such strict conditions, it is expected to be capable of detecting drones.

The monitoring system was installed indoors on the 43rd floor of Roppongi Hills located in the center of Tokyo, to gather training data. Images of external flying objects were captured indoors through a window. Figure 2 shows an example of an image captured.

**Figure 2.** Image captured by the monitoring system and system installation.

### *2.4. Input to Model*

We determined that compressing 4K images and importing them into the YOLO model would not be an appropriate method to input high-resolution images of distant flying objects into the current model. Therefore, when processing the detection of flying objects on the processing server, 600 × 600 pixel squares were extracted from 3840 × 2160 4K images for input into the YOLO model. This is illustrated in Figure 3. These extracted images were partially duplicated, such that the precision of the detected objects would not decrease in the image boundary areas. Detection processes for each image were run in parallel on different virtual CPUs, allowing the detection process for the entire area of a 4K image to be completed within one second (i.e., in real-time).

**Figure 3.** 4K image input.

#### *2.5. CNN Model Training*

Images of flying objects were extracted from videos photographed by the proposed system. Therefore, continuous images were similar among frames, although not completely identical. When such continuous groups of images are randomly split into training and evaluation datasets, the training and evaluation datasets contain almost the same image. In this situation, the trained detection model is not accurately evaluated, resulting in the precision being evaluated highly. Therefore, first, image data were separated by each continuous frame group, as shown in Figure 4. Then, each group was randomly divided into training and evaluation datasets. Notably, 75% of the data was used to train the CNN model, while the remaining 25% was used for evaluation. Furthermore, the size of the flying object and weather were adjusted to be evenly distributed in this process.

For hyperparameters, when training the model, we used the default settings published by the developers of YOLOv3. Additional details are provided on the developer's GitHub page [28]. Therefore, when training the model, basic data augmentation, such as flips, was performed. Furthermore, the brightness of the trained images was scaled between 0.3 and 1.2, and the number of trained images was expanded by approximately 10 times to increase robustness to changes in brightness due to changes in the time of day and weather.

#### **3. Evaluation**

#### *3.1. Categorization of Data for Evaluation*

The purpose of the evaluation process was to determine the effect of differences in the type of weather or sizes of flying objects on flying objects' detection precision. The images gathered, as described in Section 2.3, were categorized by conditions. The collected image data contained 1392 helicopters, 190 airplanes, and 74 birds. It should be noted that the background images did not include buildings and trees, but only the sky, as most flying objects were airplanes and helicopters.

The images were categorized by four types of weather: clear, partly cloudy, hazy, and cloudy/rainy. The images were categorized by eye, based on whether they contained clouds with contours, haze, blue skies, or rain clouds (dark clouds). Figure 5 lists the results of categorizing the images according to the type of weather.

**Figure 5.** Categorization of data by type of weather.

The images were then sorted into six categories corresponding to the sizes of the flying objects based on the number of horizontal pixels shown in the image: SS (less than 12 pixels), S (12 to 16 pixels), M (17 to 22 pixels), L (23 to 30 pixels), LL (31 to 42 pixels), and 3L (43 pixels or more). These categories can also be expressed as the distance in meters assuming a drone of approximately 50 cm size (SS: 150 m or greater, S: 110 m to less than 150 m, M: 75 m to less than 110 m, L: 55 m to less than 75 m, LL: 38 m to less than 55 m, and 3L: less than 38 m).

Table 1 lists the results of categorizing the images. A few images were categorized as clear, and many were categorized as hazy, with 253 clear, 433 partly cloudy, 601 hazy, and 368 cloudy/rainy images. There were comparatively few images categorized as SS, S, and 3L, but a roughly equivalent number categorized as all other sizes, with 58 SS, 182 S, 368 M, 488 L, 395 LL, and 152 3L images.


**Table 1.** Number of images by size and weather.

#### *3.2. Evaluation Indicators*

A precision-recall (PR) curve with precision on the vertical axis and recall on the horizontal axis was used for evaluation. Rectangular bounding boxes are generally used for estimation when detecting objects. A fixed threshold was used in numerous studies to evaluate the intersection over union (IoU) calculated from the estimated and actual bounding boxes. However, owing to the extremely small size of the detected objects in this study, a large proportion of measurement errors would be included for the flying objects in the bounding boxes. This is due to the blurriness caused by the lens or misalignment resulting from the human annotation. Therefore, we determined that a flying object was detected during the evaluation process if the actual and estimated bounding boxes overlapped even slightly.

Figures 6 and 7 present the results of evaluating the precision based on the type of weather. In Figure 6, images where a flying object was successfully detected and subsequently successfully categorized as a helicopter, bird, or airplane were considered as true positives. The remainder (where the flying object was successfully detected but categorized incorrectly) was false positive. In Figure 7, images where a flying object was successfully detected (even if categorization failed) were evaluated as true positives, while the remainder was evaluated to be false positive. This was conducted under the assumption that even succeeding only at detection would demonstrate that the system offers the minimum required functionality for monitoring flying objects.

**Figure 6.** Detection precision by type of weather (including categorization).

**Figure 7.** Detection precision by type of weather (not including categorization).

Figures 8 and 9 show the results of evaluating the precision based on the flying object's size. Similar to the evaluation according to the type of weather, in Figure 8, images where a flying object was successfully detected and then successfully categorized as a helicopter, bird, or airplane were evaluated as true positives, while the remainder was considered false positive. Figure 9 shows that images where a flying object was successfully detected (even if categorization failed) were evaluated as true positives, while the remainder was considered false positive.

**Figure 8.** Detection precision by flying object size (including categorization).

**Figure 9.** Detection precision by flying object size (not including categorization).

All evaluation data results are depicted as black lines in Figures 6–9, while the results for each type of weather are color-coded.

#### **4. Discussion**

We first discuss the results obtained for the precision evaluation based on the type of weather. Figure 6 shows excellent results for clear weather, with almost no false detections. The results for partly cloudy weather (when there are some clouds in the sky) and cloudy/rainy weather (when the entire sky is covered in clouds) are also satisfactory, although not as good as during clear weather. When the recall is 0.8, the results indicate a precision above 0.8. In contrast, the results are worse for hazy weather. When the recall is 0.6, the precision is approximately 0.6. We believe that this is due to the haze blurring the contours of flying objects. In this evaluation, because we targeted detecting flying objects such as helicopters and aircraft, which may be anywhere from several hundred meters to several kilometers away from the monitoring system, the effect may significantly influence detection accuracy. If we consider drones, the flight distance will be remarkably closer than the current targets. Therefore, the accuracy decrease due to hazy weather may be small, which must be investigated in the future. Notably, Figure 6 shows that although clouds and rain lower the overall brightness of images and make it difficult to detect objects by sight, no decrease in precision is observed in the monitoring system. Hence, the noise of the flying object's appearance, and not the change in the brightness of the sky, affects its detection accuracy. In contrast, Figure 7 indicates excellent results in all types of weather, with a precision value of nearly 1.0 for all recall values. This demonstrates that flying object categorization mistakes, rather than detection omissions, have a significant influence in lowering the precision and recall. If the system obtains human support for the classification task, the systems based on image processing can contribute significantly to sky monitoring.

Next, we discuss evaluating the precision based on the sizes of flying objects in the images. The results in Figure 8 indicate excellent results for L, LL, and 3L, which correspond to large sizes; when the recall is 0.8, the precision is above 0.8. In contrast, SS, S, and M, which have relatively small sizes, yield different results for each size. The result of the SS size category is an unnatural value. In the evaluation process, the evaluation data were separated into six categories by size, and the number of images per size category was different (SS: 58, S: 82, M: 368, L: 488, LL: 395, and 3L: 152). Focusing on SS, the number of images is very small. This implies that the precision for SS-sized objects was likely not calculated accurately. However, it also shows the capability of detecting even the smallest flying objects, which should be investigated in the future. The S results in a comparatively good PR curve, comparable to the results for large flying objects (L, LL, 3L). When the recall is 0.8, the precision achieved is above 0.8. This size corresponds to a distance of 110 to 150 m for a drone of maximum length (roughly 50 cm), and it is the minimum size of the monitoring system's observation target. This indicates that the monitoring system using image processing achieves sufficient detection accuracy that satisfies the requirement for a distance of 150 m, as shown in Section 2.1. The M size category shows remarkably poor results that are worse than those of all other size categories. We attribute this to the hazy weather often occurring for this size of an object compared with others. From the values in Table 1, for the M-size images, 246 images, which comprise more than half of the evaluation data of 368 images, are images captured in hazy weather. Hence, it is considered that the accuracy was reduced due to the influence of the weather and not size. Thus, we can conclude that high detection accuracy is achieved for flying object sizes S and above, excluding hazy weather. For the trials without categorizations, similarly to Figure 7, Figure 9 portraying the effect of flying object size shows particularly excellent results with an achieved precision value of nearly 1.0 for all recall values. This indicates that flying object categorization mistakes, rather than detection omissions, significantly impact the lower precision and recall by size, as shown in Figure 8.

In summary, the results indicate that the precision is poor during hazy weather, and the difference in accuracy depending on the object size is not large if it is S or above. Overall, except for hazy weather, the system maintains a precision value of approximately 0.8 when the recall is 0.8, confirming that it

offers sufficient precision to function as a monitoring system. Therefore, further improvement of the algorithm to achieve better performance in hazy weather is required for actual use. The results also confirm that the detection precision is affected by flying object categorization mistakes rather than detection omissions. Under the experiment in this study, if we ignore flying objects' classification, almost all flying objects are detected in this study. Assuming that there are not many flying objects in the sky, unlike pedestrian detection in the city, even if the classification accuracy is poor, we believe that it can contribute to real scenes as a monitoring system. However, the images of the flying objects used during this study contain only the sky as a background. Images captured in an urban area are likely to include buildings behind flying objects due to building density in the area. This would create a more complicated background and make it significantly more difficult to detect categories. Therefore, the precision of this system must be verified under a range of conditions for use in urban areas.

Furthermore, although not examined in this study, capturing clear images is an important issue in monitoring systems that use image processing. When installing the system externally, the camera lens becomes dirty due to rain, insects, and dust, and it is not possible to obtain clear images. Even when installed indoors, it is affected by dirt on the glass in front of the monitoring system. If the object is relatively large, such as a pedestrian or a car, this will not have a significant negative impact. However, for flying objects, where the detection target is extremely small, the detection accuracy is substantially affected. Because physical solutions for this challenge are cumbersome, software-based methods are necessary to solve these problems for practical use.

#### **5. Conclusions**

We developed a sky monitoring system that uses image processing. This system consists of a monitoring system and a cloud processing server. The monitoring system captures a wide area of the sky at high resolution using a 4K camera and transfers the frame image to the server every second. The processing server uses the YOLOv3 model for flying object detection. In this process, real-time processing is realized by the parallel processing of multiple cropped images. We installed the monitoring system in a high-rise building in the Tokyo metropolitan area and collected CNN model training and evaluation data to evaluate the system's detection precision.

Existing research has employed the CNN model; however, the accuracy was not analyzed for each condition, and it was generally insufficient for practical use. Further, there is no simple system that combines a single 4K camera with the latest CNN to date. We designed a real-time monitoring system prototype using the latest CNN model and 4K resolution camera and demonstrated the detection accuracy of flying objects with various object sizes and weather conditions.

We found that detection accuracy is significantly reduced in hazy weather. From this result, we found that the blurring of the flying object's shape in the haze and the decrease in brightness in rainy or cloudy weather influence the decrease in accuracy. The flying object's size was replaced by the distance of the flying object in this study, and we show that a flying object with a pixel-sized image, assuming a 50 cm drone separated by 150 m, can be detected with high accuracy. Moreover, if we omit the classification accuracy, we find that almost all flying objects can be detected under all weather conditions and sizes considered in this experiment. From these results, the surveillance system using image processing is expected to contribute to sky surveillance, as the cost of human support in classification is low in an environment where there are few objects in the sky.

However, the monitoring system has several limitations: it was verified only during the daytime in this experiment, and we did not consider the effects of dirt and dust on the camera lens on the image. Moreover, the evaluation data targeted long-distance aircraft and helicopters as flying objects and did not include drones that fly relatively short distances. In the future, we plan to evaluate the monitoring accuracy in further detail by collecting evaluation data, including the above conditions. In this research, we will improve the proposed monitoring and develop a system to detect light emitted from LEDs installed on drones and a system that uses a comparatively inexpensive infrared projector for the detection of drones at night.

#### **6. Patents**

Japan patent: JP6364101B.

**Author Contributions:** Conceptualization, T.K. and Y.S.; methodology, T.K.; software, T.K. and H.S.; validation, T.K. and H.S.; formal analysis, T.K. and H.S.; investigation, H.S.; resources, T.K.; data curation, H.S.; writing—original draft preparation, T.K.; writing—review and editing, T.K. and Y.S.; visualization, T.K.; supervision, Y.S.; project administration, Y.S.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** MORI Building Co., Ltd. supported this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **Recognition and Grasping of Disorderly Stacked Wood Planks Using a Local Image Patch and Point Pair Feature Method**

#### **Chengyi Xu 1,2, Ying Liu 1,\*, Fenglong Ding <sup>1</sup> and Zilong Zhuang <sup>1</sup>**


Received: 27 August 2020; Accepted: 28 October 2020; Published: 31 October 2020

**Abstract:** Considering the difficult problem of robot recognition and grasping in the scenario of disorderly stacked wooden planks, a recognition and positioning method based on local image features and point pair geometric features is proposed here and we define a local patch point pair feature. First, we used self-developed scanning equipment to collect images of wood boards and a robot to drive a RGB-D camera to collect images of disorderly stacked wooden planks. The image patches cut from these images were input to a convolutional autoencoder to train and obtain a local texture feature descriptor that is robust to changes in perspective. Then, the small image patches around the point pairs of the plank model are extracted, and input into the trained encoder to obtain the feature vector of the image patch, combining the point pair geometric feature information to form a feature description code expressing the characteristics of the plank. After that, the robot drives the RGB-D camera to collect the local image patches of the point pairs in the area to be grasped in the scene of the stacked wooden planks, also obtaining the feature description code of the wooden planks to be grasped. Finally, through the process of point pair feature matching, pose voting and clustering, the pose of the plank to be grasped is determined. The robot grasping experiment here shows that both the recognition rate and grasping success rate of planks are high, reaching 95.3% and 93.8%, respectively. Compared with the traditional point pair feature method (PPF) and other methods, the method present here has obvious advantages and can be applied to stacked wood plank grasping environments.

**Keywords:** convolutional auto-encoders; local image patch; point pair feature; plank recognition; robotic grasping

#### **1. Introduction**

At present, the grasping process of disorderly stacked wooden planks is mainly completed by manual methods, such as the sorting and handling of wooden planks, and the paving of wooden planks. There are existing defects, such as low labor and high cost. With the rapid development and wide application of robotic technology, vision-based robot intelligent grasping technology has high theoretical significance and practical value to complete this work. The visual recognition and positioning of disorderly stacked wooden planks is an important prerequisite for successful robot grasping. The vision sensor is the key component of the implementation. A RGB-D camera can be regarded as a combination of a color monocular camera and a depth camera, and this camera can collect both texture information and depth information at the same time [1], which has obvious application advantages.

*Sensors* **2020**, *20*, 6235

The recognition, detection, and location of objects is a research hotspot in both Chinese and overseas contexts. Hinterstoisser et al. [2] extracted features from the color gradient of the target and the normal vector of the surface and matched them to obtain a robust target detection result. Rios-Cabrera et al. [3] used a cluster to find the target and the detection speed was faster. Rusu et al. [4] calculated the angle between the normal vector of a point cloud and the direction of the viewpoint based on the viewpoint feature histogram (VFH), but this was not robust to occlusion problems. The orthogonal viewpoint feature histogram (OVFH) [5] was proposed. Birdal et al. [6–8] used a point cloud to extract point pair features (PPFs), and used shape information to identify and detect targets, but this was mainly for objects with complex shapes. Lowe et al. [9,10] proposed a scale-invariant 2D feature point the scale invariant feature transform method (SIFT) local feature point with good stability, but it was not suitable for multiple similar targets in the same scene. There are also the feature descriptors surf [11], spin image [12], signature of histogram of orientation (SHOT) [13], etc., Choi [14] added color information on the basis of traditional four-dimensional geometric point pair features and obtained better accuracy than the original PPF method. Ye et al. [15] proposed fast hierarchical template Matching Strategy of Texture-Less Objects, which takes less time than the origin method. Muñoz et al. [16] proposed a novel part-based method using an efficient template matching approach where each template independently encodes the similarity function using a forest trained over the templates. In reference to the multi-boundary appearance model, Liu [17] proposed to fit the tangent to the edge point of the model as the direction vector of the point. Through the four-dimensional point pair feature matching and positioning, a good recognition result was obtained. Li [18] proposed a new descriptor curve set feature (CSF), where the descriptor curve set feature describes a point by describing the surface fluctuations around the point and can evaluate the pose. CAD-based pose estimation was also used to solve recognition and grasping problems. Wu et al. [19] proposed constructing 3D CAD models of objects via a virtual camera, which generates a point cloud database for object recognition and pose estimation. Chen et al. [20] proposed a CAD-based multi-view pose estimation algorithm using two depth cameras to capture the 3D scene.

Recently, deep learning has been applied to the recognition and grasping of robot situations. Kehl [21] proposed the use of convolutional neural networks for end-to-end training to obtain the pose of an object. Caldera et al. [22] proposed a novel approach to multi-fingered grasp planning leveraging learned deep neural network models. Kumara et al. [23] proposed a novel robotic grasp detection system that predicts the best grasping pose of a parallel-plate robotic gripper for novel objects using the RGB-D image of the scene. Levine et al. [24] proposed a learning-based approach to hand-eye coordination for robotic grasping from monocular images. Zeng et al. [25] a robotic pick-and-place system that is capable of grasping and recognizing both known and novel objects in cluttered environments. Kehl [26] proposed a 3D object detection method that used regressed descriptors of locally-sampled RGB-D patches for 6D vote casting. Zhang [27] proposed a recognition and positioning method that uses deep learning to combine the overall image and the local image patch. Le et al. [28] proposed applying an instance segmentation-based deep learning approach using 2D image data for classifying and localizing the target object while generating a mask for each instance. Tong [29] proposed a method of target recognition and localization using local edge image patches. Jiang et al. [30] used a deep convolutional neural network (DCNN) model that was trained on 15,000 annotated depth images synthetically generated in a physics simulator to directly predict grasp points without object segmentation.

Even in the current era of deep learning, the point pair feature (PPF) method [7] has a strong vitality in bin-picking problem. Its algorithm performance is still no less than that of deep learning. Many scholars have made a lot of improvements to PPF [6–8,14,15,31] because of its advantages. The general framwork of PPF has not changed significantly at the macro or micro level anyway. Vidal et al. [31] proposed an improved matching method for Point Pair Features with the discriminative value of surface information.

At present, most of the robot visual recognition and grasping scenes are put together with different objects and the target objects that need to be recognized have rich contour feature information. However, the current research methods are not effective for the identification and grasping of disorderly stacked wooden planks. The main reason for this is that the shape of the wooden plank itself is regular and symmetrical and is mainly a large plane. The contour change information is not rich and different planks have no obvious or special features. There are many similar features such as shapes and textures. When these wooden planks are stacked together, it is difficult to identify and locate one of the wooden planks using conventional methods, making it difficult for a robot to grasp a plank when among many. Therefore, we utilize PPF combined other features to recognize and locate unordered stacked planks.

The local image patches of the wooden plank images both from the self-developed scanning equipment and the disorderly stacking plank scene here are taken as a data set. Using the strong fitting ability of a deep convolutional autoencoder, the convolutional autoencoder is trained to obtain stable local texture feature descriptors. The overall algorithm flow realized by robot grasping is shown in Figure 1. In the offline stage, a pair of feature points are randomly selected on the wooden plank image and local image patches are intercepted. These two image patches are sequentially input to the trained feature descriptor to obtain the local feature vector and combine the geometric feature information of the point pair to build the feature code database of the plank model. In the online stage, the disorderly stacking plank scene is segmented and the plank area to be grasped is extracted, and the geometric feature information of the point pair is also extracted and the feature description code of the local image patch is found, similar to the offline stage. Then, point pair matching is performed with the established plank feature database. Finally, the robot is used to realize the positioning and grasping operation after all the point pairs complete pose voting and pose clustering.

**Figure 1.** The overall algorithm flow of robot grasping.

#### **2. Methods of the Local Feature Descriptor Based on the Convolutional Autoencoder**

Traditional autoencoders [32–34] are generally fully connected, which will generate a large number of redundant parameters. The extracted features are global, local features are ignored, and local features are more important for wood texture recognition. Convolutional neural networks have the

characteristics of local connection and weight sharing [35–40], which can accelerate the training of the network and facilitate the extraction of local features. The deep convolutional autoencoder designed in this paper is shown in Figure 2. The robot drives the camera to collect small image patches at different angles around the same feature point in the scene of disorderly stacked wood planks as the input and expected value of output to train the convolutional autoencoder. Its network is mainly composed of two stages, namely, encoding and decoding. The encoding stage has four layers. Each layer implements feature extraction on input data convolution and ReLU activation function operations. There are also four layers in the decoding stage. Each layer implements feature data reconstruction through operations such as transposed convolution and ReLU activation function operations. This network model combines the advantages of traditional autoencoders and convolutional neural networks, in which residual learning connections are added to improve the performance of the network.

**Figure 2.** Convolutional auto-encoder mode.

 × ௗ ,1 ≤ ≤ × ′ ,ௗ ′ ℎ ∈ ெᇲ×ெ<sup>ᇲ</sup> <sup>ᇱ</sup> =−+1 ℎ In the encoding stage, the input of the convolutional layer has *D* groups of feature matrices, and the two-dimensional *M* × *M* feature matrix of the *d*th group is *x d* , 1 ≤ *d* ≤ *D*. The two-dimensional *N* × *N* convolution kernel of the group of the *k* ′ th channel of the output convolution layer is *Wk*,*<sup>d</sup>* , and the *k* ′ th feature map of the output convolution layer is *h <sup>k</sup>* <sup>∈</sup> *<sup>R</sup> M*′×*M*′ , where *M*′ = *M* − *N* + 1, where *h k* can then be expressed as follows:

$$h^k = f\left(\sum\_{d=1}^D \mathbf{x}^d \ast \mathcal{W}^{k,d} + b^k\right) \tag{1}$$

 \* In Equation (1), *f* is the activation function, \* is the two-dimensional convolution, and *b k* is the offset of the *k* ′ th channel of the convolutional layer.

1

ᇱᇱ = <sup>ᇱ</sup> −+1

ௗ

′ ௗ ∈ ெᇲᇲ×ெᇲᇲ The decoding stage is designed to reconstruct the feature map obtained in the encoding stage, input a total of *K* groups of feature matrices, and output the *d*th feature map, *y <sup>d</sup>* <sup>∈</sup> *<sup>R</sup> M*′′ ×*M*′′ , where *M*′′ = *M*′ − *N* + 1. Below, *y d* can be expressed as follows:

$$\mathcal{Y}^d = f\left(\sum\_{k=1}^K h^k \star \widetilde{W}^{k,d} + c^d\right) \tag{2}$$

1 In Equation (2), *<sup>W</sup>*e*k*,*<sup>d</sup>* is horizontal and vertical flip of the convolution kernel *<sup>W</sup>k*,*<sup>d</sup>* and 1 <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>K</sup>*. Additionally, *c d* is the offset of the *d* ′ th channel of the deconvolution layer.

The convolutional autoencoder encodes and decodes the original intercepted image to continuously learn the parameters and offsets of the convolution kernel in the network, so that the output result *y* is as close as possible to the given output expectation *x* ′ , and to minimize the reconstruction error function in Equation (3)

$$E = \frac{1}{2n} \sum\_{i=1}^{n} \left\| y\_i - \mathbf{x}'\_i \right\|^2 \tag{3}$$

≤ ௗ ′

≤ ௗ ′

where *n* is the number of samples input into the model training, *y<sup>i</sup>* represents the actual output samples, and *x* ′ represents the expected value of the output. ᇱ ᇱ

2

2

<sup>ᇱ</sup>

<sup>ᇱ</sup>

1

1

1 2

1 2

෩ ,ௗ ,ௗ 1 ≤

෩ ,ௗ ,ௗ 1 ≤

A back propagation (BP) neural network error back propagation algorithm is used to adjust the network parameters. If the training result makes the autoencoder converge, the trained autoencoder encodes the part of the network to obtain a local texture feature descriptor of the wood plank that is robust to changes in perspective, that is, the local image patch is input into the coding part of the convolutional encoder to get the corresponding feature description code. 

Only the local image features of the wooden planks are used for the matching and recognition of a single plank. Since there are multiple and similar wooden planks in the visual scene of stacked wooden planks, the algorithm cannot identify whether multiple local image patch features belong to the same wooden plank, and they are likely to be scattered on different wooden planks, as shown in Figure 3. This is very easy to mismatch. We hope local image patches on the same board to be pair relations, as shown in Figure 4.

**Figure 3.** Local image patches matching.

**Figure 4.** Local image patches of paired relation.

#### **3. O**ffl**ine Calculation Process: Model Feature Description Using Point Pair Features and Local Image Patch Features**

Consider adding a point pair geometric feature constraint relationship to two points on the same wooden plank model, as shown in Figure 5.

**Figure 5.** Schematic diagram of point pair features.

, )

1

1

2

2 1 2

1 2

, ) <sup>ଶ</sup>

) <sup>ଵ</sup>

1 2 1 2 12 , , ,, ,, ,

1 2 1

, arccos

, arccos

, arccos

, <sup>ଵ</sup> , <sup>ଶ</sup>

<sup>ଵ</sup>

<sup>ଵ</sup> ∠(<sup>ଶ</sup>

<sup>ଶ</sup>

∠(<sup>ଵ</sup>

‖‖ ∠(<sup>ଵ</sup>

  <sup>ଶ</sup>

, <sup>ଶ</sup>

2

2 1

2

3

 

1 2

For any two points *m*<sup>1</sup> and *m*<sup>2</sup> and their respective normal vectors , *n*1, *n*2, define the constraint relationship describing the local texture feature point pair as shown in Equation (4) [7]:

$$F(m\_1, m\_2) = \left[ \|d\|\_\prime \angle(n\_1, d), \angle(n\_2, d), \angle(n\_1, n\_2) \right] \tag{4}$$

where <sup>k</sup>*d*<sup>k</sup> is the distance between the two points, <sup>∠</sup>(*n*1, *<sup>d</sup>*) is the angle between the normal vector *<sup>n</sup>*<sup>1</sup> and line *d* connecting two points, ∠(*n*2, *d*) is the angle between the normal vector *n*<sup>2</sup> and the line *d* connecting two points, and ∠(*n*1, *n*2) is the angle between the normal vector *n*<sup>1</sup> and the normal vector *n*2. Then, namely:

$$\begin{cases} \begin{aligned} F\_1 &= ||d|| = |m\_2 - m\_1| \\ F\_2 &= \angle(n\_1, d) = \arccos\frac{n\_1 \cdot d}{|n\_1||d|} \\ F\_2 &= \angle(n\_2, d) = \arccos\frac{n\_2 \cdot d}{|n\_2||d|} \\ F\_3 &= \angle(n\_1, n\_2) = \arccos\frac{n\_1 \cdot n\_2}{|n\_1||n\_2|} \end{aligned} \tag{5}$$

We at first performed point pair feature sampling on the point cloud of a single plank model, that is, we first took a feature point, established a point pair relationship with all other feature points in turn, and then took a new feature point to establish a point pair relationship, repeating the execution until the point pair relationship of all feature points was established, and the characteristic parameter *F* was calculated for each pair of non-repeated point pairs.

Since the shape of the plank itself is mainly a large rectangular plane, with symmetrical regularity, the corresponding characteristic parameter information of different points on a single plank is the same or similar. Pointpair features alone cannot complete the uniqueness of the plank feature matching, so the feature of the local image patch of the point-pair was added here. We input the intercepted local image patch into the previously trained encoder, where the encoder only needed to take the encoding part of the original convolutional encoder, the feature vector corresponding to the input local image patch can be obtained, and the feature vector of all non-repetitive point pairs of the local image patch can be combined with the point pair feature geometric information.

As shown in Figure 6, we define a more comprehensive description of the characteristic parameters of the wood plank, namely:

**Figure 6.** Local patch point pair feature (LPPPF).

$$F\_l(m\_1, m\_2) = \begin{bmatrix} \|d\|\_\prime \angle(n\_1, d), \angle(n\_2, d), \angle(n\_1, n\_2), l\_1, l\_2 \end{bmatrix} \tag{6}$$

 where *l*1, *l*<sup>2</sup> represent the feature code extracted from the local image patch by the encoder.

ଵ , <sup>ଶ</sup> The KD-tree (k-dimensional tree) method is used to build a feature code database reflecting the characteristics of the wood plank model, as shown in Figure 7. In summary, the model description method that combines point pair features and image local features can avoid the drawbacks of using one of the methods alone.

ଵ , <sup>ଶ</sup>

1 2 1 2 1 2 12 , , , , , , , , , 

2 2 ,

**Figure 7.** Establishment of feature code database of the wood plank model.

#### **4. Online Calculation Process**

#### *4.1. Generating the Feature Code of the Plank to Be Grasped*

1 1 ,

Since the stacked wooden planks have obvious hierarchical characteristics, the robot's grasping process is generally carried out in order from top to bottom, and the wooden plank grasped each time should be the top layer in the scene at that time. First, the Euclidean distance cluster method [41] was used to segment the scene under robot vision, then calculating the average depth of different clusters and selecting the smallest average depth value as the area to be grasped. As shown in Figure 8, there is no need to match all the wood planks in the scene later, which saves on the computing time. Next, we randomly extracted a part of the scene pointpair feature information and the point pair corresponding local image patches from the area to be grasped, and input the local image patches into the trained encoder to generate local image patch feature vectors. These were combined with the pointpair geometric feature information to form a feature description code.

**Figure 8.** Generation of feature code of the wood plank to be grasped.

#### *4.2. Plank Pose Voting and Pose Clustering*

௦→ ିଵ <sup>௫</sup> ()→

We extracted point pairs that were similar to the feature codes generated in the scene from the feature code database established offline and measured their similarity by the Euclidean distance to complete the point pair matching:

$$\text{dist}(F\_{off}, F\_i) = \left\| F\_{off} - F\_i \right\|\_2 \tag{7}$$

௦→

( )

( ,)

 ( , )

→

1

2

௫()

where *F<sup>i</sup>* represents the feature code extracted and generated in the scene and *Fo f f* is the feature code in the feature code database established offline.

We used a local coordinate system to vote in a two-dimensional space to determine the pose here proposed by Drost et al. [7]. We selected a reference point in the scene point cloud to form a point pair with any other point in the scene then calculated the feature value according to Formula (6) and searched for the point pair (*m<sup>r</sup>* , *mi*) in the feature code database through Formula (7). Successful matching indicates that feature point *s<sup>r</sup>* is extracted in the scene, where there is a point *m<sup>r</sup>* corresponding to it in the feature code database. We put them in the same coordinate system, as shown in Figure 9. Next, we moved these two points to the origin of coordinates and rotated them so that their normal vectors were aligned with the x-axis. Among them, the transformation matrix that occurs on *m<sup>r</sup>* is *Tm*→*g*, the transformation matrix that occurs on *s<sup>r</sup>* is *Ts*→*g*, and, at this time, the other points of their point pairs *s<sup>i</sup>* and *s<sup>r</sup>* are not aligned, and these need to rotate the angle to achieve alignment, with the transformation matrix is *Rx*(α), which then becomes the following [7]:

$$s\_i = T\_{s \to \mathcal{g}}^{-1} \mathcal{R}\_{\mathbf{x}}(\boldsymbol{a}) T\_{m \to \mathcal{g}} m\_i \tag{8}$$

where *T* −1 *<sup>s</sup>*→*gRx*(α)*Tm*→*<sup>g</sup>* is a temporary plank pose matrix.

**Figure 9.** Model and scene coordinate system transformation.

In order to reduce the calculation time and increase the calculation speed, the rotation angle can be calculated by the following formula:

$$
\alpha = \alpha\_m - \alpha\_s \tag{9}
$$

where α*<sup>m</sup>* is only determined by the model feature point pair and α*<sup>s</sup>* is only determined by the scene point pair features.

 <sup>௦</sup> ( , ) From Equation (8), it can be seen that the complex pose solving problem is transformed into the problem of matching model point pairs and corresponding angles α, so it can be solved by ergodic voting. We created a two-dimensional accumulator, where the number of rows is the number of scene model points *M*, and the number of columns is the value *q* after the angle is discretized. When the point pair extracted in the scene matches the point pair of the model correctly, one of the two-dimensional accumulators (*m<sup>r</sup>* , α) corresponds to it, that is, the position is voted.When all the point pairs composed of the scene point *s<sup>r</sup>* and other points *s<sup>i</sup>* in the scene have been processed, the position where the peak vote is obtained in the two-dimensional accumulator is the desired position. An angle α can estimate the posture of the plank, and the position of the model point can estimate the position of the plank.

 In order to ensure the accuracy and precision of the pose, multiple non-repetitive reference points in the scene are selected to repeat the above voting. There are also multiple model points in the two-dimensional accumulator used for voting. In this way, there will be multiple voting peaks for different model points, eliminating significantly less incorrect pose votes, which can improve the accuracy of the final result. Multiple voting peaks means that the generated poses need to be clustered. The highest vote is used as the pose clustering center value. The newly added poses must have the translation and rotation angles corresponding to the pose clustering center pose values set in advance. Within a certain threshold range, when the pose is significantly different from the current pose cluster center, a new cluster is created. The score of a cluster is the sum of the scores of the poses contained in the cluster, and the score of each pose is the sum of the votes obtained in the voting scheme. After the cluster with the largest score is determined, the average value of each pose in the cluster is used as the final pose of the plank to be grasped. Pose clustering improves the stability of the algorithm by excluding other poses with lower scores, and at the same time ensures that only one pose is finalized during each recognition, so that the robot only chooses to grasp one wooden plank at one time. The method of obtaining the maximum score clustering pose average also directly improves the accuracy of the plank's pose. This pose value can be used as the initial value of the iterative closest point method (ICP) [42] to further optimize the plank's pose. In summary, the process to determine the pose of the plank is shown in Figure 10.

**Figure 10.** The final pose determination process of the plank.

#### **5. Experiments Results and Discussion**

The computer hardware conditions used in the experiment were an Intel Xeon W-2155 3.30 GHz CPU, 16.00 GB of RAM, and a NVIDIA GeForce GTX 1080 Ti GPU. The whole framework is based on C++, OpenCV, Point Cloud Library (PCL) and other open source algorithm libraries. A visual grasp scene model of stacked wooden planks was built on the ROS (robot operating system) with the Gazebo platform. As shown in Figure 11, the RGB-D camera (Xtion Pro Live, Suzhou, China) is fixed at the end of the robot arm and the disorderly stacked wooden boards to be grasped are placed below. Besides that, the robot hand device is a suction gripper with six suction points.

**Figure 11.** Visual grasping scene model of stacked wooden planks.

## *5.1. Data Preparation and Convolutional Autoencoder Training*

For setting up the data set, we used the self-developed mechatronics equipment to collect the images of wood boards (Figure 12). This device mainly includes a strip light source, a transmission device, a CCD industrial camera (LA-GC-02K05B, DALSA, Waterloo, ON, Canada) and an photoelectric sensor (ES12-D15NK, LanHon, Shanghai, China) mounted on top. When the conveyor belt moves the wood board to the scanning position, the photoelectric sensor will detect the wooden board and start the CCD camera to collect the image of the wood board surface. We collected 100 images of red pine and camphor pine planks (Figure 13) and eventually divide them into small pieces of about 8000 local images (Figure 14).

**Figure 12.** Wood image acquisition equipment.

**Figure 13.** Collected image of the surface of a wood.

**Figure 14.** Some local images, which were intercepted from collected wood images.

The data collected by the self-developed equipment accounts for 75% of the whole data set, while the remaining 25% of the data set is collected in the ROS system. The different poses of the end of the robotic arm bring the RGB-D camera to collect scene information from different perspectives to obtain images of the same scene in different perspectives. First, the camera collects feature points at different positions from each image and intercept 22 × 22 pixels local image patches around the

points. From the positive kinematics of the robot and the hand-eye calibration relationship, the pose of the camera can be known. When the scene is fixed, the corresponding points on the image under different perspectives can be known, and the local image patches are intercepted around the same corresponding point on the image under different perspectives. We took a set of two of them as the input end sample and output end expectation value of the training convolutional autoencoder and collected a total of 4000 sets of such local image patches.

The network training used the deep learning autoencoder model designed in this paper. The structure size of each layer is shown in Table 1. The encoding stage contains four convolutional layers, and the decoding stage contains four transposed convolutional layers. The neural network training adopted the form of full batch learning, where the epoch is 160, and the relationship change curve between the training error and iteration number is shown in Figure 15. When the number of epochs was less than 20, the loss value of the network model decreased faster. When the number of epochs was more than 20, the loss value of the network model decreased slowly. When the epoch was 120–160, the loss value of the network model remained basically stable, that is, the model converges. Through training the network model, a 32-dimensional local texture feature descriptor of the wood plank that was stable enough for viewing angle changes was finally obtained. During the experiment, four feature dimensions of local image patches (i.e., 16, 32, 64, and 128) were specifically tested, and recognized 600 planks with different poses. We used the final pose to meet the pose accuracy of the plank to be grasped as the correct pose for recognition. We calculated the recognition rate and pointpair matching time to evaluate the performance of these four feature dimensions, as shown in Figure 16. With the 16-dimensional, 32-dimensional, 64-dimensional, and 128-dimensional feature dimensions increasing sequentially, in other word, the feature expression code was more abundant, so the recognition rate of the wood plank gradually increased, i.e., increases of 83.2%, 95.6%, 95.9%, and 96.4%. After the feature dimension size reached 32 dimensions, the recognition rate was not significantly improved. An increase in feature dimension size was also accompanied by an increase in computing time. When the feature dimensions were 64 dimensions and 128 dimensions, the calculation time of the plank pose increased even more. Considering the recognition rate and computing time, the 32-dimensional local image feature descriptor of the wooden plank was finally selected.

**Figure 15.** Iterative loss function curve of the deep convolution auto-encoder.

**Table 1.** Deep convolutional auto-encoder construction.


**Figure 16.** The influence of feature dimensions of the local image patch on recognition performance.

#### *5.2. Grasping of Planks*

 The robot first grasps the top plank of the stacked plank. The original point cloud visualization result of the stacked plank scene under the RGB-D camera using the Rviz tool is shown in Figure 17. Combining the Euclidean distance clustering method [41], the point cloud under the camera was divided into different areas, so that different planks correspond to different point cloud clusters. The visualization result using the Rviz tool is shown in Figure 18. The cluster with the smallest average depth value was confirmed as the current priority grasping area, and the point pair matching was completed using the aforementioned method based on pointpair features and local image patch features. Then, we performed pose voting and clustering and finally determined the pose *<sup>c</sup> <sup>o</sup>M* of the plank to be grasped. 

**Figure 17.** Original point cloud.

**Figure 18.** The segmentation result of point cloud after regional clustering.

After obtaining the pose *<sup>c</sup> <sup>o</sup>M* of the plank to be grasped in the camera coordinate system, the current pose *<sup>b</sup> <sup>h</sup><sup>M</sup>* of the robot end (by forward kinematic solution of the robot kinematics) and the result *<sup>c</sup> <sup>h</sup>M* of the robot hand-eye calibration, the plank's pose was converted to the robot base coordinate system: 

 

> 

$$\mathbf{M}\_o^b \mathbf{M} = {}^b\_h \mathbf{M}\_h^c \mathbf{M}^{-1} {}^c\_o \mathbf{M} \tag{10}$$

Then the grasping operation was realized by driving the end of the robotic arm to move to this position. As shown in the Figure 18, the top cluster has only one piece of wood, which is easy to locate and grasp with our method. When the robot hand has grasped and moved away the top board, the original second cluster layer is now the top position. The cluster contains two planks, since the two planks are close together. As shown in Figure 19, the two planks are not fully presented in camera vision and their respective image dose not include the whole plank, which is similar to the occlusion effect. In this case, our method can also obtain the recognition results of the cluster with multiple boards and select the target to be grasped (Figure 19). The multiple pose voting peak obtained by PPF algorithm get the poses of multiple wood planks in the same cluster and the pose with the highest voting score is selected as the target pose to be grasped by robot. As shown in Figure 20, the recognition and positioning of the plank to be grasped is accurate, and the robot grasping action process is smooth. 1

**Figure 19.** Wood plank recognition results: (**a**) Recognition results of some wood planks in the same cluster; (**b**) The board to be grasped.

**Figure 20.** The result of the robot visually grasping wooden planks: (**a**) Identifying the pose of the plank to be grasped; (**b**) positioning the end of the robot and grasping; (**c**) during the robot handling process; (**d**) preparing to place the grasped wooden plank.

We carried out a grasping experiment 1000 times on randomly placed wooden planks in stacked wooden planks piles, also using other methods to perform the same number of experiments, and then compared the recognition rate, average recognition time, and grasping success rate. If the positioning

accuracy of the recognition result is less than 3 mm and the rotation angle error is less than 2◦ , this situation is good for grasping success. This positioning accuracy is regarded as the correct recognition. As is shown in Table 2, the corresponding recognition performance comparison is shown in Figure 16, where "PPF" [7] was the traditional point pair method; "CPPF" [14] used the pint pair feature added color information. "SSD-6D" [17] used the convolutional neural networks for end-to-end training to obtain the pose of an object. "LPI" only used the local image patch proposed here to match the feature points and ICP calculated the pose, and did not use the point pair method to match; "LPPPF" is a method we proposed to determine the pose based on the local image patch combining point pair feature matching feature points, pose voting, and clustering.


**Table 2.** Performance of different combination methods in grasping wood planks. PPF: Point pair feature.

From Table 2 and Figure 21, it can be seen that the recognition rate of the plank is closely related to the success rate of robot grasping. The higher the recognition rate, the greater the success rate of robot grasping. The feature descriptors that only use PPF methods feature poor description of the surface features of the wood plank, so the recognition rate is significantly lower. Even if color information is added to the traditional point pair feature in CPPF method, color information is only one of the features of the plan and the recognition rate of this method is not high here. SSD-6D using convolutional neural networks for end-to-end training to obtain the pose of an object also does not have a high recognition rate because of the low positioning accuracy. The local image patches used only have certain advantages in describing wood texture features, and the recognition rate has been improved to a certain extent. However, the feature description is not comprehensive enough, resulting in a low recognition rate 85.1%. The LPPPF method we proposed here has a certain improvement in the recognition rate of the wood plank to be grasped compared to other methods, which is about 11 percentage points higher when using deep learning SSD-6D method. Compared with only using local image patch features, it is about 9 percentage points higher. Additionally, the average computing time is also relatively short, i.e., 396 ms. This shows that this method has obvious application advantages in grasping occasions in the scene of disorderly stacking planks. Through the convolutional autoencoder to extract the texture features of the local image patches of the wood, combined with the point pair features, the surface features of the wood can be better expressed. At the same time, in view of the hierarchical nature of the wood stacking, an Euclidean distance clustering method is used for segmentation first, which avoids the entire scene for collecting image patches for matching, greatly reducing the number of local image patches that need to be extracted and ultimately reducing the calculation time for recognition.

**Figure 21.** Comparison of the recognition performance of different methods.

#### **6. Conclusions**

The main shape of a plank is a large plane that is symmetrical and regular. The current conventional methods make it difficult to identify and locate planks to be grasped in scenes of disorderly stacked planks, which makes it difficult for robots to grasp them. A recognition and positioning method combining local image patches and point pair features was proposed here. Image patches were collected from disorderly stacked wooden boards in the robot vision scene and a convolutional autoencoder was used for training to obtain a 32-dimensional local texture feature descriptor that is robust to viewing angle changes. The local image patches around the point pair from the single-plank model were extracted, the feature code was extracted through the trained encoder, and the point pair geometric features were combined to form a feature code describing the feature of the board. In the stacking plank scene, the area of the plank to be grasped was segmented by a Euclidean distance clustering method and the feature code was extracted, and the plank to be grasped was identified through processes such as matching point pairs, pose voting and clustering. The robot grasping experiment here has proven that the recognition rate of this method is 95.3%, and the grasping success rate is 93.8%. Compared with PPF and other methods, the method presented here has obvious advantages. It is suitable for the grasping of disorderly stacked wood planks. At the same time, it has certain reference significance for recognition and grasping in other similar conditions.

**Author Contributions:** Methodology, writing—original draft, C.X.; conceptualization, funding acquisition, supervision, Y.L.; validation, F.D., and Z.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** Key Research & Development Plan of Jiangsu Province (Industry Foresight and Key Core Technologies) Project (grant no. BE2019112), Jiangsu Province Policy Guidance Program (International Science and Technology Cooperation) Project (grant no. BZ2016028), Qing Lan Project of the Jiangsu Province Higher Education Institutions of China, Natural Science Foundation of Jiangsu Province (grant no. BK20191209), Nantong Science and Technology Plan Fund Project (grant no. JC2019128).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*
