*2.1. Faster R-CNN for Target Detection*

The Faster R-CNN is chosen for preliminary detection for its high accuracy in chimney detection compared with other methods [21]. As mentioned before, the Faster R-CNN contains two steps [10]. The first step is Region Proposal Network (RPN). RPN takes an image as input and outputs a set of rectangular object proposal regions, each with an objectness score. The second step is Fast R-CNN detection in the proposed regions. Both RPN and Fast R-CNN share the same convolutional layers, rather than learning two separate networks. Figure 2 shows the process structure of Faster R-CNN. It first performs the deep fully convolution on the input image to obtain feature maps. Then, the feature maps are used by RPN to generate proposal regions. Fast R-CNN uses feature map and proposal regions to generate region of interest (ROI) pooling. After that, the fully connected layer is used for classification and regression operations.

**Figure 2.** Faster region-based convolutional neural networks (R-CNN) structure diagram.

Different types of targets correspond to different anchors, which are a serious reference boxes in each sliding-window when region proposals are generated. Anchor size can be obtained from previous experience. In order to fit chimney and condensing tower detection, we set four types anchors of scales: 322, 642, 1282, and 2562, and five aspect types of ratios: 1:1, 1:2, 1:3, 2:1, and 3:1. The resnet101 [29] trained on coco [30] is selected as the pre-training model. This model is one of widely used model in the field of target detection because of high accuracy and speed.

#### *2.2. The Elevation Filtering Using Local DTM*

DTM is a digital description of the shape, size, and elevation of terrain. The chimney and condensing tower are usually higher than the surrounding features. In the place where there is a chimney or a condensing tower, the value of DTM shows obvious fluctuations, and the height difference can achieve as large as 20 m. In place where false detection appears, the value of DTM changes more gradually.

To get the DTM slice images, which are pieces of DTM image cut from whole DTM image correspondent to the target bonding box, the detection results of Faster R-CNN are registered to DTM first. Then, the bounding boxes are used to cut several slices from the DTM. Then statistical operations are performed in slices. The max and mean height of each DTM slice are calculated as follows:

$$V\_{\text{mean}} = \frac{1}{m \times n} \sum\_{i=1}^{m} \sum\_{j=1}^{n} f(\mathbf{x}\_i, y\_j) \tag{1}$$

$$V\_{\max} = \text{Max}(f(\mathbf{x}\_1, y\_1), f(\mathbf{x}\_2, y\_2), \dots, f(\mathbf{x}\_m, y\_n)) \tag{2}$$

Where *Vmean* is average value of slice, *Vmax* is the maximum value of slice, *f*(*xi, yj*) is the pixel value of the slice, m and n are the number of rows and columns of the slice, respectively. The filter condition is given by:

$$\begin{cases} \ V\_{\text{max}} - V\_{\text{mean}} > T\\ V\_{\text{max}} - V\_{\text{mean}} < T \end{cases} \tag{3}$$

*T* is threshold value. The difference between the max height and mean height in the slice should be larger than the threshold, or else the detected object will be considered as false positive and removed from the set of detected chimneys. The value of threshold is set to be 20 m according to the National Standard of China, the Emission Standard of air pollutants for boiler [31], in which states that the coal combustion chimney should not be less than 20 m. Moreover, we also experimentally test 5 threshold values. The experiment results are shown in Table 1. When the threshold is 16 m or 18 m, the number of false positive targets is still too large. When the threshold increases to 20 m, although 3 chimneys are mis-removed, the number of false positive targets is greatly reduced. When the threshold is 22 m or 24 m, there will be too many mis-removed chimneys. Thus, a 20 m-threshold seems reach a good compromise between low mis-removal and effective deletion of false positive targets.

**Table 1.** Threshold experiments.


#### *2.3. Main Direction Test*

The chimney is a long and vertical object. In the bounding box, the image slice, which contains a chimney, will show obvious directional texture features. Moreover, the chimney and the condensing tower in one high-resolution remote sensing image are all approximately pointed to the same direction, which is called main direction in this paper. We found that a lot of the mis-detected targets do not have the same feature. Therefore, the false chimneys can be further removed by testing its consistency with the main direction. The principle component analysis (PCA) is used to calculate the main direction of each image slice. The processing flow is:


Figure 3 shows two examples of using this method to find the main direction of each detected target. After calculating the main directions of all slices, the distribution histogram will be mapped at intervals of 5 degrees. The maximum value in the histogram is considered as the main direction *d* of the entire image. Then, the detected target whose main direction is close to the main direction of the image will be considered as true detection. The decision criteria is set to be *d* ± 5◦ for chimney, and *d* ± 8◦ for condensing tower since the condensing tower is much wider than the chimney in the image.

**Figure 3.** Main direction rotation image. The green arrow represents the main direction, while the yellow arrow represents the direction perpendicular to the main direction.

#### **3. Results**

#### *3.1. Dataset, Experimental Area, and Data*

The dataset used in this experiment is BUAA-FFPP60, which is collected and produced by Beihang University. The dataset is composed of chimneys and condensation towers distributed in the 123-km2 power plant in the Beijing–Tianjin–Hebei area. There are 318 original pictures, of which 31 are test pictures. The remaining 287 pictures are mirrored or rotated by 90◦ to generate 861 training pictures. The pictures come from Google map with a resolution of 1 m, ranging in size from 500 × 500 to 700 × 1250 pixels. The working state of the chimney and condensation tower is determined by whether there is smoke. The four labels in the dataset are working chimney, non-working chimney, working condensation tower, and non-working condensation tower. Figure 4 shows some examples of dataset.

**Figure 4.** BUAA-FFPP60 dataset samples. Four subfigures indicate the working chimneys, non-working chimneys, working condensing towers, non-working condensing towers, respectively.

The area selected for this experiment is Tangshan City, Hebei Province, located 180 km southeast of Beijing. It is a regional core city of Beijing-Tianjin-Tangshan city group, and burdens the task of releasing the industrial pressure of Beijing, the capital of China. Tangshan City is a typical industrial city in North China, and the total crude steel production in 2018 is 133 million tons, about 7.35% of world's total production. Meanwhile, it is also one of the cities with the worst air quality in the country. According to the "Tangshan City Environmental Status Report", in 2011, the emissions of sulfur dioxide and nitrogen oxides in Tangshan City were 336.54 thousand tons and 40.59 thousand tons, respectively [32]. Numerous steel factories and power plants with a large number of chimneys and condensation towers in Tangshan have contributed the most to the hazardous air pollutants. Therefore, investigating the position and working status of chimneys and condensation towers is very important to region environmental governance.

Three Google Maps images with 1-m resolution covering about 600 km2 are used for final detection. Sizes of images are 16,000 × 25,000 pixels, 10,000 × 10,000 pixels and 10,000 × 10,000 pixels, respectively. The images cover Lubei District, Guye District, Kaiping District, and Fengrui District. The images from ZiYuan-3 satellite with size of 24,500 × 20,000 is used to generate DTM.

#### *3.2. Experimental Results and Analysis*

#### 3.2.1. Accuracy of Faster R-CNN Trained Model

We performed the experiments on a computer with a 2.5 GHz Central Processing Unit (CPU) and a NVIDIA GeForce GTX 2080Ti Graphics Processing Unit (GPU). The memory sizes of CPU and GPU are 8 GB and 11 GB, respectively. The TensorFlow [33] deep learning framework was selected to train 861 Google map images of the BUAA-FFPP60 dataset. The pre-training model is the resnet101 [29] model trained on coco [30]. The number of training iterations is 170,000 and the learning rate is 0.001.

To evaluate the detection accuracy of the Faster R-CNN models, we test the trained model on test image of BUAA-FFPP60 dataset. When the detect target is true, the test result is a true positive (TP), and when the detect targets is false, the test result is false positive (FP). The false negative (FN) indicates the number of undetected true target in the image. Then, we can combine these into three metrics, precision (P), recall (R), and quality (Q):

$$\mathbf{P} = \frac{TP}{TP + FP} \tag{4}$$

$$\mathcal{R} = \frac{TP}{TP + FN} \tag{5}$$

$$\mathbf{Q} = \frac{TP}{TP + FP + FN} \tag{6}$$

For test samples, the precisions of working chimney, non-working chimneys, condensing tower, and non-condensing tower are 0.7210, 0.7326, 0.9482, and 0.9551, respectively. The recall rates are 0.8674, 0.8642, 0.9707and 0.9659 respectively. The qualities are 0.6451, 0.6629, 0.9423, and 0.9473, respectively.
