Infrared Bird Target Detection Based on Temporal Variation Filtering and a Gaussian Heat-Map Perception Network

Zhao, Fan; Wei, Renjie; Chao, Yu; Shao, Sidi; Jing, Cuining

doi:10.3390/app12115679

Open AccessArticle

Infrared Bird Target Detection Based on Temporal Variation Filtering and a Gaussian Heat-Map Perception Network

by

Fan Zhao

^1,2,*

,

Renjie Wei

^1,2,

Yu Chao

^1,2,

Sidi Shao

^1,2 and

Cuining Jing

^1,2

¹

Department of Information Science, Xi’an University of Technology, Xi’an 710048, China

²

Shaanxi Provincial Key Laboratory of Printing and Packaging Engineering, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(11), 5679; https://doi.org/10.3390/app12115679

Submission received: 14 April 2022 / Revised: 18 May 2022 / Accepted: 31 May 2022 / Published: 2 June 2022

Download

Browse Figures

Versions Notes

Abstract

:

Flying bird detection has recently attracted increasing attention in computer vision. However, compared to conventional object detection tasks, it is much more challenging to trap flying birds in infrared videos due to small target size, complex backgrounds, and dim shapes. In order to solve the problem of poor detection performance caused by insufficient feature information of small and dim birds, this paper suggests a method of detecting birds in outdoor environments using image pre-processing and deep learning, called temporal Variation filtering (TVF) and Gaussian heatmap perception network (GHPNet), respectively. TVF separates the dynamic background from moving creatures. Using bird appearance features that are brightest at the center and gradually darker outwards, the size-adaptive Gaussian kernel is used to generate the ground truth of the region of interest (ROI). In order to fuse the features from different scales and to highlight the saliency of the target, the GHPNet network integrates VGG-16 and maximum-no-pooling filterer into a U-Net network. The comparative experiments demonstrate that the proposed method outperforms those that are state-of-the-art in detecting bird targets in real-world infrared images.

Keywords:

deep learning; infrared bird target; target detection; Gaussian heatmap perception network

1. Introduction

Birds have various effects on human-related activities. In aviation, birds attacking aircraft will cause serious flight accidents, and it is very important to detect all the surrounding birds before they strike the aircraft. In addition, if we know the activity information of birds, we can not only prevent the mass death of birds in terms of an ecological standpoint but also prevent the crops from being damaged due to birds in terms of agriculture. In this paper, we specifically focus on detecting the presence of all infrared birds in the images. If all birds can be accurately detected in the monitoring area, various bird-related issues can be solved.

Detection of flying birds from video sequences has gained a lot of interest in recent years due to its variety of applications such as aviation safety, avian protection, and the ecological science of migrant bird species [1,2,3,4]. However, there are still great challenges in detecting infrared birds in the outdoor environment due to low signal-to-noise ratio, small target size, illumination variation, shadows, occlusion, shape deformation, etc., so it is of great significance to study efficient bird target detection and location methods.

In recent decades, researchers have conducted numerous studies on infrared dim target detection algorithms for image sequences, which mainly include traditional image processing techniques and deep learning methods.

Traditional infrared small target detection methods are mainly divided into two categories: spatial domain detection methods and time-domain detection methods. Spatial domain detection methods [5,6,7,8,9,10,11] normally utilized the contrast between the targets and the surrounding background to distinguish the small targets from the background. For a simple background, these methods can effectively suppress clutter and enhance the target. However, when a large number of noises and clutters are similar to the target, these algorithms will face problems such as a low detection rate, a high false alarm rate, poor real-time performance, etc. The time domain detection methods [12,13] mainly use the strong correlation between adjacent image frames to estimate the background so as to calculate the difference between the observed image and the background to obtain the target. Because additional image temporal information is used, these detection methods generally have a higher detection accuracy, however, when non-target objects such as clouds and trees move in front of the camera or the camera moves in the environment, the target detection performance will be greatly reduced.

With the rapid development of deep convolutional neural networks (CNN [14]), CNN-based object detection methods have achieved significant performance improvements in object detection tasks. Target detection methods based on CNN [14] mainly include two-stage target detection methods and one-stage target detection methods. The two-stage target detection methods first generate candidate regions, and then classify and regress the candidate regions, which mainly include R-CNN [15], Fast R-CNN [16], Faster R-CNN [17], Mask-R-CNN [18], and Cascade R-CNN [19], etc. The one-stage target detection methods such as Single Shot MultiBox Detector (SSD [20]), YOLO [21], YOLO9000 [22], YOLOv3 [23], YOLOv4 [24], RFB-Net [25], and RefineDet [26] do not generate candidate regions but directly perform classification and regression, which have better real-time performance than the two-stage methods. The single shot multibox detector (SSD [20]) is one of the fastest algorithms in the current target detection field. It has achieved good results in target detection, but there are problems such as poor extraction of features in shallow layers and loss of features in deep layers [27]. By applying a context information scene perception (CISP) module to the network structure of SSD, CISPNet [28] is proposed to detect small targets with a higher detection accuracy than the original SSD [20]. By integrating feature enhancement and feature fusion into the conventional structure of the SSD [20], a single shot object detection method named FFESSD [27] achieves a better performance in fuzzy target detection. However, for existing small objects that are dense and overlapping in the scene, the detection performance of both algorithms is not ideal. Because of its powerful modeling ability, the method based on CNN [14] has achieved promising results in general target detection. However, the existing CNN [14] based methods cannot be directly applied to infrared small targets because the weak texture and the low contrast will lead to loss of the target in the deep layer.

Since U-Net [29] can realize multi-scale feature fusion by connecting the down-sampled feature maps with the up-sampled feature maps of the same scale for enhancing the descriptive power of features, many improved algorithms based on U-Net [29] have been proposed for small infrared target detection [30,31,32]. A lightweight convolution neural network, called TBC-Net [30], is proposed for infrared small target detection. TBC-Net consists of a target extraction module (TEM) and a semantic constraint module (SCM), which are used to extract small targets from infrared images and to classify the extracted target images during the training, respectively. By integrating the global and the local dilated residual convolution blocks into the U-Net [29], the residual learning CNN model DRU-Net (dilated residual U-Net) [31] is initially employed to predict the residual image (i.e., background, clutter, and noise) rather than clean images. Once the residual image is estimated, the targets are obtained by subtracting the residual image from the input image. However, the performance of DRU-Net [31] suffers from multiple targets flocking together or overlapping each other. To handle this problem of pooling layers could lead to the loss of targets in deep layers, a dense nested attention network (DNA-Net) [32] is designed to maintain the infrared small targets in deep layers with the repeated fusion and enhancement of high-level and low-level features.

Although many density estimation schemes [33,34] have been proposed to solve the crowd counting problem, these counting-by-density-estimation approaches focus on measuring the number of people per unit area on the ground but compromises accurate detection of a single individual.

Our work is inspired by the density learning that is exploited for crowd counting. However, the density map corresponding to the training sample is calculated by summing all the 2D Gaussian kernels centered at every body’s location. Different from them, we use a size-adaptive Gaussian kernel instead of the sum of multiple Gaussian kernels to generate the ground truth of each bird. The advantage of this is that it not only makes use of the brightness distribution characteristics of infrared bird target similar to Gaussian kernel but also facilitates the extraction of individual birds from crowded populations. To solve the problem of pooling layers in the network could lead to the loss of targets in deep layers so we propose a model that incorporates the max-no-pooling convolution blocks into the U-Net. The introduction of the max-no-pooling convolution blocks can enhance the salience of bird targets. Furthermore,

1 \times 1

convolution is employed to improve calculation speed.

In summary, the main contributions of this paper are listed as follows.

(1): We propose a two-stage flying bird detection method in infrared video, which consists of the pre-processing and deep learning method. The former is used for background separation, and the latter is used for re-detection of overlapping targets;
(2): We propose a method for generating ground truth of bird targets, which can be automatically generated by using a size-adaptive Gaussian kernel;
(3): We propose a novel Gaussian heatmap perception network (GHPNet) to predict individual birds in highly crowded and occluded scenes;
(4): We replace the traditional maxpooling filter with maximum-no-pooling filtering to maintain small target features in deeper network layers, thus avoiding the loss of small objects;
(5): The experimental results show that our method is not only superior to state-of-the-art methods in several infrared bird videos but also has near real-time performance.

2. The Proposed Method

The flow of our algorithm is shown in Figure 1, which is mainly composed of first-stage moving target detection based on TVF and second-stage target confirmation by GHPNet. Firstly, using the motion characteristics of the bird flock, TVF is performed on the infrared video sequence to obtain the candidate motion region, that is, a single bird or multiple birds with a certain overlap in space. Secondly, the size consistency of birds in the flock is used to screen whether the candidate target enters secondary detection. Aiming at the problem that the traditional detection network is difficult to distinguish individuals in a dense flock of birds, a bird target detection network based on Gaussian heat map perception was then designed and called GHPNet, which can predict the Gaussian heat maps with clear boundaries between targets. Finally, the predicted Gaussian heat map is post-processed by the watershed algorithm to obtain each bird in the overlapping region.

2.1. TVF and Screening Candidate Targets

Over the past few decades, a range of approaches have been developed in the detection of small moving targets in infrared (IR) sequences. Among them, by using consecutive frames information to calculate the difference between the background and the target, the temporal contrast filter (TCF) has been used successfully to detect point targets. The key part of the TCF is the background signature estimation by the minimum filter to maximize the signal-to-noise ratio, however, when the illumination changes with a flicker, using the minimum brightness in consecutive frames as the background brightness value will make the detection effect worse. In order to avoid the influence of illumination change on the detection results, unlike TCF, our moving target detection uses temporal variation filtering rather than minimum filtering, so it is called TVF.

If

I (i, j, k)

denotes the intensity of the

(i, j)

pixel at the current k-th frame, the TVF at

(i, j, k)

is defined by Equation (1) in which the buffer size

k

is selected by trading off the accuracy of the background estimation against the complexity of the TVF algorithm, which is an empirical value that is set to six in the experiment, and

k - 1

frames are used to estimate the background intensity by the averaging function

m e a n ()

. One of the reasons for choosing the mean method rather than the median method for background estimation is that the former is faster. Another reason is that when a flicker slightly lower than the target brightness value appears in consecutive frames only once, the median method will reduce the difference between the current pixel and the predicted background, resulting in a failure of target detection.

T V F (i, j, k) = I (i, j, k) - \underset{n = 1, 2, \dots, k - 1}{mean} I (i, j, n)

(1)

B i (i, j, k) = {\begin{matrix} 1, T V F (i, j, k) > T h r (k) \\ 0, T V F (i, j, k) \leq T h r (k) \end{matrix}

(2)

T h r (k) = μ_{T V F} (k) + a σ_{T V F} (k)

(3)

The adaptive threshold segmentation is conducted by using Equation (2), where

μ_{T V F} (k)

and

σ_{T V F} (k)

represent the mean and the standard deviation of the k-th frame image of TVF over along spatial dimension, respectively. If

T V F (i, j, k)

is greater than a predefined threshold,

T h r (k)

, the current-pixel

B i (i, j, k)

is binarized to 1, otherwise it is binarized to 0. We execute the open source Opencv library function cv2.findContours (Bi, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) to extract the target contour in which the arguments Bi, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE denote the input image, retrieval mode, and approximate method, respectively; and, the region within the contour is declared a detected candidate target.

It can be seen from the first-stage detection results in Figure 2 that the TVF algorithm can accurately detect the moving area. However, it is difficult to distinguish birds that are close to each other or overlap in space. In this way, identifying multiple birds as a single bird will inevitably lead to a certain degree of deviation in airport bird detection and prediction. Aiming at the problem that the first-stage detection algorithm is difficult to segment the clustered birds, this paper proposes a Gaussian heat map-aware bird target detection network to re-segment the candidate targets screened from the primary detection results so as to accurately locate the single birds appearing in the clustered area. Because the birds flying together basically belong to the same species and their size is similar to a certain degree, whether the detection result is a single bird can be judged according to the similarity between the candidate target area and the average area of birds. Equation (4) is used to screen candidate regions to enter the second-stage GHPNet network to further detect dense objects, where

S_{m}

is the area of the

m - t h

candidate target

C_{m}

,

S_{m e a n}

is the average area of all the bounding boxes of targets detected by the TVF algorithm, and

b

is the preset screening coefficient, which is 0.8 in the experiment.

C S = {C_{m} | S_{m} > b \times S_{m e a n}, m \in [1, C]}

(4)

2.2. Gaussian Heat Map Sample Production

As can be seen from Figure 3, in the infrared video sequence image, the bird target presents the phenomenon that the central area is the brightest and the brightness gradually becomes darker as it expands outward, and this case is just consistent with the distribution of the Gaussian model. Inspired by this observation, aiming at the problem that it is difficult to distinguish individual birds in the gathering and the overlapping regions; the Gaussian heat map sample annotations are automatically generated for the manually selected bird targets. By using a size-adaptive two-dimensional Gaussian kernel, the ground truth Gaussian heat map can be quickly and automatically generated for multiple targets in a given candidate target area. For

m - t h

candidate target image block

C_{m}

screened by the TVF algorithm, all birds existing in the scene are firstly artificially labeled with a set of

Z_{m}

rectangle boxes

R_{m}^{o b j} = {R_{m}^{z}}_{1 \leq z \leq Z_{m}}

, which corresponding center point set and maximum side length set are

P_{m}^{o b j} = {P_{m}^{z}}_{1 \leq z \leq Z_{m}}

and

L_{m}^{o b j} = {L_{m}^{z}}_{1 \leq z \leq Z_{m}}

, respectively. The generation process of the ground truth of the Gaussian heat map is as follows: given a candidate target dense region

C_{m}

screened by the TVF algorithm, after manually selecting bounding boxes for all targets, the Gaussian heat map

D_{m}^{g t}

of image patch

C_{m}

is automatically realized by assigning the corresponding Gaussian value within each target box and assigning zero outside all the bounding boxes. In other words, for any pixel

p

in the image block

C_{m}

, the bounding box where it is located is judged. If it belongs to a certain box

R_{m}^{z}

, the true value of its Gaussian heat map is calculated using the Gaussian kernel function

ℕ (p - μ, σ)

by its distance from the center point

P_{m}^{z}

and side length

L_{m}^{z}

as

D_{m}^{g t} (p) = ℕ (p - μ, σ) = ℕ (p - P_{m}^{z}, \frac{L_{m}^{z}}{2})

(5)

Here,

ℕ (p - μ, σ)

denotes the Gaussian kernel function,

μ

and

σ

represent the mean and the standard deviation of the normal distribution.

Figure 3 shows the process of generating Gaussian heat maps of the targets in the image of the bird gathering or overlapping region. It can be seen from Figure 3 that the edges of targets in the original image are blurred and the boundaries between them are unclear, resulting in a partial overlap between the manually selected rectangular boxes. However, after Gaussian heat map processing, each target is not only complete but also clearly distinguishable from each other, which proves the effectiveness of the proposed Gaussian heat map sample labeling method to a certain extent.

2.3. GHPNet Network Architecture

The constructed network needs to predict the heat map information of all bird targets. The predicted value should be the highest in the center of each bird and gradually decreases outward along the center. This is a very challenging task. The GHPNet network takes an image of the size

128 \times 128

as input, and outputs a pixel-wise Gaussian heatmap of the same size, not a binary map of whether it belongs to the target. As shown in Figure 4, the GHPNet consists of three components: a feature encoder (FE), multi-scale fusion (MF), and Gaussian heap estimator (GHE). The parameter settings of each module are shown in Table 1.

FE is comprised by the first 9 pretrained layers of the VGG-16 [35] network, which extracts multi-scale features and multi-level semantic information by combining the stack of

3 \times 3

convolution layers and

2 \times 2

max-pooling layers. Specifically, it consists of 5 convolution layers and 4 maximum pooling layers. Each of the first four convolution layers are followed by a max-pooling layer with a stride of two, due to which the output of CNN layers is down-sampled by a factor of four. The top feature maps (

1 / 16

in size of the original input) from the fifth convolution layer are fed into MF to learn image-level bird features and the contextual information at multiple rates.

The MF module is composed of several groups of skip connection layers, up-sampling layers, and convolution layers. The skip connection is designed to remind the network to learn multi-scale features to avoid the gradient disappearing during training. The up-sampling layer can retrieve a high-resolution feature map. In order to solve the problem that the pooling layer of the network will lead to the loss of small targets, we use a maximum-no-pooling layer to enhance the targets. The specific method is to use the output feature map of the Conv5 layer of the FE module as the input of the Conv6 layer of the MF module, and before the convolution layer of the Conv6 layer, the fifth convolution layer from FE is connected to a max-pooling layer with both a stride and a padding of one. The insight of choosing this maximum-no-pooling layer is to highlight the saliency target feature without changing its scale.

Figure 5 shows the original two target feature maps and their corresponding feature results obtained by max-pooling and maximum-no-pooling processing. As can be seen from Figure 5, maximum-no-pooling can highlight the target area better than max-pooling, which can ensure that the target will not easily be lost in the deep layers of the network. After convolutions, the output feature of each scale from MF are concatenated with the output feature from FE at the same scale before passing through a specified fusion layer, which comprises of a

1 \times 1

convolution layer, with a batch-norm layer and a

3 \times 3

convolution layer, with a batch-norm layer. Additionally, for reducing the network parameters, a

1 \times 1 \times (O u t \times 2)

convolution is added before a

3 \times 3 \times O u t

convolution layer in which

O u t

means the number of channels of the output feature map. Figure 6a,b are the processing procedures with and without

1 \times 1

convolution: the input is

8 \times 8 \times 1024

and the output is

8 \times 8 \times 256

. The number of parameters corresponding to two different convolutions are

N u m 1

and

N u m 2

, respectively, which are calculated as

N u m 1 = 8 \times 8 \times 1024 \times 3 \times 3 \times 256 = 150,994,944

,

N u m 2 = 8 \times 8 \times 1024 \times 1 \times 1 \times 512 + 8 \times 8 \times 512 \times 3 \times 3 \times 256 = 109,051,904

. It can be seen that applying

1 \times 1

convolution can save 27.8% of the computation compared to not using

1 \times 1

convolution. By adding

1 \times 1

convolution before the

3 \times 3

convolution of each layer, the MF module can effectively reduce the amount of network parameters and thus save the time complexity of the network.

In the GHE module, we first use three convolutions to progressively refine the details of the feature map from the MF module, with channel sizes from 32 to 16. Then, two convolution layers are used consecutively to estimate the Gaussian heat value at each position. Since the values of the Gaussian map are always non-negative, we apply a ReLU activation behind the last convolution layer. Finally, GHE generates the high-resolution Gaussian heap map with the same size as the input image. Let

G_{u, v}^{e s t}

be the estimated Gaussian heap value of

v - t h

pixel in

u - t h

training samples and

G_{u, v}^{g t}

be the corresponding ground-truth of the Gaussian heat value. Our network is trained using the

L^{2}

loss defined as:

L = \frac{1}{N} \times \frac{1}{H \times W} \sum_{u = 1}^{N} (\sum_{v = 1}^{H \times W} {‖ G_{u, v}^{e s t} - G_{u, v}^{g t} ‖}_{2}^{2})

(6)

where

N

is the number of samples for batch training and

H \times W

is the spatial dimension of the original input image.

Finally, the Gaussian heatmap is processed by calling the watershed segmentation function

W a t e r s h e d ()

of the open source OpenCV library so as to obtain the detection results of birds in the aggregation area.

3. Experiments and Analysis

3.1. Datasets, Experiment Setup and Performance Evaluation

Two datasets are used in our experiment, one from the largest publicly released thermal infrared video dataset TIV [36] and the other from the dataset taken by ourselves. Three videos of bats with different densities taken from three viewpoints are selected from TIV [36], and they were called Davis08-sparse, Davis08-dense, and Da-vis13-medium, respectively, which describes real world scenarios of bats emerging from their caves in large numbers with a resolution of 640 × 512. In addition, we also provided two 5-min infrared videos of a flock of birds with unknown names taken near the airport at different time periods, with the number of birds ranging from 50 to 200.

In the training phase, we used the VGG-16 [35] pretrained weights on ImageNet as initial weights in the FE block of GHPNet. The initial learning rate was set to 0.01 with a decay equal to 0.1 for every 160 epochs. We used the Adam algorithm with mini-batch size four for optimization.

In this paper, the precision rate P, the recall rate R, the weighted harmonic average value F1 of P and R in common target detection algorithms are used as the bird detection evaluation indicator [37], which are calculated as follows.

P = \frac{T P}{T P + F P}

(7)

R = \frac{T P}{T P + F N}

(8)

F 1 = \frac{2 \times P \times R}{P + R}

(9)

where TP (True Position) represents the number of positive samples that are correctly identified as positive samples, TN (True Negative) represents the number of negative samples that are correctly identified as negative samples, FP (False Position) represents that negative samples are incorrectly identified as the number of positive samples, and FN (False Negative) represents the number of false positive samples identified as negative samples.

3.2. Comparisons with the Baseline Methods

In order to verify the performance of the proposed method, we compare our method with several state-of-the-art methods such as Faster R-CNN [17], Mask R-CNN [18], YOLOv4 [24], U-Net [20], DRU-Net [31], and DNA-Net [32]. From the above datasets, we randomly extract 500 images containing different numbers of birds and bats as training samples for all eight comparison methods.

We firstly show the detection results of all methods on the bird dataset produced by us in Table 2. As shown in Table 2, our method obtains the highest P, R, and F1 score performance, 91.2%, 89.8%, and 90.5%, respectively, which are much higher than the other five methods. In terms of the F1 score, our method is the best, while DNA-Net [32] and YOLOv4 [24] are second, and Faster R-CNN [17] is the worst. DNA-Net [32] has a comparable performance to YOLOv4 [24] in F1 scores. Although the F1 scores of these two methods are improved by at least 6.0% compared to Mask R-CNN [18] and U-Net [29], they are still 1.3% and 1.5% lower than our method.

We secondly show the detection results on the bat video sequences of the dataset TIV in Table 3. As shown in Table 3, for the more densely distributed bats, P, R, and F1 scores of all methods are worse than the corresponding results in the relatively sparse case in Table 2, with an average drop of 2.7 percentage. For bats that appear densely in dataset TIV, Faster R-CNN [17] has the worst detection performance, followed by U-Net [29]. The performance of Mask R-CNN [18] and DNA-Net [32] is slightly better. Compared with the previous four methods, the performance of DRU-Net [31] and YOLOv4 [24] has been greatly improved, but it is still worse than our method. The state-of-the-art detection performance shows that our method can detect dense small objects well.

To further demonstrate the advantages of our method, we show the subjective performance comparisons in Figure 7. Three test images and their expanded versions of the red boxes are shown in Figure 7a,b, respectively. The detection results corresponding to different methods are given in Figure 7c–g. The actual numbers of birds in the left and the right regions cropped from the first image in Figure 7a are 35 and 47, which is shown in the first and the second column in Figure 7b, respectively. As shown in the first and the second columns in Figure 7c–g, respectively, the number of birds detected by Mask R-CNN [18] is 30 and 41, with 15 and 35 detected by DRU-Net [31], with 27 and 37 detected by DNA-Net [32], with 28 and 42 detected by YOLOv4 [24], and with 32 and 45 detected by our algorithm. The actual number of birds in the second image in Figure 7b is 38, as shown in the third column of Figure 7c–g, the number of the detected birds of Mask-RCNN [14], DRU-Net [16], DNA-Net [32], Yolov4 [12], and our method is 33, 30, 34, 34, 37, respectively. The number of birds in the third image in Figure 7b is 74. For the case where the birds appear more densely, the detection numbers of Mask-RCNN [14], DRU-Net [16], DNA-Net [32], Yolov4 [12], and our method are 57, 62, 67, 80, and 73, as shown in the fourth column in Figure 7c–g, respectively. It can be seen that the number of bird objects detected by our method is closest to the ground truth in both our captured dataset and the more densely bird-dense TIV [36] dataset. In order to further clearly show the detection effect of each algorithm, four white dashed boxes are specially marked in three images in Figure 7. Taking the first white dashed box in the first image as an example, the number of the detected birds of Mask R-CNN [18], DRU-Net [31], DNA-Net [32], YOLOv4 [24], and our method is 5, 1, 5, 4, and 6, respectively. It can be seen that all of these comparison methods miss a certain number of targets compared to the actual number of six targets, however, only the detection result of our method is completely correct. These four comparison methods have the problem of identifying multiple birds in dense areas as independent individuals. Compared to these methods, our algorithm has absolute advantages.

3.3. Speed Analysis

The execution times of all methods are measured as frames per second (FPS). All of the FPS are generated on a personal computer with a 2.10 GHz six-core i7-9750H CPU and GeForce GTX1660Ti-6G graphics card.

Table 4 shows that with an input image size of 720 × 576, YOLOv4 [24] is the fastest at 31FPS, followed by our method at 21FPS. The results of speed prove the near real-time performance of our method. In addition, the frame rate of the GHPNet network with

1 \times 1

convolution and without

1 \times 1

convolution is also compared. It can be seen that the frame rate after adding

1 \times 1

convolution is increased by 2.2FPS, indicating our algorithm can improve the efficiency of target detection by reducing network parameters.

3.4. Comparison and Analysis of Ablation Experiments

Our method is mainly composed of two modules: the TVF and the GHPNet networks. Firstly, in order to verify the effectiveness of the TVF algorithm, Table 5 gives the target detection comparison results of median and mean methods used for background estimation. It can be seen from Table 5 that compared with the median method, the F1 value of target detection using the mean method for background estimation is increased by 9.1%, the frame rate is also much faster. The experimental results demonstrate the effectiveness of the TVF algorithm, using the mean method for background estimation. Secondly, in order to verify the contribution of each module to detection performance, Figure 8 and Figure 9 show the target detection subjective results using TVF and GHPNet combined with TVF on our captured image and the image from the TIV dataset, respectively. It can be seen from the experimental results that the TVF algorithm can accurately detect a single flying bird that exists independently in the space. However, when multiple flying birds are clustered together with a certain degree of overlap and occlusion in the image space, multiple flying birds will be detected as a single individual. For the problem that the TVF algorithm cannot accurately distinguish the individual birds in the aggregation area, GHPNet successfully perceives the individual Gaussian heatmap effectively, thereby realizing the high-precision detection of dense birds.

Lastly, it can also be seen from the ablation experiment results in Table 6 that the detection performance of GHPNet is improved by at least 7.9% on the F1 value of the two test experimental datasets compared to the TVF algorithm. Additionally, compared with not using maximum-no-pooling, using maximum-no-pooling improves the F1 performance by 0.9% and 3.4% on our captured dataset and the TIV dataset, respectively. It can be seen that for a denser spatial distribution of birds, using maximum-no-pooling can maintain the features of small targets in deeper network layers to avoid the loss of them.

4. Challenges and Future Work

The challenge of the proposed method is the detection of weak and small targets in the case of complex background interference. Figure 10 shows the target detection results in a complex background. The left image is the original image, and the right image is an enlarged image corresponding to the cyan box in which the purple box is the detected target and the yellow dotted box is the missed target. As can be seen from the right image of Figure 10, due to the small size, poor contrast, and complex background, the target detection performance is decreased to a certain extent, which is a challenge faced by most infrared target detection algorithms and is also one of our works to focus on in the future.

5. Conclusions

In this paper, a near real-time infrared bird target detection method is proposed. Firstly, Temporal Variation Filtering (TVF) is performed to the continuous frame images to extract the moving targets in the scene. Secondly, a Gaussian heatmap perception network (GHPNet) is proposed to predict the pixel-wise Gaussian heat map of the candidate region. Finally, the watershed algorithm is used to segment each target from the predicted heatmap. Experimental results show that the proposed method achieves the best infrared target detection performance on both sparse and dense cases.

Author Contributions

Conceptualization, F.Z.; methodology, F.Z., R.W. and S.S.; software, R.W., Y.C. and S.S.; validation, S.S.; formal analysis, S.S. and Y.C.; writing—original draft preparation, F.Z.; writing—review and editing, C.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (NSFC), grant number 52075435, and by the Key R&D Project of Shaanxi Province, China, grant number 2022GY-305.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset TIV in this paper can be obtained from the task of Multi-view Multi-Object Tracking in the link: http://csr.bu.edu/BU-TIV/BUTIV.html, accessed on 25 September 2014.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dolbeer, R.; Wright, S.; Weller, J.; Begier, M. Wildlife Strikes to Civil Aircraft in the United States 1990–2013; Department of Transportation, Federal Aviation Administration and U.S. Department of Agriculture Animal and Plant Health Inspection Services: Washington, DC, USA, 2014.
Bhusal, S.; Khanal, K.; Goel, S.; Karkee, M.; Taylor, M. Bird deterrence in a vineyard using an unmannes aerial system (UAS). Trans. ASABE 2019, 62, 561–569. [Google Scholar] [CrossRef]
Boudaoud, L.; Maussang, F.; Garello, R.; Chevallier, A. Marine bird detection based on deep learning using high-resolution aerial images. In Proceedings of the OCEANS 2019-Marseille, Marseille, France, 17–20 June 2019. [Google Scholar]
Hong, S.; Han, Y.; Kim, S.; Lee, A.; Kim, G. Application of deep-learning methods to bird detection using unmanned aerial vehicle imagery. Sensors 2019, 19, 1651. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chen, C.; Li, H.; Wei, Y.; Xia, T.; Tang, Y. A Local Contrast Method for Small Infrared Target Detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared small target detection utilizing the multiscale relative local contrast measure. IEEE Geosci. Remote Sens. Lett. 2018, 15, 612–616. [Google Scholar] [CrossRef]
Wu, L.; Ma, Y.; Fan, F.; Wu, M.; Huang, J. A double-neighborhood gradient method for infrared small target detection. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1476–1480. [Google Scholar] [CrossRef]
He, Y.; Zhang, C.; Mu, Y.; Yan, T.; Chen, Z. Multiscale Local Gray Dynamic Range Method for Infrared Small-Target Detection. IEEE Geosci. Remote Sens. Lett. 2020, 18, 1846–1850. [Google Scholar] [CrossRef]
Wan, M.; Kan, R.; Gu, G.; Zhang, X.; Qian, W.; Chen, Q.; Yu, S. Infrared Small Moving Target Detection via Saliency Histogram and Geometrical Invariability. Appl. Sci. 2017, 7, 569. [Google Scholar] [CrossRef] [Green Version]
Ren, X.; Wang, J.; Ma, T.; Bai, K.; Ge, M.; Wang, Y. Infrared dim and small target detection based on three-dimensional collaborative filtering and spatial inversion modeling. Infrared Phys. Technol. 2019, 101, 13–24. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Liu, C.; Zhang, H.; Zhao, Q. A Local Contrast Method for Infrared Small-Target Detection Utilizing a Tri-Layer Window. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1822–1826. [Google Scholar] [CrossRef]
Sun, S.; Kim, K.; Kim, S. Highly effificient supersonic small infrared target detection using temporal contrast fifilter. Electron. Lett. 2014, 50, 81–83. [Google Scholar]
Deng, L.; Zhu, H.; Tao, C. Infrared moving point target detection based on spatial-temporal local contrast filter. Infrared Phys. Technol. 2016, 76, 168–173. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. NIPS 2012, 60, 1097–1105. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchiesfor accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A. SSD: Single Shot Multibox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 June 2016. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, realtime object detection. In Proceedings of the IEEE Computer Vision Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO 9000: Better, faster, stronger. In Proceedings of the IEEE Computer Vision Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.; Liao, H. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, S.; Huang, D. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212. [Google Scholar]
Shi, W.; Bao, S.; Tan, D. FFESSD: An Accurate and Efficient Single-Shot Detector for Target Detection. Appl. Sci. 2019, 9, 4276. [Google Scholar] [CrossRef] [Green Version]
Shi, W.; Jiang, J.; Bao, S.; Tan, D. CISPNet: Automatic Detection of Remote Sensing Images from Google Earth in Complex Scenes Based on Context Information Scene Perception. Appl. Sci. 2019, 9, 4836. [Google Scholar] [CrossRef] [Green Version]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015. [Google Scholar]
Zhao, M.; Cheng, L.; Yang, X.; Feng, P.; Liu, L.; Wu, N. TBC-Net: A real-time detector for infrared small target detection using semantic constraint. arXiv 2019, arXiv:2001.05852. [Google Scholar]
Fang, H.; Xia, M.; Zhou, G.; Chang, Y.; Yan, L. Infrared Small UAV Target Detection Based on Residual Image Prediction via Global and Local Dilated Residual Networks. IEEE Geosci. Remote Sens. Lett. 2021, 9, 1–5. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense Nested Attention Network for Infrared Small Target Detection. arXiv 2021, arXiv:2106.00487. [Google Scholar]
Thanasutives, P.; Fukui, K.; Numao, M.; Kijsirikul, B. Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting. arXiv 2020, arXiv:2003.05586. [Google Scholar]
Liu, W.; Salzmann, M.; Fua, P. Context-Aware Crowd Counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Wu, Z.; Fuller, N.; Theriault, D.; Betke, M. A Thermal Infrared Video Benchmark for Visual Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Nayef, N.; Yin, F.; Bizid, I.; Choi, H.; Ogier, J. ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification-RRC-MLT. In Proceedings of the IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 1454–1459. [Google Scholar]

Figure 1. The flow chart of the proposed bird detection algorithm.

Figure 2. The original image and the target detection results, (a) original image; (b) target detection results of TVF algorithm; (c) enlarged target detection results of TVF algorithm.

Figure 3. Gaussian heat map generation process, (a) bird gathering or overlapping region image; (b) targets manually annotated with rectangle boxes; (c) the resulting Gaussian heatmap ground truth; (d) 2D Gaussian kernel.

Figure 4. The structure diagram of the GHPNet.

Figure 5. The target features and their features obtained by max-pooling and maximum-no-pooling, respectively: (a) target original features; (b) features by max-pooling; (c) features by maximum-no-pooling.

Figure 6. The processing procedures with and without

1 \times 1

convolution, (a) without

1 \times 1

convolution; (b) with

1 \times 1

convolution.

Figure 6. The processing procedures with and without

1 \times 1

convolution, (a) without

1 \times 1

convolution; (b) with

1 \times 1

convolution.

Figure 7. Subjective detection results of several methods, (a) original image; (b) enlarged version of the selected region; (c) the detection results of Mask-RCNN [14]; (d) the detection results of DRU-Net [16]; (e) the detection results of DNA-Net [32]; (f) the detection results of Yolov4 [12]; (g) the detection results of our method.

Figure 8. Comparison of the first-stage and the second-stage detection results on our dataset, (a) the detection result of TVF; (b) the detection result of GHPNet combined with TVF.

Figure 9. Comparison of first-stage and second-stage detection results on TIV dataset, (a) the detection result of TVF; (b) the detection result of GHPNet combined with TVF.

Figure 10. The target detection results in a complex background, (a) on our dataset, (b) on TIV dataset.

Table 1. Parameter setting of the GHPNet.

Network Layer	Parameter Setting	Output Dimension	Network Layer	Parameter Setting	Output Dimension
FE module			Conv7_x	$[\begin{array}{l} 1 \times 1, \begin{matrix} \end{matrix} 512 \\ 3 \times 3, \begin{matrix} \end{matrix} 256 \end{array}]$	$16 \times 16 \times 256$
Conv1_x	$[\begin{array}{l} 3 \times 3, \begin{matrix} \end{matrix} 64 \\ 3 \times 3, \begin{matrix} \end{matrix} 64 \end{array}]$	$128 \times 128 \times 64$	Conv8_x	$[\begin{array}{l} 1 \times 1, \begin{matrix} \end{matrix} 256 \\ 3 \times 3, \begin{matrix} \end{matrix} 128 \end{array}]$	$32 \times 32 \times 128$
Conv2_x	$[\begin{array}{l} 3 \times 3, \begin{matrix} \end{matrix} 128 \\ 3 \times 3, \begin{matrix} \end{matrix} 128 \end{array}]$	$64 \times 64 \times 128$	Conv9_x	$[\begin{array}{l} 1 \times 1, \begin{matrix} \end{matrix} 128 \\ 3 \times 3, \begin{matrix} \end{matrix} 64 \end{array}]$	$64 \times 64 \times 64$
Conv3_x	$[\begin{array}{l} 3 \times 3, \begin{matrix} \end{matrix} 256 \\ 3 \times 3, \begin{matrix} \end{matrix} 256 \\ 3 \times 3, \begin{matrix} \end{matrix} 256 \end{array}]$	$32 \times 32 \times 256$	Conv10_x	$[\begin{array}{l} 1 \times 1, \begin{matrix} \end{matrix} 64 \\ 3 \times 3, \begin{matrix} \end{matrix} 32 \end{array}]$	$128 \times 128 \times 32$
Conv4_x	$[\begin{array}{l} 3 \times 3, \begin{matrix} \end{matrix} 512 \\ 3 \times 3, \begin{matrix} \end{matrix} 512 \\ 3 \times 3, \begin{matrix} \end{matrix} 512 \end{array}]$	$16 \times 16 \times 512$	GHE module
Conv5_x	$[\begin{array}{l} 3 \times 3, \begin{matrix} \end{matrix} 512 \\ 3 \times 3, \begin{matrix} \end{matrix} 512 \end{array}]$	$8 \times 8 \times 512$	Conv11_x	$[\begin{array}{l} 3 \times 3, \begin{matrix} \end{matrix} 32 \\ 3 \times 3, \begin{matrix} \end{matrix} 32 \\ 3 \times 3, \begin{matrix} \end{matrix} 16 \\ 1 \times 1, \begin{matrix} \end{matrix} 16 \\ 1 \times 1, \begin{matrix} \end{matrix} 1 \end{array}]$	$128 \times 128 \times 1$
MF module
Conv6_x	$[\begin{array}{l} 3 \times 3, \begin{matrix} \end{matrix} 1024 \\ 1 \times 1, \begin{matrix} \end{matrix} 1024 \end{array}]$	$8 \times 8 \times 1024$

Table 2. Comparison of detection performance on the bird dataset.

Methods	Year	Backbone	Detection (%)
Methods	Year	Backbone	P	R	F1
U-Net [29]	2015	U-Net	82.6	79.7	81.1
Faster R-CNN [17]	2017	VGG16	80.3	73.4	76.7
Mask R-CNN [18]	2017	ResNet-101-FPN	83.3	77.9	80.5
YOLOv4 [24]	2020	CSPDarknet-53	89.4	86.6	88.0
DRU-Net [31]	2021	U-Net	87.0	86.1	86.5
DNA-Net [32]	2021	U-Net	89.0	87.4	88.2
Our method	2022	VGG16+U-Net	91.2	89.8	90.5

Table 3. Comparison of detection performance on TIV dataset.

Methods	Year	Backbone	Detection(%)
Methods	Year	Backbone	P	R	F1
U-Net [29]	2015	U-Net	78.5	76.3	77.3
Faster R-CNN [17]	2017	VGG16	75.8	71.0	73.3
Mask R-CNN [18]	2017	ResNet-101-FPN	80.4	77.6	79.0
YOLOv4 [24]	2020	CSPDarknet-53	87.7	85.5	86.2
DRU-Net [31]	2021	U-Net	86.8	83.6	85.2
DNA-Net [32]	2021	U-Net	84.4	78.4	81.2
Our method	2022	VGG16+U-Net	88.3	87.4	87.8

Table 4. Frame rate comparison results.

Methods	Year	F1 (%)	FPS
U-Net [29]	2015	77.3	9
Faster R-CNN [17]	2017	76.7	7
Mask R-CNN [18]	2017	80.5	5
YOLOv4 [24]	2020	88.0	31
DRU-Net [31]	2021	86.5	5.2
DNA-Net [32]	2021	88.2	4.4
Our method without $1 \times 1$ convolution	2022	90.5	18.9
Our method with $1 \times 1$ convolution	2022	90.5	21.1

Table 5. Comparison detection results of median and mean mode used in TVF on our dataset.

	P (%)	R (%)	F1 (%)	FPS
Mode	P (%)	R (%)	F1 (%)	FPS
Median	69.2	63.4	66.2	10
Mean	78.0	72.7	75.3	71

Table 6. Ablation Experiment.

Dataset	TVF	GHPNet without Maximum- No-Pooling	GHPNet with Maximum-No-Pooling	P (%)	R (%)	F1 (%)
Bird	✓	✕	✕	78.0	72.7	75.3
	✓	✓	✕	90.2	89.0	89.6
	✓	✕	✓	91.2	89.8	90.5
TIV	✓	✕	✕	77.6	75.4	76.5
	✓	✓	✕	86.4	82.5	84.4
	✓	✕	✓	88.3	87.4	87.8

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, F.; Wei, R.; Chao, Y.; Shao, S.; Jing, C. Infrared Bird Target Detection Based on Temporal Variation Filtering and a Gaussian Heat-Map Perception Network. Appl. Sci. 2022, 12, 5679. https://doi.org/10.3390/app12115679

AMA Style

Zhao F, Wei R, Chao Y, Shao S, Jing C. Infrared Bird Target Detection Based on Temporal Variation Filtering and a Gaussian Heat-Map Perception Network. Applied Sciences. 2022; 12(11):5679. https://doi.org/10.3390/app12115679

Chicago/Turabian Style

Zhao, Fan, Renjie Wei, Yu Chao, Sidi Shao, and Cuining Jing. 2022. "Infrared Bird Target Detection Based on Temporal Variation Filtering and a Gaussian Heat-Map Perception Network" Applied Sciences 12, no. 11: 5679. https://doi.org/10.3390/app12115679

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Infrared Bird Target Detection Based on Temporal Variation Filtering and a Gaussian Heat-Map Perception Network

Abstract

1. Introduction

2. The Proposed Method

2.1. TVF and Screening Candidate Targets

2.2. Gaussian Heat Map Sample Production

2.3. GHPNet Network Architecture

3. Experiments and Analysis

3.1. Datasets, Experiment Setup and Performance Evaluation

3.2. Comparisons with the Baseline Methods

3.3. Speed Analysis

3.4. Comparison and Analysis of Ablation Experiments

4. Challenges and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI