1. Introduction
SAR is a powerful and potential remote sensing system, which is capable of working in both weak natural light and adverse weather conditions [
1]. With the purpose of detecting and classifying targets accurately and efficiently from SAR imagery, ATR is playing a more and more important part in both military and civilian field applications [
2]. The designing of ATR includes three steps of detection, discrimination and classification [
3]. In this paper, we mainly focus on the fundamental step of detection. Current ATR algorithms can be mainly divided into three types: algorithms based on template, model and deep learning [
4]. The template-based ones can hardly meet the real-time requirements, and the emerging deep learning methods require quite abundant samples. Model based algorithms extract and screen image features, and then identify target types by specific classifiers. The more stable and recognizable the features are, the more reliable the recognition results can be [
5].
However, because of the backscatter imaging mechanism, features extracted from SAR images are highly sensitive to the SAR acquisition geometry [
6]. With a slight change in the target pose or position, the scattering intensity and other characteristics of the same variety of targets can vary quite abruptly. On the other hand, as the aspect angle is limited in the radar observation at a time, some targets in the scene may be partly or even completely invisible, as the radar cross section (RCS) is partly determined by the corresponding aspect angle [
7]. This kind of condition is especially true for man-made targets like the buildings, in which the scattering characteristics of dihedral angles are usually found [
8]. In that case, it is commonly reckoned that multi-aspect images of the same scene always contain richer information and can provide better performance than any single of them in target detection tasks [
9]. The data acquisition geometry in this paper is shown in
Figure 1, where the images are captured by the same airborne sensor at consecutive aspects.
Figure 2 takes a building area for example to show its different configurations at different aspects.
Many existing studies have shown the enhancement effects of image combination in the field of target detection [
10,
11]. Ref. [
11] proposes a novel object detection framework, integrating diverse features by jointly considering features in the frequency domain and the original spatial domain. In [
12,
13], multimodal images are combined via deep learning techniques to show the superiority of diverse data. Multi-aspect SAR images utilization methods can be divided into the following three categories. The first category works on finding the features remain unchanged as the aspect changes. For example, Bhanu et al. [
14] compare the positions of strong scattering centers in different images, and select the scattered point pairs roughly stay still as features for model construction. Zhang et al. [
15] believe that the intrinsic dimension of the target will always remain the same when the aspect changes within a wide range of degrees. Therefore, man-made targets can be identified by averaging the intrinsic dimensions in the region of interest (ROI) of different images. The second category aggregates the different performance of the target at different aspects to enrich the referent sample base. Brendel et al. [
10] compose one grand image with images in wide angle separations, which is later used as the reference image in a mean squared error (MSE)-template-based ATR system, so that the reference image contains more comprehensive information about the target. The third category pays attention to the inner connections of multi-aspect images. In this strategy, the images are fused through mutual influence, and the internal relevance between them is regarded as an effective criterion for recognition. As an example, Huan et al. [
16] put vectors representing different images into the same matrix, which is then dealt with PCA and wavelet fusion methods. The resulting vectors separated from the processed matrix are used as features for classifiers. Zhang et al. [
8] take advantage of sparse representation classification (SRC) among multiple aspects for a single joint recognition decision. Its ability to describe each sample precisely under the inner correlation constraints among samples brings it wide acceptance. The deep learning methods applied in multi-aspect SAR are usually based on the connection analyzing as well. Pei et al. [
17] propose the multi-aspect deep convolutional neural network (MVDCNN), where they compare images from adjacent aspects step by step with a parallel network topology. Relationship exploration is completed progressively in different network layers. Zhang et al. [
18] propose a deep neural network that containing Bi-LSTM model, so they can learn the connections of the training samples in both forward and backward directions independently. In the above literatures, the utilization of multi-aspect images have been demonstrated to be a remarkable improvement compared with single aspect methods. However, there are still some limitations in their practical applications. The first category has strict requirements on the interval and quantity of image samples in each class. The interval is usually recommended to be one degree, and no missing aspects in a wide range is recommended. In the second category, not many variations are allowed in both the target itself and the surrounding environment. When there are not enough training samples, targets in the interval aspect positions are still hard to be identified. The third category emphasizes the internal relationship between the images, but it may not work well when the relationship happened to be weak, especially when the aspects are quite separated.
In all the presented methods, it is always the major target signature variations among different aspects that cause trouble for detection. In this paper, we propose a new method for building detection with multi-aspect SAR images. With this method we process these variations into recognizable and essential features in the detection procedure, instead of avoiding them by requiring small aspect separation or stable environment conditions. We have noticed that as the aspect changes, some statistic characteristics of the background tend to stay relatively steady, while the same characteristics would vary sharply in building areas in the same scene. The different variation patterns between target and background can contribute to target discrimination in the complexity of disturbance in urban areas. In our method, the holistic scene to be detected is partitioned into a fixed number of grids, and their respective local variation patterns are taken for discrimination. As a single feature has only limited potentials, we adopt five indexes derived from three complementary characteristics to get a comprehensive description. By calculating and integrating variances from different indexes, we are able to put the grids into a K-means classifier for prescreening. After that, in order to reduce the information loss when the statistical histograms drop directly to one dimension of variance, we recalculate two variation patterns in vector forms based on PCA via correlation and fluctuation analyses. Separate SVM classifiers work independently under the resulting two variation patterns, whose training sets are provided by modified K-means clustering results instead of manual labeling. At last, the SVM detection results are fused according to a maximum probability rule. Experiments show that the method has good adaptability to significant target signature variabilities and has no strict requirements on the number and intervals of images.
The remaining part of the paper is structured as follows: we first introduce the common difficulties in multi-aspect target detection in
Section 2. Then, in
Section 3, the proposed method for building detection is presented. Extensive experiments are conducted on airborne SAR images in
Section 4. Finally, conclusions are drawn in
Section 5.
2. Significant Target Signature Variabilities in Multi-Aspect Images
In addition to target deformations like affine transformation caused by radar perspective conversion naturally, there are also some significant target signature variabilities in the multi-aspect image sequence [
19,
20]. These variabilities include target scintillation, both intentional and unintentional target obscuration, changing background surfaces caused by inherent speckle noise and shadowing, etc. [
21]. In the following part, we would illustrate these variabilities with specific examples.
The stated variabilities make the images in the time series carry discrepant information to some extent. As a result, the targets become harder to discriminate or fit into uniform descriptions. Because of the existence of the variabilities, we have decided not to search for stable features or fixed association relationships between all aspects, but simply focus on describing the variation patterns contained in the image sequence. By discriminating the targets with the difference of variation patterns, we can ensure the robustness of the algorithm in images cluttered or fuzzy.
2.1. Target Scintillation
In SAR images, flat surfaces such as building roofs in urban areas are often shown as dark areas in many aspects because of their surface scattering properties. They are only highlighted in some specific aspects, depending mainly on their incline angles to the ground and positions relative to the radar platform. As an example,
Figure 3 shows three different scattering conditions of the same group of buildings in different aspects. In
Figure 3a a large part the building group is highlighted, but the remaining parts are still more ambiguous and weaker than the surroundings. In
Figure 3b the buildings are partly shown, and these parts are almost complementary to
Figure 3a. In
Figure 3c the buildings are almost invisible and hard to recognize.
2.2. Target Obscuration
Radar detection has the ability of penetration. This ability is generally related to wavelength and polarization mode used by the detector, but also related to the aspect angle of the current image inevitably. In the image in
Figure 4a, the buildings are obscured by the trees nearby, while in
Figure 4b,c, parts of the buildings under the trees are visible. The appearance and disappearance of the obscurations are also responsible for the variabilities of the targets.
2.3. Background Changing: Speckle Noise and Shadowing
Speckle noise cannot be completely eliminated from SAR images and will always cause trouble in SAR target detection. However, when the variation features are taken as the detection criteria, the problem of speckle noise can be avoided to a large extent. Speckle noise usually has a relatively uniform distribution in the whole scene and hence little influence on regional statistical characteristics. It has even less influence on variation features as it just changes randomly with aspects, which is very different from the changing patterns of targets in the same scene.
In images change significantly with aspects, the reliability of some traditional methods tends to be greatly affected. Shadowing happens to be one of the main factors that cause this degree of change. The change in shadow with aspects is immediate and noticeable, and can bring unavoidable interference to the work of target detection. For instance, geometrical properties are commonly used in building detection methods [
22]. However, when the targets are partly shadowed by urban greening vegetation or other buildings nearby, their areas, contours, shapes and connectivity can be affected a lot. The presence of complex objects in the background will present great challenges for the detection.
Figure 5 shows the influence of shadows at different aspects on building forms in SAR images. Therefore, a more robust approach that is not sensible to these factors is needed.
3. Proposed Method of Multi-Aspect Building Detection
3.1. Multi-Aspect Building Detection Framework
There are three steps contained in our method. First, we quantify the variations of 5 indexes from three different categories, analyze them to roughly define the areas where targets are likely to appear; then the features from these categories are refined in two different ways and put in the SVM classifier, respectively, to determine the exact building locations. At last, the results obtained from SVM classifiers are fused at the decision level to get our final detection results. The block diagram of the algorithm is shown in
Figure 6.
3.2. Variances Derived from Statistic Characteristics as Prescreening Features
To achieve fully automatic target detection, we need to address the problem that unsupervised learning can fail to meet the accuracy requirements while supervised learning needs mass work in manual sample labeling. In that case, we have decided to take a step of prescreening with K-means to roughly define the target area locations, whose results are later taken as training sets in SVM classifiers with some proper modifications. In the process of prescreening, we tend to prioritize strict constraint conditions to ensure the correctness of the results. The utilization of one single feature has only limited constraint effect, for better performance we need to seek the fusion approaches for multiple features.
We consider the comparison between a group of multi-aspect sequential images a kind of time domain analysis for a fixed scene. In order to achieve a comprehensive description of the targets, it is essential to find more characteristics covering spatial and time-frequency domain analyses in the image level. For this purpose, we choose characteristics of three categories by experimental investigation, with the aim to ensure that they are aspect-sensitive, complementary to each other and easy to acquire and store. Five specific indexes are derived from three characteristics, that is, mean amplitudes and highlighted pixel proportions derived from intensity, regional homogeneity and dissimilarity from texture and norm of low frequency components in the wavelet decomposition. For a certain index, the variance among multi-aspect images is calculated as a feature value and different features are combined to form the criterion for prescreening. As we can see, in the target and non-target regions, there is not much difference in the average and range of the indexes, but unignorable difference in their variances.
3.2.1. Intensity Variance
The intensity of pixels is the most intuitionistic feature of SAR images. The signature variabilities in multi-aspect images have great influence on intensity of the targets. So, we have to examine the variances of indexes derived from intensity, and look for the difference of their representation forms between building areas and background. We first divide the holistic scene into
grids, for each grid the intensity histograms from different aspects are obtained. Then, the variances of mean values and bright pixel proportions are calculated, respectively, from different aspect histograms. By now, each grid has got two scalar feature values under the same category of intensity:
where
i is the sequence number of the bins in the histogram,
j is the sequence number of multi-aspect images.
N is the total number of bins in the histogram,
is number of images involved.
is the amplitude of the
bin in the
histogram.
is the threshold set to distinguish bright pixels from others.
is the mean value index of the
image,
is the highlighted pixel proportion index of the
image.
is the variance of the mean values,
is the variance of bright pixel proportions.
and
are the two features both derived from the characteristic of intensity. In
Figure 7a, which shows the mean intensity index in different aspects, the diagram on the left comes from a grid in the background area. We can see that it experiences a slow change as the aspect changes. The diagram on the right shows how the same index changes sharply in a grid of building area. In addition, we can see that the mean values of the two grids are quite close, indicating that there is no obvious difference based on the index amplitude alone.
Figure 7b shows the highlighted proportions are of the same conditions.
3.2.2. Texture Variance
Texture reflects the different organization forms of the pixels within different parts of the images. The gray level co-occurrence matrix (GLCM) is generally used to describe image texture by studying the spatial correlation of the pixels [
23]. To use the GLCM principle, we first convert the radar image to a gray level image by grading the pixel intensity into
L levels. Then, the occurrence frequency of pixel pairs at each grade level is counted according to specified direction and distance. At last, the co-occurrence matrixes
obtained in different directions are averaged to serve the subsequent feature extraction steps. The final co-occurrence matrix
P of pixel
is shown as:
where
and
are the specified displacement of a pixel pair at the row and column directions,
L is the general grades of gray levels, and
is the direction of counted pixels.
Our purpose is to obtain the texture of the grids in general for comparison between different images, instead of the elaborating characteristics of a certain image. In this condition, GLCM is only formed at the central pixel in each grid to represent the grid’s texture characteristic. Of all the texture values calculated from the co-occurrence matrix, we find that the indexes of homogeneity and dissimilarity can lead to the best distinction results via experiments.
Figure 8 shows the normalized texture variations in multi-aspect images.
Figure 8a compares the homogeneity variances in target and background grids, while
Figure 8b compares the dissimilarity variances in the same conditions. The variances in different images are calculated as follows, where
and
are the indexes derived from the characteristic of texture.
3.2.3. Variance of Wavelet Low Frequency Components
Wavelet decomposition extracts features in the image domain through time frequency analysis. In wavelet decomposition, the low frequency wavelet components are not sensitive to insignificant disturbance and can reflect the intrinsic signatures of an image [
16]. In this paper, we perform
wavelet decomposition at 3 levels to each divided grid as shown in
Figure 9.
Figure 9d shows the decomposition results of
Figure 9a in principle, where
denotes the low frequency component of the
level decomposition while
,
and
denote high frequency components.
By column-stacking the
from different aspect images, we get a matrix
M represents the wavelet low frequency components. We calculate the
mixed norm of
M by calculating
norm of each row in
M and
of the resulting vector afterwards. The value of
is taken as the variance of wavelet components for each grid, in order to properly reflect the variation relationship among the components [
22]. In the following formulas,
is the amplitude of the
bin of the histogram from the
aspect.
stands for the
we used. For intuitive observation, the mean value of each low frequency component in different images is shown in
Figure 10.
3.3. Prescreening Based on Fused Features by K-Means
After we have got the variances of characteristics on image intensity, texture and wavelet, we integrate them into a vector for each grid and put it in the K-means classifier to determine preliminarily whether the grid belongs to target areas or not. K-means is one of the most widely used unsupervised classifiers who can make full use of existing features to give effective predictions. This procedure is regarded as prescreening in our works.
Because we have considered the image features from quite comprehensive perspectives, the results of the prescreening are also proved to be of low false alarm rate. Still, we cannot be entirely sure about the correctness of the results offered by K-means. In the procedure of variance calculation, the indexes transform from multidimensional vectors directly to scalars, and the information loss thus becomes unignorable. Therefore, in the following steps, the features will be refined and the detected results will be used as training sets for SVM classifiers for finer discrimination.
However, the mistakes in the training set are more likely to be enlarged in the SVM classification outcomes. To address this problem, we would modify the training samples based on the areas and aggregation conditions reflected in the relative positions of the detected regions, as a supplement to further ensure the reliability of the samples. Taking into account the characters of buildings, we would delete fifteen percent of the isolated small areas in the detected regions, as buildings are more likely to appear in the form of large areas of connectivity in principle. The method of modification can be described as the following steps:
Binarization. The pixel judged as targets in prescreening are set to 1, while pixels judged as background set to 0;
Count the area of all the connected areas in the scene, and arrange them from smallest to largest;
Find the smallest of the area and calculate the Sum of Euclidean distances from them to the center of the mass of the largest of the area;
In the of the smallest area, the half with the greater sum of distances is discarded to be background after modification.
This move may cause miscalculation, but it will do more good than harm in the long run.
3.4. Refining Features for Accuracy Improvement
In this part, to locate the targets more precisely based on the prescreening results, we would first provide finer features than scalar variables for each grid. Back to
Section 3.2, when we first got the histograms of intensity, GLCM texture and wavelet low frequency components as statistical characteristics, instead of deriving scalar indexes from them, we would use them as vectors directly to explore the variation patterns among different aspects.
For each grid, if we join histograms of different characteristics from end to end, the new constructed feature vector will be detailed but easily redundant. With these vectors not further optimized, too much calculation will be required due to too high feature dimensions. Moreover, these histograms are naturally of different dimensions, which will eventually lead to unnecessary difference in their weights and influence when put together.
In this condition, we have decided to use the PCA method for feature selection. This method has the effects of unifying and reducing dimensions of these histograms and retaining decisive features with appropriate dimensions. It could concentrate feature energies and extract features through selecting appropriate basis function in low dimension space. Take the characteristics of intensity for an example, for a certain grid, we arrange the multi-aspect histograms as column vectors into a
matrix
H. Then the correlation matrix
C of
H is calculated and its eigenvalue equation is solved as shown in (18–19). The solved eigenvectors
corresponding to the maximum
p eigenvalues
are taken to form an orthogonal vector basis
W, which is used as the transformation matrix to perform dimension reduction. The vectors are thus reduced to
p dimensions from
N in the resulting matrix
S. Set
p a constant quantity for all the features, we can assure them the same dimension and importance.
After all the features of reduced dimensions are obtained by PCA for each grid, we can use them as materials to study the variation patterns along with aspects. A proper definition is needed here to explicitly describe the variation relationship among the feature vectors from different aspects. There are two options both proved to be effective and complementary to each other in this situation. One of them focuses on the correlation of the vectors and the other analyzes the fluctuation between the vectors. For the first one, we calculate the covariance matrix of
S:
As we can see in (23), the covariance matrix has a definition similar to the variance of scalars. It is an extension of variance in multi-dimensional cases. In fact, the diagonal elements in represent the variances of the column vectors respectively in S, while the non-diagonal elements reflect the degree of correlations among the columns. Apparently, the latter are negatively correlated with the variation amplitudes of the columns, so we would make them into one of the criteria we are looking for. Because of the symmetry of the covariance matrix, in order to avoid repetition, we take the upper triangle elements of to form a new vector representing the correlation variation pattern for a certain feature.
The principle of the second way is more straightforward in comparison. The variation of different columns is a direct combination of the variances from each row in matrix
S:
After that, for the same characteristic, feature vectors from different aspects are summarized as an available variation vector by their variation relationship. The feature vectors from different characteristics are then connected end to end, forming the new criterion that will be adopted by SVM.
3.5. SVM as Classifier for Accuracy Improvement
Corresponding to the above two criteria from different variation pattern analytical perspectives, two separate classifiers are adopted to form independent classification results. These results will be then fused to make the final detection decisions. When it comes to the classifier types, it is commonly agreed that supervised classifiers can achieve better performance with proper samples. SVM is a binary classifier widely used in SAR classification due to its conspicuous performance in feature learning and class separating [
24,
25,
26]. The basic principle of SVM can be stated as follows [
27]: SVM first transforms its samples into a high-dimensional Euclidean space, and then separates them with a decision surface found in this new space with its kernel function.
where
is support vector,
is class label of
,
is Lagrange multiplier of
,
b is the threshold used in this classification,
K is the Gaussian kernel function and
stands for the final classification results by SVM.
As mentioned in
Section 3.3, SVM classifier use the detected regions as the samples of the training set to avoid manual labeling. Besides, we also delineate several random size districts in the same scene with no presence of any targets as control terms. These districts have been dealt with the same dividing and feature extracting process as above, and the obtained grids are added into the training set as negative samples. The grids not considered as targets in the prescreening step are all put into the test set and reclassified. The detection results coming from different SVM classifiers are combined in the following part according to a maximum probability rule.
3.6. Fusion Strategy for SVM
By now, we have adopted two different methods to calculate the variations of the same set of feature vectors and thus got the detection results from two separate classifiers. In this part, we fuse these results at the decision level according to a maximum probability rule [
4], in which the proposition with the highest probability is obtained. The probabilities are provided by SVM classifiers. The detection results obtained by different classifiers are formed into a set
T:
where
H is the number of used classifiers,
is the probability that the classifier
h regards the grid at the position of row
r and column
c as a target region.
is the probability that the same grid regarded as background by
h.
is the maximum probability that the grid
considered to be target by all the classifiers.
is the probability that the same grid considered to be background by the union of the rest of the classifiers except the one contributes to
. All the grids satisfying
constitute the final decided target regions.
5. Conclusions
Most of the existing multi-aspect detection methods are designed for isolated targets with relatively simple background. The proposed method provides a new choice in the image level for complex application scenarios. Based on the variations between different images, it can work effectively in the presence of diverse information, and thus be applied in cluttered backgrounds like urban areas for their monitoring and planning.
Our method contains three steps: Firstly, we calculate the variances of different indexes educed from different characteristics, and integrate the variances as criteria for prescreening. Secondly, we remodel the variations of the same indexes into vectors for finer feature fusion. The vectors are then put into two SVM classifiers, respectively, according to two different variation pattern definitions. Thirdly, the independent results of the SVMs are fused at decision level for final judgment. It is not necessary to know the aspect of each image in advance in the proposed method. There are also no strict restrictions on the number of images and their aspect intervals. The method may be improved from several aspects in the future: new registration methods specifically developed for multi-aspect images may be beneficial for the subsequent detection steps. Different feature screening methods or attempts with other emerging classification algorithms could provide additional performance improvement. Further measures can be taken in the processing of target area boundaries. Finally, it is expected to combine multi-aspect SAR images and optical images for multi-modal applications.