Next Article in Journal
Data Processing in Cloud Computing Model on the Example of Salesforce Cloud
Previous Article in Journal
A Survey on Text Classification Algorithms: From Text to Predictions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Learning Static-Adaptive Graphs for RGB-T Image Saliency Detection

1
School of Computer and Information Engineering, Fuyang Normal University, Fuyang 236037, China
2
Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China
3
School of Computer Science and Technology, Anhui University, Hefei 230601, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Information 2022, 13(2), 84; https://doi.org/10.3390/info13020084
Submission received: 25 December 2021 / Revised: 10 February 2022 / Accepted: 10 February 2022 / Published: 12 February 2022
(This article belongs to the Topic Soft Computing)

Abstract

:
Many works have been proposed on image saliency detection to handle challenging issues including low illumination, cluttered background, low contrast, and so on. Although good performance has been achieved by these algorithms, detection results are still poor based on RGB modality. Inspired by the recent progress of multi-modality fusion, we propose a novel RGB-thermal saliency detection algorithm through learning static-adaptive graphs. Specifically, we first extract superpixels from the two modalities and calculate their affinity matrix. Then, we learn the affinity matrix dynamically and construct a static-adaptive graph. Finally, the saliency maps can be obtained by a two-stage ranking algorithm. Our method is evaluated on RGBT-Saliency Dataset with eleven kinds of challenging subsets. Experimental results show that the proposed method has better generalization performance. The complementary benefits of RGB and thermal images and the more robust feature expression of learning static-adaptive graphs create an effective way to improve the detection effectiveness of image saliency in complex scenes.

1. Introduction

Image saliency detection aims to quickly capture the most important and useful information from a scene by using the human visual attention mechanism, which can reduce the complexity of subsequent image processing, and has been applied to numerous vision problems including image classification [1], image retrieval [2], image encryption [3,4], video summary [5], and so on. In the past few decades, researchers have proposed many saliency detection algorithms, which can be divided into bottom-up data-driven models and top-down task-driven methods. Bottom-up models [6,7,8,9] take the underlying image features and some priors into consideration, such as color, texture, orientation, and brightness. Itti et al. [10] proposed a visual attention mechanism, which opened research on saliency detection in the field of computer vision. Cheng et al. [11] introduced a regional contrast-based salient object detection algorithm, which simultaneously evaluates global contrast differences and spatial weighted coherence scores. Wang et al. [12] improved the detection effect of image saliency by optimizing seeds. Top-down models [13,14] are task driven. They use a large amount of training data with category labels and supervised learning to conduct a task-oriented analysis. Recently, most of these methods are based on deep learning, they have better performance, but their training processes are time-consuming. We focus on the bottom-up models. Many scholars have made many attempts to improve image saliency detection and have obtained good performance in simple scenes. However, the effectiveness of traditional RGB saliency detection methods decreases sharply in complex scenes, such as poor lighting or saliency objects that have the same color and texture as the background. In recent years, multi-source sensor technology has become popular in image processing. Li et al. [15,16,17] simultaneously extracted RGB and thermal features for tracking, which effectively improved the effect of video target tracking at night or in rainy, hazy, and foggy weather. Zhang et al. [18] extracted the depth features of RGB images and thermal images, and then fused the two extracted features for saliency detection, which greatly improved the detection effectiveness in the case of poor illumination or similar color and texture to the background. The fusion of RGB and thermal images is proven to be effective in image saliency detection. RGB images can provide texture details with high definition in a manner consistent with the human visual system in simple scenes. By contrast, thermal images can work well in low illumination, and also have good discrimination when the target and the background have similar colors or shapes RGB-T saliency detection algorithms can obtain better results by handling challenging issues including low illumination, cluttered background, low contrast, and so on. Graph-based models [19,20,21] use pixels or superpixels as nodes and the similarity weight between nodes as the edge to generate the graph, which can achieve a great structure character from initial input images for RGB-T saliency detection results. However, the existing graph-based fusion models only use the static graph. The limitation of this kind of method is that it cannot explore the relationship between nodes at the target level and gain better fusion of multi-modality information. Inspired by these methods, we consider the spatial connectivity feature of graph nodes to learn a static-adaptive graph, and propose a novel RGB-thermal saliency detection algorithm to obtain more effective results, as in Figure 1.
Specifically, we first extract superpixels from the two modalities and calculate their affinity matrix. Then, we learn the affinity matrix dynamically and construct a static-adaptive graph. Finally, the saliency maps can be obtained by a two-stage ranking algorithm. The contributions of this paper are summarized as follows.
  • We construct an adaptive graph by sparse representation and carry out the optimization solution;
  • We learn a novel static-adaptive graph model to increase the fusion capacity by considering the spatial connectivity features of graph nodes in RGB-T saliency detection;
  • We compare our method with the state-of-the-art methods on an RGB-T dataset with 11 kinds of challenging subsets. The experimental results verify the effectiveness of our method.

2. Related Work

In this section, we give a brief review of methods closely related to our work. The relevant work in this paper mainly includes the graph-based method, multi-modality fusion method, and subspace-based method.
Graph-based method. In the past few decades, graph-based models have been successfully used for saliency detection problems. Harel et al. [19] proposed a graph model. The algorithm takes the pixel points as the graph nodes, constructs edges between the pixel points according to the spatial distance and feature distance, and uses Markov random field for feature fusion. Yang et al. [20] proposed a manifold ranking algorithm based on a static graph, which is a typical two-stage model to gain more accurate saliency maps. Jiang et al. [21] calculated a preliminary saliency map by Markov absorption probability on a weighted graph via partial image borders as prior background. Zhang et al. [22] used multi-scale to improve the manifold ranking algorithm. Xiao et al. [23] proposed a prior regularized multi-layer graph ranking model in which they used the prior calculating by boundary connectivity. Aytekin et al. [24] proposed a graph model that uses a convolution kernel function network to learn the connection weight coefficients.
Multi-modality fusion method. In recent years, with the development of multi-sensors, multi-modality fusion has become a new effective means to improve computer vision tasks. Li et al. [25] combined gray and thermal information to deal with target tracking in complex scenes. Li et al. [15] used multispectral (RGB and thermal) data to improve visual tracking effectiveness. Li et al. [26] established a unified RGB-T dataset and proposed a new algorithm to fuse RGB and thermal images for saliency detection, which incorporates the cross-modality consistent constraints to integrate different modalities collaboratively. RGB-D is an effective multi-modal fusion method in many aspects, such as manufacturing [27], semantic segmentation [28,29,30], and saliency detection [31,32]. Liu et al. [33] used three transformer encoders with shared weights to enhance multi-level features, and the algorithm they proposed effectively improves the effectiveness of saliency detection.
Subspace-based method. Subspace-based methods represent high-dimensional data in low-dimensional subspace. The purpose of subspace representation is to obtain a similarity matrix in the basic subspace of the original data. In a dataset, each data point can be reconstructed by an effective combination of other points, which are often helpful for data processing, because data can better reflect the characters of data in its low-dimensional subspace. Guo et al. [34] proposed a subspace segmentation method to jointly learn data representation and affinity matrix relationships simultaneously in a model. Li et al. [35] represented each patch with a linear combination of the remaining ones and learned the weights of the global and local features of the detection object, achieving good effectiveness in the application field of video tracking.
We learn static-adaptive graphs for saliency detection. The static graph is the traditional graph. Its structure is fixed, and it only considers the relationship between adjacent nodes. The adaptive graph is obtained by the subspace method to mine the internal relationship between superpixels. Therefore, our algorithm considers both local and global features, and has better effectiveness than the saliency detection algorithm which is only based on the static graph. In multi-modality selection, we fuse RGB image and thermal image, because RGB and thermal images have natural complementarity. Compared with the RGB-D saliency detection algorithm, the RGB-T saliency detection algorithm has much lower hardware requirements for computers, and can run well on computer with an i3 3.3G CPU and 4GB RAM.

3. Brief Review of Manifold Ranking

A manifold ranking (MR) model [20] is a typical graph-based method for saliency detection. For an image, simple linear iterative clustering (SLIC) [36] is always used to obtain n superpixels as graph nodes in most of these models. Take a graph G = ( V , E ) , where V is a node set. Some of nodes are labeled as queries and the rest need to be ranked according to their relevance to the queries. Let X = [ x 1 , x 2 , , x n ] R d × n be the character matrix, d the dimensionality of the feature vector, and n the number of the superpixels. E is the set of undirected edges and W i j is the edge weight between node i and node j that can be calculated by feature vectors of two nodes. Let q = [ q 1 , q 2 , , q n ] T denote an indication vector, where q i = 1 if node i is a labeled query, otherwise, q i = 0 . The aim of MR is to gain a ranking value f i for each graph node, which can be computed by solving Equation (1),
min f 1 2 ( i , j = 1 n W i j f i D i i f j D j j 2 + μ i = 1 n f i q i 2 )
where D i i = j = 1 n W i j .
To obtain more effective results, Yang et al. [20] obtained the ranking value by using the un-normalized Laplacian matrix in Equation (2),
f = ( D λ W ) 1 q ,
where D is a degree matrix, D = d i a g { D 11 , , D n n } , λ = 1 / ( 1 + μ ) .

4. Static-Adaptive Graph Learning

4.1. Static-Adaptive Graph Construction

The graph of traditional models is static; most of them only consider adjacent nodes and boundary nodes. The limitation of this kind of method is that it cannot explore the relationship between nodes at the target level. Therefore, we consider the spatial connectivity features of graph nodes to construct a static-adaptive graph, in which superpixels with similar features in the region are also connected. Take multiple graphs G m = ( V m , E m ) , m = 1 , 2 , , M , where V m is a node set, and E m is the set of undirected edges. Let X m = [ x 1 m , x 2 m , , x N m ] R d × N , m = 1 , 2 , , M be the character matrix of the m-th modality. N is the number of graph nodes. d is the dimensionality of the feature vector. As in traditional static graphs [20], when two nodes meet one of the following three conditions, they are considered to have edges.
(1)
Two nodes are directly adjacent;
(2)
There is a common edge between two nodes;
(3)
Superpixels are on the four boundaries.
If there is an edge between two nodes, the weight of the edge is calculated by Equation (3).
W i , j m = e γ 0 x i m x j m , m = 1 , 2 , , M ,
where x i m denotes the mean of the i-th superpixel in the m-th modality, and γ 0 is a parameter.
We add the adaptive graph weight matrix to gain the weight matrix of the static-adaptive graph as in Figure 2, which can be calculated by Equation (4).
W = W a + m = 1 M t m W m ,
where W a is the weight matrix of adaptive graph, which can be obtained by adaptive graph learning.
W m = [ W i j m ] N × N , m = 1 , 2 , , M is the initial weight matrix of the m-th modality. t m can indicate the importance of different modalities of static and adaptive graphs.

4.2. Adaptive Graph Learning Model Formulation

For M graphs G m = ( V m , E m ) , m = 1 , 2 , , M , we assume that all nodes in each graph belong to the same sparse subspace, in which each node can be sparsely represented by the remaining nodes. We can obtain X m = X m Z m , m = 1 , 2 , , M , where Z m R N × N is the sparse coefficient matrix. Sparse constraints can automatically select most informative neighbor nodes for each node, and make the graph more powerful. Since the nodes are often disturbed by noises, we introduce a noise matrix E m R d × N to improve the robustness. The joint sparse representation with the convex relaxation for all modalities can be written as,
min Z , E m α Z 1 + β m = 1 M E m 2 , 1 , s . t . X m = X m Z m + E m .
where α and β are balanced parameters. Z = [ Z 1 ; ; Z M ] R N × ( M N ) is the joint sparse representation coefficient matrix.
We consider the spatial connectivity feature of graph nodes and use C R N × N to indicate the spatial connections of neighboring nodes.
If node i and j are 8-neighboring, C i j = 1 ; otherwise C i j = 0 .
C i j = 1 , i f i a n d j a r e 8 - n e i g h b o r i n g , 0 , e l s e .
The closer the distance, the greater the relevance. Inspired by [35], to capture the global and local structure information, we employ Equation (7) to learn the adaptive graph affinity matrix.
min W a γ 2 i , j = 1 N Z i Z j F 2 W i j a + δ 2 i , j = 1 N C i j Z i Z j F 2 + λ 1 W a F 2 , s . t . W a T 1 = 1 , W a 0 .
where γ and δ are the balancing parameters. The first item reflects the probability W i j a from the same cluster based on the distance between their representations Z i and Z j . The second item indicates that two close nodes will have similar representations. λ 1 W a F 2 is to avoid over-fitting of W a . 1 denotes a unit vector. W a T 1 = 1 , W a 0 are constraints to guarantee the probability property of W i j a . We combine the Equations (5) and (7) and obtain the following optimal function,
min Z , E m , W a α Z 1 + γ 2 i , j = 1 N Z i Z j F 2 W i j a + δ 2 i , j = 1 N C i j Z i Z j F 2 , + λ 1 W a F 2 + β m = 1 M E m 2 , 1 , s . t . X m = X m Z m + E m , W a T 1 = 1 , W a 0 .
In order to solve the problem easily, let D i i a = j = 1 N W i j a , D i i c = j = 1 N C i j . Equation (8) is a slightly algebraic transformation to,
min Z , E m , W a α Z 1 + γ t r ( Z L a Z T ) + δ t r ( Z L c Z T ) + λ 1 W a F 2 + β m = 1 M E m 2 , 1 , s . t . X m = X m Z m + E m , W a T 1 = 1 , W a 0 .
where L a = D a W a and L c = D c C are Laplacian matrices of W a and C , respectively.

4.3. Optimization

The variables in Equation (9) are not jointly convex; they are convex with respect to the subproblem of each variable when others are fixed and have a close form solution. We introduce two auxiliary variables, P m and Q m , to make Equation (9) separable and then use the alternating direction multiplier (ADMM) algorithm [37] for optimization iteration. Then, we can obtain Equation (10).
min Z , E m , W a α Z 1 + γ t r ( Z L a Z T ) + δ t r ( Z L c Z T ) + λ 1 W a F 2 + β m = 1 M E m 2 , 1 , s . t . P m = Z m , Q m = Z m , X m = X m Z m + E m , W a T 1 = 1 , W a 0 .
Thus, we obtain the Lagrange function [38] as Equation (11),
min Z , E m , W a , P , Q α Q 1 + γ t r ( P L a P T ) + δ t r ( P L c P T ) + λ 1 W a F 2 + m = 1 M ( β E m 2 , 1 + μ 2 X m X m Z m E m + Y 1 m μ F 2 + μ 2 P m Z m + Y 2 m μ F 2 + μ 2 Q m Z m + Y 3 m μ F 2 1 2 μ ( Y 1 m F 2 + Y 2 m F 2 + Y 3 m F 2 ) ) .
where P = [ P 1 ; P 2 ; ; P M ] and Q = [ Q 1 ; Q 2 ; ; Q M ] . μ is a penalty parameter; Y 1 m , Y 2 m , and Y 3 m are Lagrange multipliers.
There are five variables, Z , E m , W a , P , and Q , needed to solve in Equation (11), The solver iteratively updates one variable at a time by fixing the others.
Z-subproblem: In order to calculate Z , we fix other variables in Equation (11); the Z -subproblem can be written as Equation (12). Then, we divide Z and set it to 0 to obtain Equation (13),
min Z m = 1 M ( μ 2 X m X m Z m E m + Y 1 m μ F 2 + μ 2 P m Z m + Y 2 m μ F 2 + μ 2 Q m Z m + Y 3 m μ F 2 ) ,
Z m , k + 1 = ( μ ( X m ) T X m + 2 μ k I ) 1 ( μ k ( X m ) T X m μ k ( X m ) T E m , k + ( X m ) T Y 1 m , k + μ k P m , k + μ k Q m , k Y 2 m , k Y 3 m , k ) .
P-subproblem: In order to calculate P , we fix other variables in Equation (11); the P -subproblem can be written as Equations (14) and (15), then dividing P and setting it to 0 to obtain Equation (16),
min P γ t r ( P L a P T ) + δ t r ( P L c P T ) + m = 1 M μ 2 P m Z m + Y 2 m μ F 2 ,
min P γ t r ( P L a P T ) + δ t r ( P L c P T ) + μ 2 P Z + Y 2 μ F 2 ,
P k + 1 = ( μ Z k + 1 Y 2 k ) ( γ ( L a ) k + γ ( ( L a ) k ) T + δ ( L a ) k + δ ( ( L c ) k ) T + μ I ) 1 .
Q-subproblem: In order to calculate Q , we fix other variables in Equation (11), then the Q -subproblem can be written as Equations (17) and (18). Then, divide Q and set it to 0, which is computed by the soft-thresholding (or shrinkage) method [39] as Equation (19),
min Q α Q 1 + m = 1 M μ 2 Q m Z m + Y 3 m μ F 2 ,
min Q α Q 1 + μ 2 Q Z + Y 3 μ F 2 ,
Q k + 1 = s o f t _ t h r ( Z k + 1 Y 3 k μ , α μ k ) .
E m -subproblem: In order to calculate E m , we fix other variables in Equation (11); then the Q -subproblem can be written as Equation (20). Then, by dividing E and setting it to 0, which is computed by the soft-thresholding (or shrinkage) method [39], we obtain Equation (21),
min E m m = 1 M β E m 2 , 1 + μ 2 X m X m Z m E m + Y 1 m μ F 2
E m , k + 1 = S β μ ( X m Z m , k + 1 X m Y 1 m , k μ k )
W a -subproblem: In order to calculate W a , we fix other variables in Equation (11), then the W a -subproblem can be written as Equation (22). Then dividing W a and set it to 0 obtains Equation (23),
min W a γ t r ( P L a P T ) + λ 1 W a F 2 + γ P i P j F 2 W i j a
( W a ) i k + 1 = ( 1 + j = 1 s U ^ j s 1 U i j ) +
where U j R N × 1 is a vector whose i-th element is U i j = γ P i P j F 2 λ 1 .
The Lagrange multiplier can be updated by Equation (24),
Y 1 m , k + 1 = Y 1 m , k + μ k ( X m X m Z m , k + 1 E m , k + 1 ) Y 2 m , k + 1 = Y 2 m , k + μ k ( Z m , k + 1 P m , k + 1 ) Y 3 m , k + 1 = Y 3 m , k + μ k ( Z m , k + 1 Q m , k + 1 )

5. RGB-T Salient Detection

Given a pair of RGB-T images, considering that the thermal image has stronger anti-interference ability in complex scenes, we first fuse the RGB and the thermal images at a ratio of 1:4. To generate N non-overlapping superpixels, we use a simple linear iterative clustering (SLIC) algorithm in the fused image. A two-stage ranking model is adapted to calculate the final saliency map. In the first stage, we take the boundary as a prior and select the nodes around the image as background seed queries. We use the top, bottom, left, and right sides of the image as four kinds of queries, q t , q b , q l , q r , which are selected separately to obtain four different detection results, f t , f b , f l , f r , by Equation (2). Considering that the symmetry of the image and saliency objects are often cross-left boundary and cross-bottom boundary, we select the large class nodes as queries by using the k-means method to obtain two clusters on the left and bottom boundaries separately. Then, we normalize f k ( k = t , p , l , r ) to the range between 0 and 1. The saliency value vector of N nodes s k can be obtained by s k = 1 f ^ k ( k = t , p , l , r ). The saliency ranking value vector of all nodes s 1 in the first stage can be calculated by Equation (25).
s 1 = s t × s b × s l × s r
By using the object characteristics, secondary ranking is performed to improve the first-stage saliency value. Given s 1 , we set an adaptive threshold to generate foreground as queries q 2 . Then, the Equation (2) is used to obtain the second ranking results s 2 , which are normalized to the range of 0 and 1 as s ^ 2 . In order to further reduce the background noise, we let s = s 1 × s 2 be the final saliency value and obtain the final salient map S . The main steps of the two-stage RGB-T salient object detection algorithm are summarized in Algorithm 1.
Algorithm 1 The Static-Adaptive Graph based RGB-T Salient Detection Produce.
Require: The static-adaptive graph weight matrix W , the indicator vectors of the four boundaries queries q t , q b , q l , q r .
  1:
Use Equation (2) to obtain f t , f b , f l , f r separately;
  2:
f t , f b , f l and f r are normalized to 0 and 1;
  3:
Set s t = 1 f ^ t , s b = 1 f ^ b , s l = 1 f ^ l , s r = 1 f ^ r ;
  4:
Obtain the first saliency value vector s 1 = s t × s b × s l × s r ;
  5:
s 1 is normalized to 0 and 1, and obtain s ^ 1 ;
  6:
Use an adaptive threshold to binary s ^ 1 and obtain foreground query q 2 ;
  7:
Use Equation (2) to obtain the second saliency value vector s 2 ;
  8:
s 2 is normalized to 0 and 1 s ^ 2 ;
  9:
Set s = s ^ 1 × s ^ 2 to suppress the background noise of image;
10:
Set all superpixels value s i to each pixel and obtain the final saliency map S .
Ensure: S is the saliency map of the static-adaptive graph model for RGB-T saliency detection.

6. Experiment

6.1. Datasets and Experimental Settings

The RGBT-Saliency dataset [26] includes 821 pairs images with ground truth, in which the images with high diversity are recorded under different scenes and environmental conditions.
The datasets can be download from the address http://chenglongli.cn/people/lcl/journals.html (accessed on 20 December 2021).
The initial segmentation number of the superpixel N is set to 250. The edge weight coefficient θ is set to 29. Other parameters in this paper are set to α = 0.11 , β = 0.15 , γ = 0.04 , δ = 0.3 , and λ 1 = 0.6 .

6.2. Measuring Standard

To verify the effectiveness of our algorithm, we compared with other methods with precision, recall, and F-measure (PRF) values, mean absolute error (MAE) values, and PR curve.
PR (Precision, Recall) curve. The PR curve is a curve with the “precision rate” as the ordinate and the “recall rate” as the abscissa. We binarize the original image S to obtain M, and then calculate the precision value and recall value by comparing M and G (ground truth) pixel by pixel in the following formula,
P r e c i s i o n = | M G | | M |
R e c a l l = | M G | | G |
PRF (precision, recall, F-measure). Sometimes, the P and R indicators are contradictory, so they need to be considered comprehensively. The most common method is F-measure (also known as f-score). F-measure is the weighted average of precision and recall:
F β 2 = ( 1 + β 2 ) × P × R β 2 × P + R ,
where β 2 = 0.3 .
MAE (mean absolute error). MAE is the direct calculation of the average absolute error between the salience map and the ground truth of the model output. It first binarizes them and then calculates them with the following formula:
M A E = 1 W × H x = 1 W y = 1 H | S ¯ ( x , y ) G ¯ ( x , y ) |
where W is the width of the salient map S and the ground truth map G; H is the height of the salient map S and the ground truth map G.

6.3. Comparison Results

We compared our model with eight methods including BR [40], CA [41], MCI [42], NFI [43], SS-KDE [44], GMR [20], GR [45], and MTMR [26] on the RGBT-Saliency dataset.
We generated PR curves for 11 challenging subsets and the entire dataset, and listed their F values. The eleven subsets are eleven different challenges, which are: big salient object (BSO), bad weather (BW), center bias (CB), cross image boundary (CIB), image clutter (IC), low illumination (LI), multiple salient objects (MSO), out of focus (OF), similar appearance (SA), small salient object (SSO), and thermal crossover (TC). In Table 1, we describe in detail the division method of the eleven subsets [26].
As can be seen from Figure 3, only in the “BSO” and “CIB” subsets was our F-Measures slightly lower than the best detection result, and they were the best in the other nine subsets. Especially in the CB subset, the detection result has obvious advantages. Our detection curve has no crossover with other curves.
The comparison results of the precision, recall, and F-measure values with other methods in different modalities as shown in Table 2. We only provide the detection results of MTMR [26] after multi-modality fusion because this model proposes to integrate multi-modal information and use multi-modal adaptive weights to detect image saliency objects. From the Table 2, we can see that the proposed algorithm is better than other methods in terms of P value and the comprehensive measure F-measure.
Sample Results. From the dataset, we extracted four photos with various challenges as the data source and compared the detection results of our algorithm with other algorithms for salient detection. It can be seen from the Figure 4 that our algorithm has a very robust detection effectiveness in challenging scenes such as fuzzy images, large targets, small targets, complex background, and center bias.
Runtime Results. All results were obtained on a Windows 10 64-bit operating system running Matlab 2014b with an i3 3.3G CPU and 4GB RAM. We compared the average running time with other algorithms in Table 3. Compared with the algorithm in [20], we spent more time mainly on the learning of the adaptive graph.

6.4. Analysis of Our Approach

In our method, we compared the following four combinations of image salient detection results: (1) learning static-adaptive graphs for RGB image salient detection, called our1; (2) learning static-adaptive graph for thermal image salient detection, called our2; (3) not learning static-adaptive graphs and only fusing RGB and thermal image to detect the salient, called our3; (4) learning static-adaptive graphs for RGB-T image salient detection, called our4. It can be seen from Figure 5 that the fusion of multi-modality and the use of learning static-adaptive graphs are both effective methods to improve the salient detection.
Advantages. We fused thermal and RGB images for image salient detection, which can overcome the limitations of light, ambient temperature, background clutter, and color similarity in single mode. By learning the static-adaptive method, we not only retained the local features of superpixels, but also learned to mine their internal relations to obtain a better affinity matrix of superpixels and greatly improve the detection accuracy of image saliency.
Limitations. Through the experiment, we found that under complex scenes, multi-modality fusion can effectively improve the image in general. However, in some cases, the single-modality has better detection accuracy. Our future work will set the modality weight according to the image characteristics and further improve the detection effect of image saliency in complex scenes.

7. Conclusions

In this paper, we combine RGB-thermal modality information for image salient detection, which effectively improves the detection performance of single-modality RGB images under poor illumination and when the background and foreground colors are similar. At the same time, our method improves the detection accuracy of thermal images under normal lighting conditions, especially in the case of small temperature differences between the environment and the target. The image is dynamically learned, taking both global and local cues into account, and thus our method is capable of capturing the intrinsic relationship of superpixels. In the future, we will assign different weights to different modality images according to the characteristics of different modality images.

Author Contributions

Z.X. and J.T. proposed the idea, designed and performed the simulations, and wrote the paper; A.Z. and H.L. analyzed the data. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is funded by the following foundations: National Natural Science Foundation of China (61906044), Natural Science Foundation of Anhui Higher Education Institution of China (KJ2019A0536, KJ2019A0529, KJ2020ZD46), Natural Science Foundation of Fuyang Normal University (2019FSKJ02ZD), Fuyang Normal University Scientific Research Project(2020KYQD0032), the Young Talents Projects of Fuyang Normal University(rcxm202001, 2021FSKJ01ZD), Fuyang City School Cooperation Project (SXHZ202103).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets can be download from the address http://chenglongli.cn/people/lcl/journals.html (accessed on 20 December 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wang, Q.; Lin, J.; Yuan, Y. Salient band selection for hyperspectral image classification via manifold ranking. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 1279–1289. [Google Scholar] [CrossRef]
  2. Yang, X.; Qian, X.; Xue, Y. Scalable mobile image retrieval by exploring contextual saliency. IEEE Trans. Image Process. 2015, 24, 1709–1721. [Google Scholar] [CrossRef]
  3. Wen, W.; Zhang, Y.; Fang, Y.; Fang, Z. A novel selective image encryption method based on saliency detection. In Proceedings of the Visual Communications and Image Processing (VCIP), Chengdu, China, 27–30 November 2016; pp. 1–4. [Google Scholar]
  4. Wen, W.; Zhang, Y.; Fang, Y.; Fang, Z. Image salient regions encryption for generating visually meaningful ciphertext image. Neural Comput. Appl. 2018, 29, 653–663. [Google Scholar] [CrossRef]
  5. Jacob, H.; Padua, F.L.C.; Lacerda, A.; Pereira, A.C.M. A video summarization approach based on the emulation of bottom-up mechanisms of visual attention. J. Intell. Inf. Syst. 2017, 49, 193–211. [Google Scholar] [CrossRef]
  6. Zhang, L.; Ai, J.; Jiang, B.; Lu, H.; Li, X. Saliency Detection via Absorbing Markov Chain With Learnt Transition Probability. IEEE Trans. Image Process. 2018, 27, 987–998. [Google Scholar] [CrossRef] [PubMed]
  7. Borji, A.; Cheng, M.M.; Jiang, H.; Li, J. Salient object detection: A benchmark. IEEE Trans. Image Process. 2015, 24, 5706–5722. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Tong, N.; Lu, H.; Ruan, X.; Yang, M.H. Salient object detection via bootstrap learning. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1884–1892. [Google Scholar]
  9. Zhou, X.; Liu, Z.; Sun, G.; Wang, X. Adaptive saliency fusion based on quality assessment. Multimed. Tools Appl. 2017, 76, 23187–23211. [Google Scholar] [CrossRef]
  10. Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1254–1259. [Google Scholar] [CrossRef] [Green Version]
  11. Cheng, M.M.; Mitra, N.J.; Huang, X.; Torr, P.H.; Hu, S.M. Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 569–582. [Google Scholar] [CrossRef] [Green Version]
  12. Wang, H.; Xu, L.; Wang, X.; Luo, B. Learning Optimal Seeds for Ranking Saliency. Cogn. Comput. 2018, 10, 347–358. [Google Scholar] [CrossRef]
  13. Hou, Q.; Cheng, M.M.; Hu, X.; Borji, A.; Tu, Z.; Torr, P.H. Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3203–3212. [Google Scholar]
  14. Han, J.; Zhang, D.; Cheng, G.; Liu, N.; Xu, D. Advanced deep-learning techniques for salient and category-specific object detection: A survey. IEEE Signal Process. Mag. 2018, 35, 84–100. [Google Scholar] [CrossRef]
  15. Li, C.; Zhao, N.; Lu, Y.; Zhu, C.; Tang, J. Weighted Sparse Representation Regularized Graph Learning for RGB-T Object Tracking. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1856–1864. [Google Scholar]
  16. Li, C.; Zhu, C.; Huang, Y.; Tang, J.; Wang, L. Cross-modal ranking with soft consistency and noisy labels for robust rgb-t tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 808–823. [Google Scholar]
  17. Li, C.; Liang, X.; Lu, Y.; Zhao, N.; Tang, J. RGB-T object tracking: Benchmark and baseline. Pattern Recognit. 2019, 96, 106977. [Google Scholar] [CrossRef] [Green Version]
  18. Zhang, Q.; Huang, N.; Yao, L.; Zhang, D.; Shan, C.; Han, J. RGB-T salient object detection via fusing multi-level CNN features. IEEE Trans. Image Process. 2019, 29, 3321–3335. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  19. Harel, J.; Koch, C.; Perona, P. Graph-Based Visual Saliency. Adv. Neural Inf. Process. Syst. 2006, 19, 545–552. [Google Scholar]
  20. Yang, C.; Zhang, L.; Lu, H.; Ruan, X.; Yang, M.H. Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Washington, DC, USA, 23–28 June 2013; pp. 3166–3173. [Google Scholar]
  21. Sun, J.; Lu, H.; Liu, X. Saliency region detection based on Markov absorption probabilities. IEEE Trans. Image Process. 2015, 24, 1639–1649. [Google Scholar] [CrossRef]
  22. Zhang, L.; Yang, C.; Lu, H.; Ruan, X.; Yang, M.H. Ranking saliency. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1892–1904. [Google Scholar] [CrossRef]
  23. Xiao, Y.; Wang, L.; Jiang, B.; Tu, Z.; Tang, J. A global and local consistent ranking model for image saliency computation. J. Vis. Commun. Image Represent. 2017, 46, 199–207. [Google Scholar] [CrossRef]
  24. Aytekin, Ç.; Iosifidis, A.; Kiranyaz, S.; Gabbouj, M. Learning graph affinities for spectral graph-based salient object detection. Pattern Recognit. J. Pattern Recognit. Soc. 2017, 64, 159–167. [Google Scholar] [CrossRef]
  25. Li, C.; Cheng, H.; Hu, S.; Liu, X.; Tang, J.; Lin, L. Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans. Image Process. 2016, 25, 5743–5756. [Google Scholar] [CrossRef]
  26. Li, C.; Wang, G.; Ma, Y.; Zheng, A.; Luo, B.; Tang, J. A Unified RGB-T Saliency Detection Benchmark: Dataset, Baselines, Analysis and A Novel Approach. arXiv 2017, arXiv:1701.02829. [Google Scholar]
  27. Giacomo, C.; Grazia, L.S.; Christian, N.; Rafi, S.; Marcin, W. Optimizing the Organic Solar Cell Manufacturing Process by Means of AFM Measurements and Neural Networks. Energies 2018, 11, 1221. [Google Scholar]
  28. Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019. [Google Scholar]
  29. Hu, X.; Yang, K.; Fei, L.; Wang, K. ACNet: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, China, 22–25 September 2019; pp. 1440–1444. [Google Scholar]
  30. Zhang, J.; Yang, K.; Constantinescu, A.; Peng, K.; Müller, K.; Stiefelhagen, R. Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 19–25 June 2021; pp. 1760–1770. [Google Scholar]
  31. Liu, N.; Han, J.; Yang, M.H. PiCANet: Learning Pixel-wise Contextual Attention for Saliency Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3089–3098. [Google Scholar]
  32. Liu, Z.; Tan, Y.; He, Q.; Xiao, Y. SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2021. [Google Scholar] [CrossRef]
  33. Liu, Z.; Wang, Y.; Tu, Z.; Xiao, Y.; Tang, B. TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 20–24 October 2021; pp. 4481–4490. [Google Scholar]
  34. Guo, X. Robust Subspace Segmentation by Simultaneously Learning Data Representations and Their Affinity Matrix. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), Buenos Aires, Argentina, 25–31 July 2015; AAAI Press: Palo Alto, CA, USA, 2015; pp. 3547–3553. [Google Scholar]
  35. Li, C.; Wu, X.; Bao, Z.; Tang, J. ReGLe: Spatially Regularized Graph Learning for Visual Tracking. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 252–260. [Google Scholar]
  36. Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  37. Stephen, B.; Neal, P.; Chu, E.; Borja, P.; EcKstein, J. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers; Now Publishers Inc.: Hanover, MA, USA, 2010; Volume 3, pp. 1–122. [Google Scholar]
  38. Lin, Z.; Chen, M.; Ma, Y. The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices. arXiv 2010, arXiv:1009.5055. [Google Scholar]
  39. Chen, M.; Ganesh, A.; Lin, Z.; Ma, Y.; Wright, J.; Wu, L. Fast Convex Optimization Algorithms for Exact Recovery of a Corrupted Low-Rank Matrix; Report No. UILU-ENG-09-2214; Coordinated Science Laboratory: Urbana, IL, USA, 2009. [Google Scholar]
  40. Rahtu, E.; Kannala, J.; Salo, M.; Heikkilä, J. Segmenting salient objects from images and videos. In Proceedings of the European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 366–379. [Google Scholar]
  41. Qin, Y.; Lu, H.; Xu, Y.; Wang, H. Saliency detection via cellular automata. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 110–119. [Google Scholar]
  42. Goferman, S.; Zelnik-Manor, L.; Tal, A. Context-aware saliency detection. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1915–1926. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  43. Erdem, E.; Erdem, A. Visual saliency estimation by nonlinearly integrating features using region covariances. J. Vis. 2013, 13, 11. [Google Scholar] [CrossRef]
  44. Tavakoli, H.R.; Rahtu, E.; Heikkilä, J. Fast and efficient saliency detection using sparse sampling and kernel density estimation. In Proceedings of the Scandinavian Conference on Image Analysis, Ystad, Sweden, 23–25 May 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 666–675. [Google Scholar]
  45. Yang, C.; Zhang, L.; Lu, H. Graph-regularized saliency detection with convex-hull-based center prior. IEEE Signal Process. Lett. 2013, 20, 637–640. [Google Scholar] [CrossRef]
Figure 1. Comparative results of static-adaptive graph-based method with traditional static graph model. (a) RGB image; (b) thermal image; (c) the saliency map generated by the static graph-based model; (d) the saliency map generated by our model; (e) ground truth.
Figure 1. Comparative results of static-adaptive graph-based method with traditional static graph model. (a) RGB image; (b) thermal image; (c) the saliency map generated by the static graph-based model; (d) the saliency map generated by our model; (e) ground truth.
Information 13 00084 g001
Figure 2. The general view of the static-adaptive graph on the multi-modality fusion image. The blue edges are obtained by the traditional static graph. The green edges are obtained by our adaptive graph learning model.
Figure 2. The general view of the static-adaptive graph on the multi-modality fusion image. The blue edges are obtained by the traditional static graph. The green edges are obtained by our adaptive graph learning model.
Information 13 00084 g002
Figure 3. PR curves of the proposed approach with other baseline methods with RGB-T input on eleven subsets and the entire dataset. The F 0.3 values are shown in the legend.
Figure 3. PR curves of the proposed approach with other baseline methods with RGB-T input on eleven subsets and the entire dataset. The F 0.3 values are shown in the legend.
Information 13 00084 g003
Figure 4. Sample results of the proposed approach and other baseline methods with the fusion of RGB and thermal inputs. (a) The first two columns are the origin RGB images and thermal images. (bi) The results of the baseline methods with RGB and thermal inputs; (j) the result of our approach. (k) ground truth.
Figure 4. Sample results of the proposed approach and other baseline methods with the fusion of RGB and thermal inputs. (a) The first two columns are the origin RGB images and thermal images. (bi) The results of the baseline methods with RGB and thermal inputs; (j) the result of our approach. (k) ground truth.
Information 13 00084 g004
Figure 5. PR curves of our approach with its variants on the entire dataset.
Figure 5. PR curves of our approach with its variants on the entire dataset.
Information 13 00084 g005
Table 1. List of the 11 challenging subsets of RGBT-Saliency-Dataset.
Table 1. List of the 11 challenging subsets of RGBT-Saliency-Dataset.
ChallengeDescription
BSOThe radio of ground truth salient objects over the image is more than 0.26.
BWThe image pairs are recorded in bad weather, such as snowy, rainy, hazy, or cloudy weather.
CBThe centers of salient objects are far away from the image center.
CIBThe salient objects cross the image boundaries.
ICThe image is cluttered.
LIThe environmental illumination is low.
MSOThe number of the salient objects in the image is more than one.
OFThe image is out of focus.
SAThe salient objects have similar color or shape to the background.
SSOThe radio of ground truth salient objects over the image is less the 0.05.
TCThe salient objects have similar temperature to the background.
Table 2. Average precision (P), recall (R), F-measure (F) and mean absolute error (MAE) of our method against different kinds of methods on the RGBT-Saliency dataset. In the evaluation parameters, the larger the value of P, R, and F, the better the detection effect, while the smaller the value of MAE, the better the effect. The red font indicates the best performance. The green is second best.
Table 2. Average precision (P), recall (R), F-measure (F) and mean absolute error (MAE) of our method against different kinds of methods on the RGBT-Saliency dataset. In the evaluation parameters, the larger the value of P, R, and F, the better the detection effect, while the smaller the value of MAE, the better the effect. The red font indicates the best performance. The green is second best.
AlgorithmRGB (P↑, R↑, F↑, MAE↓)Thermal (P↑, R↑, F↑, MAE↓)RGB-T (P↑, R↑, F↑, MAE↓)
BR [40]0.724, 0.260, 0.411, 0.2690.648, 0.413, 0.488, 0.3230.804, 0.366, 0.520, 0.297
CA [41]0.592, 0.667, 0.568, 0.1630.623, 0.607, 0.573, 0.2250.648, 0.697, 0.618, 0.195
MCI [42]0.526, 0.604, 0.485, 0.2110.445, 0.585, 0.435, 0.1760.547, 0.652, 0.515, 0.195
NFI [43]0.557, 0.639, 0.532, 0.1260.581, 0.599, 0.541, 0.1240.564, 0.665, 0.544, 0.125
SS-KDE [44]0.581, 0.554, 0.532, 0.1220.510, 0.635, 0.497, 0.1320.528, 0.656, 0.515, 0.127
GMR [20]0.644, 0.603, 0.587, 0.1720.700, 0.574, 0.603, 0.2320.694, 0.624, 0.615, 0.202
GR [45]0.621, 0.582, 0.534, 0.1970.639, 0.544, 0.545, 0.1990.705, 0.593, 0.600, 0.199
MTMR [26]-, -, -, --, -, -, -0.716, 0.713, 0.680, 0.107
ours0.697, 0.536, 0.603, 0.1070.715, 0.569, 0.629, 0.1120.804, 0.627, 0.716, 0.095
Table 3. Average runtime comparison on the RGBT-Saliency dataset.
Table 3. Average runtime comparison on the RGBT-Saliency dataset.
MethodBR [40]CA [41]MCI [42]NFI [43]SS-KDE [44]GMR [20]GR [45]MTMR [26]Ours
Runtime(s)21.953.1358.3733.162.512.966.483.715.18
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Xu, Z.; Tang, J.; Zhou, A.; Liu, H. Learning Static-Adaptive Graphs for RGB-T Image Saliency Detection. Information 2022, 13, 84. https://doi.org/10.3390/info13020084

AMA Style

Xu Z, Tang J, Zhou A, Liu H. Learning Static-Adaptive Graphs for RGB-T Image Saliency Detection. Information. 2022; 13(2):84. https://doi.org/10.3390/info13020084

Chicago/Turabian Style

Xu, Zhengmei, Jin Tang, Aiwu Zhou, and Huaming Liu. 2022. "Learning Static-Adaptive Graphs for RGB-T Image Saliency Detection" Information 13, no. 2: 84. https://doi.org/10.3390/info13020084

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop