High Edge-Quality Light-Field Salient Object Detection Using Convolutional Neural Network

Wang, Xingzheng; Chen, Songwei; Liu, Jiehao; Wei, Guoyao

doi:10.3390/electronics11071054

Open AccessArticle

High Edge-Quality Light-Field Salient Object Detection Using Convolutional Neural Network

¹

College of Mechatronics and Control Engineering, Shenzhen University, Shenzhen 518060, China

²

Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(7), 1054; https://doi.org/10.3390/electronics11071054

Submission received: 23 February 2022 / Revised: 11 March 2022 / Accepted: 14 March 2022 / Published: 28 March 2022

(This article belongs to the Special Issue Advanced Deep Learning and Neural Network Technologies for Image Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

The detection result of current light-field salient object detection methods suffers from loss of edge details, which significantly limits the performance of subsequent computer vision tasks. To solve this problem, we propose a novel convolutional neural network to accurately detect salient objects, by digging effective edge information from light-field data. In particular, our method is divided into four steps. Firstly, the network extracts multi-level saliency features from light-field data. Secondly, edge features are extracted from low-level saliency features and optimized by ground-truth guidance. Then, to sufficiently leverage high-level saliency features and edge features, the network hierarchically fuses them in a complementary manner. Finally, spatial correlations between different levels of fused features are considered to detect salient objects. Our method can accurately locate salient objects with exquisite edge details, by extracting clear edge information and accurate saliency information and fully fusing them. We conduct extensive evaluations on three widely used benchmark datasets. The experimental results demonstrate the effectiveness of our method, and it is superior to eight state-of-the-art methods.

Keywords:

light-field; salient object detection; convolutional neural network (CNN)

Graphical Abstract

1. Introduction

Salient object detection (SOD) aims to detect the salient attention-grabbing objects in a scene and segment the whole objects [1]. It plays a key role in many computer vision tasks, such as image segmentation [2,3], object tracking [4,5,6], image retrieval [7,8], and content-aware image editing [9,10]. With the development of light-field imaging technology, light-field SOD has also received increasing attention [11]. Accurate edge detection is a part of light-field SOD, which is crucial for numerous subsequent light-field-based vision applications. For example, in 3D display [12], salient objects with fine edges can guide precise depth correction and enhance the 3D display of major content; in light-field image compression [13], salient objects with fine edges can help to preserve the details of important objects and improve the compression quality of light-field images.

Existing light-field SOD methods can be roughly divided into two categories: traditional methods [14,15,16] and deep learning-based methods [17,18,19,20]. Traditional methods [14,15,16] employ hand-crafted features such as color contrast, texture contrast, and background prior to detect salient objects, which lack consideration of global contextual information. Consequently, the detection performance is limited in complex scenes such as low light and clutter background. Deep learning-based methods [17,18,19,20] leverage the powerful feature learning capability of CNN to detect salient objects, and make up for the shortcomings of traditional methods. Zhang and Ji et al. [17] refined the complementary information in light-field features, and reduced the interference of background to the salient objects. Wang [18] proposed a feature fusion module, which can adaptively fuse across-modal visual features to enhance the network robustness. Zhang and Li et al. [19] detected complete salient objects, by introducing both attention mechanism and ConvLSTM to fuse light-field features hierarchically. Piao and Rong et al. [20] optimized the variability between light-field features to accurately locate salient objects. These methods improved the detection accuracy in complex scenes and showed excellent performance in previous light-field SOD methods [20]. However, they focus more on the accurate saliency information in deep network and lack effective utilization of the rich edge information in shallow network, which leads to poor quality of detected edge details.

Therefore, we propose a novel convolutional neural network to exploit the edge information in shallow network. Firstly, the network extracts multi-level saliency features from all-in-focus images and focus stacks. Secondly, edge features are extracted from low-level saliency features and the invalid edge information in them is sieved out. Then, high-level saliency features and edge features are hierarchically integrated to take full advantage of the complementary strengths between them. Finally, the network learns the spatial correlations between different levels of fused features to detect salient objects. To verify the effectiveness of our proposed method, we compare it with eight state-of-the-art light-field SOD methods [14,15,16,17,19,20,21,22] on three widely-used datasets [14,19,23]. The results show that our method achieves the highest detection accuracy.

In summary, the main contributions of this paper are concluded as follows:

We propose a novel light-field salient object detection network, which hierarchically extracts saliency features from light-field data. The high-level semantic information and low-level edge information in the saliency features are fully exploited to detect accurate salient objects.
We propose an edge feature extraction strategy to extract clear edge information from light-field data and optimize it by ground-truth guidance.
Extensive experiments conducted on three benchmark datasets show that our method achieves state-of-the-art performance.

The remainder of this paper is organized as follows. Section 2 presents the related work of light-field SOD to summarize existing methods and suggest ideas for improvement. In Section 3, we describe our proposed method in terms of network structure and loss function. Following this, to validate the effectiveness of our proposed method, we present the experimental setup, experimental comparisons, and experimental results in Section 4. Then, we discuss in Section 5 to provide possible ideas for the development of light-field SOD technology. Finally, concluded remarks are made in Section 6.

2. Related Works

In this section, we briefly summarize the related work of light-field SOD. In recent years, research in light-field images, such as light-field subjective evaluation [24], light-field quality model [25] and light-field stitching [26] has effectively improved the light-field images quality and promoted the development of light-field SOD. Light-field SOD methods can be roughly divided into two categories: traditional methods and deep learning-based methods.

Traditional methods use hand-crafted features such as color contrast, texture contrast, and background prior to detect salient objects. Li and Ye et al. [14] proposed the first light-field SOD method, which calculates saliency candidates by background prior and color contrast, respectively, and then objectness cues were used to weight the candidates. This motivated Li and Sun et al. [15] to propose a weighted sparse coding framework to construct saliency dictionary and non-saliency dictionary from light-field data, and then detect the salient objects by iterative optimization. Zhang and Wang et al. [16] computed depth-induced contrast, color contrast, and background prior from light-field data to enhance light-field SOD. Then, they also proposed a random-search-based weighting strategy to fuse multiple saliency cues extracted from light-field data [23]. Wang and Li et al. [27] proposed a two-stage Bayesian framework to fuse saliency features extracted from the light-field data, in order to improve the detection performance by exploiting the complementarity between the saliency features. In addition, Wang and Yan et al. [28] performed a coarse detection using the depth information in light-field data and then calculated the final saliency map based on the color contrast and texture contrast. Piao and Li et al. [29] took account of interactions and complementarities of color, depth, focusness and location cues to detect salient objects and improved detection accuracy by enforcing spatial consistency. However, these traditional methods rely on hand-crafted features and omit the semantic information of salient objects, leading to object errors and partial misses in complex scenes such as low light, low contrast, and clutter background.

The deep learning-based methods leverage the powerful feature learning capability of CNN to detect salient objects. Wang and Piao et al. [30] first proposed a deep learning-based light-field SOD network, in which a uniquely designed recurrent attention mechanism can adaptively fuse focal slice image features according to importance. However, they ignored the complementarity between the light-field features of various modalities. Affected by them, Zhang and Ji et al. [17] proposed a light-field refinement module to refine light-field features and then sufficiently fused them by introducing recurrent attention mechanism. The main advantage of this method is that it strengthens the connection between light-field features of different modalities, but it fails to exploit the features at each level. Therefore, Zhang and Li et al. [19] extracted features from light-field data hierarchically to exploit the expressiveness of features at different levels, and then fused them using recurrent attention mechanism to detect complete salient objects. However, it also increases the computational complexity. In order to improve the efficiency of light-field SOD, Piao and Rong et al. [20] designed a teacher network to exploit focus stack images and transfer comprehensive focusness knowledge to student network. However, the method cannot guarantee both accuracy and efficiency. Piao and Jiang et al. [21] explored light-field data in a regionwise manner under the guidance of saliency, boundary and location and improved the detection accuracy of small objects. In addition, Zhang and Liu et al. [22] designed an angle change module to learn multi-view information in light-field data and used dilated convolution to extract multi-scale spatial features, which improves the robustness of light-field SOD. Wang [18] proposed a feature fusion module to leverage multi-modal features, which can effectively reduce the impact of light-field data quality. The limitation of the method is that three types of light-field data are required as input. Jing and Zhang et al. [31] extracted accurate scene depth and occlusion information from epipolar plane images to detect complete saliency objects. However, the method lacks the exploitation of visual features in shallow network. Liang and Qin et al. [32] presented a dual guidance enhanced network to improve the detection accuracy by progressively global and boundary guidance of light-field features. Unfortunately, the two guidance strategies will limit each other and the edge information in light-field features cannot be fully utilized. Benefiting from the powerful feature learning capability of CNN [33,34], these deep learning-based methods are able to accurately locate salient objects in complex scenes, but their detection results generally suffer from loss of edge details.

3. Methodology

In this section, we introduce our high edge-quality light-field SOD network, which consists of four parts: saliency feature extraction, edge feature extraction module (EFEM), multi-feature fusion module (MFM), and multi-feature joint detection module (MJDM). The network structure is illustrated in Figure 1. Firstly, the two sets of convolutional blocks in Section 3.1 extract multi-level saliency features from all-in-focus images and focus stacks. Secondly, the EFEM in Section 3.2 is proposed to extract rich edge features from low-level saliency features. Then, the MFM in Section 3.3 hierarchically fuses edge features and high-level saliency features. Finally, the MJDM in Section 3.4 detects salient objects from the fused features. In addition, the saliency ground-truth and the edge ground-truth calculated from the saliency ground-truth are used to guide the generation of accurate saliency information and clear edge information.

3.1. Saliency Feature Extraction

Different levels of convolutional features have different representations for the image [35], thus we adopt two sets of convolutional blocks to extract multi-level saliency features from light-field data. As shown in Figure 1, we drop the last pooling layer and fully connected layers of the ResNet-50 network [36], and build 2D convolutional blocks and 3D convolutional blocks [37] with remaining convolutional layers to extract features from all-in-focus images and focus stacks. As in Piao and Jiang [21] et al. each focus stack consists of 12 focus slices, and we retain 3 focus slices in high-level 3D convolutional blocks after multiple pooling operations. In addition, the resolution size of all input image is 372 × 372 pixels. The saliency feature extracted by Conv-1 and Conv-2 is kept at 88 × 88 pixels, while the spatial resolution of the saliency features extracted by each of the other convolutional blocks is reduced by 1/2 in turn. We denote the outputs of the all-in-focus images and focus stacks in the convolutional block as

{f_{A}^{i}}_{i = 1}^{5}

and

{f_{F}^{i}}_{i = 1}^{5}

separately, where

i

represents the convolution block level.

3.2. Edge Feature Extraction Module (EFEM)

We propose an edge feature extraction module to make effective use of the edge information in light-field data. On the one hand, the low-level features extracted by Conv-1 and Conv-2 contain rich edge texture information [35]. On the other hand, all-in-focus images have fewer data compared to focus stacks, which facilitates the extraction of high-quality edge features. Hence, EFEM extracts edge features from the all-in-focus image features

f_{A}^{1}

and

f_{A}^{2}

in Conv-1 and Conv-2. As shown in Figure 1,

f_{A}^{1}

and

f_{A}^{2}

are first concatenated and then repeatedly fed into a convolutional layer, a batch normalization layer, and an

R e L U

activation function three times. This procedure can be defined as:

f_{E} = {(R e L U (B N (W_{E} * C a t (f_{A}^{1}, f_{A}^{2}) + b_{E})))}_{\times 3}

(1)

where

*

,

W_{E}

, and

b_{E}

represent convolutional operator and convolutional parameters.

f_{E}

means extracted edge features.

B N (\cdot)

,

R e L U (\cdot)

, and

C a t (\cdot)

denote batch normalization, activation function, and concatenation operation.

{(\cdot)}_{\times 3}

means three-times repeated operation.

In addition, to ensure EFEM is able to learn valid edge information, we optimize the extracted edge features by supervision. First, the edge feature

f_{E}

is adjusted to single channel by a convolutional layer, which corresponds to the probability that each pixel belongs to the salient edge. Then, we supervise the single-channel edge feature using edge ground-truth map to sieve out the invalid background edge information. Specifically, we convert the 8-bit saliency ground-truth into binarized image by the adaptive threshold, followed by computing convolution with Sobel operator to obtain the edge ground-truth. Given saliency ground-truth

G_{S}

, the operation can be defined as follows:

G_{E} = \sqrt{{(W_{s o b e l - x} * G_{S})}^{2} + {(W_{s o b e l - y} * G_{S})}^{2}}

(2)

where

*

represents convolutional operator.

W_{s o b e l - x}

and

W_{s o b e l - y}

represent Sobel operator in the x and y directions, which can be defined as:

W_{s o b e l - x} = [\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}] W_{s o b e l - y} = [\begin{matrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ - 1 & - 2 & - 1 \end{matrix}]

We employ cross-entropy loss to supervise the extracted edge features. Given size

M \times N

of edge ground-truth

G_{S}

, the loss function can be defined as:

L_{E} = - \sum_{m = 1}^{M} \sum_{n = 1}^{N} (G_{E}^{m n} \log f_{E}^{m n} + (1 - G_{E}^{m n}) \log (1 - f_{E}^{m n}))

(3)

3.3. Multi-Feature Fusion Module (MFM)

We adopt multi-feature fusion module (MFM) to fuse high-level saliency features and edge features hierarchically. The high-level saliency features and edge features contain accurate location information and rich edge detail information, respectively. To take full advantage of them, we introduce MFM shown in Figure 2 to fuse different levels of all-in-focus image features, focus stack features, and edge features.

The design of MFM is inspired by the memory-oriented spatial fusion module (Mo-SFM) [19], whose parameters are adjusted according to the number of input features. MFM consists of three parts: attention mechanism, ConvLSTM [38], and global perception module (GPM). The attention mechanism assigns corresponding weights to each input feature based on contribution, in order to enhance useful features and suppress unnecessary features. This procedure can be defined as:

f_{a t t}^{i} (t) = F^{i} (t) ⊙ δ (W_{a t t} * A v g P o o l i n g (F^{i}) + b_{a t t}))

(4)

where

F^{i}

denotes the cascaded input features, including 1 edge feature, 1 all-in-focus image feature, and 3 focal slice features.

*

,

W_{a t t}

, and

b_{a t t}

represent convolutional operator and convolutional parameters.

A v g P o o l i n g (\cdot)

means global pooling operation and

⊙

means feature-wise multiplication.

δ (\cdot)

denotes softmax function.

F^{i} (t)

and

f_{a t t}^{i} (t)

denote separate feature in

F^{i}

and corresponding output. Then, ConvLSTM fuses saliency features and edge features in a complementary manner by learning the connection between them. Compared to Mo-SFM, the ConvLSTM in MFM contains 5 time steps corresponding to 5 feature inputs. Furthermore, by enlarging the receptive field, the global perception module (GPM) captures global contextual information in the output feature

f_{s}^{i}

of ConvLSTM. This operation is defined as follows:

f_{r}^{i} = C o n v_{1 \times 1} (C a t (f_{s}^{i}, ∐_{d} (C o n v_{r a t e = d} (f_{s}^{i}))), d = 1, 3, 5, 7

(5)

where

C o n v_{r a t e = d} (\cdot)

denotes convolution with dilation rate

d

.

∐_{d} (\cdot)

means perform multiple operations using different values of

d

.

C a t (\cdot)

denotes concatenation operation.

3.4. Multi-Feature Joint Detection Module (MJDM)

We utilize multi-feature joint detection module (MJDM) to detect salient objects by learning spatial correlations between multi-level fused features. MJDM is derived from the memory-oriented feature integration module [19] and the architecture is shown in Figure 3.

Firstly, MJDM emphasizes the important channels in the current fusion feature

f_{r}^{i}

by learning the channel attention of the higher-level memory feature

f_{c}^{i + 1}

and the current fusion feature

f_{r}^{i}

. Channel attention can be formulated as follows:

f_{c h}^{i} = f_{r}^{i} \otimes δ (A v g P o o l i n g ((W_{t - 1} * f_{c}^{i + 1} + b_{t - 1}) \oplus (W_{r} * f_{r}^{i} + b_{r})))

(6)

where

\oplus

and

\otimes

denote pixel-wise addition and pixel-wise multiplication, respectively. Then, ConvLSTM generates new memory features by learning the spatial correlation between the high-level memory features

f_{c}^{i + 1}

and the optimized current fusion features

f_{c h}^{i}

. The calculation process is similar to Equation (4) by replacing the inputs. After four iterations, MJDM outputs the memory features

f_{c}^{3}

, followed by a transition convolutional layer and an upsample operation to obtain the final saliency map. To ensure ConvLSTM can explicitly learn the most important information, we perform intermediate supervision on each memory feature. Specifically, we adopt a cross-entropy loss

L_{S}

to supervise the generation of memory features,

L_{M}

is defined as:

L_{S} = - \sum_{x = 1}^{X} \sum_{y = 1}^{Y} (G_{S}^{x y} \log {(f_{c}^{i})}^{x y} + (1 - G_{S}^{x y}) \log (1 - {(f_{c}^{i})}^{x y}))

(7)

where

X

and

Y

represent the size of saliency ground-truth

G_{S}

. Therefore, the total loss function can be expressed as:

L = L_{E} + L_{S}

(8)

The specific Algorithm 1 for the whole procedure is as follows.

Algorithm 1. High Edge-Quality Light-Field Salient Object Detection

Input: All-in-focus images, focus stacks, Saliency ground-truth

G_{S}

, Iteration number T, Sobel operator

W_{s o b e l - x}

and

W_{s o b e l - y}

Output: Parameters of network model

θ

1. Initialize the network parameters by pre-trained ResNet-50, t = 1
2. for t = 1: T do
3. Generate multi-level saliency features

{f_{A}^{i}}_{i = 1}^{5}

and

{f_{F}^{i}}_{i = 1}^{5}

4. Compute the

f_{E}

from

(f_{A}^{1}, f_{A}^{2})

, as in Equation (1)
5. Compute the

G_{E}

from

G_{S}

, as in Equation (2)
6. for i = 3: 5 do
7. Generate

F^{i}

by concatenating

f_{E}

,

f_{A}^{i}

and

f_{F}^{i}

8. end for
9. Compute the

f_{c}^{i}

from

F^{i}

10. Calculate the

L_{E}

and

L_{S}

from

(f_{E}, G_{E})

and

(f_{c}^{i}, G_{S})

11.

L = L_{E} + L_{S}

12. Update

θ

using Adam algorithm
13. end for

4. Experiments

In this section, we conduct experiments to evaluate the performance of our proposed method. First, the experimental setup is outlined, in terms of datasets, evaluation metrics, and parameter settings. Next, our proposed method is compared with 8 state-of-the-art light-field SOD methods, with respect to quantitative evaluation and qualitative evaluation. At last, different edge feature extraction approaches are compared to verify the effectiveness of the EFEM.

4.1. Experimental Setup

4.1.1. Datasets

To evaluate our method, we conduct experiments on three public benchmark datasets: LFSD [14], HFUT [23], and DUT-LF [19]. LFSD is the first light-field saliency dataset consisting of 100 light-fields captured by a Lytro camera. The light-fields in this dataset typically contain only one single large-sized object with simple background. HFUT consists of 255 light-fields captured by a Lytro camera, most of which contain multiple objects or complex backgrounds. DUTLF-FS consists of 1462 light-fields acquired by a Lytro Illum camera. Compared to LFSD and HFUT, DUT-LF includes more complex scenes such as low light, low contrast, and clutter background with higher light-field quality. Similar to recent work [20,21], we chose 1000 samples from DUT-LF and 100 samples from HFUT to train our proposed model. The remaining samples and LFSD are used for testing.

4.1.2. Evaluation Metrics

We adopt four widely-used metrics to comprehensively evaluate the performance of our proposed method, including F-measure (

F_{β}

) [39], Mean Absolute Error (

MAE

) [40], S-measure (

S_{α}

) [41] and E-measure (

E_{ϕ}

) [42]. They are widely accepted in light-field SOD methods [19,20,21].

4.1.3. Parameter Settings

Our network is implemented on Pytorch framework and optimized by Adam algorithm. The two sets of convolutional blocks used for saliency feature extraction are initialized with corresponding pre-trained ResNet-50 network. The parameters of the newly added convolution layer are initialized with Gaussian kernels, and the biases are initialized to 0. In the training phase, batch size is set to 6, learning rate is initialized with

5 \times 10^{- 6}

and then reduced by 1/10 after 60 epochs. It takes 160 epochs for the whole training procedure. We do not use the validation dataset during training, and it takes about 7 h to train our network on an RTX 3090 GPU.

4.2. Comparison with the State-Of-The-Art Methods

We compare our proposed method with 8 previous state-of-the-art methods, including 5 deep learning-based methods: PANet [21], ERNet [20], MoLF [19], LFNet [17], MAC [22], and 3 traditional methods: WSC [15], DILF [16], LFS [14]. To ensure fairness, all saliency results from competing methods are produced by running the source codes or pre-computed by the authors.

4.2.1. Quantitative Evaluation

As shown in Table 1, we evaluate and compare our proposed method with other light-field SOD methods in terms of F-measure, MAE, S-measure, and E-measure. It can be seen that our proposed method achieves the highest scores on DUT-LF, HFUT, and LFSD datasets almost across four evaluation metrics. In particular, our proposed method far outperforms the state-of-the-art methods on DUT-LF, which is the largest and most challenging dataset (2.1% and 12.8% improvements in F-Measure and MAE). Our method performs worse than other algorithms in some measures. This is because some light-field samples in the HFUT and LFSD datasets provide only few focal slices, and our method has poor generalization ability for these light-field samples. In addition to numerical comparison, we also plot the precision-recall curves of all methods over three datasets as shown in Figure 4. It can be seen that our proposed method which is denoted by solid red line performs the best results on three benchmark datasets.

4.2.2. Qualitative Evaluation

Figure 5 shows some representative visualization results by comparing our proposed method with other state-of-the-art light-field SOD methods. It can be noticed that our proposed method can locate and segment salient objects more accurately in various complex scenes. For example, low contrast (rows 1, 2), clutter background (rows 3, 4), multiple objects (rows 5, 6), small objects (row 7), and transparent objects (row 8). Notably, thanks to the effective utilization of edge information, salient objects detected by our proposed method retain detailed edge details.

4.3. Comparison of Edge Feature Extraction Approaches

We compare different edge feature extraction methods in terms of both feature extractors and extraction sources to verify the effectiveness of EFEM. First, we remove EFEM from the network shown in Figure 1 to build the baseline network. Then, we implement experiments on different edge feature extraction approaches.

4.3.1. Comparison of Edge Feature Extractors

We compare the effect of a convolutional layer and our proposed EFEM on the network. The quantitative results are shown in Table 2 (b) and (d), it can be seen that EFEM presents better performance on all datasets (12.8%, 7.1%, and 6.0% improvement in MAE). The qualitative results are shown in Figure 6b,d, which demonstrate that EFEM is more favorable to improve the edge details of salient objects.

4.3.2. Comparison of Edge Feature Extraction Sources

We compare the impact of extracting edge features from all-in-focus images (AiF) and focus stacks (FS). The quantitative results are shown in Table 2 (c) and (d), it is consistent with our assertion that all-in-focus images are more beneficial for high-quality edge feature extraction. Significant improvement between them can be visualized in Figure 6c,d. These results suggest that more effective edge features can be extracted from all-in-focus images compared to the focus stacks.

It is worth noting that different edge feature extraction methods all improve the network performance compared to no edge feature extraction. This indicates that edge information is extremely important for light-field SOD.

5. Discussion

Although our proposed method significantly improves the detection accuracy, it still has shortcomings. In Table 3, we compare the model complexity with several representative methods. It can be seen that our proposed method bears little improvement in inference speed (only 0.3 improvement over ERNet in FPS) and the size of the model is much larger than other methods. This limits further applications. We argue that adopting some lightweight models can alleviate this problem, such as MobileNet [43] and ShuffleNet [44], which can effectively reduce the computational complexity by optimizing the network structure.

Existing light-field SOD methods are mainly used in 3D imaging, 3D display and light-field image compression. Our proposed method can accurately locate the key contents in scenes for subsequent visual processing. However, there are some other applications that deserve to be explored, such as transparent object segmentation and defect detection. The rich scene information in light-field data can provide useful clues.

6. Conclusions

In this paper, we propose a novel light-field SOD method to solve the loss of edge details in existing methods. It hierarchically leverages light-field features to reduce information loss. Specifically, we employ two sets of convolutional blocks based on the ResNet-50 network to extract multi-level saliency features. For low-level saliency features, we propose an edge feature extraction module to extract high-quality edge features from them. For high-level saliency features, we employ the MFM to fuse them and edge features to take full advantage of both saliency information and edge information. Finally, we use the MJDM to detect salient objects from fused features. Different from other methods, we adopt a dual supervision strategy to guide the generation of edge information and saliency information. Experiments demonstrate that our method can detect salient objects with fine edge details even in complex scenes such as clutter background, low light, and low contrast. Comprehensive experiments on three benchmark datasets show that our method outperforms eight state-of-the-art methods quantitatively and qualitatively.

Author Contributions

Investigation, X.W.; Methodology, X.W.; Software, S.C.; Validation, J.L.; Writing—original draft, S.C.; Writing—review and editing, J.L. and G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of Guangdong Province (2020A1515011559, 2021A1515012287), and in part by the Shenzhen Science and Technology Research Fund (JCYJ20180306174120445, 20200810150441003).

Data Availability Statement

The data presented in this study are openly available in DUT-LF dataset [16] at https://github.com/OIPLab-DUT/ICCV2019_Deeplightfield_Saliency (accessed on 27 January 2022), HFUT dataset [20] at https://github.com/pencilzhang/MAC-light-field-saliency-net (accessed on 27 January 2022) and LFSD dataset [12] at https://sites.duke.edu/nianyi/publication/saliency-detection-on-light-field/ (accessed on 27 January 2022).

Acknowledgments

The authors would like to thank the anonymous reviewers.

Conflicts of Interest

The authors declare no conflict of interest.

References

Borji, A.; Cheng, M.-M.; Jiang, H.; Li, J. Salient Object Detection: A Benchmark. IEEE Trans. Image Process. 2015, 24, 5706–5722. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shoitan, R.; Moussa, M.M. Unsupervised Cosegmentation Model Based on Saliency Detection and Optimized Hue Saturation Value Features of Superpixels. J. Comput. Sci. 2021, 17, 670–682. [Google Scholar] [CrossRef]
Lee, S.; Lee, M.; Lee, J.; Shim, H. Railroad Is Not a Train: Saliency As Pseudo-Pixel Supervision for Weakly Supervised Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 5495–5505. [Google Scholar]
Lee, H.; Kim, D. Salient Region-Based Online Object Tracking. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1170–1177. [Google Scholar]
Zhou, Z.; Pei, W.; Li, X.; Wang, H.; Zheng, F.; He, Z. Saliency-Associated Object Tracking. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9866–9875. [Google Scholar]
Yang, X.; Li, S.; Ma, J.; Yang, J.; Yan, J. Co-Saliency-Regularized Correlation Filter for Object Tracking. Signal Process. Image Commun. 2022, 103, 116655. [Google Scholar] [CrossRef]
Zhao, H.; Wu, J.; Zhang, D.; Liu, P. Toward Improving Image Retrieval via Global Saliency Weighted Feature. ISPRS Int. J. Geo-Inf. 2021, 10, 249. [Google Scholar] [CrossRef]
Al-Azawi, M.A.N. Saliency-Based Image Retrieval as a Refinement to Content-Based Image Retrieval. ELCVIA Electron. Lett. Comput. Vis. Image Anal. 2021, 20, 1–15. [Google Scholar] [CrossRef]
Achanta, R.; Susstrunk, S. Saliency Detection for Content-Aware Image Resizing. In Proceedings of the 2009 16th IEEE International Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 1005–1008. [Google Scholar]
Ahmadi, M.; Karimi, N.; Samavi, S. Context-Aware Saliency Detection for Image Retargeting Using Convolutional Neural Networks. Multimed Tools Appl. 2021, 80, 11917–11941. [Google Scholar] [CrossRef]
Fu, K.; Jiang, Y.; Ji, G.-P.; Zhou, T.; Zhao, Q.; Fan, D.-P. Light Field Salient Object Detection: A Review and Benchmark. arXiv 2021, arXiv:2010.04968. [Google Scholar]
Wang, S.; Liao, W.; Surman, P.; Tu, Z.; Zheng, Y.; Yuan, J. Salience Guided Depth Calibration for Perceptually Optimized Compressive Light Field 3D Display. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2031–2040. [Google Scholar]
Wu, K.; Liao, Z.; Liu, Q.; Yin, Y.; Yang, Y. A Global Co-Saliency Guided Bit Allocation for Light Field Image Compression. In Proceedings of the 2019 Data Compression Conference (DCC), Snowbird, UT, USA, 26–29 March 2019; p. 608. [Google Scholar]
Li, N.; Ye, J.; Ji, Y.; Ling, H.; Yu, J. Saliency Detection on Light Field. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1605–1616. [Google Scholar] [CrossRef]
Li, N.; Sun, B.; Yu, J. A Weighted Sparse Coding Framework for Saliency Detection. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5216–5223. [Google Scholar]
Zhang, J.; Wang, M.; Gao, J.; Wang, Y.; Zhang, X.; Wu, X. Saliency Detection with a Deeper Investigation of Light Field. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; pp. 2212–2218. [Google Scholar]
Zhang, M.; Ji, W.; Piao, Y.; Li, J.; Zhang, Y.; Xu, S.; Lu, H. LFNet: Light Field Fusion Network for Salient Object Detection. IEEE Trans. Image Process. 2020, 29, 6276–6287. [Google Scholar] [CrossRef]
Wang, A. Three-Stream Cross-Modal Feature Aggregation Network for Light Field Salient Object Detection. IEEE Signal Process. Lett. 2021, 28, 46–50. [Google Scholar] [CrossRef]
Zhang, M.; Li, J.; Ji, W.; Piao, Y.; Lu, H. Memory-Oriented Decoder for Light Field Salient Object Detection. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 898–908. [Google Scholar]
Piao, Y.; Rong, Z.; Zhang, M.; Lu, H. Exploit and Replace: An Asymmetrical Two-Stream Architecture for Versatile Light Field Saliency Detection. AAAI 2020, 34, 11865–11873. [Google Scholar] [CrossRef]
Piao, Y.; Jiang, Y.; Zhang, M.; Wang, J.; Lu, H. PANet: Patch-Aware Network for Light Field Salient Object Detection. IEEE Trans. Cybern. 2021, 1–13. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Liu, Y.; Zhang, S.; Poppe, R.; Wang, M. Light Field Saliency Detection With Deep Convolutional Networks. IEEE Trans. Image Process. 2020, 29, 4421–4434. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, J.; Wang, M.; Lin, L.; Yang, X.; Gao, J.; Rui, Y. Saliency Detection on Light Field: A Multi-Cue Approach. ACM Trans. Multimed. Comput. Commun. Appl. 2017, 13, 1–22. [Google Scholar] [CrossRef]
Shi, L.; Zhao, S.; Zhou, W.; Chen, Z. Perceptual Evaluation of Light Field Image. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 41–45. [Google Scholar]
Zhou, W.; Shi, L.; Chen, Z.; Zhang, J. Tensor Oriented No-Reference Light Field Image Quality Assessment. IEEE Trans. Image Process. 2020, 29, 4070–4084. [Google Scholar] [CrossRef]
Zhang, W.; Zhao, S.; Zhou, W.; Chen, Z. None Ghosting Artifacts Stitching Based on Depth Map for Light Field Image. In Proceedings of the Advances in Multimedia Information Processing–PCM 2018, Hefei, China, 21–22 September 2018; pp. 567–578. [Google Scholar]
Wang, A.; Wang, M.; Li, X.; Mi, Z.; Zhou, H. A Two-Stage Bayesian Integration Framework for Salient Object Detection on Light Field. Neural Processing Lett. 2017, 46, 1083–1094. [Google Scholar] [CrossRef]
Wang, H.; Yan, B.; Wang, X.; Zhang, Y.; Yang, Y. Accurate Saliency Detection Based on Depth Feature of 3D Images. Multimed Tools Appl. 2018, 77, 14655–14672. [Google Scholar] [CrossRef]
Piao, Y.; Li, X.; Zhang, M.; Yu, J.; Lu, H. Saliency Detection via Depth-Induced Cellular Automata on Light Field. IEEE Trans. Image Process. 2020, 29, 1879–1889. [Google Scholar] [CrossRef]
Wang, T.; Piao, Y.; Lu, H.; Li, X.; Zhang, L. Deep Learning for Light Field Saliency Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 8837–8847. [Google Scholar]
Jing, D.; Zhang, S.; Cong, R.; Lin, Y. Occlusion-Aware Bi-Directional Guided Network for Light Field Salient Object Detection. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 20–24 October 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1692–1701. [Google Scholar]
Liang, Y.; Qin, G.; Sun, M.; Qin, J.; Yan, J.; Zhang, Z. Dual Guidance Enhanced Network for Light Field Salient Object Detection. Image Vis. Comput. 2022, 118, 104352. [Google Scholar] [CrossRef]
Zhao, Y.; Cheng, J.; Zhou, W.; Zhang, C.; Pan, X. Infrared Pedestrian Detection with Converted Temperature Map. In Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 18–21 November 2019; pp. 2025–2031. [Google Scholar]
Shi, K.; Bao, H.; Ma, N. Forward Vehicle Detection Based on Incremental Learning and Fast R-CNN. In Proceedings of the 2017 13th International Conference on Computational Intelligence and Security (CIS), Hong Kong, China, 15–18 December 2017; pp. 73–76. [Google Scholar]
Li, X.; Song, D.; Dong, Y. Hierarchical Feature Fusion Network for Salient Object Detection. IEEE Trans. Image Process. 2020, 29, 9165–9175. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhang, Y.; Chen, G.; Chen, Q.; Sun, Y.; Xia, Y.; Deforges, O.; Hamidouche, W.; Zhang, L. Learning Synergistic Attention for Light Field Salient Object Detection. arXiv 2021, arXiv:2104.13916. [Google Scholar]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-K.; Woo, W. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, 7–12 December 2015; pp. 802–810. [Google Scholar]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-Tuned Salient Region Detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1597–1604. [Google Scholar]
Perazzi, F.; Krahenbuhl, P.; Pritch, Y.; Hornung, A. Saliency Filters: Contrast Based Filtering for Salient Region Detection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 733–740. [Google Scholar]
Fan, D.-P.; Cheng, M.-M.; Liu, Y.; Li, T.; Borji, A. Structure-Measure: A New Way to Evaluate Foreground Maps. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4548–4557. [Google Scholar]
Fan, D.-P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.-M.; Borji, A. Enhanced-Alignment Measure for Binary Foreground Map Evaluation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 698–704. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]

Figure 1. The overall structure of our proposed network.

Figure 2. The architecture of MFM.

Figure 3. The architecture of MJDM.

Figure 4. Precision-recall curves of our proposed method and other state-of-the-art light methods on three popular datasets.

Figure 5. Qualitative comparison with the state-of-the-art methods.

Figure 6. Visual comparison of different edge feature extraction approaches. (a) denotes the baseline network with no edge features extracted. (b) represents using a convolutional layer to extract edge features from AiF images. (c) means using EFEM to extract edge features from FS. (d) denotes using EFEM to extract edge features from AiF images.

Table 1. Quantitative comparisons on three light-field datasets. The best results are marked in bold. ↑ and ↓ denote larger and smaller is better respectively.

Methods	DUT-LF [19]				HFUT [23]				LFSD [14]
Methods	$F_{β} ↑$	$MAE ↓$	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β} ↑$	$MAE ↓$	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β} ↑$	$MAE ↓$	$S_{α} ↑$	$E_{ϕ} ↑$
Ours	0.911	0.034	0.917	0.947	0.733	0.081	0.805	0.833	0.825	0.078	0.847	0.876
PANet [21]	0.892	0.039	0.908	0.932	0.727	0.074	0.795	0.831	0.820	0.081	0.835	0.871
ERNet [20]	0.891	0.040	0.899	0.943	0.709	0.082	0.778	0.832	0.829	0.083	0.830	0.879
MoLF [19]	0.855	0.052	0.887	0.921	0.639	0.095	0.742	0.790	0.800	0.092	0.825	0.864
LFNet [17]	0.843	0.054	0.878	0.912	0.628	0.093	0.736	0.777	0.794	0.092	0.821	0.867
MAC [22]	0.746	0.103	0.804	0.806	0.620	0.108	0.732	0.733	0.753	0.118	0.789	0.790
WSC [15]	0.505	0.181	0.589	0.705	0.487	0.155	0.609	0.677	0.716	0.152	0.699	0.748
DILF [14]	0.450	0.181	0.624	0.606	0.513	0.152	0.673	0.656	0.685	0.153	0.787	0.738
LFS [14]	0.333	0.261	0.556	0.515	0.306	0.229	0.560	0.515	0.497	0.209	0.671	0.560

Table 2. Quantitative results of different edge feature extraction methods. (a) denotes the baseline network with no edge features extracted. (b) represents using a convolutional layer to extract edge features from all-in-focus (AiF) images. (c) means using EFEM to extract edge features from focus stacks (FS). (d) denotes using EFEM to extract edge features from AiF images.

Indexes	Methods	DUT-LF [19]				HFUT [23]				LFSD [14]
Indexes	Methods	$F_{β} ↑$	$MAE ↓$	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β} ↑$	$MAE ↓$	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β} ↑$	$MAE ↓$	$S_{α} ↑$	$E_{ϕ} ↑$
(a)	Baseline	0.896	0.042	0.902	0.932	0.716	0.091	0.788	0.819	0.807	0.091	0.826	0.853
(b)	(a) + Conv (AiF)	0.902	0.039	0.914	0.937	0.721	0.087	0.796	0.826	0.811	0.083	0.835	0.867
(c)	(a) + EFEM (FS)	0.905	0.037	0.916	0.941	0.725	0.084	0.799	0.826	0.823	0.081	0.843	0.872
(d)	(a) + EFEM (AiF)	0.911	0.034	0.917	0.947	0.733	0.081	0.805	0.833	0.825	0.078	0.847	0.876

Table 3. Comparison of model complexity.

Methods	Size (M) ↓	FPS (frames/s) ↑	DUT-LF [19]	HFUT [23]	LFSD [14]
Methods	Size (M) ↓	FPS (frames/s) ↑	$S_{α} ↑$	$S_{α} ↑$	$S_{α} ↑$
Ours	298.7	26.4	0.917	0.805	0.847
PANet [21]	60.4	11.4	0.908	0.795	0.835
ERNet [20]	88.2	26.1	0.899	0.778	0.830
MoLF [19]	177.8	23.8	0.887	0.742	0.825

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Chen, S.; Liu, J.; Wei, G. High Edge-Quality Light-Field Salient Object Detection Using Convolutional Neural Network. Electronics 2022, 11, 1054. https://doi.org/10.3390/electronics11071054

AMA Style

Wang X, Chen S, Liu J, Wei G. High Edge-Quality Light-Field Salient Object Detection Using Convolutional Neural Network. Electronics. 2022; 11(7):1054. https://doi.org/10.3390/electronics11071054

Chicago/Turabian Style

Wang, Xingzheng, Songwei Chen, Jiehao Liu, and Guoyao Wei. 2022. "High Edge-Quality Light-Field Salient Object Detection Using Convolutional Neural Network" Electronics 11, no. 7: 1054. https://doi.org/10.3390/electronics11071054

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

High Edge-Quality Light-Field Salient Object Detection Using Convolutional Neural Network

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Saliency Feature Extraction

3.2. Edge Feature Extraction Module (EFEM)

3.3. Multi-Feature Fusion Module (MFM)

3.4. Multi-Feature Joint Detection Module (MJDM)

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Parameter Settings

4.2. Comparison with the State-Of-The-Art Methods

4.2.1. Quantitative Evaluation

4.2.2. Qualitative Evaluation

4.3. Comparison of Edge Feature Extraction Approaches

4.3.1. Comparison of Edge Feature Extractors

4.3.2. Comparison of Edge Feature Extraction Sources

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI