Multiplexing Multi-Scale Features Network for Salient Target Detection

Liu, Xiaoxuan; Peng, Yanfei; Wang, Gang; Wang, Jing

doi:10.3390/app14177940

Open AccessArticle

Multiplexing Multi-Scale Features Network for Salient Target Detection

by

Xiaoxuan Liu

^1,*,

Yanfei Peng

¹,

Gang Wang

² and

Jing Wang

¹

School of Electronic Information Engineering, Liaoning Technical University, Huludao 125105, China

²

School of Electronic and Electrical Engineering, Bohai shipbuilding Vocational College, Huludao 125105, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7940; https://doi.org/10.3390/app14177940

Submission received: 11 July 2024 / Revised: 11 August 2024 / Accepted: 15 August 2024 / Published: 5 September 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a multiplexing multi-scale features network (MMF-Network) for salient target detection to tackle the issue of incomplete detection structures when identifying salient targets across different scales. The network, based on encoder–decoder architecture, integrates a multi-scale aggregation module and a multi-scale visual interaction module. Initially, a multi-scale aggregation module is constructed, which, despite potentially introducing a small amount of noise, significantly enhances the high-level semantic and geometric information of features. Subsequently, SimAM is employed to emphasize feature information, thereby highlighting the significant target. A multi-scale visual interaction module is designed to enable compatibility between low-resolution and high-resolution feature maps, with dilated convolutions utilized to expand the receptive field of high-resolution feature maps. Finally, the proposed MMF-Network is tested on three datasets: DUTS-Te, HUK-IS, and PSCAL-S, achieving scores of 0.887, 0.811, and 0.031 in terms of its F-value SSIM and MA, respectively. The experimental results demonstrate that the MMF-Network exhibits a superior performance in salient target detection.

Keywords:

salient object detection; multiplexing multi-scale features network; multi-scale aggregation; multi-scale visual interaction; CNN

1. Introduction

Salient object detection (SOD) refers to the detection and extraction of significant objects with the most focused human visual attention from complex scenes [1]. With the advancement of science and technology, salient object detection has become a prevalent method of image pre-processing in a variety of computer vision tasks, such as image classification [2], semantic segmentation [3], traffic control [4], object detection [5,6], pedestrian detection [7], etc., which helps to find effective objects or regions in a scene.

Significance detection methods can be classified into traditional methods and deep learning-based methods. Traditional methods define significance detection as a binary segmentation problem, as shown in Figure 1.

They utilize position, orientation, spatial, and depth priors to construct an initial computational framework model, which is then used to complete the task of salient target detection. These methods often require manual labeling as a means of guidance, and their detection results are often inaccurate [9,10]. To address this limitation, the researchers have introduced neural networks into salient target detection algorithms and integrated multi-scale features into the network to enhance the capture of salient information. Deng et al. [11] proposed a multi-scale feature pyramid network that controlled the sampling step size of a basic network to enhance the fusion strength of multi-scale features. Zhou et al. [12] introduced dual feature aggregation modules in their multi-scale-feature, deep reuse, significance target detection network to reuse CNN features through two pathways and perform feature enhancement using residual connections. Luo et al. [13] introduced a new fusion of multi-scale semantic information to enhance the expression of shallow features and to strengthen the shallow features of targeted areas through an attention mechanism, so as to obtain better detection results. Zhao et al. [14] proposed a method for processing input images. This involves subjecting the images to superpixel segmentation, extracting and optimizing pre-significant regions using data processing, and then performing a principal component analysis on high-dimensional depth features using feature extraction. Finally, the significance value is calculated. However, most of these methods merely involve the simple reuse of multi-scale features, representing only a preliminary fusion of different features. They do not deeply explore the intricate relationships between multi-scale features. Establishing a new network structure that further investigates the deeper connections between multi-scale features, building on this initial feature fusion, to enhance the performance of salient object detection is a topic worthy of thorough research. This forms the first motivation of this study.

The multiplexing of multi-scale features can prompt networks to generate more accurate saliency maps [15]. However, the excessive integration of feature maps with different resolutions not only increases computational burden, but also leads to the dilution of useful features, which can ultimately compromise the algorithm’s performance [16]. In order to address this issue, numerous alternative approaches have been put forth by researchers. Ji et al. [17] proposed a two-way multi-scale pooling target detection network to extract the semantic information of their data and realize feature mapping, so as to enhance the feature representation ability of weak targets and avoid performance loss. Zhang et al. [18] proposed an infrared image weak-target detection algorithm with high computing efficiency. This algorithm integrates a mathematical formula into its processs, preprocesses the data, and employs a two-way multi-scale pooling target detection network to enhance the feature representation ability of weak targets. Xu et al. [19] selected Darknet53 as their backbone network for deep convolution feature extraction and designed a four-scale feature pyramid network to be responsible for the positioning and classification of targets. This approach improved the detection performance of small-scale pedestrian targets by introducing lower-layer high-resolution feature maps. Although the aforementioned networks attempt to extract information from feature maps at different resolutions using various methods, they only achieve the preliminary reuse of multi-scale features, resulting in suboptimal salient object edge detection. Investigating how to appropriately incorporate multi-scale features into salient object detection networks to achieve the clearer edge detection of salient objects is a topic deserving of in-depth research. This forms the second motivation of this study.

In conclusion, the integration of multi-scale features into salient target detection networks in different ways will result in varying outcomes. In order to make the edge of salient target detection results clearer, this paper proposes the MMF-Network based on encoder–decoder architecture. This network gradually realizes the effective integration of features. The primary objectives of this study are to (1) design a multi-scale aggregation module enhance the high-level semantic information of features; (2) design a multi-scale visual interaction module to connect the low-resolution and high-resolution feature map; and (3) integrate the SimAM attention mechanism to optimize the features’ weight allocation.

The rest of this article is organized as follows. Section 2 illustrates the structural design of the MMF-network. Section 3 shows the experimental environment’s setup and presents the analysis and discussion of the experimental results of this paper. Section 4 summarizes the paper.

2. MMF-Network

The MMF-Network uses ResNet50 to preprocess the input data to acquire backbone network features. ResNet50 is a variant of the deep residual network (ResNet). It has a depth of 50 layers, including 18 basic residue blocks, each containing 3 convolutional layers. These residual blocks allow the model to learn more complex features and are able to efficiently solve the gradient vanishing problem in deep neural networks. If you delete the ResNet50 global average pooling layer in the MMF-Network and use it as the backbone network, you can obtain five sets of feature maps of different scales and use these feature maps as the input data of the MMF-Network to realize the aim of deep feature extraction. ResNet50, with its stability, scalability, powerful feature extraction abilitym and moderate computational complexity, when used as a backbone network, can fully extract data features and makes it difficult for gradient disappearance or gradient explosion problems to appear.

The MMF-Network adopts an encoder–decoder structure. In order to obtain a greater abundance of multi-scale features, a multi-scale aggregation module (MAM) is designed in the encoder to sample adjacent features and fuse channels. This process enables the high-level feature information of the low-level features and the geometric information of the image to be obtained. Furthermore, it enhances the sensitivity of the network to the image’s textural feature information. Finally, it allows the network to expand its receptive field in the feature layer and restore significant target information step by step. Figure 2 shows the network model diagram of this method.

2.1. Multi-Scale Aggregation Module

In conventional FPNs [20], multi-scale features are aggregated in a top–down manner, preventing underlying features from acquiring semantic information about high-level features. However, this structure often lacks the ability to effectively integrate the underlying features and the high-level features when processing. The underlying features usually contain rich spatial detail information, while the high-level features contain more semantic information. Because the FCN mainly relies on high-level features for upsampling and prediction and lacks a full utilization of underlying features and their effective combination with high-level features, its performance is often limited when processing tasks require both spatial detail and semantic information. To address this shortcoming, the MINet [21] aggregation interaction module is designed to complete the feature transmission of the top to the bottom by integrating the features of adjacent layers. However, the ability of the bottom features to partake in the transfer process is limited. Zhou et al. [12] proposed a two-way dense feature aggregation module to address this issue. This module facilitates multi-scale feature extraction by connecting paths across scales to avoid the dilution of underlying features.

The multi-scale aggregation module (MAM) is inspired by the structure of this model and employs a novel approach to feature aggregation. Unlike traditional cascade operations, the MAM does not rely on a simple top–down or bottom–up approach. Instead, it continuously stacks and expands the feature aggregation module, thereby enabling the extraction of more significant information. This method not only avoids increasing the number of parameters, but also minimizes the loss of underlying characteristics. The objective is to achieve a more accurate and significant detection effect. The fundamental principle of the MAM is to generate the original feature

E^{i}

(i = 0, 1, 2, 3, 4)through the effective aggregation of the operational backbone network

B^{i} = c (E^{i})

and then to obtain the new feature sequence

f_{M A M}^{i}

through iterative transmission, where

E^{i}

represents the layer i feature and

B^{i}

represents the aggregated feature of each layer. A diagram of its structure is shown in Figure 3.

From the above figure, the input of the module is

E^{i} (i = 0, 1, 2, 3, 4)

, where

E^{0}

is the lowest layer feature map and

E^{4}

is the highest layer feature map; the convolution layer, the batch normalization layer, and the Relu activation layer can obtain the same feature map with this input feature scale. In this process, the model can enhance the nonlinear features of an image and reduce the number of channels to generate a complete branch.

f_{M A M}^{i}

is obtained by the stacking of

B^{i}

branches. This stacking process is as follows:

f_{M A M}^{i} = \{\begin{matrix} L (B^{i}), i = 0 \\ L (\sum_{0}^{j} D (B^{i})) + L (B^{i}), j = i - 1, i = 1, 2, 3, 4 \end{matrix}

(1)

f_{M A M}^{i}

features representing the input feature

B^{i}

are processed by the MAM module, in which L represents the loss layer and D represents the downsampling operation.

2.2. SimAM

To highlight significant features of the image without reducing computational efficiency, the SimAM attention mechanism was introduced to the MMF-Network. The fundamental principle underlying the SimAM approach is an energy function that is based on some well-known neuroscience theories. When using this energy function in a neural network, it can quickly assign importance weights to each pixel in the image. Adding the SimAM module to our network can emphasize the important pixels of an image feature map, avoid generating redundant parameters, and increase the burden on the network.

2.3. Multi-Scale Visual Interaction Module

The shallow features of convolutional neural networks contain a wealth of information regarding image texture, space, and other characteristics, which help the network to locate salient targets. However, they lack visual detail information. In contrast, high-level features contain semantic visual information, which can refine the edge details of salient targets. Li et al. [22] proposed a global attention upsampling module based on a simple low-level network model. This module turns weighted low-level features into high-level features and provides high-level feature guidance information for the low-level feature graph in a simple way. Inspired by this model, a multi-scale visual interaction module (multi-scale visual interaction module, MVIM) is designed to integrate underlying and high-level features through bottom–up feature transmission. Its structure is shown in Figure 4.

The module first reduces the number of high-level feature channels through a hollow convolution layer and downsampling operation. It then applies a conversion–interaction–fusion strategy to the low-level features and multiplies the processed high-level features to ensure that high-resolution information is maintained with a low number of parameters. Finally, it performs this operation for each output feature of the encoder and transmits them upward to realize the fusion of features.

The MVIM module can be regarded as having multiple branches, and each branch

V^{i} (= 0, 1, 2, 3, 4)

is calculated by the following formula:

V_{i} = \{\begin{matrix} f_{M A M}^{i}, i = 4 \\ D (f_{M A M}^{i}) \otimes R (f_{M A M}^{i}), i = 0, 1, 2, 3 \end{matrix}

(2)

where D represents operations such as downsampling and R represents operations such as batch normalization and activation function.

2.4. Loss Function

The binary cross-entropy function L is usually used in salient object detection tasks. As a loss function of this network, this function accumulates and calculates the loss of each pixel in the network, but it ignores the connection between pixels, which may cause the network to highlight the foreground area of the image, thereby resulting in the problem of an unbalanced processing of samples. In their paper, Pang et al. [21] proposed the consistency enhancement loss (CEL) as the loss function of their network. This loss function imposes global constraints on the prediction results based on the binary cross-entropy function, so as to produce more effective gradient propagation. Its calculation method is performed as follows:

L_{C E L} = \frac{| F P + F N |}{| F P + 2 T P + F N |} = \frac{\sum (p - p g) + sup (g - p g)}{\sum p + \sum g}

(3)

where

T P

,

F P

, and

F N

represent true positives, false positives, and false negatives, respectively. The area is calculated by

| \cdot |

.

F P + F N

represents the difference set between the predicted foreground region and the ground truth, and

F P + 2 T P + F N

represents the sum of that union with that intersection. When

{p | p > 0, p P}, {g | g = 1, g G}

, the loss reaches its maximum value, namely

L_{C E L} = 1

. Since p is continuous, the

L_{C E L}

reference p is differentiable, and thus

L_{C E L}

is able to transform into

L_{B C E}

on the basis of the network’s prominent image foreground information.

3. Interpretation

3.1. Experimental Environment Settings and the Dataset

In this paper, the Pytorch deep learning framework was selected as the experimental framework. The following hardware equipment was used in the experiment:

Processor: Intel (R) Core (TM) i9-12900H 2.50 GHz;

Running memory: 16 G;

Graphics card: NVIDIA GeForce GTX 3060Ti (6 G);

System: Windows 11;

Related software: Anaconda3-4.3.14, cuda-10.0.132, PyCharm, etc.

Dataset and implementation details: DUTS-Tr was used as the training set; this dataset is a training dataset specially designed for salient target detection tasks. It is widely used in the field of salient target detection, and its data quality and annotation accuracy are recognized by the industry. DUTS-Te, HKU-IS, and PASCAL-S were used as the test sets. These test sets are widely used in the field of salient target detection, and their data quality and annotation accuracy are recognized by the industry. Therefore, using them as test sets can ensure the objectivity and accuracy of the assessment results. Considering the memory capacity of the GPU and the size of the dataset, our network was trained for 50 epochs with a mini-batch of 4. To reduce the computational complexity and avoid falling into local optimum solutions, we used a momentum SGD optimizer with a weight decay of 5 × 10⁻⁴, an initial learning rate of 1 × 10⁻³, and a momentum of 0.9.

(1): DUTS-Tr: This dataset is a part of the DUTS dataset. The DUTS-TR contains a total of 10,553 images. Currently, it is the largest and most commonly used training dataset for salient target detection. This dataset was augmented by horizontal flipping, yielding a total of 21,106 training images. The images in DUTS-Tr cover a variety of complex scenes, such as urban landscapes, indoor environments, natural scenery, etc. This diversity enables the model to cope with salient target detection tasks in different environments and conditions, improving the robustness and generalization ability of the model.
(2): DUTS-Te: This dataset belongs to the DUTS-Tr DUTS dataset and consists of 5019 images. DUTS-Te contains a large number of real images with pixel-level annotations covering a variety of complex scenarios, providing rich data support for the performance of tested models in multiple scenarios.
(3): HKU-IS: The HKU-IS dataset is known for its high-quality annotation, with a high accuracy that contributes to the accurate evaluation of model performance.The images in this dataset have a high complexity, including multiple salient targets and complex backgrounds, which challenges the detection power of the model.
(4): PASCAL-S: The PASCAL-S dataset not only provides pixel-level annotation, but also combines this with structural information to help to more comprehensively evaluate the performance of the model. This dataset covers a variety of object categories and scenarios, making its evaluation results more representative.

3.2. Experimental Evaluation Index

In our salient object detection experiment, in order to distinguish the accuracy of the experimental results of different networks and comprehensively evaluate the performance of our salient object detection model, our proposed network was evaluated by both subjective perception and objective evaluation. From the objective point of view, three evaluation methods were used as the objective evaluation indicators of this network: the average absolute error (MAE), F value (F-Measure), and structural similarity (SSIM). These three measures are introduced as follows:

(1) Mean Absolute Error

The mean absolute error (MAE) reflects the error of the predicted value and is calculated using Equation (4).

MAE = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} | \bar{S} (x, y) - \bar{G} (x, y) |

(4)

where W and H are long and high, respectively, represent the significance detection map, represents the true value, and

(x, y)

represents the pixel location; its value ranges from 0 to 1, and its data size is inversely proportional to the network’s performance.

(2) F value

The F value is a comprehensive evaluation index which requires the joint calculation of the accuracy rate (precision) and the recall rate (recall) of the significance detection map. The precision value and the recall value of the significance detection map are as follows:

Precision = \frac{| S \cap G |}{S} Recall = \frac{| S \cap G |}{G}

(5)

Here, the significance detection map is binarized to obtain S, while the true value is binarized to obtain G.

In this evaluation index, F-Measure is the weighted and harmonic average of precision and recall, and a parameter

β^{2}

is set to control the importance of the accuracy and recall rate to meet different needs. The larger the value is, the better the network’s performance is. To ensure the fairness of the experiment, as with the contrast algorithm for salient target detection, the value of

β^{2}

was taken to be 0.3.

F_{β} = \frac{(1 + β^{2}) Precision \times Recall}{β^{2} Precision + Recall}

(6)

(3) Structural Similarity

Structural similarity (SSIM) is an indicator of the structural similarity between the original image and the test image. The values of this indicator range from 0 to 1. The closer the predicted map and the true value, the closer the value of the SSIM is to 1. The value of the SSIM is calculated using Equation (7):

SSIM = \frac{2 \bar{x} \bar{y}}{{(\bar{x})}^{2} + {(\bar{y})}^{2}} \times \frac{2 σ_{x} σ_{y}}{σ_{x}^{2} + σ_{y}^{2}} \times \frac{σ_{x y}}{σ_{x} σ_{y}}

(7)

where x and y represent the significance detection graph and the true value, respectively;

\bar{x}

,

\bar{y}

,

σ_{x}

, and

σ_{y}

are the mean and standard deviation of x and y, respectively; and

σ_{x y}

is the covariance.

3.3. Performance Analysis

In order to verify the effectiveness of the proposed algorithm in significant target detection, the proposed algorithm is compared with MINet [21], EGNet [23], BASNet [24], BANet [25], and PiCANet [26]. These five current mainstream salient target detection networks are compared in the following table, which presents the results of this comparison.

As demonstrated in Table 1, to fully compare the proposed method with existing models, detailed experimental results for the three indicators are presented. From the results, our method performs well and outperforms its competitors. This method consistently outperformed all competitors, in terms of the three indicators, on most datasets. In terms of MAE, F-measure, and SSIM, its performance improvements averaged 0.0065, 0.036, and 0.042.

The test set contains a variety of different types of images, including large detection targets, small detection targets, numerous detection targets, and complex detection backgrounds. To validate the efficacy of this method, a subjective analysis of these image types is conducted.

(A): The saliency map results of detecting a small target are shown in Figure 5.

Figure 5 illustrates that the present method achieves satisfactory results in the contest of small target detection. Due to the limited number of target features, it is challenging to accurately extract the salient target’s position. Furthermore, saliency information is prone to being processed as background, which hinders the precise localization of the target. The MMF-Network effectively addresses the issue of distinguishing between significant and insignificant pixels in these images by combining multi-scale features using the SimAM. This integration enhances the accuracy of target detection.

(B): The image results of detecting multiple salient targets are shown in Figure 6.

Figure 6 shows that the method proposed in this paper is capable of more accurately identifying multiple significant targets in the context of multi-object detection, whereas the detection results of other detection algorithms are slightly lower than those of the MMF-Network. The discrepancy between the multi-object contour information presented and its true value may be attributed to the proximity of the objects in the image, which causes the network to correctly identify their edge details.

(C): The results of detecting a significant target image are shown in Figure 7.

It can be seen from Figure 7 that our method is better than other methods in detecting significant targets, although the MMF-Network approach may result in missing significant targets and incomplete edges in certain instances. This is attributed to the limitations of the current salient detection algorithm, which cannot fully encompass all significant pixels.

(D): The detection results of images with complex detection scenes are shown in the figure below.

Figure 8 illustrates that our detection results outperformed the other methods when the salient target is not outstanding. In this type of image, the objective is not only to identify the target pixels, but also to distinguish them from the background pixels; this is a challenging task, and this method has demonstrated its efficacy in this case.

3.4. Ablation Experiment

In order to verify the influence of the MAM, SimAM, and MVIM on the performance of the network model, its performance was valued by gradually adding each module to the network model. This was achieved by calculating the MAE and F values on the DUTS-TE and HUS-IS datasets to prove the effectiveness of each module, where the baseline is the original FCN network. The data presented in Table 2 indicate that the network’s performance is the best when the MAM, SimAM, and MVIM are all added.

The limitations of this study are worth noting. First, due to the limited dataset used, we have only carried out a comparison of the network’s detection performance on three datasets; although this could prove the superiority of our method, more dataset tests can further demonstrate the good robustness of our proposed method. Second, the proposed method outperforms the existing methods in almost all performance metrics; however, it is still slightly lower than PiCANet on the dataset PASCAL-S, which is mainly related to the internal structure design of the network, and the proposed method has a lot of room for improvement. In the next step, we plan to introduce more advanced feature extraction methods to achieve a better detection performance.

4. Conclusions

This paper proposes a salient target detection network that incorporates multi-scale information to mine the more semantic and spatial features of images. This approach utilizes a multi-scale aggregation module to capture spatial features at different levels of granularity and a multi-scale visual perception module to integrate underlying features and high-level visual features within the network, thereby realizing the mutual enhancement of spatial and semantic information. The test results indicate that the proposed method outperforms five other popular methods, both in terms of objective evaluation indexes and subjective perception perspectives. Nevertheless, there is still room for the further improvement of the method proposed in this paper, including, but not limited to, the introduction of new network structures to mine more of the semantic and spatial feature information of images at different granularity levels more efficiently and quickly. This is also the focus of our next research work.

Author Contributions

Conceptualization, methodology software, investigation, writing—original draft preparation, funding acquisition, X.L.; validation, visualization, supervision, Y.P.; formal analysis, resources, G.W.; curation, data, J.W.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported, in part, by the National Natural Science Foundation of China under Grant 61772249 and, in part, by the Liaoning Provincial Colleges and Universities Basic Scientific Research Project under Grant LJKZ0358.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ieracitano, C.; Mammone, N.; Spagnolo, F.; Frustaci, F.; Perri, S.; Corsonello, P.; Morabito, F.C. An explainable embedded neural system for on-board ship detection from optical satellite imagery. Eng. Appl. Artif. Intell. 2024, 133, 108517. [Google Scholar] [CrossRef]
Zhenzhen, L.; Baojun, Z.; Linbo, T.; Zhen, L.; Fan, F. Ship classification based on convolutional neural networks. J. Eng. 2019, 2019, 7343–7346. [Google Scholar] [CrossRef]
Su, W.; Wang, Z. Widening residual refine edge reserved neural network for semantic segmentation. Multimed. Tools Appl. 2019, 78, 18229–18247. [Google Scholar] [CrossRef]
Wang, P.; Wang, C.; Lai, J.; Huang, Z.; Ma, J.; Mao, Y. Traffic control approach based on multi-source data fusion. IET Intell. Transp. Syst. 2019, 13, 764–772. [Google Scholar] [CrossRef]
Zhou, T.; Li, Z.; Zhang, C. Enhance the recognition ability to occlusions and small objects with Robust Faster R-CNN. Int. J. Mach. Learn. Cybern. 2019, 10, 3155–3166. [Google Scholar] [CrossRef]
Xiang, Y.; Liu, Z.; Huang, Y.; Xu, Y. Moving target detection with polarimetric distributed MIMO radar in heterogeneous clutter. J. Eng. 2019, 2019, 8009–8012. [Google Scholar] [CrossRef]
Zhang, X.; Shangguan, H.; Ning, A.; Wang, A.; Zhang, J.; Peng, S. Pedestrian detection with EDGE features of color image and HOG on depth images. Autom. Control. Comput. Sci. 2020, 54, 168–178. [Google Scholar] [CrossRef]
Achanta, R.; Estrada, F.; Wils, P.; Süsstrunk, S. Salient region detection and segmentation. In Proceedings of the Computer Vision Systems: 6th International Conference, ICVS 2008, Santorini, Greece, 12–15 May 2008; Proceedings 6. Springer: Berlin/Heidelberg, Germany, 2008; pp. 66–75. [Google Scholar]
Zhao, K.; Liu, Z.; Zhao, B.; Shao, H. Class-Aware Adversarial Multiwavelet Convolutional Neural Network for Cross-Domain Fault Diagnosis. IEEE Trans. Ind. Inform. 2024, 20, 4492–4503. [Google Scholar] [CrossRef]
Yu, J.; Liu, M.; Rodríguez-Andina, J.J. Zonotope-Based Asynchronous Fault Detection for Markov Jump Systems Subject to Deception Attacks via Dynamic Event-Triggered Communication. IEEE Open J. Ind. Electron. Soc. 2022, 3, 304–317. [Google Scholar] [CrossRef]
Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote. Sens. 2018, 145, 3–22. [Google Scholar] [CrossRef]
Zhou, W.; Bai, W.; Ji, J.; Yi, Y.; Zhang, N.; Cui, W. Dual-path multi-scale context dense aggregation network for retinal vessel segmentation. Comput. Biol. Med. 2023, 164, 107269. [Google Scholar] [CrossRef] [PubMed]
Luo, H.; Wang, P.; Chen, H.; Xu, M. Object detection method based on shallow feature fusion and semantic information enhancement. IEEE Sensors J. 2021, 21, 21839–21851. [Google Scholar] [CrossRef]
Zhao, X.; Pang, Y.; Zhang, L.; Lu, H. Joint learning of salient object detection, depth estimation and contour extraction. IEEE Trans. Image Process. 2022, 31, 7350–7362. [Google Scholar] [CrossRef] [PubMed]
Lalithadevi, B.; Krishnaveni, S. Diabetic retinopathy detection and severity classification using optimized deep learning with explainable AI technique. Multimed. Tools Appl. 2024, 1–65. [Google Scholar] [CrossRef]
Liu, M.; Yu, J.; Rodríguez-Andina, J.J. Adaptive Event-Triggered Asynchronous Fault Detection for Nonlinear Markov Jump Systems with Its Application: A Zonotopic Residual Evaluation Approach. IEEE Trans. Netw. Sci. Eng. 2023, 10, 1792–1808. [Google Scholar] [CrossRef]
Ji, Y.; Zhang, H.; Wu, Q.J. Salient object detection via multi-scale attention CNN. Neurocomputing 2018, 322, 130–140. [Google Scholar] [CrossRef]
Zhang, T.; Zhuang, Y.; Wang, G.; Dong, S.; Chen, H.; Li, L. Multiscale semantic fusion-guided fractal convolutional object detection network for optical remote sensing imagery. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–20. [Google Scholar] [CrossRef]
Xu, D.; Zhang, N.; Zhang, Y.; Li, Z.; Zhao, Z.; Wang, Y. Multi-scale unsupervised network for infrared and visible image fusion based on joint attention mechanism. Infrared Phys. Technol. 2022, 125, 104242. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Pang, Y.; Zhao, X.; Zhang, L.; Lu, H. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9413–9422. [Google Scholar]
Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. arXiv 2018, arXiv:1805.10180. [Google Scholar]
Zhao, J.X.; Liu, J.J.; Fan, D.P.; Cao, Y.; Yang, J.; Cheng, M.M. EGNet: Edge guidance network for salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8779–8788. [Google Scholar]
Qin, X.; Zhang, Z.; Huang, C.; Gao, C.; Dehghan, M.; Jagersand, M. Basnet: Boundary-aware salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7479–7489. [Google Scholar]
Su, J.; Li, J.; Zhang, Y.; Xia, C.; Tian, Y. Selectivity or invariance: Boundary-aware salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 February 2019; pp. 3799–3808. [Google Scholar]
Liu, N.; Han, J.; Yang, M.H. Picanet: Learning pixel-wise contextual attention for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3089–3098. [Google Scholar]

Figure 1. Traditional methods for finding salient regions. (a) Input image. (b) Saliency maps, at different scales, are computed, added pixel-wise, and normalized to obtain the final saliency map. (c) The final saliency map and the segmented image. (d) The output image containing the salient object, which is made of only those segments that have an average saliency value greater than the threshold T [8].

Figure 2. MMF-Network’s network model diagram.

Figure 3.

B^{i}

model structure.

Figure 3.

B^{i}

model structure.

Figure 4. Multi-scale visual interaction module.

Figure 5. Comparison of small-target detection results.

Figure 6. Comparison of detection results for multiple salient targets.

Figure 7. Comparison of significant-target detection results.

Figure 8. Comparison of the image results of an unclear target.

Table 1. Performance comparison of MMF-Network with five other models.

Model	DUTS-Te			HKU-IS			PASCAL-S
Model	MAE↓	F Price↑	SSIM↑	MAE↓	F Price↑	SSIM↑	MAE↓	F Price↑	SSIM↑
Ours	0.039	0.880	0.803	0.031	0.886	0.811	0.066	0.887	0.798
MINet	0.041	0.877	0.798	0.034	0.882	0.805	0.068	0.873	0.790
EGNet	0.049	0.793	0.797	0.037	0.874	0.802	0.078	0.795	0.562
BASNet	0.044	0.860	0.776	0.034	0.887	0.791	0.072	0.744	0.723
BANet	0.040	0.881	0.731	0.033	0.884	0.784	0.073	0.708	0.702
PiCANet	0.052	0.866	0.781	0.046	0.840	0.795	0.077	0.868	0.799

Table 2. Ablation experiment.

Ablation Algorithm	DUTS-Te		HKU-IS
Ablation Algorithm	F Price↑	MAE↓	F Price↑	MAE↓
MAM + Baseline	0.701	0.074	0.691	0.070
MAM + SimAM + Baseline	0.794	0.059	0.784	0.061
MAM + SimAM + MVIM	0.880	0.039	0.886	0.031

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Peng, Y.; Wang, G.; Wang, J. Multiplexing Multi-Scale Features Network for Salient Target Detection. Appl. Sci. 2024, 14, 7940. https://doi.org/10.3390/app14177940

AMA Style

Liu X, Peng Y, Wang G, Wang J. Multiplexing Multi-Scale Features Network for Salient Target Detection. Applied Sciences. 2024; 14(17):7940. https://doi.org/10.3390/app14177940

Chicago/Turabian Style

Liu, Xiaoxuan, Yanfei Peng, Gang Wang, and Jing Wang. 2024. "Multiplexing Multi-Scale Features Network for Salient Target Detection" Applied Sciences 14, no. 17: 7940. https://doi.org/10.3390/app14177940

APA Style

Liu, X., Peng, Y., Wang, G., & Wang, J. (2024). Multiplexing Multi-Scale Features Network for Salient Target Detection. Applied Sciences, 14(17), 7940. https://doi.org/10.3390/app14177940

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multiplexing Multi-Scale Features Network for Salient Target Detection

Abstract

1. Introduction

2. MMF-Network

2.1. Multi-Scale Aggregation Module

2.2. SimAM

2.3. Multi-Scale Visual Interaction Module

2.4. Loss Function

3. Interpretation

3.1. Experimental Environment Settings and the Dataset

3.2. Experimental Evaluation Index

3.3. Performance Analysis

3.4. Ablation Experiment

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI