A Two-Stage Approach for Infrared and Visible Image Fusion and Segmentation

Ren, Wang; Luo, Lanhua; Ren, Jia

doi:10.3390/app151910698

Open AccessArticle

A Two-Stage Approach for Infrared and Visible Image Fusion and Segmentation

by

Wang Ren

¹,

Lanhua Luo

^1,2 and

Jia Ren

^3,*

¹

Faculty of Data Science, City University of Macau, Macao 999078, China

²

School of Artificial Intelligence, Hezhou University, Hezhou 542899, China

³

School of Information and Communication Engineering, Hainan University, Haikou 570100, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10698; https://doi.org/10.3390/app151910698

Submission received: 10 August 2025 / Revised: 18 September 2025 / Accepted: 29 September 2025 / Published: 3 October 2025

Download

Browse Figures

Versions Notes

Abstract

Early studies rarely considered cascades among multiple tasks, such as image fusion and semantic segmentation tasks. Most image fusion methods fail to consider the interrelationship between image fusion and segmentation. We propose a new two-stage infrared and visible image fusion and segmentation method called TSFS. By cascading the fusion module and the segmentation module in the first stage, we obtain better fusion results and enhance the semantic information transferred to the second stage. The fusion module in the first stage of this method uses a feature extraction module (FEM) to extract deep features and then fuses these features through a feature mixture fusion module (FMFM). To enhance the fusion performance of multimodal data in the second-stage segmentation network, we propose a Cross-Semantic Fusion Attention Module (CSFAM) to cross-fuse these features. Compared to state-of-the-art methods, experimental evaluations on public datasets reveal that our proposed TSFS improves mIoU segmentation by 1.5% and 3.3% on the FMB and MFNet datasets, respectively, and produces visually better fused images.

Keywords:

image fusion; deep learning; semantic segmentation; high-level vision task

1. Introduction

With the rapid pace at which that technology is advancing, image fusion displays enormous potential for application in computer vision, enabling the development of many sectors and having an important position. Image fusion aims to fuse the images obtained by different imaging systems to generate an image that is rich in detailed texture information and has a friendly visual effect. Multimodal images can be made more usable and of higher quality through image fusion, which also serves as a strong basis for further processing and analysis.

Visible sensors are prone to being affected by bad environments, such as low light or nighttime. However, infrared sensors offer special benefits at night or in low light and can detect thermal radiation information from objects with sensitivity. Visible sensors can capture the light that a surface of an object reflects. Visible sensors not only can obtain visible images that offer extensive texture and detailed information but also can obtain visible images that are closer to human visual perception in environments with better lighting conditions. Visible sensors cannot achieve all-day and all-weather imaging at night or under partial occlusion. They also cannot penetrate the obstructed area to monitor the target object. The resulting visible image is shown in Figure 1a,d. As shown in Figure 1b,e, infrared sensors are easily interfered with by background noise, and the imaging results lack texture detail information.

Imaging sensors have enhanced the information richness in the process of continuous development. The fusion results generated by imaging from different sensors through image fusion technology can not only help provide more efficient technical support for subsequent advanced vision tasks but also effectively enhance the performance of subsequent tasks. Considering the difference in complementary characteristics between visible sensor and infrared sensor imaging, fusion results are generated based on features containing complementary information extracted from infrared and visible images. As shown in Figure 1, the generated fusion result not only effectively integrates the rich detailed texture information in the visible image but also fuses the significant infrared information of the person in the infrared image. The fusion of infrared and visible images can specifically handle feature information with complementary characteristics. By integrating multimodal information, their advantages can be combined, which is crucial in areas such as target detection [1], target recognition [2], and target tracking [3].

Generally, image fusion methods can be classified into two main categories: traditional fusion methods and deep learning-based methods [4,5]. Since the CNN-based multi-focus fusion method was first proposed by Liu et al. [6] and achieved better performance than the most advanced multi-focus image fusion methods, subsequent research works have proposed numerous CNN-based infrared and visible image fusion (IVIF) to further improve the fusion quality. TC-GAN [7] proposed to construct a single-channel generator consisting of an encoder, a squeeze-and-excitation (SE-Net) [8] attention module, and a decoder to generate a texture map, which is input into adaptive guided filtering to obtain a decision map to guide image fusion. NestFuse proposed to fuse the encoder output and input it into the decoder through a spatial attention model based on regional energy and a channel attention model based on regional pooling [9].

Although the above-mentioned image fusion method based on a deep learning method has certain advantages in feature extraction, due to the lack of constrained semantic information, the semantic information that the merged image contains is still insufficient, and the semantic information in both the fused and source images cannot be well processed. Advanced computer vision tasks can benefit from cascaded image fusion in a mutually reinforcing manner.

Slight misalignment of the source images may cause serious artifacts in the fusion results. Wang et al. proposed the UMF-CMGR [10], which made a preliminary attempt to address this issue through multi-task mixing and aimed to unify the data into the infrared domain for image registration using cycleGAN [11]. UMF-CMGR alleviates the impact of small deformations in the image based on unimodal registration, but the fusion result will still be affected by artifacts when the source images are significantly misaligned. Although UMF has made preliminary attempts at multi-task mixing, combining image style transfer with image registration, it has only performed sequential training of image registration and image fusion, without cascading registration and without fusion for feedback-guided training between the networks. In the research work at the same stage as UMF-CMGR, SeAFusion is a cascaded fusion and segmentation network proposed by Tang et al. to calculate semantic loss to provide feedback to guide the training of the fusion network [12]. SeAFusion is a further attempt at multi-task hybrid training of fusion and other tasks, but it only implements segmentation by inputting the fusion result into a unimodal segmentation network, without considering inputting the multimodal source images and the fusion results into the multimodal segmentation network to implement segmentation. The SpuerFusion [13] proposed by Tang et al. extended SeAFusion by combining registration, fusion, and segmentation into one network. The network obtained by cascading registration and fusion was then input into the segmentation for network fine-tuning. The segmentation network used adopted a pre-trained segmentation network, but the pre-trained segmentation model was obtained through the unimodal segmentation of SeAFusion, and the segmentation results of the multimodal source images were still not added to the network training. Although UMF, SeAFusion, and SuperFusion have initially explored the need to consider high-level visual tasks with image fusion, they only improve the associated training of two tasks through loss functions. We should explore a more self-consistent approach to study the impact of semantic information in multimodal source images and fusion results on multimodal segmentation. Not only can a multi-stage fusion strategy be used to focus on the training of the fusion network [14,15], but a two-stage training strategy of reconstruction–fusion can also be used to focus on the reconstruction and fusion of the fusion network respectively [16,17,18]. A two-stage reconstruction–fusion training strategy is adopted in which the first stage focuses on the encoding reconstruction performance of the model’s encoder and decoder, and the second stage inserts the fusion part to train the feature fusion performance of the network. This paper adopts a two-stage training approach, training the fusion module, segmentation module, and segmentation network separately.

However, despite the significant progress made in image fusion, it still faces many severe challenges. It is impossible to overlook the problem of information loss during the fusion process. Due to the differences between multimodal images and the limitations of the fusion method, the fused results will lose some important information in the source image, thus impacting the fusion results’ visual quality. Some crucial detail information may be lost during the feature extraction and fusion process, leaving the fused image with insufficient semantic information and affecting the analysis and processing of subsequent high-level vision tasks. By cascading two different task modules in the first stage, we not only improve the quality of the fusion results but also pass the segmentation results to the multimodal segmentation model, which improves the model’s capabilities for extracting features from multimodal source images and for feature fusion. TSFS effectively integrates multimodal data, effectively combines image fusion and segmentation tasks, and achieves mutual optimization between the two tasks.

The main contributions of this paper are as follows:

1.: We propose a new two-stage image fusion and segmentation method called TSFS, which effectively combines image fusion and segmentation. In addition to considering the semantic information of the enhanced fused image, the semantic segmentation of the multimodal image is also guided by parameter transfer.
2.: We designed the FEM and FMFM in the fusion module. In order to improve the performance of the model in feature fusion, MFB using multiple convolutional layers and CAB using the channel attention mechanism were designed.
3.: We use the cross semantics fusion attention module (CSFAM) in the second stage to fuse features through cross fusion, thereby improving the segmentation accuracy.

2. Materials and Methods

We first briefly introduce the overall method of our method in Section 2, then describe the fusion module in the first stage S1 in detail in Section 2.1, then describe the semantic segmentation modules in the first stage S1 and the second stage S2 in Section 2.2, and, finally, describe the loss function in the entire method in Section 2.3.

Our main goal is to explore a new method based on two-stage image fusion and image segmentation. To this end, we propose several key components, including FEM, FMFM in the fusion module in the first stage S1, and CSFAM in the second stage S2. This design is designed to more effectively utilize multimodal features and enhance the quality and accuracy of fusion and semantic segmentation results. Figure 2 displays the TSFS network frame. The fusion result output by the fusion module in S1 is input into the segmentation module in S1, and then the parameters of the segmentation result of the fusion result are passed to the multimodal segmentation model. The following will describe in detail the fusion module in the first stage S1, the segmentation in the first and second stage, and the loss function.

2.1. The Fuse Module in the First Stage S1

TSFS consists of the S1 and S2. As shown in Figure 3, the first stage S1 consists of a fusion module and a segmentation module. In this section, we will explain the fusion module in the first stage S1, in detail.

Feature Extraction Module (FEM): We use the Residual in Residual Dense Block (RRDB) [19] to design the FEM in the fusion module in S1. We use RRDB to design the FEM, which can enhance the network architecture’s ability to extract features and utilize them more effectively. The infrared image and visible image are input into their respective FEM to obtain the deep feature information in the latent space and then input into FMFM for fusion. For visible images, we use the FEM we designed to extract features from their rich details and texture information to obtain deep features so that subsequent processing can better utilize this information. The FEM structure of the input infrared image is the same as that of the visible image. The depth features of the infrared image are extracted through the same feature extraction process. The infrared image and the visible image are input into the FEM through separate input branches. The separate input branch allows FEM to focus on extracting the unique feature characteristics of the input image. The FEM consists of six RRDBs, each containing three residual dense units, and the residual dense flow’s final output is the feature vector.

Feature Mix Fusion Module (FMFM): We concatenate the separately extracted deep features of the infrared image and the visible image and input them into the FMFM, which consists of the Mix Fuse Branch (MFB) and the Channel Attention Branch (CAB). FMFM is like a powerful information processor that can perform mixed fusion operations on the deep features of infrared images and visible images, thereby producing fusion in the deep feature space. We designed seven convolutional layers in MFB, with convolution kernel sizes of 3, 1, 3, 5, 7, 3, 1, and 3, respectively. The first convolution layer performs a preliminary fusion of deep features, which are then input into convolution layers with convolution kernels of 1, 3, 5, and 7 for multi-scale mixed fusion, thereby achieving more comprehensive feature fusion. GeLU activates the deep features that are aggregated and summed from the first four layers, and they are then coupled to the last convolutional layer for a

3 \times 3

convolution process. The final output of CAB are activation vectors used to guide the fusion of deep spatial features. Subsequently, we multiply the input CAB features with the activation vector through channel multiplication to generate the fused features after CAB processing. Different convolution operations in MFB are not processed by pooling layers to decrease information loss in FMFM.

In summary, the processing flow of the fusion module in the S1 is as follows: First, two FEM are used to extract deep features from infrared images and visible images. Subsequently, the deep features are interactively fused through MFB and CAB in FMFM, respectively. Finally, two convolution operations are performed to obtain the fusion result. Image fusion not only utilizes the complementary characteristics between infrared images and visible images but also utilizes the unique characteristic information in single-modal images. In the TSFS method, feature extraction and interactive fusion are critical processes. FEM is a key component in the fusion module. FCFM can perform more effective fusion through feature mix fusion so that more semantic information can be passed from the fusion module to the segmentation in the first and second stages.

2.2. Segmentation in the S1 and S2

The semantic module in S1 plays a key role in connecting the previous and the next stages. Since the parameters of the segmentation module in S1 are transferred to S2, it not only completes the initial segmentation but also comprehensively guides the semantic segmentation of multimodal images in S2.

Therefore, the cascaded optimization of image fusion and semantic segmentation is of great significance. By cascading these two tasks into a unified method, the mutual reinforcement effect between them can be exploited to its potential. Therefore, we propose the TSFS method, which aims to cascade image fusion and segmentation, breaking through the limitations of existing single-task models and providing new insights and methods for the development of computer vision. The structure of the segmentation in the S2 is shown in Figure 4. This method effectively extracts, fuses, and segments multimodal information by designing a two-stage training method: FEM, FMFM, and CSFAM. The fusion module and segmentation module in S1 and the segmentation module in S2 are three important parts of TSFS. Through this design, TSFS effectively combines image fusion and semantic segmentation tasks, achieving cascade optimization of the two.

Cross Semantics Fusion Attention Module (CSFAM): As the key module in the second stage S2 of TSFS, CSFAM is important for the cross-modal semantic features fusion. From the perspective of intra-modality enhancement and inter-modality complementarity, the self-cross-attention module first enhances local visual features within the modality. By emphasizing the refinement of local features, the importance of local features in the fusion process is increased, and effective feature areas can be focused and processed more efficiently. In CSFAM, the calculation process of CSFAM is processed in the order of unimodal self-attention, cross-modal cross-attention, cross-modal self-attention, and further cross-modal cross-attention, and the effective fusion of semantic information is achieved through cross fusion. Unimodal self-attention captures the image’s unimodal features. Unimodal self-attention can be expressed as follows:

A_{I_{X}} = V_{X} S (K_{X}, Q_{X})

(1)

where

S (\cdot)

represents the similarity operation. Unimodal self-attention, cross-modal cross-attention, cross-modal self-attention, and further cross-modal cross-attention are all calculated by

S (\cdot)

. Here, we take unimodal self-attention as an example to illustrate the generation of Value, Key, and Query [20] from the feature X of the visible image, denoted as V, K, and Q, respectively, and the feature of the infrared image is denoted as Y.

S (\cdot)

is expressed as follows:

S_{K, Q} = s o f t m a x (\frac{K^{X^{T}} Q^{X}}{\sqrt{d_{n}}})

(2)

A_{I_{X}}

enhances the aggregation of the same classification within the modality by performing self-attention calculation on the deep features of the image itself, making the unimodal semantic information of the image obtained by the model more stable and accurate. Cross-modal cross-attention enhances semantic information by integrating fine-grained information extracted from cross-modal attention.

A_{I_{X Y}}

is obtained through cross-modal cross-attention. Cross-modal self-attention further strengthens the cross-modal features captured by cross-modal cross-attention. Finally, further input into further cross-modal cross-attention to obtain the fused features.

2.3. Loss Function

In the first-stage loss function

L_{S 1}

of TSFS, we adopted pixel intensity distribution loss

L_{p i x e l}

, comparative loss

L_{c r}

, structural similarity loss

L_{s s i m}

, and conventional cross-entropy loss

L_{c e}

. The loss function in this paper includes the loss function

L_{S 1}

of the first stage S1 and the loss function

L_{S 2}

of the second stage S2. The loss function is expressed as follows:

L_{S 1} = L_{f u s} + L_{s e g_{S} 1} = L_{p i x e l} + L_{c r} + L_{s s i m} + L_{c e}

(3)

L_{S 2} = L_{c e}

(4)

L_{t o t l e} = L_{S 1} + L_{S 2}

(5)

Infrared images and visible images are paired multimodal images with complementary characteristics, but neither the FMB dataset nor the MFNet dataset provides officially released fusion images as a reference standard. So to address this problem, we need to consider a key issue in

L_{c r}

in image fusion methods: how to generate high-quality positive samples and paired matching negative samples. We manually generated masks for the FMB dataset and MFNet dataset. We use the intermediate weight maps

a_{1} = S (I_{v}) - S (I_{r})

and

a_{2} = S (I_{r}) - S (I_{v})

to calculate the visual saliency maps of visible images and infrared images, denoted as

m_{v}

and

m_{r}

, respectively.

m_{v}

and

m_{r}

are complementary, and

m_{v} + m_{r} = 1

. The expression of

m_{v}

is as follows:

m_{v} = \frac{e^{a_{1}}}{e^{a_{1}} + e^{a_{2}}}

(6)

We adopted

L_{c r}

in the loss function

L_{S 1}

of the first stage S1 and used the pre-trained model G (VGG-19 [21]) with fixed weights commonly used in the field of image processing, aiming to make the fusion result as close to the positive sample as possible while staying away from the corresponding negative sample. The definition of

L_{c r}

is as follows:

L_{c} = \sum_{i = 1}^{n} w_{i} \frac{∥G_{i} (I_{f}) - G_{i} (I_{v} \cdot m_{v} + I_{r} \cdot m_{r})∥}{∥G_{i} (I_{f}) - G_{i} (I_{v} \cdot m_{r} + I_{r} \cdot m_{v})∥}

(7)

The four loss functions in the first stage, S1, constrain the model from different perspectives and together guide the fusion results generated by TSFS to include more semantic information. This improves the quality of the fusion results and provides strong support for segmentation that is mutually reinforcing with image fusion.

3. Experiments

Our method is implemented in PyTorch 1.11, and both training and testing are performed on a machine with an NVIDIA GeForce RTX 4090 (NVIDIA, Santa Clara, CA, USA) with 24 GB of video memory and a Xeon Platinum 8352V processor (Intel, Santa Clara, CA, USA) with 50 GB of memory. During the training process, the initial learning rate of the fusion module is set to

1 \times 10^{- 3}

, and the initial learning rate of the segmentation module and the segmentation network in the second stage is set to

1 \times 10^{- 2}

. We use the Adam optimizer and stochastic gradient descent to optimize the segmentation module and the segmentation network in the S2.

All the datasets we used in this study are divided into training and test sets, where the FMB dataset [22] and MFNet [23] dataset contain 1220/1176 pairs and 280/393 pairs of infrared and visible images in the training and test sets, respectively. In comparative experiments, we present the evaluation results of the proposed TSFS on the FMB and MFNet datasets, conducting extensive qualitative and quantitative experiments against our method and seven state-of-the-art methods. These seven methods were tested according to the public codes and parameter settings of the original papers, namely MDLatLRR [24], DenseFuse [25], LTS [26], MetaLearning [27], SeAFusion, TarDAL [28], Diff-if [29], EGFNet [30], and LASNet [31].

For the fusion results in the first stage, we selected the following five objective evaluation indicators to evaluate, namely spatial frequency (SF), mutual information (MI), visual information fidelity (VIF) [32], nonlinear correlation information entropy (NCIE), and normalized mutual information (

Q_{M I}

) [33]. For the evaluation of segmentation, we chose the main metric used by most methods for evaluating semantic understanding: mean Intersection over Union (mIoU) [12].

3.1. Image Fusion Results Analysis

3.1.1. Subjective Analysis

We chose two daylight and two nighttime photos—images 00085 and 00381 from the FMB dataset and images 01314 and 01538 from the MFNet dataset—to perform the subjective analysis of the experimental findings.

For the pair of night images in the FMB dataset, the infrared targets are obvious in LTS, MetaLearning, TarDAL, Diff, and TSFS, among which the infrared target of TSFS is the most obvious, with high contrast. The ones that are difficult to distinguish from daytime due to the dark sky are MDLatLRR, DenseFuse, LTS, MetaLearning, and TarDAL. From the results of the seven comparison algorithms and the fusion results of TSFS, it can be clearly seen that, in a daytime environment with good lighting, TSFS effectively integrates the information of this area, especially the area marked by the blue box. The infrared target person in TSFS not only integrates the information of the infrared target person area in the infrared image but also integrates some texture detail information in the visible image, as shown in the yellow box of the fusion result of nighttime infrared and visible images in Figure 5. Furthermore, in the fusion results of the infrared and visible images during the daytime in Figure 5, the infrared target person in the yellow frame is enhanced in both visual quality and clarity by incorporating the texture details in the visible image. In night vision conditions, objects in visible images are dim and have low recognition. In infrared images, not only are infrared target people highly recognizable, but thermal radiation information of objects, including cars, is also captured. This phenomenon can be seen in the two pairs of infrared and visible images in the dark environment in Figure 5 and Figure 6. In the images in the dark environment of Figure 6, it can be seen from the blue box that although the car in the visible image is dim and has low recognition, the thermal radiation information in the infrared image can appropriately compensate for this. TSFS improves the recognition of the car in the fused image.

The fusion of infrared images and visible images requires the effective fusion of complementary information. Due to the imaging characteristics of infrared sensors and visible light sensors in different environments, combined with Figure 5 and Figure 6, it is possible to distinguish between night and day. In the dark, the visibility of objects in visible images is significantly reduced, while the recognition of objects in infrared images is significantly improved. Effective image fusion can achieve an overall improvement in the recognition of background information and objects. In the daytime, due to the single-channel imaging characteristics of infrared images, if the effective feature information cannot be correctly extracted in the feature information extraction stage, the clarity of the clear background information obtained from the visible image will be reduced. In summary, in the task of image fusion, we not only well fuse the salient infrared target information from the infrared image but also well fuse the detailed texture information from the visible image, obtaining fusion results that are friendly to human visual perception.

3.1.2. Objective Analysis

The results in Table 1 show that our results achieved consistent advantages in all five evaluation indicators except for MI and

Q_{M I}

, which achieved the second-best performance. From the perspective of judging the degree of information transfer, the highest NCIE, and the second-best MI,

Q_{M I}

show that there is significant transfer of source images. In addition, based on the statistical characteristics of the fused image itself, the best SF indicates that our approach has higher image contrast, a clearer image, and richer image details. In terms of image quality relative to natural scenes, the highest VIF shows that our approach is better with the visual system of humans, which means it is more friendly to the human eye. In summary, the fusion image has richer detailed textures and transfers more information from the source image, which proves that TSFS is effective and has accomplished the target task of image fusion well, especially in harsh environments such as night.

3.2. Segmentation Results Analysis

3.2.1. Subjective Analysis

To confirm the efficacy of TSFS in the segmentation experiment, we also selected a pair night image and a pair day image, which are the images named 00634 and 00359 in the FMB dataset and the images named 01342 and 01477 in the MFNet dataset. Light in visible images has a halo, which affects the quality of the segmentation results. As shown in Figure 7, the segmentation results of the seven methods except TSFS are all affected by this factor, with MetaLearning being the most affected. Compared with the MFNet dataset, the scene annotations in the FMB dataset are more detailed. At the same time, since MFNet has more unlabeled information than FMB, people and cars occupy a large proportion of the MFNet annotations. As can be seen from Figure 8, although DenseFuse, LTS, MetaLearning, SeAFusion, and TarDAL all achieved relatively good segmentation results for the car and infrared target person, none of the curves on the left achieved correct segmentation results. Although EGFNet and LAS obtain the correct segmentation results of the roadside on the left, they obtain redundant, incorrect segmentation results compared to the right area of the infrared target person in Ground Truth. Comparing the segmentation results in Figure 8 with other methods, especially compared with EGFNet and LAS, the semantic segmentation results of TSFS show a more comprehensive target outline, and the posture of the infrared target person is also more vivid.

3.2.2. Objective Analysis

The results in Table 2 show that our results have achieved consistent advantages in the car and building categories, and achieved the second best ranking in the person category. Comparative analysis of segmentation results on the FMB dataset shows that TSFS improves mIoU by 3.3%. The results in Table 3 show that our results have achieved consistent advantages in the five categories except people and cones, and ranked second in the people category. For the MFNet dataset, TSFS outperforms all other methods, with a 1.5% improvement in mIoU over LAS. Cars and people occupy a large proportion in both the FMB dataset and the MFNet dataset. TSFS has shown relatively outstanding performance in categories such as cars and people. Compared with the categories of segmentation results of the FMB dataset, the improvement effect of different instance category on the MFNet dataset is more obvious. In summary, TSFS demonstrates superior segmentation performance, not only obtaining the optimal mIoU but also presenting a comprehensive target outline.

3.3. Ablation Studies

To assess the necessity for the two stages in our method TSFS, we performed ablation experiments on the MFNet dataset and examined six variants. In experiment 1, we replaced MFB with concatenate. In experiment 2, we removed the negative samples in

L_{c r}

. In experiment 3, we removed

L_{c r}

. In experiment 4, we retained the S1 and removed the S2. In experiment 5, we removed the S2 and retained the S1. In experiment 6, we replaced CSFAM with local attention.

Figure 9 and Figure 10 respectively show the effects of different designs in the two stages on image fusion and semantic segmentation. Removing any module will lead to a performance decline. The results in Table 4 show that, except for the MI indicator, which achieved the second best ranking, the other four indicators, MI, NCIE, QY, and

Q_{M I}

, all achieved significant and consistent advantages. The experimental data indicate that, if

L_{c r}

is missing, the image contrast and clarity will decrease, and the mIoU will also drop. If we eliminate the negative samples based on contrast loss, we will lose some of the image’s textures and details, as shown in Figure 9. The human behind the yellow warning sign in the yellow frame in Figure 9 can also be easily identified in TSFS. If the second stage S2 is missing, that is, only the first stage S1 is present, the mIoU drops by 1.3%. If the first stage S1 is missing, that is, there is only S2, the mIoU drops by 1.7%. A comparative analysis of Experiments 4, 5, and 6 with TSFS shows that the target person’s posture is more vivid and the edge contours are clearer. Only our method can obtain relatively accurate segmentation results for the target person. These results all emphasize the positive impact of different designs in these two stages on the fusion results and segmentation results.

4. Discussion

We designed a new two-stage image fusion and segmentation method in TSFS using FEM, FMFM, and CSFAM. We fuse multimodal information with a two-stage approach to enhance the accuracy of segmentation and the quality of fused results. In the fusion module, we design FEM and FMFM to extract multimodal image information and fuse features. The proposed FMFM can fuse features at multiple levels through MFB and CAB. Compared with existing methods, TSFS shows significant advantages in all indicators. According to experimental data, TSFS is effective in image fusion and segmentation since it considerably enhances the fused image’s visual effect, evaluation index values, and segmentation mIoU. Although the experimental results are excellent, we only conducted comparative experiments on FMB and MFNet. Further verification can be done on other infrared and visible image datasets. In the following studies, we want to enhance the generalization ability of the model and further consider its practical application significance. We will continue to improve the model’s generalization and robustness across other datasets.

5. Conclusions

To explore the intrinsic connection between the image fusion problem and segmentation in a more self-consistent way, we propose a new method called TSFS. The semantic information in the fused image is enhanced by cascading the fusion module and the segmentation module. The segmentation results of the fusion result are passed to the multimodal segmentation model in the form of pre-trained model parameters, reducing the problem of semantic information loss caused by the lack of multimodal source image information. We designed the FEM and FMFM in the fusion module and also input the fusion results output by the fusion module in S1 into the segmentation module to obtain semantic loss, guiding the fusion image generated by the fusion module in S1 to have more semantic information. We use CSFAM in the second stage S2 to fuse features in a cross-fusion manner in different directions to improve the segmentation accuracy. Multiple comparisons and ablation studies on the FMB dataset and MFNet dataset show that our proposed TSFS not only improves the visual effect of the fused image but also improves the segmentation mIoU. Future research will continue to investigate joint training of image fusion with other high-level computer tasks and improve the generalization ability of the model.

Author Contributions

Conceptualization, W.R.; methodology, W.R.; software, W.R.; validation, W.R. and L.L.; formal analysis, W.R.; data curation, W.R.; writing—original draft preparation, W.R.; writing—review and editing, W.R., J.R. and L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (62262016, 62302132), 14th Five-Year Plan Civil Aerospace Technology Preliminary Research Project (D040405).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Han, J.; Bhanu, B. Fusion of color and infrared video for moving human detection. Pattern Recognit. 2007, 40, 1771–1784. [Google Scholar] [CrossRef]
Sun, Y.; Yan, K.; Li, W. CycleGAN-based SAR-optical image fusion for target recognition. Remote Sens. 2023, 15, 5569. [Google Scholar] [CrossRef]
Chandrakanth, V.; Murthy, V.; Channappayya, S. Siamese cross-domain tracker design for seamless tracking of targets in RGB and thermal videos. IEEE Trans. Artif. Intell. 2022, 4, 161–172. [Google Scholar]
Maqsood, S.; Javed, U.; Riaz, M.; Muzammil, M.; Muhammad, F.; Kim, S. Multiscale image matting based multi-focus image fusion technique. Electronics 2020, 9, 472. [Google Scholar] [CrossRef]
Li, X.; Li, Y.; Chen, H.; Peng, Y.; Chen, L.; Wang, M. RITFusion: Reinforced interactive transformer network for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 2023, 73, 1–16. [Google Scholar] [CrossRef]
Liu, Y.; Chen, X.; Peng, H.; Wang, Z. Multi-focus image fusion with a deep convolutional neural network. Inf. Fusion. 2017, 36, 191–207. [Google Scholar] [CrossRef]
Yang, Y.; Liu, J.; Huang, S.; Wan, W.; Wen, W.; Guan, J. Infrared and visible image fusion via texture conditional generative adversarial network. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4771–4783. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Li, H.; Wu, X.; Durrani, T. NestFuse: An infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. IEEE Trans. Instrum. Meas. 2020, 69, 9645–9656. [Google Scholar] [CrossRef]
Wang, D.; Liu, J.; Fan, X.; Liu, R. Unsupervised misaligned infrared and visible image fusion via cross-modality image generation and registration. arXiv 2022, arXiv:2205.11876. [Google Scholar] [CrossRef]
Zhu, J.; Park, T.; Isola, P.; Efros, A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion. 2022, 82, 28–42. [Google Scholar] [CrossRef]
Tang, L.; Deng, Y.; Ma, Y.; Huang, J.; Ma, J. SuperFusion: A versatile image registration and fusion network with semantic awareness. IEEE/CAA J. Autom. Sin. 2022, 9, 2121–2137. [Google Scholar] [CrossRef]
Chang, K.; Huang, J.; Sun, X.; Luo, J.; Bao, S.; Huang, H. Infrared and visible image fusion network based on multistage progressive injection. Complex Intell. Syst. 2025, 11, 367. [Google Scholar] [CrossRef]
Wang, C.; Wu, J.; Zhu, Z.; Chen, H. MSFNet: MultiStage Fusion Network for infrared and visible image fusion. Neurocomputing 2022, 507, 26–39. [Google Scholar] [CrossRef]
Cao, Y.; Luo, X.; Tong, X.; Yang, J.; Cao, Y. Infrared and visible image fusion based on a two-stage class conditioned auto-encoder network. Neurocomputing 2023, 544, 126248. [Google Scholar] [CrossRef]
Zheng, X.; Yang, Q.; Si, P.; Wu, Q. A Multi-Stage Visible and Infrared Image Fusion Network Based on Attention Mechanism. Sensors 2022, 22, 3651. [Google Scholar] [CrossRef]
Huang, S.; Kong, X.; Yang, Y.; Wan, W.; Song, Z. FTSFN: A Two-Stage Feature Transfer and Supplement Fusion Network for Infrared and Visible Image Fusion. IEEE Trans. Instrum. Meas. 2025, 74, 1–15. [Google Scholar] [CrossRef]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Wang, J.; Mao, Y.; Ma, X.; Guo, S.; Shao, Y.; Lv, X.; Han, W.; Christopher, M.; Zangwill, L.M.; Bi, Y. ODFormer: Semantic Fundus Image Segmentation Using Transformer for Optic Nerve Head Detection. Inf. Fusion. 2024, 112, 102533. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Liu, J.; Liu, Z.; Wu, G.; Ma, L.; Liu, R.; Zhong, W.; Luo, Z.; Fan, X. Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–8 October 2023; pp. 8115–8124. [Google Scholar]
Ha, Q.; Watanabe, K.; Karasawa, T.; Ushiku, Y.; Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 5108–5115. [Google Scholar]
Li, H.; Wu, X.; Kittler, J. MDLatLRR: A novel decomposition method for infrared and visible image fusion. IEEE Trans. Image Process. 2020, 29, 4733–4746. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Wu, X. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar] [CrossRef]
Liu, J.; Wu, Y.; Wu, G.; Liu, R.; Fan, X. Learn to search a lightweight architecture for target-aware infrared and visible image fusion. IEEE Signal Process. Lett. 2022, 29, 1614–1618. [Google Scholar] [CrossRef]
Li, H.; Cen, Y.; Liu, Y.; Chen, X.; Yu, Z. Different input resolutions and arbitrary output resolution: A meta learning-based deep framework for infrared and visible image fusion. IEEE Trans. Image Process. 2021, 30, 4070–4083. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 5802–5811. [Google Scholar]
Li, H.; Cen, Y.; Liu, Y.; Chen, X.; Yu, Z. Diff-IF: Multi-modality image fusion via diffusion model with fusion knowledge prior. Inf. Fusion. 2024, 110, 102450. [Google Scholar]
Zhou, W.; Dong, S.; Xu, C.; Qian, Y. Edge-aware guidance fusion network for rgb–thermal scene parsing. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI), Virtual, 22–28 February 2022; pp. 3571–3579. [Google Scholar]
Li, G.; Wang, Y.; Liu, Z.; Zhang, X.; Zeng, D. RGB-T semantic segmentation with location, activation, and sharpening. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1223–1235. [Google Scholar] [CrossRef]
Han, Y.; Cai, Y.; Cao, Y.; Xu, X. A new image fusion performance metric based on visual information fidelity. Inf. Fusion. 2013, 14, 127–135. [Google Scholar] [CrossRef]
Liu, Z.; Blasch, E.; Xue, Z.; Zhao, J.; Laganiere, R.; Wu, W. Objective assessment of multiresolution image fusion algorithms for context enhancement in night vision: A comparative study. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 94–109. [Google Scholar] [CrossRef]

Figure 1. Examples of visible images, infrared images, and fusion results.

Figure 2. The network frame of TSFS.

Figure 3. The network frame of S1.

Figure 4. The network frame of S2.

Figure 5. Fusion results on the FMB dataset.

Figure 6. Fusion results on the MFNet dataset.

Figure 7. Segmentation results on the FMB dataset.

Figure 8. Segmentation results on the MFNet dataset.

Figure 9. Fusion results of ablation studies on the MFNet dataset. (a) Visible image. (b) Infrared image. (c) TSFS. (d) Replaced MFB with concatenate. (e) w/o Negative sample. (f) w/o

L_{c r}

. (g) Only S1.

Figure 9. Fusion results of ablation studies on the MFNet dataset. (a) Visible image. (b) Infrared image. (c) TSFS. (d) Replaced MFB with concatenate. (e) w/o Negative sample. (f) w/o

L_{c r}

. (g) Only S1.

Figure 10. Segmentation results of ablation studies on the MFNet dataset. (a) Visible image. (b) Infrared image. (c) Ground Truth (d) Replaced MFB with concatenate. (e) w/o Negative sample. (f) w/o

L_{c r}

. (g) Only S1. (h) Only S2. (i) Replaced CSFAM with Local Attention. (j) TSFS.

Figure 10. Segmentation results of ablation studies on the MFNet dataset. (a) Visible image. (b) Infrared image. (c) Ground Truth (d) Replaced MFB with concatenate. (e) w/o Negative sample. (f) w/o

L_{c r}

. (g) Only S1. (h) Only S2. (i) Replaced CSFAM with Local Attention. (j) TSFS.

Table 1. The FMB and MFNet datasets’ objective evaluation parameter outcomes of the fusion results.

Datasets	Methods	SF	MI	VIF	NCIE	$Q_{MI}$
FMB	SeAFusion	0.056	2.855	0.873	0.807	0.425
	DenseFuse	0.033	3.175	0.779	0.808	0.472
	LTS	0.049	3.287	0.985	0.808	0.468
	MetaLearning	0.047	4.111	0.854	0.812	0.604
	SeAFusion	0.053	3.833	0.953	0.810	0.569
	TarDAL	0.046	3.384	0.850	0.809	0.492
	Diff-If	0.054	4.509	0.882	0.815	0.679
	TSFS	0.066	4.443	0.997	0.815	0.647
MFNet	MDLatLRR	0.057	2.168	0.859	0.805	0.336
	DenseFuse	0.029	2.277	0.761	0.805	0.368
	LTS	0.046	2.617	0.887	0.806	0.411
	MetaLearning	0.046	3.199	0.763	0.809	0.522
	SeAFusion	0.052	2.559	0.902	0.806	0.400
	TarDAL	0.039	2.439	0.77	0.806	0.387
	Diff-If	0.051	3.679	0.926	0.811	0.597
	TSFS	0.058	3.496	0.989	0.812	0.535

Table 2. The FMB datasets’ objective evaluation parameter outcomes of the segmentation results.

Methods	Car	Person	T-Lamp	T-Sign	Building	Vegetable	Pole	mIoU
DenseFuse	72.8	54.2	40.3	71.0	78.6	83.6	40.6	52.2
LTS	72.5	58.5	40.9	69.7	76.3	82.8	40.5	51.7
MetaLearning	70.0	50.2	34.8	62.8	76.6	75.8	36.4	48.1
SeAFusion	74.0	59.3	40.4	73.4	81.2	84.8	43.3	54.3
TarDAL	72.0	60.0	39.9	59.8	76.9	82.7	34.1	49.3
Diff-If	72.0	58.3	38.3	71.9	78.7	83.3	41.7	51.2
TSFS	81.1	59.3	39.1	70.0	83.9	83.2	30.3	57.6

Table 3. The MFNet dataset’s objective evaluation parameter outcomes of the segmentation results.

Methods	Car	Person	Bike	Curve	Car Stop	Cone	Cump	mIoU
EGFNet	87.6	69.8	58.8	42.8	33.8	7.0	47.1	54.8
LASNet	84.2	67.1	56.9	41.1	39.6	18.9	40.1	54.9
DenseFuse	83.4	69.9	60.8	29.5	28.4	51.7	45.9	51.9
LTS	83.4	68.7	59.8	27.5	24.7	50.6	44.7	50.8
MetaLearning	77.9	34.5	56.0	21.1	23.5	45.9	38.3	43.9
SeAFusion	85.5	71.3	60.0	35.4	29.9	51.1	46.9	53.1
TarDAL	82.1	69.2	56.5	34.1	27.3	49.8	42.8	50.5
Diff-If	85.4	72.2	59.6	32.4	25.8	48.9	48.6	52.3
TSFS	88.0	71.8	61.2	44.3	45.0	46.3	53.1	56.4

Table 4. We have obtained the objective evaluation parameter outcomes of the ablation studies from the MFNet dataset.

	SF	MI	VIF	NCIE	$Q_{M I}$	mIoU
Concatenate	0.048	3.383	0.946	0.809	0.514	55.2
w/o Neg	0.051	3.499	0.907	0.811	0.528	55.2
w/o $L_{c r}$	0.029	2.549	0.814	0.806	0.412	55.6
Only S1	0.058	3.496	0.989	0.812	0.535	55.1
Only S2						54.7
Exp 6	0.058	3.496	0.989	0.812	0.535	52.3
TSFS	0.058	3.496	0.989	0.812	0.535	56.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ren, W.; Luo, L.; Ren, J. A Two-Stage Approach for Infrared and Visible Image Fusion and Segmentation. Appl. Sci. 2025, 15, 10698. https://doi.org/10.3390/app151910698

AMA Style

Ren W, Luo L, Ren J. A Two-Stage Approach for Infrared and Visible Image Fusion and Segmentation. Applied Sciences. 2025; 15(19):10698. https://doi.org/10.3390/app151910698

Chicago/Turabian Style

Ren, Wang, Lanhua Luo, and Jia Ren. 2025. "A Two-Stage Approach for Infrared and Visible Image Fusion and Segmentation" Applied Sciences 15, no. 19: 10698. https://doi.org/10.3390/app151910698

APA Style

Ren, W., Luo, L., & Ren, J. (2025). A Two-Stage Approach for Infrared and Visible Image Fusion and Segmentation. Applied Sciences, 15(19), 10698. https://doi.org/10.3390/app151910698

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Two-Stage Approach for Infrared and Visible Image Fusion and Segmentation

Abstract

1. Introduction

2. Materials and Methods

2.1. The Fuse Module in the First Stage S1

2.2. Segmentation in the S1 and S2

2.3. Loss Function

3. Experiments

3.1. Image Fusion Results Analysis

3.1.1. Subjective Analysis

3.1.2. Objective Analysis

3.2. Segmentation Results Analysis

3.2.1. Subjective Analysis

3.2.2. Objective Analysis

3.3. Ablation Studies

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI