Dual-Branch Cross-Fusion Normalizing Flow for RGB-D Track Anomaly Detection

Gao, Xiaorong; Wen, Pengxu; Li, Jinlong; Luo, Lin

doi:10.3390/s25082631

Open AccessArticle

Dual-Branch Cross-Fusion Normalizing Flow for RGB-D Track Anomaly Detection

School of Physical Science and Technology, Southwest Jiaotong University, Chengdu 610031, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(8), 2631; https://doi.org/10.3390/s25082631

Submission received: 18 February 2025 / Revised: 16 April 2025 / Accepted: 18 April 2025 / Published: 21 April 2025

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

:

With the ease of acquiring RGB-D images from line-scan 3D cameras and the development of computer vision, anomaly detection is now widely applied to railway inspection. As 2D anomaly detection is susceptible to capturing condition, a combination of depth maps is now being explored in industrial inspection to reduce these interferences. In this case, this paper proposes a novel approach for RGB-D anomaly detection called Dual-Branch Cross-Fusion Normalizing Flow (DCNF). In this work, we aim to exploit the fusion strategy for dual-branch normalizing flow with multi-modal inputs to be applied in the field of track detection. On the one hand, we introduce the mutual perception module to acquire cross-complementary prior knowledge in the early stage. On the other hand, we exploit the effectiveness of the fusion flow to fuse the dual-branch of RGB-D inputs. We experiment on the real-world Track Anomaly (TA) dataset. The performance evaluation of DCNF on TA dataset achieves an impressive AUROC score of 98.49%, which is 3.74% higher than the second-best method.

Keywords:

anomaly detection; normalizing flow; RGB-D fusion

1. Introduction

The popularization of high-speed trains has made this study convenient for us. With the increase in running times, regular track inspection greatly affects the safety of the train. Early industrial inspection methods were usually based on manual inspection. However, the efficiency of this approach is low and it fails to meet the increasing demand for inspection [1,2]. With the development of deep learning, computer vision has shown excellent performance in automatic industrial detection [3,4], and deep learning is gradually being applied to high-speed train detection. With a line-scan 3D camera mounted on the bottom of the inspection train, we can automatically collect the track images, and deep learning helps us automate anomaly detection.

In anomaly detection task, it is hard to build a supervised model due to the lack of labeled anomaly data that present significant challenges in industrial detection [5,6]. In this case, most works utilize the distribution of normal data to build models, and others that do not match the normal distribution will be considered as an anomaly. Recent works focus on anomaly detection at the 2D level, usually using the color information of the image to determine whether there is an anomaly [7,8]. However, in industrial detection tasks, the use of 2D-only images might cause false positives or missed detection. In this paper, we extend 2D anomaly detection to 3D, in combination with depth maps, to take advantage of their shape information.

In industrial applications, the acquisition of depth maps by line-scan 3D cameras has obvious advantages over the acquisition of point clouds by 3D cameras. Firstly, the line-scan 3D camera has the capacity for fast scanning, which can simultaneously obtain the RGB images and depth maps of the whole object in a short time. Secondly, the line-scan 3D camera provides the depth information of each pixel, which ensures high-precision measurement and detection results. In addition, for the shape scanning of moving objects, line-scan 3D cameras are more convenient and suitable for scenes that require real-time monitoring or measurement. At the same time, due to the simplicity of depth map data processing, it can be directly applied to computer vision algorithms, which greatly simplifies the subsequent data processing flow. In this case, the objective of this paper is to build an RGB-D anomaly detection network for RGB images and depth maps, with the aim of enhancing the efficiency and performance of industrial detection.

The fusion strategies of RGB images and depth maps can be categorized as either early fusion or late fusion [9]. In the case of early fusion, the RGB images and depth maps are concatenated to form a four- or six-channels input. The late fusion typically establishes two branches for RGB images and depth maps, subsequently integrating their high-level features. The present paper introduces a novel cross-fusion strategy for RGB-D anomaly detection called Dual-Branch Cross-Fusion Normalizing Flow (DCNF). Inspired by [10], we adopt the concept of muti-scale normalizing flow, and set up a dual-branch normalizing flow framework to process the RGB images and depth maps separately. Besides this, we propose the Mutual Perception module (MP) to facilitate early-stage information exchange between the two branches, followed by the fusion of the two branches through a Fusion Flow in order to derive the anomaly score.

We evaluate the performance of DCNF on the Track Anomaly dataset (TA), which is collected from actual track anomalies, and achieves an advanced accuracy, while the results demonstrate the effectiveness of DCNF.

The contributions of this paper are outlined as follows:

Introduction of a novel RGB-D anomaly detection method in combination with depth maps to take advantage of the shape information;
Proposal of a dual-branch normalizing flow with cross-fusion strategy that fuses RGB images and depth maps;
Introduction of the Mutual Perception module (MP) and Fusion Flow to improve the compatibility of RGB-D data;
Achievement of advanced accuracy in TA dataset.

The remainder of this article is structured as follows: Section 2 provides a concise overview of prior research relevant to our methodology. The proposed DCNF is elaborated in detail in Section 3. Section 4 presents extensive experiments and analyses conducted to evaluate the effectiveness of our approach. Finally, in Section 5, a comprehensive conclusion summarizing the key findings of this article is provided.

2. Materials and Methods

2.1. 2D Anomaly Detection

Since the proposal of the MVTec AD [11], it has become a benchmark of 2D anomaly detection, and 2D anomaly detection has received more attention. Most of the works focus on the defect-free data, which are considered as unsupervised tasks.

2.1.1. Reconstruction-Based Methods

The reconstruction-based methods are widely based on modeling the normal distribution, and are used reconstruct the unseen data. This model could generate defect-free data well, which accord with the normal distribution and fail to reconstruct the anomaly [12]. Autoencoder (AE) is a typical neural network model. The model realizes the feature learning and information reconstruction of input data by constructing a symmetric structure of “coding–decoding”. Specifically, the encoder module maps the high-dimensional input data to the low-dimensional potential space through nonlinear transformation, and the decoder module is responsible for restoring the compressed feature vector to the original data form. GANomaly is a classic reconstruction-based method [13] that leverages the training of generative adversarial networks (GANs) to model the distribution of normal data and subsequently reconstruct input data using GANs. However, these methods heavily rely on the quality and quantity of training data, which may limit their performance when applied to small datasets.

2.1.2. Embedding Similarity-Based Methods

One of the other categories is embedding similarity-based methods, wherein the ImageNet pretrained feature extractor is employed for extracting image vectors [14]. Cho et al. introduced SPADE [8], which incorporates a multiresolution feature pyramid and employs k-nearest neighbor (kNN) techniques to effectively exploit deep pretrained features. Defard et al. introduced PaDiM [15], which maps the features from the pre-trained network to a Gaussian distribution and builds the corresponding memory library. Despite the use of patch anomaly insertion techniques, these methods still encounter difficulties in detecting minute or intricate anomalies. Therefore, these approaches may not be suitable for tasks involving the segmentation of fine anomalies [16].

2.1.3. Normalizing Flows-Based Methods

In recent years, normalizing flows (NFs) have been gaining increasing attention by virtue of their efficient reversible transformation. Under the same principle as Real-NVP [17], the NFs first map the original distribution into the hidden space and reflect it back to the original sample distribution through the invertible function. With the rise of computer vision, NFs are gradually being used in the field of anomaly detection. DifferNet [18] leverages NFs to exploit the descriptive features extracted by convolutional neural networks to estimate their density. Following this, Gudovskiy et al. proposed CFlow [19], a more computationally and memory-efficient anomaly detection framework based on a conditional NFs framework. This structure not only retains the likelihood estimation capability of DifferNet, but also realizes the explicit probability judgment of the coding features by introducing conditional normalization flow. Specifically, CFlow uses a convolutional neural network (CNN) as the encoder in the feature extraction stage, and uses a reversible normalized flow network in the decoding stage. Through this design, anomaly localization at the pixel level is realized.

2.2. RGB-D Fusion Strategies

The early RGB-D fusion strategies tended to utilize the hand-craft feature extractors and fuse the RGB images and depth maps [20,21]. Despite the effectiveness of hand-craft features, the low-level features still lack generalizability. With the increasing number of works on deep-learning, the question emerges of how to leverage the multi-modal correlations and multi-level information.

Currently, RGB-D fusion strategies are being widely investigated in saliency object detection tasks. Existing RGB-D fusion strategies are broadly categorized into early fusion and late fusion. Early fusion crudely concatenates RGB images and depth maps into four- or six-channel inputs [22,23]. Late fusion usually builds two branches that process RGB images and depth maps separately, and fuses the processed high-level features at a later stage [24,25]. Wu et al. [26] introduced a layer-wise attention and trident spatial attention mechanisms to weigh up the early and late fusion of RGB and depth features and address depth dissonance. Inspired by the salient object detection task, the goal of our paper is to build an RGB-D fusion strategy that effectively incorporates the depth maps to take advantage of the shape information.

3. Method

The proposed method of Dual-Branch Cross-Fusion Normalizing Flow (DCNF) is a dual-branch normalizing flow strategy with cross-fusion. As shown in Figure 1, we first input RGB images and depth maps into different branches of the dual-branch network. Subsequently, the Mutual Perception module (MP) allows each branch to get complementary information about the other branch, followed by a pretrained extractor to extract the cross-complementary feature. The dual-branch feature maps are then sent into the fully convolutional normalizing flow string, and we exchange the cross-complementary feature through Fusion Flow.

3.1. Mutual Perception Module (MP)

The fusion of RGB and depth features encounters two primary challenges. One pertains to the compatibility issue arising from muti-modal differences, while the other concerns the presence of duplications and interference in depth features with poor quality. Building upon [27], we introduced the MP to enhance the compatibility among multi-modal features and extract informative cues from deep features. The architecture of MP is illustrated in Figure 2.

Specifically,

x_{1}

and

x_{2}

denote the input of the RGB and depth branches, respectively. We define

x_{1}

as the primary input and

x_{2}

as the secondary input. The objective is to enable

x_{1}

to incorporate complementary information from

x_{2}

, thereby acquiring cross-complementary prior knowledge. The calculation process is delineated as follows:

F_{M P} (x_{1}) = C A (x_{1}) ⨂ x_{2} \begin{matrix} x_{1}^{'} = S A (F_{M P} (x_{1})) ⨂ F_{M P} (x_{1}) ⨁ x_{2} \end{matrix}

(1)

where

C A (\cdot)

denotes the Channel Attention module and

S A (\cdot)

denotes the Spatial Attention module.

C A (\cdot)

aims to utilize the inter-channel relations between the depth and RGB features, and

S A (\cdot)

determines the position that carries the depth features. The

C A (\cdot)

and

S A (\cdot)

are implemented as

C A (x_{i}) = {M (P}_{A d a} (x_{i})) \begin{matrix} S A (x_{i}) = C o n v (P_{m a x} (x_{i})) \end{matrix}

(2)

where

M (\cdot)

denotes the multi-layer perception,

P_{A d a} (\cdot)

represents the adaptive max pooling applied to each branch of features and

P_{m a x} (\cdot)

signifies the global max pooling applied to each point in the feature map across the channel axis.

3.2. Feature Extraction

The study conducted by Schirrmeister et al. [28] has demonstrated that the feature extractors, which are pre-trained on the ImageNet dataset, effectively capture the representative distribution of the data. Relevant features for anomaly detection can be extracted by pretrained CNNs [8]. The third layer of Wide-Resnet50-2 is utilized in this paper for feature extraction, enabling the inclusion of more intricate spatial details. While pixel-level anomaly localization tasks inherently prioritize spatial structure over semantic understanding, image-level anomaly detection tasks rely heavily on localized anomaly regions, making detail-rich feature maps more suitable. In order to minimize computational expenses, we incorporate a pooling layer following feature extraction to down-sample the extracted features. The features can be expressed as:

y_{i} = A v g P o o l ({E (x}_{i}^{'}))

(3)

where

E (\cdot)

denotes the CNNs feature extraction and

A v g P o o l (\cdot)

denotes the average pooling layer.

3.3. Fully Convolutional Normalizing Flow

The DCNF model consists of two independent fully convolutional normalizing flow branches with distinct weights. Each branch of a normalizing flow consists of

n

flow blocks, which encode the feature map

y_{i}

and transform it into the latent feature map

z_{i} = f_{1 o r 2} (y_{i})

. The dual normalizing flow branches share the same architecture as is illustrated in Figure 3; we use

3 \times 3

convolution to construct the normalizing flow branches, which automatically capture the spatial context information that is ignored in [19]. Specifically, the S-T Net is used to calculate the scale weights

w_{i}^{s}

and shift weights

w_{i}^{t}

,

{N e t}_{S - T} (\cdot) = {C o n v}_{3 \times 3} \circ R E L U \circ L N \circ {C o n v}_{3 \times 3} (\cdot)

(4)

where

{C o n v}_{3 \times 3} (\cdot)

represents the

3 \times 3

convolution,

R E L U

refers to the ReLU activation function and

L N

presents the layer normalization. The incorporation of

L N

enhances process stability and elevates model performance.

The bijective reversible normalizing flow can be implemented by following

f o r w a r d : \{\begin{matrix} X_{1} = Z_{1} \\ X_{2} = Z_{2} e x p ⨀ (W_{s} (Z_{1})) + W_{t} (Z_{1}) \end{matrix} r e v e r s e : \{\begin{matrix} Z_{1} = X_{1} \\ Z_{2} = (X_{2} - W_{t} (X_{1})) ⨀ e x p ((- W_{s} (X_{1}))) \end{matrix}

(5)

where

W_{s}

and

W_{t}

generate the scale weights

w_{i}^{s}

and shift weights

w_{i}^{t}

. The Jacobian determinant could be calculated as

l o g |\det (J_{f^{- 1}})| = \sum W_{s} (X_{1})

(6)

3.4. Fusion Flow

The initial stage of network operation involves a simple interaction between the two branches. However, the primary concern lies in effectively integrating the features from both branches. In this case, we propose fusion flow to fuse the feature, followed by the normalizing flow. The proposed fusion flow is able to capture different parts of information using different branches of perception.

The fusion flow first fuses the dual-branch inputs

a_{1}, a_{2}

and then trains the scale weights

w_{i}^{s}

and shift weights

w_{i}^{t}

of the fusion flow:

(z_{1}^{f u s e}, z_{2}^{f u s e}) = g^{f u s e} (a_{1}, a_{2})

(7)

The

g^{f u s e} (\cdot)

is depicted in Figure 4; the fusion flow integrates the branches by means of an average pooling layer and establishes their connection. Subsequently, the two branches are merged using a

3 \times 3

full convolution. The resulting feature map is then restored to its original size through up-sampling. Finally, the fused features are combined with their respective inputs in an element-wise addition process to obtain the fused inputs

z_{1}^{f u s e}

and

z_{2}^{f u s e}

.

3.5. Learning Objective and Post Processing

Similar to [29], our training objective function is derived as:

l o s s = \sum_{i = 1}^{2} \frac{{‖z_{i}^{f u s e}‖}_{2}^{2}}{2} - l o g |\det (J_{f^{- 1}})|

(8)

where the

{‖\cdot‖}_{2}^{2}

represents the squared

l_{2}

-norm of a vector

x

in

n

-dimensional Euclidean space and

|\det (J_{f^{- 1}})|

denotes the Jacobian determinants of the fully convolutional normalizing flows and fusion flows.

The main objective of the anomaly task is to locate the anomaly. Previous works have typically mapped

z

to a latent space and generated an anomaly score. In the paper, we define the anomaly scores for each local position. Specifically, we upsample

z

back to the original size and compute its squared mean to get the anomaly score for each location, as follows:

{S c o r e}_{i} = mean ({z_{i}^{f u s e}}^{2}, d i m = 1)

(9)

Different from previous methods that either take the maximum score in a pixel-by-pixel anomaly score map or take the average score as the global anomaly score [19,20], our DCNF takes the average of the maximum anomaly scores of

T o p K

in the spatial dimension as the global anomaly score, as follows:

S = \frac{1}{K} \sum_{i = 1}^{K} T o p K ({S c o r e}_{i})

(10)

where

{S c o r e}_{i}

represents the anomaly score of each pixel, and maximum and average are special cases of solutions when the

K

is set as

1

or

H_{i m g} \times W_{i m g}

.

4. Experiment

4.1. Datasets and Metrics

Since there is no public dataset for the RGB-D anomaly detection task, we have collected a real-world Track-Anomaly dataset (TA), which includes foreign objects on rail tracks as well as RGB images and depth maps for each data set. The TA dataset contains 580 defect-free images in the train set and 330 images in the test set, which contains both defect and defect-free images. Figure 5 shows examples of the images in the TA dataset. The majority of anomalies in track anomaly detection are foreign objects, and the TA dataset contains both large-scale and small-scale foreign objects, which we consider suitable for industrial track detection scenarios.

Similar to the current anomaly detection methods, we here assess the effectiveness of our proposed DCNF using the Area Under Receiver Operator Curve (AUROC) [30], and subsequently evaluate its performance at the point of anomaly detection. Besides this, we compute the recall as the detection rate of anomalies for industrial evaluation.

4.2. Experimental Details

We utilize the output of layer 3 with a channel of 1024 from wide-resnet-50 (WRN-50) pretrained using ImageNet as the feature extractor for extracting higher-level image features. To reduce the computational effort, we input the extracted features into the average pooling layer for down-sampling. In fully convolutional normalizing flow blocks, we set the number of normalizing flow blocks to

n = 8

for all the experiments. We resize the images to

(512 \times 512)

for the TA dataset. In our implementation, we adopt the Adam optimizer with the learning rate of

1 \times 10^{- 4}

. In the post-processing stage, we set the

T o p K

to

0.05 \times H_{i m g} \times W_{i m g}

. We train the DCNF for 120 epochs on the TA dataset. These experiments are carried out with an NVIDIA RTX 3090 24G GPU, which is manufactured by NVIDIA and procured from China.

Specifically, we initialize the feature extractor using the initial weights pretrained from ImageNet and freeze its parameters during training. All other modules use manual seed 3407 to initialize the weights and update them during training.

4.3. Quantitative Comparison

The performance of DCNF is compared with the performances of prior methods including GANomaly [14], PaDiM [16], CFA [31], CFlow [20] and MSFlow [11]. Since these methods are all built on 2D images, we evaluate them using RGB images in the TA dataset. The results presented in Table 1 demonstrate the exceptional performance of DCNF on the TA dataset. For instance, DCNF showed a 3.67% better AUROC than MSFlow, which achieved the second best AUROC on TA. Figure 6 exhibits the visualization of DCNF on the TA dataset.

In order to showcase the effectiveness of our fusion strategy, we also compare the outcomes obtained from the RGB branch, depth branch, and the fusion of the two branches in Table 2. The results demonstrate that the dual-branch cross-fusion strategy we proposed could effectively fuse the RGB images and depth maps through dual-branch normalizing flow.

To demonstrate that the dual-branch network is able to take full advantage of both RGB images and depth maps, we present visualizations of each branch. The results are compared with the fusion results shown in Figure 7. The visualization results show that the fused results have less noise than each branch of the DCNF.

In industrial applications, we pay more attention to the detection rate of anomalies. Therefore, we evaluate the recall of DCNF on the TA dataset in Table 3. Notably, the DCNF outperforms previous methods and achieves a 2.17% better recall than the second option. Considering that our goal is to judge whether a track is abnormal or not, we believe it is essential to employ recall as an evaluation metric.

Table 4 presents a recall performance comparison of the RGB branch, the depth branch, and their fusion using the DCNF framework on the TA dataset. The results demonstrate that the fusion strategy significantly outperforms individual modalities, achieving a 96.47% recall score, which is highlighted as the best performance.

The RGB branch alone achieves 94.13% recall, slightly surpassing the depth branch (93.85%). While both unimodal branches deliver strong results, the marginal superiority of the RGB branch suggests that visual features (e.g., color, texture) may provide slightly more discriminative cues for the task on this dataset. However, the depth branch’s competitive performance (93.85%) underscores the importance of geometric and spatial information in enhancing recognition robustness. The fused result (96.47%) exhibits a notable improvement over both unimodal branches, with gains of +2.34% over RGB and +2.62% over depth. This indicates strong complementary characteristics between RGB and depth modalities, where their combination effectively mitigates individual limitations (e.g., RGB sensitivity to lighting variations, depth ambiguity in textureless regions). The fusion mechanism in DCNF successfully leverages cross-modal synergies to achieve state-of-the-art performance.

4.4. Ablation Study

We here investigate the influence of individual components of DCNF, including adding different modules, the number of the normalizing flow blocks, different extractions, and the

K

set in

T o p K

.

4.4.1. Influence of Different Modules

The results in Table 5 show a great improvement in AUROC using both MP and the Fusion Flow module. As can be seen, both MP and the Fusion Flow module have a lifting effect on the dual-branch normalizing flow.

By separating the modal input, it can be verified whether multi-modal fusion is significantly better than single-modal detection, and the complementarity of the shape information provided by the depth map (such as the three-dimensional structure of a foreign object) and the RGB texture information is proven. The experimental results show that the fusion strategy can effectively improve the robustness of complex anomalies (such as the detection of small-scale foreign bodies and illumination changes) by combining the advantages of the two modes.

4.4.2. Influence of the Number of Normalizing Flow Blocks

Existing methods focus on stacking as many normalizing flow blocks as they can, but the increasing number of blocks will raise the amount of computation. Therefore, we here investigate an optimal set by conducting experiments with varying numbers of normalizing flow blocks. The utilization of a greater number of normalizing flow blocks, as demonstrated in Table 6, can enhance the accuracy of the model. It is worth noting that the accuracy when using

8

is same as when using

10

, so we adopt

n = 8

in all our experiments.

4.4.3. Influence of Different Extractions

In this subsection, we experiment in different feature extractions including ResNet-18 (RN-18), ResNet-50 (RN-50) and WideResNet-50 (WRN-50), and compare the results with those from the prior methods shown in Table 7. The results demonstrate that our DCNF achieves a significant improvement over the prior methods, even with a lightweight RN-18. The WRN-50 layers three features based on ImageNet pre-training balance semantic expression and spatial details, combined with average pooled down-sampling to reduce computational complexity and improve inference efficiency while maintaining high accuracy.

However, the features of WRN-50 pre-trained on ImageNet may exhibit sensitivity to domain differences (e.g., medical images, satellite images), thus necessitating the implementation of additional fine-tuning or domain adaptation strategies. Therefore, subsequent work may be appropriately fine-tuned for railway applications.

4.4.4. Influence of the K Sets in TopK:

As a hyperparameter in the post-processing stage, we have conducted an ablation study to determine the optimal

K

. Different from the other methods that compute the anomaly scores in the whole images, we select only

T o p K

pixels for the calculation of anomaly scores, which helps to minimize the interference of noise. The results of these experiments are presented in Figure 8.

T o p K

represents the ratio of the maximum scores of pixels in the image; the specific formula is

K \times H_{i m g} \times W_{i m g}

and the average score gives special solutions when

T o p K

is set as

100 % \times H_{i m g} \times W_{i m g}

. As shown in Figure 8, when the

T o p K

is set to

5 %

, the accuracy of DCNF achieves the best value in most categories.

It is noteworthy that when the

T o p K

is set as

100 % \times H_{i m g} \times W_{i m g}

, the accuracy will obviously decrease. According to our analysis, this is attributed to the fact that in the final anomaly map, only the anomaly exhibits a significantly high anomaly score. Consequently, when dealing with images containing minor anomalies, calculating the average of the anomaly scores for the entire image results in a lower anomaly score.

5. Conclusions

In this paper, we have proposed a novel dual-branch cross-fusion strategy for RGB-D anomaly detection in combination with depth maps to take advantage of the shape information. Specifically, we propose a mutual perception module and fusion flow module to utilize the cross-complementary prior knowledge and the different information given by the dual-branch module. The proposed DCNF achieves state-of-the-art performance on the TA dataset, thereby demonstrating the effectiveness of the DCNF. Additionally, we explored the effects of different

K

sets of

T o p K

on accuracy, and we have concluded that for images with small anomalies, smaller

K

sets are more helpful to overall detection. Future work could explore adaptive fusion weights to further optimize modality contributions under varying scenarios, as well as investigating whether the depth branch’s marginally lower standalone performance stems from inherent data limitations (e.g., noise in depth sensors) or architectural constraints in feature extraction. We hope our work will inspire future research on industrial anomaly detection.

Author Contributions

Conceptualization, X.G. and P.W.; methodology, X.G., P.W. and J.L; data curation, J.L.; validation, X.G., J.L. and L.L.; writing—original draft preparation, P.W.; writing—review and editing, P.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Key Project of International Cooperation and Exchanges Natural Science Foundation of China, grant number 61960206010.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The generated data supporting the findings are currently not publicly accessible, but may be obtained from the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

AE	Autoencoder
AUROC	Area Under Receiver Operator Curve
DCNF	Dual-Branch Cross-Fusion Normalizing Flow
TA	Track Anomaly
MP	Mutual Perception module
GAN	Generative Adversarial Network
kNN	k-Nearest Neighbor
NF	Normalizing Flow
RN-18	ResNet-18
RN-50	ResNet-50
WRN-20	WideResNet-50

References

Liu, J.; Song, K.; Feng, M.; Yan, Y.; Tu, Z.; Zhu, L. Semi-supervised anomaly detection with dual prototypes autoencoder for industrial surface inspection. Opt. Lasers Eng. 2021, 136, 106324. [Google Scholar] [CrossRef]
Du, X.; Li, B.; Zhao, Z.; Jiang, B.; Shi, Y.; Jin, L.; Jin, X. Anomaly-prior guided inpainting for industrial visual anomaly detection. Opt. Laser Technol. 2024, 170, 110296. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Y.; Luo, L.; Wang, N. Anodfdnet: A deep feature difference network for anomaly detection. J. Sens. 2022, 2022, 3538541. [Google Scholar] [CrossRef]
Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 14318–14328. [Google Scholar]
Pang, G.; Shen, C.; Cao, L.; Hengel, A.V.D. Deep learning for anomaly detection: A review. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
Yang, W.; Song, K.; Wang, Y.; Wei, X.; Tong, L.; Chen, S.; Yan, Y. NFCF: Industrial Surface Anomaly Detection with Normalizing Flow Cross-Fitting Network. Opt. Lasers Eng. 2023, 168, 107655. [Google Scholar] [CrossRef]
Yang, J.; Xu, R.; Qi, Z.; Shi, Y. Visual anomaly detection for images: A systematic survey. Procedia Comput. Sci. 2022, 199, 471–478. [Google Scholar] [CrossRef]
Cohen, N.; Hoshen, Y. Sub-image anomaly detection with deep pyramid correspondences. arXiv 2020, arXiv:2005.02357. [Google Scholar]
Zhou, T.; Fan, D.P.; Cheng, M.M.; Shen, J.; Shao, L. RGB-D salient object detection: A survey. Comput. Vis. Media 2021, 7, 37–69. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Xu, X.; Song, J.; Shen, F.; Shen, H.T. MSFlow: Multiscale Flow-Based Framework for Unsupervised Anomaly Detection. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 2437–2450. [Google Scholar] [CrossRef] [PubMed]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD—A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9592–9600. [Google Scholar]
Wen, P.; Gao, X.; Wang, Y.; Li, J.; Luo, L. Normalizing Flow-Based Industrial Complex Background Anomaly Detection. J. Sens. 2023, 2023, 6690190. [Google Scholar] [CrossRef]
Akcay, S.; Atapour-Abarghouei, A.; Breckon, T.P. Ganomaly: Semi-supervised anomaly detection via adversarial training. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision; Perth, Australia, 2–6 December 2018, Revised Selected Papers; Springer International Publishing: Cham, Switzerland, 2019; Volume Part III 14, pp. 622–637. [Google Scholar]
Bergman, L.; Cohen, N.; Hoshen, Y. Deep nearest neighbor anomaly detection. arXiv 2020, arXiv:2002.10445. [Google Scholar]
Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. Padim: A patch distribution modeling framework for anomaly detection and localization. In International Conference on Pattern Recognition; Springer International Publishing: Cham, Switzerland, 2021; pp. 475–489. [Google Scholar]
Shah, R.A.; Urmonov, O.; Kim, H. Two-stage coarse-to-fine image anomaly segmentation and detection model. Image Vis. Comput. 2023, 139, 104817. [Google Scholar] [CrossRef]
Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using real nvp. arXiv 2016, arXiv:1605.08803. [Google Scholar]
Rudolph, M.; Wandt, B.; Rosenhahn, B. Same same but differnet: Semi-supervised defect detection with normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1907–1916. [Google Scholar]
Gudovskiy, D.; Ishizaka, S.; Kozuka, K. Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 98–107. [Google Scholar]
Lang, C.; Nguyen, T.V.; Katti, H.; Yadati, K.; Kankanhalli, M.; Yan, S. Depth matters: Influence of depth cues on visual saliency. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision; Florence, Italy, 7–13 October 2012, Springer: Berlin/Heidelberg, Germany, 2012; Proceedings; Volume Part II 12, pp. 101–115. [Google Scholar]
Desingh, K.; Krishna, K.M.; Rajan, D.; Jawahar, C.V. Depth really Matters: Improving Visual Salient Region Detection with Depth. In Proceedings of the BMVC, Bristol, UK, 9–13 September 2013; pp. 1–11. [Google Scholar] [CrossRef]
Ren, J.; Gong, X.; Yu, L.; Zhou, W.; Ying Yang, M. Exploiting global priors for RGB-D saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 25–32. [Google Scholar]
Peng, H.; Li, B.; Xiong, W.; Hu, W.; Ji, R. RGBD salient object detection: A benchmark and algorithms. In Computer Vision–ECCV 2014: 13th European Conference; Zurich, Switzerland, 6–12 September 2014, Springer International Publishing: Cham, Switzerland, 2014; Proceedings; Volume Part III 13, pp. 92–109. [Google Scholar]
Han, J.; Chen, H.; Liu, N.; Yan, C.; Li, X. CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE Trans. Cybern. 2017, 48, 3171–3183. [Google Scholar] [CrossRef] [PubMed]
Wang, N.; Gong, X. Adaptive fusion for RGB-D salient object detection. IEEE Access 2019, 7, 55277–55284. [Google Scholar] [CrossRef]
Wu, Z.; Gobichettipalayam, S.; Tamadazte, B.; Allibert, G.; Paudel, D.P.; Demonceaux, C. Robust rgb-d fusion for saliency detection. In Proceedings of the 2022 International Conference on 3D Vision (3DV), Prague, Czech Republic, 12–16 September 2022; IEEE: New York, NY, USA, 2022; pp. 403–413. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Schirrmeister, R.; Zhou, Y.; Ball, T.; Zhang, D. Understanding anomaly detection with deep invertible networks through hierarchies of distributions and features. Adv. Neural Inf. Process. Syst. 2020, 33, 21038–21049. [Google Scholar]
Rudolph, M.; Wehrbein, T.; Rosenhahn, B.; Wandt, B. Fully convolutional cross-scale-flows for image-based defect detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 1088–1097. [Google Scholar]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4183–4192. [Google Scholar]
Lee, S.; Lee, S.; Song, B.C. Cfa: Coupled-hypersphere-based feature adaptation for target-oriented anomaly localization. IEEE Access 2022, 10, 78446–78454. [Google Scholar] [CrossRef]

Figure 1. Structural image of DCNF network: Illustrating Mutual Perception, feature extraction, pooling layer, NF Blocks, fusion flow and post-processing for RGB-depth data-driven anomaly detection.

Figure 2. The overall architecture of the Mutual Perception module. The CA denotes the Channel Attention module, and the SA denotes the Spatial Attention module.

Figure 3. The architecture of the fully convolutional normalizing flow block.

Figure 4. The architecture of the fusion flow module.

Figure 5. Examples of normal and anomaly images in the TA dataset.

Figure 6. Examples of test results on the TA dataset. From left to right are RGB images, depth maps, Ground Truths and the visualization of the results.

Figure 7. Comparison of the visualization results of different branches.

Figure 8. Ablation study of AUROC in % with different K sets of TopK on the TA dataset. The best result is plotted in the figure.

Table 1. Comparison of area under ROC (AUROC) in % of different methods on the TA dataset. The best is highlighted in bold, and the second is highlighted with underlining.

Method	GANomaly [13]	PaDiM [15]	CFA [31]	CFlow [19]	MSFlow [10]	DCNF
AUROC	80.30	91.70	94.28	93.12	94.75	98.49

Table 2. AUROC scores (%) compared for RGB branch, depth branch, and the fusion of the two branches on the TA dataset. The best is highlighted in bold, and the second is highlighted with underlining.

Method	DCNF (RGB)	DCNF (Depth)	DCNF
AUROC	97.02	95.94	98.49

Table 3. The recall score (%) for the TA dataset compared with previous methods. The best is highlighted in bold, and the second best is highlight with underlining.

Method	GANomaly	PaDiM	CFA	CFlow	MSFlow	DCNF
Recall	83.40	90.81	94.30	93.99	92.93	96.47

Table 4. A comparison of recall scores (%) for the RGB branch, the depth branch, and the fusion of the two branches on the TA dataset. The best is highlighted in bold, and the second best is highlight with underlining.

Method	DCNF (RGB)	DCNF (Depth)	DCNF
Recall	94.13	93.85	96.47

Table 5. Ablation study of AUROC in % on the TA dataset when adding different modules. The best is highlighted in bold.

MP	Fusion Flow	TA
×	×	97.26
×	√	97.76
√	×	98.08
√	√	98.49

Table 6. Ablation study of AUROC in % with the number of normalizing flow blocks

n

on the TA dataset. Bold values denote the best result on the TA dataset.

Table 6. Ablation study of AUROC in % with the number of normalizing flow blocks

n

on the TA dataset. Bold values denote the best result on the TA dataset.

n	2	5	8	10
AUROC	97.76	97.87	98.49	98.49

Table 7. Comparison study of AUROC in % with different extractions. The best is highlighted in bold, and the second best is highlighted with underlining.

Method	GANomaly	PaDiM	CFA	CFlow	MSFlow	DCNF (RN-18)	DCNF (RN-50)	DCNF (WRN-50)
TA	80.3	91.7	94.28	93.12	94.75	97.79	98.13	98.49

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, X.; Wen, P.; Li, J.; Luo, L. Dual-Branch Cross-Fusion Normalizing Flow for RGB-D Track Anomaly Detection. Sensors 2025, 25, 2631. https://doi.org/10.3390/s25082631

AMA Style

Gao X, Wen P, Li J, Luo L. Dual-Branch Cross-Fusion Normalizing Flow for RGB-D Track Anomaly Detection. Sensors. 2025; 25(8):2631. https://doi.org/10.3390/s25082631

Chicago/Turabian Style

Gao, Xiaorong, Pengxu Wen, Jinlong Li, and Lin Luo. 2025. "Dual-Branch Cross-Fusion Normalizing Flow for RGB-D Track Anomaly Detection" Sensors 25, no. 8: 2631. https://doi.org/10.3390/s25082631

APA Style

Gao, X., Wen, P., Li, J., & Luo, L. (2025). Dual-Branch Cross-Fusion Normalizing Flow for RGB-D Track Anomaly Detection. Sensors, 25(8), 2631. https://doi.org/10.3390/s25082631

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Branch Cross-Fusion Normalizing Flow for RGB-D Track Anomaly Detection

Abstract

1. Introduction

2. Materials and Methods

2.1. 2D Anomaly Detection

2.1.1. Reconstruction-Based Methods

2.1.2. Embedding Similarity-Based Methods

2.1.3. Normalizing Flows-Based Methods

2.2. RGB-D Fusion Strategies

3. Method

3.1. Mutual Perception Module (MP)

3.2. Feature Extraction

3.3. Fully Convolutional Normalizing Flow

3.4. Fusion Flow

3.5. Learning Objective and Post Processing

4. Experiment

4.1. Datasets and Metrics

4.2. Experimental Details

4.3. Quantitative Comparison

4.4. Ablation Study

4.4.1. Influence of Different Modules

4.4.2. Influence of the Number of Normalizing Flow Blocks

4.4.3. Influence of Different Extractions

4.4.4. Influence of the K Sets in TopK:

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI