Next Article in Journal
Automobile-Demand Forecasting Based on Trend Extrapolation and Causality Analysis
Previous Article in Journal
YOLOv8s-NE: Enhancing Object Detection of Small Objects in Nursery Environments Based on Improved YOLOv8
Previous Article in Special Issue
Unsupervised Classification of Spike Patterns with the Loihi Neuromorphic Processor
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cross-Domain Object Detection through Consistent and Contrastive Teacher with Fourier Transform

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610000, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(16), 3292; https://doi.org/10.3390/electronics13163292
Submission received: 15 July 2024 / Revised: 15 August 2024 / Accepted: 16 August 2024 / Published: 19 August 2024
(This article belongs to the Special Issue Neuromorphic Computing: Devices, Chips, and Algorithm)

Abstract

:
The teacher–student framework has been employed in unsupervised domain adaptation, which transfers knowledge learned from a labeled source domain to an unlabeled target domain. However, this framework suffers from two serious challenges: the domain gap, causing performance degradation, and noisy teacher pseudo-labels, which tend to mislead students. In this paper, we propose a Consistent and Contrastive Teacher with Fourier Transform (CCTF) method to address these challenges for high-performance cross-domain object detection. To mitigate the negative impact of domain shifts, we use the Fourier transform to exchange the low-frequency components of the source and target domain images, replacing the source domain inputs with the transformed image, thereby reducing domain gaps. In addition, we encourage the localization and classification branches of the teacher to make consistent predictions to minimize the noise in the generated pseudo-labels. Finally, contrastive learning is employed to resist the impact of residual noise in pseudo-labels. After extensive experiments, we show that our method achieves the best performance. For example, our model outperforms previous methods by 3.0% on FoggyCityscapes.

1. Introduction

In recent years, deep learning algorithms have achieved excellent performance in various tasks such as object detection, and these algorithms typically assume that the training and inference data are independent and identically distributed (i.i.d.). However, real-world data distributions rarely match this assumption. The distributional differences between domains can lead to severe performance degradation when pre-trained models are deployed to new domains. In addition, manually labeling annotations for all samples is highly time-consuming and labor-intensive, so inference is often performed with new domains that lack labels for supervision. As a result, unsupervised domain adaptation (UDA) has received widespread attention from the research community. As a semi-supervised method that effectively utilizes abundant unlabeled data, the teacher–student framework is widely employed for UDA, e.g., in the Unbiased Mean Teacher (UMT) [1] and the Adaptive Teacher (AT) [2]. Within the teacher–student framework, the teacher model generates pseudo-labels for the target domain, allowing the student model to be supervised without additional annotations.
Nevertheless, since the teacher–student framework was not initially introduced to solve UDA problems, it suffers from two serious problems: domain gaps and noisy pseudo-labels. Due to the domain gap, it is difficult for the teacher trained in the source domain to provide adequate guidance to the student in the target domain. In addition, the pseudo-labels generated by the teacher in the target domain usually contain noise, as shown in Figure 1, which includes many false positives and errors. Noisy pseudo-labels will mislead the student model and cause further performance degradation. In addition, although the quality of the pseudo-label can be improved, it is difficult to eliminate noise, which inherently limits the student’s performance.
To address these problems, we propose a Consistent and Contrastive Teacher with Fourier Transform (CCTF) for cross-domain object detection. First, we point out that style mismatches caused by low-frequency spectral variations reduce the perception of high-frequency semantics, leading to large domain gaps. For this reason, we introduce the Fourier transform to exchange the low-frequency components between the spectra of the source and target domain images to allow the style of the target image to be transferred to the source image, minimize the style mismatch, and reduce the domain gap. As shown in Figure 2, the discrepancy between the source distributions and target distributions is narrowed after the Fourier transform. Second, we maximize the consistency of the localization scores and the classification scores in both supervised and unsupervised ways. Due to discrepancies in domain distributions, the pseudo-labels are characterized by significant noise. Noisy pseudo-labels suggest that the IoU between predictions and ground truth may be low. Intuitively, a low IoU indicates the erroneous detection of an instance as an object, necessitating that the classification score attributed to this bounding box be correspondingly low. Conversely, a high IoU implies the accurate detection of an object, which encourages a correspondingly high classification score for the bounding box. In conclusion, the consistency between classification scores and IoU scores for a predicted instance directly correlates with the probability of accurately identifying an object. Thus, the quality of pseudo-labels is improved with original classification scores evolving to be consistent predictions that represent both classification and localization qualities. Third, we employ cross-domain contrastive learning on the object level. Despite enhancements to the quality of pseudo-labels, it remains inevitable that they incorporate a certain degree of noise. Inspired by the fact that contrastive learning can extract powerful representations without relying on accurate labels, we introduce contrastive learning to mitigate the influence of noise in pseudo-labeling and to facilitate consistency between student and teacher representations.
The contributions of our work can be summarized as follows:
  • We propose to swap low-frequency image spectra between the source and target domains to narrow the domain gaps, thereby facilitating the cross-domain performance of the model.
  • We propose to improve the consistency between the classification and localization branches of the teacher model in both supervised and unsupervised ways to optimize the quality of its generated pseudo-labels and reduce noise.
  • We use contrast learning to reduce the distance between samples belonging to the same category in the source and target domains to improve knowledge transfer between teacher and student and to counteract the negative effects of noise in pseudo-labels.
  • We carried out extensive experiments to demonstrate the state-of-the-art performance of our method. For instance, our method achieves 54.9% mAP on FoggyCityscapes, which is 3.0% higher than previous methods.

2. Related Work

  • Unsupervised domain adaptation. UDA is proposed to transfer labeled source domain knowledge to another unlabeled target domain and various methods have been designed. The discrepancy-based methods try to minimize the distribution difference between domains to align them. Maximum mean discrepancy (MMD) [3] is an effective loss that many methods are proposed based on. For example, to balance source data distributions, WDAN [4] assigns an auxiliary weight for each class. JAN [5] measures the HilbertSchmidt norm between the kernel mean embedding of empirical joint distributions of source data and target data. Inspired by generative adversarial networks [6], adversarial-based methods introduce an adversarial objective for domain confusion through a domain discriminator. Ref. [7] excludes globally dissimilar images and only aligns globally similar images. Ref. [8] utilize gradient reverse layers (GRL) [9] to employ the adversarial objective and align two domains on both image level and instance level. Ref. [10] aligns discriminative regions that are mined with clustering methods. Lately, researchers have discovered the potential of the teacher–student framework in UDA. Ref. [2] introduces adversarial-based methods to the teacher–student framework to reduce domain gaps to alleviate the noise in pseudo-labels. Ref. [1] effectively alleviates model biases toward the source domain by healing the teacher framework with distillation and healing the student with style transfer.
  • Cross-domain object detection. Cross-domain object detection (CDOD) is the extension of UDA from classification to object detection tasks. Some have proposed curriculum learning methods [11], annotation-level methods [12], and graph matching methods [13] for it. The adversarial learning is also employed to align two domains in [14,15,16,17]. As an effective semi-supervised method, the researchers also introduce the teacher–student framework to this field to help exploit target domain samples [1,2,18,19]. With the mutual learning between the teacher and the student, the framework can gain stronger robustness against data variance [1]. However, when facing large domain gaps and distribution discrepancies, most of these methods still suffer from large performance drops.
  • Contrastive learning. Contrast learning pushes the model to learn generalized features that are insensitive to transformations, improving the representation of the model [20]. Recent studies have shown that pre-trained models based on contrastive proxy tasks can replace their supervised learning counterparts [21,22,23,24] with even better performance [25]. Naturally, contrastive learning can be employed for UDA. For example, [26] utilizes contrastive learning to align the text and vision encoders to disentangle the domain information. Therefore, in this paper, we use contrast learning to reduce the distance between samples belonging to the same class in the source and target domains to counteract the negative effects of residual noise in pseudo-labels.

3. Method

3.1. Problem Formulation

Given labeled source images D s = { ( X s , B s , C s ) } , where X s , B s , and C s represent the source image set, the bounding boxes and class labels, respectively, and unlabeled target images D t = { ( X t ) } , where X t represent the target image set, UDA aims at transferring knowledge learned from labeled source domain to unlabeled target domain and obtaining an effective model for the target domain.

3.2. Fourier Transform

The teacher–student framework was originally proposed for semi-supervised problems, whereas in UDA, the performance of the vanilla mean teacher is limited by the domain gap. To unleash the full potential of the teacher–student framework in UDA, we introduce the Fourier transform to mitigate the negative impact of the domain gap on the teacher model.
Referring to [27], given an RGB source image x s and an RGB target image x t , we denote the amplitude and phase components of the Fourier transform as F A , F P . After the Fourier transform, the image x in RGB space is transformed to F ( x ) ( a , b ) in frequency domain:
F ( x ) ( a , b ) = h , w x ( h , w ) e j 2 π h H a + w W b , j 2 = 1 .
We then swapped the low-frequency components of the source and target inputs in the frequency domain and transformed them back into RGB space using the inverse Fourier transform F 1 :
x s t = F 1 ( [ M β F A ( x t ) + ( 1 M β ) F A ( x s ) , F P ( x s ) ] ) ,
where the Fourier transform F ( x ) is implemented following [28] in a simple way using the Fast Fourier Transform (FFT) algorithm, and M β denotes a mask, which is defined as follows:
M β ( h , w ) = 1 ( h , w ) [ β H : β H , β W : β W ] ,
whose value is zero unless β ( 0 , 1 ) . Note that β is not decided by image size nor resolution since β is not pixel-wise, and we assume the center of a sample is ( 0 , 0 ) . By using the binary mask in the frequency domain space, the low-frequency portions of the source and target image amplitudes are exchanged, and the distribution difference between the hybridized image and the target image is reduced, facilitating domain adaptation. After that, without changing the phase component of x s , the modified spectral representation of it is mapped back to the image x s t . x s t contains the same content as x s , but resembles the appearance of x t .
With the Fourier transform exchanging of the low-frequency components between the image spectra of two domains, the style mismatches and domain gaps between the two domains are reduced. As a result, the performance of the vanilla mean teacher can be further unleashed, thus facilitating the performance of the student model.

3.3. Consistent Teacher

In the vanilla mean teacher, the teacher model is updated with the exponential moving average (EMA) of the student model to output robust target domain pseudo-labels. The student model is updated with both source domain ground truth and target domain pseudo-labels. Therefore, we can define the following optimization objective for the vanilla mean teacher:
L M T = L s o u r c e + L t a r g e t ,
where L s o u r c e and L t a r g e t denote the loss from source ground truth and the loss from target pseudo-labels.
With the employment of the Fourier transform, we initially improved the performance of the vanilla mean teacher. However, the vanilla mean teacher will produce low-quality pseudo-labels, which are harmful to the training process. Therefore, we propose the consistent teacher to generate high-quality pseudo-labels to help the student learn stable representations in both supervised (source domain samples) and unsupervised (target domain samples) ways.
Normally, the teacher–student framework optimizes the classification branch and the localization branch of the object detection loss separately. We argue that the inconsistency between the predicted outputs of these two branches is the source of the noise in the teachers’ pseudo-labels. Thus, we facilitate the consistency of these two branches by making localization quality the learning goal of the classification branch. To be specific, in the source domain, we set y to the IoU for the ground truth class in the localization branch and keep other classes at 0, which is formulated as follows:
L ( y , p r e d ) = ( y 2 y ) log ( 1 p r e d ) y 2 log ( p r e d ) y > 0 ( p r e d ) γ log ( 1 p r e d ) y = 0 ,
where p r e d is the classification prediction. The supervised loss can also be modified as follows:
L c l s = l s c s L ( y l s , c s , p r e d l s , c s ) ,
where l s and c s represent the lth object of class c in the source domain. By changing the learning target from one-hot vectors to localization IoU, the classification score now represents both classification quality and localization quality. In addition, this softened learning objective also acts as a label-smoothing effect, further contributing to classification accuracy.
In the target domain, since we cannot access the annotation, we alternatively calculate the IoU between a box and all other box predictions from the student and choose the largest IoU as y ^ . Similarly to the supervised loss, the unsupervised loss can be remolded as follows:
L ^ c l s t = l t c t L ( y ^ l t , c t , p r e d l t , c t ) .
where l t and c t represent the lth object of class c in the target domain, and y ^ l t , c t and p r e d l t , c t represent the pseudo-IoU and the classification prediction for the lth object of class c in the target domain.
With the modification in both the supervised way and the unsupervised way, the teacher model can output pseudo-labels with less noise, and the teacher–student framework will be more robust in the target domain.

3.4. Cross-Domain Contrastive Learning

Due to the lack of supervision, the Fourier transform and the consistent teacher cannot eliminate the noise in pseudo-labels. To this end, we introduce cross-domain contrastive learning to optimize feature representations and maximize beneficial learning signals.
To conduct contrastive learning, we first feed the same target mini-batch M B to both the teacher’s RoIAlign [29] and the student’s RoIAlign to extract object-level features O B F T and O B F S T . Meanwhile, by feeding M B to the teacher, we also obtain corresponding pseudo-classes C t = { C 1 , . . . , C N } (N is the number of objects in M B ). With object-level features and pseudo-classes, the contrastive loss can be formulated as follows:
L cont = λ cont N i = 1 N 1 | P ( i ) | p P ( i ) log exp ( O B F i S T · O B F p T / τ ) j = 1 N exp ( O B F i S T · O B F j T / τ ) ,
where the positive pair P ( i ) = { p | C p = C i , p { 1 , . . . , N } } consists of all objects that have same pseudo-class as object i in mini-batch M B .
With the employment of contrastive learning, our student model can maximize the similarity of related samples and minimize that of unrelated samples to learn target sample representations.

3.5. Learning Objective

The overview of our framework is presented in Figure 3, and it is optimized as follows:
L = L ^ c l s + L r e g + λ 1 ( L ^ c l s t + L r e g t ) + λ 2 L c o n t ,
where L ^ c l s , L r e g are the supervised classification and localization losses, and L ^ c l s t , L r e g t are the unsupervised classification and localization losses. L c o n is the object-level cross-domain contrastive loss. λ 1 and λ 2 are hyper-parameters used to control the weights of losses.

4. Experiments

4.1. Datasets

  • Cityscapes. Cityscapes contains 2975 training images and 500 validation images collected from 50 different scenes of 8 categories for object detection tasks. Annotations are converted from segmentation masks.
  • FoggyCityscapes. Foggy Cityscapes shares the same city scenes, categories, and annotations with Cityscapes, and Foggy Cityscapes adds fog to Cityscapes with depth information. Foggy Cityscapes simulates three fog levels (0.005, 0.01, 0.02) corresponding to different visibility ranges.
  • PASCALVOC. PASCAL VOC [30] is composed of a combination of PASCAL VOC 2007 and 2012, which contains 17125 real-world images of 20 categories and corresponding annotations.
  • Comic2k. Comic2k [31] contains 2000 images of 6 categories that are in comic styles.
  • Clipart1k. Clipart1k [31] consists of 1000 clipart images that share the same categories with PASCAL VOC.
  • Watercolor2k. Watercolor2k [31] contains watercolor-style images and shares the same categories with the Comic2k dataset.
For Comic2k, Clipart1k, and Watercolor2k, we follow [7,32] and equally divide three datasets into the training set and the testing set. Visualizations of samples of VOC, Clipart, Watercolor, and Comic are shown in Figure 4, and large domain gaps can be observed.
Experiments are conducted on two domain adaptation settings: from normal images to foggy images (Cityscapes → Foggy Cityscapes) and from realistic images to article images (Pascal VOC → Watercolor, Pascal VOC → Clipart1k, and Pascal → Comic2k). We use the training sets of both the Fourier-transformed source samples and the target samples for training and the validation sets of the target domain samples for evaluation. For comparison, we use the mean average precision (mAP) as our evaluation metric.

4.2. Implementation Details

In our experiments, we employ Faster R-CNN as the base object detection model in our method. The backbones of the Faster R-CNN are VGG16 [33] and ResNet-101 [34] pre-trained on ImageNet. The input images are scaled by resizing their shorter sides to 600 and keeping their original ratios. For the loss hyper-parameters, we set λ 1 = 1.0 and λ 2 = 0.001 . For the hyper-parameters in Equation (4), we set α = 0.75 and γ = 2.0 . For the hyper-parameters in Equation (8), we set τ = 0.07 following [35,36]. We first train the Student model for 60k iterations. Then, the Teacher model is updated with EMA every iteration, and the Teacher and the Student are trained together for another 40k iterations. We set the learning rate as 0.016 and optimize our model using Stochastic Gradient Descent (SGD). Several data augmentation methods are applied, including random horizontal flip, random color jittering, gray-scaling, Gaussian blurring, and Erasing, to improve the generalization of our model. The weight smooth coefficient parameter of the EMA is set to 0.9996. We conduct our experiments on 2 NVIDIA TITAN RTX. The experiments are all implemented in PyTorch.
  • Hyper-parameter discussions. We believe that performance drops in the teacher–student framework are mainly caused by noise in pseudo-labels. The consistent teacher is introduced to effectively reduce noise in pseudo-labels, hence we set λ 1 to be large as 1.0. We set λ 2 to be as small as 0.001 since we introduce the contrastive learning to enhance the consistent teacher.
  • Avoiding strong augmentation issues. We observe that some objects are completely cut off from the strong augmented image while they are kept in the weak augmented image. Under such circumstances, object features extracted by the teacher and the student cannot be matched. To enforce the consistency between the teacher and the student, for each bounding box, we count the number of pixels where the RGB difference is larger than 40 between the teacher’s prediction and the student’s prediction [37]. If the difference ratio of pixels is higher than 50%, this object should be considered cut off and should be excluded from the contrastive learning.

4.3. Results and Comparisons

  • Normal weather to severe weather. In this experiment setting, we evaluate our model on the benchmark normal weather → foggy weather (Cityscapes → FoggyCityscapes). In this setting, various complex scenes will bring challenges to the adaptation.
The results are shown in Table 1. For a complete demonstration of our model, we perform experiments on both the foggy level “0.02” split (the foggiest split) and the foggy level “ALL” split. We can observe from Table 1 that our model outperforms the previous best performance by 2.1% on the foggy level “0.02” split and by 3.0% on the foggy level “ALL” split, and this can demonstrate our model’s ability to extract robust features from both labeled data and unlabeled data. Besides, by achieving the best mAP in six categories, our model shows its effectiveness. Furthermore, these results can demonstrate the practical significance of our model. Sufficient unlabeled data can be easily collected for real-world applications. With more unlabeled data, our model is capable of extracting more target domain learning signals and the robustness of our model can further increase, leading to better performance in the target domain.
  • Real to art. In this experiment setting, we evaluate our model on real-world images to artistic images. Compared with the experiment setting “Severe Weather”, domain shifts of this setting are much larger, bringing more challenges to our model. We use Pascal VOC (real-world images) as the source dataset and Watercolor, Clipart1k, and Comic2k (artistic images) as the target datasets. The results are shown in Table 2, Table 3 and Table 4. Specifically, our method achieves the highest 48.6% mAP among all compared methods and achieves the highest mAP in 10 categories. Similar results for Pascal VOC to Comic2k and Watercolor2k can be observed from Table 3 and Table 4: our model achieves 39.7% and 55.9% mAP, which surpasses the previous method by 2.0% and 0.7% respectively on Comic2k and Watercolor2k.
We also provide some visualization results for our model. In Figure 5, our model demonstrates the ability to accurately detect objects within images.
Table 1. The results for Cityscapes → Foggy Cityscapes. Our proposed method (in red) outperforms other methods on both foggy level “0.02” split and foggy level “ALL” split. SIGMA and MTOR (in blue) are based on ResNet-50 while other methods are based on VGG-16. The highest mAP of each class is bold.
Table 1. The results for Cityscapes → Foggy Cityscapes. Our proposed method (in red) outperforms other methods on both foggy level “0.02” split and foggy level “ALL” split. SIGMA and MTOR (in blue) are based on ResNet-50 while other methods are based on VGG-16. The highest mAP of each class is bold.
MethodSplitPersonRiderCarTruckBusTrainMotorBikemAP
DM [38]0.0230.840.540.527.238.434.528.432.334.6
HTCN [14]0.0233.247.547.931.647.440.932.337.139.8
MeGA-CDA [39]0.0237.749.052.425.449.246.934.539.041.8
TIA [40]0.0234.846.349.731.152.148.637.738.142.3
SIGMA [41]0.0246.948.463.727.150.735.934.741.443.5
MTOR [19]0.0230.641.444.021.943.440.231.733.235.1
UMT [1]0.0233.046.748.634.156.546.830.437.341.7
PT+CMT [37]0.0242.351.764.026.042.737.142.544.043.8
AT+CMT [37]0.0245.955.763.739.666.038.841.451.250.3
Ours0.0249.560.365.640.265.841.040.356.552.4
Source0.0222.426.628.59.016.04.315.225.318.4
Oracle0.0239.547.359.133.147.342.938.140.843.5
PDA [42]ALL36.045.554.424.344.125.829.135.936.9
ICR-CCR [17]ALL32.943.849.227.236.436.430.334.637.4
PT+CMT [37]ALL45.655.166.534.059.442.443.947.449.3
AT+CMT [37]ALL47.055.764.539.463.251.940.353.151.9
OursALL50.954.868.145.365.155.842.956.354.9
SourceALL27.933.440.412.123.210.120.730.924.8
OracleALL41.249.161.632.656.649.037.942.446.3
Table 2. The results for real images to article images on Pascal VOC → Clipart1k. All methods are based on ResNet-101 (ours in red). The highest mAP of each class is bold.
Table 2. The results for real images to article images on Pascal VOC → Clipart1k. All methods are based on ResNet-101 (ours in red). The highest mAP of each class is bold.
MethodAeroBcycleBirdBoatBottleBusCarCatChairCowTableDogHorsem-cycPersonPlantSheepSofaTraintvmAP
Source23.039.620.123.625.742.625.20.941.225.623.711.228.249.545.246.99.122.338.931.528.8
SCL [32]44.750.033.627.442.255.638.319.237.969.030.126.334.467.361.047.921.426.350.147.341.5
UMT [1]39.659.132.435.045.161.948.47.546.067.621.429.548.275.970.556.725.928.939.443.644.1
HTCN [14]33.658.934.023.445.657.039.812.039.751.321.120.139.172.863.043.119.330.150.251.840.3
DM [38]25.863.224.542.447.943.137.59.147.046.726.824.948.178.763.045.021.336.152.353.441.8
CMT [37]39.856.338.739.760.435.056.07.160.160.435.828.167.884.580.155.520.332.842.338.247.0
Ours41.655.939.843.161.837.259.48.162.662.937.027.771.287.981.455.321.736.244.937.648.6
Oracle33.347.643.138.024.582.057.422.948.449.237.946.441.154.073.739.536.719.153.252.945.0
Table 3. The results for real images to article images on PASCAL VOC → Comic2k. All methods are based on ResNet-101 (ours in red). The highest mAP of each class is bold.
Table 3. The results for real images to article images on PASCAL VOC → Comic2k. All methods are based on ResNet-101 (ours in red). The highest mAP of each class is bold.
MethodBikeBirdCarCatDogPersonmAP
Source32.512.021.110.412.429.919.7
SWDA [7]36.421.829.815.123.549.629.4
HTCN [14]50.315.027.19.418.946.227.8
DT+PL [31]53.023.734.427.427.244.035.0
AT [2]34.09.156.224.638.863.637.7
Ours34.010.761.124.940.267.439.7
Oracle61.938.950.848.945.276.653.7
Table 4. The results for real images to article images on PASCAL VOC → Watercolor2k. All methods are based on ResNet-101 (ours in red). The highest mAP of each class is bold.
Table 4. The results for real images to article images on PASCAL VOC → Watercolor2k. All methods are based on ResNet-101 (ours in red). The highest mAP of each class is bold.
MethodbikebirdcarcatdogpersonmAP
Source68.846.837.232.721.360.744.6
BDC-Faster [7]68.648.347.226.521.760.545.5
DA-Faster [8]75.240.648.031.520.660.046.0
WST-BSR [43]75.645.849.334.130.364.149.9
HTCN [14]78.647.545.635.431.062.250.1
AT [2]83.834.354.236.131.671.351.9
SCL [32]82.255.151.839.638.464.055.2
Ours82.056.552.540.039.165.055.9
Oracle61.938.950.848.945.276.650.6
While assessing Clipart1k, Comic2k, and Watercolor2k within the same setting, it should be noted that the domain distances from Pascal VOC to Watercolor2k are considerably smaller than those observed between Pascal VOC and Clipart1k or Comic2k. To verify our statement, we calculate domain distances between Pascal VOC (source domain) and Clipart1k, Comic2k, and Watercolor2k (target domains). Strictly speaking, domain divergence should be measured with the H Δ H -divergence. Since calculating the H Δ H -divergence is complex, we employ the utilization of A -distance, which is written as follows:
A ( D s , D t ) = 2 ( 1 2 θ ) ,
where θ denotes the domain classification error of a binary classifier which classifies the source domain from the target domain. A reduced A -distance indicates a smaller domain distance. Table 5 shows the results of domain distances. We can observe that domain distances between Pascal VOC and Watercolor2k are the smallest, corresponding to the best performance. Accordingly, domain distances between Pascal VOC and Comic2k are the largest, corresponding to the worst performance. Intuitively, watercolor paintings are more similar to real-world images than clipart or comics. Hence, the performances on Watercolor2k of all mentioned methods, including ours, is higher than those on Clipart1k and Comic2k.

4.4. Ablation Studies

Experiments conducted in Section 4.4 and Section 4.5 demonstrate that our model is robust under various circumstances. In this section, we further conduct ablation studies on each of our important modules.
  • Fourier transform. The effectiveness of our Fourier transform is benchmarked in Table 6. From the results, we can observe large performance drops up to 2.0% for the “Severe Weather” setting. In addition, the performance drop of split “ALL” is larger than that of split “0.02”. Meanwhile, the performance drops for the “Real to Art” setting are only up to 1.3%, which is much smaller. Our Fourier transform is proposed to exchange the low-frequency parts of the amplitude of images from two domains. As shown in Figure 6, it works more naturally in the “Severe Weather” setting than in the “Real to Art” setting. In the “Severe Weather” setting, the Fourier transform can effectively convert an original image to its foggy version. In the “Real to Art” setting, however, the Fourier transform can only convert the original car to match the color of the target car. Based on these observations, we infer that the Fourier transform works more effectively when the low-level statistics of the distributions of two domains are more similar.
  • The consistent teacher–student framework. We further analyze the importance of our consistent teacher–student framework. Results are shown in Table 6. We replace our consistent teacher–student framework with the MT and evaluate it on all datasets (up to 1.6%). Varying degrees of performance drops are observed on all datasets. The results show that our model is more robust after enforcing the consistency between the classification and the localization branches in both supervised and unsupervised ways.
  • The contrastive learning. We also analyze the effectiveness of our cross-domain contrastive learning from two aspects. For one thing, in each iteration, we perturb pseudo-labels by randomly re-assigning random labels to a fraction of pseudo-label objects. By introducing noise to pseudo-labels, the domain adaptation pipeline may be compromised. The results are shown in Figure 7. As we perturb the pseudo-labels, the performance of the model without contrastive learning drops considerably (4.6% performance drop). In contrast, our model can partially resist the noise and recover the accuracy from the noise (2.8% performance drop). Therefore, contrastive learning can help our model improve stability and avoid collapsing. On the other hand, when we directly remove contrastive learning from our model, we can observe performance drops from 1.3% to 2.0% in Table 6. The results demonstrate that contrastive learning can resist the noise in labels and acquire robust visual representations to improve performance.

4.5. Discussion

We propose a novel method for cross-domain object detection tasks which seamlessly integrates Fourier transform, contrastive learning, and the consistent teacher. First, the Fourier transform alleviates domain gaps, thereby facilitating the effectiveness of subsequent modules. Then, enhanced representations derived from contrastive learning, in conjunction with the higher quality of pseudo-labels generated by a consistent teacher, mutually reinforce each other. Large performance improvements in two settings can demonstrate the effectiveness and adaptability of our model.
  • Training time. We also compare the training time of our method with some state-of-the-art methods. For fair comparison without severe performance drops, we warm up each teacher–student-framework-based method for 20k iterations and then train them for another 10k iterations. For other methods, we train them for 30k iterations. Table 7 shows the training time (in red) of each method. In practice, we notice that the time consumed by different methods during the validation process within the training process is different, so we also present the validation time (in blue) for fair comparison. It can be seen that our method requires only marginally more time compared with other methods. The results in Table 7 and Section 4.3 demonstrate that our method can enhance performance while minimizing the increase in complexity.
  • Application prospects. Given our model’s superior performance, it exhibits broad applicability across various perspectives, especially within autonomous driving. In practice, the weather will change from sunny to severe weather, e.g., foggy, requiring the model to be adaptive across different domains. Large performance improvements over previous DA methods on Cityscapes and FoggyCityscapes demonstrate that our model is adaptive to foggy weather. Therefore, our model may suit the field of autonomous driving.

5. Conclusions

In this paper, we introduce a novel method for cross-domain object detection. First, we introduce the Fourier transform to align two domains. The Fourier transform exchanges the low-frequency spectrum of the source and target domains to transfer target styles to source samples. Therefore, the initial alleviation of the domain shifts between two domains is achieved. Then, to improve the quality of pseudo-labels, we enforce the consistency between the classification and the localization branches in both supervised and unsupervised ways. Finally, to further resist the noise in pseudo-labels and improve the accuracy, we introduce contrastive learning to extract powerful visual representations. Extensive experiments and ablation studies demonstrate our method’s effectiveness and broad applicability within autonomous driving.

Author Contributions

Conceptualization, X.T., L.Z. and W.L.; methodology, L.J.; validation, X.T.; writing—original draft preparation, L.J.; writing—review and editing, M.J.; visualization, X.T.; supervision, X.T., L.Z. and W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 61877009).

Data Availability Statement

Acknowledgments

Thanks to my seniors and juniors, who firmly supported me when I was under pressure and facing difficulties.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CCTFConsistent and Contrastive Teacher with Fourier Transform

References

  1. Deng, J.; Li, W.; Chen, Y.; Duan, L. Unbiased mean teacher for cross-domain object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021, Online, 19–25 June 2021; pp. 4091–4101. [Google Scholar]
  2. Li, Y.J.; Dai, X.; Ma, C.Y.; Liu, Y.C.; Chen, K.; Wu, B.; He, Z.; Kitani, K.; Vajda, P. Cross-domain adaptive teacher for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–22 June 2022; pp. 7581–7590. [Google Scholar]
  3. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
  4. Yan, H.; Ding, Y.; Li, P.; Wang, Q.; Xu, Y.; Zuo, W. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 2272–2281. [Google Scholar]
  5. Long, M.; Zhu, H.; Wang, J.; Jordan, M.I. Deep transfer learning with joint adaptation networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 2208–2217. [Google Scholar]
  6. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 63, 139–144. [Google Scholar]
  7. Saito, K.; Ushiku, Y.; Harada, T.; Saenko, K. Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6956–6965. [Google Scholar]
  8. Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3339–3348. [Google Scholar]
  9. Ganin, Y.; Lempitsky, V. Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1180–1189. [Google Scholar]
  10. Zhu, X.; Pang, J.; Yang, C.; Shi, J.; Lin, D. Adapting object detectors via selective cross-domain alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 687–696. [Google Scholar]
  11. Soviany, P.; Ionescu, R.T.; Rota, P.; Sebe, N. Curriculum self-paced learning for cross-domain object detection. Comput. Vis. Image Underst. 2021, 204, 103166. [Google Scholar] [CrossRef]
  12. Bao, Z.; Luo, Y.; Tan, Z.; Wan, J.; Ma, X.; Lei, Z. Deep domain-invariant learning for facial age estimation. Neurocomputing 2023, 534, 86–93. [Google Scholar] [CrossRef]
  13. Li, W.; Liu, X.; Yao, X.; Yuan, Y. Scan: Cross domain object detection with semantic conditioned adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, British Columbia, Canada, 18 February–1 March 2022; Volume 36, pp. 1421–1428. [Google Scholar]
  14. Chen, C.; Zheng, Z.; Ding, X.; Huang, Y.; Dou, Q. Harmonizing transferability and discriminability for adapting object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 14–19 June 2020; pp. 8869–8878. [Google Scholar]
  15. He, Z.; Zhang, L. Multi-adversarial faster-rcnn for unrestricted object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 27 October–2 November 2019; pp. 6668–6677. [Google Scholar]
  16. Su, P.; Wang, K.; Zeng, X.; Tang, S.; Chen, D.; Qiu, D.; Wang, X. Adapting object detectors with conditional domain normalization. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XI 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 403–419. [Google Scholar]
  17. Xu, C.D.; Zhao, X.R.; Jin, X.; Wei, X.S. Exploring categorical regularization for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 11724–11733. [Google Scholar]
  18. Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv 2017. [Google Scholar] [CrossRef]
  19. Cai, Q.; Pan, Y.; Ngo, C.W.; Tian, X.; Duan, L.; Yao, T. Exploring object relation in mean teacher for cross-domain detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 11457–11466. [Google Scholar]
  20. Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 1735–1742. [Google Scholar]
  21. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, (ICML), Online, 12–18 July 2020; pp. 1597–1607. [Google Scholar]
  22. Chen, X.; Fan, H.; Girshick, R.; He, K. Improved baselines with momentum contrastive learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
  23. Chen, X.; He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 15750–15758. [Google Scholar]
  24. Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
  25. Tomasev, N.; Bica, I.; McWilliams, B.; Buesing, L.; Pascanu, R.; Blundell, C.; Mitrovic, J. Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? arXiv 2022, arXiv:2201.05119. [Google Scholar]
  26. Lai, Z.; Bai, H.; Zhang, H.; Du, X.; Shan, J.; Yang, Y.; Chuah, C.N.; Cao, M. Empowering unsupervised domain adaptation with large-scale pre-trained vision-language models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 8–12 January 2024; pp. 2691–2701. [Google Scholar]
  27. Yang, Y.; Soatto, S. Fda: Fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4085–4095. [Google Scholar]
  28. Frigo, M.; Johnson, S.G. FFTW: An adaptive software architecture for the FFT. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), Washington, DC, USA, 12–15 May 1998; Volume 3, pp. 1381–1384. [Google Scholar]
  29. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  30. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  31. Inoue, N.; Furuta, R.; Yamasaki, T.; Aizawa, K. Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5001–5009. [Google Scholar]
  32. Shen, Z.; Maheshwari, H.; Yao, W.; Savvides, M. Scl: Towards accurate domain adaptive object detection via gradient detach based stacked complementary losses. arXiv 2019, arXiv:1911.02559. [Google Scholar]
  33. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  34. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 June–1 July 2016; pp. 770–778. [Google Scholar]
  35. Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
  36. He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
  37. Cao, S.; Joshi, D.; Gui, L.Y.; Wang, Y.X. Contrastive Mean Teacher for Domain Adaptive Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 23839–23848. [Google Scholar]
  38. Kim, T.; Jeong, M.; Kim, S.; Choi, S.; Kim, C. Diversify and match: A domain adaptive representation learning paradigm for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12456–12465. [Google Scholar]
  39. Vs, V.; Gupta, V.; Oza, P.; Sindagi, V.A.; Patel, V.M. Mega-cda: Memory guided attention for category-aware unsupervised domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 4516–4526. [Google Scholar]
  40. Zhao, L.; Wang, L. Task-specific inconsistency alignment for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 14217–14226. [Google Scholar]
  41. Li, W.; Liu, X.; Yuan, Y. Sigma: Semantic-complete graph matching for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5291–5300. [Google Scholar]
  42. Hsu, H.K.; Yao, C.H.; Tsai, Y.H.; Hung, W.C.; Tseng, H.Y.; Singh, M.; Yang, M.H. Progressive domain adaptation for object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Colorado Springs, CO, USA, 1–5 March 2020; pp. 749–757. [Google Scholar]
  43. Kim, S.; Choi, J.; Kim, T.; Kim, C. Self-training and adversarial background regularization for unsupervised domain adaptive one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6092–6101. [Google Scholar]
Figure 1. The impact of domain gaps. (a) are results outputted by a biased model. (b) is the ground truth. The results demonstrate that models exhibit performance degradation when facing domain gaps, leading to inaccurate outputs.
Figure 1. The impact of domain gaps. (a) are results outputted by a biased model. (b) is the ground truth. The results demonstrate that models exhibit performance degradation when facing domain gaps, leading to inaccurate outputs.
Electronics 13 03292 g001
Figure 2. Examples of Fourier transform. After the Fourier transform, the low-frequency part of the amplitude of source samples is replaced by that of target samples, and the domain distance is narrowed.
Figure 2. Examples of Fourier transform. After the Fourier transform, the low-frequency part of the amplitude of source samples is replaced by that of target samples, and the domain distance is narrowed.
Electronics 13 03292 g002
Figure 3. Overview of our method. We first introduce the Fourier transform to exchange the low-frequency part of the image amplitude of two domains. Then, we optimize the teacher–student framework by enforcing the consistency between the classification branch and the localization branch. Simultaneously, we utilize cross-domain contrastive learning between source images and target images to acquire robust and reliable visual representations.
Figure 3. Overview of our method. We first introduce the Fourier transform to exchange the low-frequency part of the image amplitude of two domains. Then, we optimize the teacher–student framework by enforcing the consistency between the classification branch and the localization branch. Simultaneously, we utilize cross-domain contrastive learning between source images and target images to acquire robust and reliable visual representations.
Electronics 13 03292 g003
Figure 4. Visualizations of datasets of domain adaptation from realistic images to article images (VOC as the source domain and other as target domains). Large domain gaps can be observed.
Figure 4. Visualizations of datasets of domain adaptation from realistic images to article images (VOC as the source domain and other as target domains). Large domain gaps can be observed.
Electronics 13 03292 g004
Figure 5. Visualizations of detection results for our model. It can be observed that our model can accurately detect objects within images.
Figure 5. Visualizations of detection results for our model. It can be observed that our model can accurately detect objects within images.
Electronics 13 03292 g005
Figure 6. Fourier transform on “Severe Weather” setting and “Real to Art” setting. For “Severe Weather” setting, the original image is well converted to its foggy version. For “Real to Art” setting, Fourier transfer adaptation is limited to converting the original car to match the color of the target car.
Figure 6. Fourier transform on “Severe Weather” setting and “Real to Art” setting. For “Severe Weather” setting, the original image is well converted to its foggy version. For “Real to Art” setting, Fourier transfer adaptation is limited to converting the original car to match the color of the target car.
Electronics 13 03292 g006
Figure 7. Impacts of pseudo-label noise on Clipart1k performance. Models w/o contrastive learning suffer great performance drops caused by noise in pseudo-labels. Our model introduces contrastive learning to help resist the noisy pseudo-labels and reduce performance instability.
Figure 7. Impacts of pseudo-label noise on Clipart1k performance. Models w/o contrastive learning suffer great performance drops caused by noise in pseudo-labels. Our model introduces contrastive learning to help resist the noisy pseudo-labels and reduce performance instability.
Electronics 13 03292 g007
Table 5. The results of the A -distance between domains of real images to article images. A smaller A -distance value indicates a smaller domain distance.
Table 5. The results of the A -distance between domains of real images to article images. A smaller A -distance value indicates a smaller domain distance.
DomainsClipart1kComic2kWatercolor2k
A -distance1.151.270.73
Table 6. Results of the ablation studies on our Consistent and Contrastive Teacher. Mean average precision (mAP, %) is reported on each module (performance drops in red).
Table 6. Results of the ablation studies on our Consistent and Contrastive Teacher. Mean average precision (mAP, %) is reported on each module (performance drops in red).
ModelFoggy “0.02”Foggy “ALL”Clipart1kComic2kWaterColor2k
Ours52.454.948.639.755.9
w/o Fourier transform50.9 (−1.5)52.9 (−2.0)47.5 (−1.1)38.7 (−1.0)54.6 (−1.3)
w/o consistent teacher51.0 (−1.4)53.3 (−1.6)47.3 (−1.3)38.5 (−1.2)54.3 (−1.6)
w/o contrastive learning51.0 (−1.4)53.4 (−1.5)48.2 (−1.4)38.4 (−1.3)53.9 (−2.0)
Table 7. Results (in red) of the training time of our method and some SOTA methods. For fair comparison, the validation time within the training process (blue) is also reported.
Table 7. Results (in red) of the training time of our method and some SOTA methods. For fair comparison, the validation time within the training process (blue) is also reported.
City → FoggyVoc → Clipart
AT4.5 h (−1.5 h)2.5 h (−10 min)
CMT+AT4.5 h (−1.0 h)2.5 h (−7.5 min)
CMT+PT5.0 h (−1.5 h)
SIGMA5.0 h (−1.0 h)
Ours4.75 h (−1.0 h)2.75 h (−7.5 min)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jia, L.; Tian, X.; Jing, M.; Zuo, L.; Li, W. Cross-Domain Object Detection through Consistent and Contrastive Teacher with Fourier Transform. Electronics 2024, 13, 3292. https://doi.org/10.3390/electronics13163292

AMA Style

Jia L, Tian X, Jing M, Zuo L, Li W. Cross-Domain Object Detection through Consistent and Contrastive Teacher with Fourier Transform. Electronics. 2024; 13(16):3292. https://doi.org/10.3390/electronics13163292

Chicago/Turabian Style

Jia, Longfei, Xianlong Tian, Mengmeng Jing, Lin Zuo, and Wen Li. 2024. "Cross-Domain Object Detection through Consistent and Contrastive Teacher with Fourier Transform" Electronics 13, no. 16: 3292. https://doi.org/10.3390/electronics13163292

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop