*Article* **BiFDANet: Unsupervised Bidirectional Domain Adaptation for Semantic Segmentation of Remote Sensing Images**

**Yuxiang Cai 1, Yingchun Yang 1,\*, Qiyi Zheng 1, Zhengwei Shen 2,3, Yongheng Shang 2,3, Jianwei Yin 1,4,5 and Zhongtian Shi <sup>6</sup>**


**Abstract:** When segmenting massive amounts of remote sensing images collected from different satellites or geographic locations (cities), the pre-trained deep learning models cannot always output satisfactory predictions. To deal with this issue, domain adaptation has been widely utilized to enhance the generalization abilities of the segmentation models. Most of the existing domain adaptation methods, which based on image-to-image translation, firstly transfer the source images to the pseudo-target images, adapt the classifier from the source domain to the target domain. However, these unidirectional methods suffer from the following two limitations: (1) they do not consider the inverse procedure and they cannot fully take advantage of the information from the other domain, which is also beneficial, as confirmed by our experiments; (2) these methods may fail in the cases where transferring the source images to the pseudo-target images is difficult. In this paper, in order to solve these problems, we propose a novel framework BiFDANet for unsupervised bidirectional domain adaptation in the semantic segmentation of remote sensing images. It optimizes the segmentation models in two opposite directions. In the source-to-target direction, BiFDANet learns to transfer the source images to the pseudo-target images and adapts the classifier to the target domain. In the opposite direction, BiFDANet transfers the target images to the pseudo-source images and optimizes the source classifier. At test stage, we make the best of the source classifier and the target classifier, which complement each other with a simple linear combination method, further improving the performance of our BiFDANet. Furthermore, we propose a new bidirectional semantic consistency loss for our BiFDANet to maintain the semantic consistency during the bidirectional image-to-image translation process. The experiments on two datasets including satellite images and aerial images demonstrate the superiority of our method against existing unidirectional methods.

**Keywords:** unsupervised domain adaptation; bidirectional domain adaptation; convolutional neural networks (CNNs); image-to-image translation; generative adversarial networks (GANs); remote sensing images; semantic segmentation

#### **1. Introduction**

In the last few years, it has been possible to collect a mass of remote sensing images, thanks to the continuous advancement of remote sensing techniques. For example, Gaofen satellites can capture a large number of satellite images with high spatial resolution on a large scale. In remote sensing, such a large amount of data has offered many more capability for image analysis tasks; for example, semantic segmentation [1], change detection [2] and scene classification [3]. Among these tasks, the semantic segmentation of remote

**Citation:** Cai, Y.; Yang, Y.; Zheng, Q.; Shen, Z.; Shang, Y.; Yin, J.; Shi, Z. BiFDANet: Unsupervised Bidirectional Domain Adaptation for Semantic Segmentation of Remote Sensing Images. *Remote Sens.* **2022**, *14*, 190. https://doi.org/10.3390/ rs14010190

Academic Editors: Fahimeh Farahnakian, Jukka Heikkonen and Pouya Jafarzadeh

Received: 30 November 2021 Accepted: 28 December 2021 Published: 1 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

sensing images has become one of the most interesting and important research topics because it is widely used in many applications, such as dense labeling, city planning, urban management, environment monitoring, and so on.

For the semantic segmentation of remote sensing images, CNN [4] has become one of the most efficient methods in the past decades and several CNN models have shown their effectiveness, such as DeepLab [5] and its variants [6,7]. However, these methods have some limitations, because CNN-based architectures tend to be sensitive to the distributions and features of the training images and test images. Even though they give satisfactory predictions when the distributions of training and test images are similar [1], when we attempt to use this model to classify images obtained from other satellites or cities, the classification accuracy severely decreases due to different distributions of the source images and target images, as shown in Figure 1. In the literature, the aforementioned problem is known as domain adaptation [8]. In remote sensing, domain gap problems are often caused due to many reasons, such as illumination conditions, imaging times, imaging sensors, geographic locations and so on. These factors will change the spectral characteristics of objects and resulted in a large intra-class variability. For instance, the images acquired from different satellite sensors may have different colors, as shown in Figure 1a,b. Similarly, due to the differences of the imaging sensors, images may have different types of channels. For example, a few images may consist of near-infrared, green, and red channels while the others may have green, red, and blue bands.

In typical domain adaptation problems, the distributions of the source domain are different from those of the target domain. In remote sensing, we assume that the images collected from different satellites or locations (cities) are different domains. The unsupervised domain adaptation defines that only annotations of the source domain are available and aims at generating satisfactory predicted labels for the unlabeled target domain, even if the domain shift between the source domain and target domain is huge. To improve the performances of the segmentation models in aforementioned settings, one of the most common approaches in remote sensing is to diversify the training images of the source domain, by performing data augmentation techniques, such as random color change [9], histogram equalization [10], and gamma correction [11]. However, even if these methods slightly increase the generalization capabilities of the models, the improvement is unsatisfactory when there exists huge differences between the distributions of different domains. For example, it is difficult to adapt the classifier from one domain with near-infrared, red, and green bands to another one with red, green and blue channels by using simple data augmentation techniques. To overcome such limitation, a generative adversarial network [12] was applied to transfer images between the source and target domains and made significant progress in unsupervised domain adaptation for semantic segmentation [13,14]. These approaches based on image translation can be divided into two steps. At first, it learns to transfer the source images to the target domain. Secondly, the translated images and the labels for the corresponding source images are used to train the classifier which will be tested on the unlabeled source domain. When the first step reduce the domain shift, the second step can effectively adapt the segmentation model to the target domain. In addition, inverse translations which adapt the segmentation model from the target domain to the source domain have been implemented as well [15]. In our experiments, we find that these two translations in opposite directions should be complementary rather than alternative. Furthermore, such unidirectional (e.g., source-to-target) setting might ignore the information from the inverse direction. For example, Benjdira et al. [16] adapted the source classifier to the unlabeled target domain, they only simulated the distributions of the target images instead of making the target images fully participate in domain adaption. Therefore, these unidirectional methods cannot take full advantage of the information from the target domain. Meanwhile, the key to the domain adaptation methods based on image translation is the similarity between the distributions of the pseudo-target images and the target images. Given fixed image translation models, it will depend on the difficulty of converting between two domains: there might be some situations where transferring the

target images to the source domain is more difficult, and situations where transferring the source images to the target domain is more difficult. By combining the two opposite directions, we will acquire an architecture more general than those unidirectional methods. Furthermore, the recent image translation network (e.g., CycleGAN [17]) is bidirectional so that we can usually obtain two image generators in the source-to-target and target-to-source directions when the training of the image translation model is done. We can use both of generators to make the best of the information from the two directions.

(**a**) (**b**)

**Figure 1.** An example of the domain adaptation. We show the source images and the target images which are obtained from different satellites, the label of the target image and the prediction of DeeplabV3+. In the label and the prediction, black and white pixels represent background and buildings respectively. (**a**) Source image. (**b**) Target image. (**c**) Label of the target image. (**d**) Prediction for the target image.

However, solving the aforementioned problems presents a few challenges. First, the transformed images and their corresponding original images must have the same semantic contents with the original images. For instance, if the image-to-image translation model replaces buildings with bare land during the translation, the labels of the original images cannot match the transformed images. As a result, semantic changes in any directions will affect our models. If the semantic changes occur in the source-to-target direction, the target domain classifier will have poor performance. If the approach replaces some objects with others in the target-to-source direction, the predicted labels of the source domain classifier would be unsatisfactory. Secondly, when we transfer the source images to the target domain, the data distributions of the pseudo-target images should be as similar as possible to the data distributions of the target images and the data distributions of the pseudo-source and source images should be similar as well. Otherwise, the transformed images of one domain cannot represent the other domain. Finally, the predicted labels of the two directions complement each other and the method of combining the labels is crucial because it will affect the final predicted labels. Simply combining the two predicted labels may leave out some correct objects or add some wrong objects.

In this article, we propose a new bidirectional model to address the above challenges. This framework involves two opposite directions. In the source-to-target direction, we generate pseudo-target transformed images which are semantically consistent with the original images. For this purpose, we propose a bidirectional semantic consistency loss to maintain the semantic consistency during the image translation. Then we employ the labels of the source images and their corresponding transformed images to adapt the segmentation model to the target domain. In the target-to-source direction, we optimize the source domain classifier to predict labels for the pseudo-source transformed images. These two classifiers may make different types of mistakes and assign different confidence ranks to the predicted labels. Overall the two classifiers are complementary instead of alternative. We make full use of them with a simple linear method which fuses their probability output. Our contributions are as follows:


This article is organized as follows: Section 2 summarizes the related works. Section 3 presents the theory of our proposed framework. Section 4 describes the data set, the experimental design and discusses the obtained results, Section 5 provides the discussion and Section 6 draws our conclusions.

#### **2. Related Work**

#### *2.1. Domain Adaptation*

Tuia et al. [8] explained that in the research literature the adaptation methods could be grouped as: the selection of invariant features [18–21], the adaptation of classifiers [22–27], the adaptation of the data distributions [28–31] and active learning [32–34]. Here we focus on the methods of aligning the data distributions by performing image-to-image translation [35–39] between the different domains [40–43]. These methods usually match the data distributions of different domains by transferring the images from the source domain to the target domain. Next, the segmentation model is trained on the transferred images to classify the target images. In the fields of computer vision, Gatys et al. [40] raised a style transfer method to synthesizes fake images by combining the source contents with the target style. Similarly, Shrivastava et al. [41] generated realistic samples from synthetic images and the synthesized images could train a classification model on real images. Bousmalis et al. [42] learned the source-to-target transformation in the pixel

space and transformed source images to target-like images. Taigman et al. [44] proposed a compound loss function to enforce the image generation network to transfer images from target to themselves. Hoffman et al. [14] used CycleGAN [17] to transfer the source images into the target style alternatively and transformed images were input into the classifier to improve its performance in the target domain. Zhao et al. [45] transformed fake images to the target domain which performed pixel-level and feature-level alignments with sub-domain aggregation. The segmentation model trained on such transformed images with the style of the target domain outperformed several unsupervised domain adaptation approaches. In remote sensing, Graph matching [46] and histogram matching [47] were employed to perform abovementioned image-to-image translation. Benjdira et al. [16] generated the fake target-like images by using CycleGAN [17], then the target-like images are used to adapt the source classifier to segment the target images. Similarly, Tasar et al. proposed ColorMapGAN [48], SemI2I [49] and DAugNet [50] to perform image-to-image translation between satellite image pairs to reduces the impact of domain gap. All the above mentioned methods focus on adapting the source segmentation model to the target domain without taking into account the opposite target-to-source direction that is beneficial.

#### *2.2. Bidirectional Learning*

Bidirectional learning was used to approach the neural machine translation problem [51,52], which train a language translation system in opposite directions of a language pair. Compared with unidirectional learning, it can improve the performance of the model effectively. Recently, bidirectional learning was applied to image-to-image translation problems as well. Li et al. [53] learned the image translation model and the segmentation adaptation model alternatively with a bidirectional learning method. Chen et al. [54] presented a bidirectional cross-modality adaptation method that aligned different domains from feature and image perspectives. Zhang et al. [55] adapted the model by minimizing the pixel-level and feature-level gaps. The theses method does not optimize the segmentation model in the target-to-source directions. Yang et al. [56] proposed a bi-directional generation network that trained a simple framework for image translation and classification from source to target and from target to source. Jiang et al. [57] proposed a bidirectional adversarial training method which performs adversarial trainings with generating adversarial examples from source to target and back. These methods only use bidirectional learning techniques in training process, but at test time, they do not make full use of two domains even if they have optimized the classifiers in both directions. Russo et al. [58] proposed a bidirectional image translation approach which trained two classifiers on different domains respectively and finally fuses the classification results. However, semantic segmentation task is more sensitive to pixel category while classification task focuses on image category. This proposed method can only be used to deal with the classification tasks, which can't apply to semantic segmentation tasks directly because it may not preserve the semantic contents.

#### **3. Materials and Methods**

The unsupervised domain adaptation assumes that the labeled source domain (*XS*,*YS*) and unlabeled target domain *XT* are available. The goal is to train a framework which correctly predicts the results for unlabeled target domain *XT*.

The proposed BiFDANet consists of bidirectional image translation and bidirectional segmentation adaptation. It learns to transfer source images to the target domain and transfer target images to the source domain, and then optimizes the source classifier *FS* and the target classifier *FT* in two opposite directions. In this section, we detail how we transfer images between the source and target domain. And then we introduce how we adapt the classifier *FT* to the target domain and optimize the classifier *FS* in the target-to-source direction. Thereafter, we describe how we combine the two predicted results of the two classifiers *FS* and *FT*. Finally, we illustrate the implementations of the network architectures.

#### *3.1. Bidirectional Image Translation*

To perform bidirectional image translation between different domains, we use two generators and two discriminators based on GAN [12] architecture and we add two classifiers to extract the contents from the images. *GS*→*<sup>T</sup>* denotes the target generator which generates pseudo-target images, *GT*→*<sup>S</sup>* denotes the source generator which generates pseudo-source images. *DS*, *DT* denote the discriminators and *FS*, *FT* are the classifiers.

First of all, we want the source images *xs* and the pseudo-source images *GT*→*S*(*xt*) to be drawn form similar data distributions, while the target images *xt* and the pseudotarget images *GS*→*T*(*xs*) have similar data distributions. To deal with these issues, we enforce the data distributions of the pseudo-target images *GS*→*T*(*xs*) and the pseudosource images *GT*→*S*(*xt*) to be similar to that of the target domain and the source domain respectively by applying adversarial learning (see Figure 2 blue portion). The discriminator *DS* discriminates between the source images and the pseudo-source images while the discriminator *DT* distinguishes the pseudo-target images from the target domain. We train the generators to fool the discriminators while the discriminators *DT* and *DS* attempt to classify the images from the target domain and the source domain. The adversarial loss for the target generator *GS*→*<sup>T</sup>* and the discriminator *DT* in the source-to-target direction is as follows:

$$\mathcal{L}\_{\text{adv}}^{\mathbb{S}\rightarrow T}(D\_{T}, \mathbf{G}\_{\mathbb{S}\rightarrow T}) = \mathbb{E}\_{\mathbf{x}\_{l}\sim\mathcal{X}\_{T}}[\log D\_{T}(\mathbf{x}\_{l})] + \mathbb{E}\_{\mathbf{x}\_{l}\sim\mathcal{X}\_{\mathbb{S}}}[\log(1 - D\_{T}(\mathbf{G}\_{\mathbb{S}\rightarrow T}(\mathbf{x}\_{l})))] \tag{1}$$

where <sup>E</sup>*xs*∼*XS* , <sup>E</sup>*xt*∼*XT* are the expectation over *xs* and *xt* drawn by the distribution described by *XS* and *XT* respectively. *GS*→*<sup>T</sup>* tries to generate the pseudo-target images *GS*→*T*(*xs*) which have data distributions similar to the that of the target images *xt*, while *DT* learns to discriminate the pseudo-target images from the target domain.

**Figure 2.** BiFDANet, training: The **top row** (black solid arrow) shows the source-to-target direction while the **bottom row** (black dashed arrow) shows the target-to-source direction. The colored dashed arrows correspond to different losses. The generator *Gs*→*<sup>T</sup>* transfers the images to the pseudo-target images while the generator *GT*→*<sup>S</sup>* transfers the images to the source domain. *DS* and *DT* discriminate the images from the source domain and the target domain. *FS* and *FT* segment the images which are drawn from source domain and target domain, respectively.

This objective ensures that the pseudo-target images *GS*→*T*(*xs*) will resemble the images drawn from the target domain *XT*. We use a similar adversarial loss in the target-tosource direction:

$$\mathcal{L}\_{adv}^{T \to S}(D\_S, G\_{T \to S}) = \mathbb{E}\_{\mathbf{x}\_s \sim \mathbf{x}\_S}[\log D\_S(\mathbf{x}\_s)] + \mathbb{E}\_{\mathbf{x}\_T \sim \mathbf{x}\_T}[\log(1 - D\_S(G\_{T \to S}(\mathbf{x}\_t)))] \tag{2}$$

This objective ensures that the pseudo-source images *GT*→*S*(*xt*) will resemble the images drawn from the source domain *XS*. We compute the overall adversarial loss for the generators and the discriminators as:

$$\mathcal{L}\_{adv}(D\_{\mathcal{S}\prime}D\_{\mathcal{T}\prime}G\_{\mathcal{S}\rightarrow\mathcal{T}\prime}G\_{\mathcal{T}\rightarrow\mathcal{S}}) = \mathcal{L}\_{adv}^{\mathcal{S}\rightarrow\mathcal{T}}(D\_{\mathcal{T}\prime}G\_{\mathcal{S}\rightarrow\mathcal{T}}) + \mathcal{L}\_{adv}^{\mathcal{T}\rightarrow\mathcal{S}}(D\_{\mathcal{S}\prime}G\_{\mathcal{T}\rightarrow\mathcal{S}}) \tag{3}$$

Another purpose is to maintain the original images and transformed images semantically consistent. Otherwise, the transformed images won't match the labels of the original images, and the performance of the classifiers would significantly decrease. To keep the semantic consistency between the transformed images and the original images, we define three constraints.

Firstly, we introduce a cycle-consistency constraint [17] to preserve the semantic contents during the translation process (see Figure 2 red portion). We encourage that transferring the source images from source to target and back reproduces the original contents. At the same time, transferring the target images from target to source and back to the target domain reproduces the original contents. These constraints are satisfied by imposing the cycle-consistency loss defined in the following equation:

$$\begin{split} \mathcal{L}\_{\text{cyc}}(\mathcal{G}\_{\text{S}\rightarrow T}, \mathcal{G}\_{T\rightarrow\text{S}}) &= \\ \mathbb{E}\_{\mathbf{x}\_{\mathcal{S}}\sim\mathcal{X}\_{\mathcal{S}}}[||\mathcal{G}\_{T\rightarrow\text{S}}(\mathcal{G}\_{\text{S}\rightarrow T}(\mathbf{x}\_{\mathcal{S}})) - \mathbf{x}\_{\mathfrak{s}}||\_{1}] + \mathbb{E}\_{\mathbf{x}\_{\mathcal{I}}\sim\mathcal{X}\_{\mathcal{I}}}[||\mathcal{G}\_{\text{S}\rightarrow T}(\mathcal{G}\_{T\rightarrow\text{S}}(\mathbf{x}\_{\mathcal{I}})) - \mathbf{x}\_{\mathfrak{t}}||\_{1}] \end{split} \tag{4}$$

Secondly, we require that *GT*→*S*(*xs*) for the source images *xs* and *GS*→*T*(*xt*) for the target images *xt* will reproduce the original images, thereby enforcing identity consistency (see Figure 2 orange portion). Such constraint is implemented by the identity loss defined as follows:

$$\begin{aligned} \mathcal{L}\_{\text{idf}}(\mathcal{G}\_{\text{S}\rightarrow T\text{'}}, \mathcal{G}\_{\text{T}\rightarrow\text{S}}) &= \\ \mathbb{E}\_{\mathbf{x}\_{l}\sim\mathbf{X}\_{l}}[||\mathcal{G}\_{\text{S}\rightarrow T}(\mathbf{x}\_{l}) - \mathbf{x}\_{l}||\_{1}] + \mathbb{E}\_{\mathbf{x}\_{s}\sim\mathbf{X}\_{\text{S}}}[||\mathcal{G}\_{\text{T}\rightarrow\text{S}}(\mathbf{x}\_{s}) - \mathbf{x}\_{s}||\_{1}] \end{aligned} \tag{5}$$

The identity loss L*idt* can be divided into two parts: the source-to- target identity loss Equation (6) and the target-to-source identity loss Equation (7). These two parts are as follows:

$$\mathcal{L}\_{\text{idt}}^{\text{S}\rightarrow T}(\mathbf{G}\_{\text{S}\rightarrow T}) = \mathbb{E}\_{\mathbf{x}\_{t}\sim X\_{T}}[||\mathbf{G}\_{\text{S}\rightarrow T}(\mathbf{x}\_{t}) - \mathbf{x}\_{t}||\_{1}] \tag{6}$$

$$\mathcal{L}\_{\text{idf}}^{T \to S}(G\_{T \to S}) = \mathbb{E}\_{\mathbf{x}\_s \sim X\_{\S}}[||G\_{T \to S}(\mathbf{x}\_s) - \mathbf{x}\_s||\_1] \tag{7}$$

Thirdly, we enforce the transformed images to be semantically consistent with the original images. CyCADA [14] proposed the semantic consistency loss to maintain the semantic contents. The source images *xs* and the transformed images *GS*→*T*(*xs*) are fed into the source classifier *FS* pretrained on labeled source domain. However, since the transformed images *GS*→*T*(*xs*) are drawn from the target domain, the classifier trained on the source domain could not extract the semantic contents from the transformed images effectively. As a result, computing the semantic consistency loss in this way is not conducive to the image generation. In ideal conditions, the transformed images *GS*→*T*(*xs*) should be input to the target classifier *FT*. However, it is impractical because the labels of the target domain aren't available. Instead of using the source classifier *FS* to segment the transformed images *GS*→*T*(*xs*), MADAN [45] proposed to dynamically adapt the source classifier *FS* to the target domain by taking the transformed images *GS*→*T*(*xs*) and the source labels as input. And then, they employed the classifier trained on the transformed domain as *FT*, which performs better than the original classifier. The semantic consistency loss computed by

*FT* would promote the generator *GS*→*<sup>T</sup>* to generate images that preserve more semantic contents of the original images. However, MADAN only considers the generator *GS*→*<sup>T</sup>* but ignores the generator *GT*→*<sup>S</sup>* which is crucial to the bidirectional image translation. For bidirectional domain adaptation, we expect both source generator *GT*→*<sup>S</sup>* and target generator *GS*→*<sup>T</sup>* to maintain semantic consistency during image-to-image translation process. Therefore, we propose a new bidirectional semantic consistency loss (see Figure 2 green portion). The proposed bidirectional semantic consistency loss is:

$$\begin{aligned} \mathcal{L}\_{\text{sem}}(\mathbf{G}\_{\text{S}\rightarrow\mathcal{T}}, \mathbf{G}\_{\text{T}\rightarrow\text{S}}, \mathbf{F}\_{\text{S}}, \mathbf{F}\_{\text{T}}) &= \\ \mathbb{E}\_{\mathbf{x}\_{s}\sim\mathbf{X}\_{\text{S}}} \mathbb{KL}(\mathbf{F}\_{\text{S}}(\mathbf{x}\_{s}) \| \| \mathcal{F}(\mathbf{G}\_{\text{S}\rightarrow\text{T}}(\mathbf{x}\_{s}))) + \mathbb{E}\_{\mathbf{x}\_{t}\sim\mathbf{X}\_{\text{T}}} \mathbb{KL}(\mathbf{F}(\mathbf{x}\_{t}) \| \| \mathcal{F}\_{\text{S}}(\mathbf{G}\_{\text{T}\rightarrow\text{S}}(\mathbf{x}\_{t}))) \end{aligned} \tag{8}$$

where *KL*(·
·) is the KL divergence.

Our proposed bidirectional semantic consistency loss can be divided into two parts: source-to-target semantic consistency loss Equation (9) and target-to-source semantic consistency loss Equation (10). These two parts are as follows:

$$\mathcal{L}\_{\text{sem}}^{\mathcal{S}\rightarrow T}(\mathcal{G}\_{\mathcal{S}\rightarrow T}, F\_T) = \mathbb{E}\_{\mathbf{x}\_{\mathcal{S}}\sim\mathcal{X}\_{\mathcal{S}}} KL(F\_{\mathcal{S}}(\mathbf{x}\_{\mathcal{S}}) \| F\_T(\mathcal{G}\_{\mathcal{S}\rightarrow T}(\mathbf{x}\_{\mathcal{S}}))) \tag{9}$$

$$\mathcal{L}\_{\text{scm}}^{T \to \text{S}} (G\_{T \to S}, F\_{\text{S}}) = \mathbb{E}\_{\mathbf{x}\_{t} \sim \mathbf{X}\_{T}} KL \left( F\_{T} (\mathbf{x}\_{t}) || F\_{\text{S}} (G\_{T \to S} (\mathbf{x}\_{t})) \right) \tag{10}$$

#### *3.2. Bidirectional Segmentation Adaptation*

Our adaptation includes the source-to-target direction and the target-to-source direction as shown in Figure 2.

#### 3.2.1. Source-to-Target Adaptation

To reduce the domain gap, we train the generator *GS*→*<sup>T</sup>* with <sup>L</sup>*S*→*<sup>T</sup> adv* Equation (1), <sup>L</sup>*cyc* Equation (4), L*S*→*<sup>T</sup> idt* Equation (6) and L*S*→*<sup>T</sup> sem* Equation (9) to map the source images *xs* to the pseudo-target images (see Figure 2, top row). Note that the labels of the transformed images *GS*→*T*(*xs*) won't be changed by the generator *GS*→*T*. Therefore, we can train the target classifier *FT* with the transformed images *GS*→*T*(*xs*) and the ground truth segmentation labels of the original source images *xs* (see Figure 2 gray portion). For C-way semantic segmentation, the classifier loss is defined as:

$$\begin{split} \mathcal{L}\_{F\_{\mathcal{T}}}(\mathsf{G}\_{\mathcal{S}\rightarrow T}(\mathsf{x}\_{\mathsf{s}}), F\_{\mathcal{T}}) &= \\ -\mathbb{E}\_{\mathsf{G}\_{\mathcal{S}\rightarrow T}(\mathsf{x}\_{\mathsf{s}}) \sim \mathsf{G}\_{\mathcal{S}\rightarrow T}(\mathsf{X}\_{\mathsf{S}})} \sum\_{c=1}^{\mathsf{C}} \mathbb{I}\_{[c=y\_{\mathsf{s}}]} \log(softmax(F\_{T}^{(c)}(\mathsf{G}\_{\mathcal{S}\rightarrow T}(\mathsf{x}\_{\mathsf{s}})))) \end{split} \tag{11}$$

where *<sup>C</sup>* denotes the category number of categories and <sup>I</sup>[*c*=*ys*] represents the corresponding loss only for class *c*.

Above all, the framework optimizes the objective function in the source-to-target direction as follows:

$$\begin{aligned} \min\_{\begin{subarray}{c}\mathbf{G}\_{\mathcal{S}\rightarrow T} \\ F\_{T}\end{subarray}} \max\_{D\_{T}} & \lambda\_{1} \mathcal{L}\_{adv}(\mathbf{G}\_{\mathcal{S}\rightarrow T'}, D\_{T}) + \lambda\_{2} \mathcal{L}\_{cyc}(\mathbf{G}\_{\mathcal{S}\rightarrow T'}, \mathbf{G}\_{T\rightarrow S}) \\ & + \lambda\_{3} \mathcal{L}\_{idt}^{\mathcal{S}\rightarrow T}(\mathbf{G}\_{\mathcal{S}\rightarrow T}) + \lambda\_{4} \mathcal{L}\_{scm}^{\mathcal{S}\rightarrow T}(\mathbf{G}\_{\mathcal{S}\rightarrow T}, F\_{T}) + \lambda\_{5} \mathcal{L}\_{F\_{T}}(\mathbf{G}\_{\mathcal{S}\rightarrow T}(\mathbf{x}\_{s}), F\_{T}) \end{aligned} \tag{12}$$

#### 3.2.2. Target-to-Source Adaptation

We take into account the opposite target-to-source direction and employ a symmetrical framework (Figure 2, black dashed arrow). In this direction, we optimize the generator *GT*→*<sup>S</sup>* with <sup>L</sup>*T*→*<sup>S</sup> adv* Equation (2), <sup>L</sup>*cyc* Equation (4), <sup>L</sup>*T*→*<sup>S</sup> idt* Equation (7) and <sup>L</sup>*T*→*<sup>S</sup> sem* Equation (10) to map the target images *xt* to the pseudo-source images *GT*→*S*(*xt*) (see Figure 2, bottom row). Then, we use the source classifier *FS* to segment the pseudo-source images *GT*→*S*(*xt*) to compute the semantic consistency loss Equation (10) instead of the classifier loss because the ground truth segmentation labels for the target images are not

available. The segmentation model *FS* are trained using the labeled source images *xs* with following classifier loss (see Figure 2 gray portion):

$$\mathcal{L}\_{F\_{\mathcal{S}}}(X\_{\mathcal{S}}, F\_{\mathcal{S}}) = -\mathbb{E}\_{\mathbf{x}\_{\mathcal{S}} \sim X\_{\mathcal{S}}} \sum\_{\boldsymbol{\varepsilon}=\boldsymbol{1}}^{\mathsf{C}} \mathbb{I}\_{[\boldsymbol{\varepsilon} = \boldsymbol{y}\_{\mathcal{S}}]} \log(\operatorname{softmax}(F\_{\mathcal{S}}^{(\boldsymbol{\varepsilon})}(\mathbf{x}\_{\mathcal{S}}))) \tag{13}$$

Collecting the above components, the target-to-source part of the framework optimizes the objective function as follows:

$$\begin{aligned} \min\_{\begin{subarray}{c} G\_{T \to S} \\ F\_{\text{S}} \end{subarray}} \max\_{D\_{\text{S}}} & \lambda\_1 \mathcal{L}\_{\text{adv}} (G\_{T \to S}, D\_{\text{S}}) + \lambda\_2 \mathcal{L}\_{\text{cyc}} (G\_{T \to S}, G\_{S \to T}) \\ & + \lambda\_3 \mathcal{L}\_{\text{idf}}^{T \to \text{S}} (G\_{T \to S}) + \lambda\_4 \mathcal{L}\_{\text{scm}}^{T \to \text{S}} (G\_{T \to S}, F\_{\text{S}}) + \lambda\_6 \mathcal{L}\_{F\_{\text{S}}} (X\_{\text{S}}, F\_{\text{S}}) \end{aligned} \tag{14}$$

#### *3.3. Bidirectional Domain Adaptation*

Combining above two directions, we conclude with the complete loss function of BiFDANet:

$$\begin{aligned} \mathcal{L}\_{BiFDANvt}(G\_{\mathcal{S}\to T}, G\_{T\to\mathcal{S}}, D\_{\mathcal{S}}, D\_{T\prime}F\_{\mathcal{S}\prime}F\_{T}) &= \\ \lambda\_1 \mathcal{L}\_{adv} + \lambda\_2 \mathcal{L}\_{cyc} + \lambda\_3 \mathcal{L}\_{idt} + \lambda\_4 \mathcal{L}\_{\kappa rm} + \lambda\_5 \mathcal{L}\_{F\_{\mathcal{T}}} + \lambda\_6 \mathcal{L}\_{F\_{\mathcal{S}}} \end{aligned} \tag{15}$$

where *λ*1, *λ*2, *λ*3, *λ*4, *λ*<sup>5</sup> and *λ*<sup>6</sup> control the interaction of the six objectives.

The training process corresponds to solving for the generators *GS*→*<sup>T</sup>* and *GT*→*S*, the source classifier *FS* and the target classifier *FT* according to the optimization:

$$\mathbf{G}\_{\mathbf{S}\to T'}^{\*}\mathbf{G}\_{T\to S'}^{\*}\mathbf{F}\_{\mathbf{S}}^{\*}\prime\prime\_{T}F\_{T}^{\*} = \arg\min\_{\begin{subarray}{c}F\_{\mathbf{S}}, F\_{T} \gets G\_{\mathbf{S}\to T} \; D\_{\mathbf{S}}, D\_{T} \\ \mathbf{G}\_{T\to\mathbf{S}}\end{subarray}} \max\_{D\_{\mathbf{S}}\to D\_{\mathbf{S}}, D\_{T}} \mathcal{L}\_{\mathbf{B}\to D\_{\mathbf{T}}} \tag{16}$$

#### *3.4. Linear Combination Method*

The target classifier *FT* is trained on the pseudo-target domain which have data distributions similar to the target domain and segment the target images. The source segmentation model *FS* is optimized on the source domain and segment the pseudosource images *GT*→*S*(*xt*). These two classifiers make different types of mistakes and assign different confidence ranks to the predicted labels. All in all, the predicted labels of the two classifiers are complementary instead of alternative. When addressing fusion, it is important to stress that we should remove the wrong objects from both predicted labels as much as possible and preserve the correct objects at the same time. For this purpose, we design a simple method which linearly combines their probability output as follows:

$$output = \lambda F\_S(G\_{T \to S}(\mathbf{x}\_t)) + (1 - \lambda)F\_T(\mathbf{x}\_t) \tag{17}$$

where *λ* is a hyperparameter in the range (0, 1).

Then, we convert the probability output to the predicted labels. A schematic illustration of the linear combination method is shown in Figure 3.

#### *3.5. Network Architecture*

Our proposed BiFDANet consists of two generators, two discriminators and two classifiers. We choose DeeplabV3+ [7] as the segmentation model and use ResNet34 [59] as the DeeplabV3+ backbone. The encoder applies atrous convolution at multiple scales to acquire multi-scale features. The decoder module which is simple yet effective provides the predicted results. We use dropout in the decoder module to avoid overfitting. Figure 4 shows the architecture of the classifier.

As shown in Figure 5, we use nine residual blocks for the generators which are used in [17]. Four convolutional layers are used to downsample the features, while four deconvolutional layers are applied to upsample the features. We use instance normalization rather than batch normalization and we apply ReLU to activate all layers.

**Figure 3.** BiFDANet, test: the target classifier *FT* and the source classifier *FS* are used to segment the target images and the pseudo-source images respectively. And then the probability outputs are fused with a linear combination method and converted to the predicted labels.

**Figure 4.** The architecture of the classifier (DeeplabV3+ [7]). The encoder acquires multi-scale features from the images while the decoder provides the predicted results from the multi-scale features and low-level features.

Similar to the discriminator in [17], we use five convolution layers for discriminators as shown in Figure 6. The discriminators encode the input images into a feature vector. Then, we compute the mean squared error loss instead of using Sigmoid to convert the feature vector into a binary output (real or fake). We use instance normalization rather than batch normalization. Unlike the generator, leaky ReLU is applied to activate the layers of the discriminator.

**Figure 5.** The architecture of the generator. ks, s, p and op correspond to kernel size, stride, padding and output padding parameters of the convolution and deconvolution respectively. ReLU and IN stand for rectified linear unit and instance normalization. The generator uses nine residual blocks.

**Figure 6.** The architecture of the discriminator. LReLU and IN correspond to leaky rectified linear unit and instance normalization respectively. We use mean squared error loss instead of Sigmoid.

#### **4. Results**

In this section, we introduce the two datasets, illustrate the experimental settings, and analyse the obtained results both quantitatively and qualitatively.

#### *4.1. Data Set*

To conduct our experiments, we employ the Gaofen Satellite dataset and the ISPRS (WGII/4) 2D semantic segmentation benchmark dataset [60]. In the rest of this paper, we abbreviate the Gaofen Satellite data and the ISPRS (WGII/4) 2D semantic segmentation benchmark dataset to the Gaofen dataset and the ISPRS data set to simplify the description.

#### 4.1.1. Gaofen Data Set

The Gaofen dataset consists of the Gaofen-1 (GF-1) satellite images and the Gaofen-1B (GF-1B) satellite images, which are civilian optical satellites of China and equipped with two sets of multi-spectral and panchromatic cameras. We reduce spatial resolution of the images to 2 m and convert the images to 10 bit. The images from both satellites contain 4 channels (i.e., red, green, blue and near-infrared). The labels of buildings are provided. We assume that only the labels of the source domain can be accessed. We cut the images and their labels into 512 × 512 patches. Table 1 reports the number of patches and the class percentages belonging to each satellite. Figure 7a,b show samples from the GF-1 satellite and the GF-1B satellite.



**Figure 7.** Example patches from two datasets. (**a**) GF-1 satellite image of the Gaofen dataset. (**b**) GF-1B satellite image of the Gaofen dataset. (**c**) Potsdam image of ISPRS dataset. (**d**) Vaihingen image of the ISPRS dataset.

#### 4.1.2. ISPRS Data Set

This ISPRS dataset includes aerial images acquired from [61,62], which have been publicly available to the community. The Vaihingen dataset consists of images with a spatial resolution of 0.09 m and the spatial resolution of Potsdam dataset is 0.05 m. The Potsdam images contain red, green and blue channels while the Vaihingen images have 3 different channels (i.e., red, green and infrared). All images in both datasets are converted to 8 bit. Some images are manually labeled with land cover maps and the labels of impervious surfaces, buildings, trees, low vegetations and cars are provided. We cut the images and their labels into 512 × 512 patches. Table 1 reports the number of patches and the class percentages for the ISPRS dataset. Figure 7c,d show samples from each city.

#### 4.1.3. Domain Gap Analysis

The domain shift between different domains is caused by many factors such as illumination conditions, camera angle, imaging sensors and so on.

In terms of the Gaofen data set, the same objects (e.g., buildings) have similar structures, but the colors of the GF-1 satellite images are different from the colors of the GF-1B satellite images as shown in Figure 7a,b. What's more, we depict the histograms to represent the data distributions of the two datasets. There are some differences between the histograms of the GF-1 satellite images and the GF-1B satellite images as shown in Figure 8a,b.

**Figure 8.** Color histograms of the Ganfen data set and the ISPRS data set. Different colors represent the histograms for different channels. (**a**) GF-1 images. (**b**) GF-1B images. (**c**) Potsdam images. (**d**) Vaihingen images.

In terms of the ISPRS dataset, the Potsdam images and the Vaihingen images have many differences, such as imaging sensors, spatial resolutions and structural representations of the classes. The Potsdam images and the Vaihingen images contain different kinds of channels due to the different imaging sensors, which results in the same objects in the two datasets being of different colors. For example, the vegetations and trees are green in the Potsdam dataset while the vegetations and trees are red color because of the infrared band. Besides, the Potsdam images and the Vaihingen images are captured using various spatial resolutions, which leads to the same objects being of different sizes. What's more, the structural representations of the same objects in the Potsdam dataset and Vaihingen dataset might be different. For example, there may be some differences between the buildings in different cities. At the same time, we depict the histograms to represent the data distributions of the Potsdam dataset and Vaihingen dataset as well. As shown in Figure 8c,d, the histograms of the Potsdam images are quite different from the histograms of the Vaihingen images.

#### *4.2. Experimental Settings*

We train BiFDANet in two stages. First, the training process minimizes the overall objective L*BiFDANet*(*GS*→*T*, *GT*→*S*, *DS*, *DT*, *FS*, *FT*) without the bidirectional semantic consistency loss by setting *λ*<sup>4</sup> parameters in Equation (15) to 0. This is because, without a trained target segmentation model, the bidirectional semantic consistency loss would not be helpful in training process. The *λ*1, *λ*2, *λ*3, *λ*<sup>5</sup> and *λ*<sup>6</sup> parameters in Equation (15) are set to 1, 10, 5, 10 and 10, respectively. We have found these values through repeated experiments. We train the framework for 100 epochs in this step. Second, after we obtain the well-trained target classifier, we add the bidirectional semantic consistency loss by setting *λ*<sup>4</sup> to 10 and the *λ*1, *λ*2, *λ*3, *λ*<sup>5</sup> and *λ*<sup>6</sup> parameters in Equation (15) are the same as in the first step. We then optimize the network for 200 epochs. For all the methods, the networks are implemented in the PyTorch framework. We trained the models with Adam optimizer [63], using a batch size of 12. The learning rates for the generators, the discriminators and the classifiers are all set to 10−4. At test time, the parameters to combine the segmentation models are *λ* ∈ [0, 0.05, 0.1, 0.15, 0.2, ..., 0.95, 1] chosen on the validation set of 20% patches from the target domain.

#### *4.3. Methods Used for Comparison*

(1) DeeplabV3+ [7]: We do not apply any domain adaptation methods and directly segment the unlabeled target images with a DeeplabV3+ trained on the labeled source domain.

(2) Color Matching: For each channel of the images, we adjust the average brightness values of the source images to that of the target images. Then, we train the target segmentation model on the transformed domain.

(3) CycleGAN [17]: This method uses two generators *G* and *F* to perform image translation. The generator *G* learns to transfer the source images to the target domain while *F* learns to transfer the target images to the source domain. This method forces the transferring from source to target and back and transferring from target to source and back reproduce the original contents. Then the generated target-like images are used to train the target classifier.

(4) For BiFDANet, besides the full approach, we also give the results obtained by the segmentation models *FS* and *FT* before the linear combination method. At the same time, to show the effectiveness of the linear combination method, we also show the results obtained by simply taking the intersection or union of the two results.

For the above approaches, we use the same training parameters and architecture to make a fair comparison.

#### *4.4. Evaluation Metrics*

To evaluate all the methods quantitatively and comprehensively, we use scalar metrics included *Precision*, *Recall*, *F1-score (F1)* and *IoU* [64] defined as follows:

$$Precision = \frac{TP\_b}{TP\_b + FP\_b} \tag{18}$$

$$Recall = \frac{TP\_b}{TP\_b + FN\_b} \tag{19}$$

$$F1 = \frac{2 \times Precision \times Recall}{Precision + Recall} \tag{20}$$

$$IoI = \frac{TP\_b}{TP\_b + FN\_b + FP\_b} \tag{21}$$

where *b* denotes the category. *FP* (false positive) is the number of pixels which are classified as category *b* but do not belong to category *b*. *FN* (false negative) corresponds to the number of pixels which are category *b* but classified as other categories. *TP* (true positive) is the number of pixels which are correctly classified as category *b* and *TN* (true negative) corresponds to the number of pixels which are classified as other categories and belong to

other categories. The aforementioned evaluation metrics are computed for each category (except the background). Especially, because we only segment buildings in our experiments, all the evaluation results we reported in tables are corresponding to the building (category).

#### *4.5. Quantitative Results*

To report fair and reliable results, we repeat training our framework and the comparison methods with the same parameters and architecture five times and depict the average precision, recall, F1-score and IoU values in Tables 2 and 3. Tables 2 and 3 show the comparison results on the Gaofen dataset and the ISPRS dataset, respectively. The DeeplabV3+ row includes results are corresponding to the no-adaptation case. For BiFDANet, we report the results obtained by the source classifier *FS* and the target classifier *FT* separately before the linear combination method and obtained by simply taking the intersection or union of the predicted results of the two classifiers *FS* and *FT*.

**Table 2.** Comparison results on Gaofen dataset. The best values are in bold.


**Table 3.** Comparison results on ISPRS dataset. The best values are in bold.


*4.6. Visualization Results*

Figures 9–12 depict the predicted results for DeeplabV3+, CycleGAN, color matching and BiFDANet. Our proposed BiFDANet which considers distribution alignment and bidirectional semantic consistency obtains the best predicted results, and the contours of the predicted buildings are more accurate than those acquired by color matching and CycleGAN.

**Figure 9.** Segmentation results in GF-1 → GF-1B experiment. White and black pixels represent buildings and background. (**a**) GF-1B. (**b**) Label. (**c**) DeeplabV3+. (**d**) Color matching. (**e**) CycleGAN. (**f**) BiFDANet.

**Figure 10.** Segmentation results in GF-1B → GF-1 experiment. White and black pixels represent buildings and background. (**a**) GF-1. (**b**) Label. (**c**) DeeplabV3+. (**d**) Color matching. (**e**) CycleGAN. (**f**) BiFDANet.

**Figure 11.** Segmentation results in Potsdam → Vaihingen experiment. White and black pixels represent buildings and background. (**a**) Vaihingen. (**b**) Label. (**c**) DeeplabV3+. (**d**) Color matching. (**e**) CycleGAN. (**f**) BiFDANet.

**Figure 12.** Segmentation results in Vaihingen → Potsdam experiment. White and black pixels represent buildings and background. (**a**) Potsdam. (**b**) Label. (**c**) DeeplabV3+. (**d**) Color matching. (**e**) CycleGAN. (**f**) BiFDANet.

#### **5. Discussion**

In this section, we compare our results with the compared methods in detail, and discuss the effect of our proposed bidirectional semantic consistency (BSC) loss and the roles of each component in our BiFDANet.

#### *5.1. Comparisons with Other Methods*

As shown in Tables 2 and 3, the DeeplabV3+ method which directly apply the source segmentation model to classify the target images performs worst in all settings. Color matching obtains a better performance than the DeeplabV3+ method, which indicates the effectiveness of domain adaptation for semantic segmentation of remote sensing images. CycleGAN perform better than both DeeplabV3+ and Color matching. Among all the compared methods, BiFDANet achieves the highest F1-score and IoU score in all settings. And the separate segmentation models *FS* and *FT* also significantly outperform the other adaptation methods. When combing the two segmentation models with the linear combination method, the performance of BiFDANet is further enhanced. Moreover, in the Vaihingen → Potsdam experiment, BiFDANet *FS* performs much better than BiFDANet *FT*. Because transferring from Vaihingen to Potsdam is more difficult than transferring from Potsdam to Vaihingen. There are far more Potsdam images than Vaihingen images, in some ways, the widely variable target domain (Potsdam) contains more variety of shapes and textures, and therefore it is more difficult to adapt the classifier from Vaihingen to Potsdam. Thanks to its bidirectionality which is disregarded in previous methods, BiFDANet achieves a performance gain of +7 percentage points while the gain in performance of BiFDANet *FT* is only +1 percentage points. In this experiment, our proposed method makes full use of the information from the inverse target-to-source translation to produce much better results.

#### 5.1.1. BiFDANet versus DeeplabV3+

There is no doubt that BiFDANet performs much better than DeeplabV3+ for all four cases. Because of the domain gap, there are some significant differences between the source domain and target domain. Without domain adaptation, the segmentation model cannot deal with the domain gap.

#### 5.1.2. BiFDANet versus CycleGAN

In order to reduce the domain gap, CycleGAN and BiFDANet perform image-to-image translation to align data distribution of different domains. Figures 13–16 show some original images and the corresponding transformed images generated by color matching, CycleGAN and BiFDANet. As shown in Figures 13 and 14, it is obvious that the semantic contents of the images are changed by CycleGAN because there are no constraints for CycleGAN to enforce the semantic consistency during the image generation process. For instance, during the translation, CycleGAN replaces the buildings with bare land as shown in Figures 13 and 14 yellow rectangles. Besides, when generating transformed images, CycleGAN produces some buildings which do not exist before, as indicated in Figures 13 and 14 green rectangles. By contrast, the pseudo images transformed by BiFDANet and their corresponding original images have the same semantic contents and the data distributions of the pseudo images are similar to the data distributions of the target images. Similarly, as shown in Figure 15, we observe that there are some objects which look like red trees on the rooftops of the buildings as highlighted by green rectangles. At the same time, the pseudo images transformed by CycleGAN generates a few artificial objects in the outlined areas in Figure 15. What's more, in Figure 16, the pseudo images transformed by CycleGAN transfer the gray ground to the orange buildings, as highlighted by cyan rectangles. On the contrary, we do not observe aforementioned artificial objects and semantic inconsistency in the transformed images generated by BiFDANet in the vast majority of cases. Because the bidirectional semantic consistency loss enforces the classifiers to maintain semantic consistency during the image-to-image translation process. For CycleGAN, because the transformed images do not match the labels of the original images, the segmentation model *FT* learns wrong information in training progress. Such wrong information may affect the performances of classifiers significantly. As a result, the domain adaptation methods with CycleGAN performs worse than our proposed method at test time, as confirmed by Figures 13–16.

#### 5.1.3. BiFDANet versus Color Matching

Figures 13 and 14 illustrate that color matching can efficiently reduce the color difference between different domains. At the first sight, color matching works well. It preserves the semantic contents of the original source images in the transformed images, and the color of the target images is transferred to the transformed images. Besides, the transformed images generated by color matching look similar to the images generated by BiFDANet in Figure 14. However, in Tables 2 and 3, we can see that there are relatively big gaps between the performances of BiFDANet and color matching. The quantitative results for color matching are worse than the results for CycleGAN which can not keep semantic contents well. To better understand why there is such a difference in performance, we further analyse the differences between BiFDANet and color matching. The main problem of color matching is that it only tries to match the color of the images instead of considering the differences in features and data distributions. On the contrary, BiFDANet learns high-level features of the target images by using the discriminators to distinguish the features and data distributions of the pseudo-target transformed images from that of the original target images. In other word, the generators of BiFDANet generate pseudo-target transformed images whose high-level features and data distributions are similar to that of the target images. For this reason, our proposed BiFDANet outperforms color matching substantially.

Furthermore, to prove our point, we show color histograms of the GF-1 images, the pseudo GF-1 images generated by color matching and BiFDANet, and the GF-1B images, the pseudo GF-1B images generated by color matching and BiFDANet in Figure 17. And we depict color histograms of the Potsdam images, the pseudo Potsdam images generated by color matching and BiFDANet, and the Vaihingen images, the pseudo Vaihingen images generated by color matching and BiFDANet in Figure 18. Since the source domain and the target domain are drawn from different data distributions, the histograms of the pseudotarget images and the target images can't be exactly the same. However, we want them to be as similar as possible. Although color matching tries to match the color of the source images with the color of the target images, it doesn't learn the data distributions so that the histograms of the pseudo-target images are quite different from that of the target images.

**Figure 13.** GF-1 to GF-1B: Original GF-1 images and the transformed images which are used to train the classifier for GF-1B images. (**a**) GF-1 images. (**b**) Color matching. (**c**) CycleGAN. (**d**) BiFDANet (ours).

**Figure 14.** GF-1B to GF-1: Original GF-1B images and the transformed images which are used to train the classifier for GF-1 images. (**a**) GF-1B images. (**b**) Color matching. (**c**) CycleGAN. (**d**) BiFDANet (ours).

**Figure 15.** Potsdam to Vaihingen: Original Potsdam images and the transformed images which are used to train the classifier for Vaihingen images. (**a**) Potsdam images. (**b**) Color matching. (**c**) CycleGAN. (**d**) BiFDANet (ours).

**Figure 16.** Vaihingen to Potsdam: Original Vaihingen images and the transformed images which are used to train the classifier for Potsdam images. (**a**) Vaihingen images. (**b**) Color matching. (**c**) CycleGAN. (**d**) BiFDANet (ours).

**Figure 17.** Color histograms of the Gaofen dataset. (**a**) GF-1. (**b**) Pseudo GF-1 transformed by color matching. (**c**) Pseudo GF-1 transformed by BiFDANet. (**d**) GF-1B. (**e**) Pseudo GF-1B transformed by color matching. (**f**) Pseudo GF-1B transformed by BiFDANet.

**Figure 18.** Color histograms of the ISPRS dataset. It is worth noting that Potsdam and Vaihingen have different kinds of bands. (**a**) Potsdam. (**b**) Pseudo Potsdam transformed by color matching. (**c**) Pseudo Potsdam transformed by BiFDANet. (**d**) Vaihingen. (**e**) Pseudo Vaihingen transformed by color matching. (**f**) Pseudo Vaihingen transformed by BiFDANet.

As shown in Figures 17 and 18, color matching does not match the data distributions of the pseudo-target images with the data distributions of the target images. For Gaofen dataset, there are still some differences between the histograms of the pseudo-target images generated by color matching and the real target images as shown in Figure 17. In contrast, the histograms of the pseudo-target images transformed by BiFDANet are similar to that of the real target images as shown in Figure 17. Thus the performances of BiFDANet are better than color matching. For ISPRS dataset, the histograms of the pseudo-target images generated by color matching are much different from the histograms of the target images as shown in Figure 18. In comparison, BiFDANet effectively matches the histograms of pseudo-target images with the histograms of the real target images, as shown in Figure 18. Therefore, the performance gap between BiFDANet and color matching becomes larger as confirmed by Figures 13–16.

#### 5.1.4. Linear Combination Method versus Intersection and Union

In the GF-1 → GF-1B, Vaihingen → Potsdam and Potsdam → Vaihingen experiments, simply taking the intersection or union of the results of the two classifiers *FS* and *FT* obtains the highest precision values or recall values, these results prove that the two opposite directions are complementary instead of alternative. However, the F1-score values and IoU values can't achieve the highest by the intersection and union operation. In the Vaihingen → Potsdam and Potsdam → Vaihingen experiments, simply taking the intersection or union of the outputs of the two classifiers *FS* and *FT* results in performance degradation. It shows that the intersection operation and union operation of the two predicted results aren't always stable, because these methods may leave out some correct objects or introduce some wrong objects during the combination process. In comparison, the linear combination method leads to further improvements for all four experiments because the combination of probability output is more reliable.

#### *5.2. Bidirectional Semantic Consistency Loss*

We replace the bidirectional semantic consistency (BSC) loss in BiFDANet with semantic consistency (SC) loss [14] and dynamic semantic consistency (DSC) loss [45], and report the evaluation results in Tables 4 and 5.

As shown in Tables 4 and 5, we can see that for all adaptations in both directions on Gaofen data set and ISPRS data set, our proposed bidirectional semantic consistency loss achieves better results. It is worth noting that our framework with SC loss [14] and DSC loss [45] also performs well in the source-to-target direction, but the performance of *BiFDANet FS* degrades. This illustrates the necessity of the proposed bidirectional semantic consistency loss when optimizing the classifier *FS* in the target-to-source direction. What's more, our framework with the proposed bidirectional semantic consistency (BSC) loss outperforms our framework with the dynamic semantic consistency (DSC) loss in the source-to-target direction even if the semantic constraints are the same in this direction. It shows that keeping semantic consistency in the target-to-source direction is helpful to maintain the semantic consistency in the source-to-target direction. At the same time, the source classifier *FS* in our framework with semantic consistency loss [14] and dynamic semantic consistency loss [45] perform better than the source classifier *FS* in our framework without semantic consistency loss even though there are no semantic constraints for these methods in the target-to-source direction. It means that the semantic consistency constraints in the source-to-target direction are also beneficial to preserve the semantic contents in the target-to-source direction. In conclusion, these two transferring directions promote each other to keep the semantic consistency.

#### *5.3. Loss Functions*

We study the roles of each part in BiFDANet in the Vaihingen → Potsdam experiment. We start from the base source-to-target GAN model with the adversarial loss L*adv* and the classification loss L*FT* . Then we test the symmetric target-to-source GAN model with the adversarial loss L*adv* and the classification loss L*FS* . We combine the two symmetric models that form a closed loop. In the next steps, we add the cycle consistency loss L*cyc* and the identity loss L*idt* in turn. Finally, the framework is completed by introducing the bidirectional semantic consistency loss L*sem*. The results are shown in Table 6. We can observe that all components help our framework to achieve better IoU and F1 scores, and the proposed bidirectional semantic consistency loss could further improve the performance of the models, which demonstrates the effectiveness of our bidirectional semantic consistency loss again.


**Table 4.** Evaluation results of different semantic consistency loss on Gaofen dataset. The best values are in bold.

**Table 5.** Evaluation results of different semantic consistency loss on ISPRS dataset. The best values are in bold.


**Table 6.** Evaluation results of each component on ISPRS dataset.


#### **6. Conclusions**

In this article, we present a novel unsupervised bidirectional domain adaptation framework to overcome the limitations of the unidirectional methods for semantic segmentation in remote sensing. First, while the unidirectional domain adaptation methods do not consider the inverse adaptation, we take full advantage of the information from both domains by performing bidirectional image-to-image translation to minimize the domain shift and optimizing the source and target classifiers in two opposite directions. Second, the unidirectional domain adaptation methods may perform badly when transferring from one domain to the other domain is difficult. In order to make the framework more general and robust, we employ a linear combination method at test time, which linearly merge the softmax output of two segmentation models, providing a further gain in performance. Finally, to keep the semantic contents in the target-to-source direction which was neglected by the existing methods, we propose a novel bidirectional semantic consistency loss and supervise the translation in both directions. We validate our framework on two remote sensing datasets, consisting of the satellite images and the aerial images, where we perform a one-to-one domain adaptation in each dataset in two opposite directions. The experimental results confirm the effectiveness of our BiFDANet. Furthermore, the analysis reveals the proposed bidirectional semantic consistency loss performs better than other semantic consistency losses used in the previous approaches. In our future work, we will redesign the combination method to make our framework more robust and further improve the segmentation accuracy. What's more, in practical terms, the huge number of remote sensing images usually contain several domains, we will extend our approach to multi-source and multi-target domain adaptation.

**Author Contributions:** Conceptualization, Y.C.; methodology, Y.C. and Q.Z.; formal analysis, Y.Y. and Y.S.; resources, J.Y. and Z.S. (Zhongtian Shi); writing—original draft preparation, Y.C.; writing—review and editing, Y.Y., Y.S., Z.S. (Zhengwei Shen) and J.Y.; visualization, Y.C.; data curation, Z.S. (Zhengwei Shen); funding acquisition, J.Y. and Z.S. (Zhongtian Shi). All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was funded by the National Natural Science Foundation of China under Grant 61825205 and Grant 61772459 and the Key Research and Development Program of Zhejiang Provence, China under grant 2021C01017.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The satellite dataset presented in this study is available on request from China resources satellite application center and the aerial dataset used in our research are openly available; see reference [60–62] for details.

**Acknowledgments:** We acknowledge the National Natural Science Foundation of China (Grant 61825205 and Grant 61772459) and the Key Research and Development Program of Zhejiang Provence, China (grant 2021C01017).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

