**1. Introduction**

Scene classification is the process of automatically assigning an image to a class label that describes the image correctly. In the field of remote sensing, scene classification gained a lot of attention and several methods were introduced in this field such as bag of word model [1], compressive sensing [2], sparse representation [3], and lately deep learning [4]. To classify a scene correctly, effective features are extracted from a given image, then classified by a classifier to the correct label. Early studies of remote sensing scene classification were based on handcrafted features [1,5,6]. In this context, deep learning techniques showed to be very efficient in terms compared to standard solutions based on handcrafted features. Convolutional neural networks (CNN) are considered the most common deep learning techniques for learning visual features and they are widely used to classify remote sensing images [7–9]. Several approaches were built around these methods to boost the classification results such as integrating local and global features [10–12], recurrent neural networks (RNNs) [13,14], and generative adversarial networks (GANs) [15].

The number of remote sensing images has been steadily increasing over the years. These images are collected using different sensors mounted on satellites or airborne platforms. The type of the sensor results in different image resolutions: spatial, spectral, and temporal resolution. This leads to huge number of images with different spatial and spectral resolutions [16]. In many real-world applications, the training data used to learn a model may have di fferent distributions from the data used for testing, when images are acquired over di fferent locations and with di fferent sensors. For such purpose, it becomes necessary to reduce the distribution gap between the source and target domains to obtain acceptable classification results [17]. The main goal of domain adaptation is to learn a classification model from a labeled source domain and apply this model to classify an unlabeled target domain. Several methods have been introduced related to domain adaptation in the field of remote sensing [18–21].

The above methods assume that the sample belongs to one of a fixed number of known classes. This is called a closed set domain adaptation, where the source and target domains have shared classes only. This assumption is violated in many cases. In fact, many real applications follow an open set environment, where some test samples belong to classes that are unknown during training [22]. These samples are supposed to be classified as unknown, instead of being classified to one of the shared classes. Classifying the unknown image to one of the shared classes leads to negative transfer. Figure 1 shows the di fference between open and closed set classification. This problem is known in the community of machine learning as open set domain adaptation. Thus, in an open set domain adaptation one has to learn robust feature representations for the source labeled data, reduce the data-shift problem between the source and target distributions, and detect the presence of new classes in the target domain.

**Figure 1.** (**a**) Closed set domain adaptation: source and target domains share the same classes. (**b**) Open set domain adaptation: target domain contains unknown classes (in the grey boxes).

Open set classification is a new research area in the remote sensing field. Few works have introduced the problem of open set. Anne Pavy and Edmund Zelnio [23] introduced a method to classify Synthetic Aperture Radar (SAR) images in the test samples that are in the training samples and reject those not in the training set as unknown. The method uses a CNN as a feature extractor and SVM for classification and rejection of unknowns. Wang et al. [24] addressed the open set problem in high range resolution profile (HRRP) recognition. The method is based on random forest (RF) and

extreme value theory, the RF is used to extract the high-level features of the input image, then used as input to the open set module. The two approaches use a single domain for training and testing.

To the best of our knowledge, no previous works have addressed the open-set domain adaptation problem for remote sensing images where the source domain is di fferent from the target domain. To address this issue, the method we propose is based on adversarial learning and pareto-based ranking. In particular, the method leverages the distribution discrepancy between the source and target domain using min-max entropy optimization. During the alignment process, it identifies candidate samples of the unknown class from the target domain through a pareto-based ranking scheme that uses ambiguity criteria based on entropy and the distance to source class porotypes.

#### **2. Related Work on Open Set Classification**

Open set classification is a more challenging and more realistic approach, thus it has gained a lot of attention by researchers lately and many works are done in this field. Early research on open set classification depended on traditional machine learning techniques, such as support vector machines (SVMs). Scheirer et al. [25] first proposed 1-vs-Set method which used a binary SVM classifier to detect unknown classes. The method introduced a new open set margin to decrease the region of the known class for each binary SVM. Jain et al. [26] invoked the extreme value theory (EVT) to present a multi-class open set classifier to reject the unknowns. The authors introduced the Pi-SVM algorithm to estimate the un-normalized posterior class inclusion likelihood. The probabilistic open set SVM (POS-SVM) classifier proposed by Scherreik et al. [27] empirically determines the unique reject threshold for each known class. Sparse representation techniques were used in open set classification, where the sparse representation based classifier (SRC) [28] looks for the sparest possible representation of the test sample to correctly classify the sample [29]. Bendale and Boult [30] presented the nearest non-outlier (NNO) method to actively detect and learn new classes, taking into account the open space risk and metric learning.

Deep neural networks (DNNs) were very interesting in several tasks lately, including open set classification. Bendale and Boult [31] first introduced OpenMax model, which is a DNN to perform open set classification. The OpenMax layer replaced the softmax layer in a CNN to check if a given sample belongs to an unknown class. Hassen and Chan [32] presented a method to solve the open set problem, by keeping instances belonging to the same class near each other, and instances that belong to di fferent or unknown classes farther apart. Shu et al. [33] proposed deep open classification (DOC) for open set classification, that builds a multi-class classifier which used instead of the last softmax layer a 1-vs-rest layer of sigmoids to make the open space risk as small as possible. Later, Shu et al. [34] presented a model for discovering unknown classes that combines two neural networks: open classification network (OCN) for seen classification and unseen class rejection, and a pairwise classification network (PCN) which learns a binary classifier to predict if two samples come from the same class or di fferent classes.

In the last years generative adversarial networks (GANs) [35] were introduced to the field of open set classification. Ge et al. [36] presented the Generative OpenMax (G-OpenMax) method. The algorithm adapts OpenMax to generative adversarial networks for open set classification. The GAN trained the network by generating unknown samples, then combined it with an OpenMax layer to reject samples belonging to the unknown class. Neal et al. [37] proposed another GAN-based algorithm to generate counterfactual images that do not belong to any class; instead are unknown, which are used to train a classifier to correctly classify unknown images. Yu et al. [38] also proposed a GAN that generated negative samples for known classes to train the classifier to distinguish between known and unknown samples.

Most of the previous studies mentioned in the literature of scene classification assume that one domain is used for both training and testing. This assumption is not always satisfied, due to the fact that some domains have images that are labeled, on the other hand many new domains have shortage in labeled images. It will be time-consuming and expensive to generate and collect large datasets of labeled images. One suggestion to solve this issue is to use labeled images from one domain as training data for di fferent domains. Domain adaptation is one part of transfer learning where transfer of knowledge occurs between two domains, source and target. Domain adaptation approaches di ffer from each other in the percentage of labeled images in the target domain. Some works have been done in the field of open set domain adaptation. First, Busto et al. [22] introduced open set domain adaptation in their work, by allowing the target domain to have samples of classes not belonging to the source domain and vice versa. The classes not shared or uncommon are joined as a negative class called "unknown". The goal was to correctly classify target samples to the correct class if shared between source and target, and classify samples to unknown if not shared between domains. Saito et al. [39] proposed a method where unknown samples appear only in the target domain, which is more challenging. The proposed approach introduced adversarial learning where the generator can separate target samples to known and unknown classes. The generator can decide to reject or accept the target image. If accepted, it is classified to one of the classes in the source domain. If rejected it is classified as unknown.

Cao et al. [40] introduced a new partial domain adaptation method, they assumed that the target dataset contained classes that are a subset of the source dataset classes. This makes the domain adaptation problem more challenging due to the extra source classes, which could result in negative transfer problems. The authors used a multi-discriminator domain adversarial network, where each discriminator has the responsibility of matching the source and target domain data after filtering unknown source classes. Zhang et al. [41] also introduced the problem of transferring from big source to target domain with subset classes. The method requires only two domain classifiers instead of multiple classifiers one for each domain as shown by the previous method. Furthermore, Baktashmotlagh et al. [42] proposed an approach to factorize the data into shared and private sub-spaces. Source and target samples coming from the same, known classes can be represented by a shared subspace, while target samples from unknown classes were modeled with a private subspace.

Lian et al. [43] proposed Known-class Aware Self-Ensemble (KASE) that was able to reject unknown classes. The model consists of two modules to e ffectively identify known and unknown classes and perform domain adaptation based on the likeliness of target images belonging to known classes. Lui et al. [44] presented Separate to Adapt (STA), a method to separate known from unknown samples in an advanced way. The method works in two steps: first a classifier was trained to measure the similarity between target samples and every source class with source samples. Then, high and low values of similarity were selected to be known and unknown classes. These values were used to train a classifier to correctly classify target images. Tan et al. [45] proposed a weakly supervised method, where the source and target domains had some labeled images. The two domains learn from each other through the few labeled images to correctly classify the unlabeled images in both domains. The method aligns the source and target domains in a collaborative way and then maximizes the margin for the shared and unshared classes.

In the contest of remote sensing, open set domain adaptation is a new research field and no previous work was achieved.

#### **3. Description of the Proposed Method**

Assume a labeled source domain *Ds* = *X*(*s*) *i* , *y* (*s*) *i ns i*=1 composed of *X*(*s*) *i* images and their corresponding class labels *y* (*s*) *i* ∈ {1, 2, ... , *K*}, where *ns* is the number of images and *K* is the number of classes. Additionally, we assume an unlabeled target domain *Dt* = *X*(*t*) *j nt j*=1 with *nt* unlabeled images. In an open set setting, the target domain contains *K* + 1 classes, where *K* classes are shared with the source domain, and an addition unknown class (can be many unknown classes but grouped in one class). The objective of this work is twofold: (1) reduce the distribution discrepancy between the source and target domains, and (2) detect the presence of the unknown class in the target domain. Figure 2 shows the overall description of the proposed adversarial learning method, which relies on the idea of min-max entropy for carrying out the domain adaptation and uses an unknown class detector based on pareto ranking.

**Figure 2.** Proposed open-set domain adaptation method.

#### *3.1. Network Architecture*

Our model uses EfficientNet-B3 network [46] from Google as a feature extractor although other networks could be used as well since the method is independent of the pre-trained CNN. The choice of this network is motivated by its ability to generate high classification accuracies but with reduced parameters compared to other architectures. EfficientNets are based on the concept of scaling up CNNs by means of a compound coefficient, which jointly integrates the width, depth, and resolution. Basically, each dimension is scaled in a balanced way using a set of scaling coefficients. We truncate this network by removing its original ImageNet-based softmax classification layer. For computation convenience, we set *h*(*s*) *i ns i*=1 and *h*(*t*) *j nt j*=1 as the feature representations for both source and target data obtained at the output of this trimmed CNN (each feature is a vector of dimension 1536). These features are further subject to dimensionality reduction via a fully-connected layer *F* acting as a feature extractor yielding new feature representations *z*(*s*) *i ns i*=1 and *z*(*t*) *j nt j*=1 each of dimension 128. The output of *F* is further normalized using *l*2normalization and fed as input to a decoder *D* and similarity-based classifier *C*.

The decoder *D* has the task to constrain the mapping spaces of *F* with reconstruction ability to the original features provided by the pre-trained CNN in order to reduce the overlap between classes during adaptation. On the other side, the similarity classifier *C* aims to assign images to the corresponding classes including the unknown one (identified using ranking criteria) by computing the similarity measure of their related representations to its weight *W* = [w1, w2, ... , wK, wK+<sup>1</sup>]. These weights are viewed as estimated porotypes for the *K*-source classes and the unknown class with index *K* + 1.

#### *3.2. Adversarial Learning with Reconstruction Ability*

We reduce the distribution discrepancy using an adversarial learning approach with reconstruction ability based on min-max entropy optimization [47]. To learn the weights of F, C, and D we use both labeled sources and unlabeled target samples. We learn the network to discriminate between the labeled classes in the source domain and the unknown class samples identified iteratively in the target domain (using the proposed pareto ranking scheme), while clustering the remaining target samples to the most suitable class prototypes. For such purpose, we jointly minimize the following loss functions:

$$\begin{cases} \begin{array}{c} L\_F = L\_s(y^{(s)}, \hat{y}^{(s)}) + L\_{K+1}(y^{(t)}, \hat{y}^{(t)}) + \lambda H(\hat{y}^{(t)})\\ L\_C = L\_s(y^{(s)}, \hat{y}^{(s)}) + L\_{K+1}(y^{(t)}, \hat{y}^{(t)}) - \lambda H(\hat{y}^{(t)})\\ L\_D = \frac{1}{n\_s} \sum\_{i=1}^{n\_s} ||h\_i^{(s)} - \hat{h}\_i^{(s)}||^2 + \frac{1}{n\_l} \sum\_{j=1}^{n\_l} ||h\_i^{(t)} - \hat{h}\_i^{(t)}||^2 \end{array} \tag{1}$$

where λ is a regulalrization parameter which controls the contribution of the entropy to the total loss. *Ls* is the categorical cross-entropy loss computed for the source domain:

$$L\_s(\boldsymbol{y}^{(s\_k)}, \boldsymbol{\hat{y}}^{(s\_k)}) = -\frac{1}{n\_{s\_k}} \sum\_{i=1}^{n\_{s\_k}} \sum\_{k=1}^K \mathbf{1}\left(k = \boldsymbol{y}\_{i\bar{k}}^{(s\_k)}\right) \log\left(\boldsymbol{\hat{y}}\_{i\bar{k}}^{(s\_k)}\right) \tag{2}$$

L*K*+<sup>1</sup> is the cross-entropy loss computed for the samples iteratively identified as unknown class:

$$L\_{K+1}(y^{(t\_{K+1})}, \boldsymbol{\mathfrak{y}}^{(t\_{K+1})}) = -\frac{1}{n\_{l\_{k+1}}} \sum\_{j=1}^{n\_{l\_{K+1}}} y\_j^{(t\_{K+1})} \log \left( \boldsymbol{\mathfrak{y}}\_j^{(t\_{K+1})} \right) \tag{3}$$

Hyˆ(t)is the entropy computed for the samples of the target domain:

$$H(\hat{\mathcal{Y}}^{(t)}) = -\frac{1}{n\_l} \sum\_{i=1}^{n\_l} \sum\_{k=1}^{K+1} \mathbf{1}\{k = \hat{\mathcal{Y}}\_{i\bar{k}}^{(t)}\} \log\left(\hat{\mathcal{Y}}\_{i\bar{k}}^{(t)}\right) \tag{4}$$

and *LD* is the reconstruction loss.

From Equation (1), we observe that both *C* and *F* are used to learn discriminative features for the labeled samples. The classifier *C* makes the target samples closer to the estimated prototypes by increasing the entropy. On the other side, the feature extractor *F* tries to decrease it by assigning the target samples to the most suitable class prototype. On the other side, the decoder *D* constrains the projection to the reconstruction space to control the overlap between the samples of the different classes. In the experiments, we will show that this learning mechanism allows to boost the classification accuracy of the target samples. In practice, we use a gradient reversal layer to flip the gradients of *H*yˆ(tk) between C and F. To this end, we use a gradient reverse layer [48] between C and F to flip the sign of gradient to simplify the training process. The gradient reverse layer aims to flip the sign of the input by multiplying it with a negative scalar in the backpropagation, while leaving it as it is in the forward propagation.

#### Pareto Ranking for Unknown Class Sample Selection

During the alignment of the source and target distributions, we strive for detecting the most *r* < *nt* ambiguous samples and assign a soft label to them (unknown class K+1). Indeed, the adversarial learning will push the target samples to the most suitable class prototypes in the source domain, while the most ambiguous ones will potentially indicate the presence of a new class. In this work, we use the entropy measure and the distance from the class prototypes as a possible solution for identifying these samples. In particular, we propose to rank the target samples using both measures.

An important aspect of pareto ranking is the concept of dominance widely applied in multi-objective optimization, which involves finding a set of pareto optimal solutions rather than a single one. This set of pareto optimal solutions contains solutions that cannot be improved on one of the objective functions without affecting the other functions. In our case, we formulate the problem as finding a sub-set *P* from the unlabeled samples that maximizes the two objective functions *f*1 and *f*2, where

$$f\_1^j = \text{cosine}(z\_j^{(t)}, \overline{z}\_k^{(s)}) \; ; \; j = 1, \dots, n\_l \tag{5}$$

$$f\_2^j = -\frac{1}{n\_t} \sum\_{i=1}^n \sum\_{k=1}^{n\_t} \mathbf{1}\{k = \mathfrak{G}\_{ik}^{(t)}\} \log\left(\mathfrak{G}\_{ik}^{(t)}\right) \tag{6}$$

where *f j*1 is the cosine distance of the representation *z*(*t*) *j* of the target sample *X*(*t*) *j* with respect to the class porotypes *z*(*s*) *k* = 1*nsk nsk i*=1 *zik* of each source class, and *f j*2 is the cross-entropy loss computed for the samples of the target domain.

Many samples in Figure 3 are considered undesirable choices due to having low values of entropy and distance which should be dominated by other points. The samples of the pareto-set *P* should dominate all other samples in the target domain. Thus, the samples in this set are said to be non-dominated and forms the so-called Pareto front of optimal solutions.

**Figure 3.** Pareto-front samples potentially indicating the presence of the unknown class.

The following Algorithm 1 provides the main steps for training the open-set DA with its nominal parameters:

#### **Algorithm 1**: Open-Set DA

Input: Source domain: *Ds* = *X*(*s*) *i* , *y*(*s*) *i ns <sup>i</sup>*=1, and target domain: *Dt* = *X*(*t*) *j nt j*=1

Output: Target labels 1:Network

	- • Number of iterations *num* \_ *iter* = 100
	- • Mini-batch size: *b* = 100
	- • Adam optimizer: learning rate: 0.0001, exponential decay rate for the first and second moments β1 = 0.9, β2 = 0.999 and epsilon=1*e*<sup>−</sup><sup>8</sup>

2: Get feature representations from EfficientNet-B3: *h*(*s*) *i*= *NetX*(*s*) and *h*(*t*) *i*= *NetX*(*t*)


5.1: Shuffle the labeled samples and organize them into *rb* = *ns*+*r b* groups each of size *b* 5.2: for *k* = 1 : *rb*


5.3: Feed the target domain samples to the network and form a new pareto set *P* 6: Classify the target domain data.

#### **4. Experimental Results**

#### *4.1. Dataset Description*

To test the performance of the proposed architecture, we used two benchmark datasets. The first dataset consists of very high resolution (VHR) images customized from three well-known remote sensing datasets, which is the Merced dataset [1] consisting of 21 category classes each with 100 images. This dataset contains images with size of 256 × 256 pixels and with 0.3-m resolution. The AID dataset contains a large number of images more than 10,000 images of size 600 × 600 pixels with a pixel resolution varying from 8 m to about 0.5 m per pixel [49]. The images are classified to different 30 classes. The NWPU dataset contains images of size of 256 × 256 pixels with spatial resolutions varying from resolution 30 to 0.2 m per pixel [50]. These images correspond to 45 category classes with 700 images for each. From these three heterogonous datasets, we build cross-domain datasets, by extracting 12 common classes (see Figure 4), where each class contains 100 images.

**Figure 4.** Example of samples from cross-scene dataset 1 composed of very high resolution (VHR) images.

The second dataset consists of extremely high resolution (EHR) images collected by two different Aerial Vehicle platforms. The Vaihingen dataset was captured using a Leica ALS50 system at an altitude of 500 m over Vaihingen city in Germany. Every image in this dataset is represented by three channels: near infrared (NIR), red (R), and green (G) channels. The Trento dataset contains unmanned aerial vehicles (UAV) images acquired over Trento city in Italy. These images were captured using a Canon EOS 550D camera with 2 cm resolution. Both datasets contain seven classes as shown in Figure 5 with 120 images per class.

**Figure 5.** Example of samples from cross-scene dataset 2 composed of extremely high resolution (EHR) images.

#### *4.2. Experiment Setup*

For training the proposed architecture, we used the Adam optimization method with a fixed learning rate of 0.001. We fixed the mini-batch size to 100 samples and we set the regularization parameter of the reconstruction error and entropy terms to 1 and 0.1, respectively.

We evaluated our approach using three proposed ranking criteria for detecting the samples of the unknown class including entropy, cosine distance, and the combination of both measures using pareto-based ranking.

We present the results in terms of (1) closed set (CS) accuracy related to the shared classes between the source and target domains, which is the number of correctly classified samples divided by the total number of tested samples of the shared classes only; (2) the open set accuracy (OS) including known and unknown classes; (3) the accuracy of the unknown class itself termed as (Unk); which is the number of correctly classified unknown samples divided by the total number of tested samples of the unknown class only, and (4) the F-measure, which is the harmonic mean of Precision and Recall:

$$F = 2 \times \frac{Precision \times Recall}{Precision + Recall} \tag{7}$$

where Recall is calculated as

$$Recall = \frac{TP}{TP + FN} \tag{8}$$

and Precision is calculated as

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{9}$$

where TP, FN, and FP are for true positive, false negative, and false positive, respectively. F-measure gives a value between 0 and 1. High F-measure values result in better performance for the image classification system.

For the openness measure, which is the percentage of classes that appear in the target domain and are not known in the source domain, we define it as

$$
overline{\rm{sess}} = 1 - \frac{\mathcal{C}\_{\rm{S}}}{\mathcal{C}\_{\rm{s}} + \mathcal{C}\_{\rm{u}}} \tag{10}$$

where *CS* is the number of classes in the source domain shared with the target domain and *Cu* is the number of unknown classes in the target domain. Thus, when removing three classes from the source domain this leads to nine classes in the source ( *Cs* = 9). The number of unknown classes *Cu* is 3, which leads to an openness of 1 − 9 9+3 = 0.25. Increasing the value of the openness leads to increasing the number of unknown classes in the target domain that are not shared by the source domain. Setting the openness to 0 is similar to the closed set architecture where all the classes are shared between source and target domains with no unknown classes in the target domain.
