*2.4. N-Mode Product*

It means the multiplication of a tensor <sup>A</sup> <sup>∈</sup> <sup>R</sup>*I*1×···×*IN* by a matrix **<sup>U</sup>** <sup>∈</sup> <sup>R</sup>*J*×*In* or vector **<sup>u</sup>** <sup>∈</sup> <sup>R</sup>*In* in mode *<sup>n</sup>*; i.e., along axis *<sup>n</sup>*. It is represented by <sup>B</sup> <sup>=</sup> <sup>A</sup> <sup>×</sup>*<sup>n</sup>* **<sup>U</sup>**, where <sup>B</sup> <sup>∈</sup> <sup>R</sup>*I*1×···×*In*−1×*J*×*In*+1×···×*IN* [17].

#### *2.5. Rank-One Tensor*

A tensor <sup>X</sup> <sup>∈</sup> <sup>R</sup>*I*1×···×*IN* is rank one if it can be written as the outer product of *<sup>N</sup>* vectors; i.e., <sup>X</sup> <sup>=</sup> **<sup>a</sup>**(**1**) ◦···◦ **<sup>a</sup>**(**N**).

#### *2.6. Rank-R Tensor*

The rank of a tensor rank(X) is the smallest number of components in a CPD; i.e., the smallest number of rank-one tensors that generate X as their sum [17].

#### *2.7. N-Rank*

The *<sup>n</sup>*-rank of a tensor <sup>X</sup> <sup>∈</sup> <sup>R</sup>*I*1×···×*IN* denoted rank*n*(X), is the column rank of **<sup>X</sup>**(*n*); i.e., the dimension of the vector space spanned by the mode-*n* fibers. Hence, if *Rn* ≡ rank*n*(X) for *n* = 1, ... , *N*, we can say that X has a rank − (*R*1,..., *RN*) tensor.

All the tensor algebra notation presented until this point is summarized in Table 2 for simpler regarding.


**Table 2.** Tensor algebra notation summary

### *2.8. Tucker Decomposition (Tkd)*

The TKD can be seen as a form of higher-order PCA [17]. This method decomposes a tensor <sup>X</sup> <sup>∈</sup> <sup>R</sup>*I*1×···×*IN* into a core tensor <sup>G</sup> <sup>∈</sup> <sup>R</sup>*J*1×···×*JN* multiplied by a matrix along each mode *<sup>n</sup>* <sup>=</sup> 1, ... , *<sup>N</sup>* as

$$\mathfrak{X} \approx \mathfrak{Y} \times\_1 \mathfrak{U}^{(1)} \cdot \cdots \times\_N \mathfrak{U}^{(N)} \tag{1}$$

where the core tensor preserves the level of interaction for each factor or projection matrix **<sup>U</sup>**(*n*) <sup>∈</sup> <sup>R</sup>*In*×*Jn* . These matrices are usually, but not necessarily, orthogonal, and can be thought of as the principal components in each mode [17] (see Figure 1). *Jn* represents the number of components in the decomposition; i.e., the rank − (*R*1, ... , *RN*). We compute rank − (*R*1, ... , *RN*), where rank*n*(X) = *Rn* for every *n*-mode, which generally does not exactly reproduce X. Starting from (1), the reconstruction of an approximated tensor can be given by where Xˆ is the reconstructed tensor. Then, we can acquire the core tensor G by the multilinear projection

$$\mathfrak{G} = \mathfrak{A}^\* \times\_1 \mathbf{U}^{(1)T} \cdot \cdots \times\_N \mathbf{U}^{(N)T} \,. \tag{2}$$

where **U**(*n*)<sup>T</sup> denotes the transpose matrix of **U**(*n*) for *n* = 1, . . . , *N*. The reconstruction error *ξ* can be computed as

$$\xi(\mathfrak{X}) = ||\mathfrak{X} - \mathfrak{X}||\_{F'}^2 \tag{3}$$

where || · ||*<sup>F</sup>* represents the Frobenius norm. To effectively compress data, the reconstructed lower-rank tensor Xˆ should be close to the original tensor X; this can be reached by an algorithm as HOOI, which is iterative, and it is described in Section 5.1.

$$\mathbf{\hat{X}} = \mathbf{\hat{g}} \times\_1 \mathbf{U}^{(1)} \cdot \cdots \times\_N \mathbf{U}^{(N)},\tag{4}$$

**Figure 1.** Tucker decomposition for a third-order tensor.

#### **3. Problem Statement and Mathematical Definition**

Spectral images are third-order arrays, which provide not only spatial, but also spectral features from RS scenes of interest. These properties aid CNNs to easily find features to characterize the behaviors of different materials over the earth's surface. However, the large amount of spectral data causes huge computational load, and therefore, large processing time using machine learning algorithms.

It is important to preserve the three-dimensional array structure of the RS spectral input image, in order to effectively classify each pixel of the image. In RS multi- or hyperspectral images, the spectral bands are highly correlated, and contain lot of redundancy. Therefore, we propose a TKD-based method as a preprocessing step to provide a better suited input for the semantic segmentation based on CNN. This will also considerably reduce high number of parameters, and in turn, processing time during training and testing. Our problem statement for RS spectral images can be described as follows.

#### *3.1. Problem Statement*

Given a pair (X, **<sup>Y</sup>**), where tensor <sup>X</sup> <sup>∈</sup> <sup>R</sup>*I*1×*I*2×*I*<sup>3</sup> denotes a CNNMSI or HSI, and **<sup>Y</sup>** <sup>∈</sup> <sup>R</sup>*I*1×*I*<sup>2</sup> its corresponding ground truth matrix for a specific number of classes *C*, find another pair (G, **Y**ˆ), where the tensor <sup>G</sup> <sup>∈</sup> <sup>R</sup>*J*1×*J*2×*J*<sup>3</sup> , used for classification, is representative of <sup>X</sup>, and **<sup>Y</sup>**<sup>ˆ</sup> is its associated matrix of predicted classes; preserving the spatial-domain *J*<sup>1</sup> = *I*1, *J*<sup>2</sup> = *I*<sup>2</sup> but with fewer new tensor bands, i.e., *J*<sup>3</sup> < *I*3, achieving higher or competitive performance metrics for pixel-wise classification, reducing the dimensionality, and therefore, decreasing computational complexity in the classification task.

### *3.2. Mathematical Definition*

We can describe the problem stated in previous subsection mathematically as the following optimization problem

$$\begin{array}{ll}\underset{\mathbf{(}\mathbf{U}^{(1)},\mathbf{U}^{(2)},\mathbf{U}^{(3)})}{\min} & ||\mathfrak{X}-\mathfrak{G}\times\_{1}\mathbf{U}^{(1)}\times\_{2}\mathbf{U}^{(2)}\times\_{3}\mathbf{U}^{(3)}||\_{F}^{2} \\\\ \text{subject to} & \mathbf{U}^{(n)}\in St\_{I\_{n}\times I\_{n}} \quad \text{and} & St\_{I\_{n}\times I\_{n}}\equiv\{\mathbf{U}^{(n)}\in\mathbb{R}^{I\_{n}\times I\_{n}} \mid \mathbf{U}^{(n)\top}\mathbf{U}^{(n)} = \mathbf{I}^{(n)}\}, \\ & I\_{1}=I\_{1},I\_{2}=I\_{2} & \text{preserving the pixel domain}, \\ & I\_{3}$$

where *ψ* denotes an error threshold defined depending on the accuracy or performance metrics required for each application and *StIn*×*Jn* represents the Stiefel manifold [30]. Embedding G into the objective function, as Lathhauwer proved in [31] Theorems 3.1, 4.1, and 4.2, (5), can be written by the equivalent under the same constraints as (5).

$$\max\_{\mathbf{U}^{(1)}, \mathbf{U}^{(2)}, \mathbf{U}^{(3)}} ||\mathfrak{X} \times\_1 \mathbf{U}^{(1)\mathbf{T}} \times\_2 \mathbf{U}^{(2)\mathbf{T}} \times\_3 \mathbf{U}^{(3)\mathbf{T}}||\_F^2 \tag{6a}$$

$$\text{where} \quad \mathbf{\mathcal{G}} = \mathbf{\mathcal{X}} \times\_1 \mathbf{U}^{(1)\mathbf{T}} \times\_2 \mathbf{U}^{(2)\mathbf{T}} \times\_3 \mathbf{U}^{(3)\mathbf{T}} \tag{6b}$$

The subtensors G*in* of the core tensor G satisfy the all-orthogonality property [32], which establishes that two subtensors G*in*=*<sup>α</sup>* and G*in*=*<sup>β</sup>* are all-orthogonal

$$
\langle \mathfrak{G}\_{i\_n=a}, \mathfrak{G}\_{i\_n=\beta} \rangle = 0 \tag{7}
$$

for all possible values of *n*, *α*, and *β* subject to *α* = *β*, and the ordering property:

$$\|\|\mathbf{G}\_{i\_n=1}\|\|\_F \ge \|\|\mathbf{G}\_{i\_n=2}\|\|\_F \ge \cdots \ge \|\|\mathbf{G}\_{i\_n=I\_N}\|\|\_F. \tag{8}$$

Our optimization problem can be solved by several algorithms. In this work, the HOOI algorithm was selected (described in Section 5.1), due to its convergence and orthogonality performance. Once a tensor G is obtained, a classifier *f* that belongs to the hypothesis space *H* maps input data G into output data **Y**ˆ ; that is

$$
\hat{\mathbf{Y}} = f(\mathbf{g})\tag{9}
$$

where *f* is a pixel-wise classifier. In this paper, a FCN for semantic segmentation was used as classifier due to the need of classify each pixel of the input image and to its performance in pixel accuracy. The FCN used in this work is described in Section 4.

#### **4. Convolutional Neural Networks (CNNs)**

CNNs are supervised feed-forward DL-ANNs for computer vision. The idea of applying a sort of convolution of the synaptic weights of a neural network through the input data yields to a preservation of spatial features, which alleviates the hard task of classification and in turn semantic segmentation. This type of ANN works under the same linear regression model as every machine learning (ML) algorithm. Since images are three dimensional arrays, we can use tensor algebra notation to describe the input of CNNs as a tensor <sup>A</sup> <sup>∈</sup> <sup>R</sup>*I*1×*I*2×*I*<sup>3</sup> , where *<sup>I</sup>*1, *<sup>I</sup>*2, and *<sup>I</sup>*<sup>3</sup> represent height, width, and depth of the third order array respectively; i.e., the spatial and spectral domain of an image. We can write generally the linear regression model used for ANNs as

$$
\hat{\mathbf{y}} = \sigma \left( \mathbf{W} \mathbf{g} + \mathbf{b} \right) \tag{10}
$$

where **y**ˆ represents the output prediction of the network; *σ* denotes an activation function; **g** is the input dataset; **W** and **b** are the matrix of synaptic weights and the bias vector, respectively. These parameters are adjustable; i.e., their values are modified every iteration looking for convergence to minimize the loss in the prediction through optimization algorithms [33]. For simplicity, the bias vector can be ignored, assuming that matrix **W** will update until convergence independently of another parameter [33]. Considering that the input dataset to a CNN is a multidimensional array, we can represent (9) and (10) using tensor algebra notation as

$$
\mathfrak{G} = \sigma \left( \mathfrak{W} \mathfrak{G} \right) \tag{11}
$$

where Yˆ represents the prediction output tensor of the ANN (in our case, a second order tensor or matrix **<sup>Y</sup>**<sup>ˆ</sup> ), <sup>G</sup> is the input dataset, and <sup>W</sup> is a *<sup>K</sup>*<sup>1</sup> <sup>×</sup> *<sup>K</sup>*<sup>2</sup> <sup>×</sup> *<sup>F</sup>*<sup>1</sup> tensor called filter or kernel with the adaptable synaptic weights. Different to conventional ANN, in CNNs, W is a shiftable square tensor is much smaller in height and width than the input data, i.e., *K*<sup>1</sup> = *K*<sup>2</sup> and *Ks* << *Is* for *s* = 1, 2; *F*<sup>1</sup> denotes the number of input channels; i.e., *F*<sup>1</sup> = *I*3. For hidden layers, instead of the prediction tensor Yˆ, the output is a matrix called activation map **<sup>M</sup>** <sup>∈</sup> <sup>R</sup>*I*1×*I*<sup>2</sup> , which preserves features from the original data in each domain. Actually, it is necessary to use much kernels W(*f*2) as activation maps, with different initialization values to preserve diverse features of the image. Hence, we can also define activation maps as a tensor <sup>M</sup> <sup>∈</sup> <sup>R</sup>*I*1×*I*2×*F*<sup>2</sup> where *<sup>F</sup>*<sup>2</sup> denotes the number of activation maps produced by each filter (see Figure 2). Kernels are displaced through the whole input image as a discrete convolution operation. Then, each element of the output activation map *mi*1*i*<sup>2</sup> *<sup>f</sup>*<sup>2</sup> is computed by the summary of the Hadamard product of kernel W(*f*2) and a subtensor from the input tensor G centered in position (*i*, *j*) and with same dimensions of W, as follows

$$m\_{i\_1 i\_2 f\_2} = \sigma \left[ \sum\_{k\_1=1}^{K\_1} \sum\_{k\_2=1}^{K\_2} \sum\_{f\_1=1}^{F\_1} w\_{k\_1 k\_2 f\_1} \xi\_{i\_1 + k\_1 - o\_1, i\_2 + k\_2 - o\_2, f\_1} \right] \tag{12}$$

where *mi*1*i*<sup>2</sup> *<sup>f</sup>*<sup>2</sup> denotes the value of the output activation map *f*<sup>2</sup> at position *i*1, *i*2; *σ* represents the activation function; and *o*<sup>1</sup> and *o*<sup>2</sup> are offsets in spatial dimensions which depend on the kernel size, and equal *<sup>K</sup>*1+<sup>1</sup> <sup>2</sup> and *<sup>K</sup>*2+<sup>1</sup> <sup>2</sup> respectively (see Figure 2).

**Figure 2.** Convolutional layer with a *K*<sup>1</sup> × *K*<sup>2</sup> × *F*<sup>1</sup> × *F*<sup>2</sup> kernel. Input channels *F*<sup>1</sup> must equal the spectral bands *I*3. To preserve original dimensions at the output, zero padding is needed [18]. Output dimensions also depend on stride *S* = 1 to consider every piece of pixel information and to preserve original dimensions.

An ANN is trained by using iterative gradient-based optimizers, such as Stochastic gradient descent, Momentum, RMSprop, and Adam [33]. This drive the cost function *L*(W) to a very low value by updating the synaptic weights W. We can compute the cost function by any function that measures the difference between the training data and the prediction, such as Euclidean distance or cross-entropy [10]. Besides, the same function is used to measure the performance of the model during testing and validation. In order to avoid overfitting [33], the total cost function used to train an ANN combines one of the cost functions mentioned before, plus a regularization term.

$$J(\mathbf{\mathcal{W}}) = L(\mathbf{\mathcal{W}}) + R(\mathbf{\mathcal{W}}),\tag{13}$$

where *J*(W) denotes the total cost function and *R*(W) represents a regularization function. Then, we can decrease *J*(W) by updating the synaptic weights in the direction of the negative gradient. This is known as the method of steepest descent or gradient descent.

$$\mathbf{W}' = \mathbf{W} - a\nabla\_{\mathbf{W}} I(\mathbf{W}),\tag{14}$$

where W represents the synaptic weights tensor in next iteration during training, *α* denotes the learning rate parameter, and ∇<sup>W</sup> *J*(W) the cost function gradient. Gradient descent converges when every element of the gradient is zero, or in practice, very close to zero [10].

CNNs has been successfully used in many image classification frameworks. This variation in architecture from other typical ANN models yields the network to learn spatial and spectral features, which are highly profitable for image classification. Besides, FCNs, constructed with only convolutional layers are able to classify each element of the input image; i.e., they yield pixel-wise classification, or in other words, semantic segmentation.

### **5. Hooi-Fcn Framework**

In this work we propose a TKD-CNN-based framework called HOOI-FCN, which maps the original high-correlated spectral image into a low-rank core tensor, preserving enough statistical information to alleviate image pixel-wise classification. The aim is to improve performance while reducing processing time in semantic segmentation ANNs by compressing CNNMSI third-order tensors. Applying TD methods, relevant information is preserved, mainly acquired from the spectral domain, convenient for the classification FCN. This novel framework is in summary, a two step structure composed by an HOOI TD and a FCN for semantic segmentation described below (see Figure 3).

### *5.1. Higher Order Orthogonal Iteration (HOOI) for Spectral Image Compression*

Quoting Kolda, "The truncated higher order singular value decomposition (HOSVD) is not optimal in terms of giving the best fit as measured by the norm of the difference, but it is a good starting point for an iterative alternating least square algorithm" [17]. HOOI is an iterative algorithm to compute a rank-(*R*1, ... , *RN*) TKD. Let <sup>X</sup> <sup>∈</sup> <sup>R</sup>*I*1×···×*IN* be an *<sup>N</sup>*-th order tensor and *<sup>R</sup>*1, ... , *RN* be a set of integers satisfying 1 ≤ *Rn* ≤ *In*, for *n* = 1, ... , *N*; the rank − (*R*1, ... , *RN*) approximation problem is to find a set of *In* <sup>×</sup> *Rn* matrices **<sup>U</sup>**(*n*) column-wise orthogonal and a *<sup>R</sup>*<sup>1</sup> ×···× *RN* core tensor <sup>G</sup> by computing

$$\min\_{\mathbf{g}, \mathbf{U}^{(1)}, \dots, \mathbf{U}^{(N)}} ||\mathbf{\mathcal{X}} - \mathbf{\mathcal{G}} \times\_1 \mathbf{U}^{(1)} \cdot \dots \cdot\_N \mathbf{U}^{(N)}||^2 \tag{15}$$

and from matrices **U**(*n*), where **U**(*n*)T**U**(*n*) = **I**(*n*), the core tensor G is found to satisfy (2) [34]. For a third-order tensor decomposition, we can rewrite (4) as

$$\mathfrak{A} = \mathfrak{g} \times\_1 \mathbf{U}^{(1)} \times\_2 \mathbf{U}^{(2)} \times\_3 \mathbf{U}^{(3)} \tag{16}$$

where <sup>X</sup><sup>ˆ</sup> denotes the reconstruction approximation of the input spectral image <sup>X</sup>, <sup>G</sup> is the *<sup>J</sup>*<sup>1</sup> <sup>×</sup> *<sup>J</sup>*<sup>2</sup> <sup>×</sup> *<sup>J</sup>*<sup>3</sup> core tensor, and **<sup>U</sup>**(1) <sup>∈</sup> <sup>R</sup>*I*1×*J*<sup>1</sup> , **<sup>U</sup>**(2) <sup>∈</sup> <sup>R</sup>*I*2×*J*<sup>2</sup> and **<sup>U</sup>**(3) <sup>∈</sup> <sup>R</sup>*I*3×*J*<sup>3</sup> are the projection matrices. Algorithm <sup>1</sup> shows HOOI for a third order tensor decomposition, but the extension to higher order tensors is straightforward. Thus, with Algorithm 1 we compute the tensor G with rank-(*J*1, *J*2, *J*3) for each spectral image as third-order tensor.

**Figure 3.** The big picture of the fast semantic segmentation framework proposed, with a fully convolutional network encoder-decoder architecture and a preprocessing HOOI tucker decomposition stage.

#### **Algorithm 1:** HOOI for MSI. ALS algorithm to compute the core tensor G.

**Function** HOOI(X*, R*1*, R*2*, R*3)**:** initialize **<sup>U</sup>**(*n*) <sup>∈</sup> <sup>R</sup>*In*×*Rn* for *<sup>n</sup>* <sup>=</sup> 1, 2, 3 using HOSVD; **repeat for** *n* = 1, 2, 3 **do** <sup>D</sup> <sup>←</sup> <sup>X</sup> <sup>×</sup><sup>1</sup> **<sup>U</sup>**(1)<sup>T</sup> <sup>×</sup><sup>2</sup> **<sup>U</sup>**(2)<sup>T</sup> <sup>×</sup><sup>3</sup> **<sup>U</sup>**(3)<sup>T</sup> **<sup>U</sup>**(*n*) <sup>←</sup> *Rn* leading left singular vectors of **<sup>D</sup>**(*n*) **end until** *fit ceases to improve or maximum iterations exhausted*; <sup>G</sup> <sup>←</sup> <sup>X</sup> <sup>×</sup><sup>1</sup> **<sup>U</sup>**(1)<sup>T</sup> <sup>×</sup><sup>2</sup> **<sup>U</sup>**(2)<sup>T</sup> <sup>×</sup><sup>3</sup> **<sup>U</sup>**(3)<sup>T</sup> **Output:** G, **U**(1), **U**(2), **U**(3)

#### *5.2. Fcn for Semantic Segmentation of Spectral Images*

We use a FCN model for semantic segmentation based on the proposed by Badrinarayanan et al. in [35] called Segnet. Each core tensor G obtained after decomposition, is the input to the SegNet for training and testing the network. Hence, the feature activation maps <sup>M</sup> <sup>∈</sup> <sup>R</sup>*I*1×*I*2×*F*<sup>2</sup> for each hidden layer of the SegNet encoder-decoder FCN are computed by displacing the filters W through the whole input core tensor in strides *S* = 1. It is worth noting that kernel W is a four-order tensor <sup>W</sup> <sup>∈</sup> <sup>R</sup>*K*1×*K*2×*F*1×*F*<sup>2</sup> , where *<sup>K</sup>*<sup>1</sup> and *<sup>K</sup>*<sup>2</sup> represent its spatial dimensions height and width; *<sup>F</sup>*<sup>1</sup> its depth, i.e., the spectral domain; and *F*<sup>2</sup> denotes the number of filters used to produce *F*<sup>2</sup> activation maps (Figure 2). We express this convolution operation as

$$\mathbf{M}^{(f\_2)} = \sigma \left( \mathbf{W} \odot \mathbf{\hat{g}} \right), \tag{17}$$

where **<sup>M</sup>**(*f*2) represents each activation map for *<sup>f</sup>*<sup>2</sup> <sup>=</sup> 1, ... , *<sup>F</sup>*2, and each value *mi*1*i*<sup>2</sup> *<sup>f</sup>*<sup>2</sup> is computed as in (12). *σ* denotes the rectified linear unit (ReLU) [33] function; i.e., *σ*(*z*) = max {0, *z*}. Symbol is used in this paper to represent the convolution; i.e., the whole operation applied in convolutional layers (see Figure 2). These activation maps are the input for the subsequent layer in the SegNet FCN.

The last layer is used the softmax activation function [33] to produce a distribution probability, and so, predict values relating each pixel to one of the *C* classes of interest. Hence, for the last layer we rewrite (17) as

$$
\hat{\mathbf{Y}} = \delta \left( \mathbf{W} \odot \mathbf{M} \right),
\tag{18}
$$

where **Y**ˆ represents the output prediction, M is the feature activation maps at previous layer, *δ* the softmax activation function, and W the filter or kernel tensor with the adaptable synaptic weights.

The output of the FCN is a matrix **Y**ˆ with the same spatial dimensions as the input, with a value of the most likely class for each pixels. Figure 4 shows the architecture of the SegNet model used in this work. Experiments present the behavior of this FCN with and without data compression in the spectral domain.

**Figure 4.** SegNet FCN. Encoder-decoder architecture with convolutional, pooling, and upsampling layers with their corresponding activation functions and batch normalization [33].

#### **6. Experimental Results**

#### *6.1. Our Data*

As case study, a CNNMSI dataset with 100 RS images was used for training and 10 for testing, all of them from central Europe with 128 × 128 pixels. These images are partitions of the original Sentinel-2 images without modification and all semi-manually labeled, and with abundant presence of the elements of interest. In Table 3 the 10 scenarios correspond to our 10 images for testing. We used only nine from the 13 available spectral bands from visible, NIR to SWIR wavelengths. Bands 2, 3, 4, and 8 have 10 m resolution, and bands 5, 6, 7, 11, and 12 have 20 m (oversampled to 10 m [18]). These bands provide decisive information for discrimination of different classes. Bands 1, 9, and 10 were dismissed because of their lower spatial resolution of 60 m. Band 8A, also with 20 m spatial resolution, was dismissed due to wavelength overlapping with band 8. It is worth mentioning that the framework proposed in this work can be applied to any kind of spectral image and multitemporal datasets [36].

#### 6.1.1. The Training Space

For training, the input data ws a tensor <sup>X</sup> <sup>∈</sup> <sup>R</sup>128×128×9×100, where 128 <sup>×</sup> 128 is the spatial dimensions, 9 is the number of spectral bands, and 100 is the number of images used for training. Although the number of images seems low, taking into account that we work at pixel-domain, the real number of training points or vectors is high. Indeed, our FCN for semantic segmentation was trained with 128 × 128 × 100 = 1638400 samples or vectors. To test whether the size of the data for training was sufficiently high, a smaller subtensor of <sup>X</sup>, <sup>X</sup>*<sup>p</sup>* <sup>∈</sup> <sup>R</sup>128×128×9×80, equivalent to 1310720 points or vectors, was used for a second training obtaining, for the same test set, an average PA of 91.48%; i.e., only 0.08% less than with 100 images, 91.56%. We also tested these results by a third training with an extended dataset of 120 images, <sup>X</sup>*<sup>q</sup>* <sup>∈</sup> <sup>R</sup>128×128×9×<sup>120</sup> equivalent to 1966080 vectors, and we found only a slight variation of +0.01% in the PA (91.57%), while the execution time for the training increased significantly.

#### 6.1.2. The Labels

Our labels were acquired using the scene classification algorithm developed by the ESA [19], and subsequently modified, semi-manually, misclassified pixels.
