**1. Introduction**

Nowadays, remote sensing (RS) is used in numerous applications [1–3] due to the following main reasons. Different types of useful information can be potentially retrieved from RS images, especially high resolution and multichannel data (i.e., a set of co-registered component images of the same territory acquired for different wavelengths, polarizations, even by different sensors [3–6]). The modern RS sensors often offer a possibility of fast and frequent data collection—good examples are multichannel sensors Sentinel-1 and Sentinel-2 which have been launched recently and started to produce a great amount of valuable RS data [7,8].

Therefore, the volume of RS data greatly increases due to the aforementioned factors: better spatial resolution, a larger number of channels, more frequent observations. This causes challenges in RS data processing that relate to all basic stages of their processing: co-registration, calibration, pre- or post-filtering, compression, segmentation, and classification [9,10]. One of the most serious challenges is the compression of multichannel RS images [11–13]. Compression is applied to diminish data size before their downlink transferring from spaceborne (more rarely, airborne, UAV) sensors, to store acquired images in on-land centers of RS data collecting or special depositories, to pass images

to potential customers. Lossless compression is often unable to meet requirements to compression ratio (CR) that should be provided, since, even in the most favorable situations of high inter-band correlation of component images [14], CR attained by the best existing lossless compression techniques reaches 4 ... 5 [12].

Lossy compression can provide considerably larger CR values [12,13,15–17] but at the expense of introduced distortions. It is always a problem to reach an appropriate compromise between compressed image quality (characterized in many different ways) and CR [11–13,18–20]. There are several reasons behind this. First, compression providing a fixed CR often used in practice [21,22] leads to compressed images whose quality can vary in wide limits [23]. For a given CR and simpler structure images, introduced distortions are smaller and compressed image quality is higher, whilst for images, having a more complex structure (containing many small-sized details and textures), losses are larger and, thus, image quality can be inappropriate since some important information can be inevitably lost, which is undesired. Certainly, some improvements can be reached due to the employment of a better coder, adaptation to image content [24], use of inter-channel correlation by three-dimensional (3D) compression [12,13,21,22], or some other means. However, the positive effect can be limited, and/or there can be some restrictions, e.g., the necessity to apply image compression standard [12,19,21].

Thus, we are more interested in a different approach that presumes lossy compression of multichannel RS images with a certain control of introduced distortions combined with simultaneous desire to provide a larger CR. Let us explain the advantages of such an approach, when it can be reasonable, and what the conditions are for its realization. Recall here that RS data compressed in a lossy manner can be useful if: (a) introduced losses do not lead to sufficient reduction of solving the final tasks of RS image processing such as segmentation, object detection, spectral unmixing, classification, parameter estimation [17,25–34]; (b) distortions due to lossy compression do not appear themselves as artifacts leading to undesired negative effects (e.g., appearing of ghost artifacts) in solving the aforementioned final task.

Below, we consider the impact of lossy compression on multichannel RS image classification [27,28,33–35]. Specifically, we focus on the case of a limited number of channels, e.g., color, multi-polarization, or multispectral images, due to the following reasons: (a) it is simpler to demonstrate the effects that take place in images due to lossy compression and how these effects influence classification just for the case of a small number of image components; (b) an accurate classification of multichannel images with a small number of components is usually a more difficult task than the classification of hyperspectral data (images with a relatively large number of components) because of a limited number of available features and the necessity to reliably discriminate classes in feature space.

Returning to the advantages of lossy compression of multichannel RS data, we can state the following. First, images compressed in a lossy manner might be classified better than original images [27,34]. Such an effect usually takes place if original images are noisy and coder parameters are adjusted so that the noise removal effect due to lossy compression [13,36–38] is maximal. However, even if the noise in images is absent (or, more exactly, peak signal-to-noise ratio (PSNR) values in original images are high and noise is not seen in visualized component images), the accuracy of image classification may remain practically the same for a relatively wide range of CR variation [27,34,38].

This range width can be different [39]. Its width depends on the following main factors:


Therefore, researchers have a wide field of studies intended on understanding how to set a CR or a coder's parameter that controls compression (PCC) for a given multichannel image, used lossy compression technique, and considered classifier. Here we would like to recall some already known aspects and dependencies that will be exploited below. First, for the same CR, considerably larger distortions can be introduced into images with higher complexity [40]. Moreover, larger distortions can be introduced into more complex and/or noisier images for the same PCC, for example, quantization step (QS) of coders based on orthogonal transforms (discrete cosine transform (DCT) or wavelets [40]). Second, the desired quality of a compressed image (a value of a metric that characterizes image quality) can be nowadays predicted and provided with an appropriate accuracy [40,41], at least, for some coders based on DCT as, e.g., the coder AGU [42]. Third, there exist certain dependencies between values of conventional (e.g., PSNR) and/or visual quality metrics and accuracy of correct classification of an entire image or classes [43]. Classification accuracy for classes mainly represented by large size homogeneous objects (water surfaces, meadows) better correlates with the conventional metrics whilst probability of correct classification for classes mainly represented by textures, small-sized and prolonged objects (forests, urban areas, roads, narrow rivers) correlates with visual quality metrics. Besides, classification accuracies for classes can depend on PCC or CR differently [38]. For the latter type of classes, the reduction of probability of correct classification with an increase of CR is usually faster. Fourth, different classifiers produce classification results that can significantly differ both in the sense of obtained classification maps and quantitative characteristics as the total probability of correct classification, probabilities for classes, and confusion matrices [43,44]. Fifth, there are quite successful attempts to predict the total probability of correct classification (TPCC) of compressed RS images based on the fractal models [28].

Aggregating all these, we assume the following:


The goal of this paper is to carry out a preliminary analysis are these assumptions valid. The main contributions are the following. First, we show that it is worth applying lossy compression with distortions around their invisibility threshold to provide either some improvement of TPCC (achieved for simple structure images) or acceptable reduction of TPCC (for complex structure images) compared to TPCC for the corresponding original image. Moreover, we show how such compression can be carried out for a particular coder. Second, we demonstrate that classifier training for compressed data usually provides slightly better results than training for original (uncompressed) images.

The paper structure is the following. Section 2 deals with a preliminary analysis of the performance criteria used in lossy compression of RS images. In Section 3, two popular methods of RS data classification based on the maximum likelihood method (MLM) and neural network (NN) are described. Section 4 contains the results of classifier testing for the synthetic three-channel test image. Section 5 provides the results of experiments carried out for real-life three-channel image. Discussion and analysis of some practical peculiarities are presented in Section 6; the conclusions are shown in Section 7.

#### **2. Performance Criteria of Lossy Compression and Their Preliminary Analysis**

Consider an image I t ij where i, j denote pixel indices and IIm, JIm define the processed image size. If one deals with a multichannel image, index *q* can be used. Let's start with the consideration of the case of a single-component image. To characterize compression efficiency, we have used the following three metrics. The first one, traditional PSNR, is defined by

$$\text{PSNR}\_{\text{imp}} = 10 \log\_{10} \left( \text{DR}^2 / \sum\_{\text{i}=1}^{\text{I}\_{\text{Im}}} \sum\_{\text{j}=1}^{\text{I}\_{\text{Im}}} \left( \mathbf{I}\_{\text{ij}}^{\text{c}} - \mathbf{I}\_{\text{ij}}^{\text{t}} \right)^2 / \left( \mathbf{I}\_{\text{Im}} \times \mathbf{J}\_{\text{Im}} \right) \right) = 10 \log\_{10} \left( \text{DR}^2 / \text{MSE} \right) \tag{1}$$

where DR denotes the range of image representation, I c ij is the compressed image, MSE is the mean square error of distortions introduced by lossy compression. In general, DR in remote sensing applications might sufficiently differ from 255, which is common for processing RGB color images. Below, we will mainly consider single and multichannel RS images having DR = 255 but will also analyze what should be done if DR is not equal to 255.

PSNR is the conventional metric widely used in many image processing applications. Meanwhile, the drawbacks of PSNR are well known. The main drawback of PSNR is that it is not adequate in characterizing image visual quality [45,46]. Currently, there are numerous other, so-called HVS (human vision system) metrics, that are considered to be more adequate [45–47]. Some of the existing HVS-metrics can be applied only to color images. As we need some metrics that should be good enough and applicable to single component images, we apply the metrics PSNR-HVS and PSNR-HVS-M. The former HVS-metric takes into account less sensitivity of human vision to distortions in high-frequency spectral components, the latter one also takes into consideration a masking effect of image texture and other heterogeneities [47]. These metrics are defined by

$$\text{PHVSM} = 10 \log\_{10} \text{(DR}^2/\text{MSE}\_{\text{HVS}}) \tag{2}$$

$$\text{PHVSM}\_{\text{out}} = 10 \log\_{10} \left( \text{DR}^2 / \text{MSE}\_{\text{HVS}-\text{M}} \right) \tag{3}$$

where MSEHVS and MSEHVS−<sup>M</sup> are MSEs determined considering the aforementioned effects. If PSNR-HVS is approximately equal to PSNR, then the distortions due to lossy compression have properties similar to an additive white Gaussian noise. If PSNR-HVS is smaller than PSNR, this evidences that distortions are more similar to either non-Gaussian (heavy-tailed), or spatially correlated noise, or both [47] (see also [48] for details). If PSNR-HVS-M is considerably larger than PSNR, then a masking effect of image content is sufficient (most probably, an image is highly textural). In turn, if PSNR-HVS-M is approximately equal or smaller than PSNR, then a masking effect is absent, and/or distortions are like spatially correlated noise.

Figure 1a,c,e present three grayscale test RS images of different complexity—the image Frisco has the simplest structure whilst the image Diego is the most textural. Figure 1b,d,f represent dependences of the considered metrics in Equations (1)–(3) on quantization step which serves as PCC in the coder AGU [42] (this compression method employs 2D DCT in 32 × 32 pixel blocks, the advanced algorithm of coding uniformly quantized DCT coefficients, and embedded deblocking after decompression). As one may expect, all metrics become smaller (worse) if QS increases. Meanwhile, for the same QS, compressed image quality is not the same. As an example, for the case of QS = 20, PSNR and PSNR-HVS-M for the test image Frisco are practically the same (about 40 dB). For the test images Airfield and Diego that have more complex structures, PSNR-HVS-M values are also about 40 dB whilst PSNR values are about 34 dB. This means the following: (1) there are masking effects, i.e., textures and heterogeneities still sufficiently mask introduced distortions; (2) the introduced distortions can be noticed by visual inspection (comparison of original and compressed images). Recall that distortion visibility thresholds are about 35 ... 38 dB according to the metric PSNR and about 40 ... 42 dB according to the metric PSNR-HVS-M [45].

**Figure 1.** Test grayscale images Frisco (**a**), Airfield (**c**), and Diego (**e**), all of size 512 × 512 pixels, and dependences of the considered metrics (PSNR, PSNR-HVS, PSNR-HVS-M, all in dB) on the quantization step (QS) for the coder AGU for these test images ((**b**), (**d**), and (**f**), respectively).

Note that distortions' invisibility happens if QS is smaller than 18 ... 20 for grayscale images represented as 8-bit data arrays [23] (more generally, if QS ≤ DR / (12 ... 13)). One should also keep in mind that the use of the same QS leads to sufficiently different CR. For example, for QS = 20, CR values are equal to 26.3, 4.6, and 4.5 for the considered test images Frisco, Airfield, and Diego, respectively. These facts and dependences are given in Figure 1 confirm the basic statements presented in the Introduction.

Alongside differences in metrics' values for the same QS considered above for QS ≈ 20 (similar differences take place for QS > 20, compare the corresponding plots in Figure 1b,d,f), there are interesting observations for QS <sup>≤</sup> 10 (more generally, QS <sup>≤</sup> DR / 25). In this case, MSE <sup>≈</sup> QS2 / 12 [20], PSNR exceeds 39 dB and PSNR-HVS-M is not smaller than PSNR and exceeds 45 dB. Thus, the introduced distortions are not visible and the desired PSNRdes (or, respectively, MSEdes) can be easily provided by a proper setting of QS as QS ≈ (12MSEdes) <sup>1</sup>/2.

Certainly, there are numerous quality metrics proposed so far. Below we carry our study based on PSNR-HVS-M because of the following reasons. First, it is one of the best component-wise quality metrics [49] that can be calculated efficiently. Second, the cross-correlation between the best existing quality metrics is high (see data in Supplementary Materials Section). Therefore, other quality metrics can be used in our approach under the condition of carrying out a corresponding preliminary analysis of their properties. Note that we have earlier widely used PSNR-HVS-M in image compression and denoising applications [23] utilizing well its main properties.

### **3. Considered Approaches to Multichannel Image Classification**

As has been mentioned above, there are numerous approaches to the classification of multichannel images. Below we consider two of them. The first one is based on the maximum likelihood method (MLM) [50,51] and the second one relies on a neural network (NN) training [50]. In both cases, the pixel-wise classification is studied. There are many reasons behind using the pixel-wise approach and just these classifiers: (a) to simplify the classification task and to use only Q features (q = 1, ... , Q), i.e., the values of a given multichannel image in each pixel; (b) to show the problems of pixel-wise classification; (c) MLM and NN based classifiers are considered to be among the best ones [50].

The note that classifiers of RS data classification can be trained in different ways. One option is to train a classifier in advance using earlier acquired images, e.g., uncompressed ones stored for training or other purposes. Another option is to use a part of the obtained compressed image for classifier training and its use for entire image classification [39]. One can expect that classification results would be different where both options have both advantages and drawbacks.

### *3.1. Maximum Likelihood Classifier*

The classification (recognition) function for MLM is based on the calculation of some metric in the feature space

$$\Phi = \begin{pmatrix} 1, & \varrho\left(\stackrel{\rightarrow\ast}{\mathbf{x}}^\*, \stackrel{\rightarrow}{\mathbf{x}}\_s\right) \ge \text{tr};\\ 0, & \varrho\left(\stackrel{\rightarrow\ast}{\mathbf{x}}^\*, \stackrel{\rightarrow}{\mathbf{x}}\_s\right) < \text{tr}. \end{pmatrix} \tag{4}$$

where <sup>→</sup> x ∗ , → xs are feature (attribute) vectors for current and sample objects (image pixel in our case); ρ(•, •) is a used metric of vector similarity; tr denotes the decision undertaking threshold.

As components of the feature vector, in general, one can consider both original ones (pixel (voxel) values of a multichannel image) and "derivative" features calculated on their basis (e.g., ratios of these values).

If Φ = 1, then a current object (pixel) determined by the feature vector <sup>→</sup> x ∗ , is related to a class as.

Taking into account the stochastic nature of the features <sup>→</sup> x, → <sup>x</sup> <sup>∈</sup> Rc , information about each object class is contained in parameters <sup>→</sup> θ of joint probability density functions (PDFs) fc → x ; <sup>→</sup> θ .

This information is used for creating statistical decision rules

$$\mathbf{L} = \frac{\mathbf{f\_c}\left(\overrightarrow{\mathbf{x}}^\*; \overrightarrow{\boldsymbol{\Theta}} \,\|\,\mathbf{a\_2}\right)}{\mathbf{f\_c}\left(\overrightarrow{\mathbf{x}}^\*; \overrightarrow{\boldsymbol{\Theta}} \,\|\,\mathbf{a\_1}\right)} \geq \text{tr} \tag{5}$$

where L is the likelihood ratio. The threshold tr is determined by a statistical criterion used; for example, the Bayesian criterion provides getting the optimal classifier when the following information is available: PDFs for all sets of patterns, probabilities of each class occurrence, and losses connected with probabilities of misclassifications.

In most practical cases, for object description, statistical sample information is used, i.e., statistical estimates of PDFs obtained at the training stage are employed in likelihood ratio L. Then, sample size, reliability of information about observed objects, and efficiency of using this information in decision rule mainly determine the quality of undertaken decisions.

Training samples are formed using pixels representing a given class; usually, it is some area (or areas, a set of pixels) in an image identified based on some true data for a sensed terrain. The main requirement for the data of training samples is their representativeness—the pixels of the sample must correspond to one class on the ground; such a class should occupy a territory that is fairly well represented by pixels in the image with a given resolution. In other words, the number of pixels in the selected area of the image should ensure the adoption of statistically significant decisions.

When constructing a multi-alternative decision rule (K > 2), the maximum likelihood criterion is used; the threshold tr in the Equation (5) is assumed to be equal to 1. The maximum likelihood criterion makes it possible to eliminate the uncertainty of the solution (when none of the classes can be considered preferable to the others), does not require knowledge of the a priori probabilities of the classes and the loss function, allows evaluating the reliability of the solutions, and can be easily generalized to the case of many classes. In accordance with this criterion, it is believed that the control sample (measured values of features in the current image pixel) belongs to the class, 1 ≤ u ≤ K, for which the likelihood function (or rather, its estimate obtained at the training stage) is maximum:

$$\mathfrak{f}\_{\mathbf{c}}\left(\overrightarrow{\mathbf{x}}^{\*};\overleftarrow{\boldsymbol{\Theta}}|\,\mathbf{a}\_{\mathbf{u}}\right) = \max\_{1\le k\le K} \left\{ \mathfrak{f}\_{\mathbf{c}}\left(\overrightarrow{\mathbf{x}}^{\*};\overleftarrow{\boldsymbol{\Theta}}|\,\mathbf{a}\_{\mathbf{k}}\right) \right\} \implies \overset{\rightarrow}{\mathbf{x}}^{\*} \in \mathbf{a}\_{\mathbf{u}}.\tag{6}$$

If the assumptions about the normal law of distribution of the feature vector are true, the maximum likelihood method Equation (6) provides optimal recognition. The assumption of normality of features, on the one hand, simplifies the estimation of the parameters of distribution laws (DL) and makes it possible to take into account the presence of mutual correlation relationships between data in different channels in models of classes. On the other hand, it is known that the distributions of real multichannel data are often non-Gaussian. Even with the assumption of the normalization of DL observations due to the central limit theorem, the nonlinearity introduced during the reception and primary processing of information, which manifests itself in a change in the shape and parameters of the observed distribution, cannot be ignored.

To increase the accuracy of approximating empirical non-Gaussian distributions of features and to take into consideration the possible presence of strong correlation relationships, it is advisable to apply multiparameter distribution laws based on non-linear transformations of normal random variables. These transformations include Johnson's transformations.

For describing a class of spectral features, we propose to apply Johnson SB-distribution [52,53]:

$$\mathbf{f}(\mathbf{x}) = \frac{\eta\lambda}{\sqrt{2\pi}(\mathbf{x} - \varepsilon)(\varepsilon + \lambda - \mathbf{x})} \mathbf{e}^{-\frac{1}{2}(\gamma + \eta \ln \frac{\eta - \varepsilon}{\varepsilon + \lambda - \mathbf{x}})^2} \tag{7}$$

where η and γ are two shape parameters connected with skewness and kurtosis (η > 0, γ ∈ (−∞, +∞)); ε and λ are two scale parameters of a random variable x (ε(−∞, +∞),ε ≤ x ≤ ε + λ).

Due to a large number of parameters, the model in Equation (4) is quite universal and able to approximate almost any unimodal distribution as well as a wide spectrum of bimodal distributions. Besides, since Johnson distributions are based on nonlinear transformations of normal random variables, their use in describing empirical distributions allows staying within correlation theory framework, which is important for the considered features characterized by non-Gaussian distributions (see examples below) and the presence of strong inter-channel correlation.

The estimates of the parameters for PDFs <sup>→</sup> θ|ak = εk, λk, ηk, γ<sup>k</sup> can be found by a numerical optimization of the loss function minimizing integral MSE of empirical distribution (histogram) representation by the theoretical model in Equation (7).

The multidimensional variant of Equation (7) is the following [52]:

$$\begin{split} \mathfrak{f}\_{\mathsf{c}}(\overrightarrow{\mathbf{x}} \mid \mathsf{a}) &= (2\pi)^{-\mathsf{c}/2} \|\Xi\|^{-1/2} \prod\_{\mathbf{v}=1}^{\mathsf{c}} \frac{\eta\_{\mathsf{v}} \lambda\_{\mathsf{v}}}{(\mathbf{x}\_{\mathsf{v}} - \mathsf{c}\_{\mathsf{v}})(\boldsymbol{\varepsilon}\_{\mathsf{v}} + \lambda\_{\mathsf{v}} - \mathsf{x}\_{\mathsf{v}})} \times \\ &\times \exp\Big[ -\frac{1}{2} \sum\_{\mathbf{v}, \mathsf{k}=1}^{\mathsf{c}} \Xi\_{\mathsf{v}\mathbf{k}}^{-1} \Big( \mathsf{y}\_{\mathsf{v}} + \eta\_{\mathsf{v}} \ln \frac{\mathsf{x}\_{\mathsf{v}} - \mathsf{c}\_{\mathsf{v}}}{\mathsf{c}\_{\mathsf{v}} + \lambda\_{\mathsf{v}} - \mathsf{x}\_{\mathsf{v}}} \Big) \Big( \mathsf{y}\_{\mathsf{k}} + \eta\_{\mathsf{v}} \ln \frac{\mathsf{x}\_{\mathsf{k}} - \mathsf{c}\_{\mathsf{k}}}{\mathsf{c}\_{\mathsf{v}} + \lambda\_{\mathsf{k}} - \mathsf{x}\_{\mathsf{k}}} \Big) \Big] \end{split} \tag{8}$$

where *c* is the size of the feature vector <sup>→</sup> x ; **Ξ** is the sample correlation matrix.

Thus, to construct a multidimensional statistical model of correlated data, processing consists of several stages. At the first stage, a traditional statistical analysis is performed by estimating the moments of distributions and histograms of the components of the random vector. Sample estimates of the cross-correlation coefficients of the components <sup>→</sup> x are also found. At the second stage, one-dimensional statistical models of the form Equation (7) are constructed for each of the components of multidimensional data. The initial estimation of the distribution density (the histogram of the feature xν) is used as a target for the desired model f(xν) described by the Johnson DL parameter vector <sup>→</sup> θ. The multidimensional statistical model is built based on multidimensional histograms ˆfc → x . To construct a multidimensional model as in Equation (8) for a class ak, one needs to evaluate the matrix of distribution parameters **Θ**<sup>k</sup> = [εck, λck, ηck, γck] : 4 × c and the sample correlation matrix **Ξ**.

After the process of creation and estimation of the training samples, image pixels are sorted (related) to the classes based on the decision rule (#3). If the pixel-wise classification is applied, data for each pixel with coordinates (i, j) are processed separately (independently). Value vector for a current pixel <sup>→</sup> x ∗ ij is put into mathematical models of the class target in Equation (5). The obtained results ˆfc → x ∗ ij | ak , k = 1 ... K are compared between each other and a maximal estimate of the likelihood function is chosen; its number is the number of the class the current pixel (i, j) is referred to.

### *3.2. NN Classifier Description*

As the classifier is based on a neural network for image processing, a feedforward neural network is used. Nowadays there exist many NNs and approaches to their training and optimization [54]. We have chosen a simple but efficient NN for our application that can be easily implemented or placed on different platforms and devices. The employed classifier is an all-to-all connected perceptron combined with a self-organizing map to treat obtained weights or probabilities of the pixel belonging to one of the classes. The architecture details are the following: input layer for incoming data of three-color components, one hidden neuron layer with 30 fully connected neurons (this number has shown to perform equally well with larger possible numbers of hidden layer neurons). For the hidden layer, the tanh activation function is exploited. Scaled conjugate gradient backpropagation is used as a global training function. As has been mentioned above, the output layer of the used NN is a self-organized map fully connected with neurons of the hidden layer. It provides the mapping of output probabilities of the decision to the list of classes as vectors.

Training and validation processes have been performed in the following way. For this purpose, a certain image is divided into a set of pixel color value vectors. Hereby, the obtained set of vectors is permuted to introduce non-ordinary allocation for each training. For NN training, we have taken 70% of the produced set, the other 30% of data is used for validation. The self-dataset validation at this first stage has to be performed to prove the chosen NN architecture and NN parameters, and other training process peculiarities. We have considered different configurations of NN-based classifiers, especially varying the number of hidden layers, neurons, and connections among them. It has been established that complex architectures, like multi-layer perceptron, do not provide better efficiency of

the classification and even cause over-fitting using the same training dataset. As the result, the chosen architecture was NN with one hidden layer with a fully optimal number of training epochs equal to 50 for the given NN. The proposed classifier is easy to use, and it is fast (which is the crucial point for the application of classification to remotely sensed data). The overall training process was repeated 100 times with full permutation of the dataset and the obtained classifier has been applied to test images.
