*2.3. Parameter Learning and Entropy Model Estimation* 2.3.1. Loss Function: Rate Distortion Trade-Off

The autoencoder parameters (filter weights, GDN/IGDN parameters and representation distribution model) are jointly learned through the optimization of a loss function involving the rate *R*(**y**ˆ) and the distortion *D*(**x**, **xˆ**) between the original image **x** and the reconstructed image **xˆ**. The rate-distortion criterion, denoted as *J*(**x**, **xˆ**, **yˆ**), writes as the weighted sum:

$$J(\mathbf{x}, \hat{\mathbf{x}}, \hat{\mathbf{y}}) = \lambda D(\mathbf{x}, \hat{\mathbf{x}}) + R(\hat{\mathbf{y}}),\tag{3}$$

where *λ* is a key parameter that tunes the rate-distortion trade-off.

• The rate *R* achieved by an entropy coder is lower-bounded by the entropy derived from the actual discrete probability distribution *m*(**y**ˆ) of the quantized vector **y**ˆ. The rate increase comes from the mismatch between the probability model *p***y**ˆ(**y**ˆ) required for the coder design and *m*(**y**ˆ). The bit-rate is given by the Shannon cross entropy between the two distributions:

$$H(\mathfrak{Y}) = \mathbb{E}\_{\mathfrak{Y} \sim m} \left[ -\log\_2 p\_{\mathfrak{Y}}(\mathfrak{Y}) \right],\tag{4}$$

where means distributed according to. The bit-rate is thus minimized if the distribution model *p***yˆ**(**yˆ**) is equal to the distribution *m*(**yˆ**) arising from the actual distribution of the input image and from the analysis transform *Ga*. This highlights the key role of the probability model.

• The distortion measure *D* is chosen to account for image quality as perceived by a human observer. Due to its many desirable computational properties, the mean square error (MSE) is generally selected. However, a measure of perceptual distortion may also be employed such as the multi-scale structural similarity index (MS-SSIM) [24].

The loss function defined in Equation (3) is minimized through gradient descent with back-propagation [7] on a representative image training set. However, this requires the loss function to be differentiable. In the specific context of compression, a major hurdle is that the derivative of the quantization function is zero everywhere except at integers, where it is undefined. To overcome this difficulty, a quantization relaxation is considered in the backward pass (i.e., when back-propagating the gradient of the error). Ballé et al. (2016) [13] proposed to back-propagate an independent and identically distributed (i.i.d.) uniform noise, while Theis et al. (2017) [14] proposed to replace the derivative of the quantization function with a smooth approximation. In both cases, the quantization is kept as it is in the forward pass (i.e., when processing an input data).

#### 2.3.2. Entropy Model

As stressed above, a key element in the end-to-end learned image compression frameworks is the entropy model defined through the probability model *p***yˆ**(**yˆ**) assigned to the quantized representation for coding.

• Fully factorized model: For simplicity, in [13,14], the approximated quantized representation was assumed independent and identically distributed within each channel and the channels were assumed independent of each other, resulting in a fully factorized distribution:

$$p\_{\mathfrak{Y}|\mathfrak{\Psi}}(\vec{\mathfrak{y}}|\mathfrak{\Psi}) = \prod\_{i} p\_{\mathfrak{g}\_{i}|\mathfrak{\Psi}^{(i)}}(\vec{y}\_{i}) \, , \tag{5}$$

where index *i* runs over all elements of the representation, through channels and through spatial locations, *ψ*(*i*) is the distribution model parameter vector associated with each element. As mentioned previously, for back-propagation derivation during the training step, the quantization process (**y**ˆ = *Q*(**y**)) is approximated by the addition of an i.i.d uniform noise Δ**y**, whose range is defined by the quantization step. Due to the adaptive local normalization performed by GDN non-linearities, the quantization step can be set to one without loss of generality. Hence the quantized representation **y**ˆ, which is a discrete random variable taking values in Z, is modelled by the continuous random vector **y˜** defined by:

$$
\tilde{\mathbf{y}} = \mathbf{y} + \Delta \mathbf{y} \tag{6}
$$

taking values in R. The addition of the uniform quantization noise leads to the following expression for *py*˜*i*|*ψ*(*i*)(*y*˜*i*) defined through a convolution by a uniform distribution on the interval [−1/2, 1/2]:

$$p\_{\bar{y}\_i|\Psi^{(i)}}(\bar{y}\_i) = p\_{y\_i|\Psi^{(i)}}(y\_i) \* \mathcal{U}(-1/2, 1/2). \tag{7}$$

For generality, in [13], the distribution *pyi*|*ψ*(*i*)(*yi*) is assumed non parametric, namely without predefined shape. In [13,14], the parameter vectors are learned from data during the training phase. This learning, performed once and for all, prohibits adaptivity to the input images during operational phase. Moreover, the simplifying hypothesis of a fully factorized distribution is very strong and not satisfied in practice, elements of **y**ˆ exhibiting strong spatial dependency as observed in [16]. To overcome these limitations and thus to obtain a more realistic and more adaptive entropy model, [16] proposed a hyperprior model, derived through a variational autoencoder, which takes into account possible spatial dependency in each input image.

• Hyperprior model: Auxiliary random variables **z**˜, conditioned on which the quantized representation **y**˜ elements are independent, are derived from **y** by an auxiliary autoencoder, connected in parallel with the bottleneck (right column of Figure 1 (right)). The hierarchical model hyper-parameters are learned for each input image in operational phase. Firstly, the hyperprior transform analysis *Ha* produces the set of auxiliary random variables **z**. Secondly, **z** is transformed by the hyperprior synthesis transform *Hs* into a second set of random variables *σ*. In [16], **z** distribution is assumed fully factorized and each representation element *y*˜*i*, knowing **z**, is modeled by a zero-mean Gaussian distribution with its own standard deviation *σi*. Finally, taking into account the quantization process, the conditional distribution of each quantized representation element is given by:

$$\left|\bar{y}\_{i}\right|\bar{\mathbf{z}} \sim \mathcal{N}\left(0, \sigma\_{i}^{\;2}\right) \* \mathcal{U}\left(-\frac{1}{2}, \frac{1}{2}\right). \tag{8}$$

The rate computation must take into account the prior distribution of **z**˜, which has to be transmitted to the decoder with the compressed data, as side information.

In the following we derive the complexity of the analysis and synthesis transforms, involved both in the main and in the auxiliary autoencoders of [16]. We also perform a statistical analysis of the learned transform on a representative set of satellite images. The objective is to propose an entropy model simpler while adaptive, because more specifically tailored than the two previous ones.

#### **3. Reduced-Complexity Variational Autoencoder**

In the literature, the design of learned image compression frameworks hardly takes into account the computational complexity: the objective is merely to obtain the best performance in terms of rate-distortion trade-off. However, in the context of on board compression, a trade-off between performance and complexity has also be considered to take into account the strong computational constraints. Our focus here is to propose a complexity-reduced alternative to the state-of-the-art structure [16] minimizing the impact on the performance. Please note that this complexity reduction must be essentially targeted at the coding part of the framework, most subject to the on board constraints. A meaningful indicator of the complexity reduction is the number of network parameters. Indeed, its reduction has a positive impact not only on the memory complexity, but also on the time complexity and on the training difficulty. Indeed, the optimization problem involved in the training step applies on a smaller number of parameters. This is an advantage even if the training is performed on ground. The convergence is thus obtained with less iterations and thus within a shorter time. Moreover, the complexity reduction has also a positive impact on the ease to upload the final model to the spacecraft (once commissioning is finished and the training with real data completed) as the up-link transmission is severely limited.

#### *3.1. Analysis and Synthesis Transforms*

#### 3.1.1. Complexity Assessment

First, let characterize the computational complexity of a convolutional layer composing the analysis and synthesis transforms. Let *Nin* denote the number of features at the considered layer input. In the particular case of the network first layer, *Nin* is the number of channels of the input image (*Nin* = 1 for a panchromatic image) else *Nin* is the number of filters of the previous layer. Let *Nout* denote the number of features at this layer output, i.e., the number of filters of this layer. As detailed in Section 2, in [16], *Nout* = *N* for each layer of the analysis and synthesis transforms except for the last one of the main auto-encoder analysis transform and the last one of the auxiliary auto-encoder synthesis transform composed of *M* filters with *M* > *N* and thus for these layers *Nout* = *M*. As in [13,16], we consider square filters with size *n* × *n*. The number of parameters associated with the filtering part of the layer is:

$$\mathbf{Param}^f = (\mathbf{n} \times \mathbf{n} \times \mathbf{N}\_{\mathrm{int}} + \boldsymbol{\delta}) \times \mathbf{N}\_{\mathrm{out}}.\tag{9}$$

The term *δ* is equal to 1 when a bias is introduced and is equal to 0 otherwise. Please note that this bias is rarely used in the considered architectures (except in Tconv3, as displayed in Figure 1). The filtering is applied to each input channel after downsampling (respectively upsampling). The downsampled (resp. upsampled) input channel if of size *sout* × *sout* with *sout* = *sin*/*D* (respectively *sout* = *sin* × *D*) where *D* denotes the downsampling (respectively upsampling) factor and *sin* × *sin* is the size of a feature at the filter input. Floating points operations Operation*<sup>f</sup>* for the filtering operation is thus:

$$\text{Operation}^f = \text{Pararm}^f \times s\_{out} \times s\_{out} \,. \tag{10}$$

GDN/IGDN perform a normalization of a filter output with respect to the other filter outputs. According to Section 2, the number of parameters and the number of operations of each GDN/IGDN are expressed by:

$$\begin{aligned} \text{Param}^{\mathbb{g}} &= (N\_{\text{out}} + 1) \times N\_{\text{out}} \\ \text{Operation}^{\mathbb{g}} &= \text{Param}^{\mathbb{g}} \times \text{s}\_{\text{out}} \times \text{s}\_{\text{out}} \end{aligned} \tag{11}$$

Since the number of layers is already very low for the considered architectures, the reduction of the complexity of the analysis and synthesis transforms may target, according to the previous complexity assessment, the number of filters per layer, the size of these filters and the choice of the activation functions. Our proposal below details our strategy for complexity reduction.

#### 3.1.2. Proposal: Simplified Analysis and Synthesis Transforms

Our approach to reduce the complexity of the analysis and synthesis transforms, while maintaining an acceptable rate-distorsion trade-off, is to fine-tune the parameters of each layer (number of filters and filter sizes) and to consider the replacement of GDN/IGDN by simpler non-parametric activation functions.

In a first step, we propose a fine-tuning of the number of filters composing the convolutional layers of the analysis and synthesis transforms. The state-of-the-art frameworks generally involve a large number of filters with the objective of increasing the network approximation capability [13,16]. However, a high number of filters also implies a high number of parameters and operations in the GDN/IGDN, as it increases the depth of the tensors (equal to the number of filters) at their input. Apart from a harder training, a large number of filters comes with a high number of operations as well as a high memory complexity, which is problematic in the context of on board compression. In [16], the number of filters at the bottleneck (*M*) and for the other layers (*N*) are fixed according to the target bit-rate: the higher the target bit-rate, the higher *M* and *N*. Indeed, higher bit rates mean lower distortion and thus increased network approximation capabilities to learn accurate analysis and synthesis transforms [16]. This principle also applies to the auxiliary autoencoder implementing the hyperprior illustrated in Figure 1 (right column of the right part). In the present paper, we propose and evaluate a reduction of the number of filters in each layer for different target rate ranges. In particular, we investigate the impact of *M* on the attainable performance when imposing a drastic reduction of *N*. The question is whether there is a real need for a high global number of filters (high *N* and *M*) or whether a high bottleneck size (low *N* and high *M*) is sufficient to achieve good performance at high rates. For that purpose, we impose a low value of *N* (typically *N* = 64) and we consider increasing values of *M* defined by *M* = 2*N*, 3*N*, 4*N*, 5*N* to determine the minimum value of *M* that leads to an acceptable performance in a given rate range.

In a second step, we investigate the replacement of the GDN/IGDN by non-parametric activation functions. As previously mentioned, according to [19], GDN/IGDN allow obtaining good performance even with a low number of layers. However, for the sake of completeness, we also test their replacement by ReLU functions. Finally, we propose to evaluate the effect of the filter kernel support. The main autoencoder in [16] is entirely composed of filters with kernel support *n* × *n*. The idea then is to test different kernel supports that is (*n* − 2) × (*n* − 2) and (*n* + 2) × (*n* + 2).

#### *3.2. Reduced Complexity Entropy Model*

#### 3.2.1. Statistical Analysis of the Learned Transform

This section first performs a statistical analysis of each feature of the learned representation in the particular case of satellite images. A similar statistical analysis has been conducted in the case of natural images in [25] with the objective to properly design the quantization in [13]. The probability density function related to each feature, averaged on a representative set of natural images, was estimated through a normalized histogram. The study showed that most features can be accurately modelled as Laplacian random variables. Interestingly, a similar result has also been analytically demonstrated in [26] for block-DCT coefficients of natural images under the assumption that the variance is constant on each image block and that its values on the different blocks are distributed according to an exponential or a half-normal distribution. We conducted the statistical analysis on the representation obtained by the main autoencoder as defined in [16], with *N* = 128 and *M* = 192, but used alone, as in [13], without auxiliary autoencoder. Indeed, on one side the main autoencoder in [16] benefits from improvements with respect to the one in [13] and on another side, the auxiliary autoencoder is not necessary in this statistical study. This autoencoder is trained on a representative dataset of satellite images and for rates between 2.5 bits/pixel and 3 bits/pixel. First, as an illustration, let consider the satellite image displayed in Figure 2. This image of the city of Cannes (French Riviera) is a 12-bit simulated panchromatic Pléiades image with size 512 × 512 and resolution 70 cm.

**Figure 2.** Simulated 12-bit Pléiades image of Cannes with size 512 × 512 and resolution 70 cm.

Figure 3 shows the 1st 32 × 32 feature derived from the Cannes image and its normalized histogram with Laplacian fitting.

**Figure 3.** First feature of Cannes image representation, its normalized histogram with Laplacian fitting.

On this example, the fitting with an almost centered Laplacian seems appropriate. According to the Kolmogorov-Smirnov goodness-of-fit test [27], 94% of the features derived from this image follow a Laplacian distribution with a significance level *α* = 5%. Recall that the Laplacian distribution Laplace(*μ*, *b*) is defined by:

$$f(\zeta, \mu, b) = \frac{1}{2b} \left( -\frac{|\zeta - \mu|}{b} \right) \text{ for } \zeta \in \mathbb{R}, \tag{12}$$

where *μ* is the mean value and *b* > 0 is a scale parameter related to the variance by *Var*(*ζ*) = 2*b*2.

To extend this result, the representation (composed of *M* = 192 feature maps) was derived for 16 simulated 512 × 512 Pléiades images. Figure 4 shows the normalized his-

tograms (derived from 16 observations) of some feature maps (each of size 32 × 32) and the Laplacian fitting.

**Figure 4.** Normalized histogram of the *i*th feature map and Laplacian fitting *f*(., *μ*, *b*).

Most of the feature maps have a similar normalized histogram. According to Kolmogorov-Smirnov goodness-of-fit test [27], 94% of the features derived from this image follow a Laplacian distribution with a significance level *α* = 5%. Please note that the Gaussian distribution, N (*μ*, *<sup>σ</sup>*2) with a small value of *<sup>μ</sup>*, also fits the features albeit to a lesser extent. The remaining non-Laplacian feature maps (6% of the maps for this example) stay close to the Laplacian distribution. Figure 5 displays two representative examples of non-Laplacian feature maps. Please note that the first one (19th feature map) is far from Laplacian, this feature appears as a low-pass approximation of the input image. However, this is a very particular case: the second displayed feature has a typical behavior, not so far from a Laplacian distribution.

(**a**) 19th feature. (**b**) Normalized histogram and Laplacian fitting.

(**c**) 55th feature. (**d**) Normalized histogram and Laplacian fitting.

**Figure 5.** Normalized histogram of the *i*th feature map and Laplacian fitting *f*(., *μ*, *b*).

#### 3.2.2. Proposal: Simplified Entropy Model

The entropy model simplification aims at achieving a compromise between simplicity and performance while preserving the adaptability to the input image. In [13], the representation distribution is assumed fully factorized and the statistical model for each feature is non-parametric to avoid a prior choice of a distribution shape. This model is learned once, during the training. In [16], the strong independence assumption leading to a fully factorized model is avoided by the introduction of the hyperprior distribution, whose parameters are learned for each input image even in the operational phase. Both models are general and thus suitable to a wide variety of images; however the first one implies a strong hypothesis of independence and prohibits adaptivity while the second one is computationally expensive. Based on the previous analysis, we propose the following parametric model for each of the *M* features. Consider the *j*th feature elements *yij* for *ij* ∈ *Ij*, where *Ij* denotes the set of indexes covering this feature:

$$\begin{aligned} y\_{i\_{\vec{j}}} &\sim \text{Laplace}(0, b\_{\vec{j}}) \text{ (resp. } y\_{i}^{\vec{j}} \sim \mathcal{N}(0, \sigma\_{\vec{j}}^{2})\text{)}\\ \text{with: } b\_{\vec{j}} &= \sqrt{Var(y\_{i}^{\vec{j}})/2} \text{ (resp. } \sigma\_{\vec{j}}^{2} = Var(y\_{i}^{\vec{j}})\text{)}.\end{aligned} \tag{13}$$

The problem then boils down to the estimation of a single parameter per feature referred to as the scale *bj* (respectively the standard deviation *σj*) in the case of the Laplacian (resp. Gaussian) distribution. Starting from [16], this proposal reduces the complexity at two levels. First, the hyperprior autoencoder, including the analysis *Ha* and synthesis *Hs* transforms, is removed. Second, the side information initially composed of the compressed auxiliary random variable set (**z**) of size 8 × 8 × *M* now reduces to a *M* × 1 vector of variances. The auxiliary network simplification is displayed on the right part of Figure 6.

**Figure 6.** Proposed architecture after entropy model simplification: main autoencoder (**left column**) and simplified auxiliary autoencoder (**right column**).

In the next section, the performance of these proposals and of their combinations are studied for different rate ranges.

#### **4. Performance Analysis**

This section assesses the performance, in terms of rate-distortion, of the proposed method in comparison with the CCSDS 122.0-B [6], JPEG2000 [5] and with the reference methods [13,16]. Beforehand, a subjective image quality assessment is proposed. Although informal, it allows comparing the artefacts produced by learned compression and by the CCSDS 122.0-B [6].

#### *4.1. Implementation Setup*

To assess the relevance of the proposed complexity reductions, experiments were conducted using TensorFlow. The batch size (i.e., the number of training samples to work through before the parameters are updated) was set to 8 and up to 1M iterations were performed. Both training and validation datasets are composed of simulated 12-bit Pléiades panchromatic images provided by the CNES, covering various landscapes (i.e., desert, water, forest, industrial, cloud, port, rural, urban). The reference learned frameworks for image compression are designed to handle 8-bit RGB natural images and they generally target low rates (typically up to a maximum of 1.5 bits/pixel). In contrast, on board satellite compression handles 12-bit panchromatic images and targets higher rates (from 2bits/pixel to 3.5 bits/pixel). The training dataset is composed of 8M of patches (of size 256 × 256) randomly cropped from 112 images (of size 585 × 585). The validation dataset is composed of 16 images (of size 512 × 512). MSE was considered to be the distortion metric for training. The rate and distortion measurements were averaged across the validation dataset for a given value of *λ*. Please note that the value of *λ* has to be set by trial and error for a targeted rate range.

In addition to the MSE, we also evaluate those results in terms of MS-SSIM. Please note that they exhibit a similar behavior even if the models were trained for the MSE only. The proposed framework is compared with the CCSDS 122.0-B [6], JPEG2000 [5] and with the reference methods [13,16] implemented for values of *N* and *M* recommended by their authors for particular rate ranges.


#### *4.2. Subjective Image Quality Assessment*

At low rates, JPEG2000 is known to produce quite prominent blurring and ringing artifacts, which is particularly visible in high-frequency textures [5]. This is also the case for the CCSDS 122.0-B [6]. Figure 7a,b shows the original image of the city of Blagnac, which is a 12-bit simulated panchromatic Pléiades image with size 512 × 512 and resolution 70 cm. The figure also shows the image compressed by CCSDS 122.0 (c) and the image compressed by the reference learned compression architecture Ballé(2017)–non-parametric-N (d), for a low compression rate (1.15 bits/pixel). The image obtained through learned compression appears closest to the original one than the image obtained through the CCSDS.

(**a**) Original image. (**b**) Zoom on the original image.

(**c**) Zoom on the CCSDS compressed image. (**d**) Zoom on the end-to-end compressed image.

**Figure 7.** Subjective image quality analysis (R = 1.15 bits/pixel).

For medium luminances, far less artifacts, such as blurring and flattened effects are observed. In particular, the building edges are sharper. The same is true for low luminances, corresponding to shaded areas: the ground markings are sharp and less flatened areas are observed. As shown in Figure 8, for a higher rate of 1.66 bits/pixel, the two reconstructed images are very close. However, the image obtained through learning compression remains closest to the original one, especially in shaded areas.

(**a**) Original image. (**b**) Zoom on the original image.

(**c**) Zoom on the CCSDS compressed image. (**d**) Zoom on the end-to-end compressed image.

**Figure 8.** Subjective image quality analysis (R = 1.66 bits/pixel).

Finally, as shown in Figure 9 for an even higher rate of 2.02 bits/pixel, CCSDS 122.0 still produces flattened effects, especially in low variance areas. The learned compression method leads to a reconstructed image that is closest to the original one. Even though the stadium ground markings slightly differs from the original, the image quality is overall preserved.

(**a**) Original image. (**b**) Zoom on the original image.

(**c**) Zoom on the CCSDS compressed image. (**d**) Zoom on the end-to-end compressed image.

**Figure 9.** Subjective image quality analysis—R = 2.02 bits/pixel.

Finally, for various rates, the learned compression method [13] does not suffer from the troublesome artifacts induced by the CCSDS 122.0, leading to a more uniform image quality. The same behavior was observed for the proposed method.

In the following, an objective performance analysis is performed in terms of ratedistortion trade-off for the CCSDS 122.0-B [6], JPEG2000, the reference methods [13,16] and the proposed ones.
