*4.3. Impact of the Number of Filter Reduction*

#### 4.3.1. At Low Rates

We first consider architectures devoted to low rates, say up to 2 bits/pixel. Starting from [16] denoted as Ballé(2018)–hyperprior-N128-M192, the number of filters *N* (for all layers, apart from the one just before the bottleneck) is reduced from *N* = 128 to *N* = 64, keeping *M* = 192 for the layer just before the bottleneck. This reduction is applied jointly to the main autoencoder and to the hyperprior one. The proposed simplified architecture is termed Ballé(2018)-s-hyperprior-N64-M192. The complexity of this model is evaluated in terms of number of parameters and of floating point operation per pixel (FLOPp) in Table 1 .


**Table 1.** Detailed complexity of Ballé(2018)-s-hyperprior-N64-M192.

Table 2 compares the complexity of Ballé(2018)-s-hyperprior-N64-M192 to the reference Ballé(2018)-hyperprior-N128-M192.

**Table 2.** Comparative complexity of the global architectures-Case of target rates up to 2 bits/pixel.


The complexity of the proposed simplified architecture is 73% lower in terms of FLOPp with respect to the reference method. Now let consider the impact on compression performance of the reduction of *N*. Figure 10 displays the performance, in terms of MSE and MS-SSIM, of our different proposed solutions, of the reference learned methods and of the JPEG2000 [5] and CCSDS 122.0-B [6] standards. The gray curve portions indicate that the values of *N* and *M* are not recommended for this rate range (above 2 bits/pixel). In this first experiment, we are mainly concerned by the comparison of the blue lines: the solid one for the reference method Ballé(2018)–hyperprior-N128-M192 and the dashed one for the proposal Ballé(2018)–hyperprior-N64-M192.

**Figure 10.** Rate-distortion curves for the considered learned frameworks and for the CCSDS 122.0-B [6] and JPEG2000 [5] standards in terms of MSE and MS-SSIM (dB) (derived as −10 log10(1−MS-SSIM))-Case of rates up to 2 bits/pixel.

As expected, Ballé(2018)-s-hyperprior–N64-M192 achieves a rate-distortion performance close to the one of Ballé(2018)–hyperprior-N128-N192 [16], both in terms of MSE and MS-SSIM, for rates up to 2 bits/pixel. We can conclude that the decrease in

performance is very small, keeping in mind the huge complexity reduction. Please note that our proposal outperforms by far CCSDS 122.0-B [6], JPEG2000 [5] standards as well as Ballé(2017)-non-parametric-N192 [13].

#### 4.3.2. At High Rates

Now let consider the architecture devoted to higher rates, say above 2 bits/pixel. For such rates, the reference architectures involve a high number of filters (*N* = 256 in [13], *N* = 192 and *M* = 320 in [16]). Starting from [16], we reduced the number of filters to *N* = 64 in all layers except the one before the bottleneck, keeping *M* = 320. The proposal Ballé(2018) s-hyperprior–N64-M320 is compared to the reference Ballé(2018)-hyperprior–N192-M320 but also to Ballé(2017)-non-parametric–N256 and to JPEG2000 [5] and CCSDS 122.0-B [6] standards. Table 3 compares the complexity of Ballé(2018)-s-hyperprior-N64-M320 to the reference Ballé(2018)-hyperprior-N192-M320.

**Table 3.** Comparative complexity of the global architectures-Case of target rates above 2 bits/pixel.


The complexity of the proposed simplified architecture is 87% lower in terms of FLOPp with respect to the reference method. Now let consider the impact on compression performance of the reduction of *N*. Figure 11 displays the performance, in terms of MSE only, of our different proposed solutions, of the reference learned methods and of the JPEG2000 and CCSDS standards. The MS-SSIM shows the same behaviour.

**Figure 11.** Rate-distortion curves at higher rates for learned frameworks and for the CCSDS 122.0-B [6] and JPEG2000 [5] standards for MSE in log-log scale Case of high rates (above 2 bits/pixel).

Theses curves show that the simplified architectures (e.g., resulting from a decrease of *N*) by far outperform the JPEG2000 and CCSDS standard even at high rates, while showing a low decrease in performance with respect to the reference architectures. Note however that, for both the reference (Ballé(2018)–hyperprior-N192-M320) and the simplified (Ballé(2018)-s-hyperprior-N64-M320) variational models, a training of 1M iterations seems insufficient for the highest rates. Indeed, due to the auxiliary autoencoder implementing the hyperprior, the training has conceivably to be longer, which can be a disadvantage in practice. This may be an additional argument to propose a simplified entropy model.

#### 4.3.3. Summary

As an intermediary conclusion, for either low or high bit rates, a drastic reduction of *N* starting from the reference architecture [16], does not decrease significantly the performance, both in MSE and in MS-SSIM, while it leads to a complexity decrease of more than 70%. These results are interesting since it was mentioned in [13,16,19] that structures of reduced complexity would not be able to perform well at high rates.

#### *4.4. Impact of the Bottleneck Size*

As previously highlighted, the bottleneck size (*M*) plays a key role in the performance of the considered architectures. Thus, we now consider a fixed low value of *N* (*N* = 64) and then we vary the bottleneck size (*M* = 128, 192, 256 and 320). This experiment, performed on the proposed architecture integrating the simplified entropy model Ballé(2018)-s-laplacian-N64-M, allows quantifying the impact of *M* on the performance in terms of both MSE and MS-SSIM for increasing values of the target rate. Figure 12 shows the rate-distortion averaged over the validation dataset. According to the literature, high bit rates require a large global number of filters [16].

**Figure 12.** *Cont.*

**Figure 12.** Impact of the bottleneck size in terms of MSE and MS-SSIM (dB) (derived as −10 log10(1−MS-SSIM)) .

Figure 12 shows that increasing the bottleneck size *M* only, while keeping *N* very small, allows maintaining the performance as the rate increases. As displayed in Figure 12, while the performance reaches a saturation point for a given bottleneck size, it is possible to renew its dynamic by increasing *M* only. This result is consistent since the number of output channels (*M*), just before the bottleneck, corresponds to the number of features that must be compressed and transmitted. It therefore makes sense to produce more features at high rates for a better reconstruction of the compressed images. Interestingly, this figure allows establishing in advance the convolution layer dimensions (*N* and *M*) for a given rate range, taking into account a complexity concern.

#### *4.5. Impact of the Gdn/Igdn Replacement in the Main Autoencoder*

The original architecture Ballé(2018)–hyperprior-N128-M192 of [16], involving GDN/IGDN non-linearities, is compared with the architecture obtained after a full ReLU replacement, except for the last layer of the decoder part. Indeed, this layer involves a sigmoid activation function for constraining the pixel interval mapping between 0 and 1 before the quantization. Figure 13 shows the rate-distortion averaged over the validation dataset in terms of both MSE and MS-SSIM.

As claimed in [19], GDN/IGDN perform better than ReLU for all rates and especially at high rates. Thus, although GDN/IGDN increase the number of parameters to be learned and stored, as well as the number of FLOPp, on one side this increase represents a small percentage of the overall structure with respect to conventional non-linearities [19]. On the other side, GDN/IGDN lead to a dramatic performance boost. In view of these considerations, the complexity reduction in this paper does not target the GDN/IGDN. However, their replacement by simpler activation functions can be envisioned in future work to take into account on board hardware requirements.

(**b**) Distortion measure: MS-SSIM (dB).

**Figure 13.** Impact of the GDN/IGDN replacement and of the filter kernel support on performance in terms of MSE and MS-SSIM (dB) (derived as −10 log10(1−MS-SSIM)).

#### *4.6. Impact of the Filter Kernel Support in the Main Autoencoder*

The original architecture Ballé(2018)–hyperprior-N128-M192 of [16] is also compared when the 5 × 5 filters composing the convolutional layers of the main autoencoder are replaced by 3 × 3 and 7 × 7 filters. It is worth mentioning that all the variant architectures considered in this part share the same entropy model obtained through the same auxiliary autoencoder in terms of number of filters and kernel supports, since the objective here is not to assess the impact of the entropy model. According to Figure 13, a kernel support reduction from 5 × 5 to 3 × 3 leads to a performance decrease. This result is expected in the sense that filters with a smaller kernel support correspond to a reduced approximation capability. On the other hand, a kernel support increase from 5 × 5 to 7 × 7 does not lead to a significant performance improvement. This result indicates that the approximation capability obtained with a kernel support 5 × 5 is sufficient.
