*3.2. Reconstruction*

BigBiGAN neural network was trained for 200,000 steps with a batch containing 32 randomly picked patches from the training set. The trained model was saved during each reconstruction period that occurred every 1000 steps. During this period, patches from the validation set were fed, in inference time, to the encoder and generator to measure their power in creating artificial samples and close to the inversible encoding in terms of spatial features. Three types of metrics were calculated for each saved model to evaluate the reconstruction quality—pixel-wise mean absolute error (MAE) of image values normalized between –1 and 1, Fréchet inception distance (FID) [52] on a pre-trained InceptionV3 model, and by performing perceptual evaluation similar to that presented in the Human Eye Perceptual Evaluation (HYPE) paper [53]. MAE above 0.5 was used to discard low-quality models that were not able to effectively reconstruct input images in the early stages of the training. Then, FID values of all preserved models were compared and 20 with the highest score were selected. The average FID score was equal to 86.36 ± 7.28 in contrast to the state-of-the-art BigBiGAN baseline FID, which was equal to 31.19 ± 0.37.

The final model was selected by comparing the results of human evaluation of 21 arbitrarily chosen samples from the validation dataset with their reconstructed counterparts created by the network for each model. The human reader had an objective to assess whether each of the 42 images is real or artificial. The last verification phase resulted in selecting the model from the 170th reconstruction period, which yielded the least accuracy during human perceptual evaluation (accuracy: 59.5%, f-score: 0.6663). Samples and their reconstruction results are presented in Figure 6.

The overall quality of the reconstruction was assessed as sufficient during both quantitative and qualitative verification. For the selected model evaluated on non-scaled images (pixel values between 0 and 255), MAE was 27.213, structural similarity index (SSIM) [54] was 0.942, and peak signal-to-noise ratio (PSNR) [55] was equal to 42.731. From the analysis of human reader' misclassifications, it was clear that the chosen model is exceptionally good in reproducing areas like forests, land abandonment, and farmlands. The characteristic spatial features are preserved after encoding. Shadows cast by trees are consistent and natural. In the majority of cases, artificial and real images are indistinguishable. Mediocre results were achieved for urbanized areas. Reconstructed roads keep their linear character and surface type information. Although the model is capable of generating buildings, due to the high variety of housing types present in the research area and possible undersampling, the results are far from realistic. It is interesting that the link between residential areas and roads was maintained in multiple samples. Unfortunately, the generator is not capable of serving samples that contain water areas such as rivers or lakes. From all analyzed images from the training and validation set only a few presented water, which indicates weak encoding capabilities. Furthermore, all were significantly disrupted. The authors confirmed that this is related to undersampling and the insufficient information present in the RGB orthophoto. To tackle this issue, access to rich, multispectral imagery or digital terrain model (DTM) is required, or the model itself needs to be enriched to utilize additional class embeddings that could be derived from existing thematic maps or projects like Geoportal TBD [56].

**Figure 6.** Reconstruction result of 21 validation samples. Ground truth is represented by real tile images placed on the left. Images on the right were reconstructed by the generator from real images latent codes acquired through the encoder.

#### *3.3. Feature Engineering*

BigBiGAN encoder possesses an interesting capability that enables it to shift the input image into the latent space constructed during network training. The encoding, a 120- dimensional vector, should be considered simultaneously a compressed version of the input orthophoto and a recipe for generating a similar artificial image in terms of spatial features. The latter phenomenon is called representation learning. What is important, due to the nature of latent space, similar data points, i.e., those that were encoded from similar images, are closer to each other. This opens an interesting possibility to understand the structural similarity between images by performing the analysis not on the raw image input but only using latent codes.

In the research, the authors utilized the trained encoder to perform inference on a set of 256 px × 256 px test patches (see Figure 7). The 1224 test patches were converted into their latent space codes and represented as a geopandas [57] data frame containing 1224 rows, 120 encoding value columns, identifier, and a geometry column. Afterward, distance weights between patch centroids were calculated utilizing the *k*-NN algorithm [58]. The data frame and distance weights served as input parameters to the agglomerative clustering algorithm. Figure 8 represents the results for a specified number of clusters.

**Figure 7.** Test area 72961\_840430\_M-34-40-B-a-2-3 [42].

Simultaneously, ground truth segmentation masks were prepared by manually dividing the test image into a fixed number of regions. For the number of clusters between 2 and 10, there was an average of 17.97% ± 8.7% patch-wise difference between ground truth and the unsupervised approach results. The more clusters were predicted the difference was larger. Figure 9 represents the best result, which was acquired for six clusters where the unsupervised approach misclassified 6% of patches.

**Figure 8.** Agglomerative clustering of 72961\_840430\_M-34-40-B-a-2-3 sample encoded patches using different cluster numbers.

**Figure 9.** Clustering results of sample orthophoto (72961\_840430\_M-34-40-B-a-2-3 [42]) patches for fixed number of clusters (*n* clusters = 6). Each color represents a different cluster. Filled squares are the result of latent space clustering. The line pattern indicates the difference between the latent space clustering results and ground truth prepared by manual annotation.

## **4. Discussion**

Utilizing a neural network as a key element of a feature engineering pipeline is a promising idea. The concept of learning the internal representation of data is not new and was extensively studied after the introduction of autoencoders (AE) [59]. Unlike regular autoencoders, bidirectional GANs do not make assumptions about the structure or distribution of the data making them agnostic to the domain of the data [38]. This makes them perfectly suited for use beyond working with RGB images and opens the opportunity to apply them in remote sensing where processing hyperspectral imagery is a standard use case.

One of the main challenges when utilizing a GAN is determining how big a research dataset is needed to feed the network to obtain the required result. The performance of the generator and, therefore, the overall quality of the reconstruction process and network encoding capabilities are tightly coupled with the input data. To be able to properly encode an image, BigBiGAN needs to learn different types of spatial features and discover how they interact with each other. In the early stages of the research, we identified that the size of the dataset had a positive influence on reconstruction quality. We initially worked with around 10% of the final dataset in order to rapidly prototype the solution. The results

were not satisfying, i.e., we were not able to produce artificial samples that resembled ground truth data. This prompted us to gradually increase the dataset's size. Authors are far from estimating the correct size of the dataset that could yield the best possible result for a specific research area. We are sure that addressing this issue will be important in the future development of this method.

The method of measuring the training progression of generative models still remains a problematic issue. The standard approach of monitoring the loss value during training and validation is not applicable due to the fact that all GAN components interact with each other, and the loss value is calculated against a specific point in time during the training process and, therefore, is ephemeral and incomparable with previous epochs. There are multiple ways of controlling how the training should progress, e.g., by using Wasserstein loss [60], applying gradient penalty [61], or spectral normalization [62]. Nevertheless, it is difficult to make a clear statement of what loss value identifies a perfectly trained network. Furthermore, applying GAN to tackle the problems within the remote sensing domain is still a novelty. It is difficult to find references in the scientific literature or open-source projects that could be helpful in determining the proper course of model training.

Although nontrivial, measuring the quality of bidirectional GAN image reconstruction capabilities seems to be a valid approach to the task of model quality assurance. An encoder, by design, always yields a result. It is just as true for a state-of-the-art model and its poorly trained counterparts. Encoder output cannot be directly interpreted, which makes it hard to evaluate its quality. The generator, on the other hand, produces a visible result that can be measured. According to the assumptions of bidirectional models, the encoding and decoding process should to some extent be reversible [38]. Hence, the artificially produced image should resemble, in terms of features, its reconstruction origin, i.e., the real image in which latent code was used to create an artificial sample. In other words, checking generator results operating on strictly defined latent codes determines the quality of the entire GAN.

A naive method of verification of the degree to which an orthophoto generated image looks realistic would be to directly compare it to its reconstruction origin. Pixel-wise mean absolute error (MAE) or a similar metric can give the researchers insight, to a limited extent, regarding the quality of produced samples. Unfortunately, this technique only allows getting rid of obvious errors such as a significant mistake in the overall color of the land cover. This is due to MAE not promoting textural and structural correctness, which may lead to poor diagnostic quality in some conditions [63]. One can approach a similar problem when using PNSR. To some extent, SSIM addresses the issue of measuring absolute errors by analyzing structural information. On the other hand, this method is not taking into account the location of spatial features. BigBiGAN reconstruction process only preserves features and their interaction not their specific placement in the analyzed image. Inception score (IS) and Fréchet inception distance (FID) address this problem by measuring the quality of the artificial sample by scoring the GAN capability to produce realistic features [34]. The main drawback of the IS is that it can be misinterpreted in case of mode collapse [64], i.e., the generator is able to produce only a single sample despite the latent code used as input. FID is much stronger in terms of assessing the quality of the generator. What is important, both metrics utilize a pre-trained Inception classifier [50] to capture relevant image features and therefore are dependent on its quality. There are multiple pre-trained models of Inception available. Many of them were created using large datasets such as ImageNet [65]. The authors are not aware of whether a similar dataset for aerial imagery exists. The use of FID is advisable and, as confirmed during the research, it is valuable in proving the capabilities of the generator, but it needs an Inception network trained on a dedicated aerial imagery dataset to be reliable. This way, the score calculated would depend on real spatial features existing in the geographical space. What is more, this approach is only applicable to RGB images. To perform FID calculation for hyperspectral images, a fine-tailored classifier should be trained. Not surprisingly, one of the most effective ways of verifying the quality of artificial images is

through human judgment. This takes on even greater importance when approaching the research subject requires specialized knowledge and skills, as exemplified by the analysis of aerial or satellite imagery. Unfortunately, qualitative verification is time-consuming and has to be supported by a quantitative method, which can aid in preselecting potentially good samples.

BigBiGAN accompanied by hierarchical clustering can be effectively used as a building block of an unsupervised orthophoto segmentation pipeline. The results of performing this procedure on a test orthophoto (see Figure 9) proves that the solution is powerful enough to divide the area into a meaningful predefined number of regions. Particularly noteworthy is the precise separation of forests, arable lands, and build-up areas. There is also room for improvement. Currently, the network is not capable of segmenting out tree felling areas located in the northwest and the river channel, which would be very beneficial from the point of view of landscape analysis. Furthermore, it also incorrectly combined pastures and arable lands. The main drawback of this method is the need to predefine the number of clusters. What is more, when increasing the number of clusters, artifacts started to occur, and the algorithm predicted small areas that were not identified as distinct regions in the ground truth image (Figure 8, *n* clusters = 7–10). Further analysis of latent codes and features that they represent is needed to understand the origin of this issue.

BigBiGAN clustering procedure results resemble, to some extent, the segmentation of the area performed during the Corine Land Cover project in 2018 (Figure 10). It is interesting that the proposed GAN procedure shows a better fit with the boundaries of individual areas than CLC. Nevertheless, CLC has a grea<sup>t</sup> advantage over the result generated using GAN, i.e., each tile possesses information about the land cover types that it represents. CLC land cover codes are consistent across all areas involved in the study, which makes this dataset very useful in terms of even sophisticated analysis. This does not mean, however, that the GAN cannot be rearmed to carry information about the land cover types. In the initial BigGAN paper, the authors proposed a solution to enrich each part of the neural network with a mechanism that would enable working with class embeddings [44]. The authors did not use the aforementioned solution to maintain the unsupervised nature of the procedure. An interesting solution would be to compare the latent codes of patches located within different regions to check how similar they are and use this information to join similar, distant regions. To achieve this, a more advanced dataset is needed to cover a larger area and prevent undersampling of occurring less frequently but spatially significant features. Comparison with CLC is also interesting due to the differences in the creation of both sets. CLC is prepared using a semi-supervised procedure that involves multiple different information sources. In contrast, the GAN approach utilizes only orthophotos and is fully unsupervised. Another interesting approach would be to utilize Corine Land Cover (CLC) as the source of model labels and retrain the network to also possess the notion of land cover types. This way, we would gain an interesting solution that would offer a way of producing CLC-like annotations in different precision levels and using different data sources.

**Figure 10.** Orthophoto sheet 72961\_840430\_M-34-40-B-a-2-3 and Corine Land Cover Project (2018) segmentation [66].

## **5. Conclusions**

Generative adversarial networks are a powerful tool that definitely found their place in both geographical information systems (GIS) and machine learning toolboxes. In the case of remote sensing imagery processing, they provide a data augmentation mechanism of creating decent quality artificial data samples, enhancing, or even fixing existing images, and also can actively participate in feature extraction. The latter gives the researchers access to new information encoded in the latent space. During the research, authors confirmed that the bidirectional generative adversarial network (BigBiGAN) encoder module can be successfully used to compress RGB orthophoto patches to lower-dimensional latent vectors.

The encoder performance was assessed indirectly by evaluating the network reconstruction capabilities. Pixel-wise comparison between ground truth and reconstruction output yielded the following results: mean absolute error (MAE) 27.213, structural similarity index (SSIM) 0.942, peak signal-to-noise ratio (PSNR) 42.731, and Fréchet inception distance (FID) 86.36 ± 7.28. Furthermore, the encoder was tested by utilizing output latent vectors to perform geospatial clustering of a chosen area from the Pilica River region (94% patch-wise accuracy against manually prepared segmentation mask). The case study proved that orthophoto latent vectors, combined with georeferences, can be used during spatial analysis, e.g., in region delimitation or by producing reliable segmentation masks.

The main advantage of the proposed procedure is that the whole training process is unsupervised. The utilized neural network is capable of discovering even complex spatial features and code them in the network underlying latent space. In addition, handling relatively lightweight latent vectors during analysis rather than raw orthophoto proved to significantly facilitate the study. During processing and analysis, there was no need to possess a real image (37MB) but only a recipe to compute in on the fly (3MB). The authors think this feature has grea<sup>t</sup> potential in the commercial application of the procedure to lower disk space and network transfer requirements when processing large remote sensing datasets.

On the other hand, the presented method is substantially difficult to implement, configure, and train; it is prone to errors and is demanding in terms of computation costs. To achieve a decent result, one must be ready for a long run of trials and errors mainly related to tuning the model and estimating the required dataset size. Regarding latent vectors, authors have identified a major flaw related to the lack of possibility to precisely describe the meaning of each dimension. The main disadvantage of the proposed procedure is that the majority of steps during the evaluation of the model involves human engagement.

The authors are certain that utilizing BigBiGAN on a more robust and rich dataset, like multispectral imagery, backed by digital terrain model (DTM) and at the same time working on reducing the internal complexity of the network to enable processing larger patches will result in a handful of valuable discoveries. The main focus of the research team in the future will be the verification of the proposed method on a greater scale. Future work will involve performing geospatial clustering of latent codes acquired for all Polish geographic regions and presenting the comparison between classically distinguished regions and their automatically generated counterparts.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/2072 -4292/13/2/306/s1. Encoder model in h5 format with sample data is available on github.com (maciej-adamiak/bigbigan-feature-engineering).

**Author Contributions:** Conceptualization, M.A.; Methodology, M.A..; Software, M.A; Validation, K.B. and A.M.; Formal analysis, M.A.; Investigation, M.A.; Resources, M.A.; Data curation, M.A.; Writing—original draft preparation, M.A., K.B, and A.M.; Writing—review and editing, M.A., K.B, and A.M.; Visualization, M.A.; Supervision, K.B and A.M.; Project administration, M.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. This data can be found here: https://www.geoportal.gov.pl/.

**Acknowledgments:** We would like to thank Mikołaj Koziarkiewicz, Maciej Opała, Kamil Rafałko and Tomasz Napierała for helpful remarks and an additional linguistic review.

**Conflicts of Interest:** The authors declare no conflict of interest.
