*3.1. Semantic Image Segmentation Architectures*

Currently, the most commonly used network architectures for image segmentation tasks are different variations of UNET and DeepLab. SegNet [30,31] and PSPNet [32] are also used, however, their use for further analyses was rejected because they are usually less efficient. Choosing the most optimal architecture was not the aim of the paper, but we would like to describe them briefly.

UNET consists of two main segments: the encoder and the decoder. The encoder at the initial stage consists of convolutional blocks, at the ends of which a pooling layer is implemented to reduce dimensionality. After moving to the dimensionality change pointbridge, dimensionality is increased by deconvolution or upsampling. Additionally, block information from the encoder is skipped and concatenated, followed by a convolution block. The operation is repeated until the input dimensions are obtained, where a predicted mask is obtained using the final convolution layer with the appropriate activation function [33]. The above description is the foundation of the network. In the years following the emergence of UNET network, various research teams have tried to modify it so that it provides even better results for different applications. Such an approach can be the use of DeepUNET [34], DeepResUNET [5,8] or combining UNET with solutions such as ASPP (Atrous Spatial Pyramid Pooling) [10]. From the point of view of information extraction from aerial images or satellite imagery, the results presented in [35,36] are particularly interesting. The obtained results allowed extraction of specific objects with varied accuracy—the mean Intersection Over Union value in most of the cited publications is around 90% in the case of building segmentation.

DeepLab, on the other hand, are network architectures based on atrous convolution in its initial version—DeepLabv1 [37], followed by the creation of atrous spatial pyramid pooling—DeepLabv2 [38]; its extension—DeepLabv3 [39]; the development of a segmentation decoder—DeepLabv3+ [40]; and the creation of networks based on NAS—Neural Architecture Search—Auto-DeepLab [41]. The use of the DeepLab architecture is particularly effective with the use of pre-trained backbones that allow for feature extraction. Part of ASPP allows for context identification by analysing links in the nearer and wider area. This approach, among others, was used in [24]. The DeepLab architecture, particularly DeepLabv3+ is also often used to segmen<sup>t</sup> information from satellite or aerial images e.g., [42–44].

Therefore, it was decided to test both architectures described above for solving the segmentation problem using open data resources. Part of the solution was implemented using the Python and Keras libraries, along with a Tensorflow framework as the backend. For this purpose, a publicly available library was created on GitHub [29]. A visualisation of the architectures used is shown in Figure 5.

**Figure 5.** Used model architectures.

Various variations of UNET and DeepLabV3+ networks were implemented there, as well as the metrics and loss functions described later in this paper. A Github implementation of UNET with the ResNet34 backbone [45] was also used. The backbones were loaded with publicly available weights for the models that were obtained from the classification of the ImageNet dataset. The table below (Table 2) presents the network architectures used.

**Table 2.** Summary of used architectures.

