**4. Results**

Before the learning process started, during data loading, all pixels were rescaled to values between 0 and 1. Additionally, the images were normalised. These procedures were intended to speed up the operation of the computational networks. The calculations themselves were performed according to the algorithm described in Section 3. The presented results were obtained on the basis of calculations on a test set. The validation set was used as a control during network learning.

### *4.1. Dataset with 0.5 m Terrain Pixel*

The results that were obtained for a dataset with a 0.5 m pixel are presented in Table 3. The best results were achieved for the UNET network with the Resnet34 backbone. In contrast, the worst results were obtained for the DeepLabV3+ network, which may be due to the requirements of this architecture in relation to the number of input datasets. Part of DeepLabV3+ is ASPP (Atrous Spatial Pyramid Pooling) which uses Dilated Convolution. This is computationally demanding [51] and requires sufficient input data of satisfactory quality. The main purpose of using ASPP is to extract features of larger objects and maintain their consistency [7]. Too little input data resulted in False Positive artifacts in some parts of the predicted images. This can be seen in Figure 7, especially in example (c).

**Table 3.** Results for 0.5 m dataset [%].


**Figure 7.** Visualisation of the results for the 0.5 m dataset. From left: input image, ground truth, DeepLab, DeepLab with augmentation, UNET, UNET with augmentation, UNET\_bb, UNET\_bb with augmentation. Colours: dark green—TP, orange—TN, light green—FP, red—FN. (**<sup>a</sup>**–**<sup>e</sup>**) various examples from the test dataset.

Table 3 also shows the results for each architecture using the data augmentation described in Section 3.2. Better results on the test set were obtained only for the DeepLabV3+ architecture, which may confirm the conjecture described in the previous paragraph about the requirements of this network. For both UNET architectures the results achieved are slightly worse. This may be due to the specifics of the dataset and the information transmitted inside the network. As the dataset is characterised by inaccuracies (described in Section 2.2), the network through augmentation may have received more bad patterns for learning. Therefore, the calculated weights changed to recognise more bad patterns. In addition, the data were highly heterogeneous, which results from a lack of spatial planning. It can therefore be concluded that data augmentation does not always produce positive results.

Figure 7 presents the predicted masks based on several images from the test set, which mirrors the results in Table 3. However, not all objects were classified. The reasons for this situation are discussed in detail in Section 4.3.

Ultimately, the UNET architecture with the Resnet34 backbone was found to perform best in terms of both metrics and visualisation. Regular UNET was slightly worse (mIOU worse by 2points). On the other hand, the worst results were obtained for the DeepLabV3+ network—mIOU which was worse than UNET by over 10 points when using the dataset without augmentation and about 7 points when using augmentation.

### *4.2. Dataset with 0.1 m Terrain Pixel*

The results that were obtained by the individual architectures for the dataset with a 0.1 m terrain pixel are presented in Table 4. As data augmentation did not significantly improve the results that were achieved for the dataset with a 0.5 m pixel, it was decided to omit it when analysing the dataset with a smaller terrain pixel.

**Table 4.** Results for 0.1 m dataset [%].


As in the case of the dataset with the 0.5 m terrain pixel, and for the dataset with the 0.1 m terrain pixel, the best performance is achieved by the UNET network with backbone, but the mIoU difference with respect to DeepLabV3+ is small at 0.14 points. This is due to the definition of an appropriate input set size for the DeepLabV3+ network. The regular UNET network performs slightly worse here, the achieved mIoU metric is worse by about 2 points compared to the other networks. However, its biggest advantage is the speed of computation, which is due to the significantly smaller number of weights to be determined.

Visualization of the results is shown in Figure 8. Considering only visual issues, it can be concluded that the best results were obtained for DeepLabV3+ architecture—the most TP and TN areas. The other architectures also give satisfactory results. Importantly, the architectures give good results for both lower density areas (Figure 8a,c–g) and higher density areas (Figure 8b,f).

The occurrence of FP-labelled pixels at the edges of buildings is largely due to the ground truth not being entirely true (errors described in Section 2.2). On the other hand, occurring FNs are often caused by obscuration caused by trees or shading from a taller building. Further, the varied shape of the roof causes the model to fail to recognise parts of the building (TN).

**Figure 8.** Visualisation of the results for the 0.1 m dataset. From left: input image, ground truth, DeepLabV3+, UNET, UNET with backbone. Colours: dark green—TP, orange—TN, light green—FP, red—FN. (**<sup>a</sup>**–**h**) various examples from the test dataset.
