*2.3. Data Preprocessing*

Publicly available datasets for object segmentation are most often created manually based on vector data from OSM or corrected manually based on publicly available building outlines. Datasets are also created based on commercially acquired data. However, our aim was to create a dataset completely free of charge.

The input orthophoto data included six images in. GTiff format with a ground pixel resolution of 10 cm. The areas that were selected for analysis were diverse in terms of architecture and building density. The dimensions of each image in pixels were 22,477 × 23,162, and in metres 2247.7 m × 2316.2 m. The area of analysis therefore covered an area of over 31 square kilometres. In the study area there were 21,010 buildings in vector format, available in the EGiB database.

From the input data, which consisted of orthophotomaps and vector data from EGiB, two datasets were created according to the algorithm presented in Figure 3: the first one for the input pixel with the terrain pixel of 10 cm and the second one with the terrain pixel of 50 cm. The second dataset was created by resampling data from the first, main dataset. The algorithm was programmed using the Python language and the gdal, ogr, opencv and patchify libraries. Different terrain pixels from the input data were used to compare the performance of the algorithms with respect to the size of the terrain pixels and to evaluate the possibility of segmenting buildings on these pixels.

The data were split into smaller images that were suitable for neural networks. This is a recommended action as it reduces the computing power required. Then, only those images where buildings were present were selected.

The first dataset contained 6365 images with dimensions: width—512, height—512, number of channels—3 (RGB colours) and corresponding labels in the form of binary image masks (1—buildings, 0—background) that were obtained as a result of rasterization. The data were divided into training set—80% of data, validation set—10% and testing set—10%. They contained, respectively, 5092 training images, 636 validation images and 637 test images. Similar work was carried out with the second dataset with a larger terrain pixel. The division of the large images into smaller images resulted in 1263 images with dimensions: width—256, height—256, number of channels—3 (RGB colours) and corresponding labels in the form of binary image masks (1—buildings, 0—background) that were obtained by rasterization. The data were divided in the same ratio as the first dataset and in this way 1010 training images, 126 validation images and 127 test images were obtained. A summary of both datasets is shown below in Table 1. Figure 4 shows raster data visualisations of the two datasets for visual comparison of the datasets. Clearly, more blurring is seen for the larger ground pixel, so a worse performance of the proposed architectures for this dataset is to be expected. The datasets have been made available on a repository [29] via the GitHub platform.

**Figure 3.** Algorithm for preparing datasets.

**Figure 4.** Comparison of datasets: (**a**) with 0.1 m pixel, (**b**) with 0.5 m pixel.

177


**Table 1.** Summary of created datasets.

We are aware that somewhat false ground-truth data (containing the errors mentioned in the previous section) were used for testing. Testing of the algorithm with manually produced data (true ground-truth) is planned in future work. We anticipate that this may affect the accurate extraction of building edges, but as mentioned, our aim is to test the feasibility of using fully open data for building segmentation.
