*2.1. Data*

To evaluate the performance of different methods, a research area located in Christchurch, New Zealand, is selected. The original aerial imagery, as well as annotated building polygons, are hosted by the Land Information of New Zealand (LINZ) (https://data.linz.govt.nz/layer/53413- nz-building-outlines-pilot/). The aerial images are in a spatial resolution of 0.075. Prior to performing our experiments, we evenly partition the study area into two areas for training (i.e., Figure 1a, left) and testing (i.e., Figure 1a, right), respectively. The original annotations provided by the LINZ are registered to the corresponding building grounds instead of rooftops (confirmed by visual interpretation uisng QGIS GUI (https://qgis.org/)). For accurate outline extraction, we manually adjust vectorized building outlines to ensure that all building polygons and aerial rooftops are roughly registered (i.e., Figure 1b). Because of the huge amount of buildings and occasional human errors, sub-pixel or several pixel misalignments will be inevitable. Thus, we have to train the models with imperfect "ground truth".

**Figure 1.** (**a**) Aerial imagery of the study area ranging from 172◦33E to 172◦40E and 43◦30S to 43◦32S, encompassing approximately 32 km2. (**b**) Manual adjustment of provided annotation (e.g., from Red to Green polygon). (**c**) Sample pairs of the extracted patches.

As shown in Figure 1a, the study area is covered mainly by residential buildings with sparsely distributed factories, trees, and lakes. From training and testing areas, 16,635 and 14,834 patches are extracted. The size of the patch is 224 × 224 pixels. As shown in Figure 1c, within each pair of the patches, there are buildings in the center area.
