Appendix A
During the design of the experiments and the analysis of the results, pixel thresholds were applied prior to determining whether the classification performed by the network was acceptable. More specifically, since the segmentation analysis was conducted at the pixel level, we introduced these thresholds to exclude cases that may be of limited relevance to the overall geometry of an area (e.g., isolated or sparse pixel misclassifications within an image region) or to the evaluation of boundary properties (e.g., minor discrepancies in border delineation between the network outputs and expert annotations).
This appendix elaborates on the rationale behind the selection of appropriate pixel thresholds for performance evaluation in semantic segmentation. Although a general threshold of 1000 pixels—corresponding to approximately 0.38% of a standard 512 × 512 image—was found to be optimal in most scenarios, class-specific analysis revealed slight adjustments that enhance robustness. The objective was to determine suitable cutoff values that effectively filter out marginal predictions while preserving meaningful, class-specific detections.
To this end, we performed a per-class analysis on the dataset and generated representative histograms showing the distributions of true positives (TPs), false positives (FPs), and false negatives (FNs) across the dataset images. These histograms provided insight into the prevalence of small-area misclassifications for each class and supported the derivation of appropriate thresholds tailored to different segmentation scenarios.
A multi-step evaluation strategy was adopted. Initially, TP, FP, and FN pixel distributions were calculated for each land cover class across the training images. Threshold candidates were then assessed based on their ability to exclude minimally relevant FP and FN errors. These values were validated both visually and statistically to identify the class-specific breakpoints that yielded the most favorable trade-offs between precision and recall.
The analysis was conducted using segmentation models developed with the MMSegmentation framework, which were applied to a land cover dataset annotated for five semantic classes. Custom Python scripts were used to compute per-image pixel distributions and generate evaluation metrics. Histograms were produced to visualize TP, FP, FN, and total ground-truth pixel counts per image. Additionally, confusion maps were generated to spatially analyze misclassifications, with particular attention given to boundary regions, where errors were most frequently observed.
The visual tools used included Matplotlib (version 3.8.2) for generating histogram plots and OpenCV for creating confusion maps. Each map overlaid misclassification patterns using color-coded pixels to separately highlight true positives (TPs), false positives (FPs), and false negatives (FNs). This dual-visualization approach offered both statistical insight and spatial interpretability.
In parallel with the analysis, we also extracted a representative image from the dataset for each segmented class, where the number of misclassified pixels slightly exceeded the selected threshold. For each case, we present a confusion map to support a detailed misclassification analysis, illustrating the actual discrepancies between the ground truth (GT) and the predicted mask.
Figure A1,
Figure A2,
Figure A3,
Figure A4,
Figure A5,
Figure A6,
Figure A7,
Figure A8,
Figure A9 and
Figure A10 present the complete spectrum of the analysis. Each figure focuses on a specific class or threshold configuration. For the background class,
Figure A1 demonstrates that most true positive (TP) pixels were preserved and that a threshold of approximately 1500 pixels effectively minimized errors while avoiding under-segmentation. The histogram in
Figure A3 supports this finding for the class buildings, although a threshold of around 1000 pixels was sufficient in less complex cases.
Figure A2 and
Figure A4 display the corresponding confusion maps, confirming that false positives (FPs) and false negatives (FNs) were primarily concentrated along class boundaries or restricted to small, isolated regions.
Figure A1.
(a) A histogram of the number of true positives per image for the class ‘background’. (b) A histogram of the number of false positives per image for the class ‘background’. (c) A histogram of the number of false negatives per image for the class ‘background’. (d) A histogram of the number of ground-truth pixels classified as ‘background’.
Figure A1.
(a) A histogram of the number of true positives per image for the class ‘background’. (b) A histogram of the number of false positives per image for the class ‘background’. (c) A histogram of the number of false negatives per image for the class ‘background’. (d) A histogram of the number of ground-truth pixels classified as ‘background’.
Figure A2.
A confusion map for the class “background”. The false positives and false negatives are predominantly attributable to the borders.
Figure A2.
A confusion map for the class “background”. The false positives and false negatives are predominantly attributable to the borders.
Figure A3.
(a) A histogram of the number of true positives per image for the class ‘buildings’. (b) A histogram of the number of false positives per image for the class ‘buildings’. (c) A histogram of the number of false negatives per image for the class ‘buildings’. (d) A histogram of the number of ground-truth pixels classified as ‘buildings’.
Figure A3.
(a) A histogram of the number of true positives per image for the class ‘buildings’. (b) A histogram of the number of false positives per image for the class ‘buildings’. (c) A histogram of the number of false negatives per image for the class ‘buildings’. (d) A histogram of the number of ground-truth pixels classified as ‘buildings’.
Figure A4.
A confusion map for the class “buildings”. The false positives and false negatives are predominantly attributable to the borders.
Figure A4.
A confusion map for the class “buildings”. The false positives and false negatives are predominantly attributable to the borders.
Figure A5.
(a) A histogram of the number of true positives per image for the class ‘woodland’. (b) A histogram of the number of false positives per image for the class ‘woodland’. (c) A histogram of the number of false negatives per image for the class ‘woodland’. (d) A histogram of the number of ground-truth pixels classified as ‘woodland’. (e) A histogram of the number of ground-truth pixels classified as ‘woodland’ limited to the range 0–5000 pixels.
Figure A5.
(a) A histogram of the number of true positives per image for the class ‘woodland’. (b) A histogram of the number of false positives per image for the class ‘woodland’. (c) A histogram of the number of false negatives per image for the class ‘woodland’. (d) A histogram of the number of ground-truth pixels classified as ‘woodland’. (e) A histogram of the number of ground-truth pixels classified as ‘woodland’ limited to the range 0–5000 pixels.
Figure A6.
A confusion map for the class “woodland”. The false positives and false negatives are predominantly attributable to the borders.
Figure A6.
A confusion map for the class “woodland”. The false positives and false negatives are predominantly attributable to the borders.
Figure A7.
(a) A histogram of the number of true positives per image for the class ‘water’. (b) A histogram of the number of false positives per image for the class ‘water’. (c) A histogram of the number of false negatives per image for the class ‘water’. (d) A histogram of the number of ground-truth pixels classified as ‘water’.
Figure A7.
(a) A histogram of the number of true positives per image for the class ‘water’. (b) A histogram of the number of false positives per image for the class ‘water’. (c) A histogram of the number of false negatives per image for the class ‘water’. (d) A histogram of the number of ground-truth pixels classified as ‘water’.
Figure A8.
A confusion map for the class “water”. The false positives and false negatives are predominantly attributable to the borders.
Figure A8.
A confusion map for the class “water”. The false positives and false negatives are predominantly attributable to the borders.
Figure A9.
(a) A histogram of the number of true positives per image for the class ‘roads’. (b) A histogram of the number of false positives per image for the class ‘roads’. (c) A histogram of the number of false negatives per image for the class ‘roads’. (d) A histogram of the number of ground-truth pixels classified as ‘roads’.
Figure A9.
(a) A histogram of the number of true positives per image for the class ‘roads’. (b) A histogram of the number of false positives per image for the class ‘roads’. (c) A histogram of the number of false negatives per image for the class ‘roads’. (d) A histogram of the number of ground-truth pixels classified as ‘roads’.
Figure A10.
A confusion map for the class “roads”. The false positives and false negatives are predominantly attributable to the borders.
Figure A10.
A confusion map for the class “roads”. The false positives and false negatives are predominantly attributable to the borders.
The background class was the most “noisy” because it encompassed all regions not attributed to the other classes. It included large cultivated areas, uncultivated fields, and occasionally vegetation in urban areas not classified as woodland. This ambiguity in the class content resulted in a wide variety of error types and scales. However, by applying the previously identified threshold, nearly all misclassifications involving small-area segments were eliminated.
In the case of woodland segmentation (
Figure A5), the findings differed slightly. A threshold of approximately 800 pixels already yielded the optimal segmentation performance by effectively excluding erroneous boundary detections. Nevertheless, the number of images containing fewer than 1000 pixels was relatively limited compared to those for buildings and was approximately equal (as shown in the histogram) to the number of images below the 800-pixel threshold.
Figure A6 illustrates the spatial distribution of the segmentation discrepancies for this class and presents insights consistent with
Figure A2 and
Figure A4.
The water segmentation results are presented in
Figure A7 and
Figure A8. Owing to the higher geometric regularity of water bodies relative to the other classes, the number of small-sized errors was lower. The histogram analysis revealed fewer areas with small spatial footprints and more regular boundary delineations. These combined factors resulted in a significantly lower optimal threshold of around 300 pixels. In
Figure A8, the class boundaries also exhibit reduced ambiguity.
The road class, depicted in
Figure A9 and
Figure A10, showed an optimal threshold of approximately 600 pixels. The true-positive coverage remained stable, while false positives and false negatives were substantially reduced, particularly near image boundaries. Across all five classes, the class-specific thresholds improved segmentation clarity and reduced evaluation noise.
The confusion maps and histograms consistently indicate that most segmentation errors occurred along class boundaries. These transitional zones often displayed mixed features, leading to prediction uncertainty. For background, the 1500-pixel threshold shown in
Figure A1 reflected the need for a higher cutoff due to its dominant spatial extent. However, under less complex conditions, as demonstrated in
Figure A3, a 1000-pixel threshold was sufficient. For buildings, an 800-pixel threshold yielded optimal results.
Woodland, being less dominant and spatially sparse, benefited from a lower threshold. The 300-pixel threshold in
Figure A7 effectively eliminated false positives and false negatives, preserving the number of correct predictions. Similarly, for roads, a 600-pixel threshold (
Figure A9) balanced the removal of isolated misclassifications with the preservation of linear continuity. The consistency of these results supports the hypothesis that class-specific thresholds enhance evaluation interpretability.
Although these tailored thresholds yielded incremental improvements, applying a general 1000-pixel threshold remained a reliable and interpretable default. It performed consistently across most classes and simplified the evaluation process. Especially in operational scenarios or standardized benchmarking, a unified threshold ensures fairness and comparability.
This extended analysis demonstrates that, while class-specific thresholds enhanced per-class clarity, a general threshold of 1000 pixels provided the most consistent overall results. The identified values—1500 pixels for background (
Figure A1), 1000 pixels forbuildings (
Figure A3), 800 pixels for woodland (
Figure A5), 300 pixels for water (
Figure A7), and 600 pixels for roads (
Figure A9)—illustrate how spatial and structural characteristics guide the selection of appropriate thresholds.
The 1000-pixel threshold eliminated nearly all false positives and false negatives while preserving the number of correct detections. Visual inspection supported these findings, revealing that most FPs and FNs occurred in transition zones.
Figure A1,
Figure A2,
Figure A3,
Figure A4,
Figure A5,
Figure A6,
Figure A7,
Figure A8,
Figure A9 and
Figure A10 illustrate the practical impact of applying both general and class-specific thresholds. While class-specific tuning provides enhanced precision, the 1000-pixel threshold performs effectively across various land cover categories, confirming its role as the most balanced and reliable choice. This threshold ensures fair evaluation and improves the interpretability of segmentation results in real-world applications.
Appendix B
This appendix investigates the rationale behind the decision not to employ pretrained weights in the evaluated segmentation models. The analysis was based on a comparative evaluation of both pretrained and non-pretrained versions of two widely used architectures: PSPNet and DANet. The objective was to assess the influence of pretraining on the models’ behavior in the presence of extreme misclassifications—commonly referred to as outliers—using a range of intersection over union (IoU) thresholds. The analysis was conducted using the same semantic segmentation framework and evaluation pipeline applied throughout this study. Specifically, the models were either trained from scratch or initialized with pretrained weights and then tested under identical conditions. The metric used for outlier identification was based on the number of test images in which the IoU for a given class fell above fixed thresholds: 0.19%, 0.27%, 0.38%, and 0.57%. Additional annotations were included to distinguish between clear model errors, ground-truth inconsistencies, and ambiguous cases.
Two summary tables are provided.
Table A1 reports the numbers of network mistakes, ground-truth mistakes, and ambiguous cases for the non-pretrained versions of the two models across all IoU thresholds.
Table A2 presents the same metrics for the pretrained versions. The datasets and class structure remained identical, ensuring a direct comparison of model robustness under different initialization conditions. Each table includes four threshold levels to capture the progressive effect of relaxing or tightening the outlier definition.
The results reveal that pretraining did not consistently enhance model robustness related to extreme errors. In several cases, models initialized with pretrained weights exhibited equal or higher outlier rates compared to their non-pretrained counterparts. For example, when pretrained, PSPNet showed a slight increase in the number of outliers detected at the 0.38% IoU threshold, rising from 21 for the non-pretrained version to 24 for the pretrained one. This trend was observed across other thresholds and architectures. These findings suggest that pretraining may introduce latent biases—possibly inherited from the source dataset—that affect the network’s ability to generalize under noisy or ambiguous conditions.
Based on the comparative outlier analysis, pretraining does not universally guarantee improved performance in terms of robustness against severe segmentation failures. The results indicate that training from scratch can, in some cases, yield more stable behavior under strict evaluation criteria. Consequently, the decision not to use pretrained weights in this study was justified, as it avoided unintended biases and ensured that the networks’ performance was attributable solely to learning from the target dataset.
Table A1.
Outlier analysis when PSPNet and DANet were not pretrained.
Table A1.
Outlier analysis when PSPNet and DANet were not pretrained.
| PSPNet | DANet |
---|
Thresholds | 0.19% (*) | 0.27% (*) | 0.38% (*) | 0.57% (*) | 0.19% (*) | 0.27% (*) | 0.38% (*) | 0.57% (*) |
Network mistakes | | | | | | | | |
background | 6 | 6 | 5 | 5 | 4 | 4 | 3 | 3 |
buildings | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
woodland | 6 | 5 | 4 | 4 | 7 | 7 | 7 | 6 |
water | 1 | 0 | 0 | 0 | 9 | 9 | 8 | 7 |
road | 13 | 13 | 12 | 10 | 11 | 10 | 10 | 7 |
Total network mistakes | 26 | 24 | 21 | 19 | 32 | 31 | 29 | 24 |
Ground-truth mistakes | | | | | | | | |
background | 16 | 16 | 16 | 16 | 7 | 7 | 7 | 7 |
buildings | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
woodland | 16 | 15 | 15 | 15 | 17 | 17 | 17 | 16 |
water | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
road | 11 | 11 | 11 | 10 | 14 | 14 | 13 | 11 |
Total ground-truth mistakes | 43 | 42 | 42 | 41 | 40 | 40 | 38 | 35 |
Ambiguous mistakes | | | | | | | | |
background | 22 | 21 | 16 | 16 | 56 | 56 | 47 | 42 |
buildings | 1 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
woodland | 80 | 80 | 66 | 64 | 69 | 62 | 59 | 51 |
water | 1 | 1 | 0 | 0 | 9 | 9 | 7 | 6 |
road | 37 | 32 | 20 | 16 | 36 | 29 | 20 | 14 |
Total ambiguous mistakes | 141 | 134 | 102 | 96 | 172 | 156 | 133 | 113 |
Total | | | | | | | | |
background | 44 | 43 | 37 | 37 | 67 | 67 | 57 | 52 |
buildings | 1 | 0 | 0 | 0 | 4 | 2 | 2 | 2 |
woodland | 102 | 100 | 85 | 83 | 93 | 86 | 83 | 73 |
water | 2 | 1 | 0 | 0 | 19 | 19 | 15 | 13 |
road | 61 | 56 | 43 | 36 | 61 | 53 | 43 | 32 |
Total | 210 | 200 | 165 | 156 | 244 | 227 | 200 | 172 |
Table A2.
Outlier analysis when PSPNet and DANet were pretrained.
Table A2.
Outlier analysis when PSPNet and DANet were pretrained.
| Pretrained PSPNet | Pretrained DANet |
---|
Thresholds | 0.19% (*) | 0.27% (*) | 0.38% (*) | 0.57% (*) | 0.19% (*) | 0.27% (*) | 0.38% (*) | 0.57% (*) |
Network mistakes | | | | | | | | |
background | 7 | 7 | 6 | 6 | 8 | 7 | 7 | 7 |
buildings | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
woodland | 8 | 8 | 8 | 7 | 4 | 4 | 4 | 3 |
water | 2 | 2 | 1 | 1 | 4 | 4 | 3 | 3 |
road | 10 | 9 | 9 | 9 | 11 | 9 | 9 | 8 |
Total network mistakes | 27 | 26 | 24 | 23 | 27 | 24 | 23 | 21 |
Ground-truth mistakes | | | | | | | | |
background | 16 | 16 | 16 | 16 | 7 | 7 | 7 | 7 |
buildings | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
woodland | 16 | 15 | 15 | 15 | 17 | 17 | 17 | 16 |
water | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
road | 11 | 11 | 11 | 10 | 14 | 14 | 13 | 11 |
Total ground-truth mistakes | 43 | 42 | 42 | 41 | 40 | 40 | 38 | 35 |
Ambiguous mistakes | | | | | | | | |
background | 32 | 26 | 21 | 17 | 40 | 39 | 28 | 22 |
buildings | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
woodland | 63 | 59 | 49 | 44 | 62 | 51 | 46 | 38 |
water | 2 | 2 | 2 | 1 | 4 | 4 | 3 | 3 |
road | 22 | 15 | 11 | 8 | 24 | 20 | 14 | 6 |
Total ambiguous mistakes | 120 | 103 | 83 | 70 | 131 | 115 | 91 | 69 |
Total | | | | | | | | |
background | 55 | 49 | 43 | 39 | 55 | 53 | 42 | 36 |
buildings | 1 | 1 | 0 | 0 | 2 | 2 | 1 | 1 |
woodland | 87 | 82 | 72 | 66 | 83 | 72 | 67 | 57 |
water | 4 | 4 | 3 | 2 | 9 | 9 | 6 | 6 |
road | 43 | 35 | 31 | 27 | 49 | 43 | 36 | 25 |
Total | 190 | 171 | 149 | 134 | 198 | 179 | 152 | 125 |