**4. Experimental Results**

We distinguish between the results obtained at pixel-level (semantic segmentation task) and at object detection level.

In particular, for the semantic segmentation task we group the metrics in Dataset Metrics and Class Metrics [33].

The group of **Dataset Metrics** includes semantic segmentation metrics aggregated over the data set: *Global Accuracy*, *Mean Accuracy* (the mean of the accuracies calculated per class), *Mean IoU* (the mean of the IoUs calculated per class), *Weighted IoU* (mean of the IoUs, weighted by the number of pixels in the class) and *Mean F-score* (mean of the F-measures calculated per class).

The group of **Class Metrics** includes semantic segmentation metrics calculated for each class, namely: *Accuracy* (2), *IoU* (3) and *Mean F-score* (F-measure for each class, averaged over all images).

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \tag{2}$$

$$IoI = \frac{TP}{TP + FP + FN} \tag{3}$$

For the object detection task, confusion matrices are calculated assuming that a true positive match between predicted mask and ground truth mask has pixel-wise IoU (3) of at least 0.2. Besides confusion matrices, the metrics used for assessing the results of the object detection task are:

$$Precision = \frac{TP}{TP + FP'} \tag{4}$$

$$Recall = \frac{TP}{TP + FN'} \tag{5}$$

$$F\_1 \ Score = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall} . \tag{6}$$

The best results on non-sclerotic glomeruli have been obtained using DeepLab v3+, while for sclerotic glomeruli the best model was SegNet. An example of the output of our semantic segmentation framework is depicted in Figure 9.

**Figure 9.** Top Left: original image. Top Right: ground truth. Bottom Left: SegNet prediction. Bottom Right: DeepLab v3+ prediction. Sclerotic glomeruli and non-sclerotic ones are white and gray colored, respectively.
