*3.2. Proposed Workflow*

### 3.2.1. CAD Architecture

A high-level overview of the proposed CAD is depicted in Figure 2. The physicians can visualize the WSIs using Aperio ImageScope software. In order to perform supervised learning, we need labelled data. Pathologists can annotate the slides using ImageScope, and export the results in XML files, which we can use to feed our neural networks. After having trained our models, we can export the output in XML files, and physicians can see the CAD annotations always in ImageScope, with seamless integration. To accomplish the task of calculating the Karpinski histological score, we must make a careful choice for the architecture of the network. All the models have been trained and validated on a machine with the characteristics reported in Table 2.

**Figure 2.** CAD architecture. Physicians can visualize and annotate the WSIs using Aperio ImageScope software. The developed Deep Learning models can interact with ImageScope through an XML interface.


**Table 2.** System Details.

### 3.2.2. Semantic Segmentation Workflow

To obtain an estimate of the Karpinski score, we must detect and classify all the glomeruli which appear in the WSI. We first use a semantic segmentation CNN to obtain a pixel-level classification, distinguishing between pixels which belongs to background, sclerotic and non-sclerotic glomeruli. Then, we must turn these pixel-level classifications into object detections, so that we can count the number of sclerotic and non-sclerotic glomeruli. The general schema for our semantic segmentation-based glomerular detector is depicted in Figure 3.

**Figure 3.** Semantic Segmentation approach architecture. The top part describes how to train the CNN. The bottom part explains how to use the trained model for performing inference, and the related morphological and clustering post-processing steps.

The first step in our workflow consists of segmenting the sections present in the WSI. At this purpose, we used classical Image Processing techniques as thresholding, morphological operators, connected components labelling, and eventually, clustering. A similar preprocessing step has also been done by Ledbetter et al. [3]. We refer to the module performing this step as Sections Extractor. To reduce the very large dimension of WSIs, which can be overwhelming for Deep Learning algorithms, we undersampled the sections by a factor of 4. The original WSIs have a magnification of 20×, after undersampling it becomes equivalent to a magnification of 5×. This operation leads to an effective downsampling of the images from a resolution of about 8000 × 8000 pixels to a resolution of about 2000 × 2000 pixels. Since the section obtained this way was still too large to fit in our GPU, we divided it in patches. During training, we randomly sampled patches of size 656 × 656, with a mechanism to avoid to take too many patches only with negatives samples. The random patches sampled during the training process are then fed to a data augmentation block that performs different augmentations, as reported in Table 3. Augmentations are generated on-the-fly for each epoch within random ranges, so the network always processes slightly different input data, thus reducing the risk of overfitting. In the inference phase, we take patches of size 656 × 656 pixels, with an overlap between successive windows of 200 × 200 pixels. Please note that in semantic segmentation is important to have a larger context for performing inference, when the approach involves a sliding window processing [28]. After we ge<sup>t</sup> the predicted masks for glomeruli at patch-level, we project them to the original WSI, to ge<sup>t</sup> the WSI-level predicted mask. At this point, we apply morphological operators to remove noisy points and smooth the glomeruli shapes. We then analyze shape descriptors to understand if it is necessary to perform a clustering operation. In the end, the obtained mask is projected to 20× resolution, corresponding to oversampling by 4, using nearest-neighbour interpolation. Please note that in this work, all the resizing operations involving the digital pathology images are obtained using bicubic interpolation, while all the resizing operations involving the categorical masks are obtained using nearest-neighbour interpolation.

### 3.2.3. Morphological Operators and Clustering

Adapting a semantic segmentation network to perform object detection poses some challenges. The task of semantic segmentation consists of labelling only individual pixels, which mainly captures textural information. In contrast to architectures explicitly tailored to Object Detection, such as Faster R-CNN [29] or Mask R-CNN [30], where there are anchor boxes, the network does not look for objects, it just tries to classify individual pixels. To extend the semantic segmentation model into an instance segmentation one, we must use different morphological operators and clustering algorithms as post-processing steps.

Morphological operators are applied only to binary masks obtained as the output of the semantic segmentation networks. First, we smooth the shapes of objects performing a morphological closing operation, with a disk of radius 5 pixels as structuring element, and with the morphological flood-fill operation. Then, we delete small objects and noisy points using opening operator, with a disk of radius 10 pixels as structuring element, and area opening operator, removing connected regions with an area below 1000 pixels. Examples are depicted in Figure 4, where binary masks are overlapped to the biopsy images for visualization purposes. Masks relative to non-sclerotic and sclerotic glomeruli are green and red colored, respectively. Lastly, we analyze the shape descriptors for each of these objects to understand if there are touching objects we need to cluster. The sequence of morphological operators used is depicted in Figure 5.

**Figure 4.** (**Left**) Semantic Segmentation output. (**Right**) After Morphological Operators.

**Figure 5.** Morphological operators sequence applied to the output masks from the semantic segmentation network. The output of the morphological post-processing is used for calculating shape descriptors to eventually perform clustering.

An important observation is that individual glomeruli have convex shapes, so their area is pretty similar to their convex area. We perform a *K*-means clustering based on the difference between the convex area and the area, as specified in Equation (1).

$$
abla 
 \text{Area} = \text{conv} \text{x} \text{Hull} \text{Area} - \text{area} \tag{1}$$

We decide the number *K* of clusters according to deltaArea: if deltaArea ≤ 900, *K* = 1; if deltaArea > 900 and deltaArea ≤ 5000, *K* = 2; if deltaArea > 5000, *K* = 3. The values of deltaArea and the corresponding *K* have been empirically determined on the trainval set. Confusion matrices reported later have been obtained after the clustering with the configuration based on deltaArea. Examples of glomeruli before clustering are depicted in Figure 6a,c. The corresponding images after clustering are shown in Figure 6b,d.

**Figure 6.** Examples of K-means clustering for both sclerotic and non-sclerotic glomeruli. The number *K* of clusters is determined according to deltaArea defined in (1). (**a**) Sclerotic glomeruli before clustering. (**b**) Sclerotic glomeruli after clustering, with *K* = 2. (**c**) Non-sclerotic glomeruli before clustering. (**d**)Non-scleroticglomeruliafterclustering,with *K* = 3.
