*2.2. NDG-CAM*

In this section, we introduce the methodology adopted for NDG-CAM. Several steps have been carried out. As the first step, a semantic segmentation architecture trained for nuclear segmentation is required. Different experimental configurations of the datasets and network architectures have been compared in order to find the most suitable model, with details reported in Sections 2.2.1 and 2.2.2. Then, the Grad-CAM technique for semantic segmentation, which is still underexplored if compared to Grad-CAM for classification, has been employed to obtain saliency maps of the nuclei, with higher values of intensity corresponding to positions nearest to the centroids. Subsequently, a search for local maxima, combined with post-processing and clustering, allowed for the detection and eventually instance segmentation of the nuclei. This process is presented in Section 2.2.3. Compared to specialized architectures, such as those used for instance segmentation, semantic segmentation networks are simpler and faster to train. In addition, our system can be trained if labels do not distinguish between different nuclear instances, which would not be possible for instance segmentation models.

#### 2.2.1. Semantic Segmentation Workflow

Starting from the datasets described in the previous sections, the following experiments were carried out, all with images at a size of 512 × 512:


In the first two experiments, images were padded from 500 × 500 to 512 × 512 exploiting the mirror padding. Instead, in the last experiment, the images were padded from 1000 × 1000 to 1024 × 1024 with mirror padding and subsequently divided into 4 tiles of 512 × 512. For each experiment, different deep network architectures were trained and compared: U-Net [24], SegNet [25], and DeepLab v3+ [26] in three different backbone configurations, namely ResNet18, ResNet50 [27], and MobileNet-v2 [28]. The aforementioned experiments were carried out in MATLAB R2021a.

#### 2.2.2. Network Architectures

The segmentation phase is a milestone for the detection phase; this step aims to discriminate between cell nuclei and the background. semantic segmentation architectures play a role of pivotal importance in deep learning-based medical image analysis [9,29–31]. It is a process that associates a label or a category to each pixel of an input image, thus allowing the pixelwise spatial localization of each object category appearing in the scene.

In the specific case under analysis, the goal was to segment the cell nuclei in a robust way, so as to provide satisfactory results even when the algorithm would have been applied to different images of the same type. For this reason, it was decided to carry out the same experiments with several convolutional architectures.

The considered architectures include:


An example of semantic segmentation prediction from DeepLab v3+ with backbone ResNet18 is shown in Figure 3.

**Figure 3.** Semantic segmentation output for nuclei images. (**Left**) Original image. (**Middle**) Ground truth. (**Right**) Prediction of experiment (b) with DeepLab v3+ and backbone ResNet18.

2.2.3. Nuclei Detection with Grad-CAM

After the best performing network has been identified, the output returned by the semantic segmentation was a mask in which the pixels of the input image were classified into pixels belonging to the foreground, i.e., nucleus, or background class. As mentioned previously, this did not allow us to distinguish multiple instances of the same object and therefore to distinguish multiple nuclei adjacent to each other.

In this scenario, the detection phase begins. In fact, after the semantic segmentation, post-processing was carried out in order to solve this problem. The first step was to calculate the Grad-CAM of the input image according to the chosen network. A CNN is often seen as a black box, or rather, as a model with parameters *W* that, given an image of input *X*, through a function *f*(*X*, *W*), is able to map to the related output *y*. XAI techniques have been designed in order to unveil the underlying mechanisms involved in the processing stages of deep neural networks, and are recently gaining a lot of attention in medical imaging and clinical decision support systems [32–35].

During the training phase, even if we are capable of achieving high performance according to the considered metrics, we do not know which image features are more determinant for the network to make its choices. One of the ways to visually solve this problem is Grad-CAM [35].

Grad-CAM is typically used in image-classification scenarios [36], but it can also be extended to semantic segmentation problems [37]. In general, the heatmap *L<sup>c</sup>* for class *c* is generated by using *a<sup>k</sup> <sup>c</sup>* (as defined in Equation (1)) to sum the feature maps *Ak*, as in Equation (2).

$$a\_c^k = \frac{1}{N} \sum\_{\mu, \upsilon} \frac{\partial y^{\varepsilon}}{\partial A\_{\upsilon \upsilon}^k} \tag{1}$$

$$L^c = \text{ReLU}\left(\sum\_k a\_c^k A^k\right) \tag{2}$$

*N* is the number of pixels and (*u*, *v*) are the indices. ReLU is applied pixelwise to clip negative values at zero, to only highlight areas that positively contribute to the decision for class *c*. The difference with the classification task is that for semantic segmentation *yc*, the scalar class score, is obtained by reducing the pixelwise class scores for the class of interest to a scalar [37], as in Equation (3).

$$y^c = \sum\_{(\mathfrak{u}, \mathfrak{v}) \in P} Y^c\_{(\mathfrak{u}, \mathfrak{v})} \tag{3}$$

*P* is a set of pixel indices of interest in the output layer: in our case, the softmax layer before the pixel classification layer. Higher values of *L<sup>c</sup>* map indicate which areas of the image are important for the decision to classify pixels.

In the proposed approach, the activation of the network for the nucleus class was analyzed, obtaining a probability map with values that we denote as CAM-Map. Therefore, activations greater in correspondence with the centroids of the nuclei (even when adjacent to each other) are visible from Figure 4C.

From CAM-Map, we applied a morphological grayscale dilation operator with a spherical shape factor of radius 7. The result is depicted in Figure 4D. This step allowed the enlargement of the activation areas so that no false nuclei were identified in the nearby regions where activations were not high enough compared to the maximum point.

Then, as portrayed in Figure 4E, we proceeded with the calculation of the local maximum of the regions and the localization of all the connected components, with the related geometric centroids, which correspond to the identified nuclei.

Once the centroids were found, K-means clustering, with K equal to the number of connected components, has been exploited to associate the adjacent pixels to each nucleus, so as to have the overall predicted mask of the original starting image. The final mask is reported in Figure 4F.

**Figure 4.** NDG-CAM Detection workflow. (**A**) Zone with multiple neighboring instances of nuclei. (**B**) Failure to recognize adjacent nuclei. (**C**) Grad-CAM for semantic segmentation. (**D**) Dilated image. (**E**) Connected components. (**F**) Detection prediction.

#### *2.3. Instance Segmentation*

Object detection involves the detection, with a bounding box, of all the different objects of interest present in a scene. Instance segmentation further extends this task, by also considering the problem of delineating a precise mask around each object. Architectures for object detection are usually divided into one-stage and two-stage models, with the first being faster and the former being more accurate. Inside the realm of methods for two-stage object detectors, a pivotal role has been played by architectures from the R-CNN family [38].

Mask R-CNN evolves the R-CNN family by adding a semantic segmentation branch, making the model capable of performing instance segmentation [17]. The overall Mask R-CNN architecture is composed of two parts: the backbone architecture, which performs feature extraction, and the head architecture, which performs classification, bounding box regression, and mask prediction.

We employed the Detectron2 [39], a platform powered by the Pytorch framework, that provides state-of-the-art detection and segmentation algorithms. It includes high-quality implementations of the most popular object detection algorithms, comprising different variants of the pioneering Mask R-CNN model. Detectron2 has an extensible design so that it can be easily employed to implement cutting-edge research projects.

The NuCLS dataset [22] was chosen to train the network, the instance segmentation model mask\_rcnn\_R\_50\_DC5\_1x. Annotations were converted into the COCO annotation format for adoption in the Detectron2 framework.

#### *2.4. Implementation Details*

All the semantic segmentation networks have been trained on a laptop with a GeForce GTX960M. For carrying out the training, the chosen optimizer was SGDM, with a starting learning rate of 0.05. The learning rate schedule was piecewise with a drop factor of 0.94 and a drop period of 2. *L*<sup>2</sup> regularization parameter was set to 0.0005. With a batch size of 2, 15 epochs lasted roughly 105 min for the best performing architecture, DeepLab v3+ with ResNet18 as the backbone.

The Mask R-CNN model, being heavier, has been trained on a Google Colab Pro environment. With a Tesla P100, 20,000 iterations were carried out in roughly 110 min. The chosen optimizer was SGDM, as set by default in the Detectron2 environment, with a starting learning rate of 0.00025.

#### *2.5. Combined Model*

In order to obtain the advantages of both approaches, a combined model has been developed.

It exploits a criterion for obtaining merged outputs from NDG-CAM detection and Mask R-CNN. In detail, a distance criterion was used to check if a nucleus was found by only one of the approaches. In that case, the nucleus was simply retained. Instead, if more nuclei centroids are found in proximity, only the ones found by Mask R-CNN are retained. The combined methodology has the idea to increase the recall, which is very important because nuclei detection is the first stage for further analyses.

#### *2.6. Evaluation Metrics*

Each semantic segmentation architecture described in Section 2.2.1 was tested in all three experimental configurations mentioned. In order to assess the goodness of pixelwise classification performed by semantic segmentation networks, the pixelwise precision, recall, and Dice coefficient were considered as performance indices. Given pixelwise true positives (TP), false positives (FP) and false negatives (FN), then precision, recall, and Dice coefficient can be defined as in Equations (4)–(6), respectively:

$$Precision = \frac{TP}{TP + FP} \tag{4}$$

$$Recall = \frac{TP}{TP + FN} \tag{5}$$

$$Dice = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}.\tag{6}$$

For all these metrics, a higher value denotes a better segmentation result; that is, predicted masks are more similar to ground truth ones.

Instead, for assessing the detection procedure, we considered two kinds of metrics. The first is based on the simple calculation of the number of detected nuclei with respect to the ground truth. The error (*ea*), defined in Equation (7), is given by the difference in absolute value between the number of nuclei found and the real number, divided by the latter. An example of the prediction vs. ground truth result, which is the basis for enumerating nuclei, is depicted in Figure 5A. Because we were also interested in understanding if our algorithm was more prone toward overdetection or underdetection, a signed error (*es*), defined in Equation (8), was also evaluated:

$$e\_d = \frac{|d-g|}{g} \tag{7}$$

$$e\_s = \frac{d - \mathcal{g}}{\mathcal{g}}.\tag{8}$$

In these two equations, *d* denotes the number of detected nuclei, whereas *g* is the number of ground truth nuclei.

**Figure 5.** Example of calculation of evaluation metrics for object detection. (**A**) Prediction vs ground truth. Yellow, ground truth; green, prediction; (**B**) Differences between prediction and ground truth. Yellow, detection FN; red, detection FP.

The second category of metrics includes Dice coefficient, precision, and recall for object detection, which can provide more information about the quality of the detection results. In this case, we are not simply rewarding our prediction of as many nuclei as are present in the ground truth, but we also want to ensure that detected nuclei are in the right place. In order to achieve this result, we need to discover object detection FP and FN, as can be seen in Figure 5B. In order to determine these quantities, as the first step, we computed the distance matrix between the centroids of the detected nuclei and the real ones. In order to decide whether a detection actually corresponds to a nucleus centroid, a distance threshold *ξ* was considered, equal to the mean radius of the nuclei of each image [16]. If the distance between a prediction and a ground truth annotation is less than or equal to *ξ*, the prediction is counted as a TP. If more than one detection verifies this condition, the one closest to the ground truth position is counted as TP and the others as FP. The detections further than *ξ* from any ground truth location are counted as FP, and all ground truth annotations without close detections are marked as FN. Lastly, the following control condition was added. If the distance between an FP and an FN is less than an threshold, set to 6 (a value close to the nuclear radius), the count of FP and FN will each be decreased by one, whereas TP will be increased by one. The pseudocode for determining TP, FP, and FN is reported in Algorithm 1.

In order to assess the statistical significance of the obtained results calculated per case, we determined the *p*-value with the two-tailed Wilcoxon signed-rank test. The threshold for significance has been set to 0.05.

```
Algorithm 1: Object Detection TP, FP, FN calculation.
  input : gt, the ground truth nuclei centroids, an array of g coordinate pairs
         pred, the predicted nuclei centroids, an array of d coordinate pairs
         ξ, the mean radius of the ground truth nuclei
         , the distance threshold // set to 6
  output:TP, the true positives
         FP, the false positives
         FN, the false negatives
  g = size(gt)
  TP = 0
  FP = 0
  FN = 0
  idxFP = list() // a list of false positive indexes
  idxFN = list() // a list of false negative indexes
  δ = distance(gt, pred) // the distance matrix
  i = 0
  while i < g do
     v = δ[:, i]
     idx = where(v < ξ) // a (possibly empty) array of indexes
     if size(idx) == 1 then
        TP = TP + 1
     else if size(idx) > 1 then
        TP = TP + 1
        FP = FP + (size(idx) − 1)
        idxFP.extend(idx)
     else if size(idx) == 0 then
        FN = FN + 1
        idxFN.append(i)
     end
     i = i + 1
  end
  arrFN = filter(gt, idxFN) // extract the false negatives
  p = 0
  while p < size(idxFP) do
     a = 0
     while a < size(arrFN) do
        Δ = distance(pred[p], arrFN[a])
        if (Δ ≤ ) then
           FP = FP − 1
           FN = FN − 1
           TP = TP + 1
        end
        a = a + 1
     end
     p = p + 1
  end
```
#### **3. Results**

The automatic segmentation of cell nuclei attracted significant interest from the scientific community, as their identification is an important starting point for many medical analyses based on histopathological images. In this work, for the semantic segmentation phase, different architectures were elaborated and tested on different datasets, for a total of 15 experiments. For each of them, performance indices were calculated to identify the best

model with which to proceed for the subsequent phases. From this comparison, it emerged that the best performance can be obtained by referring to the experimental configuration (b) defined in Section 2.2.1.

Table 2 reports the results obtained for each network architecture in the semantic segmentation task. For DeepLab v3+, the backbone architecture is included within square brackets.


**Table 2.** Performance comparison between considered network architectures for semantic segmentation.

It therefore emerges that the best solution coincides with experiment (b) conducted with DeepLab v3+ using the ResNet18 network as the backbone. It allowed us to obtain a pixelwise Dice coefficient of 74.23 ± 4.85%, a precision of 76.42 ± 8.69%, and a recall of 74.25 ± 11.23%.

DeepLab v3+ was hence chosen as the base model to be exploited in the detection phase. By exploiting the Grad-CAM for semantic segmentation, it was possible to retrieve nuclei centroids via local maxima of the obtained saliency maps.

On the V1 dataset, the experimental results demonstrated an *ea* of the identified nuclei equal to 2.11%, 2.43%, and 11.50% for the NDG-CAM, Mask R-CNN, and combined method, respectively. When calculated per case, the values for *es* were 1.84 ± 13.05%, 3.46 ± 6.15%, and 14.45 ± 11.22%, indicating that the models generally tend to overdetect on this dataset.

In the V4 dataset, the *ea* had a value of 15.26%, 59.22%, and 14.10% for the NDG-CAM, Mask R-CNN, and combined method, respectively. When calculated per case, the values for *es* were −16.86 ± 13.79%, −60.13 ± 13.88%, and −14.88 ± 12.86%, showing that the models have a tendency to underdetect on this dataset. In particular, it was noticed that very small nuclei, such as those of lymphocytes, and elongated ones, such as those of fibrocytes, were underdetected.

For the detection task, the results are reported in Table 3. In the V1 dataset, NDG-CAM, Mask R-CNN, and the combined method were capable of achieving a Dice coefficient of 0.824, 0.878, and 0.884, respectively. Thus, the combined method obtained slightly better results than the other methods. As for the recall, the combined method decisively surpasses the other approaches, with a value of 0.934.

In the V4 dataset, the combined method proves to be the best, achieving a recall of 0.850 and a Dice coefficient of 0.914. Mask R-CNN performs poorly in this case, with a recall of 0.403 and a Dice coefficient of 0.573.


**Table 3.** Comparison of detection methods, extending the one proposed by Alom et al. [4] and Sirinukunwattana et al. [14].

The violin plots calculated per tile are reported in Figure 6 for the V1 and V4 datasets, comparing the NDG-CAM detection method, Mask R-CNN, and the combined approach. It is worth noting that the Mask R-CNN model works very well on the V1 dataset but performs poorly on the V4 one. On the other hand, the NDG-CAM and the combined methods maintain high levels of performance in all scenarios.

**Figure 6.** Violin plots for the detection metrics calculated per case. (**Left**) V1 dataset. (**Right**) V4 dataset. In the figure, ns stands for nonsignificant; \* denotes *p*-value < 0.05; \*\* indicates *p*-value < 0.01; and \*\*\* means *p*-value < 0.001.

In the V1 dataset, the combined model does not show a Dice coefficient that is higher in a statistically significant way than the Mask R-CNN approach, with a *p*-value of 0.07. On the other hand, the recall was much higher for the combined method, resulting in a *p*-value < 0.001 for both NDG-CAM and Mask R-CNN. In the V4 dataset, both the NDG- CAM and the combined method showed much stronger results than Mask R-CNN, with a *p*-value less than 0.001 in both cases for Dice coefficient and recall. Moreover, the combined approach shows a statistically significant advantage over NDG-CAM (*p*-value = 0.048) for the Dice coefficient.

#### **4. Discussion**

In order to show the effectiveness of the proposed method, we compared it with existing state-of-the-art approaches. It has to be noted that our method allows exploiting semantic segmentation architectures to realize nuclei detection, whereas other approaches usually involve networks specialized for this task. Several approaches proposed in the literature try to localize centers of the nuclei or proximity maps to those centers [3,14,16]. These approaches require instance-level annotations, although the results are promising. On the other hand, the proposed method exploits an XAI technique, Grad-CAM for semantic segmentation, to reconstruct post hoc saliency maps that are related to the centers of the nuclei, showing that semantic segmentation networks can perform detection tasks without specialized modifications.

The most widespread metrics employed for assessing algorithms for object detection involve precision, recall, and Dice coefficient. Namely, they are the metrics that are also related to the position of the detected nuclei, and not only on the counts.

A quantitative comparison between considered approaches and existing ones from the literature is presented in Table 3.

From this comparative analysis, it emerges that the proposed method is perfectly aligned with the state of the art, without the need to implement specific kinds of specialized loss functions [24] or architectures for detection [17,40].

Indeed, the NDG-CAM method alone was capable of achieving a Dice coefficient for object detection of 0.824, whereas the UD-Net [4] method, the top-performing method among the selected from the literature, had a Dice coefficient of 0.828. When the proposed NDG-CAM detection method is used in combined usage with Mask R-CNN, the recall increases to 0.934, and the Dice coefficient to 0.884, surpassing the current state-of-the-art methods for nuclei detection. On the collected external validation set, metrics are even higher, with a Dice coefficient of 0.914, showing the generalization capabilities of the proposed workflow.

Qualitative results for the the object detection pipeline involving semantic segmentation and Grad-CAM on the images of the independent external validation set V4 are depicted in Figure 7. Instead, Figure 8 shows the final detection results on the validation datasets V1 and V4 with the NDG-CAM method, the Mask R-CNN architecture, and the combined adoption of both methods.

It can be seen from the images of Figure 7, taken from the V4 dataset, that precision is very high. Indeed, virtually all detected nuclei are real. Some small or elongated nuclei, such as lymphocytic or fibrocytic nuclei, are underdetected. This may be due to a lack of proper training datasets with a large variety of nuclear shapes.

**Figure 7.** Examples of the NDG-CAM method on the data from the Pathology Department of IRCCS Istituto Tumori Giovanni Paolo II. Results are shown for the best architecture (DeepLab v3+ with ResNet18 backbone). (**First row**) Original images. (**Second row**) Semantic segmentation. (**Third row**) Instance segmentation after detection of centroids of the nuclei, with each color denoting a different nuclear instance.

The two methods show similar performance on the V1 dataset, as can be observed from Figure 8. Mask R-CNN achieves slightly better performance on this dataset, and considering that it has been trained on a larger training set, the combined method proved to be superior. From the same figure, it is possible to observe that, in the V4 dataset, Mask R-CNN does not properly generalize, resulting in the missing of many nuclei (low recall).

**Figure 8.** Examples of centroid detection on the validation sets V1 and V4. (**Top row**) Green, NDG-CAM method detections; red, Mask R-CNN detections. (**Bottom row**) Blue, combined method detections. First and second columns show data from V4, whereas the third and fourth columns depict data from V1.

#### **5. Conclusions and Future Works**

In this work, a novel method was presented with the aim of nuclei identification from histological H&E images. In our multi-stage pipeline, the first phase involved semantic segmentation. After various experiments, DeepLab v3+ (ResNet18 backbone) emerged as the best-performing architecture. Subsequently, because this analysis did not allow the distinction of multiple instances of the same object, we proposed a novel detection algorithm, NDG-CAM, which exploited Grad-CAM to solve the problem of separating the instances. Even without the need to use specialized loss functions or architectures, it allowed us to achieve satisfactory results in the detection task, comparable to or even better than more sophisticated training setups [3,6,12,16]. When the method is combined with the Mask R-CNN instance segmentation architecture, results exceed the state-of-the-art methods for nuclei detection.

Even though the local validation set includes only colorectal cancer H&E slides, it has to be considered that in each slide there are several tissue types present (e.g., stroma, immune infiltration) and the proposed method has the ability to detect nuclei not only related to the tumor or normal epithelium of colon but also to other cytotypes.

Indeed, we noticed underdetection of lymphocytic or fibrocytic nuclei, and this could be explained by a lack of datasets enriched in these nuclei subtypes. For such a reason, a direction for future works includes the collection of a dataset with multiple and balanced nuclei annotations.

On the clinical side, the proposed workflow could be a valid tool to support pathologists in the detection and reporting of histological samples, thus allowing a considerable saving of time and resources, besides providing an objective tool that is more reliable than manual assessment. Future works will concern the classification of the detected nuclei, in order to estimate how many are malignant or subjected to specific lesions, so that important clinical parameters, such as neoplastic cellularity, can be determined quantitatively.

**Author Contributions:** Conceptualization, N.A., A.B. and V.B.; methodology, N.A., E.P., M.G.T., A.B. and V.B.; software, N.A., E.P. and M.G.T.; validation, C.S., F.A.Z. and S.D.S.; data curation, C.S., F.A.Z. and S.D.S.; writing—original draft preparation, N.A., E.P. and M.G.T.; writing—review and editing, all authors; visualization, N.A., E.P. and M.G.T.; supervision, N.A., A.B. and V.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** The study has been partially funded by the projects "Tecnopolo per la Medicina di Precisione, CUP: B84I18000540002" and "CustOm-made aNTibacterical/bioactive/bioCoated prostheses (CONTACT), CUP: B99C20000300005".

**Institutional Review Board Statement:** The institutional Ethic Committee of the IRCCS Istituto Tumori Giovanni Paolo II approved the study (Prot n. 780/CE).

**Informed Consent Statement:** Patient consent was waived due to the fact that this was a retrospective observational study with anonymized data, already acquired for medical diagnostic purposes.

**Data Availability Statement:** The MoNuSeg [18], CRCHistoPhenotypes [21], and NuCLS [22] datasets are publicly available. The local dataset from IRCCS Istituto Tumori Giovanni Paolo II presented in this study is available upon request from the corresponding author.

**Conflicts of Interest:** The authors declare no conflict of interest.
