**4. Discussion**

Finding the most appropriate loss function for prostate segmentation is challenging. In this study we compared the performance of nine loss functions in a 37-patient data set. These nine loss functions were chosen as they are commonly used in medical image segmentation tasks [14]. The 37-patient data set included locally acquired data with a common imaging protocol (two resolutions) and a single MRI scanner [16] to avoid variations due to image acquisition. These data were co-registered with whole mount pathology to provide ground truth delineations of the prostate [16] in contrast to many publicly available datasets that rely on clinician-generated segmentations which are subject to interobserver variation [4]. A limitation of the generalizability of our study is the small sample size and homogeneity in the methods used to acquire the MRI data. We therefore recommend that future studies that intend to use data from a variety of sources and scanning protocols confirm the findings of our study using the methodology we describe, and consider the most appropriate metric for their evaluation. Publicly available data can be sourced from a variety of locations such as those described by Ma et al. [14], however, the purpose of our study was to remove uncertainties due to heterogeneity in data source and clinician contouring, and focus only on the relative performance of the loss functions selected for our study and a range of metrics for their evaluation. Our study found the proposed architecture performed with notable variations when different loss functions were applied. As the base and the apex of the prostate are particularly challenging to

segment manually due to the lack of a clear boundary [1,17], we therefore also evaluated the performance at the mid-gland, apex, and base of the prostate independently.

Focal Tversky had the highest scores for the whole gland in terms of DSC score and sensitivity. However, W (BCE + Dice) outperformed all competing methods in precision, followed by 95HD, Ravd, and Tversky. With performance measured by the median and standard deviation, the best performance was achieved by applying W (BCE + Dice), Tversky, and Focal Tversky loss functions. However, the performance of models with Focal Tversky, Tversky, W (BCE + Dice), Dice, and IoU loss functions were very close for our dataset. Lower performance was observed using Surface loss, BCE loss and Focal loss functions. Focal Tversky and Tversky loss functions have been recommended by other researchers as returning optimal results when their parameters are set to the correct values [15]. However, for challenging medical segmentation tasks, we suggest using Focal Tversky and W (BCE + Dice), and by optimizing their parameters, the best solution can be achieved in accordance with the application requirements. The loss function parameters of W (BCE + Dice) allow the user to define the best trade-off between FNs and FPs. Additionally, Focal Tversky and W (BCE + Dice) have the advantage of adjustable parameters, which make it possible to tune the loss function based on the application requirements. For example, Focal Tversky and W (BCE + Dice) have parameters which can be tuned to address under- and over-segmentation issues that may arise with other loss functions. As a result, in the future, we plan to investigate the effectiveness of a combination of Tversky and BCE loss functions for prostate segmentation.

Lower performance was observed using Surface loss, BCE loss and Focal loss functions. All models achieved higher performance for mid-gland and lower performance in the apex and base regions. When considering model performance for individual data sets, we observed that all models had a similar performance for each image, but performance varied across the patient cohort. This may be related to patient-specific image quality, however, all models generalized the average shape of images and failed to perform well for outlier shapes.

Intuitively, it can be expected that model performance will be affected by the choice of the metric used to measure performance and the principal components driving the loss function. For example, DSC measures the overlap between two regions. If the Dice loss is used, the training process is exactly guided as the final metric, which theoretically should achieve a good performance. This can be seen in Table 2; the Dice loss achieved a consistently high DSC in the whole prostate gland (0.73) as well as the sub-volumes (0.65–0.93). In addition, the close variants of the Dice loss, including Tversky, Focal Tversky, and IoU loss, also obtained high performances (0.63–0.92), but slightly inferior to the Dice loss. For losses that are not region-based, compound losses such as BCE + Dice and W (BCE + Dice) showed relatively higher DSC (0.62–0.93) as they consist of a Dice loss component. In contrast, Surface loss (boundary-based) and BCE (distribution-based) demonstrated the lowest DSC (0.38–0.75). However, this pattern is not shown between all metrics and categories. For example, HD95 is a boundary-based metric and it was expected that Surface loss would achieve a high performance. However, as shown in Table 2, Dice loss has the lowest HD95, while Surface loss had the highest. One possible reason is that the Surface loss is relatively hard to train, requiring more epoches for it to converge. Since the training process was consistent across all loss functions, this may explain why some functions did not perform as well as expected.

To overcome variability in performance of individual loss functions, compound loss functions can be considered. For example, in the case of prostate segmentation, data imbalance is a major problem, and loss functions, such as BCE, that are suitable for balanced data are not suitable for this task. However, as shown in our study, weighted BCE combined with Dice can improve model performance significantly.

Tuning hyper-parameters of U-Net, such as the learning rate and number of iterations, requires significant computational time. To address this, we defined the best learning rate for Dice and BCE loss functions, as most of the other loss functions are variations

of these loss functions. We used a grid search for optimization of the learning rate and defined the optimal value of loss function parameters in Focal, W (Dice + BCE) and Focal Tversky loss functions on the validation data set. The optimal learning rate was selected as α = 0.0001, from 0.001, 0.0001, 0.00001. The parameters of the W (Dice + BCE) loss function allocated a higher contribution to the cross-entropy term, α equal to 0.6, in comparison to the Dice term with a weight of 0.4. The optimum value of β for the weighted cross-entropy term was found to be 0.7, which penalizes false negatives more. This aligned with other recommendations for segmentation problems on MRI data [24]. Different values of α and β can be applied to obtain the best model result and handle the imbalance problem of each dataset appropriately.

Models were trained using the T2w axial data and performed better visually in the axial view. Training a model using axial, sagittal, and coronal (or a 3D data set) might improve the model performance. However, adding more inputs will also add complexity and extra computation cost. In this study, we used the 2D U-Net model, which has a lower number of components, to optimize in comparison to a 3D U-Net. In addition, 3D U-Net models underfit when trained on a small number of datasets [6]. Furthermore, it is easier to identify the loss function contribution to the model performance where there is less model complexity. It has been shown that a simple network with a proper loss function can outperform more complex architectures, including networks with specific up-sampling or with skip connection [24].

Regarding implementation, Keras offers a number of tools to construct a U-Net with its sequential and functional interface. Hence, the model itself can be constructed and set up for training in a straightforward approach. However, for the loss function, a potential challenge is to carefully choose the exact equation to implement. This is because even for the same loss function, there are slight variations. For example, the denominator of a Dice loss can be the sum of squared signal intensities, while another form will leave out the square operation. Such subtle differences can add to confounding factors when comparing model performance reported in the literature.

A model's output can improve using post-processing methods that reduce false positives and false negatives in segmented images [25]. CNN segmentation results improve using energy-based refinement post-processing steps [26]. We applied threshold-based refinement to cope with false positives [27]. A threshold value of 0.5 was found to be the optimal value to return the highest Dice score with the least number of false positives.

#### **5. Conclusions**

The performance of a 2D U-Net model with nine different loss functions for prostate gland segmentation was compared. Ranking of model performance was found to depend on the metric used to measure performance. Performance was also found to vary based on the region within the prostate being considered, with the base and apex generally being less compared with the mid-glad and entire prostate gland. There was some evidence that performance was also affected by cross-sectional area of the image, with peak performance in the range of 600–2100 mm2. The performance of models using different loss functions varied by approximately 34% using the DSC score metric. Focal Tversky, Tversky, and W (Dice + BCE) loss functions achieve better performance considering majority of metrics. However, performance of models with Focal Tversky, Tversky, W (Dice + BCE), Dice, and IoU were close. Lower performance was observed using the distribution-based and boundary-based loss functions (Surface, BCE, and Focal loss functions). Based on this 37-patient data set, it is suggested that the Focal Tversky and W (Dice + BCE) loss functions are most suitable for the task of prostate segmentation as their parameters allow the user to modify the loss function for a specific dataset.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/bioengineering10040412/s1, Figure S1: Box plots of all metrics used in this study for the *prostate mid-gland* on validation data from the five-fold cross-validation for models with different loss functions. Figure S2: Box plots of all metrics used in this study for the *apex region* on validation data from the five-fold cross-validation for models with different loss functions. Figure S3: Box plots of all metrics used in this study for the prostate base region on validation data from the five-fold cross-validation for models with different loss functions. Figure S4: Dice similarity coefficient (DSC) score of all the models for each patient. Figure S5: DSC score vs. prostate volume (mm3) for model using Focal Tversky loss. Table S1: W (BCE + Dice), Tversky and Focal Tversky performances, DSC score, for each patient.

**Author Contributions:** Conceptualization, M.M.; methodology, M.M.; software, M.M.; validation, M.M. and Y.S.; formal analysis, M.M.; investigation, M.M., Y.S., G.S., A.H.; resources, A.H.; data curation, Y.S. and M.M.; writing—original draft preparation, M.M.; writing—review and editing, M.M., Y.S., G.S., A.H.; supervision, A.H.; project administration, A.H.; funding acquisition, A.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by National Health and Medical Research Council (Grant No.: 1126955), Cancer Institute of New South Wales: TPG 182165 and SW-TCRC Partner Program grant 2019.

**Institutional Review Board Statement:** The study was conducted in accordance with the Declaration of Helsinki, and approved by the Peter MacCallum Cancer Centre Human Research Ethics Committee Ethics Committee (HREC/15/PMCC125).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The data presented in this study and the code used to perform the calculations with these data are available on request from the corresponding author. The data are not publicly available due to privacy restrictions.

**Acknowledgments:** The authors acknowledge the Sydney Informatics Hub and the University of Sydney's high-performance computing cluster Artemis for providing the high-performance computing resources, and the technical assistance of Nathaniel Butterworth who contributed to the research results reported within this paper. We would also like to thank Nym Vandenberg for their critical review of the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. G.S. has received consultancy fees and/or travel support paid to his institution from MSD Ltd. New Zealand, Siemens HealthcareGmb, RaySearch Laboratories, Elekta New Zealand Ltd., AccuRay Ltd. G.S. has received scientific grant support from Bayer and Astella pharmaceuticals for clinical trials administered through the Australia and New Zealand Urogenital and Prostate (ANZUP) clinical trials group. G.S. also has research supported by Bristol-Myers Squibb, Prostate Cancer Foundation (USA), Prostate Cancer Foundation Australia, the PeterMac Foundation, Cancer Council Victoria and Victoria Cancer Agency. None of these relate to the current work. A.H. has a non-financial research agreement with Siemens Healthcare which does not relate to the current work. G.S. has a non-financial research agreement with MIM Software Inc. and RaySearch Laboratories; their software was used for the purposes of this work.

#### **Appendix A Definition of Loss Functions Used in This Study**

Loss functions are important key drivers in determining the success of neural network models. They define how neural network models calculate the overall error between the prediction and the ground truth. During training, the loss is calculated for each batch and minimized using optimization algorithms. Selecting an appropriate loss function has a larger effect on model performance than using a complex architecture [17]. Loss functions can generally be classified into four groups: distribution-based, region-based, boundarybased, and compound loss [14]. Compound loss is the combination of different types of loss functions. The main role of loss functions is to quantify the mismatch region between

ground truth and segmentation. The main differences between them are the weighting methods [14].

The following equations use these generic notations. Specific parameters will be explained otherwise.

*gi*, *si*: voxels *i* in ground truth and segmentation output, respectively;

*C*: the number of classes;

*c*: notation for an individual class. If class *c* is the correct classification for voxel *i*, *gi <sup>c</sup>* is equal to 1 and *si <sup>c</sup>* is the corresponding predicted probability;

*N*: the total number of samples.

**Distribution-based loss functions:** Distribution-based loss functions aim to minimize dissimilarity between two distributions. We used binary cross-entropy (BCE) and Focal loss from distribution-based loss functions. The fundamental function in this category is cross-entropy and all functions were derived from cross-entropy function.

**Cross-entropy loss:** Cross-entropy (CE) loss is the most commonly used loss function for training deep learning models. It measures dissimilarity between two distributions using CE. Data distribution comes from the training set properties. The formulation for the CE loss function is:

$$Loss\_{CE} = -\frac{1}{N} \sum\_{i=1}^{N} \sum\_{c}^{C} g\_i^{\ c} \log \left( s\_i^{\ c} \right)$$

In this study the segmentation task was a binary classification, therefore, the loss function is a binary cross-entropy (BCE).

A CE loss function can control output imbalance, false positive, and false negative rates. However, model performance with a cross-entropy loss function is not optimal for segmentation tasks with highly class-imbalanced input images [28]. There are several different loss-function-based techniques using weighted cross-entropy [29].

A variation is the weighted cross-entropy (WCE):

$$Loss\_{WCE} = -\frac{1}{N} \sum\_{i=1}^{N} \sum\_{c}^{C} w\_{c} g\_{i}^{c} \log \left( s\_{i}^{c} \right),$$

where *wc* is the weight for each class. This loss function penalizes majority classes by weighting them inversely proportional to the class frequencies.

**Focal loss:** The focal loss function is one of the WCE loss functions shown to better manage unbalanced classes in a dataset [30]. The Focal loss function reduces the loss function corresponding to well-classified examples. It uses a scaling method to allocate higher weights on the examples that are difficult to classify over easier cases.

$$Loss\_{focal} = -\frac{1}{N} \sum\_{i=1}^{N} \sum\_{c}^{\mathbb{C}} \left(1 - s\_i\right)^{\gamma} g\_i^{\mathcal{L}} \log \left(s\_i^{\mathcal{L}}\right),$$

where *γ* is a hyperparameter called focusing parameter.

**Region-based loss:** Region-based loss functions aim to minimize mismatch by maximizing the overlap regions between the output of segmentation (*Ss*) and ground truth (*Gg*). Dice loss is the key element of this category.

**Dice loss:** Dice loss aims to directly maximize the Dice coefficient, which is the most commonly used segmentation evaluation metric [31]. Segmentation models with Dice loss functions have shown superior performance for binary segmentation [29,31,32]. The loss function is formulated as the negative DSC:

$$Loss\_{Dice} = -\frac{2\sum\_{i=1}^{N} s\_i \mathbf{g}\_i}{\sum\_{i=1}^{N} s\_i^2 + \sum\_{i=1}^{N} \mathbf{g}\_i^2 + \varepsilon}$$

where *ε* is a small number to avoid division by zero. In this study, *ε* = 1 was used for all models.

**IoU loss:** The IoU loss function aims to maximize the intersection-over-union coefficient, known as the Jaccard coefficient. IoU is an evaluation metric for segmentation similar to Dice loss [33]:

$$Loss\_{IoI} = 1 - \frac{\sum\_{i=1}^{N} s\_i g\_i}{\sum\_{i=1}^{N} (s\_i + g\_i - s\_i g\_i)}$$

**Tversky:** The Tversky loss function reshapes Dice loss and prioritizes false negatives to achieve a better trade-off between precision and recall [33]. Background voxels that are labelled as the target object are false positives. False negatives refer to the voxels of a target object that are misclassified as background. Segmentation with fewer false positives and false negatives are ideal, but it is not easy to decrease both at the same time.

$$Loss\_{Tersky} = \frac{\sum\_{i=1}^{N} s\_i g\_i}{\sum\_{i=1}^{N} s\_i g\_i + \alpha \sum\_{i=1}^{N} s\_i (1 - g\_i) + \beta \sum\_{i=1}^{N} g\_i (1 - s\_i)}$$

where *α* and *β* are weighting factors to weight the contribution of false positives and false negatives. For certain applications, reducing the false positive (FP) rate is more important than reducing the false negative (FN) rate or vice versa [34].

**Focal Tversky:** Focal Tversky applies the concept of focal loss to improve model performance for cases with low probabilities [35]:

$$L\_{FTL} = (1 - L\_{T 
overset{ky}{\,}})^{1/\gamma}$$

where *γ* varies in the range [1, 3].

**Boundary-based loss functions:** Boundary-based loss functions are a new type of loss function that aims to minimize the distance between two boundaries of the ground truth and segmentation output.

**Boundary (BD) loss (Surface loss):** A boundary (BD) loss (or surface loss) function aims to minimize the mean surface distance, Dist (*∂G*, *∂S*), between two boundaries (surfaces) of the ground truth *G* and segmentation output *S*. The boundary of the ground truth (*G*) is denoted as *∂G*, and *∂S* represents the boundary of segmentation (*S*). This means that BD loss minimizes the mean of the distance between surface voxels in *S* and the closest voxels in *G*.

Boundary loss uses an integral over the boundary between regions instead of integrals within the regions.

$$\text{Dist}(\partial G, \partial S) = \int\_{\partial G} ||y\_{as}(p) - p||^2 dp$$

where *p* is a point on boundary *∂G* and *yas*(*p*) is the corresponding point on segmentation boundary *∂S*.

**Compound loss:** Compound loss functions are a combination of different types of loss functions, mostly cross-entropy and Dice similarity coefficient. This loss function comes from both the WCE and the Dice loss functions.

$$\begin{aligned} Loss\_{Combo} &= a \left( -\frac{1}{N} \sum\_{i=1}^{N} \beta (g\_i \log s\_i) + (1 - \beta)(1 - g\_i) \log(1 - s\_i) \right) - (1 - \beta) \\ &- a) \left( \frac{2 \sum\_{i=1}^{N} s\_i g\_i + \varepsilon}{\sum\_{i=1}^{N} s\_i^2 + \sum\_{i=1}^{N} g\_i^2 + \varepsilon} \right) \end{aligned}$$

where *α* controls the contribution of the WCE loss and the Dice terms; *β* controls the contribution from positive voxels within WCE. Values of *α* and *β* can be defined from a grid search. In this study, two configurations are used. One has equal weights on BCE and Dice, referred to as BCE + Dice. The other uses grid search to determine the best combination (*α* = 0.6, *β* = 0.7), known as weighted BCE and Dice, or W (BCE + Dice). The latter applies more penalty to false negatives. This aligns with the observation that under-segmentation (false negative) is a common problem for MRI data [23].
