1. Introduction
Prostate cancer is one of the most common cancer types in the world, affecting roughly one in every eight men according to the American Cancer Society. Despite its high survival rate (5-year relative rates of ≈
for localized and regional,
for distant), it was the third most prominent cancer in 2020 [
1]. Therefore, there is an urgent need to develop methods that may aid in early detection and better characterization of disease aggressiveness [
2,
3]. The latter will make possible the avoidance of over-treatment in patients with non-aggressive disease and the amplification of treatment in patients with aggressive disease. Manual segmentation is still the most common practice, a time-consuming task also limited by the subjectiveness of the radiologists’ expertise, resulting in high interobserver variability.
Most automatic segmentation processes developed for clinical applications are based on convolutional neural networks (CNNs), many being variations of the classic Unet architecture [
4]. For prostate segmentation, the proposed models continue to improve as shown by the mean dice score used for the following results. Guo et al. [
5] presented a two-step pipeline where a stacked sparse auto-encoder was used to learn deep features, and then a sparse patch matching method used those features to infer prostate likelihood, obtaining a Dice score of
. Milletari et al. [
6] presented Vnet, a volumetric adaptation of unet, achieving a Dice score of
. Zhu et al. [
7] used deep supervised layers on Unet with additional 1 × 1 convolutions, obtaining
Dice. Dai et al. [
8] used the object detection and segmentation model Mask-RCNN, achieving
and
Dice in the segmentation of prostatic gland and intraprostatic lesions, respectively. Zavala-Romero et al. [
9] presented a multistream 3D model that employed the three standard planes of an MRI volume as inputs and was trained with data from different scanners, achieving Dice scores of
for the full gland and
for the peripheral zone when using data from one scanner, and
and
when combining data from both scanners. Aldoj et al. [
10] presented a Unet variation with two stacked dense blocks in each level of the downsampling and upsampling paths, obtaining Dice scores of
,
and
for full gland, peripheral and central zones, respectively. Duran et al. [
11] presented an attention Unet [
12] variation, where an additional decoder was added to perform both gland segmentation as well as multi-class lesion segmentation, achieving a Dice score of
. Recently, Hung et al. [
13] proposed a new take on skip connections, using a transformer to capture the cross-slice information at multiple levels, with the main advantage being that this can be incorporated into most architectures, such as the nnU-Net model.
As of late, more complex and novel architectures and paradigms have been adopted for biomedical imaging segmentation. When dealing with large amounts of data, vision transformer architectures have shown great promise for biomedical segmentation [
14,
15,
16,
17,
18]. These visions transformer models have the advantage of using self-attention mechanisms, allowing them to learn complex patterns, and to also capture the global spatial dependencies of the images, which is highly advantageous for segmentations. As previously mentioned, since manual segmentation suffers from several limitations, most data end up being unlabeled. Self-Supervised learning strategies have been used to leverage these unlabeled data to improve the performance of segmentation models by first pre-training them on pretext tasks [
19,
20,
21]. Lastly, another paradigm that has been recently used for biomedical image segmentation is knowledge distillation, where models are first trained on a larger task (e.g., multi-centre data ) and are then used to help train other models for more domain-specific tasks [
22] (e.g., single-centre data). This technique has already been adopted by some researchers for prostate segmentation [
23,
24,
25] and is shown to improve overall model performance.
One major problem that presents with most previously addressed Unet-based automatic prostate segmentation works are the different evaluation conditions presented, where the same model produces different results depending on the new architecture being proposed and what it needs to outperform the previous models, regardless of the correctness of the experimental settings. What we propose is not a new method but a comparative study of several of the most common, as well as some new some variations, of Unet-based segmentation models for prostate gland and zone segmentation. Additionally, we present and evaluate another research question, regarding the impact of using an object detector model as a pre-processing step in order to crop the MRI volumes around the prostate gland, reducing computational strain and improving segmentation quality by reducing the redundant area of the volumes without simply resizing the data. We examined this question by comparing the results of the different models in both full and cropped T2W MRI volumes, during cross-validation and in an additional external dataset.
2. Materials and Methods
2.1. Data
We used two publicly available datasets containing retrospective data, one to train both the object detection and the segmentation models (ProstateX), the other to serve as an external test set to assess the quality of the segmentation models (Medical Decathlon prostate dataset). The ProstateX dataset (
https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=23691656SPIE-AAPM-NCIPROSTATExchallenge, accessed on 2 January 2023) is a collection of prostate MRI volumes that include T2W, DWI and ADC modalities. These volumes were obtained by the Prostate MR Reference Center—Radboud University Medical Centre (Radboudumc) in the Netherlands, using two Siemens 3T MR scanners (MAGNETOM Trio and Skyra). Regarding the acquisition of the images, the following description was provided by the challenge’s organizers: “T2-weighted images were acquired using a turbo spin echo sequence and had a resolution of around 0.5 mm in plane and a slice thickness of 3.6 mm. The DWI series were acquired with a single-shot echo planar imaging sequence with a resolution of 2-mm in-plane and 3.6-mm slice thickness and with diffusion-encoding gradients in three directions. A total of three b-values were acquired (50, 400, and 800), and subsequently, the apparent diffusion coefficient (ADC) map was calculated by the scanner software. All images were acquired without an endorectal coil”.
Regarding the segmentations, the prostate gland segmentations were performed by a senior radiologist from the Champalimaud Foundation, while the transition and peripheral zone segmentations were obtained from the public dataset repository. We applied bias field correction, using N4ITK ([
26]), to the T2W volumes, and all the masks were resampled to be on the same space and have the same orientation and spacing as the volumes. A total of 153 volumes were used for the gland segmentation, while for the peripheral and transition zone segmentation, a total of 139 volumes was used.
As an external test dataset, we used the publicly available prostate segmentation dataset from the Medical Segmentation Decathlon ([
27]), which was acquired at the Radboud University Medical Centre (Radboudumc) in the Netherlands. This dataset consists of 32 MRI volumes of coregistered T2W and ADC modalities, along with segmentation masks with distinct classes for the transition and peripheral zones. We extracted the T2W volumes and, similarly to the ProstateX, performed bias field correction and resampled the masks, and made new masks by joining both classes to obtain a prostate mask. The model of the scanner used to take these MRIs is not disclosed.
2.2. Prostate Detection
To prepare the data for object detection, the volumes were converted into 2D 16-bit PNG images (we chose to use 16-bit images so we keep the notion of depth while working in 2D) where the empty slices were discarded. The edge coordinates of the ground truth bounding boxes were obtained from the prostate gland masks. We used these images to train a variation of the YoLo-v4 Open source code available here:
https://github.com/ultralytics/yolov5, accessed on 2 January 2023, [
28] object detection architecture to locate the prostate gland and accurately draw a bounding box around it (Figure 3). After predicting the bounding boxes, we extracted the coordinates of the edges of the largest bounding box in the volumes, added padding of 40 pixels in each direction, and cropped the entire volume with that box. Using the largest bounding box plus an additional padding, we ensure that all slices contain the entire prostate gland and some additional area.
Figure 1 provides a comparison between a standard and cropped slice of a T2W volume.
2.3. T2W Pre-Processing and Augmentation
Before feeding the T2W volumes to the segmentation models, the following pre-processing techniques were applied: Orientation transformation to ensure the models were in RAS+ orientation; Intensity rescaling transformation to ensure the voxel intensity of the volumes was in
;
Z normalization transformation; Cropping the images into a smaller 160 × 160 × 32 size. This last transformation was done to reduce the total unused area, making it less computationally expensive, and to ensure that the volumes had an appropriate number of slices for the segmentation models. It was only applied to the volumes that had not been previously cropped by the object detector. In addition, the following augmentation transformations were applied: Random affine transformations, including rescaling, translation and rotation; Random changes to the contrast of the volumes by raising their values to the power of
, where
; Application of random MRI motion artifacts, and random bias field artifacts. Examples of volumes after pre-processing and augmentation are shown in
Figure 2.
2.4. Segmentation Models
In this study, we conducted an extensive analysis of various popular Unet-based models from the literature that either had a publicly available implementation or provided enough detail to be reproducible. These models utilized mechanisms falling into three distinct categories: Dense blocks, Recurrent connections, and Attention mechanisms. The inclusion of a diverse range of mechanisms facilitated a thorough investigation of the utility of each mechanism, given that most Unet variations exhibit only minor differences between one another. Additionally, we introduced new networks that build upon the previously published models by incorporating additional combinations of mechanisms. This enabled us to evaluate the performance of these models against established ones, leading to more reliable and informative results.
In total, 13 segmentation models were compared, including: Unet, Unet++ [
29], Residual Unet (ResNet), Attention Unet (aunet) [
12], Dense Attention Unet (daunet), Dense-2 Unet (d2unet) [
10], Dense-2 Attention Unet (d2aunet), Recurrent Residual Unet (r2unet) [
30], Recurrent Residual Attention Unet (r2aunet), nnU-Net [
31], Vnet [
6], highResNet [
32], SegResNet [
33].
For the Unets, the standard convolution blocks in the downwards path are composed of two convolution operators with a kernel size 3 and stride of 1, followed by a batch normalization, a ReLU activation function and lastly a maxpool operation for dimensionality reduction. The convolution blocks on the upwards path are composed of an upsample operation with scale 2 to double the size of the previous input, and a convolution of kernel size 2 and stride of 1, followed by a batch normalization and ReLU activation.
For the dense unets, the dense blocks are composed of four blocks, with each one having two convolution operations with kernel size 1 and stride of 1, where after each convolution there is a batch normalization and a ReLU activation function. For the transition blocks also used in the dense unets, we perform an upsample operation with scale 2 and a convolution with kernel size 3 and stride of 1, followed by batch normalization and ReLU activation. Regarding the Dense-2 Unet, our implementation differs from the one presented in the original article. As this article does not include enough information to fully replicate the model, we chose the parameters for the dense blocks and convolution blocks to be similar to the ones of the remaining networks.
The recurrent residual blocks are equal to the ones described in Zahangir et al. [
30], and are composed of two residual operation and two convolution blocks, each one having one convolution operation with kernel size 3 and stride of 1, followed by a batch normalization and ReLU activation.
The attention mechanisms are equal to those described in Oktay et al. [
12], where we calculate
,
and
, each composed of a convolution of kernel size 1 and stride of 1, and then compute
.
At the start of each Unet, a single convolution block, composed of a convolution of kernel size 3 and stride of 1, followed by a batch normalization and ReLU activation, is applied to double the channels from 32 to 64, so we end up with 1024 channels at the bottleneck area of the unets.
Regarding the Vnet, SegResNet and highResNet, we used the models made available by the MONAI package [
34]. We used the 3D full resolution version of the nnunet, which is equal to the one described in [
31], and publicly available at
https://github.com/MIC-DKFZ/nnUNet, accessed on 2 January 2023. All other models were implemented in Pytorch [
35].
2.5. Training and Evaluation
2.5.1. Detection
The detection model was trained with the ProstateX volumes, which differed in size (, , ) with a varying number of slices (minimum 19, maximum 27).
All volumes were normalized to zero mean and one unit of standard deviation, resized to , and separated in 2D slices. The slices with no prostate were removed, ending with a total of 2801 images. The remaining 2801 images were split into 70% training and 30% validation. To avoid data leakage, we ensured all slices belonging to a volume were only present in one of the sets.
To set the initial parameters for the object detector model, we used the genetic algorithm based hyperparameter evolution included in the package, that ran for 100 generations, with
mutation and
crossover probabilities. The fitness function used to evaluate this evolution is the weighted average between the mean average precision with a threshold of
(
[email protected]), which contributed with
, and the mean average precision with different thresholds, from
to
in steps of
(
[email protected]:0.95), which contributed with the remaining
. Then, after having set the initial values for the hyperparameters, the model was trained for 350 epochs.
2.5.2. Segmentation
The datasets were split with 5-fold cross-validation. All models except the nnU-Net were trained for a maximum number of 1500 epochs, using early stopping with patience 30. The optimizer was Weighted Adam (AdamW) with a starting learning rate of 1 × 10
, Cosine Annealing learning rate decay, and weight decay of 4 × 10
. For nnU-Net, we used the default parameters, training for 1000 epochs with the Ranger optimizer [
36]. All models were trained using Pytorch-Lightning.
To assess the quality of the models, we use Mean Dice score (MDS) with 95% confidence interval (CI), Mean Hausdorf distance (MHD) and Mean surface distance (MSD).
The Dice score is a widely used metric for segmentation tasks, and it measures the overlap between the ground truth and predicted masks. The higher the score, the better the segmentation.
The Hausdorff distance (HD) measures how far two points are in two images. In this case, how far a point from the predicted mask if from its nearest point in the ground truth mask, essentially indicating the largest segmentation error present in the predicted mask. Low values mean small errors.
The Surface distance (SD)measures the difference between the surface of the predicted mask and the ground truth mask. Low values mean small differences.
2.6. Loss Function
Initially, our loss function was the standard averaged sum of Dice (Equation (
2)) and Crossentropy (Equation (
1)), called Dice CE loss (Equation (
3)). It produced good results for the mid regions of the prostate, but struggled with the small and irregular shapes present in both apex and base. Therefore, we included both Focal (Equation (
4)) ([
37]) and Tversky (Equation (
5)) losses, as they mitigate this issue by focusing on the hard negative examples, reducing both false negatives and false positives. The final loss function was named Focal Tversky Dice Crossentropy loss (FTDCEL) (Equation (
6)), an averaged sum of all previously mentioned loss functions, all having the same weight (FTDCEL) (Equation (
6)). All of these loss functions were implemented using the MONAI package [
34].
4. Discussion
We conducted an extensive analysis of several commonly used segmentation models for prostate gland and zone segmentation on a unified pipeline with the same settings for all. In addition, we answered the research question regarding the effectiveness of using an object detector to crop the MRI volumes around the prostate gland to aid during the segmentation process.
First, we trained an object detector model to perform bounding box detection on the prostate gland, in order to later crop the images. We employed a variation of the Yolo-v4 model to detect the prostate on 2D slices, which provided very good results, with box and confidence validation losses of and , respectively. These results show that it is fairly easy to train a robust prostate detection model, raising the hypothesis for a feature work of leveraging the learned local representations of such a model and using them to aid in a segmentation task.
Using the results obtained from the object detection, we produced a new version of the ProstateX dataset where the volumes were cropped around the gland, reducing the overall size of the image and computational power required. We then trained 13 different commonly used segmentation models, with some additional new variations, on both the full and cropped volumes. The models were subsequently validated on the prostate dataset from the medical imaging decathlon, both on the original and on a cropped version.
When comparing the performance of the models on the full volume segmentation tasks, two main conclusions can be drawn. The first being that the nnU-Net is the overall best model, outperforming all other during cross-validation on the three different segmentation tasks, while still being the model that better generalized on the external test set, on two out of three tasks. Interestingly, despite not being the worst performing model on the external transitional data, it was the one with the highest ASD value, meaning that while it did not make the largest mistakes, it did make the most mistakes out of all models. The second conclusion is that, excluding the nnU-Net, when choosing between the other models, the decision is almost inconsequential. While SegResNet and Vnet produced results significantly worse than some other models, the remaining show no significant differences between each other for the three segmentation tasks.
When comparing the performance of the models on the cropped volume segmentation tasks, the results are not as clear as when using the full volumes. One common aspect among all three segmentation tasks is that the nnU-Net is either the best-performing model, or is at least one of the best. While for the transitional segmentation task, the nnU-Net is the clear winner, for both the gland and peripheral segmentation tasks, the choice of the model is almost inconsequential, with the exception of only four models on the gland task, which perform significantly worse. Regarding the performance on the external test set, there is no clear indication of an overall better model. While for the transition segmentation task, the nnU-Net is clearly better than the other models, and for the remaining two tasks, the results are very similar.
Lastly, comparing the performance of the models when using the two types of data, it is observable that overall the model performance during cross-validation is either the same, or statistically significantly worse. The clearest case being the nnU-Net, where it consistently hinders performance. This is most likely due to the way nnU-Net processes the data, since it is based on a set of heuristics that are applied based on the characteristics of the data, characteristics that are changed when the volumes are cropped. However, in both the gland and peripheral zone segmentation task, it is also observable that the cropped models generalize better, despite their poorer performance during cross-validation, hinting that it is worthwhile to further explore this approach, ideally on a larger dataset and with data to rule out any hypothesis regarding data quantity.