Accurate Prostate Segmentation in Large-Scale Magnetic Resonance Imaging Datasets via First-in-First-Out Feature Memory and Multi-Scale Context Modeling

Zhu, Jingyi; Zhang, Xukun; Luo, Xiao; Zheng, Zhiji; Zhou, Kun; Kang, Yanlan; Li, Haiqing; Geng, Daoying

doi:10.3390/jimaging11020061

Open AccessArticle

Accurate Prostate Segmentation in Large-Scale Magnetic Resonance Imaging Datasets via First-in-First-Out Feature Memory and Multi-Scale Context Modeling

by

Jingyi Zhu

¹

,

Xukun Zhang

¹,

Xiao Luo

¹

,

Zhiji Zheng

¹,

Kun Zhou

¹,

Yanlan Kang

¹,

Haiqing Li

^2,† and

Daoying Geng

^1,2,*,†

¹

Academy for Engineering and Technology, Fudan University, Shanghai 200433, China

²

Department of Radiology, Huashan Hospital, Fudan University, Shanghai 200400, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Imaging 2025, 11(2), 61; https://doi.org/10.3390/jimaging11020061

Submission received: 10 December 2024 / Revised: 25 January 2025 / Accepted: 10 February 2025 / Published: 16 February 2025

(This article belongs to the Section Medical Imaging)

Download

Browse Figures

Versions Notes

Abstract

:

Prostate cancer, a prevalent malignancy affecting males globally, underscores the critical need for precise prostate segmentation in diagnostic imaging. However, accurate delineation via MRI still faces several challenges: (1) The distinction of the prostate from surrounding soft tissues is impeded by subtle boundaries in MRI images. (2) Regions such as the apex and base of the prostate exhibit inherent blurriness, which complicates edge extraction and precise segmentation. The objective of this study was to precisely delineate the borders of the prostate including the apex and base regions. This study introduces a multi-scale context modeling module to enhance boundary pixel representation, thus reducing the impact of irrelevant features on segmentation outcomes. Utilizing a first-in-first-out dynamic adjustment mechanism, the proposed methodology optimizes feature vector selection, thereby enhancing segmentation outcomes for challenging apex and base regions of the prostate. Segmentation of the prostate on 2175 clinically annotated MRI datasets demonstrated that our proposed MCM-UNet outperforms existing methods. The Average Symmetric Surface Distance (ASSD) and Dice similarity coefficient (DSC) for prostate segmentation were 0.58 voxels and 91.71%, respectively. The prostate segmentation results closely matched those manually delineated by experienced radiologists. Consequently, our method significantly enhances the accuracy of prostate segmentation and holds substantial significance in the diagnosis and treatment of prostate cancer.

Keywords:

prostate segmentation; context modeling module; dynamic adjustment mechanism; T2-weighted imaging

1. Introduction

Prostate cancer is the second most frequently diagnosed cancer in men and the fifth leading cause of death worldwide [1]. The annual incidence of prostate cancer has increased in recent years. Predictions suggest that the annual number of new cases will rise from 1.4 million in 2020 to 2.9 million by 2040 [2]. The early detection and risk assessment of prostate cancer are crucial for effective treatment planning and for improving patient outcomes [3]. Common diagnostic methods for prostate cancer include digital rectal examination (DRE) and prostate-specific antigen (PSA) testing, with a definitive diagnosis typically confirmed by prostate biopsy [4]. However, these tests often cause physical discomfort to patients. Magnetic resonance imaging (MRI) has become a crucial method for detecting prostate cancer, as it offers clear anatomical images. MRI encompasses various imaging modalities, including T2-weighted (T2W), diffusion-weighted imaging (DWI), and dynamic contrast-enhanced (DCE) imaging. External beam radiation therapy (EBRT) is a common treatment modality for prostate cancer [5,6]. Precise MRI segmentation of the prostate is essential for effective prostate cancer management, enabling accurate radiation therapy and minimizing radiation-related damage to the surrounding healthy tissues [7].

However, achieving accurate delineation using MRI presents several challenges. First, the boundaries between the prostate and surrounding soft tissues are often ambiguous in MRI images, leading to time-consuming manual annotations that are prone to inter-operator variability. Second, the apex and base of the prostate inherently display blurriness, which complicates edge extraction and precise segmentation [8]. Figure 1 shows a T2W sequence image of the prostate, with the red area indicating the prostate region. As illustrated in Figure 1, the boundaries of the prostate tissue are notably blurred, particularly in the apex and base regions. This blurriness significantly complicates the automatic segmentation.

With the rapid advancement of artificial intelligence and computer vision, these technologies have found wide-spread application in medical image processing. Given the distinct characteristics of natural and medical images, particularly prostate MRI scans, applying deep learning algorithms to localize and segment these images is crucial to enhance the accuracy and efficiency of prostate cancer diagnosis [9]. Numerous medical image segmentation techniques based on traditional machine learning have been extensively investigated, including atlas-based approaches [10], graph-cut algorithms [11], and watershed transformations [12]. These methods have been employed for the quantitative assessment of regions within various medical images. However, most of these techniques rely on manually constructed features that do not effectively capture the robust visual cues required to overcome the challenges inherent in the segmentation task. Since 2012, radiomics has emerged as a complementary approach to medical image analysis, extracting a large number of quantitative features, such as texture, shape, and intensity, from medical images. These features have been used to predict clinical outcomes and disease characteristics more effectively [13,14].

In recent years, deep learning, particularly convolutional neural networks (CNN) [15], has become increasingly prevalent in medical image segmentation and has demonstrated remarkable success. Foundational models such as the Fully Convolutional Network (FCN) [16], U-Net [17], and Residual Networks (ResNet) [18] have achieved significant milestones in the domain of medical image segmentation. In line with this, the dilated One-to-Many U-Net model has been proposed to address segmentation challenges posed by diverse imaging modalities and varied target sizes, achieving impressive results on datasets such as the HC18 ultrasound dataset and the Multi-site MRI dataset, with Dice coefficients of 96.54% and 96.76% for fetal head and prostate segmentation, respectively [19]. Among them, nnU-Net [20,21], the most widely recognized convolutional network-based model, has proven to be suitable for most medical segmentation tasks. U-Net++ [22] utilizes more nested and densely connected skip connections to better capture the fine-grained details of foreground objects [23]. Although models based on U-Net [17] have significantly advanced medical segmentation, they often struggle to capture long-range relationships and global contextual information due to the limited receptive field of the convolutional kernels. Consequently, researchers have shifted their focus to self-attention mechanisms [24]. Trans-UNet [25] was the first to incorporate a Vision Transformer (ViT) [26,27] into medical image segmentation by combining transformer and U-Net architectures. It integrates the self-attention mechanism of the transformer to capture global contextual information and enhance feature representation capabilities. Swin-UNet combines the advantages of the Swin transformer [28,29] and U-Net [17,30], introducing cross-layer communication mechanisms. This enables efficient feature information flow between different slices, thereby outperforming models based on the FCN method [16]. The Point-wise Multi-scale Fusion Network (PMF-Net) has been proposed to address these challenges by effectively integrating multi-scale features with a point-wise fusion mechanism. PMF-Net improves segmentation performance by capturing both fine-grained details and long-range dependencies, making it suitable for more complex medical image segmentation tasks, especially when contextual information is crucial [31,32]. Although deep learning technology has recently been applied to prostate image segmentation, its accuracy has not yet fully met clinical application requirements. Therefore, algorithms must be adjusted to address the issues of blurred boundaries and insufficient spatial information in the prostate MRI data. This study proposes a multi-scale context modeling-based U-Net (MCM-UNet) to effectively address the challenge of segmenting the prostate on T2W images. The main contributions of this study are as follows:

We introduce a novel multi-scale context modeling (MCM) module specifically designed for MRI prostate segmentation. This innovative module enhances pixel representation in boundary areas by minimizing the influence of irrelevant features, thereby improving segmentation results [33];
We employed a first-in-first-out (FIFO) strategy to dynamically adjust the dataset-level feature vectors and select the optimal ones. This strategy enhances the segmentation accuracy, especially in the challenging apex and base regions of the prostate;
We compiled data on 2175 prostate cases from 14 different local hospitals, constituting the largest private prostate dataset to date. The novelty, effectiveness, and robustness of the proposed model was validated using this dataset.

The remainder of this paper is organized as follows. Section 2 details the private prostate dataset and describes the prostate segmentation method. Section 3 describes the experimental setup and evaluates the performance and robustness of the proposed method on both private and public datasets using four assessment metrics. It also presents a visual analysis of the segmentation results and compares them with those of other mainstream segmentation methods. Section 4 discusses the innovations and limitations of the method, and Section 5 summarizes the contents of this article.

2. Dataset and Methods

2.1. Dataset

This retrospective study was reviewed and approved on 4 April 2023 by the Ethical Review Board of Huashan Hospital affiliated with Fudan University. Data were accessed for research purposes on 6 April 2023. These data were strictly anonymized during the collection process and personal information of participants was not available to the authors during the experiment, and the requirement of informed consent was therefore waived. All data for this retrospective study came from 2175 cases in 14 public hospitals in China from March 2012 to March 2022, including 984 scans of healthy prostates and 1191 scans of prostate cancer patients. This dataset is currently the largest known private T2W prostate dataset. All patients with prostate cancer underwent prostate biopsy and were diagnosed with prostate cancer. The pathological diagnoses were performed by hospital-certified pathologists using the Gleason grading system. The dataset includes patients aged 40–85 years, with an average age of 62 years, encompassing both healthy individuals and those at various stages of prostate cancer, from early to advanced stages. A detailed description of the data is presented in Table 1. Notably, Center-1, Huashan Hospital, which is affiliated with Fudan University and is one of China’s top public hospitals, served as our primary data source. All 984 healthy prostate scans originated from Center-1, and the 556 prostate cancer patient scans from this center constituted nearly half of all prostate cancer scans. The remaining 635 cases were sourced from 13 different hospitals that utilized various MRI machines and scanning parameters, thereby ensuring data diversity and verifying the robustness of the segmentation model. All the data were annotated by two experienced radiologists and reviewed by an authoritative imaging expert. The annotating physicians possessed 11 and 13 years of professional clinical diagnostic experience, respectively, while the reviewing expert possessed 31 years of professional experience. Prior to annotation, both experts underwent internships and task-specific training.

All data were independently annotated by two experts, and the correlation between their annotations was assessed using the correlation coefficient (CC) [31], which yielded a value of 0.97. For any annotations in which the correlation coefficient fell below 0.9, the reviewing expert conducted specialized quality control to ensure annotation accuracy and consistency. The correlation coefficient (CC) is defined as follows:

C C_{i} = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{\sqrt{\sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}} \sqrt{\sum_{i = 1}^{n} {(Y_{i} - \bar{Y})}^{2}}}

(1)

where

X_{i}

and

Y_{i}

denote the annotations of the two experts in the prostate segmentation area in the i-th scan slice.

2.2. Methods

This section first introduces the overall structure of the proposed method, followed by descriptions of the context modeling module, dynamic feature storage module, and loss function. Finally, we describe the evaluation metrics.

2.2.1. Architecture Overview

The proposed MCM-UNet, shown in Figure 2, consists of four main components: a feature encoding module, a context modeling (CM) module, a feature storage module, and a feature decoding module. Our focus is on accurately segmenting the prostate using MRI. The input to the network is a 3D T2W image of the prostate, set as

D = \{D_{1}, \dots, D_{i}, \dots, D_{n}\}

, where

D_{i} \in R^{H \times W}

is the i-th slice, and H and W represent the size of the T2W image slices. The network outputs a probability map for prostate and background slices using nnU-Net as the backbone. As an enhanced version of U-Net, nnU-Net places more emphasis on image pre-processing. Through each encoding step, the network obtained feature maps with semantic information at various scales. To enhance the ability of the network to segment prostate boundaries, we integrated a CM module into every decoding step except for the final layer. Utilizing the deep supervision segmentation results and the upsampling outcomes from each layer of nnU-Net, we conducted attention computations to derive richer intra-image multi-scale contextual semantic features. In the final downsampling layer, we integrated a memory bank that dynamically retained intra-image semantic features from each batch by using a FIFO mechanism. We defined the collection of image-level semantic features within the memory bank as dataset-level semantic features. Combined with the CM module, the memory bank improves the model’s ability to extract intra-image contextual features and capture long-range semantic features across images, enhancing segmentation accuracy, particularly at the prostate apex and base. The CM module and the FIFO feature memory bank are detailed in Section 2.2.2 and Section 2.2.3, respectively.

2.2.2. Context Modeling Module

The CM module refines pixel representation by extracting contextual semantic information in the image, minimizing irrelevant features, and improving boundary segmentation for a high-resolution output. As shown in Figure 3, the CM module has two inputs. We define the two inputs as

f_{s e g}

and

f_{s f}

, where

f_{s e g}

represents the deep supervision feature obtained from the category probability distribution D, which comprises two channels.

f_{s e g} = \sum_{n = 1}^{N_{l}} Nor (D_{l}) \cdot R_{l}, l \in \{0, 1\}

(2)

where the size of

f_{s e g}

is

H \times W \times C

,

H \times W

represents number of regional representations in the current stage, and C denotes the number of channels. l represents the category classification of the foreground and background images obtained by deep supervision. The size of

D_{l}

is

N_{l} \times 1

, which represents the prediction probability of all pixels belonging to l, and Nor denotes the normalization function.

f_{s f}

refers to semantic features at the image level and represents the semantic characteristics of the current training image. We define

f_{s f_{j}}

as the semantic features obtained at each upsampling stage, where

j \in [1, n]

and n represents the number of the upsampling stage. When j = 1,

f_{s f_{1}}

is the result of upsampling the features obtained from the last encoding layer and the result of feature merging and channel compression with the penultimate layer’s skip connection. When

1 < j < n

,

f_{s f_{j}}

is the result of concatenate and channel compression of

f_{s f_{j - 1}}

with the skip connection features of the current stage.

Next, to reduce the impact of unrelated features, all pixels aggregate

f_{s f}

and

f_{s e g}

together. The self-attention mechanism is then utilized to compute the similarity between

f_{s f}

and

f_{s e g}

:

W_{s f} = Softmax (\frac{f_{s f}^{H W \times C} \otimes f_{s e g}^{C \times H W}}{\sqrt{C}})

(3)

where the size of

W_{s f}

is

H W \times H W

and ⊗ stands for matrix multiplication. Finally, the semantic features were aggregated based on similarity to obtain the image-level representation

R_{s f}

:

R_{s f} = W_{s f}^{H W \times H W} \otimes f_{s f}^{H W \times C}

(4)

In each skip connection, we perform identical steps to enhance the image-level features.

2.2.3. First-in-First-Out Feature Update Strategy

Image-level semantic features alone lack robustness for current applications. Our network combines image-level and dataset-level features to improve the robustness and applicability of boundary region features. As shown in Figure 4, we defined the dataset-level semantic feature as

f_{d l}

, which represents the region derived from the training data across the entire dataset.

f_{d l}

is derived from

f_{s f}

and is more robust than

f_{s f}

as it assimilates and continuously updates with more data to yield new dataset-level semantic features throughout the training process. Initially, we establish an

N \times H \times W \times C

memory bank to store

f_{d l}

dynamically, where N is the number of

f_{d l}

and

H \times W \times C

is the size of the

f_{d l}

. Subsequently, we compute the similarity between the image-level semantic features

f_{s f}

and

f_{d l}

, where

f_{s f}

is obtained in the last stage of encoding and the

f_{d l}

is obtained from the memory bank. We then select three

f_{d l}

with the highest similarity with the input

f_{s f}

in the memory bank and perform feature fusion on the three

f_{d l}

and

f_{s f}

to obtain a new

f_{d l}^{'}

:

f_{d l}^{'} = δ (f_{d l_{m}} \oplus f_{d l_{n}} \oplus f_{d l_{h}} \oplus f_{s f})

(5)

where ⊕ denotes the concatenation operation and

δ

is a transform function used to reduce the channels of the input matrix tensors. Simultaneously, the new

f_{d l}^{'}

is pushed into the memory bank, and the

f_{d l_{n}}

is popped out of the memory bank to complete the update of the feature container. This strategy enhances the extraction of semantic features at the dataset level, thereby improving segmentation of the apex and base of the prostate.

2.2.4. Loss Function

During the training of our network, we utilized a combination of the Dice loss and cross-entropy loss as the loss function.

L = L_{d i c e} + L_{C E}

(6)

We calculated the Dice loss for each sample in the batch and determined the average value for that batch, where the Dice loss is defined by the following formula:

L_{d i c e} = - \frac{2}{|K|} \sum_{k \in K} \frac{\sum_{i \in I} u_{i}^{k} v_{i}^{k}}{\sum_{i \in I} u_{i}^{k} + \sum_{i \in I} v_{i}^{k}}

(7)

where u is the softmax output of the network and v is the one-hot encoding of the ground truth segmentation map. Both u and v have the shape

I \times K

, with

i \in I

being the number of pixels in the training patch/batch and

k \in K

the class.

To evaluate the performance of our chosen loss function, we compared it to other commonly used loss functions, including focal loss and Tversky loss. Focal loss is particularly beneficial in cases of class imbalance, as it reduces the relative loss for well-classified examples and focuses more on hard-to-classify examples. Tversky loss, on the other hand, is specifically designed for handling imbalanced datasets and can be adjusted to emphasize either false positives or false negatives. However, despite their advantages, we found that the combination of Dice loss and cross-entropy loss yielded superior results in our specific task. The Dice loss emphasizes overlap, which is crucial for the accurate segmentation of medical images, particularly in small or irregularly shaped regions. The addition of cross-entropy loss further helps to fine-tune the boundary delineation by penalizing large discrepancies between predicted and true labels. This combination provides a balanced approach that not only improves segmentation accuracy but also ensures robustness in handling regions with unclear boundaries. Based on these findings, we believe that the combination of Dice loss and cross-entropy loss is well suited for our study’s objective of medical image segmentation.

2.2.5. Evaluation Metrics

To comprehensively evaluate the performance of our proposed model, we utilized four performance metrics to assess the prostate segmentation results: the Average Symmetric Surface Distance (ASSD), 95% Hausdorff Distance (HD95), Jaccard index, and Dice similarity coefficient (DSC). Their definitions are as follows:

ASSD (A_{i}, B_{i}) = \frac{1}{|S (A_{i})| + |S (B_{i})|} \times D

(8)

D = (\sum_{a \in S (A_{i})} min_{b \in S (B_{i})} ||a - b|| + \sum_{b \in S (B_{i})} min_{a \in S (A_{i})} ||b - a||)

(9)

where

A_{i}

represents the ground truth of the prostate for the i-th sample,

B_{i}

represents the corresponding output from the model, and ASSD is a measure of the average surface distance between the ground truth and segment outputs. The edge pixel set of

A_{i}

is denoted by

S (A_{i})

, and the edge pixel set of

B_{i}

is denoted by

S (B_{i})

.

HD 95 (X, Y) = max \{95_{x \in X}^{t h} d (x, Y), 95_{y \in Y}^{t h} d (X, y)\}

(10)

HD assesses the segmentation quality by calculating the maximum shortest distance between a point on the predicted contour and a point on the target contour. As HD is sensitive to outliers, we employed a more robust variant, HD95, which considers the 95th percentile instead of the absolute maximum. Thus,

d (x, Y)

is the minimum distance from the boundary pixel x to region Y.

Jaccard = \frac{\sum_{i = 1}^{n} X_{i} ⋂ Y_{i}}{\sum_{i = 1}^{n} X_{i} ⋃ Y_{i}}

(11)

The Jaccard index is commonly used to measure the accuracy of segmentation by quantifying the overlap between the predicted segmentation and the ground truth. Here,

X_{i}

represents ground truth for the prostate of the i-th sample and

Y_{i}

denotes the corresponding output from the model.

DSC = \frac{2 T P}{F P + F N + 2 T P}

(12)

Among the overlap-based metrics, we utilize the well-known DSC, which ranges from 0% (no overlap) to 100% (complete overlap). Here, TP denotes true positives, TN denotes true negatives, FP denotes false positives, and FN denotes false negatives.

3. Experiments and Results

In this section, we describe the implementation of our experiment. We then present the results of our methods, offering both quantitative and qualitative analyses, and compare them with those of other methods. Finally, we conducted an ablation study to analyze the impact of different scenarios on network performance.

3.1. Implementation Details

To evaluate our methods, we used a substantial private dataset comprising 2175 MRI T2W scans and a public dataset known as PROMISE12 [34], which includes 50 MRI T2W scans. The PROMISE12 dataset, released for the MICCAI 2012 Prostate Segmentation challenge, serves as a well-established benchmark for evaluating prostate segmentation methods. The use of this publicly available dataset as an external validation set allows us to assess the generalization and transferability of our model to an independent dataset, thereby strengthening the credibility and robustness of our approach. Within the private dataset, 220 scans were used for external testing. The remaining 1955 scans were subjected to a 5-fold validation process, with the training validation set randomly and equally divided into five groups. In each fold, four groups were used for training, whereas the remaining group was used for validation.

For the implementation, we used a server equipped with a GeForce RTX 4090 GPU (NVIDIA, Santa Clara, CA, USA) with 24 GB of memory. All the experiments were conducted using the PyTorch 2.1.2 framework. MRI scans in the private dataset were interpolated to an isotropic voxel spacing of [0.66 × 0.66 × 5] mm³, followed by Z-score normalization. Subsequently, to train the 2D models, voxel patches were sliced along the axial direction to produce images of size 512 × 512, which served as input data. For the 3D models, each MRI scan was cropped to a voxel patch with dimensions of 320 × 320 × 16, centered around the prostate area.

All training sessions ran for a fixed duration of 1000 epochs, with each epoch comprising 250 training iterations, as recommended by the nnU-Net. The batch size was set to 12 for the 2D models and 2 for 3D models. The learning rate adhered to a ‘poly’ policy, decaying according to the specified formula

{(1 - e p o c h / e p o c h_{max})}^{0.9}

[35]. Stochastic gradient descent was employed as the optimizer, with the Nesterov momentum (

μ

) set to 0.99. Deep supervision was employed to enhance the training efficiency. The loss function combined binary cross-entropy loss and Dice loss in an equal weight (1:1 ratio).

3.2. Result Visualization

Figure 5 shows the segmentation results of the proposed network for the test dataset. The first column of Figure 5 displays the original images, featuring four challenging cases that are difficult to discern, one apex area, two mid-gland areas, and one base area. The boundaries of the prostate were blurred, particularly in the apex and base areas, which were virtually indistinguishable to the naked eye. The second column presents the ground truth, which was manually annotated by the radiology experts. The third column displays the results segmented using MCM-UNet. The fourth column shows the overlap between the segmentation results and the ground truth, where red areas denote complete overlap, green represents false positives (indicative of under-segmentation), and blue signifies false negatives (indicative of over-segmentation). The fifth column compares the boundaries of the segmentation results with the ground truth; the yellow lines represent the ground truth, and the purple lines represent the segmentation outcomes.

Figure 6 presents the 3D visualization results of the proposed method on the test dataset. The first row shows the ground truth, while the second row displays the 3D segmentation results of MCM-UNet. The third row illustrates the overlap between the segmentation results and the ground truth, where the red region indicates perfect overlap, green represents false positives (indicating under-segmentation), and blue signifies false negatives (indicating over-segmentation). From the segmentation results, it can be observed that MCM-UNet successfully segmented the entire prostate organ, with minimal occurrences of under-segmentation or over-segmentation, except in the peripheral areas.

3.3. Comparative Experiment

3.3.1. Quantitative Analysis

To validate the effectiveness of our MCM-UNet, we conducted a comprehensive comparison with state-of-the-art 2D, 3D, and transformer-based medical image segmentation models. Specifically, our model was benchmarked against U-Net and U-Net++ for 2D segmentation, 3D-UNet [36] for 3D segmentation, and Swin-UNet and Trans-UNet for transformer-based segmentation. In additional, nnU-Net served as the baseline for comparison. All models, except nnU-Net, were trained from scratch under conditions identical to those used for MCM-UNet.

The quantitative results presented in Table 2 highlight the superior performance of MCM-UNet over other models. On private datasets, ASSD, HD95, Jaccard index, and DSC for our model were 0.58 voxels, 1.80 voxels, 83.17%, and 91.71%, respectively. Compared to the baseline nnU-Net, our model shows improvements of 0.43 voxels in ASSD, 4.38% in Jaccard index, and 3.58% in mean DSC. On the PROMISE12 prostate segmentation task, our method also demonstrated improvements across all metrics of 0.98 voxels, 2.72 voxels, 3.38%, and 1.15%, respectively.

The CM module and memory bank are crucial in extracting and leveraging both image-level and dataset-level contextual features, enhancing the model’s effectiveness, particularly at the prostate boundary and challenging apex and base areas. This strategic integration significantly boosts the performance across key metrics, underscoring the advanced capabilities of our MCM-UNet in handling complex segmentation tasks.

3.3.2. Qualitative Evaluation

Figure 7 shows the representative results of the proposed MCM-UNet and comparison methods, highlighting its superior accuracy and consistency. Basic models like U-Net, 3D U-Net [36], and U-Net++ faced significant challenges with under-segmentation, particularly in the apex and base regions, where anatomical structures such as the vas deferens were frequently misidentified as the prostate. These models also struggled with class imbalance, as they were not robust enough to handle the varied intensity distribution and small size of prostate structures in certain regions. Furthermore, U-Net and its variants suffered from boundary ambiguity, especially in regions with blurred or poorly defined margins. Swin-UNet and Trans-UNet, while incorporating transformer architecture for better context learning, still showed performance gaps in the apex and base regions. These models also displayed a sensitivity to computational efficiency, as the added transformer layers significantly increased training time and memory consumption without a proportional improvement in segmentation accuracy. nnU-Net, as our baseline, performed relatively well, but still struggled to accurately delineate detailed prostate boundaries, especially in regions with low contrast. In contrast, the proposed MCM-UNet addresses these limitations by effectively utilizing both intra-image and inter-image contextual information, which enables precise segmentation even in regions with unclear boundaries. Furthermore, our model is more robust to class imbalance and computationally efficient, providing superior segmentation across all regions, regardless of prostate shape variations.

3.4. Ablation Study

3.4.1. Hyper-Parameter Ablation Study

In our designed network structure, there are two important hyper-parameters: the number of CM modules n, and the number of features stored in the memory bank m. To study the effects of different parameter settings on the segmentation performance of the network, we set different parameter values to train the network. As shown in Table 3, we set n to one and six. The value of m is 32, 64, 128. We used HD95 and DSC as evaluation metrics to discuss the effects of the parameters on the prostate model. Table 3 shows that when n is 6 and m is 64, our proposed model obtains the lowest HD95 value and the highest DSC value for prostate segmentation.

3.4.2. Network Structure Ablation Study

We conducted an ablation study, as detailed in Table 4, utilizing our private dataset. To ascertain the contributions of the proposed CM module and FIFO feature update strategy with the memory bank (MB), we disabled various components within the entire network and trained the models accordingly. According to Table 4, integrating the CM module into nnU-Net improved the DSC from 88.13% to 88.62%, an increase of 0.49%. Similarly, the FIFO update strategy with the MB enhanced the DSC to 89.21%, indicating an increase of 1.08%. Furthermore, when both the CM module and MB were added to U-Net, the DSC increased to 91.71%, which is an increase of 3.58%. This highlights the significant positive impact of both the CM module and the MB on segmentation performance, particularly noting the enhanced improvements through their synergistic utilization. While the CM module enhances the segmentation of the boundary regions, the MB’s capability to extract dataset-level semantic features substantially benefits the network, especially in improving segmentation at the apex and base of the prostate.

4. Discussion

In this study, we enhanced the network architecture based on classical U-Net and assessed its performance in segmenting prostate MRI T2W images. Accurate segmentation of the prostate is crucial for effective prostate cancer treatment, and provides radiologists with essential indicators for diagnosis and prognosis. We propose a network that utilizes multi-scale context modeling to optimize the extraction of contextual features within each image layer. By integrating a CM module into each layer of skip connections, we reduced the influence of irrelevant features on the segmentation outcomes and enhanced the boundary pixel representation. Additionally, a storage module was incorporated to dynamically store feature vectors through a FIFO mechanism, facilitating the capture of inter-image features and enhancing the segmentation of critical prostate regions, particularly the apex and base. As shown in Table 2 and Figure 7, our method achieves a more accurate segmentation than other common methods and excels in all evaluation metrics.

Currently, most prostate segmentation methods are trained and validated on public datasets or small-scale private datasets, typically containing fewer than 100 samples, such as PROMISE12 and MSD prostate. These datasets are generally limited in size and lack representation for patients with prostate cancer. We collected 2175 T2W MRI scans of the prostate from 14 hospitals, encompassing both healthy individuals and patients with prostate cancer, which were accurately annotated by experienced radiologists. This marks the first instance in which prostate segmentation has been performed and validated on such a large dataset, enhancing the generalization and robustness of our method.

In this study, we utilized the CM module, which enhances the pixel representation of boundaries and reduces the influence of irrelevant features on segmentation outcomes. We investigated the impact of the number of CM modules on performance by comparing the addition of a single CM module to integrating CM modules in every layer through hyper-parameter experimentation. We discovered that layer-by-layer addition not only yields the highest segmentation accuracy but also increases the computational time by only 10% compared to a single-layer addition, a marginal increase. Consequently, we adopted a layer-by-layer addition of CM modules to enhance the segmentation accuracy of the model for the prostate region. Another hyper-parameter in our method is the size of the memory bank, which determines the number of stored image features. According to Table 3, a memory bank size of 64 offers optimal performance without excessive computational demand. Increasing the size to 128 or 256 does not significantly enhance performance but does lead to higher computational loads and reduced training efficiency. This effect occurs because the median number of prostate image layers is 18, allowing the size 64 memory bank to accommodate nearly four groups of distinct data. Larger sizes such as 128 and 256 can store 7 and 14 groups, respectively, but increase computation time during feature similarity calculations and are inefficient at managing long-distance feature relations. Therefore, a memory bank size of 64 is deemed most appropriate.

We trained the model on a private dataset and conducted external validation on both this dataset and the open-access PROMISE12 dataset, achieving favorable results in both cases. When examining Table 2, it is evident that the same method yields different performances on various datasets for HD95 and ASSD. The metrics are higher on PROMISE12 not due to poor model generalization, but because PROMISE12 comprises thin-sliced data with a larger slice count. This affects the HD95 and ASSD performance, though the DSC values remain consistent. In subsequent work, we will incorporate a greater volume of thin-slice data to enhance the diversity of the experimental datasets.

In this study, we focused solely on the T2W modality. However, employing multiparametric MRI, which includes T1-weighted (T1W) images, may enhance prostate segmentation accuracy by providing details that T2W images do not capture. In future studies, we plan to introduce multiparametric MRI images as separate input channels for the network, with each channel representing a different MRI modality. It is important to note that T1W typically offers poor contrast to the prostate tissue. Consequently, channel weighting is crucial during model training to optimize performance.

Regarding the clinical applicability of our method, we foresee several steps necessary to integrate it into clinical environments. First, the model would need to be validated on a diverse set of clinical datasets to ensure its robustness across different populations and scanner configurations. Additionally, integration with existing clinical systems, such as Picture Archiving and Communication Systems (PACS), is essential for seamless usage by radiologists. Real-time processing is another crucial consideration, as clinical workflows demand fast and accurate results. Thus, optimizing the model for inference speed without sacrificing accuracy is a key area for future work. Finally, user interface (UI) considerations are vital for clinical adoption. The model should be incorporated into a user-friendly platform that allows radiologists to easily visualize the segmentation results, make adjustments if necessary, and incorporate the model’s output into their diagnostic process. Designing an intuitive and efficient UI will be essential for ensuring that the tool enhances, rather than disrupts, the workflow in clinical settings.

5. Conclusions

In this study, we introduce a novel network: the multi-scale context modeling module-based UNet (MCM-UNet). This network adopts a multi-scale optimization strategy by integrating a CM module into each skip connection. The CM module enhanced pixel representation in boundary regions by selectively reducing irrelevant features. Furthermore, by employing a FIFO update strategy, feature vectors are dynamically adjusted to capture dataset-level semantic features. Our MCM-UNet demonstrated significant improvements in segmentation performance, notably in boundary regions and in the apex and base regions of the prostate. We trained and evaluated the model using 2175 high-quality clinical prostate images, yielding precise segmentation results. This model provides a reliable tool for enhancing the accuracy of radiation therapy in prostate cancer treatment.

Author Contributions

Conceptualization, D.G.; methodology, J.Z. and X.Z.; software, J.Z.; validation, X.L.; formal analysis, K.Z.; investigation, J.Z.; resources, Y.K.; data curation, H.L.; writing—original draft preparation, J.Z.; writing—review and editing, D.G., X.Z. and X.L.; visualization, Z.Z.; supervision, H.L.; project administration, H.L.; funding acquisition, D.G. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China 82372048 and 82102132, in part by Science and Technology Commission of Shanghai Municipality 22TS1400900, 23S31904100, 24SF1904200, and in part by the Medical Engineering Jiont Fund of Fudan University yg2022-22.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved on 4 April 2023 by the Ethical Review Board of Huashan Hospital affiliated with Fudan University. (HIRB-2023-489).

Informed Consent Statement

These data were strictly anonymized during the collection process and personal information of participants was not available to the authors during the experiment, and the requirement of informed consent was therefore waived.

Data Availability Statement

Data cannot be shared publicly due to patient privacy reasons. The data are available from the Ethics Committee of Huashan Hospital affiliated to Fudan University (contact via Prof. Daoying Geng) and are suitable for researchers who meet the criteria for accessing confidential data.

Acknowledgments

We would like to thank the National Natural Science Foundation of China, the Science and Technology Commission of Shanghai Municipality, and the Institute of Academy for engineering and Technology of Fudan University for their financial support.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DRE	Digital Rectal Examination
PSA	Prostate-Specific Antigen
T2W	T2-Weighted
MRI	Magnetic Resonance Imaging
DWI	Diffusion-Weighted Imaging
DCE	Dynamic Contrast-Enhanced
EBRT	External Beam Radiation Therapy
CNN	Convolutional Neural Networks
FCN	Fully Convolutional Network
ResNet	Residual Network
ViT	Vision Transformer
PMF-Net	Multi-Scale Fusion Network
CC	Correlation Coefficient
T1W	T1-Weighted
UI	User Interface
PACS	Picture Archiving and Communication Systems

References

Rawla, P. Epidemiology of Prostate Cancer. World J. Oncol. 2019, 10, 63–89. [Google Scholar] [CrossRef] [PubMed]
James, N.D.; Tannock, I.; N’Dow, J.; Feng, F.; Gillessen, S.; Ali, S.A.; Trujillo, B.; Al-Lazikani, B.; Attard, G.; Bray, F.; et al. The Lancet Commission on prostate cancer: Planning for the surge in cases. Lancet 2024, 403, 1683–1722. [Google Scholar] [CrossRef] [PubMed]
Bergengren, O.; Pekala, K.R.; Matsoukas, K.; Fainberg, J.; Mungovan, S.F.; Bratt, O.; Bray, F.; Brawley, O.; Luckenbaugh, A.N.; Mucci, L.; et al. 2022 Update on Prostate Cancer Epidemiology and Risk Factors—A Systematic Review. Eur. Urol. 2023, 84, 191–206. [Google Scholar] [CrossRef]
Vikal, S.; Haker, S.; Tempany, C.; Fichtinger, G. Prostate contouring in MRI guided biopsy. Proc. SPIE Int. Soc. Opt. Eng. 2009, 7259, 72594a. [Google Scholar]
Daly, T. Evolution of definitive external beam radiation therapy in the treatment of prostate cancer. World J. Urol. 2020, 38, 565–591. [Google Scholar] [CrossRef]
D’Amico, A.V.; Whittington, R.; Malkowicz, S.B.; Schultz, D.; Blank, K.; Broderick, G.A.; Tomaszewski, J.E.; Renshaw, A.A.; Kaplan, I.; Beard, C.J.; et al. Biochemical outcome after radical prostatectomy, external beam radiation therapy, or interstitial radiation therapy for clinically localized prostate cancer. JAMA 1998, 280, 969–974. [Google Scholar] [CrossRef] [PubMed]
Lemaître, G.; Martí, R.; Freixenet, J.; Vilanova, J.C.; Walker, P.M.; Meriaudeau, F. Computer-Aided Detection and diagnosis for prostate cancer based on mono and multi-parametric MRI: A review. Comput. Biol. Med. 2015, 60, 8–31. [Google Scholar] [CrossRef] [PubMed]
Hung, A.L.Y.; Zheng, H.; Miao, Q.; Raman, S.S.; Terzopoulos, D.; Sung, K. CAT-Net: A Cross-Slice Attention Transformer Model for Prostate Zonal Segmentation in MRI. IEEE Trans. Med. Imaging 2023, 42, 291–303. [Google Scholar] [CrossRef]
Li, C.; Qiang, Y.; Sultan, R.I.; Bagher-Ebadian, H.; Khanduri, P.; Chetty, I.J.; Zhu, D. FocalUNETR: A Focal Transformer for Boundary-Aware Prostate Segmentation Using CT Images. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2023, Proceedings of the 26th International Conference, Vancouver, BC, Canada, 8–12 October 2023; Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R., Eds.; Springer: Cham, Switzerland, 2023; pp. 592–602. [Google Scholar]
Gao, Q.; Rueckert, D.; Edwards, E. An Automatic Multi-Atlas Based Prostate Segmentation Using Local Appearance-Speci(cid:28)c Atlases and Patch-Based Voxel Weighting. In Proceedings of the MICCAI Grand Challenge: Prostate MR Image Segmentation, Nice, France, 1 October 2012; Available online: https://api.semanticscholar.org/CorpusID:12598986 (accessed on 9 February 2025).
Chen, X.; Udupa, J.K.; Bagci, U.; Zhuge, Y.; Yao, J. Medical Image Segmentation by Combining Graph Cuts and Oriented Active Appearance Models. IEEE Trans. Image Process. 2012, 21, 2035–2046. [Google Scholar] [CrossRef] [PubMed]
Grau, V.; Mewes, A.; Alcaniz, M.; Kikinis, R.; Warfield, S. Improved watershed transform for medical image segmentation using prior information. IEEE Trans. Med. Imaging 2004, 23, 447–458. [Google Scholar] [CrossRef] [PubMed]
Cairone, L.; Benfante, V.; Bignardi, S.; Marinozzi, F.; Yezzi, A.; Tuttolomondo, A.; Salvaggio, G.; Bini, F.; Comelli, A. Robustness of Radiomics Features to Varying Segmentation Algorithms in Magnetic Resonance Images. In Image Analysis and Processing. ICIAP 2022 Workshops, Proceedings of the ICIAP International Workshops, Lecce, Italy, 23–27 May 2022; Revised Selected Papers, Part I; Springer: Berlin/Heidelberg, Germany, 2022; pp. 462–472. [Google Scholar] [CrossRef]
Ali, M.; Benfante, V.; Cutaia, G.; Salvaggio, L.; Rubino, S.; Portoghese, M.; Ferraro, M.; Corso, R.; Piraino, G.; Ingrassia, T.; et al. Prostate Cancer Detection: Performance of Radiomics Analysis in Multiparametric MRI. In Image Analysis and Processing-ICIAP 2023 Workshops, Proceedings of the ICIAP International Workshops, Udine, Italy, 11–15 September 2023; Foresti, G.L., Fusiello, A., Hancock, E., Eds.; Springer: Cham, Switzerland, 2024; pp. 83–92. [Google Scholar]
Vieira, J.P.A.; Moura, R.S. An analysis of convolutional neural networks for sentence classification. In Proceedings of the 2017 XLIII Latin American Computer Conference (CLEI), Cordoba, Argentina, 4–8 September 2017; pp. 1–5. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Chenarlogh, V.A.; Hassanpour, A.; Grolinger, K.; Parsa, V. Performance Analysis of Dilated One-to-Many U-Net Model for Medical Image Segmentation. IEEE Access 2024, 12, 197259–197274. [Google Scholar] [CrossRef]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef]
Zaridis, D.I.; Mylona, E.; Tachos, N.; Kalantzopoulos, C.; Pezoulas, V.C.; Koutsouris, D.D.; Matsopoulos, G.K.; Marias, K.; Tsiknakis, M.; Fotiadis, D.I. Assessing the Robustness of nnU-Net in the Detection of Prostate Lesions via Bi-Parametric MRI. In Proceedings of the 2023 IEEE EMBS Special Topic Conference on Data Science and Engineering in Healthcare, Medicine and Biology, St. Julians, Malta, 7–9 December 2023; pp. 33–34. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Proceedings of the 4th International Workshop on Deep Learning in Medical Image Analysis, DLMIA 2018 and 8th International Workshop on Multimodal Learning for Clinical Decision Support, ML-CDS 2018 Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Stoyanov, D., Taylor, Z., Carneiro, G., Syeda-Mahmood, T., Martel, A., Maier-Hein, L., Tavares, J.M.R., Bradley, A., Papa, J.P., Belagiannis, V., et al., Eds.; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Yu, M.; Pei, K.; Li, X.; Wei, X.; Wang, C.; Gao, J. FBCU-Net: A fine-grained context modeling network using boundary semantic features for medical image segmentation. Comput. Biol. Med. 2022, 150, 106161. [Google Scholar] [CrossRef] [PubMed]
Tang, G.; Müller, M.; Rios, A.; Sennrich, R. Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 4263–4272. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems-NIPS’17, Red Hook, NY, USA, 4–7 December 2017; pp. 6000–6010. [Google Scholar]
Santhirasekaram, A.; Winkler, M.; Rockall, A.; Glocker, B. Robust prostate disease classification using transformers with discrete representations. Int. J. CARS 2024, 20, 11–20. [Google Scholar] [CrossRef] [PubMed]
Tang, Y.; Yang, D.; Li, W.; Roth, H.; Landman, B.; Xu, D.; Nath, V.; Hatamizadeh, A. Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis. arXiv 2022, arXiv:2111.14791. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Zabihollahy, F.; Schieda, N.; Krishna Jeyaraj, S.; Ukwatta, E. Automated segmentation of prostate zonal anatomy on T2-weighted (T2W) and apparent diffusion coefficient (ADC) map MR images using U-Nets. Med. Phys. 2019, 46, 3078–3090. [Google Scholar] [CrossRef]
Li, Y.; Wu, Y.; Huang, M.; Zhang, Y.; Bai, Z. Automatic prostate and peri-prostatic fat segmentation based on pyramid mechanism fusion network for T2-weighted MRI. Comput. Methods Prog. Biomed. 2022, 223, 106918. [Google Scholar] [CrossRef] [PubMed]
Donati, O.F.; Jung, S.I.; Vargas, H.A.; Gultekin, D.H.; Zheng, J.; Moskowitz, C.S.; Hricak, H.; Zelefsky, M.J.; Akin, O. Multiparametric Prostate MR Imaging with T2-weighted, Diffusion-weighted, and Dynamic Contrast-enhanced Sequences: Are All Pulse Sequences Necessary to Detect Locally Recurrent Prostate Cancer after Radiation Therapy? Radiology 2013, 268, 440–450. [Google Scholar] [CrossRef] [PubMed]
Ma, X.; Su, J.; Wang, C.; Ci, H.; Wang, Y. Context Modeling in 3D Human Pose Estimation: A Unified Perspective. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6234–6243. [Google Scholar] [CrossRef]
Litjens, G.; Toth, R.; van de Ven, W.; Hoeks, C.; Kerkstra, S.; van Ginneken, B.; Vincent, G.; Guillard, G.; Birbeck, N.; Zhang, J.; et al. Evaluation of prostate segmentation algorithms for MRI: The PROMISE12 challenge. Med. Image Anal. 2014, 18, 359–373. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016, Proceedings of the 19th International Conference, Athens, Greece, 17–21 October 2016; Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W., Eds.; Springer: Cham, Switzerland, 2016; pp. 424–432. [Google Scholar]

Figure 1. Challenges in automated prostate segmentation in T2W images. This figure illustrates the primary challenges in the automated segmentation of the prostate in T2W images. Displayed are the original image sequence and the corresponding reference standard for a specific instance. In the reference images, white arrows highlight areas of the prostate with vague and irregular borders. The prostate regions are segmented in red.

Figure 2. Overview algorithm framework of the proposed multi-scale context modeling-based U-Net (MCM-UNet) for prostate segmentation. The original T2W image dataset is fed into the encoder for high-level features. Then, the CM module captures the image-level features from multiple scales. Next, the encoder fuses the dataset-level features in the memory bank with the image-level features obtained after processing by the CM module to obtain global contextual features. Finally, the segmentation probability maps are obtained.

Figure 3. Image-level contextual feature acquisition process. The black dashed box part is the context modeling module.

Figure 4. Schematic diagram of first-in-first-out-based memory bank operation mechanism.

Figure 5. Visual presentation of the segmentation results of MCM-UNet. The first column shows the original images; the second column displays the ground truth annotations by radiologists. The third column presents the prostate segmentation results using our proposed MCM-UNet. The fourth column illustrates the overlap between the segmentation results and the ground truth, with red indicating areas of over-lap, green representing false positives, and blue indicating false negatives. The fifth column compares the boundaries of the segmentation results with the ground truth, where yellow lines represent the ground truth and purple lines indicate the segmentation outcomes.

Figure 6. Visual presentation of the 3D segmentation results of MCM-UNet. The red region indicates perfect overlap, green represents under-segmentation, and blue signifies over-segmentation.

Figure 7. Visualization comparison of segmentation results of different methods in different parts of the prostate. The yellow lines and purple lines are the ground truth and segmentation results, respectively. Each column is a 2D slice image of different samples, where the first three columns represent the apex parts of the prostate, the middle three columns are the mid-gland, and the last three columns represent the base parts of the prostate. Each row is the segmentation results of different methods. From top to bottom: ground truth, our method, U-Net, U-Net++, 3D U-Net, Swin-UNet, Trans-UNet, nnU-Net.

Table 1. The hospitals, MR scanners, and acquisition parameters of the data. * denotes that the case from the center consisted of a healthy prostate.

Muticentre	MR Scanner	Shape	Spacing (mm³)	FOV (mm³)	Training Set	External Test Set	Total Cases
Center-1 *	GE SIGNA EXCITE	512 × 512 × 16	0.625 × 0.625 × 6	320 × 320 × 96	1386	154	1540
	GE Signa HDxt	512 × 512 × 16	0.586 × 0.586 × 7	300 × 300 × 112
	GE Discovery MR750	512 × 512 × 16	0.547 × 0.547 × 4	280 × 280 × 64
	SIEMENS Verio	256 × 256 × 20	0.781 × 0.781 × 3.6	200 × 200 × 72
Center-2	SIEMENS Skyra	640 × 640 × 20	0.312 × 0.312 × 3.6	200 × 200 × 72	48	6	54
Center-3	Philips Ingenia	480 × 480 × 25	0.375 × 0.375 × 3.85	180 × 180 × 96	47	5	52
Center-3	SIEMENS Avanto	512 × 488 × 25	0.429 × 0.429 × 3.6	220 × 210 × 90	47	5	52
Center-4	SIEMENS Skyra	640 × 640 × 24	0.359 × 0.359 × 5.5	230 × 230 × 132	60	7	67
Center-5	SIEMENS Skyra	640 × 640 × 20	0.375 × 0.375 × 4.2	240 × 240 × 84	84	9	93
Center-6	GE Signa HDxt	512 × 512 × 17	0.586 × 0.586 × 4.3	300 × 300 × 73	43	5	48
	Philips Ingenia	480 × 480 × 20	0.437 × 0.437 × 3.3	210 × 210 × 66
Center-7	UIH uMR uMR560	384 × 384 × 21	0.52 × 0.52 × 3.6	200 × 200 × 76	16	2	18
Center-8	GE Signa	512 × 512 × 22	0.391 × 0.391 × 3.5	200 × 200 × 77	48	6	54
Center-8	GE Discovery MR750w	512 × 512 × 24	0.469 × 0.469 × 3.5	240 × 240 × 84	48	6	54
Center-9	GE Discovery MR750w	512 × 512 × 12	0.391 × 0.391 × 4.8	200 × 200 × 58	46	5	51
Center-9	SIEMENS TrioTim	320 × 320 × 16	0.719 × 0.719 × 4.4	230 × 230 × 70	46	5	51
Center-10	GE Signa HDxt	512 × 512 × 20	0.566 × 0.566 × 4.4	290 × 290 × 88	12	2	14
Center-11	GE Signa HDxt	512 × 512 × 24	0.508 × 0.508 × 6.0	260 × 260 × 144	47	6	53
Center-12	GE Signa HDxt	512 × 512 × 20	0.469 × 0.469 × 4.0	240 × 240 × 80	44	5	49
Center-13	SIEMENS Skyra	320 × 320 × 20	0.75 × 0.75 × 3.85	240 × 240 × 77	57	6	63
Center-14	SIEMENS Prisma	320 × 320 × 30	0.812 × 0.812 × 5.2	260 × 260 × 156	17	2	19
Center-14	UIH uMR 770	576 × 576 × 24	0.417 × 0.417 × 6	240 × 240 × 144	17	2	19

Table 2. Quantitative performance comparison of our method with classic medical image segmentation network on private and PROMISE12 datasets.

Method	Private				PROMISE12
Method	ASSD (voxel)	HD95 (voxel)	Jaccard (%)	DSC (%)	ASSD (voxel)	HD95 (voxel)	Jaccard (%)	DSC (%)
U-Net [17]	5.07	18.86	61.85	76.43	2.61	7.89	70.45	81.34
U-Net++ [22]	0.81	2.82	64.79	78.63	1.71	6.76	69.62	80.20
3D U-Net [36]	0.79	1.41	68.37	81.21	1.95	6.68	71.51	82.91
Swin-UNet [28]	0.85	2.23	75.24	85.87	1.32	4.73	70.89	82.43
Trans-UNet [25]	0.83	3.31	71.01	83.05	1.51	6.86	72.74	83.72
nnU-Net [20,21]	1.01	1.73	78.79	88.13	2.05	5.74	78.20	89.32
MCM-UNet	0.58	1.80	83.17	91.71	1.07	3.02	81.58	90.47

Table 3. Impact of hyper-parameter settings on network performance (HD95 and DSC as evaluation metrics).

Parameters		Prostate Segmentation
m	n	HD95 (voxel)	DSC (%)	Times (epoch/s)
1	32	3.86	89.27	33.1
6	32	1.98	91.24	35.3
1	64	3.52	90.32	34.2
6	64	1.80	91.71	37.6
1	128	3.14	90.25	36.4
6	128	2.82	91.42	40.5

Table 4. Ablation study of comparison with baseline.

Backbone	CM	MB	DSC (%)
nnU-Net			88.13
nnU-Net	✔		88.62
nnU-Net		✔	89.21
nnU-Net	✔	✔	91.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, J.; Zhang, X.; Luo, X.; Zheng, Z.; Zhou, K.; Kang, Y.; Li, H.; Geng, D. Accurate Prostate Segmentation in Large-Scale Magnetic Resonance Imaging Datasets via First-in-First-Out Feature Memory and Multi-Scale Context Modeling. J. Imaging 2025, 11, 61. https://doi.org/10.3390/jimaging11020061

AMA Style

Zhu J, Zhang X, Luo X, Zheng Z, Zhou K, Kang Y, Li H, Geng D. Accurate Prostate Segmentation in Large-Scale Magnetic Resonance Imaging Datasets via First-in-First-Out Feature Memory and Multi-Scale Context Modeling. Journal of Imaging. 2025; 11(2):61. https://doi.org/10.3390/jimaging11020061

Chicago/Turabian Style

Zhu, Jingyi, Xukun Zhang, Xiao Luo, Zhiji Zheng, Kun Zhou, Yanlan Kang, Haiqing Li, and Daoying Geng. 2025. "Accurate Prostate Segmentation in Large-Scale Magnetic Resonance Imaging Datasets via First-in-First-Out Feature Memory and Multi-Scale Context Modeling" Journal of Imaging 11, no. 2: 61. https://doi.org/10.3390/jimaging11020061

APA Style

Zhu, J., Zhang, X., Luo, X., Zheng, Z., Zhou, K., Kang, Y., Li, H., & Geng, D. (2025). Accurate Prostate Segmentation in Large-Scale Magnetic Resonance Imaging Datasets via First-in-First-Out Feature Memory and Multi-Scale Context Modeling. Journal of Imaging, 11(2), 61. https://doi.org/10.3390/jimaging11020061

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Accurate Prostate Segmentation in Large-Scale Magnetic Resonance Imaging Datasets via First-in-First-Out Feature Memory and Multi-Scale Context Modeling

Abstract

1. Introduction

2. Dataset and Methods

2.1. Dataset

2.2. Methods

2.2.1. Architecture Overview

2.2.2. Context Modeling Module

2.2.3. First-in-First-Out Feature Update Strategy

2.2.4. Loss Function

2.2.5. Evaluation Metrics

3. Experiments and Results

3.1. Implementation Details

3.2. Result Visualization

3.3. Comparative Experiment

3.3.1. Quantitative Analysis

3.3.2. Qualitative Evaluation

3.4. Ablation Study

3.4.1. Hyper-Parameter Ablation Study

3.4.2. Network Structure Ablation Study

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI