1. Introduction
Medical image segmentation constitutes a pivotal facet of biomedical image processing, entailing the precise demarcation and localization of specific structures or regions within medical images [
1]. This function assumes paramount importance across a spectrum of clinical applications, encompassing lesion detection and quantification, anatomical structure scrutiny, surgical strategizing and navigation, among others. Recent advances in deep learning have propelled automated segmentation techniques across modalities: in brain MRI analysis, Zhang et al. [
2] developed a Modified Recurrent Residual Attention U-Net achieving a 94.3% Dice score for tumor segmentation through spatial-channel attention fusion, while Li et al. [
3] proposed a hybrid transformer–CNN architecture for pancreatic CT segmentation with 89.2% vessel boundary accuracy. These breakthroughs demonstrate how adaptive feature fusion and hybrid architectures can address modality-specific challenges.
Notably, cell nucleus segmentation presents unique technical demands distinct from organ-level segmentation. While Caicedo et al. [
4] have contributed extensively to stained nucleus segmentation through their 2018 Data Science Bowl dataset with enhanced edge contrast, Uka et al. [
5] explored techniques for unstained brightfield images in microfluidic chambers. This dichotomy mirrors the modality adaptation challenges seen in cross-organ segmentation [
3] but with added complexity: nuclear boundaries exhibit <1 μm transitional zones compared to >5 mm tumor margins [
2], and stain-induced domain shifts exceed typical cross-modality variations [
6].
The nucleus, a cell’s central component, serves as the repository of genetic information and regulatory mechanisms, endowing it with profound significance in deciphering cellular structure and function as well as probing the mechanisms underlying diseases. Through the accurate segmentation and quantitative analysis of cell nuclei, clinicians and researchers can access crucial data pertaining to the contours, quantities, locations, and distributions of these nuclei. This, in turn, empowers them to render precise judgments in the domains of cancer treatment, genetic disease diagnosis, cytological investigations, and decision-making processes. Nevertheless, the task of accurate cell nucleus segmentation remains a formidable challenge due to the intricate and noise-laden nature of medical images. These images typically exhibit low contrast, indistinct boundaries (~2–5 pixel transitional zones), and intricate tissue architectures, all of which compound the complexities of cell nucleus segmentation. Moreover, variations in nuclei shape (polymorphism index up to 0.38 [
7]), size (5–50 μm diameter range), and hue (stain color deviation ΔE > 15 in CIELAB space [
6]) across diverse tissues and pathological conditions further augment the intricacies of this segmentation endeavor. Hence, the development of efficient and precise cell nucleus segmentation methodologies assumes pivotal significance within the ambit of medical image analysis and diagnosis.
In recent years, a plethora of deep learning-based image segmentation models have emerged, yielding significant successes across various vision-related tasks. The fully convolutional neural network (FCN) stands as a distinctive neural network architecture, primarily tailored for image segmentation purposes [
2,
3,
4,
5,
6]. Diverging from conventional convolutional neural networks, FCN comprises solely convolutional and pooling layers, eschewing fully connected layers, thereby facilitating the production of a segmentation output of the same dimensions as the input image. To augment its performance, FCN also incorporates an upsampling operation to restore feature maps to the original input image dimensions. This design allows FCN to classify individual pixels within the input image, thereby achieving pixel-level image segmentation. U-Net, a seminal image segmentation model, was introduced by Ronneberger et al. in 2015 [
7]. A pivotal component of U-Net is the skip connection, which links corresponding layers between the encoder and decoder, facilitating the transmission of low-level details to the decoder. Through these skip connections, U-Net effectively harnesses both low-level detailed information and high-level semantic knowledge for segmentation, thereby preserving rich details and contextual information across multiple scales. This proves particularly advantageous in addressing multi-scale objects and boundary delineation. However, U-Net does exhibit certain limitations. Firstly, its reliance on fixed-scale pooling and upsampling operations constrains its adaptability to objects of varying scales. Secondly, U-Net may grapple with memory and computational resource limitations when processing large-scale images. Additionally, the symmetrical encoding–decoding structure of U-Net may result in suboptimal performance when handling fine-grained boundaries and small objects. DeepLab [
8], another pivotal image segmentation model, was originally proposed by Chen et al. in 2016. DeepLab leverages dilated convolutions to expand the receptive field. By introducing varying dilation rates within the convolution layers, DeepLab achieves multi-scale information fusion, a technique widely employed across the DeepLab series with commendable results. Nonetheless, the fixed dilation rate of dilated convolution may not always cater to the requirements of diverse scale objects, thereby imposing constraints on fine-grained segmentation tasks. In addition to conventional CNN architectures, models rooted in the transformer architecture [
9,
10,
11,
12] have gained prominence. The transformer model, renowned for its ability to model global dependencies through self-attention mechanisms, has also made inroads into the domain of medical image segmentation. FCBFormer [
13], a hybrid model merging transformer and CNN, represents a novel architecture for polyp segmentation in colonoscopy images. In this design, the upsampling branch employs deconvolution layers to upsample low-resolution feature maps to match the input image dimensions, thus preserving spatial information. Meanwhile, the downsampling branch employs convolutional layers to downsample high-resolution feature maps to match the dimensions of the upsampling branch, thereby curtailing computational and memory demands. Additionally, skip connections interconnect the two sub-branches, amalgamating and harmonizing feature maps across different scales. This configuration empowers the FCB branch to adeptly process images of varying scales and yield high-quality, dense predictions. This paper amalgamates the structural attributes of DeepLab and FCBFormer to tackle the challenge of multi-scale feature fusion.
Chen et al. [
14] have proposed an innovative perspective wherein algorithm discovery is construed as a program search, skillfully applied to unearth optimally tailored algorithms for deep neural network training. Employing efficient search techniques, the authors traverse the boundless and sparse program space, supplementing their quest with program selection and simplification strategies to bridge the considerable chasm in generalization between the agent and target tasks. This endeavor has led to the discovery of a succinct yet highly effective optimization algorithm known as Lion (Evolved SignMomentum) [
14]. The Lion optimizer utilizes the sign symbol to compute updates of equal magnitude; the employment of this sign operation stems from its ability to convert gradients, be they positive or negative, into +1 or −1, thus streamlining the computation process. This uniform treatment of each component by the sign operation enables the model to fully harness the potential of all components, thereby enhancing its generalization capabilities. Nonetheless, once the model reaches convergence, each component has exploited its potential, necessitating a shift towards a gradient-based update strategy. In this paper, we introduce the Lion optimizer to the domain of medical image segmentation, amalgamating the attributes of both gradient-based and symbol-based updates. Early-stage training adopts the symbol-based update to fully explore the model’s potential, while once stability is achieved, the gradient-based method takes precedence.
The loss function assumes a pivotal role in the realm of deep learning, serving as a means to gauge the disparity between model predictions and actual labels. In tandem with the burgeoning progress of deep learning, an ever-expanding array of loss functions has surfaced, finding application across diverse tasks. For instance, the cross-entropy loss function enjoys widespread utility in classification tasks, whereas Dice loss [
15] finds its niche in segmentation endeavors. Nonetheless, extant loss functions exhibit certain limitations. Notably, in boundary detection tasks, commonly employed loss functions like BCE Loss or Dice loss fall short in duly penalizing boundary misalignment. Addressing this limitation, Bokhovkin and Burnaev introduced a boundary loss function [
16] capable of more effectively penalizing boundary misalignment, thus ameliorating segmentation quality. This paper introduces the boundary loss function into the domain of medical image segmentation. During the initial stages of training, the conventional loss function is harnessed to cultivate fundamental area segmentation proficiency. Subsequently, it is complemented by the boundary loss function, imparting a more nuanced correction capability to the boundary region, consequently enhancing the precision of boundary extraction within the image and, by extension, the accuracy of medical image segmentation.
The subsequent sections of this paper are structured as follows:
Section 2 delves into related work, while
Section 3 expounds upon the principal methods and intricacies of the segmentation model. Moving forward to
Section 4, we elucidate implementation particulars and present experimental outcomes.
Section 5 encapsulates the study’s limitations and its concluding remarks.
3. Materials and Methods
This study utilized the 2018 Data Science Bowl dataset, sourced from Kaggle, and publicly accessible to all. In the preceding section, we discussed the relevant work pertaining to cell nucleus segmentation within medical image analysis. In this section, we will delve into our proposed methodology, which involved the fusion of the Hollow Convolution Branch (HCB) and the full convolution branch (FCB), as well as delineate our approach to loss function design and optimizer fusion. The following paragraphs will elucidate these methods systematically, focusing on key components.
3.1. The AL-Net Architecture
The architecture of the cell nucleus segmentation model presented in this paper is visualized in
Figure 1. The model comprised a Hollow Convolution Branch (HCB) and a full convolution branch (FCB), as depicted in the figure. To address the challenge of suboptimal cell nucleus segmentation across multiple scales, we introduced a Full Convolution Branch (FCB) [
13]. This module adeptly captured cell nucleus information at varying scales by amalgamating feature maps of diverse dimensions. Simultaneously, we incorporated the DeepLabV3+ module, wherein dilated convolutions extended the convolution kernel’s receptive field, facilitating the acquisition of broader contextual information. Through the strategic application of dilated convolutions with distinct hole sizes at different levels, we attained multi-scale information fusion, thereby enhancing our ability to handle image features of varying scales.
To extract more comprehensive feature information, we augmented the output of the DeepLabV3+ module to 32 layers, relocating the sigmoid operation to the Post-Processing (PH) module. Specifically, the initial input image underwent preprocessing, resulting in a 3 × 256 × 256 image, which was subsequently processed through both the DeepLabV3+ module and the FCB module. Since the raw output of DeepLabV3+ has a lower spatial resolution, it was first upsampled to 256 × 256 to match the resolution of the FCB output. This yielded two 32 × 256 × 256 feature maps, subsequently concatenated to produce a single 64 × 256 × 256 feature map. The PH module was employed to fine-tune the number of feature map layers, culminating in the ultimate output result.
3.2. Fully Convolutional Branch
The configuration of the FCB (fully convolutional branch) module, as illustrated in
Figure 2, shares similarities with the UNet architecture yet incorporates several advanced modules and methodologies. The foundational downsampling module utilized a convolution operation with a stride of 2. This strategic choice served to curtail information loss. Conversely, the upsampling module leveraged the nearest neighbor interpolation algorithm, referred to as ‘nearest’. Operating with inputs of dimensions 3 × 256 × 256, this module generated three sets of outputs, each measuring 32 × 256 × 256. Furthermore, the Post-Processing (PH) component, inherited from FCBFormer, played a crucial role in fine-tuning the layer count and overseeing information fusion within the final feature map. It encompassed two RB modules and, in the final stages, adjusted the layer count via a 1 × 1 convolution operation. Ultimately, the final output was obtained through the sigmoid activation function.
3.3. Loss Function
In the context of nucleus segmentation, the judicious selection of an appropriate loss function assumes paramount significance in the model’s training and performance. To holistically address both nucleus segmentation and boundary delineation, we adopted a weighted loss function. This function seamlessly integrated the Dice loss function and the boundary loss function, thereby augmenting the model’s capacity for precise nucleus segmentation.
3.3.1. Dice Loss Function
The Dice loss function is a widely employed loss function for segmentation tasks, quantifying the similarity between predicted and actual segmentation outcomes. This similarity is gauged by computing the ratio of the intersection between predicted and true segmentation results to their union. The Dice loss function was utilized to assess the overall accuracy of nucleus segmentation and incentivize the model to more effectively glean shape and positional information of the nucleus. It is mathematically expressed as follows in Equation (1):
The symbol denotes the intersection between sets X and Y and represent the respective cardinalities of sets and . Notably, the numerator includes a coefficient of 2 due to the common element shared by X and Y in the denominator. The Dice loss function’s output spans the range of 0 to 1, wherein values closer to 1 indicate a higher degree of overlap between the segmentation result and the ground truth label, signifying enhanced accuracy.
3.3.2. Boundary Loss Function
In pursuit of heightened accuracy in nucleus boundary segmentation, we introduced a boundary loss function. This loss function primarily evaluated the consistency between the predicted boundary and the actual boundary. Our adoption of the boundary loss function, as proposed by Bokhovkin et al., served to effectively guide the model in learning the nuanced features of nucleus boundaries. Initially, precision and recall for boundary prediction are defined as per Equation (2):
Among these variables, where c represents categories, Bpdc and Bgtc denote the boundaries of the prediction map and the label map, respectively. The variable d signifies the Euclidean distance measured in pixels, serving as a predefined threshold upon which the boundary score of the category and the corresponding boundary loss function are contingent, as delineated in Equation (3):
This score epitomizes the harmonic mean of precision and recall, and the overarching purpose of this margin loss function is to optimize the alignment between precision and recall, thereby enhancing the congruence of predicted boundaries with the actual boundaries. Such optimization aids the model in acquiring more precise boundary features, ultimately elevating the quality of segmentation results.
3.3.3. Loss Function Fusion Method
The loss function fusion method represents a strategy that amalgamates distinct loss functions to harness their individual strengths. In the context of nucleus segmentation, we propose a loss function fusion method tailored to select appropriate loss functions corresponding to diverse task objectives during distinct phases of training. Refer to Equation (4) for the formal definition:
Among these variables, the weight coefficients, denoted as α and β, assume a pivotal role in regulating the relative significance of the two distinct loss functions. Through careful adjustment of these parameters, we gained the capacity to finely calibrate the emphasis placed on overall nucleus segmentation versus boundary segmentation, aligning them with the specific requirements of our tasks. The parameter ‘threshold’ constitutes a hyperparameter, with its value signifying the epoch at which the model attains convergence during the initial training stage. Specifically, for training epochs falling below the threshold, we employed the Dice loss function, facilitating the model in acquiring the nuances of nucleus shape and positional information. Conversely, once the training epoch surpassed the threshold, we transitioned to the weighted fusion of the Dice loss function and the boundary loss function. This transition enabled us to steer the model more effectively towards acquiring the intricacies of nucleus boundary features. By judiciously balancing the importance attributed to shape, position information, and boundary features and capitalizing on the strengths of diverse loss functions, we facilitated the model’s progressive acquisition of more precise segmentation outcomes across various training phases.
3.4. Optimizer
In the realm of deep learning, the selection of an appropriate optimizer bears utmost significance in shaping the model’s training and subsequent performance. Conventional optimizers encompass the traditional Stochastic Gradient Descent (SGD), alongside adaptive optimizers such as Adam, Adagrad, and RMSprop. Notably, a recent addition to the optimizer repertoire, the Lion optimizer, was introduced in a Google publication. This section serves to elucidate the Lion optimizer and explore avenues for its enhancement.
3.4.1. Lion Optimizer
The Lion optimizer operates on the principle of the sign function, transforming the gradient’s positivity or negativity into +1 or −1, respectively. It then incorporates this information by multiplying it with the learning rate and weight decay coefficient to derive the direction and magnitude of updates. Distinguished from conventional optimization algorithms, the Lion optimizer introduces a novel update rule that attains a superior equilibrium between weight decay and learning rate. In certain scenarios, it exhibits enhanced performance. The Lion optimizer, through its use of the sign operation, confers equitable treatment to each parameter component, thereby enabling the model to harness the full potential of each constituent part, ultimately augmenting generalization capabilities. When operating with large batch sizes, the Lion optimizer attains expedited convergence and superior generalization performance. Its robustness is evident in its adaptability to varying hyperparameter values. On select tasks, the Lion optimizer demonstrates the ability to achieve heightened accuracy and stability. Formula (5) outlines the formal definition of the Lion optimizer:
where
is the gradient of the loss function and is the sign function, which
controls the trade-off between gradient and momentum. It is an exponential moving average factor used to calculate momentum. Larger
values give more consideration to historical gradients, while smaller values pay more attention to current gradients.
is used to calculate the moving average of gradients, which controls the decay rate of the second moment of the gradient.
is the factor used to control the weight decay. By introducing weight decay, the model can be prevented from overfitting.
indicates the learning rate, which is used to control the step size of each parameter update.
represents the parameter values of the model, while represents the value of the momentum. The corresponding pseudocode is shown in Algorithm 1 (Pseudocode of Lion optimizer):
Algorithm 1 Lion Optimizer |
1: | |
2: | |
3: | not converged do |
4: | |
5: | update model parameters |
6: | |
7: | |
8: | |
9: | |
10: | end while |
11: | |
3.4.2. Improvement Method
A distinctive attribute of the Lion optimizer, founded on the sign operation, lies in its capacity to fully exploit the potential of each constituent element. However, once the model attains convergence, the latent capacity of these elements becomes depleted. At this juncture, it becomes imperative to revert to the conventional gradient-based update strategy, executed with greater precision to accommodate custom-tailored training requirements. The disparity between the gradient-based update strategy and the sign operation-based update strategy is elucidated in
Figure 3. In the gradient method, the update magnitude is directly proportionate to the gradient’s magnitude, whereas the sign-based method’s update magnitude hinges solely upon the gradient’s direction.
In this paper, we introduced an enhanced optimizer building upon the Lion optimizer. Specifically, during the initial training phase, we employed the Lion optimizer and utilized the sign symbol for parameter updates. This approach allowed us to fully exploit the Lion optimizer’s characteristics, delve into the role of each parameter component, and provide a more extensive optimization space. As training progresses and the model converges, we removed the sign operation and transition to the Lion optimizer without employing the sign symbol for training. Such a strategy maximized the advantages of the Lion optimizer in the early training stages and further enhanced performance and effectiveness once the model converges. Refer to Algorithm 2 (Pseudocode of improved method based on Lion optimizer) for the pseudocode:
Algorithm 2 Lion Optimizer (Updated) |
1: | |
2: | |
3: | not converged do |
4: | |
5: | update model parameters |
6: | |
7: | if epoch < threshold then |
8: | |
9: | else |
10: | |
11: | end if |
12: | |
13: | |
14: | end while |
15: | |
In the provided pseudocode, we introduced a new hyperparameter called “threshold,” which signified the epoch at which the Lion optimizer was employed for convergence and governed the timing of the sign symbol’s usage.
This section combined the sign-based operation with the gradient-based update strategy. This enhanced approach built upon the Lion optimizer maximized the benefits of the sign symbol throughout the training process, expediting model convergence, and seamlessly transitioning to the gradient-based update method to ensure model stability and generalization capability. It is worth noting that specific hyperparameter configurations and implementation details may require fine-tuning based on particular problem contexts and experimental outcomes to achieve optimal training results. In the subsequent sections, we will provide a comprehensive account of our experimental setup and a detailed analysis of the result.