Next Article in Journal
Reversible Data Hiding Algorithm in Encrypted Images Based on Adaptive Median Edge Detection and Matrix-Based Secret Sharing
Previous Article in Journal
Sensitivity Analysis and Filtering of Machinable Parts Using Density-Based Topology Optimization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CellGAN: Generative Adversarial Networks for Cellular Microscopy Image Recognition with Integrated Feature Completion Mechanism

School of Software, Jiangxi Agricultural University, Nanchang 330045, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(14), 6266; https://doi.org/10.3390/app14146266
Submission received: 1 July 2024 / Revised: 16 July 2024 / Accepted: 17 July 2024 / Published: 18 July 2024

Abstract

:
In response to the challenges of high noise, high adhesion, and a low signal-to-noise ratio in microscopic cell images, as well as the difficulty of existing deep learning models such as UNet, ResUNet, and SwinUNet in segmenting images with clear boundaries and high-resolution, this study proposes a CellGAN semantic segmentation method based on a generative adversarial network with a Feature Completion Mechanism. This method incorporates a Transformer to supplement long-range semantic information. In the self-attention module of the Transformer generator, bilinear interpolation for feature completion is introduced, reducing the computational complexity of self-attention to O(n). Additionally, two-dimensional relative positional encoding is employed in the self-attention mechanism to supplement positional information and facilitate position recovery. Experimental results demonstrate that this method outperforms ResUNet and SwinUNet in segmentation performance on rice leaf cell, MuNuSeg, and Nucleus datasets, achieving up to 23.45% and 19.90% improvements in the Intersection over Union and Similarity metrics, respectively. This method provides an automated and efficient analytical tool for cell biology, enabling more accurate segmentation of cell images, and contributing to a deeper understanding of cellular structure and function.

1. Introduction

Microscopic imaging techniques are an indispensable component of biological research [1]. Cells, as the fundamental units of life, contain a wealth of biological information. The application of confocal microscopy [2] and single-molecule techniques [3] for cellular image analysis, enabling precise monitoring of cell size and morphology, holds profound significance for advancing our understanding of various aspects of life sciences. These advanced technological approaches provide powerful tools for elucidating intracellular structures and functions, thereby driving progress in life science research. Traditionally, the segmentation of cell images required manual processing. However, images obtained under different lighting and observation conditions often exhibit high noise [4], blurred boundaries, and stacked, adhered cells, posing challenges for segmentation and necessitating the involvement of professionals. Computational vision techniques for segmenting microscopic cell images enable rapid and accurate image processing and analysis. Compared to traditional manual observation, these techniques allow for large-scale, automated data acquisition and analysis, significantly improving work efficiency. Fine-grained analysis can be achieved through detailed image processing and feature extraction of microscopic cell images, and the obtained cell data can be used for long-term research, and can provide data support for biological modeling [5].
In machine vision, medical image segmentation algorithms can be broadly categorized into unsupervised and supervised segmentation. Unsupervised segmentation methods include threshold-based, region-based, and edge-based approaches [6], relying solely on image brightness and color information. Among these, the threshold method is a non-contextual segmentation technique [7] that classifies images based on grayscale values, dividing them into two categories. Region-growing methods determine growth points and aggregate pixels or sub-regions into more significant regions. On the other hand, edge-based segmentation methods determine edges by thresholding the image gradient [7]. These unsupervised methods perform well when edges are transparent, but can be susceptible to light and noise, leading to missing boundaries when noise and artifacts are present. In contrast, supervised segmentation methods, such as probabilistic graphical models and statistical shape models, can better capture organ shapes and produce more accurate results [6].
With the rapid advancement of computational capabilities and the growth of deep learning technologies, deep learning-based image segmentation methods have emerged as the predominant approach in cellular recognition. For instance, DeepCell has efficiently segmented large-scale cellular data [8]. At the same time, Cellpose, leveraging a UNet backbone network, has had good results in the semantic segmentation of cellular images [9]. Nevertheless, the precise computer-based recognition of biological cell microscopy images remains a formidable challenge. This is primarily due to the multifaceted influences of imaging equipment performance, photobleaching effects, and the intricate morphological features of cells at the microscopic level. This study proposes a CellGAN semantic segmentation network with a Feature Completion Mechanism based on a generative adversarial network for cell images to address this issue. The main contributions include the following:
  • Constructing a feature enhancement generator that incorporates a Transformer to supplement long-range semantic information;
  • Introducing bilinear interpolation for feature completion in the self-attention module to reduce computational complexity and enhance inference speed;
  • Employing two-dimensional positional encoding in the self-attention mechanism to supplement positional information and facilitate position recovery;
  • Integrating segmentation loss and discriminator error for a generator accuracy correction strategy to optimize segmentation accuracy continuously.

2. Related Work

With the rapid development of computer vision techniques, deep learning semantic segmentation methods have been widely applied in various fields. Semantic segmentation technology has been applied in agricultural areas such as crop pest and disease leaf segmentation [10], soil clod segmentation in farmland images [11], and locating pruning points of apple trees [12]. In medical image analysis, semantic segmentation technology has also shown great potential. The encoder–decoder architecture of U-Net achieved an outstanding segmentation performance on medical cell datasets [13]. The network realized multi-scale feature fusion through four skip connections, achieving remarkable segmentation results even with only 30 training images in the cell dataset. Apart from U-Net, semantic segmentation architectures, such as DeepLabV3+ [14], with its atrous spatial pyramid pooling (ASPP) module and multi-scale design, have also achieved excellent results. U-Net excels at capturing fine-grained target boundary information in images, while DeepLabV3+ is more suitable for capturing multi-scale semantic context information. UNet++ [15] improved the long connections in UNet, combining long and short connections, enabling the model to benefit from high-level semantic information from long connections and gradient backpropagation optimization from short connections. ResUnet [16] introduced residual connections into the UNet structure, improving gradient propagation and alleviating gradient vanishing and semantic information loss.
In addition to these classic convolutional neural network semantic segmentation architectures, Generative Adversarial Networks (GANs) have gradually been applied to image segmentation tasks. Luc [17] first applied GANs to semantic segmentation by replacing the generator with the Dilated8 module and proposing a way to process adversarial network input by multiplying the label or segmentation network output with the original image to form a multi-channel input. Ramwala [18] incorporated depthwise separable convolutions in the discriminator for efficiency and adopted the Nadam [19] optimization algorithm to accelerate loss function convergence. The ability of CycleGAN [20] to perform pairwise image style transfer has also been used for semantic segmentation, such as the semi-supervised CycleGAN semantic segmentation network proposed by Mondal et al. [10], which leverages cycle consistency constraints to learn bidirectional mappings between unpaired images and segmentation masks. Although the exploration of GANs in semantic segmentation is limited, their promising performance in segmentation tasks demonstrates the great potential of GANs in image segmentation [21].
Although GANs have certain advantages in semantic segmentation tasks, the generator’s capabilities limit their segmentation performance. GANs can optimize generator parameters only when the images are high quality. The earliest semantic segmentation GAN networks used U-Net as the generator, and researchers have since made various improvements based on U-Net. For example, Chen et al. [22] proposed a task-driven GAN-based image segmentation algorithm using U-Net as the generator. At the same time, the discriminator employed a multi-scale classification network [23] with different receptive fields to guide the model in generating more detailed information. Guo et al. [24] used DenseU-Net [25], containing Inception [26] modules as the GAN generator, achieving a seamless fusion of features, from shallow to deep layers. Although these methods improved network segmentation accuracy to some extent, the convolutional neural network (CNN)-based U-Net encoder and decoder are limited by the inherent receptive field of convolutional kernels, allowing only local neighborhood information to be captured, and failing to adaptively capture long-range dependencies in the input content [27]. The long-range modeling capability must be improved for tasks such as cell segmentation [28,29,30], leading to semantic information loss.
To overcome the limitations of convolutional neural networks in microscopy image semantic segmentation, some researchers have attempted to introduce Transformers [31] into this field. The self-attention mechanism [31] has demonstrated excellent performance in computer vision tasks, making Transformers a famous model architecture. Early Transformers were mainly applied to machine translation tasks in natural language processing [32], while the emergence of Vision Transformers (ViT) [33] proved their effectiveness in computer vision. When the dataset scale reached the 300 million level of JFT-300M, ViT outperformed ResNet [34]. However, training Transformers typically requires massive datasets [35], and computing self-attention for high-resolution images consumes substantial computational resources [36]. To address this issue, Hu et al. [37] combined the U-Net encoder–decoder architecture, and proposed SwinUnet, dividing the input image into non-overlapping patches and feeding them into the Transformer encoder. The SwinTransformerBlock enhanced the network’s global modeling capability and long-range semantic information interaction, enabling SwinUnet [37] to outperform U-Net in multi-organ and cardiac segmentation tasks. However, when processing microscopy cell image datasets with limited data, blurred boundaries, and low signal-to-noise ratios, the Transformer alone struggles to model local feature relationships due to the lack of position information, leading to network overfitting. Not only does this introduce additional computation, but it also makes it difficult to achieve ideal segmentation results. This issue still needs further study.

3. Methods

This study proposes a cell semantic segmentation method called CellGAN, based on Generative Adversarial Networks (GANs), to address the cell segmentation task in biological microscopy images. This method mainly includes the following two innovative aspects: the Feature Completion Mechanism in the generator and the Adversarial Error Fusion for Accuracy Rectification strategy.
As shown in Figure 1, the original image undergoes feature extraction and upsampling through the improved U-Net generator network to generate a cell segmentation result image. Subsequently, the generated segmentation image and the Ground Truth annotation image are concatenated with the original image channels, forming a pair of real and fake images, which are input into the discriminator network. The discriminator then discriminates their authenticity and obtains the discrimination error. This discrimination error is backpropagated to optimize the parameters of the generator and discriminator. The generator and discriminator promote each other’s generation and discrimination abilities through adversarial gameplay. After iterative training, this method ultimately generates high-resolution cell segmentation images with clear boundaries.
Regarding the generator loss function, this study adopts cross-entropy loss as the evaluation criterion for generation loss. Specifically, this loss function calculates the loss value pixel-by-pixel for each input image. It constructs the generation loss for the entire image by weighted summation, as mathematically expressed in Equation (1).
L G e n e r a t o r = L D i s c r i min a t o r + α ( - 1 m p i x e l [ c log ( p i ) + ( 1 - c ) ( 1 - p i ) ] )
where m represents the total number of pixels, the symbol pixel represents a single pixel, c represents the class to which the pixel belongs in the Ground Truth label image, pi represents the prediction value output by the generator at that pixel, and α is the balancing coefficient for generation loss and discrimination loss.
The discriminator’s loss function is based on the binary cross-entropy loss, used to discriminate whether the input is a real sample or a generated sample, mathematically expressed in Equation (2).
L Discriminator = - 1 n x [ y ln a + ( 1 y ) ln ( 1 a ) ]
where n is the total number of samples, x represents a single sample, y represents the actual label (real sample or generated sample) input to the discriminator, and a represents the probability of authenticity predicted by the discriminator for that sample.
The loss function design is reasonable and can effectively guide the adversarial training of the generator and discriminator, promoting the generator to generate high-quality cell segmentation results and facilitating the adversarial game between the generator and discriminator.

3.1. Feature Completion Mechanism

As shown in Figure 2, the generator adopts a Transformer-based structure for semantic feature completion, consisting of an encoder, a feature extractor, and a decoder. The encoder and decoder each contain four upsampling and four downsampling modules. The downsampling part uses convolution operations for four-time downsampling, and the downsampled feature layers are input into the TransformerBlock for feature extraction. Convolutions are then used for upsampling to achieve feature recovery. This structure combines the following advantages of convolutions and Transformers: the convolution operation avoids the inductive bias caused by large-scale pre-training [38], while the global modeling capability of the Transformer helps capture long-range dependencies and establish global context associations; the TransformerBlock compensates for the limited receptive field of convolutions, enabling the network to search for global information and expand the receptive field; and the convolution operation retains position information, which helps enhance the network’s generalization ability, and helps it avoid overfitting [39]. Compared to pure Transformer architectures (such as SwinUNet [37]), the fusion of convolutions and Transformers provides the network with richer position information, which is beneficial for restoring image details during the upsampling process.

3.1.1. Transformer Block Feature Completion

Considering the dependence of Transformer networks on large amounts of data, directly using the entire Transformer as the backbone network makes it difficult to train on small datasets (such as rice leaf cells), and introduces excessive parameters, reducing runtime efficiency. Therefore, this method embeds a self-attention module after three convolutional downsampling layers to enhance the ability to capture global semantic information, enabling the network to accurately segment cell boundaries in images where parenchyma cells are mutually adhered to and stacked. As shown in Figure 3, the Transformer Block first normalizes the input, then uses multi-head self-attention (MHSA) [31] to capture long-range dependencies, which is added to the residual connection, passed through a BatchNorm layer for normalization, and a 1 × 1 convolution for linear mapping to strengthen semantic features, and finally added to the residual output.

3.1.2. Self-Attention Mechanism Enhancement Strategy

Traditional self-attention computation is often performed in pairs, leading to exponential growth in input image size and computational complexity, reaching O(n2). Although, theoretically, self-attention has a low-rank for long sequences [40], most pixels in feature maps of images have similar features, making pairwise self-attention computation redundant and inefficient. This study adopts a more efficient self-attention mechanism, as shown in Figure 4, using 1×1 convolution and bilinear interpolation [41] to downsample the keys and values, avoiding redundant computation and reducing the size of H and W to h and w, lowering the computational complexity to O(n).
The specific computation is shown in Equation (3).
S i m p l i f y A t t e n t i o n ( Q , K ¯ , V ¯ ) = s o f t max ( Q K ¯ T d ) P ¯ : n × k V k × d ¯
Here, S i m p l i f y A t t e n t i o n ( Q , K ¯ , V ¯ ) represents the simplified self-attention, Q represents the query vector, K ¯ represents the key vector, K ¯ T represents the transpose of the key vector, V ¯ represents the learned information vector, P ¯ is the context aggregation matrix, d is the dimension of embedding in each head, n represents the value of HW, k represents the dimension after bilinear interpolation of the input vector, and k << n. Simultaneously, for relative position encoding in cell microscopy images, this study adds relative height and width information to construct two-dimensional relative position encoding [27]. As shown in Figure 5, taking the pixel point i = ( C e l l i x , C e l l i y ), j = ( C e l j x , C e l l j y ) as an example, the calculation process of position encoding is shown in Equation (4).
l C e l l i , C e l l j = q C e l l i T d ( k C e l l j + r C e l l j x C e l l i x W + r C e l l j y C e l l i y H )
where C e l l i is the i-th pixel of the cell image; C e l l i x , C e l l i y represent the horizontal and vertical coordinates of the i-th pixel of the cell image; C e l l j is the j-th pixel of the image; C e l l j x , C e l l j y represent the horizontal and vertical coordinates of the j-th pixel of the cell image; l C e l l i , C e l l j is the relative position encoding of the i-th and j-th pixels of the cell image; r C e l l j x C e l l i x W and r C e l l j y C e l l i y H are the relative width and relative height of the i-th and j-th pixels of the cell image calculated after low-dimensional projection; k C e l l j is the key vector of the j-th pixel of the cell image; q C e l l i T is the transpose of the query vector of the i-th pixel of the cell image; and the multi-head selective self-attention with two-dimensional relative position encoding is shown in Equation (5).
S i m p l i f y A t t e n t i o n ( Q , K ¯ , V ¯ ) = s o f t max ( Q K ¯ T + S H r e l + S W r e l d ) P ¯ : n × k V k × d ¯ S W r e l Cell i , C e l l j = q C e l l i T r C e l l j x C e l l i x W , S H r e l [ C e l l i , C e l l j ] = q C e l l i T r C e l l j y C e l l i y H
Equation (6) shows that the multi-head self-attention is finally calculated.
M u l t i H e a d ( Q , K ¯ , V ¯ ) = C o n c a t ( h e a d 1 , , h e a d h ) W 0
Here,   h e a d i = A t t e n t i o n ( Q i , K ¯ i , V ¯ i ) and h e a d i are the outputs of the i-th head; W 0 is the output layer; Q i is the query vector; K i ¯ is the key vector; V i ¯ is the value vector; and after concatenating all heads, a linear transformation is performed to obtain the output.

3.2. Accuracy Rectification Strategy

This method integrates the discrimination error with the generator’s segmentation error to further enhance CellGAN’s ability to recognize cell boundaries and improve performance metrics. The generator and discriminator compete against each other, improving their respective generation and discrimination capabilities. After iterative training, the network generates high-resolution cell segmentation images with clear boundaries, optimizing the segmentation performance. In the discriminator part of the GAN, when the segmentation loss of the improved UNet continues to decrease, the discriminator finds it challenging to provide adequate feedback information due to the lack of prior knowledge. As shown in Figure 6, this study concatenates the generated image and the original image channels, with the real image pair labeled as 1, and the generated image pair labeled as 0, and inputs the concatenated real/fake image pair into the discriminator for discrimination. This approach avoids early convergence of the discriminator during network training. Including the original image provides an additional standard for the GAN discriminator to evaluate the generated results, allowing the discrimination error to be accurately transmitted to the improved UNet, and optimizing the generated image.
This study proposes a novel loss function LGenerator, composed of a segmentation loss function Generator and a discriminator loss function Discriminator, to enhance the segmentation capability of the generator. In designing the segmentation loss, we employ cross-entropy loss to compare the predicted probability of each pixel with its actual label, enabling the calculated loss to more directly reflect the quality of the model’s output. The mathematical expression for this is given in Equation (7).
L G n a r a t o r ( G , D ) = D i s c r i m n a t o r ( - 1 n x [ y ln a + ( 1 - y ) ln ( 1 - a ) ] ) + α G e n e r a t o r ( - 1 m P i x e l [ c log ( p i ) + ( 1 - c ) log ( 1 - p i ) ] )
where α is the balancing coefficient between the two, n is the number of discriminator samples, y represents the true label input to the discriminator, a represents the predicted output of the discriminator, m is the number of pixels, c represents the class of the pixel in the label, and pi is the output value of the generator at that pixel. By adding the discrimination loss and segmentation loss with the balance of α, the total loss value of the generator is obtained.

4. Results

4.1. Experimental Environment

The hardware environment used in the experiments is configured with 24 Intel(R) Xeon(R) Gold 6226 CPUs and an NVIDIA Quadro RTX 4000 GPU (16 GB memory). Python 3.7 is employed as the programming language for the software environment. The deep learning computational framework utilized is PyTorch 2.0.0, and CUDA version 11.7 is invoked for GPU acceleration. The model training used the Stochastic Gradient Descent (SGD) optimizer, with a learning rate of 0.001, a batch size of 4, and 1000 iterations.

4.2. Datasets

Four microscopy cell image datasets are used in the experiments. The MoNuSeg dataset [42] comprises tissue slice images from patients with tumors in different organs. Due to the scarcity of samples, this dataset is relatively small, and geometric transformations such as rotation and cropping were applied to augment the dataset. The GlandCell dataset [43] includes images of colorectal glandular tissue, initially released as part of the MICCAI 2015 Gland Segmentation Challenge, with images sourced from colorectal gland tissue specimens of multiple patients. The Nucleus dataset [44] consists of cell nucleus images obtained from different cell lines, imaging facilities, and staining methods, with manually completed annotations by a biology expert. A dataset of rice leaf cell images (RiceCell) constructed by the authors is also used. Preprocessing operations such as scaling, cropping, brightness transformation, and adding random noise are performed for the input images of the network.

4.3. Evaluation Metrics

This study adopted three metrics—Intersection over Union (IoU), Dice coefficient (Dice), and Accuracy (Acc)—to evaluate the segmentation performance of the model, with the calculation formulas shown in Equations (8)–(10).
I o U = T P T P + P N + F P
D i c e = 2 × T P 2 × T P + F N + F P
A c c = T P + T N T P + F N + T N + F P
where TP (True Positive) represents the overlap between the actual cell area and the predicted cell area; FP (False Positive) represents the region predicted as cells that is actually the background; FN (False Negative) represents the region predicted as the background that is actually cells; and TN (True Negative) represents the overlap between the actual background area and the predicted background area.
Furthermore, this study also employed three metrics—Root Mean Square Error (RMSE), Mean Relative Error (MRE), and Coefficient of Determination (R2)—to measure the error between the label values and predicted values, with the calculation formulas shown in Equations (11)–(13).
R M S E = 1 n i = 1 n ( y i y ^ i ) 2
M R E = 1 n i = 1 n y i y ^ i y i
R 2 = 1 i = 1 n ( y i y ^ i ) 2 i = 1 n ( y i y ¯ i ) 2
where y i represents the proportion of correctly predicted cell pixels to the total number of pixels in the image, y ¯ represents the proportion of cell pixels in the label, and y ^ represents the mean of the predicted values, while n is the number of samples. RMSE reflects the deviation between predicted and true values, and is sensitive to outliers. MRE represents the magnitude of the relative error. R2 reflects the goodness of fit between the predicted and true data, with values closer to one indicating a better fit.

4.4. Experimental Results

The model’s loss and Intersection over Union (IoU) metrics in the training and validation sets are comparatively evaluated. As shown in Table 1, with increasing iterations, CellGAN gradually achieves a stable accuracy across all datasets. Due to dataset scale differences, smaller datasets like MoNuSeg and GlandCell can converge after 400 training rounds. In comparison, larger datasets like rice leaf cells and Nuclei require 1000 rounds to obtain stable results. Only in the rice leaf cell dataset can it be seen that the validation set IoU is higher than the training set; in the other three datasets of MoNuSeg, GlandCell, and Nucleus, the validation set IoU is lower than the training set, especially for the small datasets of MoNuSeg and GlandCell, as well as the simple texture Nucleus dataset, exhibiting a certain degree of overfitting during training.
Figure 7 presents the segmentation results of several validation set samples to demonstrate the segmentation effect visually. The segmentation results for the rice leaf cell, GlandCell, and Nucleus datasets are relatively ideal, aligning well with the labels. However, in the MoNuSeg dataset, although the cell locations can be roughly located, the cell sizes cannot be accurately segmented due to the insufficient model generalization ability caused by the small dataset size.
This study predicted the test set images and statistically analyzed the relationship between the proportion of correctly predicted cell pixels (t1) and the proportion of cell pixels in the label (t2), as shown in Figure 8. For the rice leaf cell and Nucleus datasets, this method has small RMSE and MRE values, with R2 approaching 1, indicating minor relative errors, and a good fit between the predicted data and the true data, while in the MoNuSeg and GlandCell datasets, there are more outliers, and the network fitting is slightly worse.

5. Discussion

5.1. Model Comparative Analysis

This study uses UNet as the baseline, and its evaluation metrics are slightly inferior to the subsequent improved network models, as shown in Table 2. DeepLabV3+ [14] employs atrous convolution and depthwise separable convolution in the encoder and decoder to encode the multi-scale context, achieving a better performance while reducing computational complexity. UNet++ [15] replaces the skip connections with dense connections based on UNet, integrating features at different levels, but needs to perform better on the rice leaf cell dataset, and is generally inferior to UNet. ResUNet [16] introduces a residual structure in each convolutional block, deepening the network depth and outperforming UNet. SwinUNet [37] replaces all convolutions in the UNet architecture with SwinTransformer modules, but weakens position information, performing far worse than other methods on the rice leaf cell dataset, with indistinct textures and boundaries. UTNet [45] fuses convolutions and Transformers, but its performance is still inferior to CellGAN.
In the rice leaf cell dataset, as shown in Figure 9, compared to UNet, this method’s IoU is improved by 3.52%, Dice by 2.06%, and Acc by 1.55%; compared to the pure Transformer architecture SwinUNet, IoU is improved by 13.93%, Dice by 8.54%, and Acc by 7.68%; and compared to the convolutional-Transformer hybrid architecture UTNet, IoU is improved by 2.21%, Dice by 1.20%, and Acc by 1.19%. Overall, the CellGAN method in this paper performs the best. Figure 10 visually demonstrates the differences in the segmentation performance of various models on the rice leaf cell dataset. Although UNet has limitations in modeling long-range relationships, it performs well in fine-grained feature recovery related to using simple 3 × 3 convolutional feature extraction. Although the other networks have slightly higher metrics than UNet on the validation set, UNet’s feature extraction for cell boundaries is the most accurate. Compared to other methods, this method achieves the highest Dice coefficient, and the boundary distinction effect is closest to the label, verifying its effectiveness.

5.2. Ablation Experiment

An ablation experiment is conducted, with the results to verify the effectiveness of each module shown in Table 3. Scheme 1 is the original UNet; Scheme 2 is the semantic segmentation GAN network; Scheme 3 is UNet with the addition of the Transformer; Scheme 4 is GAN using UNet as the generator; Scheme 5 is GAN using UNet with the Transformer as the generator; Scheme 6 is UNet with the Transformer and improved MHSA; and Scheme 7 is GAN using the improved generator. Due to the limited generator capability, Scheme 2 has a poor segmentation performance, but it is greatly improved after introducing UNet as the generator (Scheme 4). Scheme 5 uses a GAN generator with a Transformer, and its IoU is improved by 1.36% compared to Scheme 4 without a Transformer. Scheme 7 uses the improved MHSA with minor computational complexity and two-dimensional position encoding in the Transformer, and its IoU is improved by 1.91% compared to Scheme 5 without the improved MHSA, with only 43.9% of the latter’s model parameters. Comparing Schemes 6 and 7, Scheme 7 has an IoU improvement of 0.71% over Scheme 6, indicating that introducing the discriminator and discrimination loss can optimize the generated image quality and improve segmentation performance.

5.3. Model Generalization Analysis

To verify the recognition accuracy of CellGAN on different datasets, Table 4 presents the comparative experimental results with other models in the MoNuSeg, GlandCell, and Nuclues datasets. CellGAN outperforms other convolutional or Transformer-based methods on all datasets. In the MoNuSeg dataset, this method achieves an IoU of 63.72% and a Dice of 77.84%, improving by 22.74% and 19.90%, respectively, compared to ResUNet, and by 1.44% and 1.81%, respectively, compared to SwinUNet; in the GlandCell dataset, the IoU is 83.72% and the Dice is 91.14%, improving by 3.23% and 2.15%, respectively, compared to ResUNet, and by 4.43% and 3.19%, respectively, compared to SwinUNet; in the Nucleus dataset, the IoU is 92.53% and the Dice is 86.10%, improving by 23.45% and 15.96%, respectively, compared to ResUNet, and by 0.70% and 0.72%, respectively, compared to SwinUNet. The experimental results show that this method can achieve stable and superior performance compared to current mainstream methods on various cell microscopy image semantic segmentation datasets, with only slightly lower metrics than the convolutional-transformer hybrid architecture UTNet in the GlandCell dataset.
Figure 10 demonstrates the good segmentation performance of the CellGAN model on the Nucleus dataset under low-contrast conditions. For instance, the results within the green bounding box indicate that the CellGAN model outperforms the Unet model.

5.4. Model Interpretability Analysis

This study uses Grad-CAM [46] to extract gradient information from the third upsampling layer, showing the model’s attention focus at different image locations to verify the interpretability of the model for the cell semantic segmentation task. As shown in Figure 11, the red areas indicate significant attention, while the blue areas represent redundant feature regions. By transparently overlaying the obtained heatmap with the original image, the attention regions of UNet and this method on the images can be observed. In the rice leaf cell and GlandCell datasets, UTNet can only focus on the approximate region of the cells, but cannot clearly distinguish between cells and cell boundaries. In contrast, this method can accurately capture cell locations and distinguish cells from boundaries in the three datasets, except for MoNuSeg, demonstrating the model’s excellent modeling capability. The failure to focus on effective cell locations in the MoNuSeg dataset may be related to the small data size.

6. Conclusions

This study proposes an improved generative adversarial network for cell image semantic segmentation, called CellGAN, which incorporates the Transformer Feature Completion Mechanism. The proposed CellGAN model outperforms the comparative models regarding segmentation performance on the rice leaf cell and three other public datasets of cell microscopy images. The CellGAN generator adopts a Transformer encoder–decoder architecture, facilitating long-range dependency modeling and global semantic information capture. The improved MHSA module introduces bilinear interpolation operations to replace standard dot-product attention computation. It extends the original one-dimensional position encoding to a two-dimensional one, enriching position information expression. The generative adversarial network’s discriminator and discrimination loss can optimize the generated image quality, thereby improving segmentation accuracy, and making cell boundaries clearer and distinct.
However, it is noteworthy that while the improved MHSA module reduces the number of parameters by 1.787 million compared to the original MHSA module, its parameter count in the self-attention computation process still significantly exceeds that of convolutional operations. Therefore, subsequent work must explore more concise attention calculation methods while maintaining CellGAN’s segmentation accuracy. It aims to reduce the model’s parameter count and computational complexity, aiming to develop a more lightweight and efficient cell image recognition model.

Author Contributions

Conceptualization, X.L. and W.Y.; methodology, W.Y.; software, X.L.; validation, W.Y.; formal analysis, W.Y.; investigation, X.L. and W.Y.; resources, W.Y.; data curation, X.L.; writing—original draft preparation, X.L. and W.Y.; writing—review and editing, X.L., and W.Y.; visualization, X.L.; supervision, W.Y.; project administration, W.Y.; funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The study was financially supported by the National Natural Science Foundation of China (Grant No. 61762048); the Natural Science Foundation of Jiangxi Province (Grant No. 20212BAB202015), and the National College Student Innovation and Entrepreneurship Training Program (Grant No. 202410410023X).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data from this study can be obtained upon request from the first or corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lei, X.; Lei, Y.; Li, J.K.; Du, W.X.; Li, R.G.; Yang, J.; Li, J.; Li, F.; Tan, H.B. Immune cells within the tumor microenvironment: Biological functions and roles in cancer immunotherapy. Cancer Lett. 2020, 470, 126–133. [Google Scholar] [CrossRef]
  2. Poole, J.J.; Mostaço-Guidolin, L.B. Optical Microscopy and the Extracellular Matrix Structure: A Review. Cells 2021, 10, 1760. [Google Scholar] [CrossRef]
  3. Magazzù, A.; Marcuello, C. Investigation of Soft Matter Nanomechanics by Atomic Force Microscopy and Optical Tweezers: A Comprehensive Review. Nanomaterials 2023, 13, 963. [Google Scholar] [CrossRef]
  4. Chen, J.; Sasaki, H.; Lai, H.; Su, Y.; Liu, J.; Wu, Y.; Zhovmer, A.; Combs, C.A.; Rey-Suarez, I.; Chang, H.Y.; et al. Three-dimensional residual channel attention networks denoise and sharpen fluorescence microscopy image volumes. Nat. Methods 2021, 18, 678–687. [Google Scholar] [CrossRef]
  5. Palla, G.; Fischer, D.S.; Regev, A.; Theis, F.J. Spatial components of molecular tissue biology. Nat. Biotechnol. 2022, 40, 308–318. [Google Scholar] [CrossRef]
  6. Seo, H.; Badiei Khuzani, M.; Vasudevan, V.; Huang, C.; Ren, H.; Xiao, R.; Jia, X.; Xing, L. Machine learning techniques for biomedical image segmentation: An overview of technical aspects and introduction to state-of-art applications. Med. Phys. 2020, 47, e148–e167. [Google Scholar] [CrossRef]
  7. Kumar, S.N.; Fred, A.L.; Varghese, P.S. An Overview of Segmentation Algorithms for the Analysis of Anomalies on Medical Images. J. Intell. Syst. 2020, 29, 612–625. [Google Scholar] [CrossRef]
  8. Bannon, D.; Moen, E.; Schwartz, M.; Borba, E.; Kudo, T.; Greenwald, N.; Vijayakumar, V.; Chang, B.; Pao, E.; Osterman, E.; et al. DeepCell Kiosk: Scaling deep learning–enabled cellular image analysis with Kubernetes. Nat. Methods 2021, 18, 43–45. [Google Scholar] [CrossRef]
  9. Stringer, C.; Wang, T.; Michaelos, M.; Pachitariu, M. Cellpose: A generalist algorithm for cellular segmentation. Nat. Methods 2021, 18, 100–106. [Google Scholar] [CrossRef]
  10. Mondal, A.K.; Agarwal, A.; Dolz, J.; Desrosiers, C. Revisiting CycleGAN for semi-supervised segmentation. arXiv 2019, arXiv:1908.11569. [Google Scholar] [CrossRef]
  11. Goncalves, J.P.; Pinto, F.A.; Queiroz, D.M.; Villar, F.M.; Barbedo, J.G.; Del Ponte, E.M. Deep learning architectures for semantic segmentation and automatic estimation of severity of foliar symptoms caused by diseases or pests. Biosystemsengineering 2021, 210, 129–142. [Google Scholar] [CrossRef]
  12. Tong, S.; Zhang, J.; Li, W.; Wang, Y.; Kang, F. An image-based system for locating pruning points in apple trees using instance segmentation and RGB-D images. Biosyst. Eng. 2023, 236, 277–286. [Google Scholar] [CrossRef]
  13. Qian, L.; Zhou, X.; Li, Y.; Hu, Z. Unet#: A Unet-like redesigning skip connections for medical image segmentation. arXiv 2022, arXiv:2205.11759. [Google Scholar] [CrossRef]
  14. Eissa, M.M.; Napoleon, S.A.; Ashour, A.S. DeepLab V3+ Based Semantic Segmentation of COVID-19 Lesions in Computed Tomography Images. J. Eng. Res. 2022, 6, 184–191. [Google Scholar] [CrossRef]
  15. Bousias Alexakis, E.; Armenakis, C. Evaluation of UNet and UNet++ architectures in high resolution image change detection applications. Int. Arch. Photogrammetry. Remote Sens. Spat. Inf. Sci. 2020, 43, 1507–1514. [Google Scholar] [CrossRef]
  16. Liu, Z.; Yuan, H. An Res-Unet method for pulmonary artery segmentation of CT images. J. Phys. Conf. Ser. 2021, 1924, 012018. [Google Scholar] [CrossRef]
  17. Luc, P.; Couprie, C.; Chintala, S.; Verbeek, J. Semantic Segmentation using Adversarial Networks. arXiv 2016, arXiv:1611.08408. [Google Scholar] [CrossRef]
  18. Ramwala, O.A.; Dhakecha, S.A.; Ganjoo, A.; Visiya, D.; Sarvaiya, J.N. Leveraging Adversarial Training for Efficient Retinal Vessel Segmentation. In Proceedings of the 13th International Conference on Electronics, Computers and Artificial Intelligence (ECAI), Pitesti, Romania, 1–3 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
  19. Tato, A.; Nkambou, R. Improving Adam Optimizer. 2018. Available online: https://openreview.net/pdf?id=HJfpZq1DM (accessed on 7 July 2024).
  20. Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired imageto-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar] [CrossRef]
  21. You, A.; Kim, J.K.; Ryu, I.H.; Yoo, T.K. Application of generative adversarial networks (GAN) for ophthalmology image domains: A survey. Eye Vis. 2022, 9, 6. [Google Scholar] [CrossRef]
  22. Chen, Z.; Wei, J.; Zeng, X.; Xu, L. Retinal vessel segmentation based on task-driven generative adversarial network. IET Image Process. 2020, 14, 4599–4605. [Google Scholar] [CrossRef]
  23. Ding, Y.; Zhang, Z.; Zhao, X.; Hong, D.; Cai, W.; Yang, N.; Wang, B. Multi-scale receptive fields: Graph attention neural network for hyperspectral image classification. Expert Syst. Appl. 2023, 223, 119858. [Google Scholar] [CrossRef]
  24. Guo, X.; Chen, C.; Lu, Y.; Meng, K.; Chen, H.; Zhou, K.; Wang, Z.; Xiao, R. Retinal vessel segmentation combined with generative adversarial networks and Dense U-Net. IEEE Access 2020, 8, 194551–194560. [Google Scholar] [CrossRef]
  25. Li, X.; Chen, H.; Qi, X.; Dou, Q.; Fu, C.W.; Heng, P.A. H-DenseUNet: Hybrid Densely Connected UNet for Liver and Tumor Segmentation From CT Volumes. IEEE Trans. Med. Imaging 2018, 37, 2663–2674. [Google Scholar] [CrossRef]
  26. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef]
  27. Schlemper, J.; Oktay, O.; Schaap, M.; Heinrich, M.; Kainz, B.; Glocker, B.; Rueckert, D. Attention gated networks: Learning to leverage salient regions in medical images. Med. Image Anal. 2019, 53, 197–207. [Google Scholar] [CrossRef]
  28. Gao, Y.; Huang, R.; Yang, Y.; Zhang, J.; Shao, K.; Tao, C.; Chen, Y.; Metaxas, D.N.; Li, H.; Chen, M. Focusnetv2: Imbalanced large and small organ segmentation with adversarial shape constraint for head and neck Ct images. Med. Image Anal. 2021, 67, 101831. [Google Scholar] [CrossRef]
  29. Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar] [CrossRef]
  30. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar] [CrossRef]
  31. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 7 July 2024).
  32. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020. [Google Scholar] [CrossRef]
  33. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
  34. Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted res-unet for high-quality retina vessel segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; IEEE: New York, NY, USA; pp. 327–331. [Google Scholar] [CrossRef]
  35. Liu, Y.; Sangineto, E.; Bi, W.; Sebe, N.; Lepri, B.; Nadai, M. Efficient training of visual transformers with small datasets. Adv. Neural Inf. Process. Syst. 2021, 34, 23818–23830. [Google Scholar]
  36. Shi, J.; Wang, Y.; Yu, Z.; Li, G.; Hong, X.; Wang, F.; Gong, Y. Exploiting multi-scale parallel self-attention and local variation via dual-branch transformer-cnn structure for face super-resolution. IEEE Trans. Multimed. 2023, 26, 2608–2620. [Google Scholar] [CrossRef]
  37. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland; pp. 205–218. [Google Scholar] [CrossRef]
  38. Cao, Y.H.; Yu, H.; Wu, J. Training vision transformers with only 2040 images. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland; pp. 220–237. [Google Scholar] [CrossRef]
  39. Salehi, A.W.; Khan, S.; Gupta, G.; Alabduallah, B.I.; Almjally, A.; Alsolai, H.; Siddiqui, T.; Mellit, A. A study of CNN and transfer learning in medical imaging: Advantages, challenges, future scope. Sustainability 2023, 15, 5930. [Google Scholar] [CrossRef]
  40. Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
  41. Zhou, R.G.; Wan, C. Quantum image scaling based on bilinear interpolation with decimals scaling ratio. Int. J. Theor. Phys. 2021, 60, 2115–2144. [Google Scholar] [CrossRef]
  42. Sirinukunwattana, K.; Raza, S.E.; Tsang, Y.W.; Snead, D.R.; Cree, I.A.; Rajpoot, N.M. Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Trans. Med. Imaging 2016, 35, 1196–1206. [Google Scholar] [CrossRef] [PubMed]
  43. Sirinukunwattana, K.; Pluim, J.P.; Chen, H.; Qi, X.; Heng, P.A.; Guo, Y.B.; Wang, L.Y.; Matuszewski, B.J.; Bruni, E.; Sanchez, U.; et al. Gland segmentation in colon histology images: The glas challenge contest. Med. Image Anal. 2017, 35, 489–502. [Google Scholar] [CrossRef]
  44. Phoulady, H.A.; Mouton, P.R. A New Cervical Cytology Dataset for Nucleus Detection and Image Classification (Cervix93) and Methods for Cervical Nucleus Detection. arXiv 2018, arXiv:1811.09651. [Google Scholar] [CrossRef]
  45. Gao, Y.; Zhou, M.; Metaxas, D.N. UTNet: A hybrid transformer architecture for medical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part III 24. Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 61–71. [Google Scholar] [CrossRef]
  46. Huff, D.T.; Weisman, A.J.; Jeraj, R. Interpretation and visualization techniques for deep learning models in medical imaging. Phys. Med. Biol. 2021, 66, 04TR01. [Google Scholar] [CrossRef]
Figure 1. CellGAN Network Framework.
Figure 1. CellGAN Network Framework.
Applsci 14 06266 g001
Figure 2. Feature Completion Mechanism of the Generator.
Figure 2. Feature Completion Mechanism of the Generator.
Applsci 14 06266 g002
Figure 3. Transformer Block.
Figure 3. Transformer Block.
Applsci 14 06266 g003
Figure 4. Optimized Self-Attention Mechanism.
Figure 4. Optimized Self-Attention Mechanism.
Applsci 14 06266 g004
Figure 5. Two-Dimensional Relative Positional Encoding Strategy.
Figure 5. Two-Dimensional Relative Positional Encoding Strategy.
Applsci 14 06266 g005
Figure 6. Generator Accuracy Correction Process.
Figure 6. Generator Accuracy Correction Process.
Applsci 14 06266 g006
Figure 7. Recognition Effects on Four Different Datasets.
Figure 7. Recognition Effects on Four Different Datasets.
Applsci 14 06266 g007
Figure 8. Comparison of Four Different Datasets.
Figure 8. Comparison of Four Different Datasets.
Applsci 14 06266 g008
Figure 9. Model Performance Comparison on the Rice Leaf Dataset.
Figure 9. Model Performance Comparison on the Rice Leaf Dataset.
Applsci 14 06266 g009
Figure 10. Experimental Effects on the Nucleus Dataset.
Figure 10. Experimental Effects on the Nucleus Dataset.
Applsci 14 06266 g010
Figure 11. Model Heatmap.
Figure 11. Model Heatmap.
Applsci 14 06266 g011
Table 1. Training and Validation Results of the Proposed Model on Different Datasets.
Table 1. Training and Validation Results of the Proposed Model on Different Datasets.
DatasetTraining Set IoUValidation Set IoU
MoNuSeg89.11%77.50%
GlandCeil94.11%83.52%
Nucleus92.06%86.08%
RiceCell91.50%93.21%
Table 2. Comparison Experiments of Different Models.
Table 2. Comparison Experiments of Different Models.
ModelIoU% ↑Dice% ↑Acc% ↑
UNet [13]89.6994.4395.77
DeepLabV3+ [14]90.0694.6795.98
UNet++ [15]88.3293.5294.32
ResUNet [16]90.2494.8296.23
SwinUnet [37]79.2887.9589.46
UTNet [45]91.0095.2996.13
CellGAN93.2196.4997.32
Note: "↑" indicates that CellGAN demonstrates improved performance compared to other models in the comparative results.
Table 3. Ablation Results Comparison.
Table 3. Ablation Results Comparison.
SchemeUNetTransformerImproved MHSAGANIoUDice
1 89.6994.43
2 70.2682.53
3 91.0095.29
4 89.9494.71
5 91.3095.45
6 92.4996.10
793.2196.49
Table 4. Model Generalization Experimental Results.
Table 4. Model Generalization Experimental Results.
ModelMoNuSegGlandCellNucleus
IoU ↑Dice ↑IoU ↑Dice ↑IoU ↑Dice ↑
DeepLabV3+ [14]45.8061.8278.4187.5563.2077.02
UNet++ [15]47.0163.0680.5888.9063.1176.64
ResUNet [16]40.9857.9480.4988.9962.6576.57
SwinUnet [37]62.2876.6679.2987.9585.4091.81
UTNet [45]18.3430.4183.9791.0965.9079.00
CellGAN63.7277.8483.7291.1486.1092.53
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liao, X.; Yi, W. CellGAN: Generative Adversarial Networks for Cellular Microscopy Image Recognition with Integrated Feature Completion Mechanism. Appl. Sci. 2024, 14, 6266. https://doi.org/10.3390/app14146266

AMA Style

Liao X, Yi W. CellGAN: Generative Adversarial Networks for Cellular Microscopy Image Recognition with Integrated Feature Completion Mechanism. Applied Sciences. 2024; 14(14):6266. https://doi.org/10.3390/app14146266

Chicago/Turabian Style

Liao, Xiangle, and Wenlong Yi. 2024. "CellGAN: Generative Adversarial Networks for Cellular Microscopy Image Recognition with Integrated Feature Completion Mechanism" Applied Sciences 14, no. 14: 6266. https://doi.org/10.3390/app14146266

APA Style

Liao, X., & Yi, W. (2024). CellGAN: Generative Adversarial Networks for Cellular Microscopy Image Recognition with Integrated Feature Completion Mechanism. Applied Sciences, 14(14), 6266. https://doi.org/10.3390/app14146266

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop