1. Introduction
In the development of effective convolutional neural networks (CNNs) for polyp segmentation, numerous approaches have been proposed and have demonstrated satisfactory performance over time [
1,
2,
3]. Traditional deep learning models typically assume that training and testing data are identical and independently distributed. However, in real-world scenarios, especially when deploying segmentation models to predict polyps from entirely new centers, it becomes crucial to accurately segment them regardless of variations in styles, features, shapes, or illumination not present in the training data. Unfortunately, a significant challenge arises due to the domain shift problem, where the distribution of data in the testing environment differs from that of the training data, leading to a drop in the performance of trained CNN models [
4,
5]. Style discrepancy is one of the factors that can impede the generalization ability of deep learning models [
6]. The discrepancy in style features between datasets exacerbates the domain shift problem, causing issues in real-time applications such as inaccurate polyp diagnosis and analysis, potentially impacting patient screening and treatment plans. Usually, style discrepancy refers to differences in visual characteristics, such as texture, color, or contrast, between datasets used for training and testing CNN models. These style features encompass various aspects of image appearance that may vary significantly across different sources or environments. For instance, variations in lighting conditions, imaging equipment, or image processing techniques can lead to distinct visual styles in the data, as shown in
Figure 1. The presence of style discrepancies between training and testing datasets can pose a significant challenge to CNN models, as they may struggle to generalize across these divergent visual styles, ultimately leading to performance degradation when deployed in real-world settings.
To address the challenge of domain shift, extensive research has been conducted to develop a generalized model capable of performing effectively in novel environments. Domain adaptation is one approach that aims to align the feature distributions between training source data and target data in a domain-invariant setting [
7]. However, this method typically requires access to target domain samples during training, which may not always be feasible in medical contexts. Alternatively, domain generalization (DG) involves incorporating multiple domains from various sources into a target domain without directly using target domain data. DG aims to train a model from one or more related domains to enable direct generalization to any unseen target domain without additional adjustments [
8]. The primary objective of DG is to enhance the generalization ability of trained models across diverse domains and facilitate adaptation to new scenarios. Nonetheless, a common assumption in DG is that testing data share the same distribution as the training set, which may not always hold true in real-world medical applications.
Among recent advances in domain generalization (DG) models, those based on data manipulation show promising performance [
9]. These models utilize data augmentation techniques with a learning-based approach to generate diverse data, complementing the original data and simulating unseen domains to enhance the learning process. Existing DG methods in data augmentation primarily focus on image-level manipulation in the source domain, such as translating images based on styles from auxiliary datasets [
10] or converting images from different training domain styles [
11]. In computer vision tasks, applying neural style transfer for image-level data augmentation has shown improvement in robustness against domain shift problems [
12]. Geirhos et al. [
6] highlighted that convolutional neural networks (CNNs) tend to be biased towards textural features rather than class-specific features, such as shape, leading to challenges in adapting to unseen styles. They proposed stylized versions of datasets using style transfer to enhance accuracy and generalizability.
Similarly, image-level data augmentation using Generative Adversarial Networks (GANs) has been employed, but it cannot be applied universally across tasks [
10]. In [
13], Adaptive Instance Normalization (AdaIN) was employed to match the mean and variance of the content features with those of the style features. It takes both a content and style image as inputs, encoding them into the feature space at the encoder side. These encoded representations are then passed to an AdaIN layer, which adjusts the mean and variance of the content feature maps to match those of the style feature maps, thereby producing stylized feature maps. The final output is generated by a decoder from these stylized feature maps. In contrast, methods such as MixStyle [
14] aim to boost style diversity by augmenting features directly at the feature level. This approach generates new styles by blending existing styles from seen source domains. However, existing style augmentation methods have limitations in fully representing real styles in unseen target domains, leading to reduced diversity in samples and potential performance decrements, particularly when significant differences exist between generated virtual styles and real unseen styles.
In this study, we introduce a novel style-based data augmentation module operating at the feature-space level, tailored for the task of polyp generalization. Our approach aims to address challenges stemming from differences in style distribution between source and target images, which can solve generalization in the polyp domain, as shown in
Figure 2. To tackle this, we extract style statistics, such as mean and standard deviation, from the early layers of a convolutional neural network (CNN). The basic assumption behind the style transfer between the polyp is that the network may have confirmation bias towards the style features (color and textural information) while learning. Transferring such features facilitates learning style-agnostic representations, which eventually improves the generalization problem. Our proposed method employs a style-aware encoder–decoder UNet network, integrating style information in the feature space. We utilize Adaptive Instance Normalization (AdaIN) to transfer or generate diverse style features, thereby reducing the style gap between training and unseen testing sources of polyp images while preserving original semantic features for segmentation. Through experiments, we demonstrate the effectiveness of transferring style features from unseen target images to source images during training, enhancing model generalizability. Additionally, increasing style diversity by mixing style features of two images improves model performance. We also introduce a novel approach to generating synthetic yet plausible styles to ensure minimal deviation in generated style features. Our primary objective is to mitigate domain shifts in polyp segmentation tasks while transferring knowledge from one source domain to multiple unseen domains. This scenario can be viewed as a single-domain generalization problem, wherein the segmentation model is trained on a single polyp dataset and applied to multiple unseen datasets. Extensive experiments conducted on five public polyp datasets validate the efficacy of our proposed method. Additionally, we evaluate our method by comparing it with the style augmentation technique conducted at the image level, as demonstrated in the study by Yamashita et al. [
15].
The contributions are listed below:
We introduce a novel style-aware UNet approach for the task of polyp generalization. This method enables the model to learn diverse style features from target style source images, thereby enhancing its generalization ability and effectiveness in unseen target sources.
We propose a novel style synthesis module (NSSM) aimed at generating diverse yet plausible style features dynamically during training, while also constraining the transfer of unnecessary and highly deviated styles to the source features.
Our evaluation encompasses five public polyp datasets: Kvasir-SEG [
16] (used for training), CVC-Clinic [
17], CVC-COLONDB [
18], ETIS [
19], and KvasirCapsule-SEG [
20] (utilized for testing). The experiments conducted demonstrate the effectiveness of our proposed method in the generalization task.
Finally, we conduct qualitative analysis and quantitative studies to validate the efficacy of our method.
3. Methodology
We train the model on a single source domain
and generalize it to an unseen multiple domain
. All of the datasets may have different data distributions but share same label space. To solve the domain shift problem, we propose the style conversion and generation module, which has three parts: the (1) Style Conversion Module, (2) Style Generation Module, and (3) Novel Style Synthesis Module. We utilize the AdaIN method to deal with style transfer of the target style images during the training. An overview of the proposed method is shown in
Figure 3.
3.1. Background
In style transfer, computing the instance-specific feature statistics, such as mean and standard deviation, during normalization of the feature tensors is a widely accepted technique [
34] known as instance normalization (IN) [
34]. Let us consider that
denotes the batch size, number of channels, and height and width of the tensor, respectively. Instance normalization (IN) is formulated as
where
,
∈
are learnable parameters and
,
are the mean and standard deviation of each tensor computed across the spatial dimension with each channel.
,
can be computed as:
and
Adaptive Instance Normalization (AdaIN) was designed to achieve arbitrary style transfer by changing the scale and shift parameters in Equation (
1) with the mean and the standard deviation of the style image (
y) as follows:
In this manuscript, we will use the aforementioned feature statistics, such as channel-wise mean and standard deviation, to generate the unique style feature in the feature space. We employ AdaIN in order to replace the existing style features with generated unique style features.
3.2. Style Conversion and Generation Module (SCGM)
In this section, we present the architecture of SCGM, as illustrated in
Figure 3a. As the style conversion and generation module is executed in the feature space, more diverse transformations of the input images are expected, which ultimately cover more style distribution compared to image-level augmentation.
The general polyp generalization framework consists of a pre-trained encoder,
, and a decoder. We employ the Efficient UNet model, which comprises two main components: (1) a UNet encoder leveraging EfficientNet [
35] as its backbone, which facilitates the extraction of diverse semantic details across multiple stages; and (2) a decoder module that amalgamates spatial information from various stages to produce a highly accurate segmentation mask. To accomplish the style mixing task, we draw inspiration from Mixstyle [
14] and apply it to the preceding two layers. Our main aim is to train the encoder and decoder to focus on source-invariant semantic features across the polyp datasets through style feature conversion and generation.
3.2.1. Style Conversion Module (SCM)
More specifically, the Style Conversion Module (SCM) was inspired by Adaptive Instance Normalization (AdaIN) which replaces the learnable parameters in Equation (
1) with the feature styles of target images. It transfers the feature statistics of the target style image to the source training images. It can easily be integrated into the batch while training (as shown in
Figure 3a). Given a batch of images {
}, the SCM first integrates the style image
into the source training images. After concatenation on the same mini-batch training, the SCM computes the feature statistics of each image and transfers all of them from the style to source images.
where
,
are computed across the spatial dimension for source images and
and
are computed similarly for target images. An overview of the SCM is shown in
Figure 3b.
3.2.2. Style Generation Module (SGM)
Our method, the SGM, drew inspiration from MixStyle, which was designed with the aim of regularizing the CNN by mixing the style information of the source domain during the training. However, in our case, we collected the unique style images from different datasets. Given a batch of images {
}, we sampled two random images from the collection of style images into the mini-batch settings and performed a novel style generation step. We computed the mixed style features as follows:
Here, we set
to 0.5 throughout the experiments. This formulation ensures an equal contribution from each input style, resulting in a balanced proportion in the style mixing process.
Finally, the mixed style feature space is calculated by the following equation:
An overview of the SGM is shown in
Figure 3c.
3.2.3. Novel Style Synthesis Module (NSSM)
An overview of the NSSM is shown in
Figure 3d. The problem with the SCM and SGM is that they cannot produce diverse yet plausible style features in the iterations during the training, as they only utilize a single statistic or a mere combination of two style feature statistics from the batch. Therefore, we propose seeking a novel style that should not deviate too much from the source styles and looks realistic. To achieve this, we first make a queue of style features, which is generated by combining the style statistics of two images within the batch size applying the best possible combinations (see
Figure 4). We employ a technique similar to the SGM but within each pair of batches. We compute the mean and the standard deviations and mix the feature statistics following Equations (6) and (7). Then, a subset of the images that are distinct in the queue are chosen, and the maximum mean discrepancy (MMD) between the chosen styles and the other remaining target style features is computed.
Given a batch of images {
} from both the training and target style images, we randomly select one image as a base image and perform style mixing individually with each target style image by applying Equations (6) and (7). Next, we concatenate the mean and the standard deviation values of each by mixing statistics and store them in queue S. We then compare the discrepancy of each computed distribution with the base
feature statistics. Let us assume S1 represents the style feature distribution of
and S2 represents the style queue. We adopt the square maximum mean discrepancy (MMD) between the two distributions (S1 and S2) using a radial basis function (RBF) kernel
k as follows:
Note that we apply Equation (
9) for all of the combinations and compare each original style feature with the mixed one, taking those that have the minimum discrepancy.
We then normalize the feature maps following Equation (
1) and inject the novel dynamic styles, which were chosen by applying the MMD for the mixed features. The plausible style injection can be formulated by:
We adopted a combination of the Dice loss
and the cross-entropy loss function to train the network parameters. The Dice loss was proposed by [
36] and defined as follows:
where Dice is indicated by Dice coefficient score, which represents the spatial overlap regions between the ground truth (Y) and the predicted mask (Y’). It can be calculated as follows:
In the above Equation (
12), * indicates element-wise multiplication and e is a very small parameter in case of unfavorable conditions. The combination of the binary cross-entropy loss and Dice loss have been proven efficient in handling the gradient problem [
37]. We can formulate the binary cross-entropy loss as follows:
Finally, we combine the two loss functions as follows:
where
and
are the weights for the Dice loss and the binary-cross entropy loss, respectively.
At last, we apply both the original features and the augmented stylized features to train the network using cross-entropy and the Dice loss function aforementioned in Equation (
14). The cumulated loss function is stated as follows:
Similar settings were applied for the SCM and the SGM.
3.3. Datasets
We conducted experiments using five different datasets to demonstrate the effectiveness of our proposed method for polyp generalization. Our experimental settings closely followed those outlined in PraNet. Unlike some previous studies that utilized Kvasir-SEG and CVC-ClinicDB as training sets and other datasets as testing sets, we exclusively utilized Kvasir-SEG for training. We divided Kvasir-SEG and CVC-ClinicDB into training, validation and testing subsets with ratios of 80%, 10% and 10%, respectively. Additionally, we utilized other test datasets, such as CVC-ColonDB, ETIS and Hyper-Kvasir, which contain 380, 196 and 55 images, respectively, for testing purposes. In our manuscript, we present the testing results obtained from models trained on Kvasir-SEG and evaluated on other datasets. However, we also include results from models trained on CVC-ClinicDB and evaluated on other datasets, including Kvasir-SEG.
For the experiment, we resized the images to 384 × 384 pixels, consistent with the size used in many prior works. During training, we performed augmentation on the fly while loading the data into the model. This augmentation included rotation, scaling, flipping and shearing.
3.4. Implementation Details
The proposed method was implemented using the PyTorch framework [
38] v1.10.2. We utilized a V100 GPU with two 32 GB integrated GPUs for training. As a baseline, we employed the EfficientUNet network, which utilizes EfficientNet as a pre-trained network. The hyperparameters chosen were consistent with those used in prior work [
23]. Specifically, we trained the model using stochastic gradient descent (SGD) with a batch size of 16, a momentum of 0.9 and a weight decay of
. The total number of epochs was set to 200.
3.5. Evaluation Metrics
We employed various metrics to evaluate and compare our proposed method with state-of-the-art (SOTA) methods. These metrics include mean Intersection over Union (mIoU), Dice coefficient score (Dice), weighted F-measure (FM), structure measure (SM), mean absolute error (MAE) and max enhanced-alignment measure (EM). Among these metrics, Dice and mIoU are similar, as both assess the degree of similarity at the region level and measure consistency within. To compute the Dice, FM and IoU, we utilized 256 pairs of recall and precision values between the predicted mask and the ground truth. Specifically, we transformed the predicted output into a total of 256 binary masks by varying the threshold from 0 to 255.
The F-measure is employed to evaluate a segmentation model’s performance by considering both precision and recall simultaneously, providing a single metric that reflects the model’s overall effectiveness. The structure measure (SM) quantifies the structural consistency between the predicted mask and the ground truth [
39]. Additionally, the E-measure (maxE) evaluates the output at both the region and pixel level [
40]. Similarly, the MAE serves as a pixel-level similarity comparison metric, computing the average absolute per-pixel difference between the predicted mask and the ground truth. In Equation (
20), P(x,y) denotes the pixel value of the ground truth, and G(x,y) represents the pixel value location of the predicted polyp mask.
5. Discussion
We have introduced augmentation strategies operating in the feature space, which have exhibited superior performance across various polyp datasets and demonstrated the ability to generalize in challenging and unseen environments compared to traditional data augmentation techniques and previous methodologies. We hypothesize that our proposed method facilitates the learning of deep learning models to acquire domain-agnostic and content-specific visual representations by substituting or exchanging original style components with new ones, primarily focusing on domain-irrelevant and class-specific aspects. Indeed, simply transferring style statistics at both the image level and feature space level has led to significant performance improvements. Training on stylized versions of the polyp dataset has resulted in notably enhanced performance compared to traditional augmentation methods. Similarly, experiments involving the transfer of style features at the feature space level have yielded comparable performance gains. Furthermore, the experiments on generating a more diverse style transfer technique (the NSSM) demonstrated that the proposed method can deal with arbitrary styles, in contrast to traditional augmentation approaches, the SGM and AGM, which rely on a fixed set of data style transformations.
Allegedly, prior research has not employed style-based data augmentation, either at the image level or the feature space level, for the task of polyp segmentation and generalization using deep learning models. Some earlier studies have applied style transfer techniques in different domains, such as skin lesion classification [
44] and histology datasets [
15]. However, these studies utilized transformations relevant to medical contexts to address image scarcity and imbalance issues in datasets. One variant of our proposed method (the SCM) bears some resemblance to this approach. Moreover, the study by [
15] emphasized the use of medically irrelevant transformations with natural images and demonstrated their superiority over previous approaches. This could be attributed to the potential of diverse transformations enabled by employing a wider range of style features, unlike medically relevant transformations, which are inherently limited. Our proposed methods, particularly the SGM and NSSM, align with this concept. We achieved style diversification by leveraging target images solely from different polyp datasets and generating plausible styles similar to the source image. Learning features that are specific to classes and independent of domains is crucial for deep learning models, akin to human cognition. Utilizing style transfer techniques at the feature space level with extensive diversification can enhance the model’s representation significantly.
While data augmentation at the image level is recognized as an effective technique for enhancing the performance and generalization of deep learning models, its application and potential in medical imaging remain largely unexplored, thus warranting further investigation. Additionally, determining an optimal configuration for data augmentation methods can vary depending on the datasets being utilized. As suggested by our proposed method, employing style data augmentation at the feature space level holds promise for learning domain-agnostic and class-specific representations. The findings presented in
Table 1 underscore the need for future research to explore optimal settings for data augmentation and style transformation in a diverse and plausible manner, akin to the approach proposed in the NSSM, alongside existing methods.
In clinical scenarios, the performance of deep learning models often diminishes due to variations in domain shift. Models that can generalize across multiple datasets are highly advantageous. While it is commonly believed that training deep learning models on diverse multi-institutional datasets can facilitate generalization to unseen datasets, our proposed method demonstrates that a well-curated dataset or thoughtfully designed architecture can compel the model to learn features that are both class-specific and invariant to domain shifts. Specifically, the NSSM achieved comparable performance within its dataset but exhibits superior performance on other unseen datasets. For instance, when trained on Kvasir-SEG and tested on CVC-ClinicDB, the NSSM achieved the highest Dice score of 0.933 and a mean IoU of 0.893. Similar performance gains are observed on the CVC-ColonDB and ETIS datasets. Notably, the proposed method outperforms prior methods significantly, as evidenced in
Table 6, where it attains a Dice score of 0.844 and a mean IoU of 0.750, surpassing other major metrics, while prior methods struggle with generalization. Similarly, when trained on CVC-ClinicDB and tested on other datasets, the proposed method achieves the highest scores on all metrics. Through extensive experimentation, the results indicate that the proposed method exhibits superior generalizability, attributable to its style augmentation approach at the feature space level, which consistently generates diverse style features while preserving key content features.
We conducted ablation studies to determine the optimal settings. We experimented with different mixing ratios between the source training images and target style images at the feature space. Our findings indicate that an equal mixing ratio resulted in greater diversity in the features, leading to improved performance on unseen datasets. The quantitative results of these experiments are presented in
Table 8.
One significant limitation of our study is that we tested our approach solely on a simple UNet architecture. This decision was made to avoid potential interpretability issues that could arise from using more complex models. However, future research endeavors are necessary to investigate whether our approach can effectively address these limitations and demonstrate its robustness and efficiency in various scenarios. Specifically, it would be valuable to explore the applicability of our method within the frameworks of prior methods’ backbones. Additionally, extending the evaluation to other domains such as classification and detection tasks, as well as different medical imaging domains including histopathology, dermatology and radiology, would provide further insights into the versatility of our approach. Another limitation worth mentioning is that our proposed method (the NSSM) requires longer training times compared to previous works, primarily due to the style augmentation being performed at the feature space level on the fly.
In summary, we have introduced the NSSM, a novel style-based data augmentation method designed to learn diverse style features from a comprehensive set of medically relevant polyp images originating from various sources. Our approach aims to facilitate the acquisition of domain-agnostic and class-specific feature representations within the polyp domain. Through our experiments, we have demonstrated notable enhancements in the performance of segmentation tasks across different unseen datasets, particularly when confronted with domain shift challenges. Our investigation underscores two key findings. Firstly, CNNs exhibit a bias towards style features and may rely on low-level attributes such as color and texture, rendering them susceptible to domain shifts within polyp domains. Secondly, we posit that the incorporation of a medically relevant NSSM can serve as a practical strategy to alleviate this reliance, thereby offering a potential avenue for acquiring domain-agnostic representations.