Next Article in Journal
Scapular Motor Control and Upper Limb Movement Quality in Subjects with and without Chronic Shoulder Pain: A Cross-Sectional Study
Previous Article in Journal
Selected Physical and Mechanical Properties of Subfossil Oak (Quercus spp.) Compared to Aged Oak and Recent Oak
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Adaptive Semantic Segmentation Network for Adversarial Learning Domain Based on Low-Light Enhancement and Decoupled Generation

1
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
2
Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2024, 14(8), 3295; https://doi.org/10.3390/app14083295
Submission received: 2 March 2024 / Revised: 28 March 2024 / Accepted: 9 April 2024 / Published: 13 April 2024

Abstract

:
Nighttime semantic segmentation due to issues such as low contrast, fuzzy imaging, and low-quality annotation results in significant degradation of masks. In this paper, we introduce a domain adaptive approach for nighttime semantic segmentation that overcomes the reliance on low-light image annotations to transfer the source domain model to the target domain. On the front end, a low-light image enhancement sub-network combining lightweight deep learning with mapping curve iteration is adopted to enhance nighttime foreground contrast. In the segmentation network, the body generation and edge preservation branches are implemented to generate consistent representations within the same semantic region. Additionally, a pixel weighting strategy is embedded to increase the prediction accuracy for small targets. During the training, a discriminator is implemented to distinguish features between the source and target domains, thereby guiding the segmentation network for adversarial transfer learning. The proposed approach’s effectiveness is verified through testing on Dark Zurich, Nighttime Driving, and CityScapes, including evaluations of mIoU, PSNR, and SSIM. They confirm that our approach surpasses existing baselines in segmentation scenarios.

1. Introduction

Semantic segmentation is a fundamental task in computer vision where each pixel of a given image is labeled with an object category. It is widely used in various applications such as autonomous driving [1], medical imaging [2], and human parsing [3]. In recent years, the performance of semantic segmentation of daytime scene images has substantially improved due to the rapid progress in deep learning and computing power. As researchers have tackled more challenging image segmentation scenarios under various limited, adverse, and degraded conditions, semantic segmentation of nighttime images [4] has emerged as a prominent research focus. However, nighttime semantic segmentation poses unique challenges; for example, low contrast of the input images makes it difficult to obtain clear and complete segmentation boundaries, and the variation in lighting conditions might lead to changes in the brightness and color of the objects within the same scene. Additionally, the manual labeling of a high-quality training set of nighttime images is also a formidable task, contributing to the degradation of segmentation model performance. The present study seeks to address the above-mentioned bottlenecks via proposing a nighttime semantic segmentation network that is suitable for real-scenario applications such as autonomous driving and security monitoring.
Nighttime images, as a class of low-light images, have many regions of foreground pixels that are not obvious or recognizable to the human eye, and it is difficult to perform high-quality pixel-level annotation on this part of the image. Consequently, a sufficient quantity of accurate segmentation instances is the basis for realizing efficient learning of segmentation models. To tackle this, current schemes such as domain adaptation [5,6], synthetic datasets [7], and style transfer [8] are commonly employed. Due to the low brightness and contrast of nighttime images, this paper transfers daytime images to nighttime images by domain adaptation. However, large differences in scene feature distributions and foreground types between the source and target domains, primarily concerning light intensity, can often lead to the distortion of crucial spatial semantic details during domain transfer. In view of this, some studies have proposed to establish domain transfer from the daytime domain to the nighttime domain using an intermediate domain as a smooth transition, such as the twilight domain. In [4,9,10], the twilight domain serves as a bridge, allowing the model trained in the daytime domain to adapt progressively to the nighttime domain by extracting features from twilight images and performing transfer alignment learning. In [11], a model adaptation method based on course learning is proposed, which adapts the model to light changes and noise in nighttime scenes by gradually increasing the complexity of the nighttime images. Building [11], research conducted in [12] utilizes feature maps to provide prior knowledge about nighttime scenes, aiding the model in understanding objects and structures and guiding adaptive training. In [13], an encoder–decoder structure for semantic segmentation of nighttime images is introduced which uses a domain map approach for mapping synthetic to real data. In [14,15,16], a generator network is trained using adversarial learning to translate daytime images to nighttime images. Subsequently, the feature extractor of the generator network is adopted for the semantic segmentation network to extract transform-based regularized features from nighttime images. Furthermore, to enhance the generalization performance of the model, research in [17] employs Adversarial Generative Networks (GANs) to translate daytime images to nighttime images, and random transformations are then applied to those images, followed by joint training using the adversarial and semantic segmentation loss functions.
Among the above-mentioned methods, refs. [4,9,10,11,12] leverage intermediate domains to create a smooth transition, thereby improving model generalization and potentially reducing dataset labeling costs. Nonetheless, introducing intermediate domains may entail additional preprocessing and model training and fail to fully cover all variations from daytime to nighttime. On the other hand, refs. [13,14,15,17] employ techniques such as style transfer and build synthetic datasets to address the difficulty of labeling nighttime (low-light) samples. While building synthetic datasets offers advantages, it also carries the risk of introducing bias and noise. Furthermore, these schemes focus only on the statistical representation of the overall image style in style transfer and thus are prone to the loss of spatial details. In addition, generating a transferred image with the same semantics as the original image, especially when dealing with a relatively large domain gap, remains a challenging aspect of image translation.
To address these problems, this paper uses pairs of day and night images in similar scenes as target domains and tries to transfer the source domain generalized model to the scene-specific multi-target domain without introducing an intermediate domain, synthetic datasets, or style migrations so as to improve the segmentation quality by joint adversarial learning and multi-domain co-training. Based on this, this paper proposes an adaptive semantic segmentation network for the adversarial learning domain based on low-light enhancement and decoupled generation (DLA-Net). At the front end of the model, a lightweight low-light image enhancement network (LIE-SubNet) is embedded to elevate foreground contrast in nighttime images and accomplish spatial feature alignment across different illuminance datasets. Existing segmentation models typically treat the foreground target as a unified entity; however, foreground boundary regions usually contain richer spatial details with higher-frequency feature information, whereas non-boundary regions exhibit fewer spatial details featured by low-frequency distributions. Inspired by [5], this paper leverages a generative network capable of decoupling the foreground body and edge to predict segmentation masks and uses two discriminators for adversarial training between the source and target domains. Additionally, a small pixel reweighting strategy [18] is implemented to process the input images and reduce prediction uncertainty, thus improving the segmentation accuracy for small targets. In our experiments, the Dark Zurich dataset [4] is employed, which contains pairs of daytime and nighttime images based on rough GPS positional alignment. Through extensive testing on the Dark Zurich, CityScapes, and Nighttime Driving [9] datasets, the proposed method is verified to demonstrate improved performance in low-light nighttime semantic segmentation. The primary contributions of our work are summarized as follows:
In this paper, a multi-domain model joint training network for semantic segmentation, DLA-Net, is introduced which transfers the source domain to the multi-target domain of a specific scene without requiring an intermediate domain. It accomplishes joint adversarial training of the multi-domain model, supported by the low-light image enhancement sub-network, on the multi-target domain;
The low-light image enhancement sub-network, LIE-SubNet, which combines deep learning and mapping curve iteration, is proposed to enhance pixel contrast and spatial feature alignment of nighttime images. In the segmentation network, a generative network capable of decoupling subjects and edges is utilized to guide segmentation prediction via exploiting the adversarial loss in the daytime and nighttime domains;
To effectively utilize both low-frequency and high-frequency information of foreground targets, the segmentation mask is decoupled into the body generation branch and the edge preservation branch. These branches can focus on different attributes of the regional features during training. The resulting masks are then composited and reconstructed to achieve a complete semantic segmentation mask capable of retaining the details while removing the void noise.

2. Related Work

Domain adaptation for semantic segmentation: Domain adaptation seeks to transfer knowledge learned in the source domain to the target domain, where the object classes are similar but the distribution of data statistics differs. Currently, a portion of domain adaptive schemes adopt adversarial learning frameworks, introducing an adversarial loss function between the source and target domains to guide the model in aligning feature representations across different data domains. For example, in [19], Hoffman et al. proposed a new approach to semantic segmentation using category-constrained [20] full convolutional domain adversarial learning. AdaptSegNet [5] utilizes adversarial training to achieve feature alignment in the source and target domains. Additionally, several approaches employ joint training and multiple task learning strategies to improve model generalization by sharing parameters among source and target domains. In BDL [21], images from both source and target domains are input into a shared convolutional neural network, with the last layer divided into two branches for semantic segmentation tasks in the respective source and target domains. Through sharing the feature extraction layer of the network, the source and target domains can leverage the underlying image feature representation, thereby improving adaptation to redundant representations of the target domain.
Unlike adversarial learning, style transfer [17] and image translation from source image to target image are also widely used for domain adaptation. They typically incorporate a domain invariance loss function into the generator network to enforce domain-invariant image generation in the target domain [22,23]. This type of loss function commonly comprises both an adversarial loss and a domain invariance loss. Specifically, the generator network is trained using the adversarial loss, while the domain invariance loss is used to teach the generator network to learn the shared features between source and target domains, resulting in domain-invariant representations. Some other studies have explored the combination of self-training strategy and fine-tuning strategy through multiple rounds of network training. However, the self-training strategy may introduce noise when using pseudo-labeling, thus impacting model performance [24]. To mitigate the influence of noise, researchers have proposed several improved self-training techniques [25,26], such as using model ensembles to reduce noise or refining pseudo-labeling to reduce mislabeling. Alternatively, some studies have employed course-based learning [27,28] to acquire simple attributes in the target domain before using them to normalize semantic segmentation models. However, the significant visual disparities between daytime and nighttime images pose a formidable challenge for these methods, making them ill suited to effectively handling domain adaptation in scenarios with markedly different illumination intensities. Consequently, they often fall short of delivering satisfactory performance in nighttime semantic segmentation. This paper explores more efficient techniques to minimize the domain gap so that transfer models can achieve accurate segmentation predictions.
Nighttime (low-light) semantic segmentation: Some studies have demonstrated the effectiveness of employing intermediate domains for the progressive adaptation of semantic models trained on daytime scenes to nighttime scenes. For example, Dai et al. [9] proposed a step-by-step adaptive approach based on intermediate domains. This approach leverages an intermediate twilight domain as a bridge between daytime and nighttime scenes and trains an intermediate model on the twilight domain, which is then applied to the semantic segmentation of nighttime scenes. Later, Sakaridis et al. [4,5] extended the approach to a class of guided curriculum adaptation frameworks, incorporating synthetic and unlabeled real images to establish correspondences in scene images at various times. However, it is worth noting that this progressive adaptation approach often necessitates training several semantic segmentation models. For instance, in [4], three models were trained separately for three different domains, potentially making the training process less efficient. Building upon this methodology of using intermediate domains for progressive domain adaptation, some studies have trained some additional image translation models. CycleGAN [17] is a good case in point and enables the inter-transfer of daytime and nighttime images before training semantic segmentation models, thus introducing different visual features through diverse augmented data and aiding in the adaptation of the transferred models to various scenes and environments. Furthermore, ref. [29] introduced a nighttime semantic segmentation method based on image translation by translating nighttime images into daytime ones and utilizing semantics with the model trained on the daytime domain.
More recently, to improve the semantic segmentation of night scenes, researchers have explored the use of different sensors to capture the same image as an auxiliary input. Vertens et al. [30] proposed to utilize the insensitivity of thermal infrared to changes in illumination as a supplemental input to segmented images to provide additional information for nighttime semantic segmentation. Additionally, other studies have devised specialized scene semantic segmentation methods. For example, Ref. [31] proposed a two-stage adversarial training approach that employs domain adaptation techniques to transform between pairs of daytime and nighttime scenes, particularly for rainy and nighttime scenarios. Likewise, Ref. [32] introduced an adaptive network capable of automatically adapting its internal architecture based on the attributes of input images to different environmental conditions, including nighttime and rainy ones. Differing from the above methods, this paper introduces a network structure designed to train semantic segmentation for low-light images via end-to-end adversarial learning without resorting to intermediate domains or auxiliary images.

3. Method

3.1. Framework Overview

The domain adaptive method proposed in this paper involves two key domains: a source domain S for pre-training, which can be any normal lighting scene, and a target domain T = { T d , T n } containing two roughly aligned subdomains, T d and T n , representing daytime and nighttime scenes, respectively. In the pre-training phase, a labeled image set S { X S , Y S } from the source domain was used to optimize the semantic segmentation network parameters. Subsequently, two discriminators, D S T d and D S T n , were employed to bootstrap the domain adaptive model transfer from S to T d and from S to T n to efficiently model semantic segmentation of the nighttime scene T n in the target domain. The domain adaptive semantic segmentation network in this paper comprises three modules: (1) a low-light image enhancement network N e n , (2) a pre-trained semantic segmentation network G p r e and a transfered semantic segmentation network G s e g , which decouples the body and edge during segmentation and provides predicted image dimensions of R H × W × C , with C denoting the total number of image categories, and (3) a segmented mask activation network N Y , which consists of a convolutional layer and a sigmoid normalization function, as shown in Figure 1. The network input contains the source domain image X S and the target domain images X T d and X T n , consisting of three types of domain samples. Among them, X T n was additionally passed through a nighttime (low-light) enhancement network N e n , which generated an enhancement loss L e n to optimize the enhancement result and brought the output closer to the daytime domain. The network uses image annotations X S , Y S in the source domain S dataset to compute the segmentation loss L s e g and then obtains the segmentation prediction masks F ˜ = { F ˜ S , F ˜ T d , F ˜ T n } and segmentation loss L d e by G s e g . After that, two discriminators, D S T d and D S T n , perform adversarial transfer learning, and the final segmentation masks Y ^ = { Y ^ S , Y ^ T d , Y ^ T n } are obtained via activating the network N Y , i.e., Y ^ = { N Y ( F ˜ S ) , N Y ( F ˜ T d ) , N Y ( F ˜ T n ) } . The whole network guides the domain adaptive alignment of the model based on the composite total loss L t o t a l .

3.2. Low-Light Image Enhancement Sub-Network

In the realm of image illumination enhancement, the majority of research commonly employs methods like mapping curves or neural networks. However, this paper has the initiative to fit mapping curves with neural networks to design a low-light image enhancement sub-network. The objective was to homogenize the intensity distribution of the input image X T n from the nighttime target domain T n and generate the enhanced image X ^ T n , ensuring that the predictions of different domain samples align after passing through the segmentation network. Inspired by [13], we utilized an iterative pixel enhancement mapping curve to adjust the brightness and contrast of the image through the pixel grayscale mapping relationship, as shown in Equation (1).
I e ( X T n ( x ) ; α ) = l o g ( 1 + X T n ( x ) + α X T n ( x ) ( 1 X T n ( x ) ) ) ,
where x is the pixel coordinate, and the α parameter ensures that each pixel value in the enhanced image falls within the normalized range of [ 0 , 1 ] , preventing any loss of information due to overflow. By setting α to a value between −1 and 1, the I e curve can be controlled within the range of [ 0 , 1 ] . For example, when α = 1 , I e ( X T n ( x ) ; 1 ) = l o g ( 1 + X T n ( x ) 2 ) , i.e., each value is within [ 0 , 1 ] .
To adapt to more challenging low-light conditions, iterating the quadratic curve I e could result in a higher-order curve. Although the higher-order curve is able to adjust the image over a wider area, it still applies a global adjustment as the α value is applied to all pixels, resulting in over-enhancement or diminution of localized regions. To solve this problem, we used a separate curve for each RGB channel of the input image to perform an iterative transformation so that each channel has a corresponding optimal α value for image enhancement, as shown in Equation (2).
I e m ( X T n ( x ) ; A ) = l o g ( 1 + I e m 1 ( X T n ( x ) ) + A m I e m 1 ( X T n ( x ) ) ( 1 I e m 1 ( X T n ( x ) ) ) ) ,
where m, set to 8 in this paper, signifies the number of iterations and controls the curvature, and A is a parametric mapping with the same size as the given image used to represent the optimal α value for each channel. To obtain the mapping relationship among the input image and its optimal curve parameter mapping, this paper proposes a depth curve fitting network, as illustrated in Figure 2.
To evaluate the quality of the enhancement image, we used the following three losses to train the image enhancement network.
To suppress overexposure or underexposure of certain areas, we designed an exposure control loss L e c to regulate the level of exposure. L e c quantifies the disparity between the mean luminance value of a specific area and the intended exposure level e. e was set to a grayscale value in the RGB color space following existing methods [33,34] in this paper. This loss brought the enhancement closer to the desired exposure level, mitigated overexposure or underexposure, and hence obtained a more visualized and higher-quality image, as shown in Equation (3).
L e c = 1 V i = 1 V I ^ i e ,
where V denotes the number of non-overlapping regions with a size of 16 × 16 , and I ^ represents the average luminance value of localized region V in the augmented image X ^ T n . e was set to 0.5 in the experiment.
The color constancy loss employed in this paper was based on the Gray-World [35] color constancy assumption, which posits that each color channel is averaged as gray over the whole image. This loss rectifies potential color deviations in the enhanced image, recovers color information affected by changes in illumination, improves the quality and visual perception of the image, and determines the relationship among the three color channels, as shown in Equation (4).
L c c = ( a , b ) τ ( E ¯ ( a ) E ¯ ( b ) ) 2 ,
where E ¯ ( a ) and E ¯ ( b ) denote the average intensity values of the a-channel and the b-channel, respectively, in the enhanced image X ^ T n , and ( a , b ) denotes a pair of channels, τ = { ( R , G ) , ( R , B ) , ( G , B ) } . The smaller value of L c c indicates that the color of the brightened image is more balanced, and the larger L c c indicates that the brightened image may have the problem of color bias.
In this paper, an illumination smoothness loss [36] was built into each curve parameter mapping A to maintain a monotonic relationship between adjacent pixels. The loss assists the model in learning that the illumination changes in the neighboring regions exhibit both consistency and smooth transition and improving image processing performance and image quality. It is shown in Equation (5).
L i s = 1 M m = 1 M s = 1 H × W × C x A m ( s ) + y A m ( s ) 2 ,
where M stands for the number of iterations. Specifically, C denotes the RGB color channels, and C = 3 . x and y denote the horizontal and vertical gradient operations, respectively. The smaller the value of L i s , the smoother the light of the brightened image, and vice versa, which indicates that there are mutations or artifacts in the light of the brightened image.
The total enhancement loss is shown in Equation (6).
L e n = L e c + λ 1 L c c + λ 2 L i s ,
where λ 1 and λ 2 are hyperparameters used to balance the size of the loss and were set to 0.5 and 20 in the experiments, respectively.
Contributing to the realm of LIE-SubNet, this paper explores the combination of a set of higher-order curves that can be iterated with a deep learning network for different numbers of iterations to verify the optimal performance and enhance nighttime pixel contrast. The method reduces the domain gap among the daytime and nighttime domains without resorting to an intermediate domain or the training of multiple distinct models and feeds the segmentation network with smaller differences in illumination images.

3.3. Semantic Segmentation Network for Decoupling Body and Edge

Currently, mainstream semantic segmentation methods primarily focus on enhancing the internal consistency of the object through global modeling or refining the object details along the boundaries through multi-scale feature fusion. However, it is worth noting that foreground boundary regions typically harbor more spatial detail and higher-frequency feature information. In view of this, we introduced the semantic segmentation network G s e g for decoupling body and edge, which contains a body generation branch ρ and an edge preservation branch δ . Unlike previous studies, we do not require the input image’s ground truth map and trained two branches with distinct losses to predict the body feature map and edge feature map, respectively. The implementation details are described below.
Decoupling segmentation framework: In this paper, we assume that the spatial features of the image conform to the addition rule, i.e., F ˜ = F body + F edge . Accordingly, the body feature F b o d y can be generated first, and the edge feature F e d g e can be obtained by a specific subtraction operation. If we make F b o d y = ρ ( F ) , then F e d g e = F F b o d y , as shown in Equation (7).
F ˜ = ρ ( F ) + δ ( F e d g e ) = F b o d y + δ ( F F b o d y ) ,
where ρ represents the body generation branch mapping which is used to aggregate contextual information within objects to form a distinct body for each object. On the other hand, δ denotes the edge preservation branch mapping, which is designed to extract spatially detailed features from the boundary region.
Body generation branch: This branch is responsible for the generation of more consistent feature representations for pixels that are part of the same object in an image. Low-resolution feature maps typically contain low-frequency terms, with the low-spatial-frequency portion representing the image as a whole. Therefore, the low-resolution feature maps represent the most salient parts. In order to achieve this goal, as illustrated in Figure 3, X is the input image, and we utilized an encoder–decoder architecture after the backbone to extract F. Specifically, the encoder downsamples F using dilated convolution, which downsamples F into a low-resolution representation of the low-spatial-frequency portion, denoted as F l o w . In some cases, low-resolution features might still contain high-frequency information. We assumed that this compressed representation encapsulates the most obvious object portions and leads to rough representation which ignores details or high-frequency portions. Therefore, we used bilinear interpolation to upsample F l o w to the same size as F to obtain F u p . Then, we cascaded F and F u p and used a 1 × 1 C o n v to adjust the channel dimensions to R H × W × C to obtain F c o n v , i.e., F c o n v = h c o n v ( F | | F u p ) , where h c o n v denotes the 1 × 1 convolutional layer, and | | denotes the channel dimensionality join operation. This branch also contains an average pooling layer by average pooling F c o n v to generate a feature map F a p with a more distinct body, i.e., F a p = h a p ( F c o n v ) and F a p R H × W × C , where h a p denotes the average pooling operation.
To increase the spatial accuracy of body features in segmentation results, we first mapped each pixel p in the default spatial grid Ω l on F a p to a new pixel point p via feature relocation. Then, we used a variable bilinear sampling mechanism [37,38] to approximate the value of each pixel point p in F b o d y , i.e., F b o d y ( p ^ ) = p l ( o ) F a p ( p ) , where l denotes the pixels in the four fields around p, o is the center point, and F b o d y R H × W × C . In addition, to ensure smoother performance of the body feature and reduce noise and discontinuities in the prediction results, we applied the L 2 loss [39] to bootstrapping the body generation branch learning, as shown in Equation (8).
L b o d y = s = 1 H × W × C ( F b o d y ( s ) ) 2 ,
where s denotes the positional index of the element in F R H × W × C , and s = 1 , 2 , . . . ,   H × W × C .
Edge preservation branch: This branch is dedicated to handling high-frequency terms F h i g h in the image, where high-frequency features usually encompass more detailed edge information. To obtain the high-frequency edge feature, we subtracted the body feature F b o d y from the original feature F, i.e., ( F F b o d y ) . Drawing inspiration from recent work on decoder design [40], we outputted a low-level feature F d e t a i l through the backbone’s low layer, which served as a complement to the missing fine-detail information and augmented the high-frequency terms in F e d g e . Finally, ( F F b o d y ) and F d e t a i l were cascaded, and then a 1 × 1 C o n v was used for channel adjustment to obtain F e d g e . The implementation is expressed in Equation (9).
F e d g e = h conv ( ( F F b o d y ) | | F d e t a i l ) ,
where F e d g e R H × W × C .
The edge preservation branch focuses more on edge detail features and does not require body features. Unlike the L 2 loss, the L 1 loss can obtain a sparse solution so that certain features have zero weight. This makes the boundary sparser and reduces unnecessary body features, contributing to an accurate boundary prediction feature map. Therefore, the L 1 loss [39] was utilized to guide the learning of the edge preservation branch, as shown in Equation (10).
L e d g e = s = 1 H × W × C | F e d g e ( s ) | .
The final decoupling loss is:
L d e = L b o d y + λ 3 L e d g e ,
Both the L b o d y and L e d g e losses complement each other by sampling pixels from different regions of the image, which was beneficial for showing the performance of the experimental results. Since the edge portion is not a large part of the overall image, λ 3 is used to balance the weight of L e d g e in L d e , which was set to 0.4 in the experiments.
In this paper, G s e g acquired the body feature map F b o d y and edge feature map F e d g e of the input image X through the body generation branch ρ and the edge preservation branch δ , respectively. Moreover, the edge features were supplemented by the high-frequency detailed features F d e t a i l output from the lower layer of the backbone network. By employing distinct body and edge losses, the segmentation performance was enhanced, and the final segmentation map F was obtained through F b o d y + F e d g e .

3.4. Multi-Target Domain Adversarial Learning Strategy

During the multi-target domain adversarial learning strategy, in order to ensure relatively close feature distributions after spanning different domains and to better achieve transfer alignment between source and target domains, this paper added the adversarial loss terms L S T d and L S T d to the outputs of the daytime domain T d and the nighttime domain T n , respectively. Both discriminators had identical structures, weights, and training protocols, where the identification source domain image was 1, and the target domain image was 0. The binary cross-loss function [41] was utilized to make both F ˜ T d and F ˜ T n close to F ˜ S . The antagonistic loss is defined as:
L a d v = L S T d ( X S , X T d ) + L S T n ( X S , X T n ) ,
In the experiments, we trained the generator and the discriminators alternately. The generator used in the source domain G p r e was pre-trained, and the target domain G s e g was transfered. The objective functions of D S T d and D S T n are defined as:
L S T d ( X S , X T d ) = min G s e g max D S T d ( E X S p d a t a ( X S ) ( log D S T d ( G p r e ( X S ) ) ) + E X T d p d a t a ( X T d ) ( 1 log D S T d ( G s e g ( X T d ) ) ) ) ,
L S T n ( X S , X T n ) = min G s e g max D S T n ( E X S p d a t a ( X S ) ( log D S T n ( G p r e ( X S ) ) ) + E X T n p d a t a ( X T n ) ( 1 log D S T n ( G s e g ( X T n ) ) ) ) ,
We used cross-entropy loss to train the semantic segmentation loss of the source domain. Moreover, we introduced the small pixel reweighting w k to address the small target category imbalance, as shown in Equation (15).
L s e g = 1 N C t = 1 H × W k = 1 C | | w k G T ( t , k ) · l o g ( Y ^ S ( t , k ) ) | | 1 ,
where N is the total number of image pixels, k denotes category, | | · | | 1 is the L 1 norm that sums up all the pixels, w k is the pixel weight, Y ^ S ( k ) is the prediction map Y ^ S from the kth channel of the source domain image obtained from the activation network N Y , i.e., Y ^ = N Y ( F ˜ ) , and G T ( k ) is the ground truth of the kth category of the one-hot encoding. Specifically, for each category k, we first defined a weight w k = log ( p k ) , where p k denotes the percentage of all valid pixels that are labeled as category k in the source domain. Then, w k was further normalized by w k = ( ( w k w ¯ ) / θ k ) · s t d + a v g , where w ¯ and θ k are the mean and standard deviation of w k , respectively, and s t d and a v g are preset constants to limit the value of w k to positive. Finally, w k was multiplied by the corresponding category channel in F ˜ to generate the weighted probability map, and then the segmentation result was yielded via N Y , as shown in Equation (16).
Y ^ ( k ) = N Y ( w k · F ˜ ( k ) ) ,
where Y ^ ( k ) R H × W × C .
Therefore, the total loss of the whole network is:
L t o t a l = L e n + L s e g + L d e + L a d v .
In summary, we designed a segmentation network for decoupling body and edge. It predicted the body and edge features of the input image, applied L 1 and L 2 losses to constrain them, respectively, and was then synthesized into a segmentation feature map. After that, two discriminators were used to distinguish different domain outputs between source and daytime image and source and nighttime image in a multi-objective domain adversarial learning strategy. Additionally, probabilistic reweighting was used to optimize the segmentation prediction for small targets.

4. Experiments

4.1. Experimental Settings

To assess the performance of the proposed DLA-Net and its components, the Mean Intersection-to-Noise Ratio (mIoU), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM) were employed as the evaluation metrics in the experiments and compared with the advanced methods. The mIoU is a widely used metric for evaluating the accuracy of pixel-level semantic segmentation models which calculates the ratio of intersection and union between predicted segmentation results and true labels. PSNR and the SSIM are commonly used metrics for evaluating image enhancement work. Both metrics compare the differences between the original and compressed/distorted images. PSNR measures image quality by comparing the Peak Signal-to-Noise Ratio, while the SSIM evaluates it in terms of structure, brightness, and contrast similarity. In addition, the following datasets were used for the training of all segmentation models and performance evaluation during the daytime–nighttime domain adaptive transfer process.
  • CityScapes [24]: The CityScapes dataset comprises 5000 street view images with a resolution of 2048 × 1024 divided into 2975 training images, 500 validation images, and 1525 testing images. Each image is annotated at the pixel level with 19 categories. We used the CityScapes training images as the training data in the training phase and the comparative experimental dataset for the decoupled body and edge segmentation modules;
  • Dark Zurich [4]: The Dark Zurich dataset comprises 2416 nighttime images, 2920 twilight images, and 3041 daytime images for training, all with a resolution of 1920 × 1080 . The images in these three domains are roughly aligned using GPS localization of neighboring locations and panning/zooming operations in all directions. In this paper, 2416 nighttime images were utilized to train the network model (without utilizing twilight images). In addition to the above images used for training, Dark Zurich contains 201 annotated nighttime images, of which 50 were used for validation (Dark Zurich-val) and 151 for testing (Dark Zurich-test) and evaluation;
  • Nighttime Driving [9]: In the experiments, we exclusively utilized the Nighttime Driving test set, which comprises 50 nighttime images with a resolution of 1920 × 1080 . All images are pixel-level annotated using 19 cityscape categories;
  • SICE dataset [42]: The Part 2 subset of the SICE dataset was utilized in this paper, which comprises of 229 multi-exposure sequences and the reference image corresponding to each sequence. In the experiments, only low-light images from the Part 2 subset were used.
In this study, we implemented the proposed adversarial learning domain adaptive semantic segmentation network using PyTorch on a single NVIDIA 3060 GPU, and all networks were trained using the same settings. Following [43], we trained the networks using an SGD optimizer and set the SGD optimizer momentum to 0.9 and a decay of 5 × 10 4 . The base learning rate of the network was 2.5 × 10 4 , and then the learning rate was reduced using a polynomial learning rate strategy with a decay power of 0.9. The batch size was 2. We used an Adam optimizer [44] to train the discriminators with the β set to (0.9, 0.99). The learning rate of discriminators followed the same decay strategy as the generator. The total enhancement loss L e n incorporates weights λ 1 and λ 2 , which are selected from the intervals [ 0.1 1.0 ] and [ 20 25 ] , respectively. These values were chosen based on previous similar work and different loss characteristics. After experimentation on the validation set, λ 1 and λ 2 were set to 0.4 and 20, respectively. In L d e , the weight of L e d g e is determined by λ 3 . This hyperparameter is set because L b o d y and L e d g e are complementary, with fewer pixels in the edge part relative to the body part. The default value of λ 3 is 1, but, in our experiments, we found that a value of 0.4 resulted in the best segmentation. To ensure the positivity of the values of w k , we set s t d = 0.05 , a v g = 1.0 in the experiments. ResNet-101 [45] was used as the backbone. To facilitate smoother convergence during training, we used a total of 180,000 pre-training epochs on the CityScapes dataset with three different semantic segmentation models. Table 1 presents the performance of the three distinct semantic segmentation models on the validation sets of CityScapes and Dark Zurich.

4.2. Comparison with Other Methods

Comparison on Dark Zurich-test: In this paper, we first compare the proposed DLA-Net with several state-of-the-art methods on Dark Zurich-test [4], including CPSL [49], ProCA [50], and DiGA [51], as well as some other domain adaptation methods [5,21,52]. The performance results are summarized in Table 2.
In Table 2, CPSL, ProCA, and DiGA are shown to have utilized the same baseline RefineNet, while the other methods employed DeepLab v3+. Additionally, all methods utilized ResNet-101 as the backbone [45], and the experimental dataset was Dark Zurich-test. DLA-Net with DeepLab-v3+, RefineNet, or PSPNet achieved superior or equivalent performance compared to existing methods on this dataset. It attained an overall improvement in mIoU of 2.2% compared to the highest score obtained by existing methods (DiGA). Furthermore, the DLA-Net proposed in this paper excelled in various categories, such as roads, sidewalks, and sky. For example, in the sky category, DLA-Net outperformed ProCA and DiGA by 51.6 mIoU and 27.9 mIoU, respectively, demonstrating its ability to accurately segment these categories despite a large daytime–nighttime domain gap. Figure 4 provides the visualization results of the comparison experiments with ProCA [50] and DiGA [51], highlighting the superior performance of DLA-Net in the categories of sky, road, and sidewalk.
Comparison on Nighttime Driving: We compared the proposed DLA-Net with some other baseline methods on Nighttime Driving test [9], and results are reported in Table 3.
It is important to note that the Nighttime Driving test dataset is not as finely labeled as the Dark Zurich test dataset as some elements like buildings and vegetation are not labeled. Of the 50 images in the Nighttime Driving test dataset, only two are labeled with the sky category. Despite the limited number of labeled categories and the small dataset size, the DLA-Net with DeepLab-v3+ still achieved the second-best performance (DiGA was the top performer) on this dataset. Figure 5 reports the visualization results of the ProCA [50] and DiGA [51] comparison experiments. This underscores that the DLA-Net proposed in this paper can produce superior results even when working with a small number of samples and labeled categories for segmentation tasks.
Comparison of decoupling body and edge segmentation module with other advanced segmentation methods: This paper uses ResNet-101 as a backbone [45] on the CityScapes dataset [24] to compare the decoupling body and edge segmentation module with some state-of-the-art techniques. The experimental results are listed in Table 4.
As shown in Table 4, the decoupling body and edge segmentation method proposed in this paper achieved the highest mIoU among all methods, reaching 83.1 mIoU. This demonstrates the effectiveness of the body generation branch and the edge preservation branch in segmentation. The body branch with L 2 loss constraints obtained prominent body features, while the edge branch with L 1 loss constraints captured clear edge features. The combination of these features resulted in an overall segmentation map that has been experimentally proven to yield better segmentation performance. Ablation experiments between different components are discussed in the next section.
Comparison of low-light image enhancement sub-network with other methods: In this paper, reference image quality assessment metrics PSNR and SSIM were used to quantitatively compare the performance of different methods on the SICE Part 2 test set [42]. Higher values of SSIM and PSNR indicate that the enhanced image is closer to the ground truth in terms of structural properties and pixel-level image content, respectively. The experimental results are presented in Table 5.
Table 5 reveals that, despite not using any paired or unpaired training data, the LIE-SubNet proposed in this paper still achieved the best PSNR and second-best SSIM results (EnlightenGAN was the top performer). Combining the mapping curve and the depth network resulted in a 1.1% improvement compared to the second-best performance, and the mapping curve with multiple iterations made the overall pixels of the low-light images more uniform (see Figure 5).
In this subsection, we present comparative experiments of the overall method using the Dark Zurich dataset and the Nighttime Driving dataset. The experimental results validate the excellent performance of the proposed method. However, it should be noted that the Dark Zurich dataset mostly contains unobstructed image foregrounds, and there exists a one-to-one correspondence between the daytime and nighttime images. As a result, DLA-Net is able to perform optimally and obtain excellent results. When there is occlusion in the foreground of an image, LIE-SubNet and decoupled subject and segmentation networks may not be effective in enhancing the image and performing decoupled segmentation.

4.3. Ablation Study

To demonstrate the effectiveness of the different components of the proposed DLA-Net in this paper, several ablation experiments were conducted on several model variants. The results of the ablation experiments on different components are detailed below.
Ablation study on decoupling body and edge modules: The effectiveness of the two branches in the decoupling body and edge segmentation network is illustrated in Table 6, where ρ and δ denote the body generation branch and the edge preservation branch, respectively. The direct addition of the body generation and edge preservation branches in DeepLab-V3+ [48] improved the segmentation effect by 1.7%, implying that both branches are effective. After adding L b o d y and L e d g e , respectively, there were further improvements of 0.5% and 0.4% in performance, demonstrating that using L 2 and L 1 loss constraints can facilitate the model’s learning of different features. Finally, when all losses were combined, the performance was further improved by 1.7%. This paper also investigated the necessity of the F d e t a i l module, and its removal resulted in a decrease in segmentation performance of about 0.7%.
The results of the ablation studies for each component of the decoupling body and edge segmentation module are shown in Table 7. After removing the average pooling layer and the encoder–decoder in the body generation branch, the model performance decreases by 1.5% and 1.0% accordingly. After removing the edge preservation branch, the model performance decreased by 0.4%. Therefore, removing the three modules individually leads to varying magnitudes of degradation in segmentation network performance. This indicates that average pooling and codecs help predict the body feature in the body generation branch and that average pooling improves the performance to a greater extent, while the entire edge preservation branch can also elevate the performance of the segmentation network.
Ablation study for each loss in the low-light image enhancement sub-network: The visualization results of the ablation experiments with different loss functions in the LIE-SubNet are presented in Figure 6. After removing the exposure control loss L e c , the image exhibited overexposure in areas with strong lighting, underscoring the effectiveness of exposure constraints in the network. The removal of the color consistency loss L c c resulted in the overall image’s severe color deviation. With the removal of the light smoothing loss L i s , artifacts appeared between adjacent regions in the image. These experiments highlight the critical contributions of each loss function used in this paper in the LIE-SubNet.
Ablation study on different components of DLA-Net: As shown in Table 8, AdaptSegNet [5] was used as the baseline and DLA-Net as the full model. It was observed that, although X T d was unlabeled, using roughly aligned X T d to predict X T n was quite important and also played a key role in DLA-Net. It reduced the segmentation results by 36.8% without using X T d , indicating that the training in the daytime domain is quite critical in the network. The LIE-SubNet and the corresponding loss function L e n also contribute to the whole network. Meanwhile, the utilization of the decoupling body and edge loss L d e in this paper yielded superior results when compared to applying the cross-entropy loss directly to computing the segmentation loss. The performance disparity between the two methods is notable, and not using the L d e loss outperformed using the cross-entropy loss by 38%. In addition, the adoption of probabilistic reweighting in experiments enhanced the segmentation performance, affirming its effectiveness as an auxiliary tool.
This chapter presents comparison and ablation experiments on DLA-Net and its components across multiple datasets. The results demonstrate that DLA-Net efficiently segments images in nighttime scenes without using labeled images or synthetic datasets. LIE-SubNet effectively brightens low-light images, and the decoupling body and edge segmentation effectively predict feature maps with a uniform body and clear edge. However, DLA-Net struggles with domain adaptation when faced with large domain gaps caused by differences in styles and inherent variations between datasets, such as in urban street scenes. Future work will focus on conducting in-depth research to better understand these differences and adapt to a wider range of scenes and datasets.

5. Conclusions

In this paper, we proposed the DLA-Net, an adversarial learning domain adaptive semantic segmentation network capable of performing semantic segmentation of nighttime low-light images. DLA-Net leverages a combination of mapping curve iteration and a deep network to enhance low-light images, ensuring their distributions align with those from different domains. The segmentation network employs the decoupling body and edge modules that can efficiently obtain body and edge features, respectively. After the segmentation network, two discriminators are used to differentiate outputs from different domains. Therefore, a multi-target domain adversarial learning strategy is constituted between the generator and the discriminators to realize the adversarial learning domain adaption for multi-target domains. The experimental results underscore the efficacy of each designed component, showcasing outstanding performance on datasets such as Dark Zurich and Nighttime Driving. State-of-the-art performance is also obtained on unlabeled or thin labeled datasets, and segmentation performance is better on recognizable classes with large domain gaps. However, DLA-Net does not perform well in adapting to the different styles and inherent differences between datasets. Future work will investigate how to adapt to more scenarios and datasets.

Author Contributions

Conceptualization, M.W. and Z.Z.; methodology, M.W. and Z.Z.; software, M.W. and Z.Z.; validation, M.W., Z.Z. and H.L.; formal analysis, M.W. and Z.Z.; investigation, M.W. and Z.Z.; resources, M.W. and Z.Z.; data curation, M.W., Z.Z. and H.L.; Writing—original draft, M.W. and Z.Z.; Writing—review and editing, M.W., Z.Z. and H.L.; visualization, Z.Z.; supervision, M.W. and H.L.; project administration, M.W.; funding acquisition, M.W. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (62062048) and Yunnan Provincial Science and Technology Plan Project (202201AT070113). This work is also supported by the Faculty of Information Engineering and Automation, Kunming University of Science and Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in CityScapes at https://www.cityscapes-dataset.com/, Dark Zurich at https://trace.ethz.ch/publications/2019/GCMA_UIoU/, Nighttime Driving at http://people.ee.ethz.ch/~daid/NightDriving/ and SICE dataset at https://github.com/csjcai/SICE?tab=readme-ov-file.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
  2. Chen, X.; Williams, B.M.; Vallabhaneni, S.R.; Czanner, G.; Williams, R.; Zheng, Y. Learning active contour models for medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11632–11640. [Google Scholar]
  3. Zhang, X.; Chen, Y.; Zhu, B.; Wang, J.; Tang, M. Part-aware context network for human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8971–8980. [Google Scholar]
  4. Sakaridis, C.; Dai, D.; Gool, L.V. Guided curriculum model adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7374–7383. [Google Scholar]
  5. Tsai, Y.H.; Hung, W.C.; Schulter, S.; Sohn, K.; Yang, M.H.; Chandraker, M. Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7472–7481. [Google Scholar]
  6. Bang, G.; Lee, J.; Endo, Y.; Nishimori, T.; Nakao, K.; Kamijo, S. Semantic and Geometric-Aware Day-to-Night Image Translation Network. Sensors 2024, 24, 1339. [Google Scholar] [CrossRef] [PubMed]
  7. Manettas, C.; Nikolakis, N.; Alexopoulos, K. Synthetic datasets for Deep Learning in computer-vision assisted tasks in manufacturing. Procedia CIRP 2021, 103, 237–242. [Google Scholar] [CrossRef]
  8. Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
  9. Dai, D.; Van Gool, L. Dark model adaptation: Semantic image segmentation from daytime to nighttime. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 3819–3824. [Google Scholar]
  10. Sakaridis, C.; Dai, D.; Van Gool, L. Map-guided curriculum domain adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 3139–3153. [Google Scholar] [CrossRef] [PubMed]
  11. Kurmi, V.K.; Bajaj, V.; Subramanian, V.K.; Namboodiri, V.P. Curriculum based dropout discriminator for domain adaptation. arXiv 2019, arXiv:1907.10628. [Google Scholar]
  12. Sun, L.; Wang, K.; Yang, K.; Xiang, K. See clearer at night: Towards robust nighttime semantic segmentation through day-night image conversion. In Artificial Intelligence and Machine Learning in Defense Applications; SPIE: Strasbourg, France, 2019; Volume 11169, pp. 77–89. [Google Scholar]
  13. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
  14. Hung, W.C.; Tsai, Y.H.; Liou, Y.T.; Lin, Y.Y.; Yang, M.H. Adversarial learning for semi-supervised semantic segmentation. arXiv 2018, arXiv:1802.07934. [Google Scholar]
  15. Nag, S.; Adak, S.; Das, S. What’s there in the dark. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 2996–3000. [Google Scholar]
  16. Fan, R.; Xie, J.; Yang, J.; Hong, Z.; Xu, Y.; Hou, H. Multiscale Change Detection Domain Adaptation Model Based on Illumination–Reflection Decoupling. Remote Sens. 2024, 16, 799. [Google Scholar] [CrossRef]
  17. Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
  18. Wu, X.; Wu, Z.; Guo, H.; Ju, L.; Wang, S. Dannet: A one-stage domain adaptation network for unsupervised nighttime semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15769–15778. [Google Scholar]
  19. Hoffman, J.; Wang, D.; Yu, F.; Darrell, T. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv 2016, arXiv:1612.02649. [Google Scholar]
  20. Pathak, D.; Krahenbuhl, P.; Darrell, T. Constrained convolutional neural networks for weakly supervised segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1796–1804. [Google Scholar]
  21. Li, Y.; Yuan, L.; Vasconcelos, N. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6936–6945. [Google Scholar]
  22. Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.Y.; Isola, P.; Saenko, K.; Efros, A.; Darrell, T. Cycada: Cycle-consistent adversarial domain adaptation. In Proceedings of the International Conference on Machine Learning, Chengdu, China, 15–18 July 2018; pp. 1989–1998. [Google Scholar]
  23. Wu, Z.; Han, X.; Lin, Y.L.; Uzunbas, M.G.; Goldstein, T.; Lim, S.N.; Davis, L.S. Dcan: Dual channel-wise alignment networks for unsupervised scene adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 518–534. [Google Scholar]
  24. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
  25. Liu, M.Y.; Breuel, T.; Kautz, J. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems; MIT Press: Long Beach, CA, USA, 2017; Volume 30. [Google Scholar]
  26. Xie, Q.; Luong, M.T.; Hovy, E.; Le, Q.V. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar]
  27. Lian, Q.; Lv, F.; Duan, L.; Gong, B. Constructing self-motivated pyramid curriculums for cross-domain semantic segmentation: A non-adversarial approach. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6758–6767. [Google Scholar]
  28. Zhang, Y.; David, P.; Gong, B. Curriculum domain adaptation for semantic segmentation of urban scenes. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2020–2030. [Google Scholar]
  29. Anoosheh, A.; Sattler, T.; Timofte, R.; Pollefeys, M.; Van Gool, L. Night-to-day image translation for retrieval-based localization. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May2019; pp. 5958–5964. [Google Scholar]
  30. Vertens, J.; Zürn, J.; Burgard, W. Heatnet: Bridging the day-night domain gap in semantic segmentation with thermal images. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October–24 January 2021; pp. 8461–8468. [Google Scholar]
  31. Di, S.; Feng, Q.; Li, C.G.; Zhang, M.; Zhang, H.; Elezovikj, S.; Tan, C.C.; Ling, H. Rainy night scene understanding with near scene semantic adaptation. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1594–1602. [Google Scholar] [CrossRef]
  32. Valada, A.; Vertens, J.; Dhall, A.; Burgard, W. Adapnet: Adaptive semantic segmentation in adverse environmental conditions. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4644–4651. [Google Scholar]
  33. Lv, F.; Lu, F.; Wu, J.; Lim, C. MBLLEN: Low-Light Image/Video Enhancement Using CNNs. In Proceedings of the BMVC, Newcastle, UK, 3–6 September 2018; Volume 220, p. 4. [Google Scholar]
  34. Bychkovsky, V.; Paris, S.; Chan, E.; Durand, F. Learning photographic global tonal adjustment with a database of input/output image pairs. In Proceedings of the CVPR 2011, Providence, RI, USA, 20–25 June 2011; pp. 97–104. [Google Scholar]
  35. Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2599–2613. [Google Scholar] [CrossRef] [PubMed]
  36. Li, C.; Guo, C.; Loy, C.C. Learning to enhance low-light image via zero-reference deep curve estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4225–4238. [Google Scholar] [CrossRef] [PubMed]
  37. Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In Advances in Neural Information Processing Systems; Google DeepMind: London, UK, 2015; Volume 28. [Google Scholar]
  38. Zhu, X.; Xiong, Y.; Dai, J.; Yuan, L.; Wei, Y. Deep feature flow for video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2349–2358. [Google Scholar]
  39. Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 1735–1742. [Google Scholar]
  40. Li, X.; Yang, Y.; Zhao, Q.; Shen, T.; Lin, Z.; Liu, H. Spatial pyramid based graph reasoning for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8950–8959. [Google Scholar]
  41. Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef] [PubMed]
  42. Chen, Y.S.; Wang, Y.C.; Kao, M.H.; Chuang, Y.Y. Deep photo enhancer: Unpaired learning for image enhancement from photographs with gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6306–6314. [Google Scholar]
  43. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
  44. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  45. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  46. Lin, G.; Milan, A.; Shen, C.; Reid, I. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1925–1934. [Google Scholar]
  47. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  48. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
  49. Li, R.; Li, S.; He, C.; Zhang, Y.; Jia, X.; Zhang, L. Class-balanced pixel-level self-labeling for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11593–11603. [Google Scholar]
  50. Zhang, P.; Zhang, B.; Zhang, T.; Chen, D.; Wang, Y.; Wen, F. Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12414–12424. [Google Scholar]
  51. Shen, F.; Gurram, A.; Liu, Z.; Wang, H.; Knoll, A. DiGA: Distil to Generalize and then Adapt for Domain Adaptive Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15866–15877. [Google Scholar]
  52. Vu, T.H.; Jain, H.; Bucher, M.; Cord, M.; Pérez, P. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2517–2526. [Google Scholar]
  53. Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1857–1866. [Google Scholar]
  54. Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 267–283. [Google Scholar]
  55. Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. [Google Scholar]
  56. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
  57. Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
  58. Ding, H.; Jiang, X.; Liu, A.Q.; Thalmann, N.M.; Wang, G. Boundary-aware feature propagation for scene segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6819–6829. [Google Scholar]
  59. Zhang, F.; Chen, Y.; Li, Z.; Hong, Z.; Liu, J.; Ma, F.; Han, J.; Ding, E. Acfnet: Attentional class feature network for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6798–6807. [Google Scholar]
  60. Li, X.; Zhao, H.; Han, L.; Tong, Y.; Yang, K. Gff: Gated fully fusion for semantic segmentation. arXiv 2019, arXiv:1904.01803. [Google Scholar]
  61. Li, X.; Li, X.; Zhang, L.; Cheng, G.; Shi, J.; Lin, Z.; Tan, S.; Tong, Y. Improving semantic segmentation via decoupled body and edge supervision. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 435–452. [Google Scholar]
  62. Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep retinex decomposition for low-light enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar]
  63. Liu, R.; Ma, L.; Zhang, J.; Fan, X.; Luo, Z. Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10561–10570. [Google Scholar]
  64. Ma, L.; Ma, T.; Liu, R.; Fan, X.; Luo, Z. Toward fast, flexible, and robust low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5637–5646. [Google Scholar]
Figure 1. Overall structure of the network proposed in this paper (DLA-Net). The network takes three types of domain-related samples as input: source domain image X S and target domain images X T d and X T n . Within the framework, L S T d and L S T n are the adversarial losses of S and T d , while S and T n are obtained from D S T d and D S T n , respectively.
Figure 1. Overall structure of the network proposed in this paper (DLA-Net). The network takes three types of domain-related samples as input: source domain image X S and target domain images X T d and X T n . Within the framework, L S T d and L S T n are the adversarial losses of S and T d , while S and T n are obtained from D S T d and D S T n , respectively.
Applsci 14 03295 g001
Figure 2. Architecture of the LIE-SubNet architecture. The network was designed to evaluate a set of optimal light enhancement curves ( I e curves) that iteratively enhance the input image. The deep curve fitting network uses an ordinary CNN with six alternately connected convolutional layers, each consisting of 32 3 × 3 convolutional kernels with a step size of 1. A ReLu function is added at the end of the network.
Figure 2. Architecture of the LIE-SubNet architecture. The network was designed to evaluate a set of optimal light enhancement curves ( I e curves) that iteratively enhance the input image. The deep curve fitting network uses an ordinary CNN with six alternately connected convolutional layers, each consisting of 32 3 × 3 convolutional kernels with a step size of 1. A ReLu function is added at the end of the network.
Applsci 14 03295 g002
Figure 3. Decoupling module for body generation and edge preservation. In this module, X represents the input image, F is derived from the backbone network and a dilated convolution, and F d e t a i l represents the high-frequency detailed feature which is output through the low layer of the backbone network. In the body generation branch, F and F u p are cascaded and input to 1 × 1 C o n v . Notably, F l o w was not added to the cascade. Subsequently, average pooling and feature relocation were performed to obtain the body feature F b o d y , which had an obvious body but fuzzy edges. In the edge preservation branch, ( F F b o d y ) was cascaded with F d e t a i l and input to 1 × 1 C o n v to obtain the edge feature F e d g e with a clear boundary. The final segmentation predicted F ˜ = F b o d y + F e d g e .
Figure 3. Decoupling module for body generation and edge preservation. In this module, X represents the input image, F is derived from the backbone network and a dilated convolution, and F d e t a i l represents the high-frequency detailed feature which is output through the low layer of the backbone network. In the body generation branch, F and F u p are cascaded and input to 1 × 1 C o n v . Notably, F l o w was not added to the cascade. Subsequently, average pooling and feature relocation were performed to obtain the body feature F b o d y , which had an obvious body but fuzzy edges. In the edge preservation branch, ( F F b o d y ) was cascaded with F d e t a i l and input to 1 × 1 C o n v to obtain the edge feature F e d g e with a clear boundary. The final segmentation predicted F ˜ = F b o d y + F e d g e .
Applsci 14 03295 g003
Figure 4. Visualization results of DLA-Net and some other baseline methods on Dark Zurich-val.
Figure 4. Visualization results of DLA-Net and some other baseline methods on Dark Zurich-val.
Applsci 14 03295 g004
Figure 5. Visualization results of ablation experiments with different loss functions.
Figure 5. Visualization results of ablation experiments with different loss functions.
Applsci 14 03295 g005
Figure 6. Visualization results of ablation experiments with different loss functions.
Figure 6. Visualization results of ablation experiments with different loss functions.
Applsci 14 03295 g006
Table 1. The performance of three distinct semantic segmentation models on the validation sets of CityScapes and Dark Zurich.
Table 1. The performance of three distinct semantic segmentation models on the validation sets of CityScapes and Dark Zurich.
MethodDark Zurich-ValCityScapes-Val
RefineNet [46]14.4664.50
PSPNet [47]11.4464.97
DeepLab-v3+ [48]11.5863.77
Table 2. Results of the current state-of-the-art method and the DLA-Net proposed in this paper for each category in the Dark Zurich test set. CityScapes→ DZ-night denotes the adaptation from CityScapes to Dark Zurich-night. Bold font indicates the best, and underlining indicates the second-best.
Table 2. Results of the current state-of-the-art method and the DLA-Net proposed in this paper for each category in the Dark Zurich test set. CityScapes→ DZ-night denotes the adaptation from CityScapes to Dark Zurich-night. Bold font indicates the best, and underlining indicates the second-best.
MethodRoadSidewalkBuildingWallFenceTraffic LightTraffic SignVegetationTerrainSkyPersonRiderCarTruckBusTrainMotocycleBicyclePolemIou
RefineNet-CityScapes [46]68.823.246.820.812.630.426.943.114.30.336.949.763.66.80.224.033.69.329.828.5
PSPNet-CityScapes [47]79.021.853.013.811.220.221.943.510.420.237.433.864.16.40.052.330.47.422.528.8
DeepLab-v3+-CityScapes [48]78.219.051.215.510.628.922.056.713.320.838.221.852.11.60.053.223.210.730.328.8
AdaptSegNet-CityScapes→ DZ-night [5]86.144.255.122.24.85.616.737.28.41.235.926.768.245.10.050.133.915.622.130.4
ADVENT-CityScapes→DZ-night [52]85.837.955.527.714.514.021.132.18.72.039.916.664.013.80.058.828.520.723.129.7
BDL-CityScapes→ DZ-night [21]85.341.161.932.717.411.421.329.48.91.137.422.163.228.20.047.739.415.720.630.8
CPSL [49]75.028.648.120.813.836.329.448.913.30.442.849.768.917.90.027.134.411.433.831.6
ProCA [50]81.246.458.321.519.540.041.164.330.531.653.047.075.038.70.049.130.220.540.741.5
DiGA [51]79.848.865.77.310.538.438.463.617.555.351.653.074.262.00.037.028.622.040.642.1
DLA-Net (RefineNet)88.553.369.733.919.931.435.869.432.182.244.143.654.021.90.040.835.924.024.942.4
DLA-Net (PSPNet)89.253.074.040.220.326.029.471.225.483.246.233.167.418.20.365.637.522.824.243.5
DLA-Net (DeepLab-v3+)89.559.270.132.722.033.432.869.630.979.344.840.766.515.90.172.130.722.029.744.3
Table 3. Comparison results of the proposed DLA-Net and some baseline methods on Nighttime Driving test. Bold font indicates the best, and underlining indicates the second-best.
Table 3. Comparison results of the proposed DLA-Net and some baseline methods on Nighttime Driving test. Bold font indicates the best, and underlining indicates the second-best.
MethodmIoU
RefineNet-CityScapes [46]32.75
PSPNet-CityScapes [47]25.44
DeepLab-v3+-CityScapes [48]27.65
AdaptSegNet-CityScapes→ DZ-night [5]34.5
ADVENT-CityScapes→DZ-night [52]34.7
BDL-CityScapes→DZ-night [21]34.7
CPSL [49]38.2
ProCA [50]46.7
DiGA [51]49.9
DLA-Net (RefineNet)43.82
DLA-Net (PSPNet)44.59
DLA-Net (DeepLab-v3+)47.08
Table 4. Comparison of decoupled body and edge segmentation module with other advanced segmentation methods. Bold font indicates the best, and underlining indicates the second-best.
Table 4. Comparison of decoupled body and edge segmentation module with other advanced segmentation methods. Bold font indicates the best, and underlining indicates the second-best.
MethodBackbonemIoU
DFN [53]ResNet-10179.3
PSANet [54]ResNet-10180.1
DenseASPP [55]DenseNet-16180.6
DANet [56]ResNet-10181.5
CCNet [57]ResNet-10181.4
BAFNet [58]ResNet-10181.4
ACFNet [59]ResNet-10181.9
GFFnet  [60]ResNet-10182.3
X. Li et al. [61]ResNet-10182.8
OursResNet-10183.1
Table 5. Comparison of low-light image enhancement sub-networks with other methods. Bold font indicates the best, and underlining indicates the second-best.
Table 5. Comparison of low-light image enhancement sub-networks with other methods. Bold font indicates the best, and underlining indicates the second-best.
MethodPSNR↑SSIM↑
MBLLEN [33]14.780.534
RetiexNet [62]15.560.525
RUAS [63]16.400.500
ZeroDCE [36]14.860.559
SCI [64]14.780.522
EnlightenGAN [41]17.480.651
Ours18.100.638
Table 6. Comparison of decoupled body and edge segmentation module with other advanced segmentation methods. ✓ indicates that L b o d y or L e d g e was used.
Table 6. Comparison of decoupled body and edge segmentation module with other advanced segmentation methods. ✓ indicates that L b o d y or L e d g e was used.
Method L body L edge mIoU Δ ( % )
DeepLab-v3+ [48] 74.6-
+ ρ & δ --75.91.7↑
-76.30.5↑
-76.60.4↑
77.61.7↑
w/o F d e t a i l 77.10.7↓
Table 7. Ablation study on effect of each component.
Table 7. Ablation study on effect of each component.
MethodmIoU Δ ( % )
DeepLab-v3+ [48] + ρ & δ 77.6-
w/o ρ average pooling76.41.5 ↓
w/o ρ encoder–decoder76.81.0 ↓
w/o δ 77.20.4 ↓
Table 8. Ablation study of several DLA-Net (DeepLab-v3+) modules proposed in this paper on Dark Zurich-val.
Table 8. Ablation study of several DLA-Net (DeepLab-v3+) modules proposed in this paper on Dark Zurich-val.
MethodmIoU
ProCA [50]25.47
DiGA [51]25.21
AdaptSegNet-CityScapes→ DZ-night [5]19.13
w/o X T d 22.58
w/o LIE-SubNet & L e n 32.45
w/o L e n 33.86
w/o L d e 20.19
w cross-entropy loss in L d e 32.96
w/o probability reweighting31.68
w/o pre-trained segmentation model29.78
DLA-Net35.74
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, M.; Zhang, Z.; Liu, H. An Adaptive Semantic Segmentation Network for Adversarial Learning Domain Based on Low-Light Enhancement and Decoupled Generation. Appl. Sci. 2024, 14, 3295. https://doi.org/10.3390/app14083295

AMA Style

Wang M, Zhang Z, Liu H. An Adaptive Semantic Segmentation Network for Adversarial Learning Domain Based on Low-Light Enhancement and Decoupled Generation. Applied Sciences. 2024; 14(8):3295. https://doi.org/10.3390/app14083295

Chicago/Turabian Style

Wang, Meng, Zhuoran Zhang, and Haipeng Liu. 2024. "An Adaptive Semantic Segmentation Network for Adversarial Learning Domain Based on Low-Light Enhancement and Decoupled Generation" Applied Sciences 14, no. 8: 3295. https://doi.org/10.3390/app14083295

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop