The exceptional performance of generative adversarial networks (GANs) in image generation tasks primarily relies on the powerful nonlinear fitting capabilities of neural networks. Consequently, the quality and diversity of generated images are directly influenced by the design of the network architecture [
21]. To solve the problem of limited dataset scale in this study and provide richer data samples for the training of subsequent instance segmentation models, an improved model, StyleGAN-ALL, based on StyleGAN2-ADA, is proposed in this study, and the basic structure of the network is shown in
Figure 4. This model further optimizes the quality and diversity of generated images by introducing the decoupled mapping network (DMN) and the convolutional coupling transfer block (CCTB). Specifically, the decoupled mapping network reduces the coupling degree of latent space encoding, and the CCTB module enhances the model’s ability to capture complex context information by combining the advantages of convolutional neural networks (CNNs) and transformers.
2.2.1. StyleGAN2-ADA Network Model Architecture
The core idea of the traditional GAN network is to make the generator (G) and the discriminator (D) compete with each other through adversarial training so that the generator generates realistic data and fools the discriminator. The generator outputs data, such as images and texts, by inputting a random noise vector
. The discriminator inputs real data
or generated data
, and outputs a probability value (0~1) to indicate the likelihood that the data are real data. The training process is a Minimax Game process, and the optimization objective function is as follows.
StyleGAN is NVIDIA’s (NVIDIA Corporation, Santa Clara, California) improved GAN model for generating high-resolution and high-quality images. His innovations include Mapping Network and Adaptive Instance Normalization (AdaIN). The main process is to map the random noise vector
to the intermediate latent space
, and generate the latent code with more semantic and structural information through the mapping network
(composed of 8 fully connected layers) and then generate the style parameters
and
through affine transformation. It is used to perform Adaptive Instance Normalization on the features of each layer of the generator. The mathematical formula for AdaIN is as follows.
Here,
is the activation feature of the current layer, and
and
denote the mean and standard deviation of
, respectively. Through the design of this module, the model can fine-tune the image at different scales, but the shortcomings are that premature statistical normalization will lose local details and produce artifacts, and the source of each dimension’s hidden coding output of the mapping network is too singular, so there is a strong coupling effect. In view of the above problems, it is found that AdaIN and progressive training are the main reasons for the abnormalities of the generated images. Therefore, StyleGAN2 proposes a novel modulated convolution mechanism to solve the Droken artifacts problem by replacing the AdaIN module with modulation and demodulation. It is calculated as follows:
Here,
is the modulation factor for input channel I, which can scale the input channel.
is the modulated weight.
is the normalized demodulated weight of
grouped by output channel
. Its role is to constrain the variance of the output feature map to be 1, so as to avoid the instability of the feature amplitude caused by modulation. The
is a minimal constant, preventing the denominator from being zero. At the same time, we introduce path length regularization (PLR) to constrain the change of the latent space
to be linearly related to the change of the generated image so that the gradient length of the generator is relatively stable in all directions, which is calculated as follows:
Here, means that the latent code is generated by the mapping network from the noise (that is ). means that the random vector is sampled following a standard Gaussian distribution and is used to project the generated image . is the gradient in a random direction of the generated image after derivative with respect to . is a hyperparameter and represents the object path length.
StylegGAN2-ADA [
22,
23] is a further improvement on StyleGAN2, which mainly focuses on the problem that the discriminator is easy to overfit in the training process of small datasets. The core innovations are the adaptive data augmentation (ADA) strategy and the improved discriminator training mechanism. In the training process, data augmentation operations such as flipping, rotation, and color jitter are performed on the input images, and the enhancement probability
is dynamically adjusted according to the overfitting of the discriminator. The adaptive enhancement probability formula is as follows:
where
is the overfitting index, and the larger the value, the more serious the overfitting is.
is the discriminator’s prediction on the training set (augmented), and
is the discriminator’s prediction on the retained true images (unaugmented).
is the augmented probability after updating,
is the learning rate,
is the object overfitting level, and
is the limit change to avoid sharp
fluctuations.
However, this model still has some shortcomings. Although ADA alleviates the small dataset problem to some extent, generative models may still struggle to capture rich style variation in very scarce data scenarios. Therefore, the quality and diversity of the generated images are optimized for further optimization. We introduce the convolutional coupling transfer block (CCTB) in our improved model StyleGAN-ALL proposed in this study. The CCTB module is inserted in the middle of modulation and demodulation, as shown in the orange module in
Figure 4. This module combines the advantages of convolutional neural networks (CNNs) and transformers, which can effectively enhance the model’s ability to capture complex context information. In addition, the coupling effect of
-space implicit coding in the original design may limit the independent control of diverse styles, which is why we tried to improve by decoupling the mapping network in the follow-up study. To solve this problem, a decoupled mapping network (DMN) is proposed in
Section 2.2.2 in this study to reduce feature coupling by grouping design.
2.2.2. Decoupled Mapping Network (DMN)
Since the features of each dimension of
-space in the original design are generated by the same mapping network, there may be coupling effects between different style features, which may affect the fine control of diversified image generation. In view of the diversity of category styles contained in the rural road dataset, it is necessary to reduce this coupling in order to better realize the fine generation of independent styles of each category. In this study, we improved the mapping network based on this model by dividing the original 8-layer fully connected (FC) mapping network into four groups of independent subnetworks
, which are calculated as shown in Formula (6). Each group of subnetworks contains two layers of FC, and each group outputs an intermediate latent code
. Finally, the outputs of the four layers are concatenated into a complete
. Each group of sub-networks focuses on different features,
corresponds to road structure and geometric layout, such as straight, curved, fork roads, and terrain relief.
corresponds to the natural environment and vegetation, such as vegetation types on both sides of the road, vegetation density, and vegetation wilting due to seasonal changes.
corresponds to weather and lighting conditions, such as sunny days, cloudy days, rainy days, early morning, evening, etc., and the lighting direction and shadow position can be changed by adjusting
.
corresponds to artificial objects, such as road signs, surrounding buildings, utility poles, fences, etc. These latent encodings are then used to modulate the convolution kernels of each convolutional layer in the generator after their respective affine transformations (i.e., the modulation convolution and demodulation mechanism are employed). To verify the effectiveness of the proposed method, we added a set of DMN ablation experiments to the experimental section. It is found that the DMN effectively reduces the strong coupling relationship between
-space implicit coding, so as to promote the generation network to generate images with more diversity and higher style discrimination.
The and denote the group I first l layer weights and bias, denotes the activation function, and the final latent code represents each output, namely .
2.2.3. Convolutional Coupling Transfer Block (CCTB)
In StyleGAN2-ADA, the feature learning modules at each layer still rely on basic convolutional modules. However, convolutional modules have limited receptive fields and can only extract local information from the multi-scale features progressively generated by the generator. This results in suboptimal performance in terms of continuity and generation quality when producing images with strong semantic information. Particularly in rural road image datasets, the generation of objects such as roads, trees, and skies depends on the perception of long-range dependencies. Compared to convolutional modules, transformers [
24,
25] exhibit stronger capabilities in modeling long-range dependencies, making them more suitable for generating such objects. Nevertheless, transformers also have certain limitations, such as slower convergence rates and the need for large amounts of training data. Therefore, to combine the strengths of both CNNs and transformers, this study designs a convolutional coupling transfer block (CCTB).
The module is primarily divided into five components: a local encoding module (LEM), a cross-shaped window self-attention (CSWin-Attention), a batch normalization (BN), a layer normalization (LN), and a multilayer perceptron (MLP). Its structure is illustrated in
Figure 5. The computational process of the CCTB module is as follows:
where
represents the intermediate output after applying the CSWin-Attention mechanism to the normalized features from the LEM. It captures enhanced contextual information by combining local features and long-range dependencies.
is the final output of the CCTB module at layer l, representing a comprehensive feature map that combines local and global information.
In image processing tasks of rural road scenes, the variation of object appearance and location often poses a significant challenge for model recognition. Especially, the boundaries between objects in such unstructured scenes are usually not obvious, presenting fuzzy characteristics. In order to solve this problem, combining the dual advantages of CNNs and transformers is an important move. Firstly, CNNs can have a robust local feature extraction ability through a local receptive field and weight-sharing mechanism so as to effectively identify the edge information of these objects. Second, transformers incorporates position-specific encodings for each object to represent their relative position, which enhances the model’s robustness to translation, scaling, and distortion.
To achieve effective fusion of local details and global context, LEM is used to extract fine-grained local features. Then, these local features are fused with the global feature map generated by transformers through residual connection for multi-scale fusion. The fusion process is shown in Formula (9). It not only retains the sensitivity of CNN to local structural information but also integrates the long-distance dependence modeling ability of transformers and, finally, forms a comprehensive feature representation containing local details and global context. This composite feature representation mechanism improves the robustness of the model to appearance changes and location shifts of objects in complex rural road scenes.
where DWConv represents the depthwise separable convolution.
- 2.
Cross-Shaped Window Self-Attention (CSWin-Attention)
Rural road scenes usually contain diverse objects, such as trees, traffic signs, and vehicles. The spatial relationships between these objects are complex, and there may be partial occlusions or uneven distributions. Traditional convolutional operations have limited receptive fields, making it difficult to capture global contextual information. The CSWin-Attention module divides the feature map into non-overlapping vertical and horizontal windows through dynamic window offset at different levels and randomly offsets
pixels (
is the window size). Through multi-level CCTB stacking, the dependence of the cross region can be progressively modeled. Thus, the problem of artifacts in vegetation interleaved areas can be alleviated. The shallow network with a dynamic window offset compensation mechanism has a small Δ value, which preserves local details (such as vegetation leaf texture), and the deep network has an increased Δ value, which can model the global semantic association of intersection regions (such as the topological relationship between vegetation and roads). Through multi-level concatenation stacking, the context information of the cross region is gradually fused to avoid feature breakage caused by fixed window division. In addition, CSWin-Attention uses multi-head attention grouping, which assigns part of the attention heads to the vertical window and the other part to the horizontal window so that the model models the context of different directions at the same time. This combination covers the intersection area at the junction of the two windows, reducing the artifacts caused by single-direction attention misses. Moreover, because the long-distance dependence modeling ability of a transformer is integrated, the features of the intersection area can be more reasonably modeled at different scales so as to better understand the spatial layout and semantic association between objects. It improves the overall coherence of the generated image. Additionally, rural road images often have high resolution, and traditional global self-attention mechanisms [
26] have high computational complexity, making them difficult to apply directly to such scenarios. CSWin-Attention reduces computational complexity significantly by dividing the feature map into windows and computing self-attention within these windows. At the same time, the cross-shaped window design, which combines horizontal and vertical computations, ensures the capture of global information, achieving a balance between efficiency and performance in high-resolution image processing. This mechanism splits the input feature map
along the channel dimension into horizontal and vertical attention branches, allowing independent computation of attention weights in both directions. The main structure of this mechanism is shown in
Figure 6.
Vertical attention is able to capture long-distance dependence in vertical directions, helping the model understand the up-down relationship between sky and ground, trees and roads. It is calculated by dividing the input feature map
into multiple non-overlapping vertical bands
along the vertical direction, where the width of each vertical band is
and
. The index range for each vertical band is defined as follows:
Then we apply a linear projection to each vertical band
to obtain the input matrix for self-attention, as shown in the following formula:
where
,
, and
correspond to queries, keys and values, respectively, and
,
, and
are learnable weight matrices. Through Formula (11), the following self-attention weight calculation formula in the vertical direction can be obtained:
where
is the dimension of the key vector and
is the local position encoding implemented by depthwise separable convolution.
Horizontal attention primarily focuses on the semantic associations of images in the horizontal direction. It not only captures dependencies between left and right adjacent regions, helping the model understand the extension direction of roads and the arrangement order of vehicles in rural image datasets but also enhances global context information. By computing horizontal attention, the model can better understand the distribution and interactions of objects in the horizontal direction, such as the relative positions between vehicles and pedestrians. Its calculation method is the same as that of vertical attention, and the final output feature map
is obtained through the weighted fusion of vertical and horizontal attention.
where
and
are the outputs of vertical attention and horizontal attention, respectively, which are obtained by globally aggregating the local self-attention results
.
is the learnable fusion weight, initialized with a value of 0.5 and automatically optimized by backpropagation, which is used to balance the contributions of vertical and horizontal attention in the final output. In addition, the fusion method of Formula (13) can have an indirect effect on the diagonal direction. When the weights
and
take different proportions, they form feature combinations of different directions in the feature space. For example, when
is 0.6 and
is 0.4, then the synthesis direction and diagonal direction
, so it has a certain diagonal information capture ability.
The fusion of vertical and horizontal attention enables comprehensive modeling of complex scenes, effectively addressing issues such as uneven object distribution, occlusion, and multi-scale feature fusion in rural road image datasets. This provides robust support for image generation and instance segmentation tasks, particularly in high-resolution image processing scenarios.