1. Introduction
Semantic segmentation [
1,
2] is a core task in computer vision, aimed at assigning semantic labels to every pixel in an image to achieve a detailed understanding of its content. This task is of significant importance in fields such as autonomous driving, medical image analysis [
3], remote sensing image processing, and intelligent surveillance. For example, in the medical field, semantic segmentation is particularly advantageous in surgical procedures, as observed by Marullo et al. [
4]. Traditional image segmentation methods [
5], such as thresholding, edge detection, region growing, and watershed algorithms, rely on low-level features (e.g., color, grayscale, edges) and struggle to capture high-level semantic information. Moreover, these methods are sensitive to noise, changes in lighting, and complex backgrounds, leading to unstable results. Their adaptability is limited due to the need for manually designed feature extraction and segmentation criteria, which is particularly challenging when dealing with complex textures and multiple objects. Advances in deep learning, especially the application of convolutional neural networks [
2], have greatly enhanced the accuracy and robustness of semantic segmentation. The introduction of fully convolutional networks (FCNs) [
2] represents a breakthrough in applying deep learning to semantic segmentation. FCNs replace the fully connected layers in traditional CNNs with convolutional layers, enabling pixel-level prediction for images of arbitrary sizes and supporting end-to-end training and prediction.
Building on the foundation of FCNs, numerous models have emerged with various improvements and extensions aimed at enhancing segmentation accuracy, reducing computational complexity, boosting feature representation [
6] capabilities, and adapting to different application scenarios. The U-Net [
7] architecture enhances feature representation through its distinctive symmetric shape and skip connections, making it particularly well-suited for handling small objects and fine details in images. RefineNet [
8] employs a multi-path refinement strategy, effectively combining high-level semantic information with low-level detail through cross-layer connections, thereby improving segmentation accuracy and edge sharpness. DeepLabv3+ [
9] features an encoder-decoder structure, where the encoder captures contextual information of the image and the decoder restores spatial details, particularly enhancing the accuracy of object boundaries. ERFNet [
10] achieves a good balance between efficiency and accuracy through the use of residual connections and factorized convolutions. PSPNet [
11] integrates context information at different scales using a pyramid pooling module, significantly enhancing the model’s ability to understand complex scenes and providing more precise scene interpretation. In light of these advancements, mIoU measures the accuracy of segmentation models, while FPS indicates the processing speed. By examining the relationship between these metrics through a graph, it is possible to reveal the trade-offs between accuracy and efficiency, aiding in the selection of the most suitable model for specific application scenarios.
MFNet [
12] employs a multi-branch structure, including attention branches, semantic branches, and spatial information branches, along with heterogeneous decomposition (AF) blocks to effectively fuse multi-level features, thereby improving both segmentation accuracy and real-time performance. LCNet [
13] introduces a PCT (partial-channel transformation) strategy. The PCT block incorporates a TCA (three-branch context aggregation) module, expanding the receptive field of features and capturing multi-scale contextual information. MS-SRSS [
14] proposes a multi-resolution learning mechanism, which enhances the feature extraction capabilities of the semantic segmentation network’s encoder. Despite significant advances in accuracy and efficiency achieved by deep learning-based semantic segmentation methods, several challenges remain. For instance, street scene images often encompass multiple object categories, such as pedestrians, various types of vehicles, and road markings, each with diverse sizes and shapes and often subject to mutual occlusion. Moreover, street scene datasets typically cover urban environments under a range of lighting and weather conditions. Variations in lighting and seasonal changes can lead to significant differences in the appearance of street scenes, necessitating models with robust generalization capabilities.
In response to the complex challenges associated with the semantic segmentation of street scene images, this paper presents an improved Fast-SCNN [
15] architecture. The main contributions of this study are as follows:
The integration of SimAM enables the model to more effectively highlight critical features. Additionally, the enhanced DAB is incorporated into the high-resolution branch of the feature fusion module to enhance the model’s ability to capture image details. Simultaneously, the SE attention mechanism is integrated into the low-resolution branch of the feature fusion module, further refining the model’s focus on important information.
In the classifier, depthwise separable convolutions and additional convolutional layers are introduced to enhance the model’s capacity to process features. The RPP (refined pyramid pooling) module is extended by adding finer-grained levels, providing richer contextual information for the semantic segmentation task.
The proposed method achieves an average Intersection over Union (mIoU) of 71.7% and 69.4% on the challenging Cityscapes and CamVid test datasets, respectively, with inference speeds reaching 81.4 fps and 113.6 fps.
The structure of this paper is as follows:
Section 2 reviews related work;
Section 3 presents the methodology;
Section 4 presents the experimental results and comparative analysis; and
Section 5 provides the conclusions.
3. Proposed Method
Fast-SCNN [
15] is a lightweight and efficient semantic segmentation model designed specifically for real-time performance on high-resolution images. Unlike traditional dual-branch networks, Fast-SCNN shares the initial layers between branches, thereby reducing redundant computations. Building upon the foundation of Fast-SCNN, the proposed network introduces several enhancements: the SimAM [
42] module improves feature representation; the global feature extraction module enhances contextual awareness; and feature fusion integrates SE attention with the enhanced DAB module. These improvements collectively contribute to generating the final output through the classifier.
Figure 2 illustrates the architecture of the proposed network in detail.
To provide a clearer depiction of the network structure,
Table 1 lists the detailed composition of each module in the proposed network.
3.1. Efficient Down-Sample
The primary objective of the downsampling module is to reduce the spatial dimensions of the input data while retaining as much important feature information as possible. This module achieves this through a series of convolutional operations, including standard convolutions and depthwise separable convolutions. Specifically, the input image is first processed by a convolutional layer with a kernel size of 3 and a stride of 2, reducing the spatial dimensions of the output feature map to half the original size. Subsequently, it passes through two depthwise separable convolution layers, each with a stride of 2, further reducing the feature map dimensions sequentially. Ultimately, the spatial resolution of the output feature map is 1/8 of that of the input image. Each convolution operation is followed by batch normalization and ReLU activation functions to ensure the preservation of critical visual information during feature compression and to enhance the non-linearity of feature representation. By progressively downsampling the input image, the computational load of subsequent layers can be significantly reduced, thus improving the computational efficiency of the model. This is particularly important for processing high-resolution street scene images. The downsampling process effectively retains important visual features while filtering out insignificant details, thereby extracting more abstract and high-level features.
3.2. SimAM
Existing attention modules typically refine features only along the channel or spatial dimensions, limiting their flexibility in learning attention weights across both dimensions. Unlike traditional attention mechanisms such as CBAM, SimAM [
42] does not introduce additional parameters to the network. Instead, it uses an energy optimization function to assess the importance of each neuron, considering both spatial and channel dimensions. This energy function, grounded in neuroscience theory, reflects the competitive and inhibitory effects between neurons. By solving the closed-form solution of this energy function, SimAM efficiently computes attention weights, which are then used to weight the feature map, highlighting important features and suppressing less significant ones. The mechanism of SimAM is detailed in
Figure 3 to facilitate a clearer understanding of its operational principles.
The above process can be expressed using the following equations:
Equation (1) defines the basic form of the energy function, which aims to identify the linear separability between the target neuron and other neurons by minimizing this energy function.
represents the target neuron in the feature map,
represents other neurons in the same channel,
represents the linear transformation of the target neuron, and
represents the linear transformation of the surrounding neurons.
refers to the total number of neurons in a channel.
and
represent the label assigned to the target neuron and the label assigned to the surrounding neurons, respectively.
Equation (2) is an extension of Equation (1), incorporating a regularization term
to prevent overfitting and simplify the model’s complexity.
Equations (3) and (4) are the closed-form solutions to Equation (2), providing specific methods for computing
and
.
is the mean of the surrounding neurons in the same channel.
is the variance of the surrounding neurons in the same channel.
is a regularization term to avoid overfitting.
represents the weight. The two equations adjust the weights and biases of each neuron based on the statistical properties (mean and variance) of the input features, thereby normalizing the features.
Equation (5) calculates the minimum energy value
for each neuron based on the solutions
and
.
represents the difference between the target neuron and the mean of the surrounding neurons. This minimum energy value reflects the degree of separation between the neuron and other neurons in the feature space.
Equation (6) utilizes the minimum energy values obtained from Equation (5) to adjust the responses of each neuron in the feature map. This adjustment is achieved through a sigmoid function. is essentially a collection of the minimum energy values for all neurons.
SimAM enhances feature representation by computing pixel-wise attention weights, allowing the network to focus on key feature regions and reducing background noise interference.
3.3. Global Feature Extractor
The global feature extraction module is primarily responsible for extracting deep global features from the network. Specifically, this is achieved through the use of multiple levels of linear bottleneck modules and the final refined pyramid pooling (RPP) module. The linear bottleneck module operates as follows: the input features first undergo an expansion convolution, which increases the number of channels to
t times the original, thereby enlarging the feature dimensions. This is followed by depthwise convolutions applied in the high-dimensional feature space to capture more detailed features while maintaining computational efficiency. Finally, a compression convolution reduces the number of feature channels back to the specified output channels.
Figure 4 illustrates the refined pyramid pooling (RPP).
In semantic segmentation, capturing multi-scale information is crucial for improving accuracy. The refined pyramid pooling module extracts global and local features through pooling operations at five different scales, significantly improving the understanding of multi-scale context. These operations provide multi-level information, while the 1 × 1 pooling operation captures global context, aiding in the recognition of large objects and overall structures. This module up-samples the feature maps from five scales to the original size and concatenates them along the channel dimension to achieve effective multi-scale feature fusion. Despite the multi-scale pooling and convolution operations, the design of upsampling and convolution maintains computational efficiency while enhancing network performance. Multi-scale feature fusion enables the network to capture boundary information of objects in street scene images more accurately, thus improving boundary segmentation precision. This is particularly important in complex urban scenes, aiding in the clear differentiation of various object categories. By integrating features at different scales, the network maintains consistency in handling details and the overall structure, and can identify both large objects and small objects. This multi-scale feature extraction capability ensures efficient segmentation on street scene datasets and provides robustness and adaptability to targets with varying scales.
3.4. Feature Fusion Module
Low-resolution features, typically derived from the deeper layers of the network, provide rich semantic information [
43] but lack spatial detail. Conversely, high-resolution features from the shallower layers contain more spatial details but less semantic information. To effectively fuse these two types of features, we first apply 1 × 1 convolution and batch normalization to the high-resolution feature maps and enhance their expressiveness with the refined DAB module. Low-resolution features are upsampled to match the size of the high-resolution features using bilinear interpolation and then refined with depthwise convolution to preserve key information while reducing the number of parameters. Subsequently, 1 × 1 convolution is used to adjust the channel dimensions, combined with batch normalization and SE attention [
44] mechanisms to improve feature representation and importance adjustment. Finally, the high-resolution and low-resolution feature maps are integrated effectively through element-wise fusion. This module integrates the deep semantic information with shallow spatial details, which is crucial for handling complex street scene scenarios, particularly for multi-scale objects and detailed boundaries.
Figure 5 illustrates a comparison of the implementation details between the ERFNet non-bottleneck-1D module, the DAB module, and the proposed enhanced DAB module in this paper.
Compared to the original DAB module, enhanced DAB captures multi-scale features through various convolution operations, including standard convolutions, depthwise convolutions, and dilated convolutions. These convolution operations extract information from different receptive fields, helping the model understand different contextual information and details in the image. Depthwise convolutions reduce computational complexity while maintaining the spatial resolution of feature maps, whereas dilated convolutions increase the receptive field, aiding in capturing a broader range of contextual information. The additional 3 × 3 and 1 × 1 convolution layers further enhance the feature representation capabilities. These layers, applied in the final stage of the module, help extract higher-level feature information.
The SE (squeeze-and-excitation) attention is illustrated in
Figure 6. The attention enhances feature representation by introducing channel attention mechanisms. First, the input feature map undergoes adaptive average pooling to compress the spatial dimensions to 1 × 1, thereby capturing global channel information. Subsequently, this information is processed through two fully connected layers with an intermediate ReLU activation function, and the channel attention weights are output through a sigmoid function. Finally, these weights are multiplied channel-wise with the original input feature map, recalibrating the importance of each channel and producing a recalibrated feature map. Integrating SE attention into the low-resolution branch of feature fusion can enhance important features and improve feature representation. During the feature fusion process, after emphasizing significant features in the low-resolution branch, these features can be better preserved and utilized when fused with the high-resolution branch, thereby enhancing the fusion effectiveness.
3.5. Enhanced Classifer
The classifier module maps high-dimensional feature maps to the class space for pixel-level classification. It optimizes computational efficiency and model performance by combining depthwise separable convolutions with depth convolutions. Specifically, the input feature maps first pass through three depthwise separable convolutional layers to reduce computational load while maintaining effective feature extraction. Subsequently, the feature maps are processed through a depth convolutional layer, and finally refined through a module containing 1 × 1 convolutions, batch normalization, ReLU activation, and dropout layers to complete the class mapping. Depthwise separable convolutions significantly reduce computation and parameter count by decomposing standard convolutions into depthwise and pointwise convolutions, which is particularly effective for handling high-resolution street scene images. Additionally, the combination of batch normalization and ReLU enhances training speed and stability. Batch normalization reduces internal covariate shift and accelerates the training process, while ReLU activation enhances the model’s representational capacity, allowing it to capture complex features more effectively.
4. Experiments and Results
4.1. Data Sets and Evaluation Metrics
The Cityscapes dataset [
45] is widely used for autonomous driving and urban scene understanding. It contains 5000 high-resolution images (2048 × 1024 pixels) from 50 cities. The dataset is divided into 2975 training images, 500 validation images, and 1525 test images, with precise pixel-level annotations for 30 classes, 19 of which are used for evaluation. The annotations cover common urban elements such as roads, pedestrians, and vehicles.
The CamVid [
45] dataset is a high-quality resource widely used for semantic segmentation and autonomous driving scene understanding. It comprises 701 video frames, partitioned into 367 for training, 101 for validation, and 233 for testing. With a resolution of 960 × 720 pixels, the dataset ensures clear visual detail. Covering a range of weather and lighting conditions, it provides pixel-level annotations for 11 key categories, including road, building, vehicle, pedestrian, and tree.
The mean Intersection over Union (m
IoU) is a commonly used evaluation metric in semantic segmentation tasks, assessing model performance across different categories. The calculation formula is as follows:
Here, denotes the number of pixels correctly identified for class i, while represents the number of pixels of class i that have been misclassified as class j.
In subsequent experiments, the network’s performance was comprehensively evaluated from multiple perspectives by assessing metrics such as the number of model parameters, frames per second (FPS), and floating-point operations (GFLOPs).
4.2. Implementation Details
This study was implemented using Python 3.8 and the PyTorch 2.3.1 deep learning framework. Experiments were conducted on a system equipped with CUDA 12.1 and CuDNN v8 to ensure efficient computation on an NVIDIA GeForce RTX 4060 GPU. The model was evaluated on two benchmark datasets: Cityscapes and CamVid. For the Cityscapes dataset, a training batch size of 4 was used; for the CamVid dataset, a batch size of 8 was employed. Data augmentation strategies included random cropping, scaling, horizontal flipping, and additional cropping to enhance model robustness. During training, the cross-entropy loss function was used as the loss metric, and the Adam optimizer was applied to optimize the network parameters. The momentum value was set to 0.9, the weight decay rate was set to 0.0005, and the initial learning rate was set to 0.001. To improve training efficiency and stability, the network parameters were initialized using the Kaiming initialization strategy.
The learning rate adjustment strategy followed a Poly schedule, which gradually reduces the learning rate during training to fine-tune model parameters in the later stages. The specific formula is:
where
is the initial learning rate and
power is the decay exponent set to 0.9.
4.3. Ablation Study
To systematically evaluate the contributions of the optimization modules and related improvements to the performance of the semantic segmentation network, we conducted ablation experiments using the Cityscapes dataset. The purpose of these experiments is to quantify the impact of each component and validate its effectiveness. The experiments illustrate the trade-offs between parameter count, processing speed, and mIoU (mean Intersection over Union) across different configurations. The baseline model has a parameter count of 1.18 M, a processing speed of 123.5 FPS, and an mIoU of 68%. The introduction of additional components resulted in a series of performance changes in the model.
Table 2 presents a comparative analysis of the impact of various components on model performance.
SimAM module: The introduction of the SimAM attention mechanism enhances the feature representation capability within the semantic segmentation network. By leveraging self-attention and adaptive weighting, it improves the distinctiveness and accuracy of the features. This mechanism dynamically adjusts the weights of the feature maps, allowing the network to better capture and utilize critical feature information. Consequently, the mIoU increases to 68.6%, though the processing speed decreases to 117.4 FPS, and the GFLOPs increase to 18.7 G, while the parameter count remains unchanged, indicating a notable performance improvement.
RPP module: The RPP module enhances the performance of the network in segmenting distant small objects and complex backgrounds by improving the pyramid pooling module in the global feature extraction. Specifically, the refined pyramid pooling module increases the number of pooling scales, which enables more effective extraction and fusion of multi-scale features. As a result, the mIoU improves by 0.4% compared to the baseline model, while the processing speed decreases to 119 FPS, and the parameter count increases to 1.55 M. The GFLOPs increase to 21.8 G.
FFM module: Incorporating DABBlock in the high-resolution branch and SE attention in the low-resolution branch of the feature fusion module (FFM) improved the model’s mIoU from 68% to 68.8% and 68.5%, respectively, demonstrating effective adaptation to different scales of objects. When DABBlock and SE attention are used together, the mIoU further increases to 69.1%, highlighting the significant advantage of the FFM module in enhancing the segmentation of small objects and edge regions. DABBlock combines standard and dilated convolutions to achieve effective multi-scale feature extraction and integration, improving the network’s ability to recognize objects of various sizes while maintaining computational efficiency. SE attention generates weights through global pooling, enhancing features, suppressing irrelevant information, and optimizing the low-resolution branch. The GFLOPs for FFM with DABBlock increase to 23.9 G, and with SE attention, this increases to 25.2 G.
After integrating SimAM, RPP, FFM, and the enhanced classifier, the proposed algorithm significantly improves the overall segmentation performance in complex urban scenes, achieving an mIoU of 71.7% with GFLOPs of 58.5 G.
4.4. Comparison with the Existing Model
In this section, we compare the proposed algorithm with several state-of-the-art segmentation models and present a series of performance metrics derived from the Cityscapes and Camvid datasets.
Figure 7 illustrates the relationship between model inference speed and segmentation accuracy for various segmentation networks on the Cityscapes dataset.
4.4.1. Results on CityScapes Dataset
Table 3 presents a comparison of the performance metrics for different algorithms on the CityScapes dataset. OCNet, with 62.6 M parameters and a computational cost of 549 G, achieves an mIoU of 80.1%, indicating that its high accuracy comes at the expense of a substantial computational overhead. In contrast, the ENet and ESPNet models have minimal parameters (0.36 M) and lower computational costs, but their mIoU values are 58.3% and 60.3%, respectively, highlighting that efficient models designed for resource-constrained environments may sacrifice accuracy. BiSeNet1, with 5.8 M parameters and 14.8 G of computational cost, operates at 101.6 FPS with an mIoU of 68.4%. On the other hand, BiSeNet2, which has 49 M parameters and 55.3 G of computational cost, achieves 60.4 FPS and 74.7% mIoU. This demonstrates that a moderate increase in model complexity can enhance accuracy without significantly compromising speed. The network proposed in this study features 2.59 M parameters and a computational cost of 58.5 G, which is significantly lower than that of OCNet and RefineNet, making it more suitable for real-time applications. Additionally, this network maintains a high segmentation accuracy with an mIoU of 71.7% while achieving a processing speed of 81.4 FPS, ensuring both fast response and high segmentation performance.
As illustrated in
Figure 8, the proposed algorithm demonstrates relatively better performance in detailed capture and boundary handling. The proposed method retains more details and achieves more accurate segmentation of object boundaries when handling complex scenes. For instance, as shown in the second and third rows of
Figure 8, the proposed algorithm provides clearer and more natural contours for dense crowds compared to Fast-SCNN and DABNet. The final row of
Figure 8 highlights that the proposed method segments large objects, such as buses, with clearer boundaries and fewer segmentation misalignments. The third row of
Figure 8 indicates that the proposed algorithm achieves higher precision in segmenting slender objects, such as traffic signs, demonstrating its excellence in capturing image details. In contrast, the first and fourth rows of
Figure 8 reveal that both Fast-SCNN and DABNet exhibit some degree of segmentation inaccuracies when segmenting bicycles. Fast-SCNN shows errors in distinguishing between people riding bicycles and the bicycles themselves, while DABNet’s weaker detail capture results in a loss of fine image details. Additionally, as seen in the third row of
Figure 8, Fast-SCNN and DABNet perform less effectively than the proposed algorithm in segmenting slender objects like traffic signs.
Based on
Table 4, our method achieves an average Intersection over Union (mIoU) of 71.7% on the Cityscapes dataset, outperforming other algorithms.
Table 4 also provides IoU evaluations for different algorithms across various categories, offering a detailed reflection of model performance. Among the 19 categories, our method shows superior performance compared to competing algorithms in most categories, particularly demonstrating significant advantages in “Road” (97.8%), “Building” (91.6%), and “Person” (82.0%).
4.4.2. Results on CamVid Dataset
Table 5 compares the performance metrics of different algorithms on the CamVid dataset. Our proposed method achieves 113.6 FPS with only 2.59 M parameters while maintaining an mIoU of 69.4%, demonstrating its ability to effectively balance model complexity, computational efficiency, and segmentation accuracy in practical applications. In contrast, although PSPNet achieves a comparable mIoU of 69.1%, it requires a significantly larger number of parameters (250.8 M) and offers only 5.4 FPS, making it unsuitable for real-time applications. ENet and ESPNet represent another extreme, with very small parameter sizes (0.36 M each) and FPS rates of 96.4 and 190.3, respectively, but their segmentation accuracy is relatively low, with mIoUs of 51.3% and 58.3%. DABNet, with an extremely small parameter size of 0.76 M, achieves 162 FPS and an mIoU of 65.7%, demonstrating a good balance between efficiency and accuracy. In summary, our proposed method achieves an optimal trade-off between high accuracy, low computational overhead, and real-time performance.
Figure 9 presents the segmentation results of various algorithms on the CamVid test set, highlighting the advantages of the proposed method in street scene analysis. Firstly, the proposed algorithm achieves precise object boundary segmentation, effectively avoiding the edge blurring issues seen in Fast-SCNN and DABNet, and demonstrates superior detail retention. For instance, in the third row of
Figure 9, the proposed method shows greater accuracy in segmenting bicycles compared to Fast-SCNN. Secondly, the proposed algorithm excels in segmenting small objects. The CamVid dataset contains many small objects, such as bicycles and traffic signs, which are often missed or misclassified by other algorithms. In contrast, the proposed method utilizes refined feature extraction and attention mechanisms to accurately identify and segment these small targets. As seen in the first two rows of
Figure 9, the proposed method demonstrates improved accuracy in segmenting poles. Moreover, despite the low illumination in the first input image of
Figure 9, the proposed method maintains consistent performance across different categories, indicating its robustness under complex lighting conditions. The algorithm maintains stable segmentation results across various lighting conditions through deep feature learning and effective normalization. For occluded objects, which typically lead to segmentation errors, the proposed method effectively addresses this issue through contextual information integration and enhanced feature representation. Overall, the proposed algorithm exhibits outstanding performance in segmentation stability and accuracy, demonstrating significant practical value.
Table 6 presents the segmentation accuracy for each class achieved by different networks on the CamVid dataset.
5. Conclusions
In this study, we made several key improvements to the Fast-SCNN model to enhance its semantic segmentation performance on street scene images. First, we introduced the SimAM module to boost the network’s sensitivity to critical spatial features. This module, applied after the downsampling phase, significantly improves the expressiveness of feature maps. Second, we extended the pyramid pooling module (PPM) by incorporating more fine-grained layers and designed the enhanced DAB module, which is integrated into the high-resolution branch of the feature fusion module. This provides richer contextual information and improves the accuracy of small object recognition. Additionally, the low-resolution branch includes SE attention, which significantly enhances the fusion of low-level and high-level features. The classifier module was improved by adding depthwise separable convolutions and depth convolutions, which deepens the network structure and better captures and classifies multi-scale features. These enhancements increase the network’s ability to handle image details and improve segmentation accuracy and robustness. The experimental results show that our network achieves significantly higher segmentation accuracy in complex scenarios, such as urban street scenes, compared to the original model. On the Cityscapes and CamVid datasets, our model achieved 71.7% and 69.4% mIoU, respectively, while maintaining inference speeds of 81.4 fps and 113.6 fps.
Future work will focus on further enhancing the efficiency and performance of the model. This includes exploring more lightweight network architectures to improve the inference speed while maintaining high accuracy, thereby making the model more suitable for deployment on resource-constrained embedded devices. Additionally, efforts will be directed towards investigating methods to enhance the robustness and generalization capabilities of the model across various challenging scenarios, such as extreme weather conditions and nighttime environments.