AODs-CLYOLO: An Object Detection Method Integrating Fog Removal and Detection in Haze Environments

Liang, Xinyu; Liang, Zhengyou; Li, Linke; Chen, Jiahong

doi:10.3390/app14167357

Open AccessArticle

AODs-CLYOLO: An Object Detection Method Integrating Fog Removal and Detection in Haze Environments

by

Xinyu Liang

,

Zhengyou Liang

^*

,

Linke Li

and

Jiahong Chen

School of Computer and Electronic Information, Guangxi University, Nanning 530004, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7357; https://doi.org/10.3390/app14167357 (registering DOI)

Submission received: 23 July 2024 / Revised: 15 August 2024 / Accepted: 19 August 2024 / Published: 20 August 2024

(This article belongs to the Section Ecology Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Foggy and hazy weather conditions can significantly reduce the clarity of images captured by cameras, making it difficult for object detection algorithms to accurately recognize targets. This degradation can cause failures in autonomous or assisted driving systems, posing severe safety threats to both drivers and passengers. To address the issue of decreased detection accuracy in foggy weather, we propose an object detection algorithm specifically designed for such environments, named AODs-CLYOLO. To effectively handle images affected by fog, we introduce an image dehazing model, AODs, which is more suitable for detection tasks. This model incorporates a Channel–Pixel (CP) attention mechanism and a new Contrastive Regularization (CR), enhancing the dehazing effect while preserving the integrity of image information. For the detection network component, we propose a learnable Cross-Stage Partial Connection Module (CSPCM++), which is used before the detection head. Alongside this, we integrate the LSKNet selective attention mechanism to improve the extraction of effective features from large objects. Additionally, we apply the FocalGIoU loss function to enhance the model’s performance in scenarios characterized by sample imbalance or a high proportion of difficult samples. Experimental results demonstrate that the AODs-CLYOLO detection algorithm achieves up to a 10.1% improvement in the mAP (0.5:0.95) metric compared to the baseline model YOLOv5s.

Keywords:

AODs-CLYOLO; AODNet; CSPCM++; contrast regularization; multi-scale fusion; selective attention mechanism

1. Introduction

The detection of objects in foggy conditions plays a crucial role in the field of autonomous driving [1]. However, the complex and unpredictable traffic scenarios demand high stability and accuracy from autonomous detection systems. Issues such as image blurring, reduced contrast, and color distortion caused by fog significantly impair the recognition tasks of detection systems, thereby increasing the risk of traffic accidents. Therefore, research into foggy weather object detection holds significant practical value and societal importance.

In recent years, researchers have proposed various methods to address the problem of object detection under adverse weather conditions. Some studies have focused on directly applying object detection algorithms to foggy images, such as using the YOLO series [2,3,4,5] and RCNN series [6]. Although it is possible to attempt training or fine-tuning of a detector directly in foggy conditions, the low contrast, blurred details, and color distortion of foggy images make feature extraction challenging. As a result, the model may struggle to effectively learn target features in foggy conditions. This discrepancy in features can impact detection performance, and existing detectors are typically trained on clear images, making them less adaptable to foggy environments. Therefore, training a model directly in foggy conditions may not be as effective as dehazing the images first and then performing detection. One of the strategies for preprocessing images before detection is using defogging algorithms [7,8]. Defogging algorithms can enhance image contrast and clarity, thereby improving the accuracy of object detection, for example, by restoring image clarity through physical model-based methods (such as dark channel prior) and deep learning methods. This paper proposes the AODs-CLYOLO model based on the idea of defogging before detection.

AODs-CLYOLO combines the dehazing model AODs with the object detection model CLYOLO. Firstly, AODNet [7] is enhanced for better compatibility with object detection, resulting in AODs. AODs integrates a new parameter estimation module, dehazing module, and introduces a novel contrast regularization in its loss function. Secondly, we develop CLYOLO based on the YOLOv5 framework, incorporating the learnable Cross-Stage Partial Connection Module (CSPCM++) and LSKNet selection attention mechanism in its detection head. Finally, AODs and CLYOLO are merged to create AODs-CLYOLO, a dehazed object detection model (Figure 1).

The main contributions of this paper are summarized as follows:

Introduced a new dehazing model, AODs, which incorporates a deeper multiscale feature fusion structure and a novel loss function with Contrastive Regularization. This reduces the loss of image details during dehazing operations, making the output images more suitable for detection tasks.
Designed a novel Learnable Cross-Stage Partial Connection Module (CSPCM++), utilizing an improved learnable ConvMixer module and enhanced gradient combinations. This significantly improves metrics with minimal increase in model parameters.
Proposed a new object detection network, CLYOLO, which takes AODs-processed images as input. CLYOLO integrates the LSKNet selection attention mechanism, depth-wise separable convolutions, and CSPCM++ detection head.
Constructed a synthetic foggy weather dataset composed of clear images from the commonly used VOC dataset [9], and foggy images generated by randomly adding atmospheric scattering components based on models.

2. Related Work

In this section, we present an overview of existing dehazing models, object detection models, and research on the combination of dehazing and object detection in three parts and discuss the shortcomings of related work on object detection in inclement weather.

Dehazing

Traditional dehazing algorithms primarily rely on various prior knowledge [8,10,11,12] to solve the parameters needed to restore haze-free images. Among these, the most representative is the Dark Channel Prior (DCP) method proposed by He et al. [8], which assumes that in non-sky regions, there exist pixels with very low values in at least one color channel (approaching zero). With the advancement of deep learning techniques, research combining Convolutional Neural Networks (CNNs) [7,13,14] and Generative Adversarial Networks (GANs) [15] into dehazing algorithms has grown. AODNet [7], for instance, utilizes a CNN model to estimate atmospheric light and scattering coefficients from hazy images for dehazing. Additionally, method like MSCNN [16] use CNNs to learn image priors to enhance dehazing effects. Models such as EPDN [15], which combine a CNN’s feature extraction capabilities with a GAN’s generative power, can generate more realistic and clearer haze-free images. Compared to the large model sizes of the latest deep learning dehazing algorithms, AODNet offers advantages in computational efficiency and processing speed, making it the chosen baseline model for dehazing in this paper.

B.: Object Detection

In recent years, Convolutional Neural Networks (CNNs) have been widely applied to object detection tasks, with the evolution of detection frameworks primarily distinguishing between single-stage and two-stage detectors. Common two-stage detectors include the Faster R-CNN [6] series algorithms, which generate candidate boxes and then classify and refine their positions. Two-stage detectors typically achieve better accuracy but may compromise on speed. On the other hand, single-stage algorithms such as YOLO [2,3,4,5] (You Only Look Once) and SSD [17,18] (Single-Shot Multibox Detector) directly detect and locate objects in images using a single neural network model. These models are known for their speed and simple design but may face challenges in precise localization and detecting small objects. Notably, Facebook’s DETR [19] integrated Transformers into the object detection framework, eliminating the need for post-processing like NMS and outperforming Faster R-CNN in metrics, albeit showing poorer performance in detecting small objects. TPH-YOLOv5 [20] introduced Transformers into the detection head and applied it in drone capture scenarios. Subsequently, Wang et al. [21] optimized this detection head. Considering the application scenarios of foggy weather detection tasks, this paper adopts YOLOv5s due to its significant advantages in real-time processing, end-to-end training, and stability.

C.: Integration of Dehazing and Object Detection

In addition to independently optimizing dehazing and detection algorithms, the technique of jointly optimizing dehazing with detection has gained popularity in recent years. AODNet [7] first proposed embedding CNN-based dehazing algorithms into object detection algorithms for joint training. However, the use of two-stage detection algorithms was time-consuming and impractical for real-world applications. IA-YOLO [22], GDIP-YOLO [23], and YOLO-GW [24] algorithms incorporate dark channel dehazing algorithms before object detection. IA-YOLO introduced adaptive image enhancement techniques for dehazing hazy images, but the dark channel dehazing algorithm may distort images in practice, which is detrimental to detection performance. Ding et al. [25] developed an unsupervised training strategy with a unique activation function to address issues such as snowy and blurry conditions that hinder detection.

3. Method

In this section, we will introduce our method AODs-CLYOLO in detail and the structure is shown in Figure 2. The AODs dehazing model processes foggy images and feeds the de-fogged images to the detection model CLYOLO; the CLYOLO detection model receives the images processed by AODs and outputs the categories and locations of all objects of interest after network prediction. In this paper, Section 3.1 introduces the AODs dehazing model and Section 3.2 introduces the CLYOLO detection model.

3.1. AODs: Advanced AODNet Series for Dehazing

In foggy conditions, haze affects how well targets can be detected. AODNet [7] is good at removing haze but sometimes makes images too dark or too contrasty, especially with varying image scales. To improve this, we developed AODs, a dehazing algorithm for object detection. AODs processes foggy images before they are analyzed by the detection network. Inspired by AODNet, AODs is designed as an end-to-end model with two main parts: the K-Estimation Module and the Clean Image Generation Module.

The K-Estimation Module calculates the parameter K(x), which represents the haze transmission map necessary for restoring clear images. This parameter K(x) is crucial as it helps in estimating the thickness of the haze and compensates for the loss of image clarity. By accurately determining K(x), the module can adjust the image’s contrast and brightness effectively. The Clean Image Generation Module uses the parameter K(x) to produce fog-free images by enhancing the visual quality and clarity. Figure 3 shows the structure of AODs.

Dehazing data preprocessing is a common method used before object detection. Let the input be the foggy image I(x), and the output be the processed, fog-free image J(x). (I(x) and J(x) are images, meaning matrices, with dimensions (H × W × 3). According to AODNet, the fog-free image can be solved using Equation (1):

J (x) = K (x) I (x) - K (x) + b

(1)

Here, b is a constant bias with a default value of 1. K(x) is the transmission map, which is a grayscale image with dimensions H × W × 1, defined as shown in Equation (2):

K (x) = \frac{\frac{1}{t (x)} (I (x) - A) + (A - b)}{I (x) - 1}

(2)

Here, t(x) is the medium transmission rate, which has dimensions of H × W × 1, where H and W are the height and width of the image, respectively; A represents the global atmospheric light, which is typically a scalar value. Additionally,

\frac{1}{t (x)}

represents element-wise reciprocals, and the division in

\frac{\frac{1}{t (x)} (I (x) - A) + (A - b)}{I (x) - 1}

refers to element-wise division.

We have redesigned the deep network module of the parameter estimation module, as shown in Figure 4. In this module, we introduce a Channel–Pixel (CP) attention mechanism [14] and a multi-scale feature fusion module. The deeper cross-fusion enhances the network’s ability to focus on different channel features and important pixels, thereby improving the model’s feature representation capability, which is beneficial for detection tasks. Additionally, AODs employs a composite loss function that includes an image reconstruction loss function (L1) and Contrastive Regularization (CR) based on contrastive learning [26]. This enables the model to effectively capture the correlations of multi-scale features. The experimental details of the dehazing model are presented in Appendix A.

3.2. CLYOLO: YOLOv5 Enhanced with LSKNet Attention Mechanism and CSPCM Module

YOLOv5 consists of three parts: the backbone, the neck, and the head. We have made the following improvements to the YOLOv5s model.

Backbone: Added the LSKNet attention mechanism and the CSPCM++ module before the SPP layer to enhance the focus on important features. Neck: Introduced the CSPCM++ module before the detection heads of different sizes and added the LSKNet attention mechanism before the layer that detects large objects. Additionally, we replaced standard convolutions with depth-wise separable convolutions to reduce the model’s parameter count. Loss Function: Utilized the FocalGIoU loss function, which includes Focal Loss, to improve the model’s performance in scenarios with imbalanced data.

3.2.1. CSPCM++

To enhance network generalization and improve performance while reducing computational costs, we propose a Learnable Cross-Stage Partial Connection Module (CSPCM++) based on ConvMixer++. In our method, the CSPCM++ module is added before the SPP layer in the backbone network and before the three prediction heads, to achieve cross-stage local connections and more efficient feature extraction, thereby improving the performance and effectiveness of object detection. As illustrated in Figure 5a, CSPCM++ divides the input into two parts along the channel dimension: one part undergoes processing through a convolution layer and a ConvMixer++ module (with the number of ConvMixer++ modules controlled by the parameter n), while the other part is processed only through a convolution layer. The outputs of these two pathways are then concatenated and passed through a third convolution layer before being output.

ConvMixer++ modules, as part of the CSPCM++ module, play a crucial role in feature extraction and input convolution operations. Inspired by ConvMixer [27], we designed ConvMixer++ as a purely convolutional architecture. As shown in Figure 5b, ConvMixer++ combines convolutional modules with residual connection networks. Each convolution is followed by a Leaky GELU [28] activation function and normalization. Assuming the input feature map is

z_{l - 1}

,

z_{l - 1}

first passes through the depth-wise convolution block (DWConv). Specifically, before the residual connection, we introduce learnable residual connection parameters weight and bias. These parameter values are adaptively learned by the Parameter Weighting Module, enhancing feature representation:

z_{l}^{'} = B N (σ \{C o n v D e p t h w i s e (z_{l - 1})\}) + (z_{l - 1} \times w e i g h t + b i a s)

(3)

Afterwards, the output from the residual network enters the pointwise convolution block. This block uses 1 × 1 convolutions to reduce dimensionality and improve computational efficiency, facilitating deeper processing in subsequent modules:

z_{l + 1} = B N (σ \{C o n v P o i n t w i s e (z_{l}^{'})\})

(4)

3.2.2. LSKNet Attention Mechanism

Despite YOLOv5’s excellent performance in terms of speed and accuracy, it still has certain limitations when dealing with complex scenes or small objects. LSKNet [29] is a multi-level attention mechanism network originally designed for remote sensing image detection tasks. It dynamically adjusts the weights of feature maps through embedding overlapping image blocks and a series of attention mechanism blocks to enhance the model’s ability to capture key features. Considering that both haze images and remote sensing images face issues such as varying resolutions and various noise types, this paper integrates LSKNet (Large Selective Kernel Network) as an attention mechanism added before the SPP layer and before the output of the large object detection layer.

As shown in Figure 6, LSKNet processes the input image X using a decomposition and sequential method: The input image X is sequentially processed by two decomposed one-dimensional kernels, which significantly reduce the model’s parameter count while maintaining the effectiveness of a large, undecomposed kernel. Following this, X undergoes depth-wise convolutions F1 and F2 to produce the feature maps U1 and U2. These feature maps are then concatenated along the channel dimension and pass through a series of feedforward operations. The resulting feature map is fed into the Spatial Selection mechanism. This mechanism takes the refined spatial feature maps SA and dynamically selects appropriate kernels for various objects, ultimately outputting S, the attention-enhanced feature map.

3.2.3. CLYOLO Loss Function

Due to the challenges in detecting foggy samples, this study adopts FocalGIoU as the position loss function for the detection model. FocalGIoU integrates the concept of Focal Loss [30] into the GIoU loss [31], focusing the regression process on high-quality anchor boxes. Compared to CIoU loss [32], GIoU incorporates differentiation information between bounding boxes, enhancing the model’s ability to handle various shapes in object detection. Additionally, FocalEIoU proposed by Zhang et al. [33] combines bounding box regression with effective example mining, which helps improve the model’s performance in scenarios with sample imbalance or numerous difficult samples. L_GIoU is the GIoU loss, as expressed in Equation (5). L_FocalGIoU is the location loss function, which incorporates a parameter γ to control the degree of outlier suppression (set to 0.5 in this paper). In Equation (6), IoU^γ acts as a weighting factor that adjusts the influence of L_GIoU based on the quality of the bounding box prediction. A lower IoU value results in a more amplified loss, which helps the model focus on difficult, less accurate predictions. CLYOLO’s total loss function is represented by Equation (7), where the location loss function is L_FocalGIoU.

L_{G I o U} = 1 - I o U (A, B) + \frac{|C - (A \cup B)|}{|C|}

(5)

L_{F o c a l G I o U} = {I o U}^{γ} {\cdot L}_{G I o U}

(6)

L_{C L Y O L O} = L_{l o c a t i o n} + L_{c l a s s} + L_{c o n f i d e n c e}

(7)

where A and B are two arbitrary shapes, and C is the smallest geometric shape containing both A and B (which can be any geometric shape); L_class is class loss, which is computed using a cross-entropy loss function and measures the discrepancy between the predicted class probabilities and the actual class labels for each bounding box; and L_confidence is confidence loss, which is computed using a binary cross-entropy loss, comparing the predicted confidence scores with the actual presence or absence of an object.

4. Results

This section provides a detailed overview of the experimental process and results to validate the effectiveness of the proposed algorithm. Object detection datasets are compared using random fog levels for contrastive experiments and ablation studies, evaluating the model’s performance against recent advanced models and the efficacy of its individual components.

4.1. Dataset Creation

The scarcity of existing foggy weather datasets makes it challenging to train stable and robust object detection models under foggy conditions. In this study, we selected five common object categories in foggy environments for training: person, bicycle, car, bus, and motorcycle. We constructed two datasets simulating foggy weather conditions based on the classic PASCAL VOC dataset combined with an atmospheric scattering model. These datasets are the training set VOC_train and the testing set VOC_test. Detailed information about the datasets and the objects within them is shown in Table 1.

The VOC_train dataset consists of images from the VOC2007_trainval and VOC2012_trainval sets, which have been randomly augmented with fog elements for the aforementioned five categories (fog density is controlled by adjusting the medium’s transmittance). Similarly, the VOC_test dataset is composed of images from the VOC2007_test set, randomly augmented with fog for the same five categories. Examples of images from the datasets are shown in Figure 7.

4.2. Experimental Setup

The experiments in this study, including the training of both the defogging model and the detection model, were conducted on an Ubuntu 20.04 operating system. The setup employed an Intel(R) Xeon(R) Platinum 8255C processor, manufactured by Intel Corporation, headquartered in Santa Clara, California, United States and an NVIDIA RTX 3090 GPU with 24GB of VRAM. The deep learning framework used was PyTorch 1.11.0, with CUDA version 11.3. For the object detection experiments, the YOLOv5 model was used and trained for 300 epochs on the specified dataset. The Adam optimizer was applied with an initial learning rate set at 0.001, and the batch size was set to 16.

4.3. Evaluation Metrics

We evaluate the performance of object detection using four metrics: Precision (P), Recall (R), Average Precision (AP), and Mean Average Precision (mAP), as shown in Equations (8)–(10) and (11). Recall measures the proportion of actual positive samples detected by the model, while Precision measures the proportion of true positive samples among those predicted as positive by the model. These two metrics are interrelated. Average Precision (AP) is the area under the Precision–Recall (PR) curve, and for detection algorithms, a higher AP value indicates better performance. The mean of the AP values for K categories is called Mean Average Precision (mAP). Similarly, a higher mAP value signifies better algorithm performance. The IoU = mAP@x mentioned in this paper refers to the Mean Average Precision value at an IoU threshold of x.

P = \frac{T P}{T P + F P}

(8)

R = \frac{T P}{T P + F N}

(9)

A P = \int_{0}^{1} P (R) d (R)

(10)

m A P = (\sum A P) / K

(11)

TP (True Positives) refers to the number of instances where positive samples are correctly identified, while FP (False Positives) indicates the number of instances incorrectly identified as positive samples. FN (False Negatives) represents the number of instances incorrectly identified as negative samples.

4.4. Performance of CLYOLO

The proposed AODs-CLYOLO model was trained and validated using dehazed datasets, which we refer to as VOC_train_dehaze and VOC_test_dehaze. To ensure the fairness and reliability of the results, all models were trained and tested in the same experimental environment. All comparison models were trained under the same hardware and software conditions, following the same training configurations to ensure comparability of the results and trained for 300 epochs. Specifically, the comparison models YOLOv7-tiny, YOLOv4, YOLOv3, Faster R-CNN, and SSD were trained with a batch size of 16, and other settings remained consistent with those in the original papers. Additionally, the training datasets for the comparison models all used VOC_train with randomly added haze, and the validation datasets used VOC_test.

From Table 2, we can see that our model’s Mean Average Precision ([email protected]) improves by 3.75 percentage points compared to the baseline model YOLOv5s across all categories. For applications requiring high-precision detection in autonomous driving scenarios, [email protected] provides a stringent evaluation of the model’s fine-grained detection capabilities, reflecting its actual performance in these contexts. As shown in the table, our model achieves a [email protected] of 59.80, which is 10.13 percentage points higher than the baseline. Additionally, our model also shows improvements in Precision and Recall, reaching 88.00 and 74.70, respectively. Overall, the enhanced AODs-CLYOLO model outperforms the compared algorithms in all metrics.

To comprehensively evaluate the computational performance of our proposed model, we also compared several mainstream object detection algorithms under the same experimental environment. Table 3 summarizes the performance of different models in terms of the number of parameters, floating point operations per second (FLOPs), and memory usage. Through the comparison of these metrics, we can clearly demonstrate the differences in computational efficiency and resource consumption among the models. The data in the table show that our model outperforms models like Faster R-CNN, SSD, YOLOv3, and YOLOv4 in terms of inference time and memory usage, but it falls short compared to some lightweight models such as YOLOv5n, YOLOv5s, and YOLOv7-tiny. However, our model manages to maintain a controlled increase in parameters while demonstrating significant advantages in object detection accuracy (as seen in the experimental results in Table 2). These results indicate that, in the trade-off between computational performance and detection accuracy, our model offers unique value in tasks where high detection precision is crucial.

To more intuitively demonstrate the superior detection performance, we compared the object detection results with prediction boxes of our algorithm against YOLOv5s. As illustrated in Figure 8, the eight example images highlight issues caused by haze that lead to misdetection, missed detection, and lower detection metrics in the original model. In images 1–4, the original model shows misdetection problems, such as misclassifying a small dog as a human in image 2. In images 5–7, the original model misses detections, failing to detect humans, bicycles, and cars in each respective image, errors not present in our AODs-CLYOLO model. Furthermore, the last image shows the original model detecting the doctor with a confidence score of only 0.39, while our AODs-CLYOLO model improves this to 0.63. Similar improvements in confidence scores can be observed in other images, demonstrating our model’s enhanced predictive confidence. In summary, compared to the baseline YOLOv5s model, our model significantly improves detection performance in hazy conditions, effectively reducing missed and misdetection issues faced by the original model.

4.5. Ablation Experiments

This section validates the effectiveness of different optimization modules through ablation experiments. We sequentially added the AOD-Net dehazing preprocessing module, the CSPCM++ module, the LSKNet attention mechanism, the FocalGIoU loss function, and the DWConv module to the baseline model YOLOv5s, creating several improved models. These improved models were then compared using the same test data.

The experimental results are shown in Table 4. Among these, Model A refers to the baseline YOLOv5s model. Model B uses the dataset after applying AODs de-fogging. Models C through H all involve de-fogging using the AODs model. Specifically, Models C, D, and E incorporate the CSPCM++ module, the LSKNet attention mechanism, and the FocalGIoU loss function, respectively. Models F, G, and H are combinations of these modules, with Model H representing the final model proposed in this paper. The table displays that the baseline model YOLOv5s achieves a [email protected] of 80.80% and a [email protected]–0.95 of 54.30%, while our AODs-CLYOLO model reaches 83.80% and 59.80% for these metrics, respectively. Analyzing the detection samples reveals that the CSPCM++ module and the LSKNet attention mechanism significantly contribute to the improvement in accuracy. Our method effectively integrates multi-level features by combining global features, which positively impacts the detection of foggy industrial scenes. By adding four optimization modules, the AODs-CLYOLO model (Model H) achieves the best performance, with [email protected] and [email protected]–0.95 improving by 3.7% and 10.1%, respectively, compared to the baseline model (Model A).

To validate the effectiveness of the CSPCM++ module, we conducted comparative experiments using four different modules: Convmixer, Convmixer++, CSPCM, and CSPCM++. The results are presented in Table 5. The data indicate that, compared to the original baseline model, adding the improved CSPCM++ module to the detection head significantly enhances the mAP (0.5–0.95), reaching 58.80. This demonstrates the effectiveness of the detection head proposed in this study.

5. Conclusions

This study aims to enhance target detection under foggy conditions by proposing a novel model, AODs-CLYOLO, which integrates dehazing and detection capabilities. In the data preprocessing phase, we introduce a new dehazing model, AODs, which not only improves the dehazing effect but also enhances detection accuracy. During the detection phase, AODs-CLYOLO introduces a new cross-stage local network, CSPCM++, as a key component of the detection head. To address the issue of poor detection performance on large targets, AODs-CLYOLO incorporates the LSKNet selective attention mechanism before the SPP layer and at the output of the large target detection layer. Moreover, considering the application scenarios of target detection in foggy conditions, we have upgraded the model to be more lightweight. We replaced the convolutional layers in the neck with depth-wise separable convolutions, reducing the model’s parameter count while improving detection metrics. Experimental results show that compared to the baseline model YOLOv5s, our model achieves an increase of 3.7% in [email protected] and 10.1% in [email protected]–0.95 on the synthetic foggy dataset VOC_fog. These results demonstrate that the improved model exhibits superior performance in target detection under adverse weather conditions such as haze and low visibility. This advancement is particularly significant for fields like autonomous driving and surveillance systems, as it enhances the robustness and accuracy of the systems in challenging weather conditions.

Although our model has achieved improvements in target detection performance, there remains room for further research. For instance, we have not yet explored joint training of the dehazing and target detection models. Future work will focus on exploring more complex model structures, a more diverse range of datasets, and more refined model optimization and training strategies in the context of joint training to improve the model’s robustness and generalization capabilities.

Author Contributions

Conceptualization, X.L. and Z.L.; methodology, L.L.; software, J.C.; validation, Z.L, X.L. and L.L.; formal analysis, X.L.; investigation, J.C.; resources, L.L.; data curation, X.L.; writing—original draft preparation, X.L.; writing—review and editing, X.L.; visualization, X.L.; supervision, J.C.; project administration, L.L.; funding acquisition, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

Research on Joint Image Enhancement and Object Detection Methods under Low Light Conditions was funded by the National Natural Science Foundation of China (NSFC), grant number 62171145, under the General Program, with a funding amount of CNY 560,000 and a research period from January 2022 to December 2025. The APC was also funded by the same grant.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this paper were generated using publicly available datasets PASCAL VOC (http://host.robots.ox.ac.uk/pascal/VOC/, accessed on 1 January 2023). The data files related to this study are available upon request by emails from the corresponding authors. Data supporting the conclusions of this study are accessible from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Based on experimental validation, the dehazing algorithm proposed in this paper not only benefits object detection but also enhances the dehazing effect on images. The training of the dehazing model utilized a training dataset, VOC_fog_train_10, which includes data augmented with 10 different levels of haze. To verify the dehazing performance of the improved model, we conducted extensive experimental evaluations on AODs. We tested the model’s performance on four different datasets, each with varying levels of haze (light, moderate, heavy, and random haze), and compared its performance with mainstream dehazing algorithms. Table A1 presents our experimental results.

Table A1. Evaluation of defogging effects of different models on datasets with different levels of fogging.

Fog Level	Index	DCP	MSRCR	AODNet	Ours
Slight fog (n = 1)	PSNR/dB	14.642	14.887	20.444	23.166
Slight fog (n = 1)	SSIM/%	78.563	73.873	81.191	84.384
Moderate Fog (n = 4)	PSNR/dB	13.812	14.782	21.606	24.094
Moderate Fog (n = 4)	SSIM/%	72.735	71.707	81.890	86.351
Heavy Fog (n = 8)	PSNR/dB	13.503	14.484	19.368	20.992
Heavy Fog (n = 8)	SSIM/%	66.325	67.091	77.016	82.143
Random fog (n = 1–10)	PSNR/dB	13.635	14.708	20.466	22.729
Random fog (n = 1–10)	SSIM/%	70.247	70.709	79.687	84.123

As shown in the experimental results, our proposed AODs demonstrates significant advantages across various metrics. On the randomly hazed dataset, our model achieved a PSNR of 22.729 and an SSIM of 0.84123, outperforming several other comparative algorithms. Additionally, our model excels in dehazing under moderate haze conditions. Beyond quantitative metrics, we also conducted subjective visual comparisons. The dehazing effects of different algorithms on the VOC_fog test dataset are illustrated in Figure A1.

Figure A1. De-fogging effect of different algorithms under test set voc_fog. (a) Original. (b) DCP. (c) Retinex. (d) AODNet. (e) Ours.

From the results shown in Figure A1, the improved AODs dehazing algorithm outperforms the other algorithms in terms of dehazing effectiveness and realism. It better eliminates the color distortion and blurriness caused by haze. On one hand, it avoids issues such as the color shift and excessive darkness seen in images processed by the DCP (Dark Channel Prior) algorithm. On the other hand, it addresses the low saturation problem of the original AODNet algorithm. Furthermore, compared to the Retinex dehazing algorithm, the improved AODs excels in preserving detail and maintaining scene structure in the dehazed results, providing clearer and more realistic images. In summary, the proposed AODs performs well in terms of brightness and texture restoration.

References

Bijelic, M.; Gruber, T.; Mannan, F.; Kraus, F.; Ritter, W.; Dietmayer, K.; Heide, F. Seeing Through Fog Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochnovsniy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; pp. 1137–1149. [Google Scholar] [CrossRef]
Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. An All-in-One Network for Dehazing and Beyond. arXiv 2017, arXiv:1707.06543. [Google Scholar]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar] [PubMed]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Zhu, Q.; Mai, J.; Shao, L. A fast single image haze removal algorithm using color attenuation prior. IEEE Trans. Image Process. 2015, 24, 3522–3533. [Google Scholar] [PubMed]
Su, B.; Lu, S.; Tan, C.L. Binarization of historical document images using the local maximum and minimum. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems 2010, Boston, MA, USA, 9–11 June 2010; pp. 159–166. [Google Scholar]
Zhang, W.; Dong, L.; Pan, X.; Zhou, J.; Qin, L.; Xu, W. Singleimage defogging based on multi-channel convolutional MSRCR. IEEE Access 2019, 7, 72492–72504. [Google Scholar] [CrossRef]
Cai, B.; Xu, X.; Jia, K.; Qing, C.; Tao, D. DehazeNet: An End-to-End System for Single Image Haze Removal. IEEE Trans. Image Process. 2016, 25, 5187–5198. [Google Scholar] [CrossRef] [PubMed]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature Fusion Attention Network for Single Image Dehazing. Proc. AAAI Conf. Artif. Intell. 2020, 34, 11908–11915. [Google Scholar] [CrossRef]
Qu, Y.; Chen, Y.; Huang, J.; Xie, Y. Enhanced pix2pix dehazing network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 8160–8168. [Google Scholar]
Ren, W.; Liu, S.; Zhang, H.; Pan, J.; Cao, X.; Yang, M.H. Single image dehazing via multi-scale convolutional neural networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 154–169. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
Cao, G.; Xie, X.; Yang, W.; Liao, Q.; Shi, G.; Wu, J. Feature-fused SSD: Fast detection for small objects. Comput. Vis. Pattern Recognit. 2018, 10615. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Virtual Conference, 11–17 October 2021. [Google Scholar] [CrossRef]
Wang, Q.; Feng, W.; Yao, L.; Zhuang, C.; Liu, B.; Chen, L. TPH-YOLOv5-Air: Airport Confusing Object Detection via Adaptively Spatial Feature Fusion. Remote Sens. 2023, 15, 3883. [Google Scholar] [CrossRef]
Liu, W.; Ren, G.; Yu, R.; Guo, S.; Zhu, J.; Zhang, L. Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions. Proc. AAAI Conf. Artif. Intell. 2022, 36, 1792–1800. [Google Scholar] [CrossRef]
Kalwar, S.; Patel, D.; Aanegola, A.; Konda, K.R.; Garg, S.; Krishna, K.M. GDIP: Gated Differentiable Image Processing for Object-Detection in Adverse Conditions. In Proceedings of the 2022 IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022. [Google Scholar]
Liu, X.; Lin, Y. YOLO-GW: Quickly and Accurately Detecting Pedestrians in a Foggy Traffic Environment. Sensors 2023, 23, 5539. [Google Scholar] [CrossRef] [PubMed]
Ding, Q.; Li, P.; Yan, X.; Shi, D.; Liang, L.; Wang, W.; Xie, H.; Li, J.; Wei, M. CF-YOLO: Cross Fusion YOLO for Object Detection in Adverse Weather with a High-Quality Real Snow Dataset. IEEE Trans. Intell. Transp. Syst. 2023, 24, 10749–10759. [Google Scholar] [CrossRef]
Wu, H.; Qu, Y.; Lin, S.; Zhou, J.; Qiao, R.; Zhang, Z.; Xie, Y.; Ma, L. Contrastive learning for compact single image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 19–25 June 2021; pp. 10551–10560. [Google Scholar]
Ng, D.; Chen, Y.; Tian, B.; Fu, Q.; Chng, E.S. ConvMixer: Feature Interactive Convolution with Curriculum Learning for Small Footprint and Noisy Far-field Keyword Spotting. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022. [Google Scholar] [CrossRef]
Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier Nonlinearities Improve Neural Network Acoustic Models; Computer Science Department, Stanford University: Stanford, CA, USA, 2013; Volume 30, p. 3. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large Selective Kernel Network for Remote Sensing Object Detection. arXiv 2023, arXiv:2303.09030. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI conference on artificial intelligence 2020, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Zhang, Y.-F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]

Figure 1. Step-by-step diagram of the AODs-CLYOLO model, wherein the output of the defogging module is used as an input to the detection module, i.e., the defogging model is added to the pre-processing module of the detection module.

Figure 2. The AODs-CLYOLO model. The AODs-CLYOLO object detection algorithm consists of inputs, an AODs dehazing model, and a CLYOLO detection model.

Figure 3. The network diagram and configuration of AODs.

Figure 4. The K-Estimation Module. The module utilizes a multi-scale design approach to capture features at different scales in the image. The multi-scale feature fusion module consists of convolutional layers with kernel sizes of 3 × 3, 5 × 5, and 7 × 7, followed by a concatenation layer that combines the outputs of these convolutional layers.

Figure 5. Introduction to (a) CSPCM++ and (b) ConvMixer++ module.

Figure 6. Large Selective Kernel Network (LSKNet) attention mechanism.

Figure 7. Example graph of the dataset in this article. The first row of photos are haze-free images from the PASCAL VOC dataset; the second row is a synthetic fog map with fog added. The image in the third row is the clean image after AODs dehazing, which is VOC_train.

Figure 8. Plot of object detection results for this paper’s algorithm and YOLOv5s. The first and third rows show the detection results of YOLOv5s, while the second and fourth rows show the detection results of the proposed model.

Table 1. Statistics of the used datasets, including VOC_train and VOC_test.

Dataset	Image	Person	Bicycle	Car	Bus	Motorbike	Total
VOC_train	8111	13,256	1064	3267	822	1052	19,561
VOC_test	2734	4528	337	1201	213	325	6604

Table 2. Comparative experiments on the object detection part of the evaluation on the VOC_test dataset.

Method	mAP @0.5	mAP @0.95	AP% 0.5					P	R
Method	mAP @0.5	mAP @0.95	Person	Car	Bicycle	Bus	Motorbike	P	R
Faster-RCNN	76.87	44.90	76.85	70.53	75.05	84.20	77.73	52.87	79.20
SSD	78.78	51.50	77.80	73.17	84.78	80.53	77.62	88.53	69.85
YOLOv3	82.12	59.10	81.70	80.60	86.00	79.20	83.10	89.40	74.30
YOLOv4	81.80	53.90	86.34	76.44	80.33	86.51	85.01	90.25	72.32
YOLOv5n	78.10	50.50	80.00	81.30	77.30	73.90	78.10	84.30	68.80
YOLOv7-tiny	83.10	56.20	83.60	85.30	83.10	79.60	83.80	84.40	76.30
IA-YOLO [22]	72.03	-	-	-	-	-	-	-	-
YOLOv5s	80.80	54.30	82.50	83.90	81.10	77.50	78.70	85.50	73.20
Ours	83.80	59.50	83.10	85.60	83.80	80.50	83.80	88.00	74.70

Table 3. Comparison of computational performance across various object detection algorithms.

Method	Parameters (Millon)	FLOPs (Billion)	Memory Usage (MB)
Faster-RCNN	66.0	90.0	113.5
SSD	26.0	30.0	97.1
YOLOv3	61.9	35.0	89.40
YOLOv4	64.2	65.0	256.6
YOLOv5n	17.7	4.2	3.7
YOLOv7-tiny	6.03	13.2	12.3
YOLOv5s	7.02	15.8	14.4
Ours *	8.6	17.7	17.7

* ‘Ours’ refers to the object detection model CLYOLO. Since the impact of AODs on the parameters and other metrics of AODs-CLYOLO is minimal, it has been disregarded here.

Table 4. Ablation experiments with object detection modules.

Method	AODs	CSPCM++	LSKNet	FocalGIoU	DWConv	mAP%
Method	AODs	CSPCM++	LSKNet	FocalGIoU	DWConv	0.5	0.5–0.95
A	-	-	-	-	-	80.80	54.30
B	√	-	-	-	-	80.90	55.10
C	√	√	-	-	-	82.80	58.10
D	√	-	√	-	-	83.10	56.90
E	√	-	-	√	-	81.30	55.30
F	√	√	√	-	-	83.50	59.20
G	√	√	√	-	√	82.90	59.00
H	√	√	√	√	√	83.80	59.80

Table 5. Convmixer, Convmixer++, CSPCM, and CSPCM++ experimental results.

Method	P	R	mAP%
Method	P	R	0.5	0.5–0.95
YOLOv5s	85.50	73.20	80.80	54.30
ConvMixer	88.60	72.10	81.40	57.10
ConvMixer++	87.20	74.60	82.22	57.70
CSPCM	88.10	73.60	82.80	58.10
CSPCM++	88.90	74.00	83.30	58.80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liang, X.; Liang, Z.; Li, L.; Chen, J. AODs-CLYOLO: An Object Detection Method Integrating Fog Removal and Detection in Haze Environments. Appl. Sci. 2024, 14, 7357. https://doi.org/10.3390/app14167357

AMA Style

Liang X, Liang Z, Li L, Chen J. AODs-CLYOLO: An Object Detection Method Integrating Fog Removal and Detection in Haze Environments. Applied Sciences. 2024; 14(16):7357. https://doi.org/10.3390/app14167357

Chicago/Turabian Style

Liang, Xinyu, Zhengyou Liang, Linke Li, and Jiahong Chen. 2024. "AODs-CLYOLO: An Object Detection Method Integrating Fog Removal and Detection in Haze Environments" Applied Sciences 14, no. 16: 7357. https://doi.org/10.3390/app14167357

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AODs-CLYOLO: An Object Detection Method Integrating Fog Removal and Detection in Haze Environments

Abstract

1. Introduction

2. Related Work

3. Method

3.1. AODs: Advanced AODNet Series for Dehazing

3.2. CLYOLO: YOLOv5 Enhanced with LSKNet Attention Mechanism and CSPCM Module

3.2.1. CSPCM++

3.2.2. LSKNet Attention Mechanism

3.2.3. CLYOLO Loss Function

4. Results

4.1. Dataset Creation

4.2. Experimental Setup

4.3. Evaluation Metrics

4.4. Performance of CLYOLO

4.5. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI