IDP-YOLOV9: Improvement of Object Detection Model in Severe Weather Scenarios from Drone Perspective

Li, Jun; Feng, Yongqiang; Shao, Yanhua; Liu, Feng

doi:10.3390/app14125277

Open AccessArticle

IDP-YOLOV9: Improvement of Object Detection Model in Severe Weather Scenarios from Drone Perspective

by

Jun Li

^1,2,

Yongqiang Feng

^1,2,

Yanhua Shao

³ and

Feng Liu

^4,5,*

¹

Artificial Intelligence Security Innovation Research, Beijing Information Science and Technology University, Beijing 100192, China

²

Department of Information Security, Beijing Information Science and Technology University, Beijing 100192, China

³

National Computer System Engineering Research Institute of China, Beijing 100083, China

⁴

School of Computer Science and Technology, East China Normal University, Shanghai 200062, China

⁵

Shanghai International School of Chief Technology Officer, East China Normal University, Shanghai 200062, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(12), 5277; https://doi.org/10.3390/app14125277

Submission received: 17 April 2024 / Revised: 7 June 2024 / Accepted: 13 June 2024 / Published: 18 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

Despite their proficiency with typical environmental datasets, deep learning-based object detection algorithms struggle when faced with diverse adverse weather conditions. Moreover, existing methods often address single adverse weather scenarios, neglecting situations involving multiple concurrent adverse conditions. To tackle these challenges, we propose an enhanced approach to object detection in power construction sites under various adverse weather conditions, dubbed IDP-YOLOV9. This model leverages a parallel architecture comprising the Image Dehazing and Enhancement Processing (IDP) module and an improved YOLOV9 object detection module. Specifically, for images captured in adverse weather, our approach employs a parallel architecture that includes the Three-Weather Removal Algorithm (TRA) module and the Deep Learning-based Image Enhancement (DLIE) module, which, together, filter multiple weather factors to enhance image quality. Subsequently, we introduce an improved YOLOV9 detection network module that incorporates a three-layer routing attention mechanism for object detection. Experiments demonstrate that the IDP module significantly improves image quality by mitigating the impact of various adverse weather conditions. Compared to traditional single-processing models, our method improves recognition accuracy on complex weather datasets by 6.8% in terms of mean average precision (mAP50).

Keywords:

YOLOV9; image processing; deep learning; computational intelligence; parallel optimization

1. Introduction

In the field of electric power, especially in complex environments or secluded areas, the utilization of unmanned aerial vehicles (UAVs) for surveillance, evidence collection, and external risk assessment is crucial for ensuring the safe operation of electrical facilities [1]. However, the quality of images acquired by UAVs is frequently compromised by adverse weather conditions, resulting in diminished clarity and subsequently reducing the accuracy and efficacy of detection [2]. This limitation significantly hinders the surveillance capacity of UAVs, potentially delaying the identification of risks [3] or problems and posing a threat to the safety of electrical infrastructure. Consequently, enhancing the object detection performance of UAVs under adverse weather conditions is imperative.

While mainstream detection algorithms such as Faster R-CNN [4], the YOLO series [5,6,7], and CenterNet [8] perform satisfactorily on images captured under normal weather conditions, adverse weather conditions often cause image blurring, insufficient illumination, or the overlapping of weather artifacts with objects [9,10]. Existing image dehazing methods, which utilize multi-scale feature aggregation networks, have shown inefficiency in processing images [11,12,13]. Similarly, rain removal techniques employ window self-attention networks and global residual convolution to produce rain-free images [14,15], while snow removal approaches use restoration algorithms to eliminate the influence of snowflakes [16,17]. These methods primarily address individual weather-related issues and are inadequate for dealing with the complexity of real-world adverse weather effects.

To enhance object detection, the proper supplementation of image data is required. Some studies have attempted to combine image enhancement with object detection using adaptive enhancement modules composed of context branches and edge branches to improve degraded images [18,19]. These methods dynamically adjust enhancement settings based on the illumination distribution in the input images [20,21,22]. However, these approaches often struggle to adapt to various image types, and the adjustment of enhancement parameters heavily relies on manual intervention, leading to suboptimal results. Building on these foundations, some research has integrated image enhancement with object detection [23,24], proposing the Adaptive Enhancement Model for Object Detection Network (ARODNet) [25] to improve detection performance under adverse conditions [26,27]. Nonetheless, these methods do not optimize the structure of the detection network itself, limiting their effectiveness during training.

Existing object detection methods struggle with degraded image quality and reduced effectiveness under adverse weather conditions such as rain, snow, and fog. These limitations stem from their inability to adequately address individual weather-related issues and adapt to the complexities of real-world scenarios, resulting in suboptimal detection performance. To address these challenges in the field of electric power and overcome the limitations of current object detection methods, this paper introduces an improved deep learning-based object detection model called IDP-YOLOV9. This algorithm integrates advanced image processing techniques and enhanced parallel architectures, utilizing the parameter estimation of convolutional neural networks, along with an improved YOLOV9 detection network model equipped with a three-layer routing attention mechanism. The objective is to enhance image processing capabilities under complex weather conditions, thereby improving object detection accuracy.

The primary contributions of this paper are as follows:

We designed a parallel optimization architecture for the image processing module TRA (Three-Weather Removal Algorithm) and the image enhancement module DLIE (Deep Learning-based Image Enhancement). The TRA module employs a dynamically adjusted correlation graph construction strategy, allowing it to flexibly adapt to feature relationships in different scenes. The DLIE module introduces self-learning parameters, optimizing deep learning methods for image features and enabling adaptive modifications to the image enhancement procedure. This enhancement significantly boosts the model’s detection capabilities.
We propose an improved YOLOV9 detection network, incorporating a three-layer routing attention mechanism. This mechanism captures the features of the restored clear images from the TRA and DLIE modules through joint learning, enhancing the network’s ability to detect objects.
We introduce a comprehensive loss function where the parameters of the routing attention loss function and the IoU loss function are dynamically adjusted based on the weather factor features extracted by the TRA module. This approach allows for the dynamic adjustment of scale factor ratios and blur factors, refining bounding box generation and loss function computation. Consequently, the model better adapts to various weather conditions, achieving more accurate object detection under different weather scenarios.

This paper is organized as follows: Section 2 includes an overview of relevant work on image dehazing, deraining, desnowing, and image enhancement. Section 3 elaborates on the specific improvements proposed in this paper. The experimental results and analysis are presented in Section 4. Finally, in Section 5, we engage in a discussion and summary of the research findings.

2. Related Work

Deep learning has demonstrated remarkable performance in various tasks such as denoising [28], image inpainting, super-resolution [29,30], deblurring, and style transfer [31]. In the domain of adverse weather restoration, such as defogging, deraining, and desnowing, deep neural networks significantly outperform traditional methods. For instance, in deraining, CNN networks capture features from rain-affected images, enabling the learning of the physical characteristics of raindrops and rain streaks [32,33]. Concurrently, CNNs learn paired images of rain-free and rain-degraded conditions [34,35]. However, this method may leave residual blurry regions in images. Other approaches employ GAN (Generative Adversarial Network) networks to eliminate raindrops, but this requires obtaining effective attention maps [36,37]. Similarly, to address the issue of removing irregular snowflakes from images, GAN networks are used to focus on the features of snowflake patterns [38].

Before performing object detection tasks in adverse weather scenarios, preprocessing the images is imperative. One direct approach is to remove weather factors from the images and then apply image enhancement techniques before inputting them into the detection network [39,40,41,42]. However, solely employing this method may not achieve detection accuracy comparable to that under normal weather conditions. Another approach relies on unsupervised priors [43], combining image enhancement and detection while learning feature representations of weather to eliminate interference from weather-specific information [44,45]. For example, Ju et al. [46] proposed a single-image defogging detection framework based on region-line prior [47]. Although improving the quality of adverse weather images is beneficial, it does not necessarily translate to high-precision object detection models [48,49]. Therefore, some methods connect image processing modules and detection modules end-to-end to address this issue, while others utilize domain adaptation techniques [50,51].

To bolster the effectiveness of object detection networks in challenging weather environments, we introduce a new object detection layer known as SwinFocus [52]. This layer enhances both the feature extraction and representation capabilities of YOLOv5, leading to improved detection accuracy, particularly for small and blurry objects in foggy conditions [53]. Additionally, methods have been proposed to optimize the structure of YOLOv5 by reducing the depth of the feature pyramid and limiting the maximum downsampling factor to better recognize small objects [54,55]. With the emergence of the YOLOv8 detection network, an occlusion-aware attention mechanism has been designed, and variable convolutions have been utilized to enhance the feature extraction capabilities of the YOLOv8 network [56,57].

Despite these significant improvements in object detection performance, notable deficiencies remain in parameter tuning due to the lack of focus on feature scales after image processing enhancement and insufficient modifications to object detection models for image processing and enhancement. Therefore, this study introduces an enhanced object detection methodology tailored to unmanned aerial vehicles (UAVs) navigating through inclement weather conditions, referred to as IDP-YOLOV9.

In this study, we primarily employed a parallel architecture based on image defogging, desnowing, and deraining modules (TRA) and image enhancement modules (DLIE) to process and enhance images captured under various adverse weather conditions at power construction sites by drones. Additionally, we introduced joint training of the improved YOLOV9 module with a three-layer routing attention mechanism to capture features of clear images restored by the TRA and DLIE modules. Finally, object detection was performed on images captured under various adverse weather conditions at power construction sites, thereby enhancing detection accuracy under complex adverse weather conditions.

3. Proposed Method

Under severe weather conditions, the visibility of images captured by drones is significantly reduced, seriously affecting the accuracy of object detection and posing technical challenges to the risk identification needs of power construction sites. To address this issue, this section details the adverse weather conditions in datasets of power scenes captured by drones and proposes the IDP-YOLOV9 object detection algorithm specifically designed for such conditions. The algorithm aims to reveal more latent information within images by eliminating the interference of weather factors.

The entire network framework consists of the Three-Weather Removal Algorithms (TRA) module, an Adaptive Image Enhancement (DLIE) module, and the Improved YOLOV9 detection module. The TRA module initially estimates atmospheric scattering models and parameters related to rain and snow to obtain preliminary dehazed and derained images. Subsequently, the model rescales images to dimensions of 256 × 256 before feeding them into the DLIE module, which optimizes parameters to enhance the quality of dehazed, derained, and desnowed images. These weakly supervised enhanced images are then employed for object detection. Finally, the DLIE module processes the enhanced images as inputs for the YOLOV9 detector, leading to improved detection accuracy under adverse weather conditions.

3.1. The Framework of IDP-YOLOV9

This study optimized the original network architecture to address the issues of error accumulation in the parameter estimation process, which can lead to incomplete image processing and distortion. To overcome these challenges, we employed a parallel architecture that combined the TRA image dehazing, deraining, and desnowing modules with the adaptive image enhancement module (DLIE). The resulting features were then integrated with the YOLOV9 detection network, creating an end-to-end object detection algorithm suitable for power scene images captured by drones under adverse weather conditions. The DLIE module consisted of five adaptable image enhancement parameters, essentially functioning as pixel-level filters. These parameters included the following: white balance (WB), which eliminated color deviations caused by atmospheric light; gamma correction, which restored details in darker areas; contrast enhancement, which improved overall visibility in regions affected by heavy fog, raindrops, or snowflakes; hue adjustment, which emphasized the overall atmosphere or produced specific effects; and image sharpening, which effectively enhanced the visual clarity of dehazed, derained, and desnowed images.

Figure 1 illustrates the IDP-YOLOV9 network architecture. This comprehensive approach ensures more accurate and reliable object detection in power scene images captured under adverse weather conditions.

3.2. TRA Module

Due to the difficulty in obtaining paired images under various adverse weather conditions and normal conditions at the same construction site, this study addressed this challenge by artificially synthesizing construction site images with fog, rain, and snow. The influence of rain and snow was also considered. Raindrops and snowflakes scatter and absorb light, thereby reducing image quality. These extensions enhanced the mathematical model, making it more inclusive when considering the image formation process under various weather conditions. Specifically, in the image restoration module, the parameters obtained from the estimation part were utilized to generate the final restored image through element-wise addition and multiplication layers. This process can be regarded as achieving high-quality image dehazing, deraining, and desnowing through joint learning based on the modeling parameters of fog, rain, and snow.

Below is the specific design of the dehazing, deraining, and desnowing filters. The formation of blurry images in foggy weather can be expressed as follows:

I_{1} (x, λ) = e^{- β (λ)} d (x) K (x, λ) + A (1 - e^{- β (λ)} d (x))

(1)

where

I_{1} (x, λ)

represents the foggy image,

x

is the position of pixels in the image,

λ

is the wavelength of light,

K (x, λ)

denotes the scene radiance (clean image),

A

is the global atmospheric light, and

e^{- β (λ)} d (x)

is the medium transmission map, where

β (λ)

represents the atmospheric scattering coefficient and

d (x)

is the scene depth. To restore the clean image

K (x, λ)

, it was crucial to obtain the atmospheric light

A

and the transmission map

e^{- β (λ)} d (x)

. To do this, we first computed the dark channel map of the foggy image

I_{1} (x, λ)

and selected the brightest 1000 pixels. We estimated

A

by averaging the corresponding positions of these 1000 pixels in the foggy image

I_{1} (x, λ)

. Furthermore, we introduced a parameter

ε_{1}

to control the degree of defogging. The defogging filter is expressed as follows:

K (x, λ) = \frac{I_{1} (x, λ) - A}{m a x (e^{- β (λ)} d (x), ε_{1})}

(2)

Considering the impact of raindrops on images, we can represent the blurred rainy image as follows:

I_{2} (x, λ) = (1 - M (x)) G (x, λ) + Q (x, λ)

(3)

where

I_{2} (x, λ)

is the rainy image,

G (x, λ)

denotes the scene radiance (clean image), and

M (x)

is the binary mask for raindrops, based on the scattering and reflection effects of raindrops and used to mark the positions and intensities of raindrops.

Q (x, λ)

represents the blurred image formed by the light reflected from raindrops. To restore the clear image

G (x, λ)

, we needed to estimate the effect of raindrops and remove their influence. Firstly, we computed the dark channel map of the rainy image

I_{2} (x, λ)

and selected the brightest pixel. Then, by averaging the corresponding pixels in the rainy image

I_{2} (x, λ)

, we could estimate the light reflected by raindrops

Q (x, λ)

.

We introduced a parameter

ε_{2}

to control the degree of deraining, similar to the design of the defogging filter. The final deraining filter can be expressed as follows:

G (x, λ) = I_{2} (x, λ) - ε_{2} Q (x, λ)

(4)

By adjusting the parameter

ε_{2}

, we could effectively mitigate the impact of raindrops and restore a clearer image.

For the desnowing filter design, the snowy image model can be represented as follows:

I_{3} (x, λ) = (1 - z) S (x) + H (x, λ)

(5)

where

I_{3} (x, λ)

is the snowy image,

H (x, λ)

represents the scene radiance (clean image), and

z

is the binary mask for snowflakes, based on the reflection effects of snowflakes and used to form spots on the image.

S (x)

denotes the blurred image formed by the light reflected from snowflakes. Similarly to the deraining filter, we first computed the dark channel map of the snowy image

I_{3} (x, λ)

and selected the brightest pixel. Then, by averaging the corresponding pixels in the snowy image

I_{3} (x, λ)

, we could estimate the light reflected by snowflakes

S (x)

.

The final desnowing filter can be expressed as follows:

H (x, λ) = I_{3} (x, λ) - ε_{3} S (x)

(6)

By adjusting the parameter

ε_{3}

, we could control the degree of desnowing and restore a clearer image. Through the above design, we could utilize the principles of dark channel prior and the atmospheric scattering model, combined with the learning parameter approach, to design defogging, deraining, and desnowing filters. These filters could mitigate the impact of weather on images to some extent, thereby improving image quality and clarity.

The synthetic dataset consisted of power scene data captured by drones under normal weather conditions, upon which a dataset of images under complex weather conditions was constructed offline. By adjusting parameters to specific values, set to 0.6 and 0.1, foggy images were obtained. Rainy images were generated by setting the droplet diameter to 2 pixels and simulating random distribution to mimic raindrops. Gaussian blur with a standard deviation of 1 was applied to simulate environmental reflective light and replicate the lighting effect of raindrops to obtain rainy images. Snowy images were obtained by setting the snowflake diameter to 5 pixels, transparency to 0.8, and color to white.

The TRA module was designed for image defogging, deraining, and desnowing, initially involving the physical modeling of fog, raindrops, and snowflakes in the images. This module primarily focused on parameter estimation and image degradation. Through joint learning, it utilized convolutional networks to estimate key operators for fog, rain, and snow, facilitating image restoration. In the parameter estimation module, five convolutional layers were employed to fuse multiscale information, effectively integrating coarse and fine-scale features through the concatenation of parallel convolutional layers (Concat layers). This process also involved estimating the parameters of the input image. The convolutional network learned the specific values of these parameters, which were integral components of the atmospheric scattering model and significantly impacted the results of image restoration. Notably, the design of the Concat layer incorporated hierarchical connections, with each Concat layer progressively connected to other convolutional layers. This design effectively compensated for any information loss during the convolution process, ensuring comprehensive feature detail acquisition. The output of the parameter estimation module included key parameters in the atmospheric scattering model, playing a critical role in the subsequent image restoration process.

3.3. DLIE Module

Typically, image correction and enhancement operations involve manual adjustment of filter parameters based on experience, which poses several challenges. Firstly, manual adjustment is prone to subjective factors and experience limitations, leading to subjective parameter choices and significant errors. Secondly, parameters adjusted manually are often optimized for specific scenes or image collections, lacking generality across different scenes and diverse images, thereby limiting the algorithm’s applicability. Thirdly, manually tuned parameters may not adapt well to changes in different environments and image conditions, resulting in poor system adaptability when facing new data.

To address these challenges, this study proposes an automated method using a small CNN network to estimate filter parameters. This approach improves the performance and applicability of image correction and enhancement operations by enabling a more comprehensive search of the parameter space. It reduces subjective errors, enhances the system’s adaptability, and achieves better results across different scenes and image conditions.

The DLIE module comprised both pixel-level filters and sharpening filters. The pixel-level filter involved four adjustable parameters: white balance, gamma correction, hue adjustment, and contrast enhancement. Its primary purpose was to smooth the image post-dehazing, deraining, and desnowing to improve visual quality. Table 1 presents descriptions of the four filters and their parameters. The white balance (WB) filter achieved white balance by adjusting the weights (

W_{r}

,

W_{g}

,

W_{b}

) of the red

r_{i}

, green

g_{i}

, and blue

b_{i}

channels of the input image. The output

X_{o}

was composed of the weighted sum of the input channels. The gamma filter adjusted the image by multiplying each pixel value

X_{i}

by the parameter G. The contrast filter adjusted the image contrast by varying the brightness values, where

α

is a parameter ranging between 0 and 1. The tone filter adjusted the image hue by applying different hue functions (

L_{t_{r}}

,

L_{t_{g}}

,

L_{t_{b}}

) to each color channel. The parameter

t_{i}

represents the hue function.

The parameters for the contrast filter mapping function were defined as follows: Equation (7) computed the brightness value

L (X_{i})

of the input pixel by applying a weighted sum to the three channels.

ω_{1}, ω_{2}, ω_{3}

were adjustable weight parameters for the corresponding channels. Equation (8) adjusted the brightness non-linearly using a cosine transformation of the brightness value, thus producing enhanced contrast

M a L (X_{i})

. Equation (9) adjusted the entire pixel by multiplying the input pixel value

X_{i}

with the enhanced brightness value

M a L (X_{i})

to the brightness value

L (X_{i})

.

L (X_{i}) = ω_{1} r_{i} + ω_{2} g_{i} + ω_{3} b_{i}

(7)

M a L (X_{i}) = \frac{1}{2} (1 - c o s (α \times (L (X_{i}))))

(8)

M a (X_{i}) = X_{i} \times \frac{M a L (X_{i})}{L (X_{i})}

(9)

By processing the contrast parameters as described above, the contrast filter mapping function became more flexible and suitable for various image scenes. This enabled effective enhancement of image contrast, thereby improving image quality and visual effects.

The image sharpening filter primarily utilized image sharpening to compensate for contours and highlight edge information, making the image clearer after haze removal. The process of image sharpening can be described as follows. The primary function of the image sharpening filter was to enhance image clarity by accentuating contours and highlighting edge information, particularly beneficial after haze removal.

F (x, λ) = I (x) + β \cdot λ \cdot (I (x) - E n h a n c e F u n c (I (x))

(10)

Equation (10) represents a function for image enhancement, and

E n h a n c e F u n c (I (x))

is a function used to enhance details.

λ

is a weight parameter between the original and the image with optimized details, balancing the differences between them.

β

is a newly introduced parameter used to adjust the balance between enhancement and detail strengthening. Overall, this function achieved local contrast enhancement and detail strengthening of the image.

The DLIE module optimized parameters based on a CNN network. Due to the high computational cost of extracting features from high-resolution images using CNNs, there is a risk of resource wastage. Therefore, we downsampled the high-resolution adverse weather synthetic images and extracted image filtering parameters based on this downsampled version. After the images were processed by the TRA module to remove fog, rain, and snow, the filters were applied to the downsampled processed images to enhance their quality.

To minimize computational overhead and enhance network efficiency, we employed a compact CNN network for downsampling images in adverse weather conditions, considering the relatively small number of parameters required for the filters. Before parameter estimation, the images, after fog, rain, and snow removal, underwent bilinear interpolation. This approach ensured that parameter estimation was both reasonable and effective, even with low-resolution images.

3.4. Improved YOLOV9 Detection Module

Figure 2 illustrates the proposed enhanced YOLOV9 detection network architecture. Conv was used to extract image features, RepNCSPELAN4 was the feature extraction fusion module in YOLOv9, Concat was used to concatenate feature maps from different layers, and SPPF was used to enhance the network’s detection capability for targets at different scales. It pooled feature maps at different scales and then concatenated them to capture multi-scale information of targets. Conv CLS (Convolutional Layer for Classification) was the convolutional layer for target classification, classifying each bounding box. Detect transformed the output into the final object detection results, applying non-maximum suppression (NMS) to remove overlapping bounding boxes and filtering the final detection results based on class confidence. Arrows indicated the direction of data flow, starting from the input image and passing through a series of convolutional layers, concatenation operations, and detection layers and, finally, obtaining the object detection results.

Uncontrollable factors such as the size, shape, position, orientation of the objects, and occlusions under adverse weather conditions make it difficult to achieve accurate detection using traditional convolutional operations on images that have already been processed and enhanced. Issues such as extensive false positives or negatives may arise. To address these challenges and improve the model’s ability to recognize objects like power lines and trees in images after processing, this paper proposes the incorporation of the Three-Layer Routing Attention module at the last part of the backbone. The Three-Layer Routing Attention module module was integrated into the entire network and underwent end-to-end training with appropriate loss functions, ensuring that the model could simultaneously learn image features, processing parameters, and attention weights. This adaptation enhanced the model’s focus and generalization capabilities for different image processing and enhancement tasks.

The proposed Three-Layer Routing Attention module, as shown in Figure 3, demonstrated an improved three-layer routing attention mechanism, comprising the following levels. The first layer was the Region Routing Attention Mechanism (RRAM), which operated at a macroscopic region level and introduced a method for constructing association graphs. Utilizing dynamic and adaptive association graph construction strategies, RRAM flexibly adapted to feature relationships in different scenes. This adaptability was achieved through an adaptive threshold mechanism.

Equation (11) defines the feature tensor of the input image, respectively.

X

represents the feature tensor of the input image,

H

is the height of the feature tensor,

W

is the width of the feature tensor,

C

is the number of channels in the feature tensor, and

R

represents the number of regions after partitioning. Equation (12) demonstrates the process of region partitioning and input projection, transforming the input tensor

X

into

X_{r}

, where

S_{1} S_{2}

represents the size of the region partition. The operation Reshape rearranges the elements of a tensor into a new shape while preserving the order of the elements. In this context, it reshapes the input tensor

X

into

X_{r}

, which has a size of

S_{1} S_{2}

regions, each of size

H W / S_{1} S_{2}

with

C

channels.

X \in R^{H \times W \times C}

(11)

X_{r} = R e s h a p e (X, S_{1} S_{2} \times (H W / S_{1} S_{2}) \times C)

(12)

W_{k}

and

W_{v}

are the projection weights used to project

X_{r}

into queries, keys, and values, respectively, in Equation (13). They were learned parameters of the model and were determined during the training process through backpropagation. The projection aimed to map the original feature space into a space where queries, keys, and values could be computed efficiently for the subsequent attention mechanism.

Q = X_{r} W_{q}, K = X_{r} W_{k}, V = X_{r} W_{v}

(13)

Building upon the RRAM, the Token Routing Attention Mechanism (TRAM) was introduced to further optimize the selection of attention regions. TRAM employs advanced graph pruning algorithms combined with deep learning techniques to enhance routing effectiveness by learning more complex relationships between each node.

Average Pooling is a pooling operation that calculates the average value of the input data (usually a tensor) within each region. Specifically, in Equations (14) and (15),

Q_{r}

and

K_{r}

represent the average pooling operations applied to queries

Q

and keys

K

, respectively. Through average pooling, the dimensionality of the input data could be reduced while retaining its important features. TopKIndex is a selection operation that retrieved the indices of the top K values from the input data. In Equation (16),

A_{r}

is the attention matrix between queries and keys, where each element represents the attention from one region to another. Equation (17) applied the TopKIndex operation to select the indices of the regions with the highest attention

I_{r}

values from the attention matrix

A_{r}

. These indices represented the most important and noteworthy regions, which were used for further processing or analysis:

Q_{r} = A v e r a g e P o o l i n g (Q)

(14)

K_{r} = A v e r a g e P o o l i n g (K)

(15)

A_{r} = Q_{r} (K_{r})^{T}

(16)

I_{r} = T o p K I n d e x (A_{r})

(17)

In Equations (18) and (19), we used the Gather operation to retrieve region representations from the global keys

K

corresponding to the pre-determined Top-K indices

I_{r}

. This was done to retain only the representations of keys associated with the most important regions, thereby reducing computational complexity. Following Equation (20), we performed token-to-token attention operation between the queries

Q

and the retrieved token keys

K_{g}

and values

V_{g}

. This operation assigned attention weights based on the similarity between queries and keys, and used these weights to compute a weighted sum of values, obtaining context-relevant representations for each query. We then added a local contextual enhancement (LCE) term to obtain

O_{1}

, providing richer local context information and enhancing the model’s understanding of the input image.

K_{g} = G a t h e r (K, I_{r})

(18)

V_{g} = G a t h e r (V, I_{r})

(19)

O_{1} = T o k e n A t t e n t i o n (Q, K_{g}, V_{g}) + L C E (V)

(20)

At the global level, the Global Routing Attention Mechanism (GRAM) was established, introducing a more extensive modeling of positional relationships. GRAM not only performed routing within each region but also introduced a more global routing mechanism spanning the entire input image.

Equation (21) computed global representations by performing average pooling on queries

Q_{r}

and keys

K_{r}

. Equation (22) calculated the global attention matrix

A_{g}

by taking the dot product of global queries and keys. Equation (23) selected global regions of interest by retrieving global indices

I_{g}

with the highest attention.

Q_{g} = A v e r a g e P o o l i n g (Q_{r})

(21)

K_{g} = A v e r a g e P o o l i n g (K_{r}), A_{g} = Q_{g} (K_{g})^{T}

(22)

I_{g} = T o p K I n d e x (A_{g})

(23)

Equations (24) and (25) retrieved corresponding global representations from keys and values based on global indices

I_{g}

. Equation (26) performed global attention operation between queries Q and global token keys

K_{g}

and values

V_{g}

.

K_{T} = G a t h e r (K, I_{g})

(24)

V_{T} = G a t h e r (V, I_{g})

(25)

O_{2} = G l o b a l A t t e n t i o n (Q, K_{g}, V_{g}) + L C E (V)

(26)

The proposed three-layer routing attention mechanism comprehensively captured the correlations and feature relationships at different levels of the input data, thereby enhancing the model’s performance.

3.5. Loss Function

The improved YOLOV9 included a three-layer routing attention mechanism, including region routing, token routing, and global routing, aiming to achieve optimal recognition and detection capability under fog, rain, snow, and normal weather conditions. To adapt to image features under different weather conditions, IDP-YOLOV9 adopted a comprehensive loss function during training, which included detection loss and routing attention loss, to enhance the model’s adaptability to different weather conditions. The entire network was trained end-to-end under the improved YOLOV9 detection loss to ensure mutual adaptation between internal modules of IDP-YOLOV9. To further address potential domain shifts introduced by synthetic data, IDP-YOLOV9 combined mixed training using real datasets to make the model closer to real-world environments, thus improving the model’s robustness under adverse weather conditions.

The overall loss function consisted of the detection loss derived from YOLOV9 and the routing attention loss across different layers, forming a composite measure. In addition to detection and routing attention losses, an IoU loss function was introduced to measure the overlap between predicted bounding boxes and ground truth bounding boxes. The incorporation of the IoU loss function enabled the model to predict the position and shape of objects more accurately, thereby improving the accuracy of object detection.

The parameters of the routing attention loss layer and the parameters in the IoU loss function were dynamically adjusted based on the processing results of different weather factors by the TRA module. By associating the scale factor ratio with feature parameters, the scale factor ratio could be adjusted to a smaller value when the density of raindrops or snowflakes increased, ensuring that the bounding boxes more accurately captured the target object. Based on the characteristics of rainy and snowy days, the calculation of intersection and union was adjusted by a blur factor. Due to the influence of weather, the object boundaries could be more blurred, so relaxing the definition of intersection enabled better adaptation to these situations. We could reflect the importance of different weather conditions by adjusting the weight parameters in the loss function. When it was rainy or snowy,

λ_{r e g i o n}

and

λ_{t o k e n}

could be adjusted to larger values to pay more attention to the generation and adjustment of bounding boxes.

i n t e r = (m i n (b_{r}^{g t}, b_{r}) - m a x (b_{l}^{g t}, b_{l})) * (m i n (b_{b}^{g t}, b_{b}) - m a x (b_{t}^{g t}, b_{t})) * f u z z

(27)

u n i o n = (w^{g t} * h^{g t}) * {(r)}^{2} + (w * h) * {(r)}^{2} - i n t e r

(28)

I o U^{i n n e r} = \frac{i n t e r}{u n i o n}

(29)

L_{I o U} = 1 - {I o U}^{i n n e r}

(30)

In the formulas,

b_{r}^{g t}

,

b_{l}^{g t}

,

b_{b}^{g t}

, and

b_{t}^{g t}

represent the right, left, bottom, and top boundaries of the ground truth bounding box, while

b_{r},

b_{l},

b_{b}

, and

b_{t}

represent the right, left, bottom, and top boundaries of the predicted bounding box.

w^{g t}

and

h^{g t}

denote the width and height of the ground truth bounding box, respectively, and

w

and

h

represent the width and height of the predicted bounding box.

r

denotes the scale factor ratio and

f u z z

represents the blur factor. The final total loss function can be expressed as follows:

L_{t o t a l} = L_{d e t} + λ_{r e g i o n} \cdot L_{r e g i o n} + λ_{t o k e n} \cdot L_{t o k e n} + λ_{g l o b a l} \cdot L_{g l o b a l} + λ_{I o U} \cdot L_{I o U}

(31)

Here,

L_{d e t}

represents the YOLOV9 detection loss and

L_{r e g i o n}

,

L_{t o k e n}

, and

L_{g l o b a l}

represent the losses of the region, token, and global routing layers, respectively. Hyperparameters

λ_{r e g i o n},

λ_{t o k e n}

,

λ_{g l o b a l}

, and

λ_{I o U}

were used to adjust the weights of their respective losses and balance their effects.

During the training process of the IDP-YOLOV9 network, various data augmentation techniques were utilized, including image flipping, cropping, and transformations, to extend the training dataset. Additionally, random resizing of images to (128 n × 128 n), where n

\in

[9,19], was performed to enhance the model’s adaptability to different input sizes. The RAdam (Rectified Adam) optimizer was employed for better convergence performance during training. Algorithm 1 summarizes the training process of our proposed method, as shown below.

Algorithm 1 Training Procedure Pseudocode of IDP-YOLOV9

Initialize the Improved YOLOv9 model and optimizer with RAdam parameters, lr = 0.0001. Initialize the TRA

P_{θ}

and the DLIE

P_{φ}

with random weights

θ

and

φ

.

Training parameters: epochs = 150, batch size = 8.

Prepare the training dataset.

for i in range(num_epochs):

for batch_idx, (images, objects) in enumerate(train_loader):

optimizer.zero_grad()
Reset optimizer gradients.
Calculate DIP parameters

P N = P_{θ} (i m a g e_b a t c h)

.
Apply TRA filtering to image_batch:

i m a g e_b a t c h = T R A (i m a g e_b a t c h, P N)

.
Calculate DLIE parameters

P M = P_{φ} (i m a g e_b a t c h)

.
Apply enhancement filtering to image_batch:

i m a g e_b a t c h = D L I E (i m a g e_b a t c h, P M)

.

outputs = model(images)

L_{d e t}

= criterion_detection(outputs, objects)

L_{r e g i o n}

= criterion_region(model.region_routing_output, objects)

L_{t o k e n}

= criterion_token(model.token_routing_output, objects)

L_{g l o b a l}

= criterion_global(model.global_routing_output, objects)

L_{t o t a l}

=

L_{d e t} + λ_{r e g i o n} \cdot L_{r e g i o n} + λ_{t o k e n} \cdot L_{t o k e n} + λ_{g l o b a l} \cdot L_{g l o b a l} + λ_{I o U} \cdot L_{I o U}

total_loss.backward()

optimizer.step()

end for
Update TRA, DLIE parameters

P_{θ}

,

P_{φ}

, and YOLOv9 network based on

L_{t o t a l}

.

end for

During the training phase, a comprehensive loss function was employed, providing a holistic optimization approach to enhance detection performance under various weather conditions.

4. Experimental Results

This section provides a systematic analysis and evaluation of both the detection function and the image processing capability of the enhanced YOLOV9 across different conditions. The experimental outcomes of the proposed algorithm in varied adverse weather environments are consolidated and summarized. To validate the IDP-YOLOV9 structure in fog, rain, and snow scenarios, comparisons were made with existing defogging methods (AOD-NET [58], GridDehazeNet [41]), deraining methods (EfficientDeRain [59], ADMM-ResNet [60]), and desnowing methods (U-DenseNet [61], ALL In One [62]). Subsequently, comparisons were conducted with existing detection methods, including the Faster R-CNN, SSD, RetinaNet, YOLOV8, and YOLOV9 methods. Finally, a comprehensive comparison was made between the image processing algorithms, such as AOD-NET, GridDehazeNet, EfficientDeRain, ADMM-ResNet, and so on. The above-mentioned experiments were performed using a system featuring an NVIDIA GeForce GTX 4090 GPU. This training procedure integrated the YOLOV9 detection loss with the losses from the three routing attention layers, providing a holistic optimization approach for improved detection performance under various weather conditions.

4.1. Implementation Details

This study employed a collaborative approach to train the IDP-YOLOV9 network architecture. In the initial phase, the YOLOV9 detection network was trained without prior knowledge and underwent transfer learning alongside the dataset proposed in this study. The DLIE parameters were reconfigured by utilizing the convolutional block of YOLOV9 up to the fifth layer, and joint training was conducted using a mixed data method. This collaborative training strategy aimed to facilitate maximum information exchange between the image processing module and the object detection network, thereby enhancing overall performance.

To further enhance the generalization capability for various adverse weather conditions, IDP-YOLOV9 dynamically adjusted the scale during training. Initially, a range of scales was established, and the selection of image size was dynamically adapted based on the content and image complexity. This enabled the model to accommodate various input sizes in each iteration, enhancing its robustness. The experiments were carried out utilizing the PyTorch framework and implemented on GPUs.

4.2. Performance Evaluation

The accuracy of detection under different conditions largely depended on the quality of image defogging, deraining, and desnowing and image enhancement. In this study, we compared our image processing module and image enhancement module (TRA + DLIE) with existing dehazing methods (AOD-NET, GridDehazeNet), deraining methods (EfficientDeRain, ADMM-ResNet), and desnowing methods (U-DenseNet, ALL In One). To ensure a fair comparison, we retrained and evaluated these methods on the same training and testing datasets (VOC [63], HAZE [64], FTOD). We used the MSE (Mean Square Error), PSNR (Peak Signal to Noise Ratio), and SSIM (structural similarity) metrics to measure image quality and similarity. Improvements in PSNR and SSIM metrics implied that the processed images were closer to the original images, exhibiting higher quality and better preservation of image details and structures.

P S N R = 20 \cdot \log_{10} (\frac{{M A X}_{I}}{M S E})

(32)

In the given expression,

{M A X}_{I}

represents the maximum achievable pixel value within the image. SSIM is another metric for comprehensive image quality assessment, evaluating image similarity. To bolster the consideration of structural similarity in the image, we introduced a correction term, denoted as

φ

. The enhanced SSIM formula is expressed as follows:

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + c_{1}) (2 σ_{x y} + c_{2} + φ)}{(μ_{x}^{2} + μ_{y}^{2} + c_{1}) (σ_{x}^{2} + σ_{y}^{2} + c_{2} + φ)}

(33)

where

μ

,

μ^{2}

, and

σ

respectively represent the mean, variance, and covariance of the images. The addition of a correction term

φ

for the structural similarity calculation made the evaluation metric more elastic, adapting to different types and qualities of images, and enhancing the algorithm’s performance under various adverse weather conditions. The objective evaluation results of weather-removed images of different datasets are shown in Table 2. For comparison, the proposed (TRA + DLIE) method outperformed the aforementioned algorithms for objective evaluation metrics efficiency in dehazing, deraining, and desnowing. Figure 4 illustrates the results using the aforementioned desnowing methods. While other deep learning-based techniques exhibit varying levels of image artifacts and incomplete dehazing, our proposed approach to dehazing, deraining, and desnowing excelled at removing haze, raindrop textures, and snowflakes while simultaneously enhancing contrast and saturation. This comprehensive enhancement significantly improved image quality, laying a solid foundation for subsequent object detection tasks.

Figure 5 visually demonstrates how our small CNN network predicted the image enhancement parameters (WB, gamma, contrast, tone) for the DLIE module in three distinct examples. Leveraging specific image information such as brightness, color, hue, and weather features, the small CNN network learned tailored parameter sets for each image. By filtering out weather-related factors, the images underwent processing through the DLIE module, resulting in enhanced visual effects, improved image clarity, and enriched details, ultimately facilitating more accurate detection.

Figure 5 illustrates how the small CNN network predicted the DLIE module’s image enhancement parameters (WB, gamma, contrast, tone) in three examples. The small CNN network learned a set of parameters for each image based on specific information. Information about brightness, color, and hue and weather feature parameters for each image with fog, rain, and snow were considered. After filtering out weather factors, the images were processed through the DLIE module, improving the visual effects, enhancing image clarity, supplementing detailed information, and benefiting subsequent object detection.

4.3. Evaluating the Detection Results of the Model

To assess the object detection performance of the proposed model, IDP-YOLOV9, across diverse adverse weather conditions, comprehensive comparative experiments were conducted on the same test dataset, employing both cross-sectional and longitudinal analyses. Initially, the IDP-YOLOV9 algorithm was pitted against leading CNN object detectors such as Faster R-CNN, YOLOV8, YOLOV9, and so on. Table 3 showcases the detection outcomes of these different detectors across varying fog concentrations. Notably, the results indicate that, across multiple adverse weather conditions, the object detection accuracy achieved by IDP-YOLOV9 outperforms the aforementioned algorithms.

To validate the impact of image restoration on subsequent object detection algorithms, comparative experiments were conducted on object detection using different algorithms under various adverse weather conditions, including AOD-NET (defogging), GridDehazeNet (defogging), Efficient-DeRain (deraining), ADMMResNet (deraining), UDenseNet (desnowing), and All In One (desnowing). Figure 6 showcases the outcomes of different snow removal models. The figure illustrates that in snowy weather conditions, the IDP-YOLOV9 algorithm not only effectively detects and processes the weather but also achieves notably higher accuracy and lower miss detection rates compared to other algorithms. The integration of image processing and enhancement within a parallel architecture, along with the jointly optimized YOLOV9 algorithm proposed in this research, significantly enhances performance across diverse conditions. Furthermore, the jointly optimized model demonstrates superior performance compared to the methods mentioned previously, achieving higher average precision across diverse adverse weather conditions (fog, rain, and snow).

The method proposed in this paper exhibits certain advantages in the image processing and detection of power construction site images under various adverse weather conditions. Due to the difficulty in obtaining paired adverse weather images of the same scene, the method proposed in this paper was trained using conventional datasets, including synthetic datasets. Nevertheless, the proposed method is applicable to real-world power construction site environments under diverse adverse weather conditions.

4.4. Ablation Study

Comprehensive ablation experiments were conducted to meticulously verify the efficacy of the proposed image processing and enhancement modules in facilitating subsequent object detection, especially in various adverse weather conditions. The IDP-Improved YOLOV9 method was comprehensively compared with several baseline methods, including Improved YOLOV9, Enhancement + Improved YOLOV9, and MultiRemoval + Improved YOLOV9, across three independent test datasets. Table 4 presents the mAP evaluation results for the three adverse weather conditions: fog, rain, and snow.

In diverse adverse weather environments, the combination of Enhancement + Improved YOLOV9 and MultiRemoval + Improved YOLOV9 demonstrated a significant advantage in improving detection performance compared to independently using Improved YOLOV9. The IDP-Improved YOLOV9 algorithm notably enhanced visibility in various adverse weather images, effectively improving the effects of defogging, deraining, desnowing, and enhancement while significantly increasing the accuracy of object detection. Specifically, in rainy conditions, IDP-Improved YOLOV9 achieved a 6.2% increase in detection accuracy (mAP) compared to Enhancement + Improved YOLOV9 and a 5.5% increase compared to MultiRemoval + Improved YOLOV9.

This series of ablation studies fully validated the synergistic effect of the parallel architecture for the image defogging, deraining, and desnowing modules, as well as the image enhancement module, on object detection algorithms. The experiments demonstrated that the IDP-Improved YOLOV9 method not only enhances visibility but also significantly improves the efficiency of image defogging, deraining, and desnowing and other image processing tasks, leading to an enhancement in object detection performance.

4.5. Discussion

The proposed IDP-YOLOV9 technique demonstrates significant improvements in object detection under fog, rain, and snow conditions. However, this method still has some limitations. It may not generalize well to all adverse weather conditions, requiring more data and adjustments to ensure model performance. Moreover, given the intricate nature of deep learning models and parallel processing architecture, the approach may necessitate considerable computational resources and time. Future work should focus on further expanding and enriching object detection datasets under adverse weather to cover more scenarios and situations.

5. Conclusions

We propose that the effectiveness of the IDP-YOLOV9 method is attributable to its parallel optimization architecture, which allows for the flexible adjustment of feature relationships in different scenes, the dynamic tuning of image enhancement parameters, and the utilization of the most advanced YOLOV9 network, tailored to power construction sites under various adverse weather conditions. Additionally, the joint learning approach employed by IDP-YOLOV9 enables better feature capturing from restored clear images, while the introduction of a three-layer routing attention mechanism effectively enhances the accuracy of object detection. Through comprehensive comparisons with other algorithms, we demonstrated its performance in addressing complex environments. Specifically, objective evaluation metrics with subjective assessment methods were employed to thoroughly evaluate the performance of the Three-Weather Removal Algorithm module (TRA) and the image enhancement module (DLIE) on real image datasets. Compared to existing advanced detection algorithms and non-joint methods, the IDP-YOLOV9 algorithm exhibits superior performance. Overall, the results indicate that our method accurately identifies and effectively removes weather factors, while the improved detection module demonstrates outstanding performance on processed images, providing strong support for visual perception and object recognition in power construction sites by drones.

Author Contributions

Conceptualization, Methodology, Supervision, and Writing—Review and Editing, J.L.; Resources, Investigation, Software, Writing—Original Draft, Validation, and Methodology, Y.F.; Data Curation and Writing—Original Draft, Y.S.; Conceptualization, Methodology, Supervision, Writing—Review and Editing, and Project Administration, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Basic Research Project of the National Defence Science and Industry Bureau (project no. JCKY2022405C010). We would like to express our deepest gratitude to these organizations for their generous funding and support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this study are available at https://github.com/fyq412/IDP-YOLOV9 (accessed on 12 June 2024).

Acknowledgments

We would like to express our gratitude to Yanhua Shao (The Sixth Research Institute of China Electronics and Information Industry Corporation) for providing computational resources and support.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Diaz Linares, I.; Pardo, A.; Patch, E. IoT Privacy, Security and Forensics Challenges: An Unmanned Aerial Vehicle (UAV) Case Study. In Handbook of Big Data Analytics and Forensics; Springer: Cham, Switzerland, 2022; pp. 7–39. [Google Scholar]
Introducing the Discrete Path Transform (DPT) and its applications in signal analysis, artefact removal, and spoken word recognition. Digit. Signal Process. 2021, 117, 103158. [CrossRef]
Guariglia, E.; Silvestrov, S. Fractional-Wavelet Analysis of Positive definite Distributions and Wavelets on D’(C). In Engineering Mathematics II; Silvestrov, S., Rančić, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 337–353. [Google Scholar]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Wu, T.H.; Wang, T.W.; Liu, Y.Q. Real-time vehicle and distance detection based on improved yolo v5 network. In Proceedings of the 2021 3rd World Symposium on Artificial Intelligence (WSAI), Guangzhou, China, 18–20 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 24–28. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao HY, M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 1–6. [Google Scholar]
Engin, D.; Genç, A.; Kemal Ekenel, H. Cycle-dehaze: Enhanced cyclegan for single image dehazing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 825–833. [Google Scholar]
Abbasi, H.; Amini, M.; Yu, F.R. Fog-Aware Adaptive YOLO for Object Detection in Adverse Weather. In Proceedings of the 2023 IEEE Sensors Applications Symposium (SAS), Ottawa, ON, Canada, 18–20 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Luo, K.; Luo, R.; Zhou, Y. UAV detection based on rainy environment. In Proceedings of the 2021 IEEE 4th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 18–20 June 2021; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
Li, B.; Ren, W.; Fu, D.; Tao, D.; Feng, D.; Zeng, W.; Wang, Z. Benchmarking single-image dehazing and beyond. IEEE Trans. Image Process. 2018, 28, 492–505. [Google Scholar] [CrossRef] [PubMed]
Yang, H.H.; Yang CH, H.; Tsai, Y.-C.J. Y-net: Multi-scale feature aggregation network with wavelet structure similarity loss function for single image dehazing. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 2628–2632. [Google Scholar]
Yang, A.P.; Liu, J.; Xing, J.N. Content feature and style feature fusion network for single image dehazing. Acta Autom. Sin. 2020, 46, 1–9. [Google Scholar]
Zhang, K.; Yan, X.; Wang, Y. Adaptive Dehazing YOLO for Object Detection. In Proceedings of the International Conference on Artificial Neural Networks, Heraklion, Greece, 26–29 September 2023; Springer Nature: Cham, Switzerland, 2023; pp. 14–27. [Google Scholar]
Shao, M.; Li, L.; Wang, H. Selective generative adversarial network for raindrop removal from a single image. Neurocomputing 2021, 426, 265–273. [Google Scholar] [CrossRef]
Qin, Q.; Chang, K.; Huang, M.; Li, G. DENet: Detection-driven Enhancement Network for Object Detection Under Adverse Weather Conditions. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 2813–2829. [Google Scholar]
Li, W.; Wang, M.; Wang, H. Object detection based on semi-supervised domain adaptation for imbalanced domain resources. Mach. Vis. Appl. 2020, 31, 18. [Google Scholar] [CrossRef]
Yin, X.; Yu, Z.; Fei, Z.; Lv, W.; Gao, X. PE-YOLO: Pyramid Enhancement Network for Dark Object Detection. In Proceedings of the International Conference on Artificial Neural Networks, Heraklion, Greece, 26–29 September 2023; Springer Nature: Cham, Switzerland, 2023; pp. 163–174. [Google Scholar]
Sindagi, V.A.; Oza, P.; Yasarla, R. Prior-based domain adaptive object detection for hazy and rainy conditions. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIV 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 763–780. [Google Scholar]
Hassaballah, M.; Kenk, M.A.; Muhammad, K. Vehicle detection and tracking in adverse weather using a deep learning framework. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4230–4242. [Google Scholar] [CrossRef]
Yu, Z.; Bajaj, C. A fast and adaptive method for image contrast enhancement. In Proceedings of the 2004 International Conference on Image Processing, Singapore, 24–27 October 2004; ICIP’04; IEEE: Piscataway, NJ, USA, 2004; Volume 2, pp. 1001–1004. [Google Scholar]
Wang, W.; Chen, Z.; Yuan, X. Adaptive image enhancement method for correcting low-illumination images. Inf. Sci. 2019, 496, 25–41. [Google Scholar] [CrossRef]
Hu, Y.; He, H.; Xu, C. Exposure: A white-box photo post-processing framework. ACM Trans. Graph. (TOG) 2018, 37, 1–17. [Google Scholar] [CrossRef]
Yu, R.; Liu, W.; Zhang, Y.; Qu, Z.; Zhao, D.; Zhang, B. Deepexposure: Learning to expose photos with asynchronously reinforced adversarial learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 2–8 December 2018; pp. 2153–2163. [Google Scholar]
Yang, L.; Su, H.; Zhong, C.; Meng, Z.; Luo, H.; Li, X.; Tang, Y.Y.; Lu, Y. Hyperspectral image classification using wavelet transform-based smooth ordering. Int. J. Wavelets Multiresolut. Inf. Process 2019, 17, 1950050. [Google Scholar] [CrossRef]
Zeng, H.; Cai, J.; Li, L. Learning image-adaptive 3d lookup tables for high performance photo enhancement in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2058–2073. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Yan, X.; Zhang, K. TogetherNet: Bridging Image Restoration and Object Detection Together via Dynamic Enhancement Learning. Comput. Graph. Forum 2022, 41, 465–476. [Google Scholar] [CrossRef]
Zhang, K.; Zuo, W.; Chen, Y. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 2017, 26, 3142–3155. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Guariglia, E. Harmonic Sierpinski Gasket and Applications. Entropy 2018, 20, 714. [Google Scholar] [CrossRef] [PubMed]
Jing, Y.; Yang, Y.; Feng, Z. Neural style transfer: A review. IEEE Trans. Vis. Comput. Graph. 2019, 26, 3365–3385. [Google Scholar] [CrossRef] [PubMed]
Kang, L.W.; Lin, C.W.; Fu, Y.H. Automatic single-image-based rain streaks removal via image decomposition. IEEE Trans. Image Process. 2011, 21, 1742–1755. [Google Scholar] [CrossRef] [PubMed]
Fu, X.; Huang, J.; Zeng, D. Removing rain from single images via a deep detail network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3855–3863. [Google Scholar]
Yang, W.; Tan, R.T.; Feng, J.; Liu, J.; Yan, S.; Guo, Z. Joint rain detection and removal from a single image with contextualized deep networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 1377–1393. [Google Scholar] [CrossRef] [PubMed]
Li, R.; Cheong, L.F.; Tan, R.T. Heavy rain image restoration: Integrating physics model and conditional adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1633–1642. [Google Scholar]
Eigen, D.; Krishnan, D.; Fergus, R. Restoring an image taken through a window covered with dirt or rain. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 633–640. [Google Scholar]
Quan, Y.; Deng, S.; Chen, Y.; Ji, H. Deep learning for seeing through window with raindrops. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2463–2471. [Google Scholar]
Zhang, Z.; Wu, S.; Wang, S. Single-image snow removal algorithm based on generative adversarial networks. IET Image Process. 2023, 17, 3580–3588. [Google Scholar] [CrossRef]
Guo, C.G.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference Computer Vision Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1780–1789. [Google Scholar]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar]
Liu, X.; Ma, Y.; Shi, Z.; Chen, J. Griddehazenet: Attention-based multi-scale network for image dehazing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7314–7323. [Google Scholar]
Dong, H.; Pan, J.; Xiang, L. Multi-scale boosted dehazing network with dense feature fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2157–2167. [Google Scholar]
Zheng, X.; Tang, Y.Y.; Zhou, J. A Framework of Adaptive Multiscale Wavelet Decomposition for Signals on Undirected Graphs. IEEE Trans. Signal Process. 2019, 67, 1696–1711. [Google Scholar] [CrossRef]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFANet: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11908–11915. [Google Scholar]
Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. Aodnet: All-in-one dehazing network. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4770–4778. [Google Scholar]
Ju, M.; Ding, C.; Guo, C.A. IDRLP: Image dehazing using region line prior. IEEE Trans. Image Process. 2021, 30, 9043–9057. [Google Scholar] [CrossRef] [PubMed]
Yeh, C.H.; Huang, C.H.; Kang, L.W. Multi-scale deep residual learning-based single image haze removal via image decomposition. IEEE Trans. Image Process. 2019, 29, 3153–3167. [Google Scholar] [CrossRef] [PubMed]
Nalla, B.T.; Sharma, T.; Verma, N.K.; Sahoo, S.R. Image dehazing for object recognition using faster RCNN. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–7. [Google Scholar]
Likhitaa, P.S.; Anand, R. A Comparative Analysis of Image Dehazing using Image Processing and Deep Learning Techniques. In Proceedings of the 2021 6th International Conference on Communication and Electronics Systems (ICCES), Coimbatre, India, 8–10 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1611–1616. [Google Scholar]
Hnewa, M.; Radha, H. Multiscale domain adaptive yolo for cross-domain object detection. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3323–3327. [Google Scholar]
Liu, W.; Ren, G.; Yu, R. Image-adaptive YOLO for object detection in adverse weather conditions. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 1792–1800. [Google Scholar]
Berry, M.V.; Lewis, Z.V.; Nye, J.F. On the Weierstrass-Mandelbrot fractal function. Proc. R. Soc. Lond. Ser. A 1980, 370, 459–484. [Google Scholar]
Dai, G.; Hu, L.; Fan, J. A Deep Learning-Based Object Detection Scheme by Improving YOLOv5 for Sprouted Potatoes Datasets. IEEE Access 2022, 10, 85416–85428. [Google Scholar] [CrossRef]
Ang, G.; Xing, L.; Chen, X. A dense pedestrian detection algorithm with improved YOLOv8. J. Graph. 2023, 44, 890–898. [Google Scholar]
Jinghan, Y.; Shaojun, Q.; Zekai, Y. Traffic sign recognition model in haze weather based on YOLOv5. Comput. Appl. 2022, 42, 2876–2884. [Google Scholar]
Li, Y.; Zeng, J.; Shan, S. Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Trans. Image Process. 2018, 28, 2439–2450. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Liu, Y.; Zhang, H. Occlusion-Aware Transformer With Second-Order Attention for Person Re-Identification. IEEE Trans. Image Process. 2024, 33, 3200–3211. [Google Scholar] [CrossRef] [PubMed]
Mosleh, A.; Sharma, A.; Onzon, E.; Mannan, F.; Robidoux, N.; Heide, F. Hardware-in-the-Loop End-to-End Optimization of Camera Image Processing Pipelines. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7526–7535. [Google Scholar]
Guo, Q.; Sun, J.; Xu, F.J. Efficientderain: Learning pixel-wise dilation filtering for high-efficiency single-image deraining. In Proceedings of the AAAI Conference on Artificial Intelligence, virtually, 2–9 February 2021; Volume 35, pp. 1487–1495. [Google Scholar]
Ding, Y.; Xue, X.; Wang, Z. Domain knowledge driven deep unrolling for rain removal from single image. In Proceedings of the 2018 7th International Conference on Digital Home (ICDH), Guilin, China, 30 November–1 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 14–19. [Google Scholar]
Li, P.; Tian, J.; Wang, G. Image snow removal methods for robotic environment fusion. J. Mech. Eng. 2019, 55, 98–104. [Google Scholar] [CrossRef]
Li, R.; Tan, R.T.; Cheong, L.F. All in one bad weather removal using architectural search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3175–3185. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Ancuti, C.O.; Ancuti, C.; Sbert, M. Dense-haze: A benchmark for image dehazing with dense-haze and haze-free images. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1014–1018. [Google Scholar]

Figure 1. (a) Network architecture diagram of IDP-YOLOV9; (b) parallel framework diagram of TRA (Three-Weather Removal Algorithm) module and DLIE (Deep Learning-based Image Enhancement) module; (c) TRA module for Image DeFogging, DeRaining, and DeSnowing; (d) DLIE module for Image Enhancement.

Figure 2. Improved YOLOV9 structure diagram. The red part represents the proposed three-layer routing attention mechanism module.

Figure 3. Tri-Level Routing Attention Module. The mm and softmax modules represented the operations of matrix transpose multiplication, using the softmax function to normalize attention weights. O1 and O2 were obtained by concentrating token and global attention on the collected key–value pairs and then adding context enhancement terms.

Figure 4. Comparison of different image processing algorithms for severe weather conditions in power scenes captured from drones’ perspective: (a) Snow image; (b) AOD-net algorithms; (c) GridDehazeNet algorithms; (d) IDP-YOLOV9.

Figure 5. The entire process of the IDP + YOLOV9 network, including (a) defogging processes, (b) deraining processes, and (c) desnowing processes.

Figure 6. The processing and detection effects of different algorithms on power scene images from drones’ perspective in snowy weather: (a) U-DenseNet algorithm; (b) All In One algorithm; (c) IDP-YOLOV9.

Table 1. Four filters and parameter table.

Filter	Mapping Function	Parameters
WB	$X_{o} = (W_{r} r_{i}, W_{g} g_{i}, W_{b} b_{i})$	$W_{r}, W_{g}, W_{b}$
Gamma	$X_{o} = X_{i}^{G}$ i	G
Contrast	$X_{o} = α \cdot M a (X_{i}) + (1 - α) \cdot X_{i}$	α
Tone	$X_{o} = (L_{t_{r}} (r_{i}), L_{t_{g}} (g_{i}), L_{t_{b}} (b_{i}))$	$t_{i}$

Table 2. Evaluation of defogging, deraining, and desnowing methods on different datasets against benchmarks by IDP-YOLOV9. ↑ indicates that our method achieves higher scores for this metric.

Operation	DeFog		DeRain		DeSnow		Ave Value	Evaluation
Methods	AOD-NET	GridDehazeNet	Efficient DeRain	ADMM ResNet	UDense Net	All in One	IDP-YOLOV9 (Ours)
VOC	15.06	18.80	16.36	17.58	17.14	17.96	19.80	PSNR `↑`
VOC	0.805	0.814	0.809	0.832	0.816	0.841	0.847	SSIM `↑`
HAZE	14.76	16.13	17.56	18.02	15.52	16.96	18.24	PSNR `↑`
HAZE	0.763	0.799	0.784	0.773	0.688	0.756	0.736	SSIM `↑`
FTOD	20.15	22.36	18.86	19.36	19.52	21.08	21.97	PSNR `↑`
FTOD	0.727	0.805	0.655	0.714	0.739	0.814	0.809	SSIM `↑`

Table 3. Average mAP for different object detection methods under various adverse weather conditions.

Test	Faster R-CNN	SSD	RetinaNet	YOLOV8	YOLOV9	IDP-YOLOV9 (Ours)
fog	0.517	0.525	0.535	0.587	0.608	0.674
rain	0.524	0.535	0.545	0.596	0.624	0.691
snow	0.514	0.529	0.538	0.588	0.638	0.704

Table 4. Comparison of ablation experiments. √ indicates that we used this model for image enhancement, removal, detection operations.

Model	MultiRemoval	Enhance	Detection	Fog (%)	Rain (%)	Snow (%)
Improved YOLOV9			√	58.1 (−7.7)	61.0 (−6.6)	60.3 (−6.1)
MultiRemoval + ImprovedYOLOV9	√		√	60.5 (−5.3)	62.1 (−5.5)	62.6 (−3.8)
Enhancement + ImprovedYOLOV9		√	√	59.2 (−6.6)	61.4 (−6.2)	60.9 (−5.5)
IDP-YOLOV9	√	√	√	65.8	67.6	66.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Feng, Y.; Shao, Y.; Liu, F. IDP-YOLOV9: Improvement of Object Detection Model in Severe Weather Scenarios from Drone Perspective. Appl. Sci. 2024, 14, 5277. https://doi.org/10.3390/app14125277

AMA Style

Li J, Feng Y, Shao Y, Liu F. IDP-YOLOV9: Improvement of Object Detection Model in Severe Weather Scenarios from Drone Perspective. Applied Sciences. 2024; 14(12):5277. https://doi.org/10.3390/app14125277

Chicago/Turabian Style

Li, Jun, Yongqiang Feng, Yanhua Shao, and Feng Liu. 2024. "IDP-YOLOV9: Improvement of Object Detection Model in Severe Weather Scenarios from Drone Perspective" Applied Sciences 14, no. 12: 5277. https://doi.org/10.3390/app14125277

APA Style

Li, J., Feng, Y., Shao, Y., & Liu, F. (2024). IDP-YOLOV9: Improvement of Object Detection Model in Severe Weather Scenarios from Drone Perspective. Applied Sciences, 14(12), 5277. https://doi.org/10.3390/app14125277

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IDP-YOLOV9: Improvement of Object Detection Model in Severe Weather Scenarios from Drone Perspective

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. The Framework of IDP-YOLOV9

3.2. TRA Module

3.3. DLIE Module

3.4. Improved YOLOV9 Detection Module

3.5. Loss Function

4. Experimental Results

4.1. Implementation Details

4.2. Performance Evaluation

4.3. Evaluating the Detection Results of the Model

4.4. Ablation Study

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI