The experimental hardware configuration is AMD Ryzen 5 1600 CPU, 8 GB RAM, NVIDIA GeForce GTX 1650 DDR6 with 4 Gbyte video memory, Ubuntu 18.04, 64-bit software environment, Pytorch 1.2.0 deep-learning framework, and CUDA 11 parallel computing framework.
4.1. Defogging Experiments
Generally, peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and information entropy (S) are adopted to evaluate the performance of the defogging algorithm.
PSNR is evaluated by calculating the mean squared error (
MSE) of the fogged and unfogged images to assess the model. The smaller the
MSE, the larger the
PSNR, representing the greater similarity between images. For an
H ×
W image, the
MSE and
PSNR are obtained using Equation (12) and Equation (13), respectively.
SSIM indicates the similarity of the two images and evaluates the similarity between the images in three aspects: brightness, contrast, and structure, respectively. The closer the result is to 1 means the more similar the graphics are and the less distorted the images are. The expressions are expressed as follows:
where
and
denote the variance of image
and image
;
and
denote the mean of image
and
;
,
, and
are constant terms;
denotes the covariance of image
and
.
Calculating the magnitude of the information entropy evaluation value is in finding the total expectation of the image grayscale value. The amount of detailed information in an image can be characterized by the result of the information entropy evaluation. The more information the image contains, the larger the value of information entropy is. Its calculation formula is expressed as follows:
where
I is the gray value of a pixel,
j is the average of the grayness of a small region centered on that pixel value,
is the number of occurrences of the binary, and
M and
N denote the length and width of the image, respectively.
To verify the effectiveness of the improved dark channel prior algorithm, this paper compares the effects of the three improvement strategies on the model performance with experimental results derived from the CPILD test set, as shown in
Table 1. As seen in the table, each improvement strategy on the model does not make SSIM values change greatly. However, the PSNR is significantly improved over the original model with a value of 16.27 dB after fitting the transmittance estimation using the dual dark channel; meanwhile, after optimizing the atmospheric light value estimation, the PSNR model is improved by 0.36 dB, and the PSNR metric is slightly improved when CLAHE is used as the restoration module. Combining these three improvements, the PSNR improves by 0.55 dB compared to the original model. These experiments show that the improved model has improved both the defogging effect and the image quality.
Since there are several variants of the method based on physical and non-physical models, and most of the image evaluation results are evaluated using the full reference method on a synthetic dataset containing pairs of clear and haze images, the haze images with different concentrations are not necessarily consistent with the results of different method variants. The best defogging method on thin haze images does not have the best defogging effect on thick haze images, and the best method in the full-reference image quality evaluation does not necessarily perform consistently in the no-reference image quality evaluation. Therefore, the five most representative image-defogging methods, namely dark channel prior, automatic color equalization, single-scale Retinex, DehazeNet network, and multi-scale Retinex, were selected for comparison tests, and the experimental results are shown in
Figure 10.
As seen in
Figure 10, the dark-channel-based prior method (DCP) can make the image clearer in visual effect after processing, but the image becomes darker overall, as shown in
Figure 10c; the automatic-color-balance-based defogging algorithm (ACE) enhances the contrast of the image but produces more noise, which has a greater impact on the subsequent image recognition work, as shown obviously in
Figure 10d; while the single-Retinex-model-based scale and multi-scale defogging algorithms (SSR) for foggy images are less effective, as shown in
Figure 10e,f; deep-learning-based DehazeNet defogging algorithm does not obtain good results demonstrated for this scene, as shown in
Figure 10g; the improved defogging algorithm (Dark) performs better compared with the previous five algorithms, and the image quality has been significantly improved. The image saturation and color are enhanced, too.
Table 2 shows the performances of the six defogging algorithms on the CPILD test set.
As seen in
Table 2, the proposed defogging algorithm has the best PSNR and SSIM metrics, with PSNR values of 0.55, 0.24, 4.73, 3.82, and 0.12 dB higher relative to the dark channel prior (DCP), automatic color equalization (ACE), single-scale Retinex (SSR), multi-scale Retinex (MSR), and DehazeNet, respectively. As for SSIM values, the proposed algorithm is 0.8, among which is the closest to 1. The best PSNR and SSIM values indicate that the proposed algorithm can process foggy images with the least image distortion, best defogging effect, and highest image quality. For the comparison of information entropy, the SSR algorithm has the lowest recovered information in the test set, which is only 6.41 bits. ACE obtains the next best information entropy result, with a value of 7.23 bits. The proposed algorithm has the best-recovered result, with an information entropy of 7.55 bits, which indicates that the proposed algorithm has a better image-defogging effect than other algorithms, the best image quality, rich information carried, and less image distortion.
4.2. Target Detection Experiments
4.2.1. VOC Dataset
The VOC dataset is one of the most commonly used standard datasets in the field of target detection. Therefore, the VOC2012 dataset is selected to validate the reliability of the model. This dataset has 20 categories with a total of 17125 images, 13870 images in the training set, 1542 images in the validation set, and 1713 images in the test set, where the ratio of the training set to the validation set is 9:1.
4.2.2. Evaluation Indicators
In this paper, the mean average accuracy (
mAP), the number of frames per second transmitted (
Speed), and the total number of parameters (
Params) are used as the evaluation indexes of the algorithm detection accuracy and detection speed and model size, respectively. In the target detection task,
mAP is the mean value of the average accuracy (
AP) for each type of target, as shown in Equation (19).
where
N is the number of categories, and
i denotes a particular category. The
AP for a particular category
i is calculated as follows:
where
is the mapping relationship between precision (
P) and recall (
R), which is often represented by a
P–
R curve, and the area of the region below the curve is the
AP value for that category. Precision and recall are calculated as follows:
where
TP denotes the number of samples where both the detection category and true label are
i;
FP denotes the number of samples where the detection category is
i, and the true label is not
i; and
FN denotes the number of samples where the detection category is not
i, but the true label is
i.
4.2.3. Model Training Strategy
To perform migration learning when training the model, this experiment converts all the input images of the network to 512 × 512 pixels size. At the same time, to speed up the training and prevent the initial training weights from being destroyed, the training was subjected to a freezing process for the first 50 iterations, during which every eight images in a Bach size would be frozen and then thawed after a 50-epoch training is completed. Then, after thawing every four images to a Bach size, the weights were updated once for each completed epoch and were then saved. The initialized learning rate is set as 0.0005, the adaptive optimizer Adam is used, with momentum as 0.9 and weights decay as 0, and the learning rate decreasing method chooses cos.
4.2.4. Experimental Results
To verify the superiority of the method in this paper, five representative and well-performing feature pyramid methods were selected for comparison in the experiments, including SPP (He, 2015), ASPP (Chen, 2017), PPM (Zhao, 2017), RFB and RFB-s (Liu, 2018), and Strip Pooling (Hou, 2020). The number of input and output channels was controlled to be 2048 in the experiments, and both input channels were first transformed to 128 via a 1 × 1 convolution to ensure that the number of parameters of each model was kept to (33, 35). The experimental results are shown in
Table 3.
As seen in
Table 3, by comparing the existing SPP-like models, the M-SPP model achieves the best detection accuracy with an mAP value of 82.28% when the number of parameters and the detection speed are relatively similar. This is sufficient to show that it can obtain local feature information of different sizes as well as global information, expand the sensory field, enrich the network expression, and effectively improve the network accuracy.
The feature extraction network with a cavity convolution operation can solve the problem of feature resolution reduction due to the increase in the perceptual field. Different expansion rates in the ASPP structure can process the input feature maps in parallel and extract multi-scale target information. However, since the expansion rates (6, 12, and 18) do not fully utilize the perceptual field to extract effective information, the E-ASPP module is proposed in this paper to enhance the model effect, and the results are shown in
Table 4. Compared with the original ASPP, the ASPPs can improve the detection accuracy up to 82.06%, which means it can perform feature extraction more effectively. Moreover, with the subsequent addition of the ECA attention mechanism to re-integrate the feature information, the mAP of the proposed E-ASPP was further improved to 82.21%, which is 0.15% higher in comparison to ASPPs.
For comparing the effects of the three improvement strategies, M-SPP, E-ASPP, and CA, on the Center model, experiments were carried out, and the results are shown in
Table 5. The mAP value of the original model was 79.64%. After taking the above strategies, the mAP values were all improved compared to the original model. When combining the three improvement strategies, the mAP value is improved to 83.42%, which is 3.78% higher than the original model.
To verify the detection effect of the Center model, YoLov5, YoLov7, Faster R-CNN, SSD, and CenterNet models are adopted for comparison, and the models have been converged when the same training parameters are used. The mAP values were calculated for all categories in the VOC2012 dataset after detection, and the results are shown in
Table 6.
For the YoLo family of algorithms, the Center model outperforms YoLov5 and YoLov7 in terms of mAP values, Speed, and Params. The faster R-CNN has the slowest detection speed of 10.50 FPS due to the shortcomings of two-stage detection. The SSD, in contrast, has the fastest detection speed but has the lowest detection accuracy 78.59%.
4.3. Dark-Center Target Detection
From
Section 3.1, it can be seen that the images after the defogging operation have more prominent insulator contours, richer information, and higher recognizability as well as contrast. Therefore, the defogging algorithm is jointly trained with the target detection algorithm, which we called Dark-Center, to achieve insulator detection in foggy environments. The comparison experiments were carried out on the CPILD dataset, and the experimental results are shown in
Table 7.
For the CenterNet target detection algorithm, the accuracy of detecting insulators is only 77.77%, but the improved Center model has a substantial improvement for the recognition of insulator defects, with an increase of 16.56% in mAP value and up to 94.33%. The addition of the defogging algorithm in the preprocessing module further improves the mAP value to 96.76%. Moreover, the Dark-Center model can also improve the detection accuracy of both insulators and defects, with an increase in Insulator AP from 95.89% to 98.85% and that of Defect AP from 59.64% to 94.67%.
To visualize the differences between the original CenterNet algorithm, the proposed Center model, and the Dark-Center model with the defogging process, a detection image is selected for comparison and analysis, and the detection results are shown in
Figure 11.
As seen in
Figure 11, the CenterNet detection algorithm only detects the overall insulator orientation at 0.78 and fails to accurately identify the defective part (
Figure 11b), while the improved Center algorithm not only detects the insulator with an accuracy of 0.96 but also accurately determines the defects at 0.38 (
Figure 11c). For the Dark-Center model with the additional defogging algorithm, its defogging effect is obvious, the insulator identification accuracy is up to 0.96, and the defect identification is 0.48 (
Figure 11d).