Three-Dimensional Convolutional Vehicle Black Smoke Detection Model with Fused Temporal Features

Liu, Jiafeng; Yang, Lijian; Cheng, Hongxu; Niu, Lianqiang; Xu, Jian

doi:10.3390/app14188173

Open AccessArticle

Three-Dimensional Convolutional Vehicle Black Smoke Detection Model with Fused Temporal Features

by

Jiafeng Liu

^1,2,3,

Lijian Yang

²

,

Hongxu Cheng

¹

,

Lianqiang Niu

^1,3 and

Jian Xu

^4,*

¹

School of Software, Shenyang University of Technology, Shenyang 110870, China

²

School of Information and Engineering, Shenyang University of Technology, Shenyang 110870, China

³

Shenyang Key Laboratory of Intelligent Technology of Advanced Industrial Equipment Manufacturing, Shenyang 110870, China

⁴

College of Software, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8173; https://doi.org/10.3390/app14188173

Submission received: 28 July 2024 / Revised: 29 August 2024 / Accepted: 2 September 2024 / Published: 11 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

The growing concern over pollution from vehicle exhausts has underscored the need for effective detection of black smoke emissions from motor vehicles. We believe that the optimal approach for the detection of black smoke is to leverage existing roadway CCTV cameras. To facilitate this, we have collected and publicly released a black smoke detection dataset sourced from roadway CCTV cameras in China. After analyzing the existing detection methods on this dataset, we found that they have subpar performance. As a result, we decided to develop a novel detection model that focuses on temporal information. This model utilizes the continuous nature of CCTV video feeds rather than treating footage as isolated images. Specifically, our model incorporates a 3D convolution module to capture short-term dynamic and semantic features in consecutive black smoke video frames. Additionally, a cross-scale feature fusion module is employed to integrate features across different scales, and a self-attention mechanism is used to enhance the detection of black smoke while minimizing the impact of noise, such as occlusions and shadows. The validation of our dataset demonstrated that our model achieves a detection accuracy of 89.42%,showing around 3% improvement over existing methods. This offers a novel and effective solution for black smoke detection in real-world applications.

Keywords:

vehicle black smoke detection; feature extraction; 3D convolution; feature fusion

1. Introduction

The black smoke emitted by vehicles contains plenty of nitrogen oxides and particulate matter, which is harmful to the environment, and the main task of vehicle black smoke detection is to detect medium and large-sized vehicles that illegally emit black exhaust in real scenarios.

The existing mainstream black smoke detection methods generally use single-frame video images as input and rely on two-dimensional convolutional neural networks to extract features and identify black smoke. Among them, Guo et al. [1] introduced a CenterNet-based detection model that employs a dual backbone network and incorporates an attention mechanism to enhance detection accuracy. Chen et al. [2] proposed a diesel vehicle black smoke detection method based on motion amplification enhancement and color feature localization, and at the same time, proposed a cross-camera scenario black smoke diesel vehicle relocation method, which achieves higher accuracy. Wang et al. [3] used a target detection network to extract multiple regions containing diesel vehicles and black smoke, and then used a two-dimensional convolutional classification network to perform the candidate black smoke regions with a fine-grained classification to realize diesel vehicle black smoke detection. Zhou et al. [4] proposed the efficient spatial attention model ESA-Net to concentrate on relevant regions and filter out invalid information; however, this model requires further refinement to improve its accuracy due to its lack of consideration for black smoke motion. Xu et al. [5] reconstructed a multi-scale network model based on YOLOv5 [6], replacing the path aggregation network with a bidirectional weighted feature pyramid network, which enhanced feature fusion and adjusted feature contributions, resulting in reduced leakage and misdetection rates.

However, the method above only extracts the black smoke feature information in a single image and adopts the methods of improving the model structure and introducing the attention mechanism to identify the black smoke. In actual video data, a single frame captures only a portion of the black smoke information present in the entire video sequence. Black smoke varies in morphology, concentration, and diffusion across different frames and can be easily confused with the shadows of vehicles and road stains, leading to significant interference in subsequent processing. Additionally, the black smoke in the road surveillance video occupies a small area in the image, making it challenging to determine its existence only by a single image, which leads to the low detection accuracy of the above methods.

We advocate that the optimal approach for detecting black smoke is to leverage the existing roadway CCTV cameras. These cameras are typically integrated into the city’s infrastructure and are currently extensively employed for monitoring speeding violations and red-light runners, among other functions. They typically offer a broad field of view and are designed to capture continuous video footage rather than just intermittent snapshots. By analyzing the same instance across different frames, it is possible to tackle many difficult cases, such as those with a blurry appearance and background occlusion [7]. Note that this approach is particularly advantageous for black smoke detection because black smoke itself is very blurry and the motion shadows produced by motorized vehicles are visually very similar to emitted black smoke [8,9].

With the above perspectives, we realize that one of the reasons that the prevailing existing methodologies do not employ the aforementioned optimal detection paradigm is due to the constraints inherent in the available datasets. To the best of our knowledge, a dataset for black smoke detection that utilizes roadway CCTV cameras, incorporating temporal attributes, does not yet exist. To this end, we have assembled and publicly released a dataset for black smoke detection, which has been sourced from roadway CCTV cameras within China.

Despite the development of this novel dataset, in this paper, we also present a model for black smoke detection. Unlike the existing approaches that rely solely on static features from single images, our method leverages the temporal continuity in video data to enhance detection accuracy. By considering the motion and evolution of black smoke over multiple frames, we can more effectively distinguish it from other visual artifacts. Specifically, our model is based on the ResNet50 network and designs a dual-branch structure to extract black smoke feature information at different resolutions effectively. A 3D convolution module is introduced into the backbone network to construct a dual branch structure based on the 3D ResNet50 network, which fully preserves short-term dynamic and semantic features between continuous black smoke video frames. Additionally, a cross-scale feature fusion module is proposed to concatenate the features of two sub-segments to fuse black smoke feature information at different scales. Finally, a self-attention mechanism module is introduced to enrich the black smoke feature information while mitigating the influence of noise and other factors, and the obtained mixed features are used for black smoke classification and recognition.

We summarize our contributions as follows:

Dataset: We assembled and publicly released a dataset for black smoke detection, which has been sourced from roadway CCTV cameras within China. We hope our dataset will facilitate the advancement of black smoke detection models and support their practical applications.
Model: Our model features a dual-branch architecture and 3D convolutions for the concurrent analysis of temporal features, capturing black smoke movement, alongside a feature fusion network that integrates cross-scale and self-attention mechanisms to refine detection by focusing on key features.
Experiments: The proposed method is trained and validated on our novel dataset. The results demonstrate a substantial improvement in vehicle black smoke detection accuracy for video streaming data compared to most current methods, validating the effectiveness of our approach in real-world scenarios.

The rest of this paper is organized as follows: In Section 2 we briefly analyze the existing object detection methods and the black smoke detection methods. Section 3 presents the dataset we collected. The techniques of our models are given in Section 4. In Section 5 we discuss the experiments, including the basic settings, the compared methods, and the experimental results. Finally, we conclude the paper in Section 6.

2. Related Works

In this section, we revisit the existing works about object detection models in Section 4.1 and black smoke detection methods in Section 4.2.

2.1. Object Detection

In recent years, object detection tasks have witnessed substantial progress, particularly with deep learning-based methods such as the YOLO series models [10,11] and Faster R-CNN [12]. Typically, such models consist of two main components, namely backbone and head. The backbone component is usually pre-trained on a common dataset, e.g., ImageNet [13] or COCO [14], for the feature extraction of the input image. On the other hand, the head component is used for predicting object classes and bounding boxes.

Object detection models can be categorized into one-stage and two-stage detectors based on their head design. Notably, YOLOv4 [11] and SSD [15] are representative one-stage detectors, while RCNN [16] and Faster R-CNN [12] are prominent two-stage detectors. One-stage detectors directly regress bounding boxes and predict class probabilities from the input image, whereas two-stage detectors first generate region proposals before classifying them. In the post-processing stage, both detector types remove low-confidence bounding boxes and apply non-maximum suppression (NMS) to obtain the final detection results.

According to the usage of temporal information, object detection models can be categorized into temporal context frame-oriented models (e.g., Faster-RCNN [12], R-FCN [17] and FPN [18]) and single frame-oriented models (e.g., vanilla YOLO models). Compared to the single-frame model, temporal context frame-oriented models incorporate temporal interactions between adjacent frames into the procedure [19]. This paradigm has been shown to mitigate the effects of image deterioration problems such as motion blur, background occlusion, and deformation [7]. We believe this is uniquely advantageous for the detection of black smoke, which is often indistinguishable from roads or shadows and blurs as the vehicle moves.

2.2. Black Smoke Detection

Before the popularity of deep learning-based black smoke detection, previous methods were mainly based on traditional image processing to accomplish the black smoke detection task, such as gradient features, texture features, and frequency domain features. For example, Tao et al. [20] employed color histogram, local binary pattern, and orientation gradient histogram to characterize smoke in the HSV color space. They further extracted frequency domain information of vehicle black smoke using discrete cosine transform for the final classification.

The advent and progression of deep learning have significantly advanced vehicle black smoke recognition, and become the mainstream solution for black smoke detection. Among these solutions, representative approaches are the following:

Zhang et al. [21] enhanced network structure with MobileNet and spatial pyramid pooling and adopted a transfer learning training strategy. The recognition of black smoke was achieved by analyzing multiple frames, yielding improved results, albeit with challenges such as a high false alarm rate, a high rate of missed detections, and significant computational expense. Starting from this work, the model architecture used for black smoke detection started to converge towards mainstream target detection models.
Wang et al. [3] proposed a two-stage convolutional neural network that combines the strengths of YOLOv3 and a multi-region convolutional network, effectively achieving fine-grained identification of black smoke. This is the first time that YOLO series model has been used in a black smoke detection mission, demonstrating the power of the YOLO line of models.
Li et al. [22] utilized YOLOv3 for vehicle target detection and subsequently employed the Vision Transformer [23] to identify black smoke emanating from the rear of vehicles. This is a landmark attempt to apply the attention mechanism in a large number of applications to black smoke detection tasks.
Zhou et al. [4] introduced a black smoke recognition method based on ResNet, which incorporates a reinforced spatial attention mechanism to effectively reduce the false alarm rate.

In addition to these representative efforts mentioned above, many other works have contributed to the progress of black smoke detection. Cao et al. [24] introduced a method that leverages spatial information for black smoke vehicle detection. This method employed the Inception V3 model [25] to capture spatial information from suspect frames. Chen et al. [9] proposed a one-stage diesel vehicle black smoke detection network namely CFL-Net based on color features location for environmental inspection stations. Wang et al. [8] proposed a detection method based on a two-stage segmentation-detection paradigm. It is an integrated approach of a semantic segmentation model and a detection model. This approach allows the segmentation model to filter out image regions unrelated to black smoke, resulting in more refined inputs for the detection model.

However, the above approach still suffers from two weaknesses:

Impractical detection scenarios: Most of the work mentioned above focuses primarily on model design and does not discuss the practical black smoke detection scenarios. On the other hand, for example, although [9] consider the issue of the detection scenarios, their model is designed for a specialized annual inspection station, which increases the cost of detection.
Less usage of the temporal information: As we mentioned, temporal information has unique advantages for the black smoke detection task, yet previous methods have not discussed these advantages in detail.

We discuss these two disadvantages in the next two sections.

3. Proposed Dataset

We advocate that the ideal scenario to detect black smoke is to using existing roadway CCTV cameras, which have been already widely built as part of a city’s infrastructure. The advantages of this scenario are three-fold:

These cameras capture the real and dynamic running state of vehicles, rather than the static and manually inspected log of vehicles just at an annual inspection station.
These cameras are usually mounted in the same location as the traffic lights and therefore have a ideal and unified viewing.
These cameras capture continuous videos, thus allowing for the utilization of temporal information.

We find that one of the reasons that constrain existing methods from considering this ideal scenario is the lack of the datasets. To the best of our knowledge, a dataset for black smoke detection that utilizes roadway CCTV cameras, incorporating temporal attributes, does not yet exist.

Regarding this, we collect and publicly release a new black smoke dataset, which are all collected from roadway CCTV cameras in China (Our dataset is available at https://aistudio.baidu.com/aistudio/datasetdetail/78275 (accessed on 27 July 2024)).

Our dataset was divided into 250 groups, with four video sequences in each group. Each video is about 1 min long, containing six consecutive images of black smoke video frames. To meet the training and inference process of the model, we have exported these video frames as images, which led to 7806 black smoke images being collected and well-labeled. (As an additional part of this dataset, we also provide 3255 images that do not contain black smoke, and they are also captured from roadway CCTV cameras in China.). All annotations are provided in two-fold: (1) .xml file in VOC format and (2) a mask image as shown in Figure 1 and Figure 2. The resolution of our images is 1080 × 1920, as the source CCTV cameras are 1080p resolution. The training set consisted of 6000 images, the testing set consisted of 1560 images, and the validation set consisted of 246 images.

The goal of this release is to boost the state of the art in black smoke detection by placing the question of black smoke detection in the context of a more practical and ideal question of scene understanding.

4. Proposed Model

In this section, we give a general overview of our model in Section 4.1, and then in the next three subsections (i.e., Section 4.2, Section 4.3 and Section 4.4) we detail the three core components of our model—3D convolutional layers, cross-scale feature fusion module and self-attention module—corresponding to the forward propagation order within our model.

4.1. Model Structure

Figure 3 shows the structure of our model. The input is a T-frame video sequence, and the resolution of the original video frame array is

R_{t}

. The model first processes the input video frame sequences with different resolutions, obtaining two video frame segments:

R_{S}

and

R_{D}

after a down-sampling operation. These segments,

R_{S}

and

R_{D}

, are calculated by Equations (1) and (2), respectively.

R_{S} = {R_{t}}_{t = 1}^{\frac{T}{2}} .

(1)

R_{D} = {R_{t}}_{\frac{T}{2} + 1}^{T} .

(2)

Then, the processed video frame segments are passed through the ResNet50 network with 3D convolution to extract the features of each video frame segment. The short-time motion features and semantic features between adjacent frames of the video frame segments are inputted to the feature fusion module. The cross-scale feature fusion module joins the two sub-segment features to obtain features that contain both temporal and semantic information in both dimensions. These features are then input to the self-attention mechanism module to explore the contextual relationship between discontinuous frames and capture global cue information. Finally, the obtained black smoke feature information is input to the classifier for classification and recognition.

4.2. Black Smoke Feature Extraction with 3D Convolution

To simultaneously obtain the spatial features and the motion features in the time dimension of the black smoke video frame sequence, this paper enhances the expressive ability of the black smoke features by using ResNet50 as the backbone network and replacing some of its residual blocks with 3D convolution. This constitutes a 3D ResNet50 network capable of extracting spatial features and temporal sequence information of continuous black smoke video frames.

Specifically, the ResNet50 network can be divided into five parts. Except for the first part, the remaining parts contain 3, 4, 6, and 3 residual blocks, each with three layers of convolution. The residual blocks are connected by jumps to enable the network to handle deeper layers of features effectively. The proposed 3D ResNet50 uses 3D convolution to replace the 1st, 2nd, and 4th residual blocks of the ResNet50 network, respectively, with 1, 1, and 3, and 4 residual blocks in layer three of the ResNet50 network. The extracted features of individual video frame sequences are noted as

F = {F_{1}, F_{2}, \dots, F_{T}}

, which are fed into the feature fusion module for further processing. The structure of the residual block embedded with 3D convolution is shown in Figure 4.

4.3. Feature Fusion Module

As the layers of the convolutional neural network deepen, the semantic information represented by the acquired feature maps changes. Shallow feature maps contain more location information, which is useful for target localization, while deep feature maps contain more semantic information, which is beneficial for classification tasks. In this paper, we use the feature fusion module (FF) to combine shallow and deep features, resulting in a new feature vector. The structure of FF is shown in Figure 5.

The cross-scale feature fusion module (CSFF) is embedded within the FF module, as illustrated in Figure 6. The feature maps

F_{t} \in R^{H \times W \times C}

and

F_{t + 1} \in R^{H / 2 \times W / 2 \times C}

are extracted by the two branches within the same time period, where

H, W, C

denote the height, width, and number of channels of the feature maps, respectively.

The CSFF first up-samples

F_{t + 1}

to match the size of

F_{t}

. It then adds

F_{t}

and

F_{t + 1}

element-wise to obtain fused features. In the semantic dimension, all features are combined to derive the final video sequence features. The CSFF module effectively reduces information loss that occurs during single-stage fusion by merging features from different time periods.

4.4. Self-Attention Module

While video short-time features can describe detailed information about black smoke, they lack attention to the global information in the time dimension. Continuous frames of the video contain valuable auxiliary information about black smoke diffusion, which can aid in recognition. Therefore, the self-attention mechanism (SAM) is introduced to capture global cues by leveraging frames with a relatively large time span. This helps to complement features and mitigate the influence of noise factors.

The self-attentive module employs the QKV (query-key-value) model. Given an input sequence,

X = [x_{1}, x_{2}, \dots, x_{n}]

, the output sequence is denoted as

H = [h_{1}, h_{2}, \dots, h_{n}]

. Each input is first mapped to three different spaces to obtain the query vector

q_{i}

, key vector

k_{i}

, and value vector

v_{i}

. The query matrix Q, key matrix K, and value matrix V are extracted as follows:

\begin{matrix} \{\begin{matrix} Q & = W_{q} X \\ K & = W_{k} X \\ V & = W_{v} X, \end{matrix} \end{matrix}

(3)

where

W_{q}, W_{k}, W_{v}

are learnable weights. For each

q_{n} \in Q

, the output vector

h_{n}

is computed as follows:

h_{n} = att ((K, V), q_{n}) = \sum_{j = 1}^{N} a_{n j} v_{j} = \sum_{j = 1}^{N} softmax (s (k_{j}, q_{n})) v_{j},

(4)

where

n, j \in {1, 2, \dots, N}

are the positions in the input sequence, and

a_{n j}

denotes the attention value of the n-th output relative to the j-th input.

Using the scaled dot product as the scoring function for the attention mechanism, the output vector sequence H is computed via:

H = V softmax (\frac{K^{T} Q}{\sqrt{D_{k}}}) .

(5)

For the given features from different time periods, the importance of each time period’s output is obtained by softmax, then averaged to obtain the attention weight for each branch. Finally, multiple semantic features are fused into video-level features according to these attention weights.

As a typical object detection task, the loss function we use for the training phase of our model has two terms, a IoU loss and a cross-entropy loss, as same as the existing object detection projects [26,27,28,29].

5. Experiments

In this section, we evaluate the performance of our models in our released datasets. This section is organized as follows: we introduce the experiment setup in Section 5.1, the comparative analysis is presented in Section 5.2, and the ablation study is shown in Section 5.3. Additional experiments, such as the effect of the position of the 3D convolutional layers in our model and the effect of image resolution on the model, are presented in Section 5.4 and Section 5.5.

5.1. Setup

5.1.1. Dataset

All experiments are conducted on the dataset we detailed in Section 3. The resolution size of each image is uniformly down-sampled to 256 × 128.

5.1.2. Environment

The equipment used for the experiments includes an Intel(R) Core(TM) i5-1135G7 @ 2.40 GHz CPU (Intel, Santa Clara, CA, USA)and an NVIDIA GeForce RTX 3080 GPU (NVIDIA, Santa Clara, CA, USA). The deep learning framework used is PyTorch 2.0 with CUDA 11.8.

5.1.3. Model-Training

During the training process, the model is saved every 10 iterations. The optimizer for training is Stochastic Gradient Descent (SGD) with a batch size of 8, a learning rate of 0.001, the momentum set to 0.9, and the frame size of 6. A total of 200 epochs are trained. The overall hyper-parameters used in the experiments are summarized in Table 1. Note that we use the data augmentation scheme provided by the YOLOv5 project [6] to enhance the training, and the parameters for the data augmentation process are also listed in Table 1.

The initial experiments show that the model obtained in the 180-

t h

iteration has the best performance for black smoke detection. Therefore, the model obtained from the training of this iteration is used as the vehicle black smoke detection model in this section for subsequent experimental analysis.

5.2. Evaluation Metrics

The metrics for measuring black smoke recognition algorithms include typical evaluation metrics such as accuracy rate (AR), recall rate (RR), and F1-score. These metrics provide a comprehensive understanding of the model’s effectiveness in detecting black smoke. Precision measures the proportion of true positive detections among all positive detections, while recall measures the proportion of true positive detections among all actual positives. The F1-score is the harmonic mean of precision and recall, providing a balanced measure of the model’s performance. Accuracy measures the proportion of correct detections among all samples.

\begin{matrix} \begin{matrix} A R & = \frac{T P}{T P + F P} * 100 %, \\ R R & = \frac{T P}{T P + F N} * 100 %, \\ F 1 - s c o r e & = 2 \times \frac{A R \times R R}{A R \times R R} * 100 % \end{matrix} \end{matrix}

(6)

where TP indicates the number of frames judged to be black smoke and judged correctly, FP indicates the number of frames in which black smoke does not exist but is judged to be black smoke, and FN indicates the number of frames in which black smoke does exist but is judged incorrectly.

In this section, we repeat all models three times with unfixed random seeds and report the average results. The training set consisted of 6000 images, the testing set consisted of 1560 images and the validation set consisted of 246 images.

5.3. Comparative Analysis

The model proposed in this paper is evaluated against several prominent vehicle black smoke detection and identification models, including MobileNet v3 [30], YOLOv7 [31], YOLOv8 [32], as well as methods presented in the literature [1,4,5], and the C3D [33] model. We also implemented [8] as our baseline.

In these baselines, MobileNet, YOLOv7, and YOLOv8 are all classical model architectures in the detection, and here we use their pre-trained models on the COCO dataset because we find that the COCO dataset contains a large number of road scenes [14], which are similar to what is filmed in the roadway CCTV. Note that MobileNet, YOLOv7, and YOLOv8 are fine-tuned on our proposed dataset, since the original COCO dataset does not have a class named “black smoke”. This pre-training and fine-tuning paradigm has a wide range of applications in many tasks [34,35]. For the rest of the baselines, since no pre-trained versions are provided, we train them directly from scratch.

The results are presented in Table 2. We can obtain the following observation:

MobileNetv3, YOLOv7, and YOLOv8 are all two-dimensional convolutional network detection models that do not account for the dynamic characteristics of black smoke influenced by noise, leading to lower accuracy.
Refs. [1,4] enhance the network’s feature extraction capability by incorporating an attention mechanism, which can mitigate background noise to some extent.
Ref. [5] employs the YOLOv5 network to extract black smoke features but only captures the static features, failing to distinguish clearly between black smoke and other disturbances such as vehicle shadows, road stains, and tree shades.
The C3D model uses fully connected 3D convolution for vehicle black smoke detection, improving accuracy but suffering from a large number of parameters and low inference efficiency, making it unsuitable for real-time detection scenarios.
Ref. [8] is an integrated solution that contains a semantic segmentation model (i.e., YOLOv5-s) and a detection model (i.e., MobileNetv3). Although this approach uses a semantic segmentation mechanism, the features that provide guidance for semantic segmentation are still obtained from a YOLOv5 model. Thus, from a feature extraction perspective, this is effectively equivalent to the integration of YOLOv5 and MobileNetv3, still limited by the usage to the temporal features.

Different to these baselines, our model integrates 3D convolution to fully capture the temporal features between adjacent frames of continuous black smoke images at different resolutions. This model enhances the extraction ability of black smoke features by fusing temporal and semantic dimensions, effectively reducing the impact of occlusion and noise in video sequences, thereby improving the accuracy of vehicle black smoke detection. To get a staged view, we plotted our model’s loss and accuracy over time during training, and the results are displayed in Figure 7.

5.4. Ablation Study

Recall that our model in general is based on ResNet50. We first introduce a two-branch structure to obtain spatial and temporal features at different resolutions. Then we design a cross-scale feature fusion module to mix the features of the target image at different depths in the model. In addition, an attention module is designed to find the important frames for black smoke detection among multiple inputted frames. These modules collectively compose our model. To further validate the efficacy for each designs, five ablated models were trained and evaluated.

$3 D_{R}$ denotes the two-branch structural model of branch 1 and branch 2
$3 D_{R C}$ denotes the introduction of the CSFF module on top of $3 D_{R}$
$3 D_{R C S}$ is the addition of the SAM module on top of $3 D_{R}$
$3 D_{R C S F}$ is the FULL version of our model, which indicates that the FF module is added to $3 D_{R}$ . The FF module includes the CSFF module and the SAM module.

These five models were evaluated individually under identical experimental conditions on the test set. The comparative results in terms of accuracy, recall, F1-score, and inference time are presented in Table 3.

The experimental results demonstrate that the two-branch structure with 3D convolution (

3 D_{R}

) enhances the network’s capability to extract the temporal motion features of black smoke, resulting in a 3.31% increase in accuracy and a 2.52% improvement in recall compared to the original model. The inclusion of the CSFF module in

3 D_{R C}

facilitates comprehensive integration of multi-segment features of black smoke across temporal and semantic dimensions, yielding a slight improvement in both accuracy and recall over

3 D_{R}

. The

3 D_{R C S}

model, which incorporates the SAM module, addresses the limitations of relying solely on short-term temporal cues by enhancing attention to global information and mitigating background noise, thus showing a marginal improvement in performance compared to

3 D_{R C}

. The fully integrated

3 D_{R C S F}

model proposed in this section achieves significant improvements, with increases of 7.4% in accuracy and 6.6% in recall relative to the original model.

5.5. Introducing 3D Convolutional Kernel to Different Layers

In this paper, the ResNet50 network is used as the basic detection network, and 3D convolution is introduced to achieve the extraction of black smoke temporal features in continuous video frame images. Since the introduction of 3D convolution in different layers makes the black smoke detection accuracy of the model different, in order to obtain the detection model with better black smoke detection accuracy, this section conducts experiments on the above dataset by replacing the residual module in different layers with 3D convolution. The results are shown in Table 4.

The experimental results illustrate that the accuracy of the models obtained by introducing 3D convolution in different layers of ResNet50 is different in vehicle black smoke detection, and the experimental results of introducing 3D convolution in only one layer show that introducing 3D convolution in layer 1 can obtain better detection results. Without considering the amount of computation and reasoning time, the model obtained by introducing 3D convolution in layer #1, layer #2 and layer #3 at the same time has an accuracy of up to 83.32% in black smoke detection. Therefore, in order to ensure the accuracy of black smoke detection, this section adopts the last introduction method as the initial vehicle black smoke detection model.

5.6. Experiments on Different Branches and Resolutions

In order to verify the effect of different number of branches and resolution on the experimental results, this paper divides the video sequence into several groups of clips, each group of clips is processed with different resolutions, the results of different resolutions as well as different branches on the accuracy of black smoke detection are compared, and the results are shown in Table 5.

Both branch 1 and branch 2 are the initial vehicle black smoke detection model with the introduction of 3D convolution.

The experimental results illustrate that training the network using a video frame sequence with a resolution of 256 × 128 yields the highest detection accuracy of 83.32% when only branch 1 is employed, while the accuracy is slightly lower than using a video frame sequence with a resolution of 256 × 128 when training the network with a resolution of 128 × 64. However, when the input resolution is 64 × 32, the detection performance is significantly reduced because the resolution is too low, and the dynamic features of black smoke cannot be fully extracted, resulting in a significant decrease in the detection performance of the network. When the dual-branch structure of branch 1 + branch 2 is used, the accuracy of detection is as high as 85.33% when the resolution sizes of the video frame sequences used are 256 × 128 and 128 × 64, respectively, which is an improvement of 2.01% compared to the single-branch structure with the resolution of the input video frame sequence of 256 × 128. However, when the adopted video frame sequence resolution sizes are 128 × 64 and 64 × 32, respectively, the resulting model black smoke detection accuracy decreases significantly. Therefore, in this paper, the two-branch structure of branch 1 + branch 2 is adopted as the basic vehicle black smoke detection model.

5.7. Case Study

The effectiveness of the proposed model in detecting vehicle black smoke in real-world scenarios is illustrated in Figure 8, based on our collected dataset. Here we show both failed and successful detected instances, corresponding to the images in the first and second rows, respectively.

For the three failures we have shown in Figure 8, the first and third examples may be due to the simultaneous presence of shadows or water stains in the area of the black smoke. The second example suggests that we need to consider not only the black smoke itself but also its relationship to the vehicle, as an e-bike can not produce any smoke. These three failures show us some potential directions for enhancing the black smoke detection, and we leave these for further discussion in our future works.

6. Conclusions

In this paper, we first advocated the importance of temporal information in black smoke detection, and the existing roadway CCTV cameras are an ideal data source. Then, we realized that one of the reasons that the existing prevailing methodologies do not employ this ideal detection paradigm is due to the constraints inherent in the available datasets. Therefore, we first publicly released a black smoke detection dataset; all data are collected from roadway CCTV cameras in China. The goal of this release is to boost the state of the art in black smoke detection by placing the question of black smoke detection in the context of a more practical and ideal question of scene understanding.

Based on this dataset, we propose a temporal information-aware black smoke detection model. Sufficient experimental results demonstrate the important role of temporal features in black smoke detection. In particular, some of the failures suggested potential optimization directions for the black smoke detection task, including the overlap between black smoke and water stains or shadows, and the semantic correlations between black smoke and other entities.

Author Contributions

Conceptualization, J.L.; methodology, L.N.; software, H.C.; validation, J.L. and H.C.; writing—original draft preparation, J.L.; writing—review and editing, J.X.; supervision, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Our dataset is available at https://aistudio.baidu.com/aistudio/datasetdetail/78275 (accessed on 28 July 2024).

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Guo, D.; Ren, M. Attention mechanism based two-branch black smoke vehicle detection network. Comput. Digit. Eng. China 2022, 50, 147. [Google Scholar]
Chen, J. Research on the Visual Detection Method of Smoky Diesel Vehicles in Complex Scenes. Master’s Thesis, University of Science and Technology of China, Hefei, China, 2023. [Google Scholar]
Wang, X.; Kang, Y.; Cao, Y. A two-stage Convolutional neural network for smoky diesel vehicle detection. In Proceedings of the 2019 Chinese Control Conference (CCC), Guangzhou, China, 27–30 July 2019. [Google Scholar]
Zhou, J.; Qian, S.; Yan, Z.; Zhao, J.; Wen, H. ESA-Net: A Network with Efficient Spatial Attention for Smoky Vehicle Detection. In Proceedings of the IEEE International Instrumentation and Measurement Technology Conference, I2MTC 2021, Glasgow, UK, 17–20 May 2021; pp. 1–6. [Google Scholar] [CrossRef]
Hao, X. Deep Learning Based Motor Vehicle Black Smoke Detection. Master’s Thesis, China University of Mining and Technology, Xuzhou, China, 2023. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Wong, C.; Yifu, Z.; Montes, D.; et al. ultralytics/yolov5: v6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci. ai integrations. Zenodo 2022. [Google Scholar]
Han, W.; Jun, T.; Xiaodong, L.; Shanyan, G.; Rong, X.; Li, S. PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022. [Google Scholar]
Wang, H.; Chen, K.; Li, Y. Automatic Detection Method for Black Smoke Vehicles Considering Motion Shadows. Sensors 2023, 23, 8281. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Cao, Y.; Kang, Y.; Xu, Z.; Xia, X. CFL-Net: An Environmental Inspection Stations Diesel Vehicle Black Smoke Detection Network Based on Color Features Location. In Proceedings of the 2022 41st Chinese Control Conference (CCC), Hefei, China, 25–27 July 2022. [Google Scholar]
Tripathi, A.; Gupta, M.K.; Srivastava, C.; Dixit, P.; Pandey, S.K. Object Detection using YOLO: A Survey. In Proceedings of the 5th International Conference on Contemporary Computing and Informatics, IC3I 2022, Uttar Pradesh, India, 14–16 December 2022; pp. 747–752. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, L. Microsoft COCO: Common Objects in Context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V 13; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Online, 31 October 2017. [Google Scholar]
Wei, X.; Liang, S.; Chen, N.; Cao, X. Transferable adversarial attacks for image and video object detection. arXiv 2018, arXiv:1811.12641. [Google Scholar]
Tao, H.; Lu, X. Smoky vehicle detection based on multi-feature fusion and ensemble neural networks. Multim. Tools Appl. 2018, 77, 32153–32177. [Google Scholar] [CrossRef]
Zhang, G.; Zhang, D.; Lu, X.; Cao, Y. Smoky Vehicle Detection Algorithm Based On Improved Transfer Learning. In Proceedings of the 6th International Conference on Systems and Informatics, ICSAI 2019, Shanghai, China, 2–4 November 2019; pp. 155–159. [Google Scholar] [CrossRef]
Yuan, L.; Tong, S.; Lu, X. Smoky Vehicle Detection Based on Improved Vision Transformer. In Proceedings of the CSAE 2021: The 5th International Conference on Computer Science and Application Engineering, Sanya, China, 19–21 October 2021; Emrouznejad, A., Chou, J.R., Eds.; ACM: New York, NY, USA, 2021; pp. 97:1–97:5. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Cao, Y.; Lu, X. Learning spatial-temporal representation for smoke vehicle detection. Multim. Tools Appl. 2019, 78, 27871–27889. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Online, 5 July 2016; pp. 2818–2826. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 24 June 2020. [Google Scholar]
Joseph, K.J.; Khan, S.; Khan, F.S.; Balasubramanian, V.N. Towards Open World Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021; pp. 5830–5840. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Online, 18 October 2021; pp. 3520–3529. [Google Scholar]
Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275. [Google Scholar] [CrossRef] [PubMed]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. arXiv 2019, arXiv:1905.02244. [Google Scholar]
Wang, C.; Bochkovskiy, A.; Liao, H.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. CoRR 2023. [Google Scholar]
Tran, D.; Bourdev, L.D.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar] [CrossRef]
Han, X.; Wang, Y.; Zhai, B.; You, Q.; Yang, H. COCO is “ALL” You Need for Visual Instruction Fine-tuning. arXiv 2024, arXiv:2401.08968. [Google Scholar]
Li, M.; Wu, J.; Wang, X.; Chen, C.; Qin, J.; Xiao, X.; Wang, R.; Zheng, M.; Pan, X. Aligndet: Aligning pre-training and fine-tuning in object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Online, 27 June 2023; pp. 6866–6876. [Google Scholar]

Figure 1. Some instances of our dataset. All samples are collected from roadway CCTV cameras in China, and have a ideal and unified perspective with vivid temporal information.

Figure 2. Some instances of the existing dataset. They are not captured in the ideal detection scenarios, the perspectives are not uniform, and these samples are isolated so that they do not contain temporal information.

Figure 3. The proposed model that integrates temporal features.

Figure 4. The structure of the proposed residual block embedded with 3D convolution.

Figure 5. Structure of the feature fusion module.

Figure 6. Structure of the CSFF module.

Figure 7. Loss function and accuracy over the training time of our model.

Figure 8. Vehicle black smoke detection instances. Failed and successful detection examples are shown in the first and second rows.

Table 1. The hyper-parameters we used consist of three parts: training parameters, gradient descent parameters and data enhancement parameters.

Parameter Name	Value
Batch Size	8
Frame Size	6
Epoch	200
Initial learning rate	0.01
Recurring learning rate	0.01
Weight Decay	0.9
Loss factor for IoU	0.2
Loss factor for Cross-entropy	1
Thresholds for IoU training	0.5
Hue (fraction)	0.015
Saturation (fraction)	0.7
Luminance (fraction)	0.4
Rotation angle (+/− deg)	10.0
Translation (+/− fractions)	0.1
Image scaling (+/− gain)	0.5
Probability of performing a up-down flip	0
Probability of performing a left-right flip	0.5
Probability of performing Mosaic	1.0

Table 2. Comparison of experimental results of several mainstream methods.

Model	AR(%)	RR(%)	F1-Score(%)	Inference Time
MobileNetv3	78.46	81.32	79.86	76.96
YOLOv7	83.78	84.32	84.04	123.34
YOLOv8	84.62	86.73	85.67	139.04
[1]	85.44	88.54	86.95	185.62
[4]	85.96	89.36	87.64	176.32
[5]	85.64	87.46	86.55	132.46
C3D	87.76	89.66	88.74	348.23
[8]	84.37	87.90	86.56	317.25
ours	89.42	91.32	90.36	196.68

Table 3. Ablation study results.

Model	AR(%)	RR(%)	F1-Score(%)	Inference Time
ResNet50	82.02	84.72	83.35	82.72
$3 D_{R}$	85.33	87.24	86.27	144.46
$3 D_{R C}$	87.46	89.83	88.63	153.47
$3 D_{R C S}$	86.75	89.47	88.10	151.28
$3 D_{R C S F}$	89.42	91.32	90.36	186.43

Table 4. Experimental results of 3D convolution replacing residual modules of different layers.

Model	Image Size	Position of 3D Convolution	AR
ResNet50	256 × 128	NULL	82.02
		1st layer	82.63
		2nd layer	82.37
		3rd layer	82.50
		1st + 2nd	82.82
		1st + 3rd	83.01
		1st + 2nd + 3rd	83.32

Table 5. Comparison of results with different branches and resolutions.

The Resolution of Branch 1	The Resolution of Branch 2	Usage of Both Branch 1 and Branch 2	AR (%)
256 × 128	-	N	83.32
128 × 64	-	N	81.67
64 × 32	-	N	66.54
256 × 128	128 × 64	Y	85.33
256 × 128	64 × 32	Y	84.23
128 × 64	64 × 32	Y	78.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Yang, L.; Cheng, H.; Niu, L.; Xu, J. Three-Dimensional Convolutional Vehicle Black Smoke Detection Model with Fused Temporal Features. Appl. Sci. 2024, 14, 8173. https://doi.org/10.3390/app14188173

AMA Style

Liu J, Yang L, Cheng H, Niu L, Xu J. Three-Dimensional Convolutional Vehicle Black Smoke Detection Model with Fused Temporal Features. Applied Sciences. 2024; 14(18):8173. https://doi.org/10.3390/app14188173

Chicago/Turabian Style

Liu, Jiafeng, Lijian Yang, Hongxu Cheng, Lianqiang Niu, and Jian Xu. 2024. "Three-Dimensional Convolutional Vehicle Black Smoke Detection Model with Fused Temporal Features" Applied Sciences 14, no. 18: 8173. https://doi.org/10.3390/app14188173

APA Style

Liu, J., Yang, L., Cheng, H., Niu, L., & Xu, J. (2024). Three-Dimensional Convolutional Vehicle Black Smoke Detection Model with Fused Temporal Features. Applied Sciences, 14(18), 8173. https://doi.org/10.3390/app14188173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Three-Dimensional Convolutional Vehicle Black Smoke Detection Model with Fused Temporal Features

Abstract

1. Introduction

2. Related Works

2.1. Object Detection

2.2. Black Smoke Detection

3. Proposed Dataset

4. Proposed Model

4.1. Model Structure

4.2. Black Smoke Feature Extraction with 3D Convolution

4.3. Feature Fusion Module

4.4. Self-Attention Module

5. Experiments

5.1. Setup

5.1.1. Dataset

5.1.2. Environment

5.1.3. Model-Training

5.2. Evaluation Metrics

5.3. Comparative Analysis

5.4. Ablation Study

5.5. Introducing 3D Convolutional Kernel to Different Layers

5.6. Experiments on Different Branches and Resolutions

5.7. Case Study

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI