A New Approach to Classify Drones Using a Deep Convolutional Neural Network

Rakshit, Hrishi; Bagheri Zadeh, Pooneh

doi:10.3390/drones8070319

Open AccessArticle

A New Approach to Classify Drones Using a Deep Convolutional Neural Network

by

Hrishi Rakshit

and

Pooneh Bagheri Zadeh

^*

School of Built Environment, Engineering and Computing, Headingley Campus, Leeds Beckett University, Leeds LS6 3QL, UK

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(7), 319; https://doi.org/10.3390/drones8070319

Submission received: 9 May 2024 / Revised: 27 June 2024 / Accepted: 28 June 2024 / Published: 12 July 2024

(This article belongs to the Special Issue Advances in Detection, Security, and Communication for UAV)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, the widespread adaptation of Unmanned Aerial Vehicles (UAVs), commonly known as drones, among the public has led to significant security concerns, prompting intense research into drones’ classification methodologies. The swift and accurate classification of drones poses a considerable challenge due to their diminutive size and rapid movements. To address this challenge, this paper introduces (i) a novel drone classification approach utilizing deep convolution and deep transfer learning techniques. The model incorporates bypass connections and Leaky ReLU activation functions to mitigate the ‘vanishing gradient problem’ and the ‘dying ReLU problem’, respectively, associated with deep networks and is trained on a diverse dataset. This study employs (ii) a custom dataset comprising both audio and visual data of drones as well as analogous objects like an airplane, birds, a helicopter, etc., to enhance classification accuracy. The integration of audio–visual information facilitates more precise drone classification. Furthermore, (iii) a new Finite Impulse Response (FIR) low-pass filter is proposed to convert audio signals into spectrogram images, reducing susceptibility to noise and interference. The proposed model signifies a transformative advancement in convolutional neural networks’ design, illustrating the compatibility of efficacy and efficiency without compromising on complexity and learnable properties. A notable performance was demonstrated by the proposed model, with an accuracy of 100% achieved on the test images using only four million learnable parameters. In contrast, the Resnet50 and Inception-V3 models exhibit 90% accuracy each on the same test set, despite the employment of 23.50 million and 21.80 million learnable parameters, respectively.

Keywords:

convolutional neural network; window-based filter; precision; recall; F1 score

1. Introduction

The classification of Unmanned Aerial Vehicles (UAVs), commonly referred to as drones, is gaining prominence in tandem with the expanding utilization of drones across commercial and recreational domains. Drones serve a myriad of functions involving the delivery of goods [1], aerial photography [2], and surveillance operations. Within contemporary agricultural practices [3], industrial settings [4], and urban development initiatives [5], drones fulfill multifaceted roles, contributing to tasks such as precision farming, infrastructure inspections, and urban planning. Additionally, their deployment aids in monitoring meteorological condition to enhance road safety [6], exemplifying the diverse array of applications facilitated by drones in a modern context.

Despite the diverse range of applications for drones, their misuse or deliberate exploitation can pose significant challenges to public safety and security. Of particular concern is the potential for drones to be utilized in an act of terrorism or other form of violence, owing to their adaptability and capacity to carry various payloads, including explosive and chemical agents [7]. Apart from security concerns, if a drone were to collide with an aircraft during the flight, the results could be devastating. A drone is used for sending illegal drugs, communication devices such as mobile handsets, or other illegal goods to prisons [8]. Further, the deployment of spy drones for secret surveillance purposes raises significant privacy and security implications [7]. Therefore, it is necessary to classify drones in restricted areas as early as possible to prevent any lethal or disastrous outcome.

Ensuring public safety and privacy, alongside the mitigation of potential threats posed by drones, necessitates the timely and accurate identification of these UAVs within restricted areas. Various systems have been devised for the classification of drones in such a context, including RADAR-based methodologies [9], Radio Frequency (RF) analysis techniques [10], Acoustic Characteristic Tracing mechanism [11], and Visualization-Based Classification approaches [12]. Deep learning-based classification approaches are very popular for object classification. Deep learning confronts significant challenges rooted in its reliance on extensive labeled data and associated high training costs. In response, transfer learning within the deep learning domain, often referred to as deep transfer learning, seeks to alleviate these burdens by capitalizing on acquired knowledge from a source task or big dataset to enhance training on a target task or small dataset [13]. Moreover, a prevalent issue observed in deep learning-based classification models pertains to the challenge of striking a balance between precision, model complexity, and the risk of overfitting. The pursuit of heightened precision often prompts the addition of numerous layers to the model architecture. While this augmentation enriches the model’s capacity to learn intricate patterns, it concurrently amplifies the complexity of the system. Consequently, this heightened complexity exacerbates the susceptibility of the model to overfitting, wherein the model excessively tailors itself to the training data, compromising it to generalize effectively to unseen data instances [13]. Models such as Googlenet [14], Resnet50 [15], Darknet53 [16], Inception-V3 [17], and Inception-Resnet-V2 [18] exemplify this issue.

The objective of this paper is to undertake a rigorous assessment of the merits and drawbacks inherent in current methodologies for drone classification and endeavors to introduce a novel algorithm designed to address the limitations identified in state-of-the-art approaches. Central to this research methodology is the utilization of a bespoke dataset comprising the combination of publicly available drone datasets and a newly curated collection developed at Leeds Beckett University. This comprehensive dataset encompasses both visual imagery and accompanying audio signals of drones and similar objects, with particular emphasis on the conversion of audio signals into spectrogram images. A pioneering approach employing a new Finite Impulse Response (FIR) low-pass filter is adopted for the transformation of audio sounds into spectrogram images, thereby enhancing the efficiency of drone classification algorithms.

2. Literature Review

For numerous decades, RADAR (Radio Detection and Ranging) technology has served as a fundamental tool in the classification of aerial vehicles. Nonetheless, conventional RADAR systems encounter limitations when tasked with the classification with tiny UAVs. This challenge occurs due to the swift mobility diminutive proportion of such UAVs, which leads to weak reflections of electromagnetic waves, making conventional RADAR systems more prone to false positive classifications [9].

Another widely used methodology for drone classification involves RF (Radio Frequency)-based techniques. This method requires capturing the communication signals exchanged between the drones and the ground-controlled devices. However, it is significant to note that in numerous instances, drones operate automatically through on-board software rather than being directed by ground-controlled devices [10].

The classification of drones based on their acoustic characteristics is a good way, as drones have very distinguishable acoustic characteristics compared to a similar object like an airplane, birds, helicopters, etc. However, it is important to acknowledge that this method may face challenges in environments characterized by a high level of noise, such as airports, where its feasibility may be compromised [11].

Recently, vision-based object classification methods have received significant attention by researchers, due to the advances in smart vision technologies. There are two types of visualization-based object classification systems; one is traditional computer vision-based classification systems, and the other is deep learning-based classification systems.

The traditional computer vision-based system like HOG [19], SIFT [19], etc., relies on manually designed features, which may not be robust or discriminative enough to accurately classify the object of interest in all cases [19]. Traditional classifiers are sensitive to changes in the appearance of the object of interest, such as changes in lighting or perspective [19].

On the other hand, the neural network approaches are faster and more efficient, and they also overcome the limitations of the non-neural-based approaches. Deep learning [20]-based neural networks are very popular for object classification such as an artificial neural network, convolutional neural network, etc. Here, the word ‘deep’ refers to the number of layers that the input is transformed. Most modern deep learning models for object classification are mainly based on convolutional neural networks (CNNs). Here, convolutional layers are cascaded one after another to extract features from raw input. Each convolutional filter is a small matrix of weights that is used to trace out a specific pattern or feature in the input data. For example, a convolutional filter might be designed to find horizontal edges in an image. As the filter is sliding over the input data, it looks for correlations between the values in the filter and the values in the input data. When it finds a strong correlation, it produces a high value in the corresponding position of the output feature map. This process is repeated for every position in the input data, producing a feature map that encodes the presence of the pattern or feature that the filter is designed to detect or classify [21]. Multiple filters can be applied to the input data, each designed to search for different patterns or features. For example, a CNN might have one set of filters for acquiring edges, another set for finding corners, and another set for searching textures. These filters are learned from data during the training process, so the CNN can learn which patterns and features are important for the task at hand [22].

A predominant challenge in deep learning-based existing classification models is achieving an optimal balance between precision, model complexity, and the risk of overfitting. Often, the drive for higher precision leads to the addition of multiple layers within the model architecture. Although this increase allows the model to learn more intricate patterns, it also makes the system more complex. This added complexity can lead to overfitting.

3. Dataset Preparation

In practice, the publicly available drone datasets are very limited. This is because of some privacy concerns, cost, and some safety issues. Drones are used in a limited number of industries such as in a military, search and rescue, delivery, etc., which makes the data collection more specific and limited.

3.1. Image Dataset

A novel collection of a drone dataset has been created at Leeds Beckett University (LBU), focusing on the three prominent drone models, namely DJI Phantom, Yuneec, and DJI Mavic Mini. Figure 1a represents the sample frames from the newly created drone dataset with various loads. The acquisition of video footage was executed using two distinct camera systems, i.e., the Canon LEGRIA HF R806 boasting an impressive 32X optical zoom and a focal length ranging from 2.8 mm to 89.6 mm and the Panasonic Ultra HD 4 k camera equipped with a 20X optical zoom and a focal length spanning from 4.08 mm to 81.6 mm. This newly created drone dataset is a diverse collection, containing three distinct drones with various loads. After recording the videos, the files are segmented by specifying the time intervals. The new dataset contains 30 video clips of 25 s.

The Multi-sensor Drone dataset stands as a monumental repository within the research landscape, representing the most expansive compilation of aerial entities, inclusive of drones and analogous objects such as airplanes, birds, helicopters, etc. Offering unfettered access to researchers, this dataset serves as an invaluable resource for the exploration and analysis of various aerial phenomena [23]. Within the Multi-sensor Drone dataset, a comprehensive collection of aerial entities is encompassed, featuring the three distinct drone models, a small variant represented by the Hubsan H107D+, a mid-sized counterpart exemplified by the DJI Flame Wheel configured in a quadcopter arrangement, and a high-performance model epitomized by the DJI Phantom 4 Pro [16]. Moreover, this dataset encompasses various flying objects that may be mistakenly identified as drones, like birds, airplanes, and helicopters. The Multi-sensor Drone dataset comprises a total of 650 video recordings, differentiated into 365 infrared (IR) and 285 visible (RGB) segments, each lasting 10 s [18]. This collective repository yields a corpus of 203,328 meticulously annotated frames, facilitating the comprehensive analysis and evaluation of drone classification algorithms [23].

The USC Drone dataset comprises 30 video recordings captured within the confines of the UNIVERSITY OF SOUTHERN CALIFORNIA campus, all filmed utilizing a single drone model. These recordings were meticulously curated to encompass a diverse array of background scenes, varying camera angles, distinct drone configurations, and diverse weather conditions [24]. Intentionally crafted to encapsulate the real-world scenarios, the videos aim to portray the nuances of drone behavior amidst dynamic environment factors, including rapid motion, challenging lighting conditions, and occlusion phenomena [24]. Each video clip spans approximately a one-minute duration, with a frame resolution of 1920 × 1080 and a frame rate set at 15 per second [24].

For the training of the CNN models, the new dataset along with the Multi-sensor Drone dataset [23] and USC Drone dataset [24] are used. The inclusion of multiple object classes in the training dataset serves to enhance the model’s capacity to discriminate between drones and other similar objects such as birds, small aircrafts, etc. This increases the robustness of the proposed algorithm, making it better able to handle real-world scenarios. When using more classes, the model can learn more features that are useful for classifying different types of objects. The custom dataset consists of 3000 images of drones, 2000 images of airplanes, 2000 images of helicopters, and 1000 images of birds.

3.2. Audio Dataset

The audio sounds of the three drones, namely DJI Phantom, Yuneec, and DJI Mavic Mini, have been recorded with various loads. Initially, the drones were affixed with sample drug packets and small electronic devices, such as mobile handsets. Subsequently, the laden drones were launched into the sky, allowing their audio emissions to be captured. For this, the “RODE Wireless GO” system is used. These recorded audio sounds are segmented into equal-length audio files. The new audio dataset contains 100 drone audio clips of 10 s. These audio files are then converted into spectrogram images.

An open-source drone audio dataset [25] along with the newly created audio dataset are utilized for training and testing of convolutional neural network (CNN) models. This open-source drone audio dataset contains 1300 audio clips of drone sounds. Moreover, publicly available noise datasets [26,27] are used to artificially augment the drone audio clips with noise, aiming to mimic real-world scenarios. Furthermore, audio clips featuring airplanes, helicopters, and background noises are included to enhance the training process of the models [28]. All the audio clips underwent conversion into spectrogram images, a preparatory step essential for facilitating the training and testing procedures regarding convolutional neural network (CNN) models.

3.3. Spectrogram Conversion

The spectrogram stands out as a highly accurate and valuable two-dimensional representation of audio signals, offering insights into the intensity of a signal across different frequencies [29]. It serves as a fundamental tool in the extraction of high-level discriminative features from audio signals [29]. To extract salient features and make the system learn about those features, in many cases, the conversion of an audio signal to a spectrogram is an obvious step. A spectrogram with less noise and interference helps to extract very detailed features and the models can learn very efficiently. The basic block diagram of audio to spectrogram conversion is shown in Figure 2a.

The input to this section is an audio file, which contains the time domain representation of sound, with variation in air pressure over time. Its primary function is to take an audio file typically in .wav format and serves as the source of sound data for further processing.

Before converting a spectrogram, it is necessary to conduct some pre-processing over the audio data to make it suitable for an analysis. Firstly, the audio data are sampled with 16 KHz which strikes a balance between capturing audio information, conserving storage space and maintaining compatibility across different systems. Then, the audio is normalized to a standard level, to prevent issues related to volume variation.

In the filter section, the audio signal is divided into small overlapping windows and then the window function (like Hamming, Kaiser, Gaussian, etc.) is applied. The window function is used here to emphasize the characteristic of each segment while minimizing artifacts at the edges. The windowed segments are usually overlapped to ensure smooth transitions between consecutive segments. Here, a new window function is proposed to convert an audio signal to a spectrogram. Mathematically, the proposed window function is given as Equation (1).

w = 0.5 - 0.5 \cosh [σ \sin \{π \frac{(n - \frac{N - 1}{2})}{N - 1}\}]

(1)

where N is the window length and

σ

is the tuning parameter.

The final step is to plot the spectrogram as shown in Figure 2b. The X-axis represents time, the Y-axis represents frequency, and the color of each pixel represents the magnitude or the energy at a specific time-frequency point so that one can analyze and interpret the signal’s frequency content over time.

The custom dataset comprises a total of 8000 diverse images depicting drones and analogous objects such as airplanes, birds, and helicopters. Additionally, the dataset incorporates 2000 spectrogram images representing drones, airplanes, helicopters, and background noises. This substantial quantity of images (8000 visual images and 2000 spectrogram images) proves to be advantageous for training various deep learning models [30].

4. Proposed Algorithm

The proposed algorithm’s flow chart, as depicted in Figure 3a, is structured into five distinct sections, namely the input block, Convolutional Block 2, Convolutional Block 3, Convolutional Block 4, and the classification block. The proposed model has 28 convolution layers and as the activation layer, here, the Leaky ReLU activation function is used.

Initially, any RGB image is fed into the input block. This block is designed to accept the input dimensions of 224 × 224 × 3. Here, the first two dimensions (224 × 224) represent the width and height of the input image, while the last dimension (3) signifies the number of color channels into the image (Red, Green, and Blue). The choice of the input size 224 × 224 × 3 is deliberate. It strikes a balance between being small enough to fit into memory, which is crucial for training large models, and containing sufficient information for tasks such as image classification and object recognition. The dimensions are also standard across many well-known datasets, such as ‘Imagenet Dataset’. By adhering to this standard input size, researchers can facilitate comparisons among various models trained on the same dimensions, fostering a more robust evaluation framework.

The input block consists of one convolutional layer (Conv), one Batch Normalization layer (Batch Norm or BN), one Leaky ReLU (LR) activation layer, and one Max Pooling layer. The Conv layer has 64 filters of dimensions 7 × 7. State-of-the-art feature extraction models usually use the first Conv layer of filter sizes 3 × 3, 5 × 5, 7 × 7, and 11 × 11. The proposed model works better using the 7 × 7 filter at the very initial stage of the network. Using the 3 × 3 or 5 × 5 filter at the very first layer causes some contextual information loss. On the other hand, the 11 × 11 filter sources some extra contextual information, which may not be that much useful and destroy the harmony of the feature map reduction. The output of the first 7 × 7-sized Conv layer with a stride of 2 and padding of 3 is 112 × 112 × 64. Thus, the feature map is down-sampled to half from its original input dimensions. Batch Normalization normalizes the output. The Leaky ReLU activation function makes the training smooth. The Max Pooling layer with the window size of 3 × 3 along with a stride of 2 is used to further down-sample the feature maps and helps to extract information from the more intense pixels. The output of the max pool layer is 56 × 56 × 64.

Convolutional Block 2 after the input block contains three sub-blocks, namely 1. Proposed Convolutional Sub-Block 2a (Pro2a), 2. Proposed Convolutional Sub-Block 2b (Pro2b), and 3. Proposed Convolutional Sub-Block 2c (Pro2c). Pro2a has three Conv layers, which are denoted as Pro2a_21, Pro2a_22, and Pro2a_23. After each Conv layer, there are one Batch Norm layer and one Leaky Relu activation layer. Pro2a_21 uses 64 filters of dimensions 1 × 1 with a stride of 1 and padding of 0, and Pro2a_22 utilizes 64 filters of size 3 × 3 and Pro2a_23 has 256 filters of size 1 × 1 with the same stride and padding. The first 1 × 1 conv layers allow the network to learn a more compact representation of the input data, which can be used for the subsequent layers in the network. Following this, a convolutional layer is implemented, featuring 64 filters with dimensions of 3 × 3. The rationale behind opting for a 3 × 3 filter size lies in its ability to efficiently capture intricate details present within the image, while also providing adequate coverage of its overall structural elements. Using a 3 × 3 filter allows the network to learn more complex features than it would with a larger or smaller filter dimension. Additionally, using a 3 × 3-dimensioned filter helps to reduce the number of parameters in the model, making it more efficient and easier to train. Again, 256 filters of size 1 × 1 are used to increase the channel dimension from 64 to 256 and to trace out more details of the feature maps. The output of the input block is directly connected to the output of the pro2a_23 using “bypass connection”.

In typical convolutional neural network architectures, each layer processes the input data sequentially, gradually extracting hierarchical features from the input. However, as the network becomes deeper, it encounters challenges in learning and propagating gradients efficiently through all the layers. This phenomenon, known as the vanishing gradient problem, can hinder the optimization process during training and impede the model’s ability to learn meaningful representation. Bypass connections address this issue by providing shortcuts for the flow of information across layers. Specifically, instead of feeding the output of one layer directly into the subsequent layer, bypass connections allow the output to bypass one or more intermediate layers and be added directly to the output of deeper layers. This way, the network can learn residual mapping, capturing the difference between the desired output and current output at each layer.

Here, the bypass connections allow the model to learn more complex and diverse features, improves its ability to relate new features on test data to the most likely features that it learnt from training, and reduces the risk of overfitting. When the dimensions of the feature maps are not the same, 256 filters of 1 × 1 are used along with the bypass connection. This makes the dimensions of the feature maps and the number of channels become equal. It helps the model to learn spatial hierarchies, which enables it to better understand the relationships between different parts of the feature maps. The pro2b_21, pro2b_22, pro2b_23, pro2c_21, pro2c_22, and pro2c_23 have the same configuration as the pro2a series apart from having bypass connections without any 1 × 1 Conv layer.

For pro2b series, the 1 × 1 convolutional layers in this configuration are used to reduce the dimensionality of the input, allowing the network to process it more efficiently. The 3 × 3 convolutional layers are used to learn the residual function, and the final 1 × 1 convolutional layer is used to restore the dimensionality of the output. The same is applicable for pro2c series. Convolutional Block 3 and Convolutional Block 4 have similar connections and configurations. Increasing or decreasing the number of filters is carried out, following the power of 2 to maintain harmony. This makes the network more scalable, since it is easy to double the number of filters if more capacity is needed, without having to change the overall architecture of the network.

The mathematical expression of ReLU and Leaky ReLU is given in Equations (2) and (3).

F (x) = \max (0, x),

(2)

F (x) = \{\begin{matrix} ax, for x < 0 \\ x, for x \geq 0 \end{matrix} W h e r e a = 0.01

(3)

The Rectified Linear Unit (ReLU) activation function is susceptible to the phenomenon known as the “dying ReLU” problem [31]. This issue arises when a neuron possesses a negative bias, causing it to remain inactive or “dead” indefinitely, thereby hindering the flow of information throughout the network. To address this problem, the Leaky ReLU function is used as an activation function throughout the whole network instead of ReLU. The Leaky ReLU is a variant of the ReLU activation function that allows a small, non-zero gradient when the input is negative. This helps to prevent the dying ReLU problem, where many neurons in the network end up outputting a constant zero value and therefore stop learning [31]. Moreover, the Leaky ReLU activation function offers a notable advantage over ReLU by providing a non-zero for negative input values. This characteristic enables Leaky ReLU to mitigate the issue of discarding potentially relevant information, thereby enhancing its performance, particularly in situations characterized by noisy or outlier-laden data distribution [31]. Thus, this improves the model’s performance.

The classification block consists of average pooling, fully connected, softmax, and output layers. The average pooling is used to reduce the spatial dimensions, i.e., width and height, of the feature maps while retaining the important information. This helps to reduce the computational complexity of the network and provides some degree of translation invariance. Fully connected layers are used for high-level feature aggregation and mapping the extracted features to the final class score. Using the softmax function, the softmax section converts the raw output scores into class probabilities. Regarding the class that has the highest probability score, the image belongs to that class. The output uses the argmax operation, which finds the index of the class with the highest probabilities.

In the proposed model, a distinctive approach is taken wherein weights within Convolutional Blocks 2 and 3 are initialized with pre-trained values instead of being randomly initialized. Additionally, a novel block, Convolutional Block 4, is introduced, positioned atop the preceding two blocks. The weights of this newly incorporated block undergo training contingent upon those of the preceding two blocks, emphasizing its role in extracting high-level features. Subsequently, a classification section is appended with weights, therein updated based on the collective influence of the preceding three convolutional blocks. This structured approach underscores the hierarchical integration of knowledge transfer within the model architecture, strategically leveraging pre-existing information to enhance feature extraction and classification performance.

5. Proposed Classification Network with Different Optimizers

The proposed classification network is trained with different optimizers maintaining other hyperparameter constants like a learning rate of 0.01, Max Epoch of 10, mini-batch size of 20, frequency of 50, etc. Figure 4a,b illustrate that the maximum validation accuracy is achieved with the minimum elapsed time when utilizing SGDM as the optimizer of the proposed model, contrasting with the performance of ADAM and RMSPROP. This is because ADAM maintains adaptive learning rates for each parameter individually, which can introduce noise. The custom datasets that are used for training contain lots of small-object images. That means that they have many zero entries. To handle such a type of dataset, SGDM works better. Moreover, SGDM can update the process and prevent overfitting by using momentum [31]. Furthermore, the custom dataset that is used here is full of various objects like airplanes, birds, drones, and helicopters of different sizes and aspect ratios. Sometimes, ADAM’s adaptive learning rates might not adapt optimally across all dimensions. This could result in suboptimal convergence. SGDM, on the other hand, tends to work better in such scenarios due to the momentum term, which helps steer the optimization process along the dominant directions and overcome the issues related to high dimensionality [31]. Sometimes, ADAM’s adaptive learning rate might struggle to adapt efficiently if the gradients computed during the training process have high variance. SGDM, with its momentum term, helps to mitigate this issue and provides more stable convergence [31]. RMSPROP, which also utilizes the adaptive learning rates but is different from ADAM in its computation of the second moment, tends to strike a balance between the extremes of SGDM and ADAM.

6. Performance Evaluations

Precision–recall curves (PR curves), area under the precision–recall curves (AUC-PR), and F1 score–threshold curves are considered to evaluate the performance of the state-of-the-art models. Precision is the number of true positive classifications divided by the total number of positive classifications [32]. On the other hand, recall is the number of true positive classifications divided by the total number of actual classifications [32]. The F1 score is the trade-off between

Precision

and

Recall

[32].

Precision = \frac{True Positive}{True Positive + False Positive}

(4)

Recall = \frac{True Positive}{True Positive + False Negative}

(5)

F 1 Score = \frac{2 \times Precision \times Recall}{(Precision + Recall)}

(6)

A model having high precision and low recall means that the model is very selective about what it considers to be positive classification. So, it will only output a positive classification if it is very confident about the positive class classification. The problem is that due to its high selectiveness, it might not classify all the object classes in the images. On the other hand, a model with high recall and low precision is less selective to classify positive classes. So, it will output many objects as positive class objects with a low confidence score. It might classify all the objects that are in the images but will result in lots of false object-class classifications. The performance of any model is better if it can maintain a high precision value when the recall is increasing from a lower to higher value [32]. The area under the precision–recall curve (AUC-PR) and

F 1 Score

curve are other important measurements of how well the model works. The higher the area under the PR and F1 score curves, the better the model is [32].

7. Experimental Results

Section 7.1 examines the comparative performance of the proposed window-based filter in contrast to state-of-the-art window-based filters. Conversely, Section 7.2 illustrates the superior performance of the proposed classification algorithm over existing classification algorithms specifically concerning drone classification.

7.1. Proposed Window and Proposed Window-Based Low-Pass Filter

Section 7.1.1. elucidates the temporal characteristics and normalized frequency domain depiction of both the proposed and the existing state-of-the-art windows. Furthermore, Section 7.1.2. presents a comparative analysis of the efficacy of the proposed window-based low-pass filter against that of the state-of-the-art window-based low-pass filters.

7.1.1. Proposed Window vs. State-of-the-Art Windows

In this section, the shape and the spectral properties of the proposed window are compared with the commonly used state-of-the-art window functions. Figure 5 depicts the shapes of both the proposed window and state-of-the-art windows, namely the Hamming window, Gaussian window, and Kaiser window [33], whereas Figure 6 describes their respective normalized frequency responses.

From Figure 5, it is clear that the proposed window has narrower temporality compared to the Hamming window. The shape of the proposed window is very similar to the Gaussian window with the exception that all the coefficients of the Gaussian window are greater than zero, while the proposed window touches the X-axis in a symmetric manner. The Kaiser window with its tunable parameter β = 0.5 is almost like a rectangular window with a window length of 128. On the contrary, the proposed window with a controlling parameter of σ = 0.5 is a bell-shaped window when the window length is set to 128.

Figure 6 suggests that, with the same main-lobe widths, the proposed window and the Hamming window have ripple ratios of −32.0980 dB and −45.3748 dB, respectively, and the corresponding side-lobe roll-off ratios are 121.433 dB and 13.2968 dB, respectively. With respect to side-lobe roll-off ratio consideration, the proposed window achieves 108.1362 dB better performance than the Hamming window. Hence, the proposed window function is more selective for bands to pass, suppressing the performance of the Hamming window in the fundamental aspect of signal processing. In contract to the Gaussian window, the proposed window exhibits distinct spectral characteristics with the same window length of 128. The main-lobe width of the Gaussian window is measured at 2π × 0.0509 rad/sample, accompanied by a ripple ratio of −44.0964 dB and a side-lobe ratio of 19.4604 dB. In comparison, the proposed window demonstrates a narrower main-lobe width of 2π × 0.0313 rad/sample, a ripple ratio of −32.0908 dB, and a substantially higher side-lobe roll-off ratio of 121.44 dB. With narrower main-lobe width and a 101.9796 dB better side-lobe roll-off ratio, the proposed window outperforms the Gaussian window in terms of frequency selectivity and noise reduction. The main-lobe widths of the Kaiser and the proposed windows are 2π × 0.0157 rad/sample and 2π × 0.0313 rad/sample, respectively. The ripple ratios for the Kaiser and the proposed windows are −13.8109 dB and −33.9778 dB, respectively. The Kaiser window achieves the side-lobe roll-off ratio of 31.707 dB while the proposed window results in a 121.2232 dB side-lobe roll-off ratio. Therefore, in terms of the side-lobe roll-off ratio, the proposed window yields 89.5162 dB better performance compared to the Kaiser window.

7.1.2. Proposed Window-Based Filter vs. State-of-the-Art Window-Based Filters

In this section, the performance of the proposed window-based filter is compared to the state-of-the-art window-based filter.

The normalized frequency responses of the FIR low-pass filter designed by using the proposed and the Hamming window functions are depicted in Figure 7. For the ripple ratio measurement, the proposed window-based FIR low-pass filter results in a 4.638 dB better performance than the Hamming window-based FIR low-pass filter with the same cut-off frequency and same order of the filters. Furthermore, the evaluation of the side-lobe roll-off ratio reveals a noteworthy advantage for the proposed window-based low-pass filter, exposing a remarkable improvement of 90.7482 dB compared to the Hamming window-based low-pass filter having the same cut-off frequency and filter order.

Figure 8 shows the normalized frequency response of the FIR low-pass filter designed by the proposed and the Kaiser window functions. With the same cut-off frequency and same filter order, the proposed low-pass filter achieves a 39.57 dB and 75.2134 dB better ripple ratio and side-lobe roll-off ratio, respectively, compared to the Kaiser window-based low-pass filter.

Figure 9 exhibits the normalized frequency response of the FIR low-pass filter designed by the proposed and the Gaussian window functions. Under equivalent conditions of cut-off frequency and filter order, the proposed low-pass filter attains a 39.5 dB and 74.8793 dB better ripple ratio and side-lobe roll-off ratio, respectively, compared to the Gaussian window-based low-pass filter.

The proposed filter presents exceptional performance in side-lobe roll-off ratio and ripple ratio calculations, demonstrating its efficacy in noise reduction for audio signals. However, its transition band lacks the sharpness required for processing very small Bio-signals effectively. Consequently, the proposed filter may not be the optimal choice for applications involving such minute Bio-signals.

Figure 10 shows the denoised audio spectrogram of an audio file using various state-of-the-art window-based filters and the proposed window-based filter with a cut-off frequency of 1 KHz and filter order of 256. Here, Signal-to-Noise Ratio (SNR) and Signal-to-Interference Ratio (SIR) are measured to evaluate the performance of the filters. SNRs of the Hamming, the Gaussian, and the Kaiser filters are 2.5503 dB, 2.5652 dB, and 2.6678 dB, respectively. On the other hand, SNR of the proposed window-based filter is 2.9072 dB, which is 0.2394 dB better than the best performing Kaiser filter. The Hamming, the Gaussian, and the Kaiser filters achieve SIRs of −2.3725 dB, −2.3675 dB, and −2.3327 dB, respectively. The proposed window-based filter has the SIR of −0.208316 dB, which tops the performance of the state-of-the-art filters.

7.2. Proposed Classification Algorithm vs. State-of-the-Art Classification Algorithms

In this paper, the performance of the proposed drone classification algorithm is compared with the state-of-the-art object classification algorithms, namely Resnet50, Inception-V3, and Inception-Resnet-V2. The proposed model was implemented on a system with the following specifications: an NVIDIA GeForce RTX GPU, MATLAB software (version R2023b), and a deep neural network library for GPU learning.

Figure 11 describes that Resnet 50 results in a 85.81% validation accuracy on the custom dataset with a Max Epoch of 10, mini-batch size of 20, learning rate of 0.01, and frequency of 50. The elapsed time is 20 min and 26 s. Here, learnable properties are 23.5 million.

Figure 12 depicts the training outcomes of the pre-trained network of Inception-V3 with a Max Epoch of 10, mini-batch size of 20, frequency of 50, and learning rate of 0.01 on the custom dataset. It achieves 83.03% accuracy with an elapsed time of 23 min and 20 s, having 21.8 million learnable properties.

Figure 13 illustrates the Inception-Resnet-V2 classification network yielding 82.26% validation accuracy on the custom dataset with the Max Epoch of 10, mini-batch size of 20, learning rate of 0.01, and frequency of 50. The elapsed time needed is 40 min and 01 s. The total learnable properties are 54.3 million.

Figure 14 portrays that the proposed backbone network for feature extraction yields 90.70% validation accuracy on the custom dataset with the Max Epoch of 10, mini-batch size of 20, learning rate of 0.01, and frequency of 50. The elapsed time needed is 25 min and 37 s. The total learnable properties are 4 million.

Figure 15 presents a comparative analysis of the validation accuracy of various neural network architectures, namely Resnet50, Inception-V3, Inception-Resnet-V2, and the proposed methods, across different learning rates. The results indicate notable fluctuation in validation accuracy as the learning rate varies. When the learning rate is set to 0.01, the proposed method achieves the highest validation accuracy of 90.70%, outperforming the other architectures. However, with this learning rate, Inception-Resnet-V2 has the lowest validation accuracy of 82.26%. With a learning rate of 0.001, Resnet50, Inception-V3, Inception-Resnet-V2, and the proposed method result in a validation accuracy of 82.72%, 80.64%, 79.45%, and 88.50%, respectively. On the other hand, Resnet50, Inception-V3, Inception-Resnet-V2, and the proposed method have 81.53%, 82.60%, 77.70%, and 86.90% validation accuracy, respectively, with the learning rate of 0.0001. That means that the proposed method is suitable in a fast-learning environment compared to the state-of-the-art methods.

Figure 16a,b correlate the relationship among validation accuracy, elapsed time, and number of epochs across different neural network architectures, namely Resnet50, Inception-V3, Inception-resnet-V2, and the proposed method. Initially, when the epoch is set to 10, the proposed method demonstrates the highest validation accuracy of 90.70% with an elapsed time of 25 min 37 s, beating Resnet50, Inception-V3, and Inception-Resnet-V2. However, as the number of epochs increases to 20, the proposed method maintains a relatively high accuracy of 86.80% with an elapsed time of 28 min 50 s, Resnet50 experiences a decline to 83.50% in validation accuracy with an elapsed time of 29 min 50 s, and both Inception-V3 and Inception-Resnet-V2 yield further decreases in accuracy, which are 81.89% and 80.13%, respectively, accompanied by significant increases in elapsed time to 30.50 min and 50 min 25 s, respectively.

Figure 17a,b provide an integrated exploration of the dynamics among validation accuracy, elapsed time, and mini-batch size across distinct neural network architectures, namely Resnet50, Inception-V3, Inception-Resnet-V2, and the proposed method. It has been observed that enlarging the size of the mini-batch leads to a reduction in accuracy alongside a concurrent increase in elapsed time. The proposed method has the highest validation accuracy of 90.70% with an elapsed time of 25 min 37 s, when the mini-batch size is 20, smashing Resnet50, Darknet53, and Densenet201. With the same configuration, Inception-Resnet-V2 has the lowest accuracy of 83.03% with the longest elapsed time of 40 min 01 s. For a mini-batch size of 40, Resnet50, Inception-V3, Inception-Resnet-V2, and the proposed method have the accuracies of 83.82%, 81.64%, 79.45%, and 89.50%, respectively, accompanied by significant increases in elapsed time of 28 min 30 s, 35 min 50 s, 50 min, and 29 min 30 s, respectively. With a mini-batch size of 80, ResNet50 achieves 81.53% accuracy in 34 min, Inception-V3 attains 80.60% accuracy in 38 min 50 s, Inception-Resnet-V2 yields 76.70% accuracy in 70 min 40 s, and the proposed method obtains 88.50% accuracy in 35 min 50 s, reflecting varied performance and corresponding training durations. In all cases, the proposed method achieves higher validation accuracy with spending a moderate amount of elapsed time.

Figure 18a,b present a comparative analysis of validation accuracies and associated elapsed times for Resnet50, Inception-V3, Inception-Resnet-V2, and the proposed method at validation frequencies of 50 and 100. At a frequency of 50, the proposed method achieves the highest validation accuracy of 90.70% with an elapsed time of 25 min 37 s, outperforming Resnet50, Inception-V3, and Inception-Resnet-V2.

On the other hand, at a validation frequency of 100, the proposed method maintains a relatively high accuracy of 85.85% with an elapsed time of 35 min 50 s, while Resnet50, Inception-V3, and Inception-Resnet-V2 exhibit lower accuracies and longer elapsed times. Overall, the proposed method demonstrates the superior performance across both frequency settings, achieving higher accuracies with moderate training time compared to the other methods.

Figure 19 illustrates the classification outcomes obtained from utilizing the Resnet50 classification neural network on a set of test images. The results indicate that out of 10 test images examined, the Resnet50 model successfully classifies 9 images accurately. This observation highlights the proficiency of the Resnet50 model in distinguishing features within the dataset, as evidenced by its high classification accuracy rate of 90%.

Figure 20 depicts the classification performance of the Inception-V3 classification network on a series of test images. Among the 10 images assessed, Inception-V3 successfully classifies 9 images with precision. This result accentuates the model’s proficiency in categorizing features within the dataset, as corroborated by its classification accuracy rate of 90%.

In Figure 21, the classification efficacy of the Inception-Resnet-V2 classification network is portrayed through its performance on a set of test images. Out of 10 images evaluated, the Inception-Resnet-V2 accurately identifies 8 images. This outcome underscores the effectiveness of the Inception-Resnet-V2 model in categorizing features within the dataset, as demonstrated by its classification accuracy of 80%.

Figure 22 displays an analysis of the classification performance exhibited by the proposed classification network on a designated set of test images. Upon evaluating 10 images, the proposed classification method proves exceptional precision by accurately classifying all 10 images. This remarkable outcome highlights the modal’s remarkable adeptness in classifying intricate features within the dataset, as evidenced by its impeccable classification accuracy rate of 100%.

Greater complexity in a model, as indicated by the number of learnable properties, is often associated with an increased risk of overfitting, wherein the model may memorize the training data rather than generalize well to unseen data. A superior model is typically characterized by fewer learnable parameters coupled with higher test classification accuracy. Upon the examination of Table 1, it is evident that the Inception-Resnet-V2 architecture exhibits the highest number of learnable parameters, totaling 54.30 million, while achieving the lowest test classification accuracy of 80%. In contrast, both Resnet50 and Inception-V3 models show significantly fewer learnable properties, with 23.50 million and 21.80 million, respectively, yet attain test classification accuracies of 90% each. In comparison, the proposed model stands out with a mere 4 million learnable properties, accompanied by an impressive test classification accuracy of 100%, reflecting remarkable performance in terms of both model simplicity and predictive accuracy.

Based on the finding derived from the experimental results, it can be summarized that the proposed classification algorithm reveals superior performance compared to the existing state-of-the-art classification algorithms. Furthermore, the effectiveness of the proposed algorithm is notably pronounced under specific configurations of its hyperparameters. Specifically, optimal performance is observed when the algorithm is trained with a learning rate of 0.01, a maximum epoch of 10, a mini-batch size of 20, and a validation frequency of 50, and the SGDM optimization algorithm is employed, surpassing the performance achieved under alternative combinations of hyperparameters and optimizers.

Figure 23 illustrates precision–recall (PR) curves comparing the proposed model with leading state-of-the-art models for drone classification. An analysis of Figure 23 reveals that the proposed model consistently achieves the highest precision across all recall values in comparison to state-of-the-art models, namely Resnet50, Inception-V3, and Inception-Resnet-V2. Moreover, the PR curve of the proposed model exhibits the largest area under the curve when contrasted with the three state-of-the-art models. Notably, Inception-Resnet-V2 demonstrates the lowest precision among the models examined. PR curves for Resnet50 and Inception-V3 lie intermediate to the proposed model and Inception-Resnet-V2. Consequently, it can be summarized that the proposed model surpasses state-of-the-art models in terms of PR curve evaluation.

In Figure 24, the F1 score–threshold curves for drone classes are depicted, comparing the proposed model with state-of-the-art models. Initially, the proposed model and the state-of-the-art models exhibit a similar F1 score across threshold values ranging from 0 to 0.3. However, beyond this threshold, the proposed model shows superior performance with higher F1 scores compared to the state-of-the-art models, namely Resnet50, Inception-V3, and Inception-Resnet-V2. Furthermore, the proposed model consistently boasts the largest area under the F1 score curve when contrasted with state-of-the-art models, highlighting its superior overall performance in this metric. Resnet50 and Inception-V3 present almost similar performance as their F1 score–threshold curves are very close to each other. Particularly, Inception-Resnet-V2 demonstrates the lowest F1 score among the models examined. In summary, the proposed model tops state-of-the-art models in terms of F1 score–threshold curve assessment.

8. Discussion

As presented in the dataset preparation section, a novel dataset featuring drones equipped with various payloads has been curated to facilitate the classification of drones operating within constrained environments such as airports and prisons, with a particular focus on identifying drones transporting illicit substances and explosive materials. The newly assembled drone dataset has been combined with two publicly accessible datasets, namely the “Multi-sensor Drone dataset” and “USC Drone dataset”. Additionally, to enhance the accuracy of drone classification, a comprehensive collection of audio recordings capturing diverse drone configurations and payload types has been generated and subsequently transformed into spectrogram images. This supplementary audio dataset has been merged with an open-source repository containing drone audio recordings, encompassing sounds emanating from drones, airplanes, helicopters, and ambient background noises, all of which have been converted into spectrogram representations. The mix of these datasets culminates in the creation of a custom dataset comprising 8000 visual images and 2000 spectrogram images depicting drones and analogous objects. Given that this custom dataset integrates various publicly available drone datasets alongside newly acquired data showcasing drones with different payloads, its extensive coverage ensures robust performance not only in the targeted classification task but also in broader applications involving drone and analogous object identification across diverse environments.

The proposed classification model comprises a modest architecture consisting of 28 convolutional layers, strategically integrated with “bypass connections” to address the prevalent vanishing gradient encountered in deep neural networks. By facilitating an alternative pathway for gradient flow during backpropagation, these bypass connections enable earlier layers to directly access activations from deeper layers. Consequently, this architectural design promotes the efficient reuse of features learned at varying depths of the network, enhancing the model’s capacity to learn diverse and robust representations of input data. Additionally, the incorporation of the Leaky ReLU activation function throughout the network effectively mitigates the “dying ReLU” problem, ensuring the consistent propagation of information across layers.

The effectiveness of the proposed model is further underscored by its judicious management of learnable properties, which play an important role in determining model efficiency, complexity, and susceptibility of overfitting. With a meticulous balance between efficiency and complexity, the proposed model demonstrates remarkable efficacy while mitigating the risk of overfitting. Notably, despite comprising a modest 28 convolutional layers and a mere 4 million learnable properties, the proposed model achieves an impressive 100% efficiency over test data, underscoring its efficacy in capturing intricate patterns and generalizing well to unseen instances.

The proposed model, configured with specific hyperparameters and optimizer settings, demonstrates utility when applied to a dataset characterized by noisy information and diverse aspect ratios of objects. However, for lower levels of noise, adjustments to the configuration may be necessary for optimal performance.

This architecture is primarily designed for the classification of small, rapidly moving objects captured from a significant distance. Consequently, it is particularly well suited for applications involving drones and similar objects. In contrast, this architecture is less effective for a biomedical image analysis.

The novelty of the proposed algorithm lies in its ability to achieve exceptional performance metrics with a comparatively compact architecture. By leveraging a streamlined design and leveraging “bypass connections” and Leaky ReLU activation functions carefully, the proposed model exemplifies a paradigm shift in convolutional neural network design, exhibiting that efficacy and efficiency need not be sacrificed in favor of complexity and learnable properties.

9. Conclusions

This paper utilizes a bespoke dataset comprising both audio and visual data of drones, airplanes, birds, and helicopters to train various convolutional neural network (CNN) object classification models. Incorporating audio–visual information in the training and testing process enhances classification accuracy. A new Finite Impulse Response (FIR) low-pass filter is introduced, to convert audio signals to spectrogram images, demonstrating superior performance in Signal-to-Noise Ratio (SNR) and Signal-to-Interference Ratio (SIR) calculations compared to existing state-of-the-art filters. The proposed window-based filter achieves an SNR of 2.9072 dB, which is 0.2394 dB better than the best performing Kaiser filter, and an SIR of −0.208316 dB, surpassing the Kaiser filter by 2.124348 dB. Additionally, a novel CNN-based drone classification model is proposed. Although the proposed model features a compact architecture, it demonstrates superior accuracy in precisely classifying drones. The proposed model, with just 4 million learnable parameters, achieves 100% test accuracy on the test dataset, surpassing the performances of Resnet50, which has 23.50 million learnable parameters, and achieves a 90% accuracy. The proposed model adeptly negotiates a trade-off between the learnable attributes and the risk of overfitting. Evaluation through precision–recall curves, F1 score curves, and comprehensive testing validates the efficacy of the proposed classification algorithm, beating existing CNN-based object classification methods in drone classification accuracy. In the future, RF-based classification will be integrated with audio–visual data.

Author Contributions

Conceptualization, H.R. and P.B.Z.; methodology, H.R. and P.B.Z.; validation, H.R. and P.B.Z.; formal analysis, H.R.; data curation, H.R.; software, H.R.; writing—draft preparation, H.R.; writing—review and editing, P.B.Z.; supervision, P.B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

There is no conflict of interest.

References

Benarbia, T.; Kyamakya, K. A Literature Review of Drone-Based Package Delivery Logistics Systems and Their Implementation Feasibility. Toward the New Era of Sustainable Design, Manufacturing and Management. Sustainability 2022, 14, 360. [Google Scholar] [CrossRef]
Kshirsagar, S.P.; Jagyasi, N. Evolution and Technological Advancements in Drone Photography. Int. J. Creat. Res. Thoughts-IJCRT 2020, 8, 2224–2227. [Google Scholar]
Touil, S.; Richa, A.; Fizir, M.; Argente García, J.E.; Skarmeta Gomez, A.F. A review on smart irrigation management strategies and their effect on water savings and crop yield. Irrig. Drain. 2022, 71, 1396–1416. [Google Scholar] [CrossRef]
Ismaeel, A.G.; Janardhanan, K.; Sankar, M.; Natarajan, Y.; Mahmood, S.N.; Alani, S.; Shather, A.H. Traffic Pattern Classification in Smart Cities Using Deep Recurrent Neural Network. Sustainability 2023, 15, 14522. [Google Scholar] [CrossRef]
Al Shamsi, M.; Al Shamsi, M.; Al Dhaheri, R.; Al Shamsi, R.; Al Kaabi, S.; Al Younes, Y. Foggy Drone: Application to a Hexarotor UAV. In Proceedings of the International Conferences on Advances in Science and Engineering Technology, Dubai, United Arab Emirates, 6 February–5 April 2018. [Google Scholar] [CrossRef]
Mohammed, F.; Idries, A.; Mohamed, N.; Al-Jaroodi, J.; Jawhar, I. UAVs for Smart Cities: Opportunities and Challenges. In Proceedings of the International Conference on Unmanned Aircraft Systems (ICUAS), Orlando, FL, USA, 27–30 May 2014. [Google Scholar]
Chamola, V.; Kotesh, P.; Agarwal, A.; Naren; Gupta, N.; Guizani, M. A Comprehensive Review of Unmanned Aerial Vehicle Attacks and Neutralization Techniques. Ad Hoc Netw. 2021, 111, 102324. [Google Scholar] [CrossRef] [PubMed]
Turkmen, Z.; Kuloglu, M. A New Era for Drug Trafficking: Drones. Forensic Sci. Addict. Res. 2018, 2, 114–118. [Google Scholar] [CrossRef]
Samadzadegan, F.; Javan, F.D.; Mahini, F.A.; Gholamshahi, M. Detection and Recognition of Drones Based on a Deep Convolutional Neural Network Using Visible Imagery. Aerospace 2022, 9, 31. [Google Scholar] [CrossRef]
Basak, S.; Rajendran, S.; Pollin, S.; Scheers, B. Combined RF-based drone detection and classification. IEEE Trans. Cogn. Commun. Netw. 2021, 8, 111–120. [Google Scholar] [CrossRef]
Mezei, J.; Fiaska, V.; Molnár, A. Drone sound detection. In Proceedings of the 2015 16th IEEE International Symposium on Computational Intelligence and Informatics (CINTI), Budapest, Hungary, 19–21 November 2015; pp. 333–338. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; IEEE Computer Society: Washington, DC, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
Iman, M.; Arabnia, H.R.; Rasheed, K. A Review of Deep Transfer Learning and Recent Advancements. Technologies 2023, 11, 40. [Google Scholar] [CrossRef]
Salavati, P.; Mohammadi, H.M. Obstacle Detection Using GoogleNet. In Proceedings of the 8th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 25–26 October 2018. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767v1. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. Computer Vision and Pattern Recognition. arXiv 2015, arXiv:1512.00567v3. [Google Scholar] [CrossRef]
Wang, J.; He, X.; Faming, S.; Lu, G.; Cong, H.; Jiang, Q. A Real-Time Bridge Crack Detection Method Based on an Improved Inception-Resnet-v2 Structure. IEEE Access 2021, 9, 93209–93223. [Google Scholar] [CrossRef]
Arora, A.; Kumar, A. HOG and SIFT Transformation Algorithms for the Underwater Image Fusion. In Proceedings of the 2021 IEEE International Conference on Technology, Research, and Innovation for Betterment of Society (TRIBES), Raipur, India, 17–19 December 2021. [Google Scholar]
Erabati, G.K.; Gonçalves, N.; Araújo, H. Object Detection in Traffic Scenarios—A Comparison Of Traditional and Deep Learning Approaches; Institute of Systems and Robotics, University of Coimbra: Coimbra, Portugal, 2020; pp. 225–237, CS&IT-CSCP 2020. [Google Scholar] [CrossRef]
Deng, L.; Yu, D. Deep Learning: Methods and Applications. Found. Trends Signal Process. 2016, 7, 197–387. [Google Scholar] [CrossRef]
Sahu, M.; Dash, R. A Survey on Deep Learning: Convolution Neural Network (CNN). In Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies; Springer: Singapore, 2019; Volume 153. [Google Scholar] [CrossRef]
Svanströma, F.; Alonso-Fernandez, F.; Englund, C. A dataset for multi-sensor drone detection. Data Brief 2021, 39, 107521. [Google Scholar] [CrossRef] [PubMed]
USC Drone Dataset. Available online: https://github.com/chelicynly/A-Deep-Learning-Approach-to-Drone-Monitoring (accessed on 4 December 2017).
Al-Emadi, S. Saraalemadi/Droneaudiodataset. 2018. Available online: https://github.com/saraalemadi/DroneAudioDataset (accessed on 28 June 2019).
Piczak, K.J. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, Brisbane, Australia, 26–30 October 2015; ACM Press: Pune, India, 2015; pp. 1015–1018, ISBN 978-1-4503-3459-4. [Google Scholar] [CrossRef]
Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv 2018, arXiv:1804.03209. [Google Scholar]
Available online: https://research.google.com/audioset/ontology/aircraft_1.html (accessed on 6 March 2017).
Khan, H.; Ullah, M.; Al-Machot, F.; Cheikh, F.A.; Sajjad, M. Deep Learning Based Speech Emotion Recognition For Parkinson Patient. In Proceedings of the IS&T International Symposium on Electronic Imaging 2023 Image Processing: Algorithms and Systems XXI, Online, 16–19 January 2023. [Google Scholar]
Dawson, H.L.; Dubrule, O.; John, C.M. Impact of dataset size and convolutional neural network architecture on transfer learning for carbonate rock classification. Comput. Geosci. 2023, 171, 105284. [Google Scholar] [CrossRef]
Xiao, Y.; Tian, Z.; Yu, J.; Zhang, Y.; Liu, S.; Du, S.; Lan, X. A review of object detection based on deep learning. Multimed. Tools Appl. 2020, 79, 23729–23791. [Google Scholar] [CrossRef]
Martin, D.; Powers, W. Evaluation: From Precision, Recall and F-Measure to Roc, Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
Rakshit, H.; Ullah, M.A. A New Efficient Approach for Designing FIR Low-pass Filter and Its Application on ECG Signal for Removal of AWGN Noise. IAENG Int. J. Comput. Sci. 2016, 43, 176–183. [Google Scholar]

Figure 1. (a) Sample frames from the newly created drone dataset with various loads. (b) Sample frames from the Multi-sensor Drone dataset. (c) Sample frames from the USC Drone dataset.

Figure 2. (a) Basic block diagram of spectrogram conversion. (b) Spectrogram image of drone audio clip.

Figure 3. (a) Flow chart of proposed drone classification algorithm. (b) Concept of bypass connection. (c) Leaky ReLU vs. ReLU function.

Figure 4. (a) Validation accuracy of the proposed classification model with different optimizers. (b) Elapsed time of the proposed classification model with a different optimizer.

Figure 5. Time domain representation of proposed and state-of-the-art windows with window length of 128.

Figure 6. The normalized frequency response comparison of the proposed windows and the state-of-the-art windows.

Figure 7. Normalized frequency response of FIR low-pass filter designed by proposed and Hamming window functions.

Figure 8. Normalized frequency response of FIR low-pass filter designed by proposed and Kaiser window functions.

Figure 9. Normalized frequency response of FIR low-pass filter designed by proposed and Gaussian window functions.

Figure 10. Performance comparison among the state-of-the-art windows and the proposed window-based filters for denoising spectrograms in terms of SNR and SIR measurement.

Figure 11. Training and validation results of Resnet50 on custom dataset.

Figure 12. Training and validation results of Inception-V3 on custom dataset.

Figure 13. Training and validation results of Inception-Resnet-V2 on custom dataset.

Figure 14. Training and validation results of proposed classification algorithm on custom dataset.

Figure 15. Relationship between validation accuracy and learning rates for various CNN models.

Figure 16. (a) Relationship between validation accuracy and number of epochs for various CNN models. (b) Relationship between elapsed time and number of epochs for various CNN models.

Figure 17. (a) Relationship between validation accuracy and mini-batch size for various CNN models. (b) Relationship between elapsed time and mini-batch size for various CNN models.

Figure 18. (a) Relationship between validation accuracy and validation frequency for various CNN models. (b) Relationship between elapsed time and mini-batch size for various CNN models.

Figure 19. Test images classification result for Resnet50.

Figure 20. Test images classification result for Inception-V3.

Figure 21. Test images classification result for Inception-Resnet-V2.

Figure 22. Test images classification result for proposed method.

Figure 23. Precision–recall curves of proposed model and state-of-the-art models for drone classes.

Figure 24. F1 score–threshold curves of proposed model and state-of-the-art models for drone classes.

Table 1. Learnable Properties vs. Test Accuracy of different models.

Model	Learnable Properties (in Millions)	Test Accuracy
Resnet50	23.50	90%
Inception-V3	21.80	90%
Inception-Resnet-V2	54.30	80%
Proposed	4.00	100%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rakshit, H.; Bagheri Zadeh, P. A New Approach to Classify Drones Using a Deep Convolutional Neural Network. Drones 2024, 8, 319. https://doi.org/10.3390/drones8070319

AMA Style

Rakshit H, Bagheri Zadeh P. A New Approach to Classify Drones Using a Deep Convolutional Neural Network. Drones. 2024; 8(7):319. https://doi.org/10.3390/drones8070319

Chicago/Turabian Style

Rakshit, Hrishi, and Pooneh Bagheri Zadeh. 2024. "A New Approach to Classify Drones Using a Deep Convolutional Neural Network" Drones 8, no. 7: 319. https://doi.org/10.3390/drones8070319

Article Menu

A New Approach to Classify Drones Using a Deep Convolutional Neural Network

Abstract

1. Introduction

2. Literature Review

3. Dataset Preparation

3.1. Image Dataset

3.2. Audio Dataset

3.3. Spectrogram Conversion

4. Proposed Algorithm

5. Proposed Classification Network with Different Optimizers

6. Performance Evaluations

7. Experimental Results

7.1. Proposed Window and Proposed Window-Based Low-Pass Filter

7.1.1. Proposed Window vs. State-of-the-Art Windows

7.1.2. Proposed Window-Based Filter vs. State-of-the-Art Window-Based Filters

7.2. Proposed Classification Algorithm vs. State-of-the-Art Classification Algorithms

8. Discussion

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI