RTAD: A Real-Time Animal Object Detection Model Based on a Large Selective Kernel and Channel Pruning

Liu, Sicong; Fan, Qingcheng; Zhao, Chunjiang; Li, Shuqin

doi:10.3390/info14100535

Open AccessArticle

RTAD: A Real-Time Animal Object Detection Model Based on a Large Selective Kernel and Channel Pruning

¹

College of Information Engineering, Northwest A & F University, 3 Taicheng Road, Yangling, Xianyang 712100, China

²

Research Center of Information Technology, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2023, 14(10), 535; https://doi.org/10.3390/info14100535

Submission received: 4 August 2023 / Revised: 7 September 2023 / Accepted: 28 September 2023 / Published: 30 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

Animal resources are significant to human survival and development and the ecosystem balance. Automated multi-animal object detection is critical in animal research and conservation and ecosystem monitoring. The objective is to design a model that mitigates the challenges posed by the large number of parameters and computations in existing animal object detection methods. We developed a backbone network with enhanced representative capabilities to pursue this goal. This network combines the foundational structure of the Transformer model with the Large Selective Kernel (LSK) module, known for its wide receptive field. To further reduce the number of parameters and computations, we incorporated a channel pruning technique based on Fisher information to eliminate channels of lower importance. With the help of the advantages of the above designs, a real-time animal object detection model based on a Large Selective Kernel and channel pruning (RTAD) was built. The model was evaluated using a public animal dataset, AP-10K, which included 50 annotated categories. The results demonstrated that our model has almost half the parameters of YOLOv8-s yet surpasses it by 6.2

A P

. Our model provides a new solution for real-time animal object detection.

Keywords:

real-time animal object detection; LSK module; channel pruning; vision transformer

1. Introduction

Animal resources are the foundation for human survival and development, playing an irreplaceable role in maintaining ecological balance and stability [1]. Researching and protecting animals are essential. Reliable automated multi-class animal object detection is vital in animal studies, conservation, and ecosystem monitoring [2]. Digital cameras and other devices can obtain a lot of animal image data. This non-contact data collection method has advantages such as minimal interference with animals and less constraints from environmental conditions; thus, it has been widely adopted.

Animal researchers can obtain valuable information, such as species and population size, by analyzing image data, which benefits animal studies [3]. However, with the increasing number of deployed cameras, the amount of collected image data has also grown exponentially, which has put tremendous pressure on data analysis work. With computer vision and deep learning development, animal object detection methods based on RGB images continue to emerge [2]. They have been applied in practical scenarios to solve the thorny issues described above. Researchers have developed different methods for different animals and scenarios in the research and real-life application of animal object detection.

As a classic method in object detection, YOLO series models are widely used in various fields due to their real-time analysis abilities, high efficiency, and high accuracy. A multi-object detection method (YOLO-BYTE) [4] based on the YOLOv7 [5] model and a self-attention and convolution mixed module (ACmix [6]) was proposed, successfully solving the missed and false detection problems caused by complex environments in individual cow detection. This method provides technical support for non-contact automatic monitoring of cows. Another study proposed the YOLOv5-ASFF object detection model for detecting different body parts of cows (such as individuals, heads, and legs) in complex scenes [7]. This model introduced ASFF [8] to learn the weights of different scale feature maps to capture the features of cow samples, achieving detection of cow features and improving the model’s generalization ability.

The methods of pig object detection are mainly divided into two. The first is the two-stage method. Yang et al. [9] used Faster R-CNN [10] to build a pig detector to capture the body, head, and tail of pigs from images; Riekert et al. [11] combined Faster R-CNN and Neural Architecture Search (NAS) to obtain a higher precision for this task. One drawback of this approach is that the inference speed could be faster. The second is the one-stage method, overcoming the above issue. Sha et al. [12] utilized Removal Net as the backbone of YOLOv3 to improve the representation ability of the network and added a new branch for improving the detection accuracy for small objects. Ocepek et al. [13] combined YOLOv4 [14] with feature pyramids to use features from each stage effectively. Shao et al. [15] utilized YOLOv5 in aquaculture environments to assist breeders in their work. These approaches could obtain real-time running speeds but relied on an a priori anchor.

Some researchers have focused on developing object detection methods for multiple animals. For example, a multi-animal object detection model was constructed by combining convolutional neural networks, achieving higher accuracy [16]. Another study utilized a two-stage model and an optimized multi-scale attention mechanism to make the model more sensitive to small objects, achieving high accuracy on the Animal-80 dataset [17]. A WilDect-YOLO model was built by incorporating residual blocks into the CSPDarknet53 backbone network, spatial pyramid pooling (SPP), and modified path aggregation network (PANet), which performed well with complex backgrounds [18], providing support for automated animal observation.

The above methods have achieved good results in animal object detection. Considering the limited computational resources and real-time requirements in practical applications, these models have large amounts of parameters and computational costs, which may need to be made more efficient for researchers to analyze species and population sizes. In order to construct a lightweight animal object detection model, we combine the basic structure of the Vision Transformer [19] model with the Large Selective Kernel (LSK) module [20] to replace the multi-head attention mechanism on consideration of the huge computational cost it brings. This module can expand the receptive field and extract spatial information within a larger range. This is very helpful for enhancing the model’s representation capability. In the neck section of the network, we utilize a feature pyramid structure, which is simple and cost-effective, to merge the output feature maps from different stages.

To further reduce the model’s parameters, we apply channel pruning based on Fisher information [21] to the constructed animal object detection model, removing low-importance channels. This method can consistently handle coupled layers from a global perspective during the pruning process. As a result, we obtain a real-time animal object detection model based on the LSK module and channel pruning (RTAD) that has significantly reduced parameters but exhibits an improved model accuracy. To verify the model’s effectiveness, we conduct a series of experiments on the publicly available AP-10K dataset [22], and the results demonstrate the advantages of our model.

In this research, our contributions are as follows:

Using the basic structure of the Transformer model and the LSK module with a larger receptive field, we created a powerful backbone network with a strong representation ability.
To further reduce parameters and computational cost, we introduced channel pruning based on Fisher information to remove low-importance channels.
Our RTAD has less than half the parameters of YOLO v8-s and surpasses it by 6.2 $A P$ .

2. Methods

2.1. Backbone

The backbone network consists of four stages (as shown in Figure 1), each containing a Patch Embedding module and a basic block. In the first stage, the Patch Embedding module [23] divides the input image into multiple overlapping small patches and maps each patch to a feature vector. It uses a convolution layer to perform this mapping and adds a normalization layer after the convolution layer. The input image

X \in R^{3 \times H \times W}

is divided by a convolution operation with a stride of 4 and a kernel size of

7 \times 7

. The number of patches is

\frac{H}{4} \times \frac{W}{4}

. In the subsequent three stages, the stride in the Patch Embedding module is adjusted to 2, and the kernel size is reduced to 3. This process also involves downsampling.

Each basic block consists of an attention module and an MLP module. Each stage of the feature extraction network has different depths and feature dimensions. The depth represents the number of Block modules in each stage, and the feature dimensions represent the dimensions of each feature vector. In these four stages, the channel numbers

\{C_{1}, C_{2}, C_{3}, C_{4}\}

are

\{32, 64, 128, 256\}

and the number of basic blocks

\{B_{1}, B_{2}, B_{3}, B_{4}\}

are

\{3, 3, 5, 2\}

. The channel and basic block numbers follow models like ViT and CSwin Transformer [24].

A feature pyramid module [25] is placed after the backbone network to fuse the feature maps output by the last three stages of the backbone network in order to fully integrate feature maps with rich texture information and semantic information. It is worth noting that the feature maps output by the first stage are not fused. The input size of object detection tasks is usually large. Despite downsampling being applied in the initial stages of the model, a significant amount of low-level texture information still exists. Including this stage in the fusion process would introduce more computational overhead. This design is consistent with YOLO series models [14,26,27,28], where only the outputs of the later stages are fused in the neck network, rather than all stages.

2.2. Basic Block

The basic module is the fundamental unit of the feature extraction network. It is critical in the model’s representational capacity and actual task performance. In animal detection, the environment is complex and occlusion is ubiquitous. The basic block should have a larger spatial receptive field for capturing long-range information to adapt to the complicated scene. Previous YOLO series models typically used

3 \times 3

convolutions for feature extraction, which limited the receptive field of the models. The LSK module is introduced in our basic block to overcome the limitations of typical convolution models while introducing minimal computational costs and parameters.

The overall structure of the basic block used in this paper is similar to the basic structure of Transformers. Each block consists of an attention module and an MLP module. The attention module extracts relationships between features and adds them together through residual connections. The MLP module processes feature through two convolution layers and an activation function

G E L U ()

to refine and fuse features in the channel dimension.

The basic block’s input feature map

X^{'} \in R^{C \times H \times W}

undergoes batch normalization and enters the attention module. The core of the attention module is the LSK block. A pointwise convolution exists before and after the LSK block to integrate information in the channel direction.

LSK module. Traditional convolution operations use fixed-size kernels, which can only capture features of specific scales and directions. However, this module introduces convolutional kernels of different scales and directions, allowing it to perceive features at multiple scales and directions simultaneously. It explicitly generates multiple features with various receptive fields, making subsequent regression and classification tasks easier. Additionally, combining depth-wise convolutions with dilated convolutions helps achieve a larger receptive field with a smaller computational cost. These advantages all contribute to the superior performance of LSK modules.

The specific process of the LSK block is shown in Algorithm 1. Firstly,

5 \times 5

depthwise convolution is used to extract features with a small computational cost, forming a feature map

M_{1}

. To further improve the receptive field and model the sample information more effectively, a

7 \times 7

depthwise convolution with a stride of 1, edge padding of 9, and dilation rate of 3 is used to extract features over a larger range, obtaining

M_{2}

. Then, two pointwise convolutions are used to compress the channels of

M_{1}

and

M_{2}

, respectively, to obtain

M_{3}

and

M_{4}

. At the same time, information is integrated into the channel direction, and the feature maps after compression have a size of

\frac{C}{2} \times H \times W

.

The two feature maps with a size of

\frac{C}{2} \times H \times W

are merged in the channel direction to form a new feature map

M_{5}

with a size of

C \times H \times W

. The maximum and minimum values are obtained in the channel dimension of

M_{5}

, and these values are merged in the channel dimension to form

A g g \in R^{2 \times H \times W}

. Then,

A g g

is fed into a

7 \times 7

convolution with an input channel and output channel equal to 2, followed by a sigmoid activation function, to obtain

A g g^{'} \in R^{2 \times H \times W}

.

M_{3}

and

M_{4}

are multiplied by the first and second channel values corresponding to

A g g^{'}

, respectively, and added together to obtain

A t t e n \in R^{\frac{C}{2} \times H \times W}

.

A t t e n

is upsampled in the channel direction using a pointwise convolution to obtain the final

A t t e n^{'} \in R^{C \times H \times W}

. The input

X^{'}

is multiplied by

A t t e n^{'}

, and the result is returned.

Algorithm 1 LSK module

Require:: $c h a n n e l s$ , $X \in R^{C \times H \times W}$
1:: Initialize Conv0 with Conv2d(channels, channels, 5, padding = 2, groups = channels)
2:: Initialize Conv1 with Conv2d(channels, channels, 7, stride = 1, padding = 9,
groups = channels, dilation = 3)
3:: Initialize Conv2 with Conv2d(channels, channels//2, 1)
4:: Initialize Conv3 with Conv2d(channels, channels//2, 1)
5:: Initialize Conv4 with Conv2d(2, 2, 7, padding = 3)
6:: Initialize Conv5 with Conv2d(channels//2, channels, 1)
7:: $M_{1}$ ← Conv0(x)
8:: $M_{2}$ ← Conv1( $M_{1}$ )
9:: $M_{3}$ ← Conv2( $M_{1}$ )
10:: $M_{4}$ ← Conv3( $M_{2}$ )
11:: $M_{5}$ ← Concatenate ([ $M_{3}$ , $M_{4}$ ], axis = 1)
12:: $A t t n_{A v g}$ ← Mean( $M_{5}$ , axis = 1, keepdims = True)
13:: $A t t n_{M a x}$ ← Max( $M_{5}$ , axis = 1, keepdims = True)
14:: Agg ← Concatenate ([ $A t t n_{A v g}$ , $A t t n_{M a x}$ ], axis = 1)
15:: Agg’ ← Sigmoid(Conv4(Agg) )
16:: Atten ← $M_{3}$ ∗ Agg’ [:,0,:,:].unsqueeze(1) + $M_{4}$ ∗ Agg’[:,1,:,:].unsqueeze(1)
17:: Atten’ ← Conv5(Atten)
18:: return X ∗ Atten

2.3. Shared Parameter Object Detector

An early YOLO model used a coupled object detector [14], but there was interference between the classification and regression tasks. YOLOX [27] decouples the classification and regression tasks and separates the regression task from the classification task. The convention in the YOLO series models is to use independent detectors on different output feature maps to detect objects of different scales. However, this operation results in different parameters under each scale, which underutilizes the parameters. The objects detected on feature maps of different scales usually have some similarity on a relative scale. However, there are natural statistical differences in features of different scales, and normalization operations are indispensable [28]. The shared parameter detector cannot handle the differences between scales well; introducing independent normalization operations for different scales can solve this problem.

2.4. Pruning Algorithm Based on Fisher Information

Our model uses a lightweight backbone network and a convenient feature pyramid module. Channel pruning is introduced into animal object detection further to reduce the amount of model parameters and computation. We hope that unimportant channels can be cut off in the pruning process. Previous pruning methods relied on batch normalization modules and could not handle coupled channels well. Moreover, previous pruning methods focused on channel reduction, and reducing memory access can bring more performance improvements to the model.

A pruning method based on Fisher information [21] is used to fine-tune the animal object detection model. This method does not need to perform pruning sensitivity analyses layer by layer but controls the proportion of pruning from a global perspective. This method first sets a set of binary masks to mark the status of each channel. If it is set to 1, this means this channel is kept. If it is set to 0, this channel needs to be cut off. The animal object detection model constructed in this paper uses a lot of deep convolution, residual connections, and other structures, which contain a lot of coupled channels. This feature should be taken into account during pruning, cutting off the coupled channels together.

Through a depth-first search algorithm, layers with coupled channels are found and classified into a group. The same group shares the same mask to achieve the purpose of reducing channels at the same time. Firstly, channel pruning only affects the channel dimension, so only convolutional layers and fully connected layers are considered. Only the parent layer

P_{i}

of these two components is searched individually during the search process. Layers with the same parent node should be assigned to the same group. For example, due to residual connections, the same layer has two sub-layers; these layers should be classified into the same group. If the parent layer is a grouped convolutional layer, it should be grouped with the current layer.

After the grouping is completed, the pruning operation begins. Pruning aims to cut off channels with lower importance to ensure that the model’s parameter volume and computation are reduced without affecting the model’s performance. The importance of each channel is measured based on Fisher information [21]:

\begin{matrix} s_{i} = Γ (m - e_{i}) - Γ (m) \\ \approx - e_{i}^{T} g + \frac{1}{2} e_{i}^{T} H e_{i} \\ = - g_{i} + \frac{1}{2} H_{i i} \end{matrix}

(1)

In Equation (1),

m

is an all-one vector,

e_{i}

is a one-hot vector indicating the i-th element,

g

is the gradient, and H is a 2D Hessian matrix with dimensions equal to the number of channels. We utilize the Taylor expansion to expand the loss function

Γ

and approximate its change when a channel is removed. The Taylor expansion only approximates the original function, so “approximately equal to” was used in the formula derivation. At the same time, the error can be ignored based on the fundamental principles of Taylor expansion.

Each element in the Hessian matrix represents the variance of the sample gradients. During the computation, Fisher information converts second-order derivatives into the square of first-order derivatives, thereby converting variance calculation into the expectation of computing the square of sample gradients. This can be obtained during the backpropagation process.

This algorithm considers memory access conditions and quantifies memory overhead in the importance measurement process.

Δ M_{i} = n \times h \times w

is used to describe the reduction in memory if the channel is pruned. The final channel importance is

s_{i} / Δ M_{i}

. The least important channels that occupy more memory are ultimately pruned.

2.5. Implementation Details

2.5.1. Label Assignment

To create an end-to-end animal object detector, the prediction results are matched with ground truth boxes using varying label assignment strategies at each scale. A dynamic label assignment strategy is employed, as the previous approach of using a cost function consistent with the training loss as the matching criterion had limitations. Hence, this paper utilizes a dynamic soft label assignment strategy based on SimOTA [27].

\{\begin{matrix} C = λ_{1} C_{c l a s s i f i c a t i o n} + λ_{2} C_{r e g r e s s i o n} + λ_{3} C_{c e n t e r} \\ C_{c l a s s i f i c a t i o n} = C E (P, Y_{s o f t}) \times {(Y_{s o f t} - P)}^{2} \\ C_{r e g r e s s i o n} = - l o g (I o U) \\ C_{c e n t e r} = α^{| x_{p r e d} - x_{g r o u n d t r u t h} | - β} \end{matrix}

(2)

Three main parts are involved in (2): classification cost, regression cost, and region cost.

λ_{1}

,

λ_{2}

, and

λ_{3}

are the weights of the three costs and are set as 1, 3, and 1 respectively. Firstly, the classification cost

C_{c l a s s i f i c a t i o n}

is essential to the object detection task. Specifically, the soft label

Y_{s o f t}

is calculated based on the Intersection over Union (IoU) between predictions P and the ground truth boxes for training the classification branch. This approach allows for reweighting the classification loss under different regression qualities [28]. It may not effectively capture the differences between the best and worst matches when using Generalized IoU as the regression cost,

C_{r e g r e s s i o n}

. To address this issue, we employ the logarithm of IoU as the regression cost, which amplifies the cost when the matching value is low. For the region cost, we use a soft center region cost,

C_{c e n t e r}

, instead of a fixed anchor box to ensure stable matching.

α

and

β

are two hyperparameters that are set to 10 and 3, respectively. This helps to stabilize the matching process and overcome the limitations of using prior information.

2.5.2. Data Augmentation

Cached Mosaic: This technique enhances the diversity of images by blending multiple images of different animal categories to generate a new image. It utilizes pre-processed pixels from a cache to improve computational efficiency and to accelerate the image processing process [14,28].

MixUp: In this process, the pixels of two images are averaged to create a new data label. Specifically, MixUp generates a new image by interpolating and merging two input images. It creates the new image by mixing the pixels of the two input images in proportion and assigns the label for the new image using the weighted average of corresponding labels [29].

These two methods combined enhance the diversity of samples, helping to improve the detection performance of the model and reduce the risk of overfitting.

2.5.3. Two-Stage Training Approach

Using Cached Mosaic and MixUp involves many rotation and cropping operations, which can cause misalignment between the annotated boxes and the transformed ones [28]. This phenomenon leads to the model learning some noisy information, which is detrimental to the improvement of model performance [27].

Therefore, this paper adopts a two-stage training approach. In the initial stage, Cached Mosaic and MixUp are used for data augmentation to enhance the robustness of the model. In the second stage, relatively weaker data augmentation techniques, such as random resizing and flipping, are employed.

In the final 20 iterations, we utilize Large Scale Jittering (LSJ) [30] to fine-tune the model’s parameters within a range closer to the real data distribution. Considering the characteristics of the model and the need for stable training, we use AdamW as the optimizer, which has been widely used in training Transformer models.

2.6. Experiment

This section validates the proposed animal object detection model on the public dataset AP-10K [22]. The configuration of the experimental environment, dataset composition, evaluation metrics, and hyperparameter settings are described. Then, our model is compared with other state-of-the-art animal object detection models. The effectiveness of pruning operations is also examined in the experiment.

2.7. Dataset

The AP-10K dataset is a large-scale mammalian dataset. Jingdong Exploration Research Institute, Xidian University, and the University of Sydney jointly proposed the dataset. This dataset contains 50 species of mammals, with a total of 10,015 images. The photos in the dataset were taken with hardly any additional human interference with the animals, so they reflect their natural state. The animals in the dataset are in an almost natural state without any additional interference while being photographed. The dataset includes enough species with complex environments, and its distribution resembles the real world, allowing for practical usage analysis [22].

The annotation method of this dataset is consistent with that of the COCO dataset [31]. The dataset was divided into training and testing sets in a 7:3 ratio. Figure 2 shows the animal species and the number of each animal included in the dataset. The proportion distribution of the height and width of each type of box is shown in Figure 3, and the overall aspect ratio of the samples is relatively close to that of individual cases with large aspect ratios.

2.8. Experimental Setup

To further demonstrate the advantages of our model, we compared it with other animal object detection models that perform well. The training and testing datasets had the same distribution to ensure a fair evaluation. All experiments used PyTorch 1.10 and mmdetection 3.0 deep learning frameworks with NVIDIA GeForce RTX 3090 GPU (24 GB).

The AdamW optimizer was used with a base learning rate of 0.004 and a weight decay of 0.05 (excluding biases and normalizations, which were set to 0). The momentum was 0.9, and the batch size was 24. Learning rate scheduling followed a Flat-Cosine pattern with 300 training epochs. Additionally, there was a warmup of 1000 batches. The model input size was 640 × 640 pixels. The data augmentation techniques included Cached Mosaic and MixUp for the first 280 batches and LSJ for the last 20 epochs. The detailed configuration is shown in Table 1.

After model training was completed, a pruning stage was introduced. In order to explore the impact of the pruning ratio on the experimental results, we compared the results under different ratios, mainly focusing on two indicators: average accuracy and inference speed. After the pruning process was completed, we fine-tuned the pruned model using the same hyperparameters as training, and conducted 300 epochs of fine-tuning.

2.9. Evaluation Metrics

The following metrics were used to evaluate the performance of the model in the experiment:

A P

,

A R

, FLOPs, and FPS (frames per second).

T P

,

F P

, and

F N

represent true positives, false positives, and false negatives, respectively. The formulas for calculating accuracy and recall are shown below:

R e c a l l = \frac{T P}{T P + F N}

(3)

P r e c i s i o n = \frac{T P}{T P + T N}

(4)

A P = \int_{0}^{1} P (r) d r

(5)

A R = 2 \int_{0.5}^{1} R (o) d o

(6)

The precision–recall curve forms the

P (r)

function. The value of

A P

is calculated through the integral function

P (r)

.

A R

is calculated by averaging the ratio of each label’s true value to its best matching box along the vertical axis, ranging from 0.5 to 1.

A R^{m a x = i}

represents the average recall rate when there are i detection results for a single image [31].

3. Experimental Results

Table 2 provides information on the impact of different pruning ratios on the experimental results during the pruning process. The Retention Ratio column lists the ratios used, ranging from 1 to 0.7. The AP column represents the average accuracy score obtained for each pruning ratio, while the FPS column displays the performance in frames per second. Based on the information provided in the table, Retention Ratios have the best

A P

when set to 0.9. When the value is further reduced, the number of reserved channels is smaller, and the

A P

decreases. It is worth noting that FPS does not show a variation trend with changes in hyperparameters.

As shown in Table 3, our model achieves an average precision (

A P

) of 71.7, surpassing other object detection models, such as YOLOX and YOLOv8, which perform well. Regarding inference speed, the YOLOv8-s model achieves the highest FPS value of 97.9, indicating its ability to process many frames per second. Tood-Res101 and Vfnet-Res101 followed closely behind, with FPS values of 31.4 and 28, respectively. Our model ranks second behind YOLOv8-s and ahead of other models, indicating its superior performance in accurately detecting animals in the AP-10K dataset.

Our model also excels in average recall rate and had the smallest computational and parameter counts compared to other models in the table. One significant advantage of our model is its efficient utilization of resources. It only requires 10.54 GFLOPs and has a parameter count of only 4.73 M, which reduces to 9.57 GFLOPs and 4.53 M parameters after pruning. The comparison of channel numbers before and after pruning is shown in Figure 4. It can be seen from the figure that the pruned channels are primarily concentrated in the front and rear parts of the model, with fewer in the middle section.

Our model has only half the number of parameters and computational costs compared to YOLOX-s, RTMDET-s, and YOLOv8-s, which fully demonstrates the significant advantages of our model structure. These metrics indicate that our model can achieve excellent accuracy while minimizing computational requirements. At the same time, models with shared parameter detection heads generally have greater advantages. RTMDET-s, RTMDET-m, and our model have the same detection effect and surpass YOLOX in multiple evaluation metrics, which shows the superiority of this component.

After pruning, our model reduces parameter and computational costs and shows improvements in almost all model evaluation metrics, which fully proves the effectiveness of channel pruning. The pruning operation indeed removes branches with poor importance. The smaller size also leads to an improvement in inference speed. The compact model size makes deployment more accessible, especially in resource-constrained environments or applications with limited storage capacity. This model demonstrates its advantages in object detection tasks, providing more efficient and accurate solutions for various application scenarios.

In Figure 5a, the cow poses vary and the patterns are quite different, with some obstruction by the fence, but our model still accurately boxes the cow in the image. In Figure 5b, there is mutual occlusion between two dogs, with one dog’s head obstructed, but our model still obtains the bounding boxes of both dogs. In Figure 5c, the background of the jaguar is quite complex, with the reflective water waves and the tree bushes behind it intersecting with the body edges, but the detection box still accurately boxes the jaguar. In Figure 5d, two polar bears are playing in the water, only revealing part of their bodies, and there is cross-occlusion between them, but our model still accurately boxes the two bears separately. In Figure 5e, one raccoon is climbing on the back of another, and their fur colors are very similar, which poses a significant challenge for obtaining the detection box, but our model still boxes the two raccoons. From the visualization results, our model can obtain the animal’s bounding box in complex backgrounds, even with occlusion and interference from the surrounding scenery.

4. Discussion

We used Transformer-based basic blocks to build a model but no longer used the computationally complex multi-head attention mechanism. Instead, we used attention based on the LSK module to expand the model’s field of view. We achieved feature fusion in the neck network using basic feature pyramid components, resulting in a minimal computational cost. Upon analyzing experimental data, it is evident that our RTAD model structure outperforms other models in terms of accuracy. A two-stage training approach was utilized in our model, combining the benefits of multiple data augmentation techniques.

Based on Fisher information, we simplified the model’s channel with the channel pruning algorithm. The model’s performance improved in multiple indicators without any decreases. The pruned model has a higher accuracy while reducing the computational cost and maintaining satisfactory processing speeds. The phenomena mentioned above indicate that introducing the pruning algorithm has successfully reduced channels with lower importance, as intended. During the experiment, the impact of hyperparameters on pruning was analyzed. The results indicated that excessive pruning of channels led to a decline in the model’s performance. This implies that some significant channels were also pruned along with the unimportant ones to meet the pruning ratio set beforehand. The advantages of our model make it a promising solution suitable for scenarios where accuracy, efficiency, and compactness are prioritized in object detection.

5. Conclusions

A real-time animal object detection model based on the LSK module and channel pruning is proposed in this paper. The

A P

reached 71.7, while there were only 4.53 M parameters. Our model exceeds YOLOX-s, YOLO v8-s, and RTMDET-s with approximately half the parameters. Our model achieves excellent performance due to the Transformer-based basic block and the LSK module, expanding its representational capabilities. Channel pruning led to faster inference speeds, while validation on a public dataset fully demonstrated its superiority. The visualized experimental results showed that our model can accurately detect animal bounding boxes with complex backgrounds, further demonstrating the reliability of our model.

We have made progress in the basic task of animal object detection. This is beneficial for improving the accuracy of downstream tasks such as animal pose estimation and behavior detection. This study provides solid foundational support for animal population surveys in zoology. In the future, we will design more efficient model structures to improve the model’s inference speed and detection accuracy and deploy the model in practical scenarios.

Author Contributions

Conceptualization, S.L. (Sicong Liu) and Q.F.; methodology, S.L. (Sicong Liu) and Q.F.; software, Q.F. and S.L. (Sicong Liu); validation, Q.F. and S.L. (Sicong Liu); data curation, Q.F. and S.L. (Sicong Liu); writing—original draft preparation, Q.F. and S.L. (Sicong Liu); writing—review and editing, Q.F. and S.L. (Sicong Liu); visualization, S.L. (Sicong Liu); supervision, S.L. (Shuqin Li) and C.Z.; project administration, S.L. (Shuqin Li). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

Thanks to Hang Yu of Xidian University in China for opening the AP-10K dataset. This has provided great help for our research.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ViT	Vision Transformer
LSK	Large Selective Kernel
SPP	Spatial Pyramid Pooling
ACMix	Self-Attention and Convolution Mixed Module

References

Díaz, S.; Fargione, J.; Chapin, F.S.; Tilman, D. Biodiversity Loss Threatens Human Well-Being. PLoS Biol. 2006, 4, e277. [Google Scholar] [CrossRef] [PubMed]
Ukwuoma, C.C.; Qin, Z.; Yussif, S.B.; Happy, M.N.; Nneji, G.U.; Urama, G.C.; Ukwuoma, C.D.; Darkwa, N.B.; Agobah, H. Animal species detection and classification framework based on modified multi-scale attention mechanism and feature pyramid network. Sci. Afr. 2022, 16, e01151. [Google Scholar] [CrossRef]
Neethirajan, S. Recent advances in wearable sensors for animal health management. Sens. Bio-Sens. Res. 2017, 12, 15–29. [Google Scholar] [CrossRef]
Zheng, Z.; Li, J.; Qin, L. YOLO-BYTE: An efficient multi-object tracking algorithm for automatic monitoring of dairy cows. Comput. Electron. Agric. 2023, 209, 107857. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Pan, X.; Ge, C.; Lu, R.; Song, S.; Chen, G.; Huang, Z.; Huang, G. On the Integration of Self-Attention and Convolution. arXiv 2022, arXiv:2111.14556. [Google Scholar]
Qiao, Y.; Guo, Y.; He, D. Cattle body detection based on YOLOv5-ASFF for precision livestock farming. Comput. Electron. Agric. 2023, 204, 107579. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Yang, Q.; Xiao, D.; Cai, J. Pig mounting behaviour recognition based on video spatial–temporal features. Biosyst. Eng. 2021, 206, 55–66. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
Riekert, M.; Klein, A.; Adrion, F.; Hoffmann, C.; Gallmann, E. Automatically detecting pig position and posture by 2D camera imaging and deep learning. Comput. Electron. Agric. 2020, 174, 105391. [Google Scholar] [CrossRef]
Sha, J.; Zeng, G.L.; Xu, Z.F.; Yang, Y. A light-weight and accurate pig detection method based on complex scenes. Multimed. Tools Appl. 2023, 82, 13649–13665. [Google Scholar] [CrossRef]
Ocepek, M.; Žnidar, A.; Lavrič, M.; Škorjanc, D.; Andersen, I.L. DigiPig: First Developments of an Automated Monitoring System for Body, Head and Tail Detection in Intensive Pig Farming. Agriculture 2021, 12, 2. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Shao, H.; Pu, J.; Mu, J. Pig-Posture Recognition Based on Computer Vision: Dataset and Exploration. Animals 2021, 11, 1295. [Google Scholar] [CrossRef] [PubMed]
Maheswari, M.; Josephine, M.; Jeyabalaraja, V. Customized deep neural network model for autonomous and efficient surveillance of wildlife in national parks. Comput. Electr. Eng. 2022, 100, 107913. [Google Scholar] [CrossRef]
Ulhaq, A.; Adams, P.; Cox, T.E.; Khan, A.; Low, T.; Paul, M. Automated Detection of Animals in Low-Resolution Airborne Thermal Imagery. Remote. Sens. 2021, 13, 3276. [Google Scholar] [CrossRef]
Roy, A.M.; Bhaduri, J.; Kumar, T.; Raj, K. WilDect-YOLO: An efficient and robust computer vision-based accurate object localization model for automated endangered wildlife detection. Ecol. Inform. 2023, 75, 101919. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large Selective Kernel Network for Remote Sensing Object Detection. arXiv 2023, arXiv:2303.09030. [Google Scholar]
Liu, L.; Zhang, S.; Kuang, Z.; Zhou, A.; Xue, J.H.; Wang, X.; Chen, Y.; Yang, W.; Liao, Q.; Zhang, W. Group Fisher Pruning for Practical Network Compression. arXiv 2021, arXiv:2108.00708. [Google Scholar]
Yu, H.; Xu, Y.; Zhang, J.; Zhao, W.; Guan, Z.; Tao, D. AP-10K: A Benchmark for Animal Pose Estimation in the Wild. arXiv 2021, arXiv:2108.12617. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. arXiv 2022, arXiv:2107.00652. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. arXiv 2017, arXiv:1612.03144. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2016, arXiv:1506.02640. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. arXiv 2018, arXiv:1710.09412. [Google Scholar]
Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation. arXiv 2021, arXiv:2012.07177. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, L. Microsoft COCO: Common Objects in Context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V 13; Springer International Publishing: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 11 September 2022).
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned One-stage Object Detection. arXiv 2021, arXiv:2108.07755. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sünderhauf, N. VarifocalNet: An IoU-aware Dense Object Detector. arXiv 2021, arXiv:2008.13367. [Google Scholar]

Figure 1. Architecture diagram showing the parameter settings of the backbone network, as well as the detailed configurations of the basic blocks and attention mechanisms.

Figure 2. The number of samples for each category in the dataset.

Figure 3. The height/width ratio of the bounding box for each category in the dataset.

Figure 4. Comparison of channel numbers before and after pruning.

Figure 5. (a–d) Display the object detection effects for different animals. The left side of each image is the original image, and the right side is the effect image of object detection; (e) Displays the object detection results for different animals. The left side of each image is the original image, and the right side is the effect image of object detection.

Table 1. Configuration during model training.

Configuration	Item
Framework	PyTorch 1.12
GPU	NVIDIA GeForce RTX 3090
Optimizer	AdamW
Data augmentation	Mosaic, Mix up (first 280 epochs), LSJ (last 20 epochs)
Operation system	Ubuntu 18.04

Table 2. The table illustrates how different Retention Ratios impact the experimental results during the pruning process.

Retention Ratios	AP	FPS
1	70.8	75.1
0.9	71.7	75.2
0.8	71.3	75.3
0.7	70.5	75.1

Table 3. The performance of each model on the AP-10K dataset. The model inference speed in FPS is measured when batch size = 1 and the number of threads is 1. GFLPOs encompass the overall process of the model.

Model	$AP$	${AP}^{0.5}$	${AP}^{0.75}$	${AP}^{M}$	${AP}^{L}$	${AR}^{m a x = 1}$	${AR}^{m a x = 10}$	${AR}^{m a x = 100}$	FPS	GFLOPs	Param#
RTMDET-m [28]	69.3	89.3	78	33.6	70.3	64.5	79.8	81.1	74.1	39.16	24.69
RTMDET-s [28]	68.1	89	76.9	30	69.1	63.5	78.5	79.8	74.6	14.81	8.88
YOLOv8-s [32]	65.5	86.2	73.6	27.4	66.6	62.9	76.8	77.3	97.9	14.33	11.16
YOLOX-m [27]	67.5	88	76.2	34.9	68.5	62.2	74.4	74.6	55.2	36.84	25.31
YOLOX-s [27]	61.2	84.6	69.5	32.6	62.1	58.6	70.2	70.4	68.4	13.38	8.96
Tood-Res101 [33]	66.5	87	74.7	25.2	67.6	64.4	74.8	74.8	28	103.65	50.91
Vfnet-Res101 [34]	63	83.6	70.3	19.6	64.2	62	72.6	72.7	31.4	107.03	51.6
ours (w/o pruning)	70.8	91.3	79.9	35.5	71.7	64.8	79.8	81.1	75.1	10.54	4.73
ours (with pruning)	71.7	92.1	80.5	35.4	72.5	64.9	80.0	81.3	75.3	9.57	4.53

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, S.; Fan, Q.; Zhao, C.; Li, S. RTAD: A Real-Time Animal Object Detection Model Based on a Large Selective Kernel and Channel Pruning. Information 2023, 14, 535. https://doi.org/10.3390/info14100535

AMA Style

Liu S, Fan Q, Zhao C, Li S. RTAD: A Real-Time Animal Object Detection Model Based on a Large Selective Kernel and Channel Pruning. Information. 2023; 14(10):535. https://doi.org/10.3390/info14100535

Chicago/Turabian Style

Liu, Sicong, Qingcheng Fan, Chunjiang Zhao, and Shuqin Li. 2023. "RTAD: A Real-Time Animal Object Detection Model Based on a Large Selective Kernel and Channel Pruning" Information 14, no. 10: 535. https://doi.org/10.3390/info14100535

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RTAD: A Real-Time Animal Object Detection Model Based on a Large Selective Kernel and Channel Pruning

Abstract

1. Introduction

2. Methods

2.1. Backbone

2.2. Basic Block

2.3. Shared Parameter Object Detector

2.4. Pruning Algorithm Based on Fisher Information

2.5. Implementation Details

2.5.1. Label Assignment

2.5.2. Data Augmentation

2.5.3. Two-Stage Training Approach

2.6. Experiment

2.7. Dataset

2.8. Experimental Setup

2.9. Evaluation Metrics

3. Experimental Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI