A Weakly Supervised Hybrid Lightweight Network for Efficient Crowd Counting

Chen, Yongqi; Zhao, Huailin; Gao, Ming; Deng, Mingfang

doi:10.3390/electronics13040723

Open AccessArticle

A Weakly Supervised Hybrid Lightweight Network for Efficient Crowd Counting

School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai 201400, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(4), 723; https://doi.org/10.3390/electronics13040723

Submission received: 24 December 2023 / Revised: 24 January 2024 / Accepted: 24 January 2024 / Published: 10 February 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Crowd-counting networks have become the mainstream method to deploy crowd-counting techniques on resource-constrained devices. Significant progress has been made in this field, with many outstanding lightweight models being proposed successively. However, challenges like scare-variation, global feature extraction, and fine-grained head annotation requirements still exist in relevant tasks, necessitating further improvement. In this article, we propose a weakly supervised hybrid lightweight crowd-counting network that integrates the initial layers of GhostNet as the backbone to efficiently extract local features and enrich intermediate features. The incorporation of a modified Swin-Transformer block addresses the need for effective global context information. A Pyramid Pooling Aggregation Module handles the inherent scale variation problem in crowd-counting tasks in a more computation-efficient way. This module, along with the cross-attention module, serves as bridges to promote the feature information flow between local features and global context information. Finally, a simplified regressor module is designed to allow the proposed model with weakly supervised guidance for training to avoid precise location-level annotations, noting that the omission of density map generation makes the proposed network more lightweight. Our results on the UCF-QNRF dataset indicate our model is 8.73% and 12.17% more accurate on MAE and MSE metrics, respectively, than the second-best ARNet, in which the parameters decrease by 4.52%. On the ShanghaiTech A dataset, MAE and MSE drop 1.5% and 3.2%, respectively, compared to the second-best PDDNet. The experimental results for accuracy and inference speed evaluation on some mainstream datasets validate the effective design principle of our model.

Keywords:

crowd counting; lightweight hybrid network; weakly supervised learning

1. Introduction

As a subdomain within the field of object detection and counting, crowd counting is known as a technique to count or estimate the total crowd present in an image or video stream. It can be applied in a variety of circumstances, such as video surveillance, traffic monitoring, and public safety. However, challenges persist in terms of fine-grained head annotation requirements, global context feature extraction, handling scale variations, etc. Further, in order to make this technology more practical in real-life scenarios, lightweight crowd-counting network design has become a trend in this domain. Lightweight crowd-counting network design aims at designing a more compact model architecture that can run on resource-constrained devices while maintaining counting accuracy. This poses higher demands on model lightweighting and resource consumption.

Lightweight crowd-counting models often implement techniques such as model pruning [1], parameter sharing [2,3], model quantization [4], and knowledge distillation [5] to reduce parameters and computation cost. Sun et al. [4] utilized model quantization in their model for a microcontroller unit (MCU), representing a

2.2 \times

speedup compared to the original float model, which indicates its effectiveness in resource-constrained devices. Liu et al. [5] proposed a Structure Knowledge Transfer (SKT) framework to allow a student network to learn a feature modeling ability from a teacher network. The model size was reduced to one-quarter of the teacher network with only a small drop in accuracy. However, challenges persist in lightweight crowd-counting network design. First, most of the ones [1,4,6,7,8,9,10,11] with fully supervised guidance require location-level annotation information in datasets to maintain accurate performance. Generating such annotations is tedious and time-consuming. Further, processing and learning accurate information from fine-grained annotations often need more complicated architecture. For instance, ref. [12] proposed an iterative crowd-counting network to handle high-resolution and low-resolution density maps for more accurate results, and ref. [13] combined deep and shallow networks to generate a density map that is more adaptable to large-scale variations. Secondly, lightweight models often lack effective modules for global context feature extraction [14,15] and intermediate feature enrichment [14]. Finally, it is also of vital importance to implement effective modules to handle scale variation problems inherent in the object counting task.

In order to enhance the adaptability of the model to datasets with coarse annotations, weakly supervised methods have been proposed. Such techniques can also be implemented in crowd-counting fields. Compared to fully supervised object detection methods, including detection with a bounding box [16,17], which requires box anchors to be annotated before training, and detection with image segmentation [18], which often requires generation of a segmentation map with a point map or a density map, weakly supervised objection counting can only rely on the ground-truth count number to estimate people in the image. In scenarios where only real-time estimation of the number of people is needed and detailed location information is not crucial, the model structure is correspondingly simplified, making it more compatible with lightweight network design. However, the weakly supervised model needs to handle incomplete label information and learn useful patterns from it, placing increased demands on the model’s training methods and optimization techniques. There have been some studies concentrating on weakly supervised network design, and some of them have achieved good results in this regard. Yang et al. [19] proposed a soft label ranking network in the model to facilitate counting tasks, wherein the network ranks images based on the number of people in the images. It gets rid of expensive semantic labels and location annotations, and the ranking network drives the shared backbone CNN model to explicitly acquire density-sensitive capability. Then, the regression network utilizes information about the number of individuals to enhance the accuracy of counting. Liang et al. [20] re-articulates the problem of weakly supervised crowd counting from a Transformer-based sequential counting perspective, achieving a weakly supervised paradigm that relies on counting-level annotations. However, it is not suitable for direct application in lightweight crowd-counting tasks due to its Transformer backbone.

Following the strategy proposed by [20,21], the training network in a weakly supervised setting can straightforwardly extend the traditional density map estimation process by imposing the integral of the estimated density map, which is close to the ground-truth object count for count-level annotated images. The ground-truth count number can be easily obtained either directly from the datasets, from the count number of location coordinates, or integrated from the ground-truth density map, depending on the format of ground-truth labels provided by the datasets. Since our model would not handle density map estimation, there are some slight differences from [21]. The prediction count number will be produced by learning the reflection between the final fused feature and the estimation number with fully connected layers instead of integrating the estimated density map with the global average pooling operation. Thus, our approach to weakly supervised guidance is more in line with the idea defined in [20]. It refers to methods relying on count-level annotations as a weakly supervised paradigm. In this way, the weakly supervised objective goal is defined as minimizing the difference between the predicted count number and the ground truth count number.

In our work, we propose a novel weakly supervised lightweight crowd-counting network to address the aforementioned issues. The main design principle of our work is to find a more effective method to combine a Transformer block with a convolutional neural network, which is in line with the most popular trend of designing a hybrid compact model to achieve a trade-off between accuracy and lightweight. Then, a weakly effective supervised architecture should be considered. It follows the recent trend of designing crowd-counting networks to get rid of fine-grained annotations and keep the model more lightweight. With such design guidance, modules in our network, along with their functions, and connections among them have been well considered. We tried our best to make each module specially designed for the specific challenge existing in lightweight crowd-counting networks. The overview of our model can be illustrated as follows: Initially, an input image batch undergoes local feature extraction using a GhostNet [22] backbone in the stem. Concurrently, a branch incorporates the Pyramid Pooling Aggregation Module (PPAM) [23] to capture scale variation information. As the intermediate features in the backbone inherently possess rich scale-aware information due to Ghost Bottleneck depth variation, the PPAM downscales them through adaptive average pooling operations, followed by concatenation in the channel direction to learn scale variation representations. Then, a modified Swin-Transformer block [24] is used to extract global context information with scale-aware features output from PPAM. We prevent the sharp increasing parameters caused by introducing the Transformer by clipping the input image into sub-images and rearranging them at the channel direction at the beginning of the network. Subsequently, the global context features and local deep features from the backbone output are fused using a cross-attention bridge, yielding integrated features. Finally, a simple regressor module, comprising point-wise convolution and fully connected layers is then employed to learn the relationship between these integrated features and the total number count. To overcome the dependency on location-level annotations and facilitate easier training and deployment, our model implements weakly supervised guidance with count-level annotations for training.

To verify the performance of our network, we tested it on some mainstream datasets such as ShanghaiTech A/B datasets [25], JHU-Crowd++ dataset [26], NWPU dataset [27] and UCF-QNRF dataset [28]. For accuracy performance, our network achieved state-of-the-art (SOTA) results in ShanghaiTech A, UCF-QNRF and JHU-Crowd++ datasets. Furthermore, the inference speed and memory occupation are also competitive to convolutional models. Our main contributions are as follows:

We have developed a novel hybrid lightweight crowd-counting network that combines GhostNet, which can generate abundant intermediate ghost features, and Swin-Transformer, with the Transformer block’s large receptive field to enhance long-range context information modeling ability for crowd-counting tasks.
We aim to incorporate attention mechanisms into lightweight network models while maintaining parameters and computational costs at the same level.
Our lightweight model implements weakly supervised guidance to git rid of laborious location-level annotation and density map generation that comes with feature up-sampling, which is reluctant to the inference phase.
Our network obtains state-of-the-art or comparable performance on some mainstream datasets including ShanghaiTech A dataset, UCF-QNRF dataset, JHU-Crowd++ dataset.

2. Related Work

2.1. Lightweight Network Design for General Tasks

Designing lightweight crowd-counting networks has become a mainstream method to make networks more applicable to resource-constrained devices. Compared to heavy structure models, lightweight models have low computation cost and fast processing time to meet real-time demands. There have been some works designed lightweight models targeting general computer vision tasks such as object detection, semantic segmentation, and object classification. Thought-provoking modules and their innovative design principles inspired us to optimize modules in our model for better performance. Andrew et al. [29,30,31] proposed MobileNet family networks that consist of three generations of MobileNet. The main contribution of MobileNet is the idea of depth-wise separable convolution that replaces a normal convolution operation with channel-wise convolution and point-wise convolution and the design of inverted residual bottleneck to reduce computations. Based on MobileNet’s structure, Han et al. [22,32] came up with a more efficient convolution architecture called GhostNet, which takes advantage of MobileNet’s inverted bottleneck architecture and depth-wise separable convolution and designs a Ghost Bottleneck, which can generate ghost features with cheap linear operations. When it comes to the Transformer region, Zhang et al. [33] proposed MiniViT to compress the Vision Transformer by weight multiplexing, which consists of weight transformation and weight distillation. Daniel et al. [34] introduced hydra attention to linear attention so that it can add more heads while keeping computation amounts the same as before. However, relying solely on the Transformer for compact model design is not competitive. In most cases, mixing a Transformer with a convolution network would be a better choice to take advantage of both CNN’s small model size and Transformer’s high accuracy. In this direction, Chen et al. [35] presented a parallel design of MobileNet and Transformer called Mobile-Former, aiming to fuse the local features and global information bidirectionally. Pan et al. [36] designed EdgeViTs as a lightweight vision Transformer family, which achieved accuracy–latency and accuracy–energy trade-offs in object recognition tasks. Chen et al. [37] designed a Mixing Block that combines local-window self-attention and depth-wise convolution in a parallel design to integrate the features across windows and dimensions. Metha et al. [38] replaced local processing in convolutions with global processing using Transformers in their network to learn better representations with fewer parameters and simple training recipes. The hybrid networks mentioned above are designed for general purposes and are not specially tailored for crowd-counting tasks. Based on the concept of combining a Transformer and CNN network, here, we designed our hybrid lightweight network by intersecting GhostNet and Swin-Transformer block specifically for crowd counting tasks.

2.2. Lightweight Crowd-Counting Models

When it comes to the crowd-counting field, some lightweight models have been proposed and achieved promising performance at the time when they were proposed. Now, when we look back at their design, improvements can be put forward to boost performance. Shi et al. [14] proposed a lightweight C-CNN model that uses three parallel layers with filters of different sizes to solve scale variation problems. Compared to multi-branch architecture, it has a simpler structure and fewer parameters. However, such a simple structure may not be able to handle complicated scenarios involving light variation, fake object representation, etc. Zhu et al. [3] implemented weight sharing to its scale feature extraction module in LSANet, sharply decreased the parameters of a complicated network to a minimum level. However, it outputs a density map at three different scales for better guidance, which inevitably causes unnecessary computation cost. Our weakly supervised network directly skips density map generation and can achieve a better trade-off between accuracy and computation cost. Liang et al. [39] designed PDDNet that is equipped with a multi-scale information extraction module with lightweight pyramid dilated convolutions (LPC) modules to extract global context information, and Dong et al. [40] designed a Multi-Scale Feature Extraction Network (MFENet) to model multi-scale information. Compared to the PPAM block in our model, our method takes fewer computations to process scale-aware information and the Swin-Transformer block is more effective in extracting global context information. Tian et al. [41] and Zhang et al. [42] put forward guidance branch to their lightweight model to learn localization information in their work; such a technique needs precise head location coordinates to guide location task and that is not mandatory in lightweight crowd-counting tasks. When it comes to hybrid network architecture, Sun et al. [43] introduce Transformer blocks after each downscale convolution block to separately model scale-varied information stage by stage; it is not computationally efficient to introduce multiple Transformer blocks for the same thing.

2.3. Weakly Supervised Crowd-Counting Models

There have been some previous works also concentrating on weakly supervised lightweight crowd-counting network designs that are most relevant to our network. Yang et al. [19] made the first attempt to train a pure convolution network without localization level, but still relied on a sorting network with handcrafted soft labels. Wang et al. [44] proposed a weakly supervised network with multi-granularity MLP that is solely based on count-level annotations; they introduced a ranking mechanism and designed auxiliary branches for self-supervision, causing excessive computation times, which should be avoided in lightweight models. Wang et al. [45] proposed a Joint CNN and Transformer network, which also implements weakly supervised learning for efficient crowd counting; it takes the modified Swin-Transformer with patching layers being discarded for global feature modeling to compensate for local features from VGGNet, but it ignores the scale-aware information at both local area and global scope, and the combination of VGGNet and the Swin-Transformer block with patch embedding operations would cause loss of background context information.

From recent lightweight crowd-counting works mentioned above, we can figure out some common problems in them: Firstly, lightweight convolutional models have often concentrated on designing various dilated convolution modules to expand the convolution receptive field from the local area to a global horizon. However, Transformer inherently has a global receptive field. If we can decrease the computation cost, replacing the dilated convolution module with Transformer would be a better choice. Secondly, some works maintain crowd location tasks in lightweight models, which require complicated and accurate annotations in datasets for training. Finally, weakly supervised lightweight crowd-counting models also encounter issues that fully supervised ones now face, and they require further improvement to fit coarse-grained annotations.

3. Method

3.1. Problem Formulation

The weakly supervised crowd-counting problem can be formulated as described in [21]. Given the input image I, the counting network learns useful patterns relevant to total numbers or density distribution from it and finally outputs feature maps; then, the obtained feature maps can be regressed to the prediction number that represents the total count of people in the input image with a global average pooling operation. Such a process can be formulated as follows:

F M = M (I); \hat{C} = G A P (F M) .

(1)

M represents the part of the crowd-counting model to model the feature map, FM denotes the extracted feature map,

G A P

denotes the global average pooling operation, and

\hat{C}

denotes the prediction count number. The weakly supervised crowd-counting network can be guided by minimizing the deviation between the prediction value and the ground-truth count number. Since the global average pooling operation has been replaced with fully connected layers, the process of output prediction count number can be formulated as Equation (2),

F C

denotes fully connected layers.

\hat{C} = F C (F M) .

(2)

3.2. Architecture

Our network architecture is shown in Figure 1. “Q”, “K”and “V” denote query tokens, key tokens, and value tokens required by cross-attention operation respectively. The overview of our network architecture consists of five parts: First, thirteen layers of GhostNet serve as the backbone for extracting local features; PPAM is used for fusing different scales and different depths of intermediate features from the backbone, data from [23]; Swin-Transformer with bi-dimensional feed-forward network (SBFFN) is adopted for extracting long-range dependency information; the cross-attention module servers as bridge for fusing global context information into local deep features; the simple regressor module finally outputs the prediction results by taking in the integrated features after cross-attention module. The GhostNet backbone is mainly used to initially extract local features from image pixels, and its intermediate features at each downsampling stage contain scale-aware information and depth information; it will be downsampled to a uniform size and concatenated by PPAM for scale information fusion. The PPAM consists of cascading average pooling layers, followed by a concatenation connection. The intermediate features from the GhostNet backbone with different resolutions will be concatenated at the channel level. The concatenated features are sent to the Swin-Transformer block, which not only extracts global dependencies but also extracts and fuses global multi-scale features when working with the PPAM module. To fuse the global dependency information together with the local features extracted by GhostNet, a cross-attention module serves as a bridge to fuse global information with multi-scale features into local features. Finally, a regressor module composed of point-wise convolution and fully connected layers outputs the predicted crowd-counting number. Next, we will introduce each part of our network in detail.

3.2.1. GhostNet Backbone

Based on the Ghost Module, which is composed of Ghost Bottlenecks, GhostNet involves the previous MobileNet network, and it can generate abundant ghost features that have a linear correlation with the intrinsic features generated by ordinary convolution through cheap linear operations as:

y_{i, j}^{'} = Φ_{i, j} (y_{i})

(3)

where

y_{i}

denotes the i-th intrinsic feature map in a feature F produced by ordinary convolution operation and

Φ_{i, j}

represents the j-th cheap linear operation operated to generate the j-th ghost feature

y_{i, j}^{'}

of intrinsic feature

y_{i}

. This special ability meets our demand to enrich intermediate features of our lightweight model because the attention block in Swin-Transformer needs more feature information to learn a better representation. The final features after the ghost module are the concatenated features of intrinsic features and ghost features at the channel level. Compared with other lightweight convolution networks, it can provide more relevant features with nearly the same parameters and computation cost. Benefiting from its cascading convolution architecture, it has fast reference speed and enough depth to uncover deep feature information. These traits are ideal to serve as a lightweight backbone for initial feature extraction and enrich intermediate features. Experiments in previous work [39] have proved that the first thirteen layers are more efficient parts to serve as a backbone. We directly adopt the conclusion here. Although it lacks the ability to extract multi-scale feature information. To make a complementation, we take PPAM as a branch to work parallel with the GhostNet stem. Given an original input image batch of size

H \times W \times C

, H denotes height, W denotes width, C denotes channel numbers, and H = W = 384, C = 3. After GhostNet operations, the feature size is compressed to

\frac{H}{16} \times \frac{W}{16} \times 224 C

.

3.2.2. Parallel Pooling Aggregation Module

Intermediate features in the GhostNet backbone contain rich scale information. However, due to the different sizes of output features after each GhostNet Bottleneck, we cannot fuse them and then pass them through the Swin-Transformer block directly. After passing through each Ghost Bottleneck, the input features resolution is reduced to half of its original size. To ensure that all stages of the output features have the same resolution for concatenation, the downsampling ratio of each pooling layer in PPAM halves layer by layer. Consequently, the resolution of the concatenated feature matches GhostNet’s final output. The PPAM module will downsample them to the size of the smallest one among them using an adaptive average pooling operation, and the size of the token that passed to Swin-Transformer comes to

\frac{H}{32} \times \frac{W}{32} \times 336 C

after downsampling and channel concatenation. This simple operation just consumes very few computations and needs no additional parameters. Given features of size

h \times w \times c

at each iteration of PPAM, the computation FLOPs of adaptive average pooling operation with kernel size k and step s is

k^{2} \times c \times \frac{h - k}{s} \times \frac{w - k}{s}

. The latter self-attention process in Swin-Transformer will not increase the parameters and computation sharply because of the small output features’ size of PPAM and the Swin-Transformer’s initial window-partitioning operation. In this step, different features with different convolution depths and varying scales are combined together at the channel level, and the last stage’s output features from GhostNet serve as local deep features, which are also fused into them. The total process steps can be formulated as follows:

{Y c}_{i} = C o n c a t (A v e r P o o l i n g (Y_{i}), {Y c}_{i - 1}),

(4)

where

{Y c}_{i}

denotes current stage of concatenated features for i-th iteration.

Y_{i}

denotes i-th downsample stage intermediate features of GhostNet backbone. The concatenated features will pass through the Swin-Transformer block for long-range dependency features aggregation.

3.2.3. Swin-Transformer Block with Bid-FFN

To take advantage of Transformer’s large receptive field, our network adopts the SWin-Transformer block for a long-range context information extraction. Due to Swin-Transformer’s window-partitioned self-attention mechanism, resulting in its linear computation cost with respect to the input feature size, it is possible for us to choose it to apply global feature extraction in a lightweight model. As the first thirteen layers of GhostNet have enough depth, the input fused features at different scales have been compressed thoroughly. We only use one Swin-Transformer block for long-range dependency information extraction. Even so, the Transformer block would still consume more computational cost, and the increased memory occupation would not be neglected when compared to a convolutional structure. For this reason, the channels of the input feature to the Transformer block often remain at a minimal level for most lightweight networks with a Transformer structure. Here, we adopt a point-wise convolution to decrease channel numbers from 1008 to 128 with batch normalization to avoid vanishing gradient. The channel downscale ratio is close to 1/8; it would alleviate the computation burden to such a ratio since the computation cost of Swin-Transformer is in proportion to channel size. Since the feed-forward network (FFN) in the Transformer block serves as the only non-linear component, it plays an important role in modeling non-linear relationship [46]. We must fully exert its modeling ability and compensate for its shortcomings. Although non-linear activation is usually conducted on enlarged channel dimensions produced by a linear layer, it is still insufficient in lightweight models due to parameter constraints. Furthermore, the plain FFN model cannot model spatial dependency very effectively because it has drawbacks of modeling at the spatial level [46]. To overcome such drawbacks, we adopt the bi-dimensional attention feed-forward network originating from LightViT. The bi-dimensional attention model is shown in Figure 2, “r” denotes reduction ration. It consists of a channel attention branch and a spatial attention branch: the channel attention branch aggregates the global representation with an average pooling layer, which is used to compress global features for computing channel attention; and the spatial attention is particularly designed to enhance its spatial modeling ability. Just before the spatial attention, a concatenating operation would be used to model the pixel-wise relation. Finally, the interaction between channel attention and spatial attention promotes the information flow between channels and global space.

3.2.4. Multi-Head Cross Attention for Global Feature Diffusion

After the modified Swin-Transformer block, features containing global context information were extracted. They should be fused together with local deep features from the GhostNet backbone in order to finally regress for the total crowd number. To achieve this purpose, we adopt a cross-attention module to diffuse global features into local features by projecting local features into query tokens and projecting global features into key tokens and value tokens. In this part, we simply use a cross-attention module for this purpose. Cross-attention has been used to merge some representative features into other parts of the ones in previous works and its effectiveness has been proven by these works [35,47]. Unlike channel concatenation operation and point-wise sum operation, the attention mechanism in it needs more computation. However, when Transformer blocks are adopted in the network architecture, the accuracy results tend to be better. Due to previous feature compression processes, the token sizes of this module have been significantly reduced after feature projection. After a detailed comparison, we finally chose the multi-head cross-attention module as the bridge to fuse the global features into the local representation. In order to integrate global context dependency information into local features, we project the GhostNet output features to the query token, and the output global features of Swin-Transformer are projected into the key tokens and value tokens; the output fused features are calculated as follows:

M u l t i C r o s s A t t n (Q, K, V) = C o n c a t ({h e a d}_{1}, {h e a d}_{2}, \dots, {h e a d}_{h}) W o,

(5)

where

W_{o}

denotes output projection matrix.

{h e a d}_{i}

denotes the i-th attention head; it can be represented as:

{h e a d}_{i} = A t t n (Q W_{Q_{i}}, K W_{K_{j}}, V W_{V_{k}}) .

(6)

The cross attention value of Q, K and V can be generated as Equation (7)

C r o s s A t t n (Q, K, V) = \sum (S o f t M a x (\frac{Q_{i} K_{j}^{T}}{\sqrt{d_{k}}} V_{s})),

(7)

where

Q_{j}

,

K_{j}

,

V_{s}

each represent the i-th row of query token Q, the j-th row of key token K, and the s-th row of value token.

d_{k}

denotes dimension of k-th key vector. They can be generated as:

Q = L N (F); K, V = L i n e a r (T) .

(8)

3.3. Simple Convolution Regressor

The regressor module is utilized to learn the reflection between the final fused features output from the cross-attention bridge and the number of people in the input picture. After different kinds of feature extraction and fusion in previous modules, the final fused feature contains rich information required by the final regression process; thus, the regressor module takes it as an input feature for predicting the total number of people. To simplify our network and adhere to the principle of weakly supervised network design, our regressor simply regresses the count number as the whole network’s output. The regression process is shown in Figure 3. Following this policy, our regressor module consists of two fully connected layers followed by two point-wise convolutions, with a drop-out layer between them having a drop ratio of 0.2. The first fully connected layer will reduce the features channel significantly, thereby reducing the parameters of the subsequent fully connected layer considerably. The final layer outputs the predictive result. This process can be formulated as:

G = F C (D r o p O u t (F C (P o i n t C o n v (P o i n t C o n v (Y_{f}))))),

(9)

where

Y_{f}

denotes the final fused feature after cross-attention bridge, G denotes regression number counts, and PointConv represents point-wise convolution.

4. Experiments

To validate the advantages and effectiveness of our network, we conducted some experiments on some benchmark datasets, which include ShanghaiTech A/B dataset, NWPU dataset, JHU-Crowd++ dataset and UCF-QNRF dataset. Then, we compared the experimental results with some state-of-the-art networks. In this section, we will first introduce the datasets that we used for training and testing. Then, we would like to illustrate our settings and configurations during the experimental process. Finally, we show the comparative results with some different kinds of state-of-the-art networks.

4.1. Datasets

ShanghaiTech datasets [25] contain 1198 annotated images with a total count of more than 300k people, with central heads being annotated. These images cover a wide range of crowd densities and sizes. The full dataset is initially split into two parts by its providers, denoted as A and B, respectively. In part A, the crowd density varies significantly, ranging from 33 to 3193 people per image, while part B has relatively sparse crowd densities.

The JHU-Crowd++ dataset [26] is a large-scale, unconstrained crowd-counting dataset with 4372 images and 1.51 million annotations. The pictures in this dataset are collected under a large variety of scenes with different environmental conditions, including weather-based conditions and illumination variations.

UCF-QNRF dataset [28] was shared publicly in 2018, comprising a total of 1535 crowd images with annotations. All the pictures in this dataset have a high resolution of 2013 × 2902, with annotated crowd numbers ranging from 49 to 12,865.

The NWPU Crowd dataset [27] is a large dataset designed for both crowd counting and localization. It includes 351 negative samples, which means there are no people in the pictures, but there are a large number of other objects that might be mistaken for recognizing as crowds. Training and testing a network on this dataset can be quite challenging.

4.2. Evaluation Metrics

Just like previous work [15,39,41], we simply use mean absolute error (MAE) and mean square error (MSE) to evaluate our model; they are defined respectively as:

M A E = \frac{1}{N} \sum_{i = 1}^{N} | C_{i} - C_{i}^{G T} |

(10)

M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} | C_{i} - C_{i}^{G T} |^{2}}

(11)

where N denotes the number of images in the dataset and

C_{i}^{G T}

and

C_{i}

represent the ground truth value and the estimated value corresponding to the i-th sample, respectively. The ground truth density map is generated with a fixed-size Gaussian kernel and the density function can be formulated as:

F_{i} (x) = \sum_{i = 1}^{N} δ (x - x_{i}) * G_{μ (x)}

(12)

where

δ (*)

represents the Dirac function and

σ

represents the standard deviation and we set it to 2 for all of the experimental datasets. The

L_{1}

loss function is adopted to optimize our network, and its formulation can be represented as Equation (13):

L_{1} (Θ) = \frac{1}{N} \sum_{i = 1}^{N} | \hat{M_{i}} (Θ) - M_{i} |

(13)

where N denotes the total number of images in an input batch, i represents the i-th image,

M_{i}

represents the ground-truth value and

\hat{M_{i}}

denotes the predicted value.

Θ

denotes hyper parameters in the network.

4.3. Network Settings

The project implementation of our network is inherited from the TransCrowd framework [20], which has proposed some approaches to pre-process the image crops before sending them to a weakly supervised network. We trained our network using an NVIDIA GeForce RTX 2080Ti GPU (Colorful, Shenzhen, China) and also tested it with the same hardware. To facilitate the training process of the entire network, the GhostNet in the encoder part was pre-trained on the ImageNet 1K dataset, and we used this result as a baseline to evaluate each part of our network. We used

L_{1}

norm and

L_{2}

norm to evaluate the counting loss and adopted inference time, as well as GFLOPs, for speed assessment. During the training process, we employed some data augmentation techniques, including image random crops and flips.

As previously mentioned in the model architecture, each image is cropped into a patch of six sub-images arranged in the channel direction. For training sets, the resize and clip operations are carried out through an implementation script of bi-linear interpolation and fixed size clipping, following the order of resize and data augmentation first, then fixed size clipping. In testing sets, dataloader handles such kind of task for each input image. If no ground-truth count number has been provided, we calculate it by getting the sum of the location count or integrating the ground-truth density map. For training experiments, the network parameters are randomly initialized with a Gaussian distribution with a mean value of 0 and a standard deviation of 0.01. During the training process, we used the Adam optimizer with a learning rate decay of

1.2 \times 10^{- 5}

. The initial learning rate was set to

1.5 \times 10^{- 5}

, and the number of epochs ranged from 300 to 500, depending on the size of the current dataset.

For inference speed and lightweight evaluation experiments, they were also conducted on an Ubuntu server with an NVIDIA GeForce RTX 2080Ti GPU, except for the power evaluation, which was conducted on a personal laptop with an NVIDIA GeForce 3060 GPU. To conduct inference speed experiments, we warmed up each compared network with 50 epochs and then recorded the inference speed and image processing speed. We conducted speed evaluation experiments using the ShanghaiTech A Dataset with an input resolution of

384 \times 384 \times 3 \times 1

. Regarding the power experiments, we started recording each model after 5 min of stable running, and record the difference between the average power of CPU and GPU in 1 min and the corresponding CPU and GPU power before running the model as the model’s consumed power.

5. Results

5.1. Comparative Results with Different Networks

To verify the effectiveness of our network design, we first conducted prediction accuracy experiments and applied MAE and MSE evaluation metrics. We list comparative results with some other state-of-the-art networks in Table 1. Then, we calculate the parameters of the networks in the experiments to validate that our network is lightweight enough. Finally, we tested inference time to verify that our network meets real-time demands. The experimental results demonstrate that our network is quite comparable even with some SOTA models.

The prediction accuracy results comparing our network with some SOTA networks and lightweight networks are shown in Table 1. As we can see, even with a Transformer block and some attention modules in our network, our network remains lightweight. Additionally, our network demonstrates its competitiveness when compared to other lightweight networks. In the table, we especially list the results of MobileCount and Switching-CNN, which are heavier than others in the table. Parameters of these two networks exceed 3M; even so, our results are competitive on large datasets.

Although PDDNet outperforms our model on the ShanghaiTech A/B datasets, the results on the QNRF dataset indicate that there is a significant performance boost in our model. The possible reason for this is that the attention model and the Transformer block require more data to learn the connections between neurons effectively. The experimental results on the JHU-Crow++ dataset support this hypothesis. When conducting experiments on the NWPU dataset, the results of our network are slightly behind other networks. It indicates that our network does not perform well when there are negative samples in the dataset. However, in most cases, our network is still competitive.

Among the lightweight networks, networks ranging from MCNN to SKT are too lightweight to achieve promising results. It is reasonable that our network has 0.6 million more parameters than PDDNet due to the introduction of the Transformer module. However, such an increase in parameter amount leads to significant accuracy improvements in large complex datasets, indicating that the attention mechanism works effectively in our model. As for LightMSANet, it implements a variety of convolutional operations to extract multi-scale features and uses different kinds of methods to fuse them.However, it only focuses on features at a different scale, ignoring features information at a different spatial level. The remaining networks all lack attention mechanisms, which would be beneficial when trained on large-scale datasets. When compared to MobileCount, our network uses a more advanced GhostNet backbone to use cheap linear operation to generate ghost features while keeping parameter amount nearly the same. Its MobileNet-V2 backbone cannot compete with GhostNet. Although ARNet outperforms our method on the ShanghaiTech Part B, JHU-Crowd++, and NWPU datasets, the performance gap between our method and ARNet is not substantial. However, on the UCF-QNRF dataset, our method surpasses ARNet with obvious improvements in both MAE and MSE metrics by 9.6 and 25.3, respectively, demonstrating a noticeable performance improvement. Additionally, our method has fewer parameters than the ARNet network, further demonstrating the effectiveness of our method.

The performance of our model on these datasets proves that our model demonstrates good adaptability and robustness to various scenario changes, different light brightness environments, large crowd densities and scale changes. We believe that this can mainly be attributed to each module in our model effectively performing its corresponding function and complementing the other modules well. Notably, the PPAM and cross-attention modules act as bridges to facilitate the fusion of local and global feature information. Otherwise, any kind of feature information loss in the process of feature transmission will have a large negative impact on the model prediction results. The introduction of Swin-Transformer not only effectively expands the sensory field of the model, which better serves the global feature information modeling, but also effectively enhances the model’s ability to extract the complex feature information, which is beneficial in handling more complex feature information and enhancing the ability to extract complex features. This benefit comes from the introduction of Transformer’s more complex model structure and attention mechanism.

The inference speed and lighweighting comparative results, compared with some SOTA networks, are listed in Table 2. Among them, except for our network, PDDNet is a lightweight pure convolutional network, which has inherent advantages in computation cost and inference speed. Although our network includes vision Transformer modules, the inference speed remains competitive. Furthermore, the computation cost and total memory usage of our model are slightly lower, and even our network has more parameters. It indicates that the elimination of the feature upsampling process and density map generation helps save many computation resources. TransCrowd networks of two different types represent Transformer-relevant works, whereas CAN and CSRNet are heavy convolution representatives. It can be observed from Table 2 that there is a clear performance gap between them and our network in terms of speed and lightweight. This experiment indicates that our lightweight hybrid network is well suited for deployment on edge computing devices. The lightweight GhostNet encoder helps save a significant number of parameters with its cost-effective linear operations while maintaining the network’s depth to encode high-level features. Additionally, the Swin-Transformer module performs well in extracting long-range dependency context information. Finally, the cross-attention feature aggregation module works effectively in merging local features and long-range dependency features. Each part of our network contributes to boosting network performance. To fully support this point, we will demonstrate the ablation study in the following section. From the comparison between pure convolutional models and hybrid models, we can see that the Transformer block in models will consume more power of the device.

To further evaluate the adaptability of our model, we conduct some other object counting tasks on EoCo dataset [54] part A, which contains different categories of objects, providing suitable experimental conditions to validate the generalization ability for us. The comparative results are shown in Table 3. The smallest average error of all categories validates the broad generalization ability of our model.

To further evaluate the robustness of our model to different scenarios, we display the prediction results in Figure 4. The first row validates the adaptability to different densities. The images in the first row exhibit a range of density variations, from relatively sparse to extremely dense. Then, The images in the second row demonstrate a wide variation in illumination, with the last image exhibiting strong light pollution. Furthermore, all the images in Figure 4 can represent scenarios with different perspectives and varied conditions. The discrepancy between the ground truth value and the prediction value indicates that our model can achieve good performance under various challenging circumstances.

The training loss and validation loss recordings of our model on the UCF-QNRF dataset are displayed in Figure 5. Based on the implementation of the TransCrowd framework, the validation loss has been evaluated since the 10th epochs, with a gap of five epochs, while the training loss is recorded in each epoch. Both the training loss curve and the validation loss curve show a trend of gradual descent, remaining relatively steady after 450 epochs. This phenomenon, along with good performance on the validation set, indicates that there is no underfitting in the training process. Furthermore, the validation loss curve shows no upward trend within our maximum number of epochs (500 epochs) set by the hyper-parameters, indicating that there is no overfitting. (In case of overfitting, the performance on the validation set would gradually deteriorate.)

5.2. Ablation Study

5.2.1. The Performance with Different Layers of GhostNet

In this section, we will demonstrate the effectiveness of each part of our network with our ablation experiments. As we have illustrated, the inspiration for using GhostNet is on account of its advanced convolutional architecture compared to others [1,35,55]. Based on previous work [39] in our lab, here, we directly take the first thirteen layers of GhostNet to strike the best balance between prediction accuracy and parameter count. The experimental results are shown at Table 4. We use an exact number of layers to ensure the completion of each down-sampling stage.

5.2.2. The Performance of Swin-Transformer Block

To verify the effectiveness of the Swin-Transformer block for long-range context feature extraction, we changed the number of Swin-Transformer blocks in our model and evaluated the prediction results on the UCF-QNRF dataset. The experiment results are displayed in Table 5. As the table shows, we first compared the influence with different numbers of Swin-Transformer blocks. In the table, the baseline model is that we directly send the output features from the GhostNet backbone to the decoder module. Swin * 0 indicates that we did not use the Transformer block in our model. In this way, we concatenated the output features of PPAM and sent it to the cross-attention module directly. The symbol “Bid-FFN” in parentheses indicates that we replaced the FFN with bidirectional FFN in the Swin-Transformer. It is evident that this approach leads to a significant improvement in accuracy performance. Additionally, we also tested replacing the Swin-Transformer block with other Transformer blocks. However, they did not yield satisfactory results.

5.3. The Effectiveness of Different Fusion Approach

To validate the effectiveness of the cross-attention module for feature fusion, we also tested several other approaches, such as Proposal Attention Module (PAM) [56], features concatenation by element-wise sum, and element-wise multiplication. The results are presented in Table 6. Since using cross-attention as the fusion method yields the best accuracy, we ultimately selected the cross-attention module as the network’s fusion module. Despite the fact that selecting element-wise fusion approaches has fewer parameters, we could not ignore the significant boost in accuracy it provided. Among these fusion approaches, element-wise sum and element-wise production are commonly used to fuse features in the channel space. They are computationally efficient and operate quickly. PAM has been utilized to establish complementary relationships and has been shown to work well in prior research. [56]. In terms of its structure, PAM can incorporate both element-wise sum and element-wise multiplication operations. We tested PAM in our model, and it did show some improvement compared to element-wise operations, although its performance still lags behind that of cross-attention module. The possible reason for the difference may be that the PAM module excels in extracting background context information and compensating it for foreground regions. However, our goal here is to blend local features with long-range dependency features, which is not exactly the same as PAM’s strength. With cross-attention, these two types of complementary features can be closely integrated. It is worth noting that when we implement element-wise operations, we reduce the channel size of global features from Swin-Transformer block using point-wise convolution. To ensure that the local feature and global feature have the same size, we set the stride of the PPAM module to 1. The new width w’ and height h’ would be twice the original size.

5.4. The Performance of PPAM

The PPAM is utilized for the initial fusion of features at the channel level across different scales. It collects the output from each stage of the GhostNet and utilizes it to enrich information at the channel level. If we directly use the final output feature from GhostNet layers and send them to the Swin-Transformer block, there would be potential risks. Firstly, the Swin-Transformer block requires a large number of input features to effectively learn neural connections due to its increasing number of parameters compared to convolutional operations. Concatenating the output features from each scale can significantly increase the number of input feature channels while reducing the resolution of features. Secondly, without PPAM, features at different scales lack an appropriate way to intertwine together. Statistical results from Table 7 can support our opinion. When we remove the PPAM from our model for ablation study, we obtain a copy of the GhostNet output and send it to the Swin-Transformer block directly.

6. Conclusions and Future Work

In this paper, we proposed a novel weakly supervised hybrid lightweight crowd-counting network that absorbs both GhostNet’s ghost feature ability and Transformer’s global receptive field to improve counting accuracy; they are adopted to solve inadequate intermediate features and the weak ability for extracting long-range global context information problems inherent in lightweight crowd-counting models, respectively. In our model, each module is well designed to maximize its utility in either improving accuracy or reducing computation. In our work, we utilized a series of important design principles to keep our hybrid model as lightweight as possible, including the PPAM block for multi-scale feature extraction and fusion, implementing weakly supervised guidance and ignoring both output feature up-sampling and density map generation. Our results showed competitive performance compared to some previously proposed lightweight crowd-counting networks, including both fully supervised and weakly supervised methods. Furthermore, some results achieve state-of-the-art performance at the same parameter level. The ablation experiments indicate the effective design of each part of our work. Our ablation study demonstrated that each module in our network leverages its advantages, and every aspect of the feature information flow is suitably considered. Compared to pure convolutional models, our network can achieve better accuracy performance and its inference speed is competitive with pure convolutional models at the same parameter level.

However, the adaptability of our network to inadequate training sample datasets is not sufficient when compared to pure convolutional lightweight models. Although some data argument techniques were adopted in our experiments, the performance is not ideal to our expectations. In future work, we will reconsider the attention block design in our network to compensate for such drawbacks. Maybe we can also implement pre-train skills for the Swin-Transformer block in our model on large datasets. Furthermore, the power consumption of our model is slightly higher than other compared models, so more research can be conducted to optimize the performance balance. Finally, to strengthen the robustness of negative samples, possible noise-robust training strategies including transfer-learning and outlier detection and removal techniques can be implemented in future work.

Author Contributions

Conceptualization, Y.C. and H.Z.; methodology, Y.C., H.Z. and M.G.; software, Y.C. and M.G.; validation, H.Z., Y.C., M.G. and M.D.; formal analysis, H.Z.; investigation, M.D.; resources, H.Z.; data curation, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, H.Z., M.G. and M.D.; visualization, M.D. and M.G.; supervision, H.Z.; project administration, H.Z.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lv, H.; Yan, H.; Liu, K.; Zhou, Z.; Jing, J. Yolov5-ac: Attention mechanism-based lightweight yolov5 for track pedestrian detection. Sensors 2022, 22, 5903. [Google Scholar] [CrossRef] [PubMed]
Lin, J.; Hu, J.; Xie, Z.; Zhang, Y.; Huang, G.; Chen, Z. A Multitask Network for People Counting, Motion Recognition, and Localization Using Through-Wall Radar. Sensors 2023, 23, 8147. [Google Scholar] [CrossRef] [PubMed]
Zhu, F.; Yan, H.; Chen, X.; Li, T. Real-time crowd counting via lightweight scale-aware network. Neurocomputing 2022, 472, 54–67. [Google Scholar] [CrossRef]
Son, S.; Seo, A.; Eo, G.; Gill, K.; Gong, T.; Kim, H.S. MiCrowd: Vision-Based Deep Crowd Counting on MCU. Sensors 2023, 23, 3586. [Google Scholar] [CrossRef] [PubMed]
Liu, L.; Chen, J.; Wu, H.; Chen, T.; Li, G.; Lin, L. Efficient Crowd Counting via Structured Knowledge Transfer. arXiv 2020, arXiv:2003.10120. [Google Scholar]
Khan, K.; Khan, R.U.; Albattah, W.; Nayab, D.; Qamar, A.M.; Habib, S.; Islam, M. Crowd counting using end-to-end semantic image segmentation. Electronics 2021, 10, 1293. [Google Scholar] [CrossRef]
Khan, S.D.; Salih, Y.; Zafar, B.; Noorwali, A. A deep-fusion network for crowd counting in high-density crowded scenes. Int. J. Comput. Intell. Syst. 2021, 14, 168. [Google Scholar] [CrossRef]
Chen, X.; Yu, X.; Di, H.; Wang, S. Sa-internet: Scale-aware interaction network for joint crowd counting and localization. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Beijing, China, 29 October–1 November 2021; pp. 203–215. [Google Scholar]
Duan, Z.; Wang, S.; Di, H.; Deng, J. Distillation remote sensing object counting via multi-scale context feature aggregation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5613012. [Google Scholar] [CrossRef]
Xie, Y.; Lu, Y.; Wang, S. Rsanet: Deep recurrent scale-aware network for crowd counting. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Virtual, 25–28 October 2020; pp. 1531–1535. [Google Scholar]
Wang, S.; Lu, Y.; Zhou, T.; Di, H.; Lu, L.; Zhang, L. SCLNet: Spatial context learning network for congested crowd counting. Neurocomputing 2020, 404, 227–239. [Google Scholar] [CrossRef]
Ranjan, V.; Le, H.; Hoai, M. Iterative crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 270–285. [Google Scholar]
Boominathan, L.; Kruthiventi, S.S.; Babu, R.V. Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 640–644. [Google Scholar]
Shi, X.; Li, X.; Wu, C.; Kong, S.; Yang, J.; He, L. A real-time deep network for crowd counting. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2328–2332. [Google Scholar]
Jiang, G.; Wu, R.; Huo, Z.; Zhao, C.; Luo, J. LigMSANet: Lightweight multi-scale adaptive convolutional neural network for dense crowd counting. Expert Syst. Appl. 2022, 197, 116662. [Google Scholar] [CrossRef]
Goh, G.L.; Goh, G.D.; Pan, J.W.; Teng, P.S.P.; Kong, P.W. Automated Service Height Fault Detection Using Computer Vision and Machine Learning for Badminton Matches. Sensors 2023, 23, 9759. [Google Scholar] [CrossRef]
Yu, R.; Wang, S.; Lu, Y.; Di, H.; Zhang, L.; Lu, L. SAF: Semantic Attention Fusion Mechanism for Pedestrian Detection. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence, Cuvu, Fiji, 26–30 August 2019; pp. 523–533. [Google Scholar]
Wang, Q.; Breckon, T.P. Crowd Counting via Segmentation Guided Attention Networks and Curriculum Loss. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15233–15243. [Google Scholar] [CrossRef]
Yang, Y.; Li, G.; Wu, Z.; Su, L.; Huang, Q.; Sebe, N. Weakly supervised crowd counting learns from sorting rather than locations. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 1–17. [Google Scholar]
Liang, D.; Chen, X.; Xu, W.; Zhou, Y.; Bai, X. Transcrowd: Weakly supervised crowd counting with Transformers. Sci. China Inf. Sci. 2022, 65, 160104. [Google Scholar] [CrossRef]
Lei, Y.; Liu, Y.; Zhang, P.; Liu, L. Towards using count-level weak supervision for crowd counting. Pattern Recognit. 2021, 109, 107616. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Zhang, W.; Huang, Z.; Luo, G.; Chen, T.; Wang, X.; Liu, W.; Yu, G.; Shen, C. TopFormer: Token pyramid Transformer for mobile semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12083–12093. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision Transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar]
Sindagi, V.A.; Yasarla, R.; Patel, V.M. JHU-CROWD++: Large-Scale Crowd Counting Dataset and A Benchmark Method. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2594–2609. [Google Scholar] [CrossRef]
Cao, C.; Lu, Y.; Wang, P.; Zhang, Y. A New Comprehensive Benchmark for Semi-Supervised Video Anomaly Detection and Anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 20392–20401. [Google Scholar]
Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; Shah, M. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 532–546. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Koonce, B.; Koonce, B. MobileNetV3. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Springer: Berlin/Heidelberg, Germany, 2021; pp. 125–144. [Google Scholar]
Han, K.; Wang, Y.; Xu, C.; Guo, J.; Xu, C.; Wu, E.; Tian, Q. GhostNets on heterogeneous devices via cheap operations. Int. J. Comput. Vis. 2022, 130, 1050–1069. [Google Scholar] [CrossRef]
Zhang, J.; Peng, H.; Wu, K.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. Minivit: Compressing vision Transformers with weight multiplexing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12145–12154. [Google Scholar]
Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Hoffman, J. Hydra attention: Efficient attention with many heads. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2022; pp. 35–49. [Google Scholar]
Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-former: Bridging mobilenet and Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5270–5279. [Google Scholar]
Pan, J.; Bulat, A.; Tan, F.; Zhu, X.; Dudziak, L.; Li, H.; Tzimiropoulos, G.; Martinez, B. Edgevits: Competing light-weight cnns on mobile devices with vision Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2022; pp. 294–311. [Google Scholar]
Chen, Q.; Wu, Q.; Wang, J.; Hu, Q.; Hu, T.; Ding, E.; Cheng, J.; Wang, J. Mixformer: Mixing features across windows and dimensions. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2022; pp. 5249–5259. [Google Scholar]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision Transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Liang, L.; Zhao, H.; Zhou, F.; Ma, M.; Yao, F.; Ji, X. PDDNet: Lightweight congested crowd counting via pyramid depth-wise dilated convolution. Appl. Intell. 2023, 53, 10472–10484. [Google Scholar] [CrossRef]
Dong, J.; Zhao, Z.; Wang, T. Crowd Counting by Multi-Scale Dilated Convolution Networks. Electronics 2023, 12, 2624. [Google Scholar] [CrossRef]
Tian, Y.; Duan, C.; Zhang, R.; Wei, Z.; Wang, H. Lightweight Dual-Task Networks For Crowd Counting In Aerial Images. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1975–1979. [Google Scholar]
Zhang, Y.; Zhao, H.; Duan, Z.; Huang, L.; Deng, J.; Zhang, Q. Congested crowd counting via adaptive multi-scale context learning. Sensors 2021, 21, 3777. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Li, M.; Guo, H.; Zhang, L. MSGSA: Multi-Scale Guided Self-Attention Network for Crowd Counting. Electronics 2023, 12, 2631. [Google Scholar] [CrossRef]
Wang, M.; Zhou, J.; Cai, H.; Gong, M. Crowdmlp: Weakly supervised crowd counting via multi-granularity mlp. Pattern Recognit. 2023, 144, 109830. [Google Scholar] [CrossRef]
Wang, F.; Liu, K.; Long, F.; Sang, N.; Xia, X.; Sang, J. Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting. arXiv 2022, arXiv:2203.06388. [Google Scholar]
Huang, T.; Huang, L.; You, S.; Wang, F.; Qian, C.; Xu, C. Lightvit: Towards light-weight convolution-free vision Transformers. arXiv 2022, arXiv:2207.05557. [Google Scholar]
Chen, C.F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision Transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 357–366. [Google Scholar]
Gao, J.; Wang, Q.; Li, X. Pcc net: Perspective crowd counting via spatial convolutional network. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3486–3498. [Google Scholar] [CrossRef]
Xia, Y.; He, Q.; Wei, W.; Yin, B. ARNet: Accurate and Real-Time Network for Crowd Counting. In PRICAI 2021: Trends in Artificial Intelligence; Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F., Eds.; Springer: Cham, Switzerland, 2021; pp. 376–389. [Google Scholar]
Wang, P.; Gao, C.; Wang, Y.; Li, H.; Gao, Y. MobileCount: An efficient encoder–decoder framework for real-time crowd counting. Neurocomputing 2020, 407, 292–299. [Google Scholar] [CrossRef]
Babu Sam, D.; Surya, S.; Venkatesh Babu, R. Switching convolutional neural network for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5744–5752. [Google Scholar]
Li, Y.; Zhang, X.; Chen, D. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1091–1100. [Google Scholar]
Liu, W.; Salzmann, M.; Fua, P. Context-aware crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5099–5108. [Google Scholar]
Jiang, S.; Wang, Q.; Cheng, F.; Qi, Y.; Liu, Q. A Unified Object Counting Network with Object Occupation Prior. arXiv 2022, arXiv:2212.14193. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Li, Y.; Chen, X.; Zhu, Z.; Xie, L.; Huang, G.; Du, D.; Wang, X. Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7026–7035. [Google Scholar]

Figure 1. The overview of our proposed method. The input image undergoes initial resizing and cropping to obtain an image patch, which is then fed into the GhostNet backbone for hierarchical feature extraction. Simultaneously, features from different levels of GhostNet are fed into the Pyramid Pooling Aggregation Module (PPAM) to obtain fused multi-scale features. These features are further processed through the Modified Swin-Transformer module for global feature enhancement. The enhanced multi-scale features are subjected to cross-attention operations to further augment the output features of GhostNet and are then fed into the regression module to obtain the final counting number.

Figure 2. Illustration of bi-dimensional attention module in modified FFN [46].

Figure 3. Illustration of the simple regressor module.

Figure 4. Adaptability to different scenarios. “GT” denotes ground truth value, and “Pred” denotes prediction value of our model.

Figure 5. Training loss (left) and validation loss (right) records in the training process on UCF-QNRF dataset.

Table 1. Comparative results with other SOTA networks. The best results are indicated in bold font, and the second-best results are indicated with underlines.

Method	Backbone	Params/M	Part A		Part B		UCF-QNRF		JHU-Crowd++		NWPU
			MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE
MCNN [25]	-	0.13	110.2	173.2	26.4	41.3	277	426	-	-	218.5	700.6
C-CNN [14]	-	0.07	88.1	141.7	14.9	22.1	-	-	-	-	-	-
SKT [5]	-	0.06	78.0	126.6	11.9	19.8	-	-	-	-	-	-
LightMSANet [15]	-	0.63	76.6	121.4	10.9	17.5	-	-	-	-	-	-
PCCNet [48]	-	0.55	73.5	124.0	11	19	-	-	-	-	91.5	381.5
PDDNet [39]	GhostNet	1.10	72.6	112.2	10.3	17.0	130.2	246.6	-	-	91.5	381.0
Ours	GhostNet	1.69	71.5	108.6	11.3	20	100.4	182.6	73.0	279.7	97.2	394.1
ARNet [49]	SqueezeNet	1.77	-	-	7.5	12.6	110.0	207.9	78.2	276.8	89.3	332.8
MobileCount [50]	MobileNet-V2	3.40	81.4	133.3	8.1	12.7	117.9	207.5	-	-	-	-
Switching-CNN [51]	VGG16	15.3	90.4	135.0	21.6	33.4	228	445	-	-	-	-

Table 2. Inference speed and model lightweighting evaluation comparative results with some SOTA methods on ShanghaiTech A dataset. The best results are indicated in bold font, and the second-best results are indicated with underlines.

Methods	Inference Time (ms)	Params (M)	GFLOPs	Memories (MB)	CPU Power (W)	GPU Power (W)
PDDNet [39]	6.28	1.10	4.56	329.88	16.21	87.55
Ours	7.07	1.69	2.51	306.89	17.66	88.05
CSRNet [52]	10.78	16.26	365.65	759.00	16.39	87.59
CAN [53]	13.01	18.10	387.59	867.29	15.98	87.73
TransCrowd (token) [20]	15.18	86.38	296.09	1768.85	18.65	87.95
TransCrowd (gap) [20]	15.21	89.15	296.11	1779.6	15.28	87.95

Table 3. Generalization of comparative results on Eoco part A datasets.

Classes	Samples	GT	MCNN		CSRNet		Ours
Classes	Samples	GT	Pred	Error	Pred	Error	Pred	Error
Cherry	cherry_1.png	38.00	6.06	31.94	35.83	2.17	27.00	11.00
	cherry_2.png	57.00	3.47	53.53	36.67	20.33	41.35	15.65
	cherry_3.png	18.00	23.64	5.64	17.07	0.93	22.73	4.73
Chicken	chicken_1.png	23.00	135.16	112.16	35.35	12.35	21.47	1.53
	chicken_2.png	23.00	57.24	34.24	39.49	16.49	40.84	17.84
	chicken_3.png	19.00	20.15	1.15	18.10	0.9	29.48	10.48
Tulip	tulip_1.png	33.00	59.60	26.60	43.70	10.70	40.6	7.60
	tulip_2.png	64.00	93.82	29.82	83.28	19.28	70	6.00
	tulip_3.png	21.00	32.33	11.33	28.31	7.31	16.09	4.91
Vehicle	vehicle_1.png	45.00	18.96	26.04	15.17	29.83	18.78	26.22
	vehicle_2.png	11.00	13.56	2.56	10.17	0.83	16.15	5.15
	vehicle_3.png	65.00	29.11	35.89	54.11	10.89	53.17	11.83
Jujube	jujube_1.png	68.00	29.81	38.19	44.16	23.84	66.73	1.27
	jujube_2.png	47.00	17.72	29.28	49.88	2.88	53.62	6.62
	jujube_3.png	114.00	15.03	98.97	85.82	28.18	106.78	7.22
Average Error	/	/	29.75		10.90		7.11

Table 4. Performance with different depths of GhostNet. The numbers in bold indicate the depth at which GhostNet achieves the best trade-off between accuracy and parameter count.

Layers	MAE	MSE	Params (M)
7	117.0	174.5	0.085
13	81.8	123.3	0.921
15	79.1	118.8	2.949

Table 5. Performance comparison with different Transformer blocks. Experiments were conducted on UCF-QNRF dataset, The best accuracy performance has been marked with bold font.

Transformer	MAE	MSE	Params (M)	GFLOPs
Baseline	109.8	231.6	1.32	2.32
Swin * 0	108.2	204.4	1.50	2.48
Swin * 1	103.1	187.1	1.67	2.63
Swin * 1 (Bid-FFN)	100.4	182.6	1.69	2.64
Swin * 2 (Bid-FFN)	109.1	202.3	1.67	3.47
Hydra * 1 (Bid-FFN)	102.3	190.6	1.53	3.04

Table 6. The performances of different feature fusion methods. ⨁ denotes element-wise sum and ⨂ denotes element-wise multiplication. Experiments were conducted on the UCF-QNRF dataset. The best accuracy performance has been marked with bold font.

Method	MAE	MSE	Params (M)	GFLOPs
⨁	109.1	202.3	1.67	3.47
⨂	112.7	217.6	1.67	3.47
PAM	108.2	207.2	1.86	4.20
cross-attention	100.4	182.6	1.69	2.64

Table 7. Influence of PPAM to network performance,

\bar{P P A M}

denotes without PPAM. Experiments were conducted on UCF-QNRF dataset.

Table 7. Influence of PPAM to network performance,

\bar{P P A M}

denotes without PPAM. Experiments were conducted on UCF-QNRF dataset.

Model	MAE	MSE	Params (M)	GFLOPs
PPAM	100.4	182.6	1.69	2.64
$\bar{P P A M}$	101.7	185.4	1.63	3.43

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Zhao, H.; Gao, M.; Deng, M. A Weakly Supervised Hybrid Lightweight Network for Efficient Crowd Counting. Electronics 2024, 13, 723. https://doi.org/10.3390/electronics13040723

AMA Style

Chen Y, Zhao H, Gao M, Deng M. A Weakly Supervised Hybrid Lightweight Network for Efficient Crowd Counting. Electronics. 2024; 13(4):723. https://doi.org/10.3390/electronics13040723

Chicago/Turabian Style

Chen, Yongqi, Huailin Zhao, Ming Gao, and Mingfang Deng. 2024. "A Weakly Supervised Hybrid Lightweight Network for Efficient Crowd Counting" Electronics 13, no. 4: 723. https://doi.org/10.3390/electronics13040723

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Weakly Supervised Hybrid Lightweight Network for Efficient Crowd Counting

Abstract

1. Introduction

2. Related Work

2.1. Lightweight Network Design for General Tasks

2.2. Lightweight Crowd-Counting Models

2.3. Weakly Supervised Crowd-Counting Models

3. Method

3.1. Problem Formulation

3.2. Architecture

3.2.1. GhostNet Backbone

3.2.2. Parallel Pooling Aggregation Module

3.2.3. Swin-Transformer Block with Bid-FFN

3.2.4. Multi-Head Cross Attention for Global Feature Diffusion

3.3. Simple Convolution Regressor

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Network Settings

5. Results

5.1. Comparative Results with Different Networks

5.2. Ablation Study

5.2.1. The Performance with Different Layers of GhostNet

5.2.2. The Performance of Swin-Transformer Block

5.3. The Effectiveness of Different Fusion Approach

5.4. The Performance of PPAM

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI