Video-Based Plastic Bag Grabbing Action Recognition: A New Video Dataset and a Comparative Study of Baseline Models

Low, Pei Jing; Ng, Bo Yan; Mahzan, Nur Insyirah; Tian, Jing; Leung, Cheung-Chi

doi:10.3390/s25010255

Open AccessArticle

Video-Based Plastic Bag Grabbing Action Recognition: A New Video Dataset and a Comparative Study of Baseline Models

by

Pei Jing Low

,

Bo Yan Ng

,

Nur Insyirah Mahzan

,

Jing Tian

^*

and

Cheung-Chi Leung

NUS-ISS, National University of Singapore, Singapore 119615, Singapore

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(1), 255; https://doi.org/10.3390/s25010255

Submission received: 1 October 2024 / Revised: 19 December 2024 / Accepted: 2 January 2025 / Published: 4 January 2025

(This article belongs to the Special Issue Computer Vision-Based Human Activity Recognition)

Download

Browse Figures

Versions Notes

Abstract

Recognizing the action of plastic bag taking from CCTV video footage represents a highly specialized and niche challenge within the broader domain of action video classification. To address this challenge, our paper introduces a novel benchmark video dataset specifically curated for the task of identifying the action of grabbing a plastic bag. Additionally, we propose and evaluate three distinct baseline approaches. The first approach employs a combination of handcrafted feature extraction techniques and a sequential classification model to analyze motion and object-related features. The second approach leverages a multiple-frame convolutional neural network (CNN) to exploit temporal and spatial patterns in the video data. The third approach explores a 3D CNN-based deep learning model, which is capable of processing video data as volumetric inputs. To assess the performance of these methods, we conduct a comprehensive comparative study, demonstrating the strengths and limitations of each approach within this specialized domain.

Keywords:

plastic bag grabbing; self-checkout; action video recognition

1. Introduction

Over the years, environmental issues caused by the use of disposable plastic bags have been topics for sustainability and climate change discussions [1]. Governments across the globe are starting to recognize the severity of the impact and have implemented measures to curb the use of disposable plastic bags. For example, the United Nations invited a transformation of the world by announcing the 2030 Agenda for Sustainable Development [2,3]. Carrier bags, being light, inexpensive, and convenient, are excessively used in the daily lives of humanity to carry groceries and goods from shops to home [4].

In an effort to combat plastic pollution and promote environmental sustainability, Singapore has introduced a plastic bag surcharge policy in retail and supermarkets. Starting July 2023, major supermarket chains are required to charge a minimum of five cents for each plastic bag provided to customers [5]. This initiative aims to reduce the excessive use of single-use plastic bags, encouraging consumers to adopt reusable alternatives. By implementing this policy, Singapore joins a global movement towards reducing plastic waste and protecting the environment, fostering a culture of sustainability among its citizens and contributing to long-term ecological preservation. To manage such payments, supermarkets provided an additional payment option or bar code to scan for the “purchase“ of plastic bags or used plastic bag dispensers at the self-checkout counters [6].

It is challenging to monitor the usage of plastic bags in self-checkout areas. Current enforcement methods to ensure plastic bag payments rely heavily on manpower to monitor the use of plastic bags or adopt an honor system in which supermarket operators trust that customers will pay for the plastic bags they use [7]. In addition, owing to the lack of space in self-checkout areas, only standard-size bags can be supplied to supermarkets with plastic bag dispensers [8].

Computer vision-based solutions hold significant potential to automate the monitoring of plastic bag usage in self-checkout areas in supermarkets. Firstly, they provide a passive and non-intrusive method for monitoring, ensuring that the shopping experience is not disrupted. Secondly, the existing CCTV infrastructure, commonly present in self-checkout areas, can be used to implement these vision-based systems without requiring substantial additional investments [9]. Using advanced image processing and machine learning algorithms, these systems can accurately detect and track plastic bag usage, ensuring compliance with store policies and reducing the environmental impact of excessive plastic use. Motivated by this, this paper proposes to take advantage of deep learning-based video analytics methods to develop an action recognition system applied to CCTV footage to detect the action of taking plastic bags. Automation of monitoring would not only detect but would also discourage such unwanted actions as taking plastic bags without payment [10]. In general, this system offers a scalable solution for detecting non-compliant customers and eliminates the need for additional manpower to monitor the use of plastic bags provided at self-checkout counters [11].

Problem formulation in this paper: Our study formulates the task as a video classification problem because it focuses on establishing a foundational method for automatically detecting the presence of plastic bag-taking actions in controlled scenarios. At this stage, the primary goal is to determine whether or not a bag-taking event occurs within a video segment, serving as a proof-of-concept for implementing computer vision models in self-checkout environments. This binary classification framework allows us to refine our dataset, evaluate model performance, and ensure that the system can reliably identify bag-taking actions. We recognize that real-world deployment in supermarket scenarios would indeed involve more requirements, such as handling untrimmed videos that capture the entire checkout process and counting the number of bags taken. These tasks are not the focus of the current paper. Instead, the proposed video classification system is a starting point. Future extensions can build upon this groundwork by integrating models that can process untrimmed sequences and incorporating counting mechanisms.

Contributions of this paper:

A benchmark bag-grabbing video dataset is established. We collected 989 video clips for two categories, whereby positive clips contain the action of taking the plastic bag provided, while negative clips reflect other actions besides the said action. To the best of our knowledge, this is the first dataset for this problem, as reviewed in Table 1.
The plastic bag grabbing is a niche action recognition problem. For that, we designed three baseline approaches, including (i) a hand-crafted feature extraction plus a sequential classification model, (ii) a multiple-frame convolutional neural network (CNN)-based action recognition model, and (iii) a 3D CNN-based deep learning-based action recognition model. We provided a comparative study by evaluating these baseline approaches using our benchmark dataset.

The rest of the paper is structured as follows: Section 2 provides a brief review of existing related datasets, including skeleton-based hand action recognition and video-based hand action recognition. The data collection and data preparation process for the benchmark bag grabbing video dataset is described in Section 3. Section 4 presents the three different CNN-based baseline approaches explored in this study, while Section 5 evaluates the three techniques in extensive experiments. Section 6 discusses the limitations and potential impacts of our work. Lastly, Section 7 concludes the paper.

2. Related Works

This paper investigates the problem of plastic bag grasping. It requires addressing the unique characteristics of plastic bags and the dynamics of interaction between the bag and the human hand. To our knowledge, no prior work in the literature has specifically addressed this exact problem. Therefore, this section provides a review of existing relevant research, focusing on studies related to plastic bag object detection [12,13,14], human–object interaction [15,16,17,18], and video classification [19,20,21,22,23,24,25,26].

Plastic bag object detection: As summarized in Table 1, no existing publicly available dataset that has action on humans withdrawing plastic bags is found. However, there are available datasets related to plastic bag detection and human interaction with plastic bag. For plastic bag detection, there are existing datasets at Roboflow [12,13] and Kaggle [14], which mostly consisted of used plastic bags rather than new plastic bags.

Human–object interaction: There are a few studies [27,28,29,30,31] on hand action recognition, including machine learning-based approaches, sensor technologies, and evaluation metrics. They address challenges in gesture recognition systems for human–computer interaction, emphasizing the importance of accurate hand tracking and robust feature extraction. However, despite the richness of research in general hand action recognition, this paper focuses specifically on hand-bag interaction actions. By narrowing the scope to this specific interaction, this study seeks to address the unique challenges and requirements associated with recognizing actions in practical scenarios, such as those encountered in supermarket checkout environments. Motivated by this observation, only research works related to hand–bag interaction action are discussed in Table 1 in this paper.

For example, an existing dataset such as UCF101 [32] contains 13,320 video clips, which are classified into 101 categories, which are further split into 5 types of action, including human–object interaction. Another similar dataset, HMDB51 [33], consists of 6766 video clips from 51 action categories. A hand washing video classification is studied in [34], and a yoga pose classification is studied in [35]. These established action recognition datasets are focusing on human body action and are not specific to plastic bag interaction. Although hand action recognition is studied using general gesture action datasets [15,16,17], general action recognition models cannot be applied to plastic bag grasping. There is also a dataset related to humans’ actions on plastic bags, in particular knotting plastic bags [18], but the action is very different from the act of taking plastic bags.

Video classification: Action video understanding is a rapidly evolving field that addresses the automatic interpretation and classification of human activities captured in videos. Broadly, this research area can be viewed from three key perspectives: recognition objective, backbone modeling, and deployment strategies.

Recognition objective: Action video understanding commonly centers on two related but distinct tasks, including action recognition and action detection. Action recognition focuses on classifying an entire video clip based on the action it contains [30]. This can be further divided into trimmed and untrimmed scenarios. In trimmed action recognition, the action extends across the entire duration of the video. In contrast, untrimmed action recognition deals with videos that include additional irrelevant segments before or after the target action. On the other hand, temporal action detection aims to identify not only the type of action but also its precise start and end times within an untrimmed video [19].
Backbone modeling: The second perspective focuses on the choice of backbone models [20]. Popular approaches have leveraged CNNs to extract spatiotemporal features. Such methods typically stack 2D or 3D convolutions to capture both spatial patterns within individual frames and temporal patterns across frames. Recently, transformer-based architectures have emerged as a compelling alternative. These models first tokenize video frames into a sequence of embeddings and then use multi-head self-attention layers to capture long-range dependencies. The final classification layers map these embeddings to action categories. Representative transformer-based models include a pure Transformer architecture that models video data as a sequence of spatiotemporal tokens [21], a factorized self-attention mechanism to independently process temporal and spatial dimensions [22], a multiscale vision transformer [23], and a video swin transformer that adapts the swin transformer’s shifted windowing strategy to the video domain [24].
Deployment strategies: There are two types of deployments of action understanding models, including supervised classification (the focus of this paper’s study) and zero-shot classification. In the supervised setting, models are trained on labeled datasets and then applied to testing examples. However, zero-shot classification has emerged to address scenarios where no training examples for a given action class are available. Leveraging auxiliary modalities (e.g., text embeddings), zero-shot models can classify new, unseen action categories without additional supervision [25,26].

3. Dataset

3.1. Data Collection and Organization

As the payment of plastic bag use was only mandated last year [5], there are no annotated datasets available that differentiate between the action of taking and not taking plastic bags. Thus, a new benchmark bag-grabbing video dataset is collected in this paper.

The dataset is collected as follows. Simulated checkout processes are recorded at 30 frames per second (fps) with the following conditions:

Each process began with an empty counter, followed by placing items on the counter and packing items into plastic bags. The video either recorded the customer taking the plastic bags from the counter or introducing their own bag. Lastly, both items and bags are removed from the counter.
A variety of “plastic bag taking” actions are represented, which included pinching and grabbing the bag, as well as the use of both left and right hands, as illustrated in Figure 1.
Only red-coloured bags are used to represent the plastic bags at the self-checkout counter. This is consistent with the observation that a store typically provides a standard design of plastic bags.
The customer may touch the bag at the self-checkout counter without taking the plastic bags in the end.

The videos are subsequently segmented into clips of 5 s and repeated after 2 s delay (3 s overlapping from the clip to clip), as shown in Figure 2. The clips are manually labelled as positive and negative according to the action reflected in the clips. There are 989 clips in total; 318 clips are labelled positive and the remaining 671 clips are labelled negative.

3.2. Limitation of This Dataset

The collected video dataset exhibits several key limitations that could affect its applicability to broader real-world conditions, although it is useful for modeling self-checkout scenarios.

First, there is a lack of diversity in both objects and actions. It only includes one type of object, which is a red plastic bag. It narrowly represents the range of items, colors, shapes, and textures that might be encountered in actual checkout situations. Second, the absence of other objects besides the red plastic bag reduces the complexity of the visual patterns. Without this diversity, any trained video recognition model might overfit to the limited conditions and struggle to perform robustly with new and unseen scenarios.

4. Three Baseline Approaches and a Comparative Study

General action recognition models face significant limitations when applied to plastic bag grasping due to its unique challenges. Firstly, plastic bags are highly deformable, making it difficult to predict their shape and motion compared to rigid objects. Secondly, the interaction between the plastic bag and human fingers requires precise grip adaptation, which involves complex contact points and pressure adjustments that general models are not designed to manage. Thirdly, the variable transparency and reflectivity of plastic bags interfere with visual feature extraction, causing issues in accurately capturing and interpreting the hand’s movements and the bag’s responses.

4.1. Selection of Three Baseline Approaches

Considering the aforementioned challenges and rapid innovation in action video understanding, this study focuses on CNN-based, supervised video classification methods, highlighting how various approaches capture and utilize spatiotemporal relationships in video sequences. To illustrate and compare different strategies for extracting spatiotemporal features, we have selected three representative baseline approaches. Two of these approaches rely on handcrafted feature extraction, following traditional computer vision pipelines that emphasize explicit design and engineering of features. Such methods often offer interpretability and can serve as robust baselines, though they may be limited by the quality and scope of the handcrafted features themselves. In contrast, the third approach adopts a data-driven perspective, allowing a CNN-based model to learn the most relevant spatiotemporal patterns directly from the raw video input. This approach demonstrates the power of end-to-end learning, as it can uncover subtle or complex patterns that might be missed by human-designed features. A conceptual overview of these three baseline approaches is illustrated in Figure 3.

4.2. Approach 1: Hand-Crafted Features + LSTM

The basic idea of the first approach is to utilize the relationship between hand keypoints and a plastic bag’s bounding box, then further exploit these features as inputs for a long short-term memory (LSTM) model. In this approach, Mediapipe [36] is used to extract hand keypoints, yielding 126 coordinates per frame. Then, a pre-trained bag detector model, YOLOv5, which is trained on a custom plastic bag dataset, is applied to detect the bags. The relational features are extracted by measuring distances between hand keypoints and the bounding box corners. These features are combined in two different options, including early fusion and late fusion, to be used for the LSTM model to perform action classification.

The details of this approach are described as follows:

Hand landmark detection. A pre-trained model, Mediapipe Hands [36], is used for hand landmark detection. The model is capable of localising 21 hand landmarks, consisting of coordinates and depth information for each hand. This gives rise to 126 location values when both hands are detected in the frame.
Plastic bag detection. A custom-trained Yolov5s model [37] for plastic bag detection is trained using the plastic bag dataset [14]. The rationale for only using this dataset is because this dataset mainly consists of plastic bags, which is relevant to this study.
Relational feature extraction. The relation feature is then extracted by calculating the distances between the index and thumb of the hand and the four corners of the bounding box of the detected plastic bag, as illustrated in Figure 4. The relationship information is extracted from each frame, and the sequence of such features is used as inputs for the next classification model. This gives rise to 16 values in each frame.
Action classification. Two variations are explored for the technique of using the hand-crafted features and LSTM, mainly early fusion and late fusion, as presented in Figure 5. The key difference between early and late fusion is the inputs for the LTSM model. To be more specific, early fusion uses both hand keypoints and relationships between the hand and plastic bag within the same LSTM model, while there are separate LSTM models for hand keypoints and relationships between the hand and plastic bag in late fusion. In early fusion, the hand keypoints and relational features are combined into a single input vector before being fed into a single LSTM model. This means that the model learns an integrated representation of hand pose and the spatial relationship to the plastic bag from the very beginning. In contrast, late fusion maintains two separate independent LSTM models, including one for the hand keypoints and another for the relational features. Only after both models have processed their individual feature streams do their outputs get merged. This approach allows each LSTM to specialize in a particular feature domain and reduces the complexity of the data representation.

4.3. Approach 2: Energy Motion Image + Image Classification

The objective of the second baseline approach is to exploit a motion image extraction from the input action video followed by an image classification. For that, the energy motion image (EMI) method is applied by generating an image representation of motion via accumulating differences between consecutive video frames, thus highlighting the energy or movement in specific regions of the video. Firstly, it applies Mediapipe to attain hand keypoints on each video frame in the clip, and the lines connecting the hand keypoints are thickened to mimic the hand skeletal structure. Subsequently, the EMI image is generated by performing a weighted overlay of the hand keypoints across frames, whereby the hand keypoints of the latest frame contribute more to the final EMI. These EMI images are used as inputs for a single image classification model (i.e., a ResNet model).

There are a few different approaches to assigning weights that place increasing emphasis on later frames, including linear weighting and quadratic weighting. Given a list of frames denoted as

x_{1}, x_{2}, \dots, x_{n}

, they can be combined to obtain a final EMI image

y = \sum_{i = 1}^{n} w_{i} x_{i}

, where

w_{i}

is the weighting factor for the i-th frame defined as

\begin{matrix} Linear weighting : w_{i} & = & \frac{i}{\sum_{k = 1}^{n} k}, \end{matrix}

(1)

\begin{matrix} Quadratic weighting : w_{i} & = & \frac{i^{2}}{\sum_{k = 1}^{n} k^{2}} . \end{matrix}

(2)

A larger frame index i (latest frame) has a larger weighting factor; consequently, the frame has a higher contribution to the calculated EMI image.

The details of this approach are described as follows:

Motion energy map. Hand skeletal structures, if any, are generated using Hand Landmark Detection for every frame in the input video clip. Thus, each EMI comprises a maximum overlay of 150 frames, rendering the latest frame in the video clip to have the greatest weight in the final EMI. These EMIs are used as inputs for an image classification model that determines if the clip reflects the plastic bags taking action.
Action classification. In consideration of the fact that the energy motion image is grayscale, a single-channel image classification model (i.e., ResNet model) is explored. In addition, the EMI image has been downsized to a resolution of $224 \times 224$ to reduce the number of parameters.

4.4. Approach 3: 3D CNN Model

The aim of the third baseline approach is to exploit the 3D CNN model on the input action video sequence directly without prior feature extraction done. This approach combines both feature engineering and classification into a single 3D CNN model that allows the model to deeply learn the key features itself. In addition, three different variants are explored in this approach, as shown in Figure 6.

EfficientNet model [38]. It is re-trained on our benchmark video dataset.
3D CNN model [39]. It generally performs better than 2D networks as it is able to model temporal information better on top of the spatial information that 2D networks capture.
(2 + 1)D ResNet model [40]. The use of (2 + 1)D convolutions over regular 3D convolutions reduces computational complexity by decomposing the spatial and temporal dimensions to reduce parameters. It also prevents overfitting and introduces more non-linearity that allows for a better functional relationship to be modeled.

5. Experimental Results

5.1. Performance Metrics

The performance evaluation is conducted using four key performance metrics: accuracy, precision, recall, and F1 score [41]. Precision evaluates the models’ ability to predict the action of taking plastic bags accurately. Recall evaluates the models’ ability to predict all the actions of taking plastic bags. Accuracy evaluates the models’ ability to differentiate between actions that are not taking plastic bags and actions that are taking plastic bags. The F1 score accounts for both precision and recall and is a better metric for an imbalanced dataset compared to accuracy. In addition, the computational complexity of various approaches is evaluated using the number of trainable parameters, which represents the total number of parameters used for each model.

5.2. Implementation Details

The implementation details of three baseline approaches are provided here. All the weights in the model are initialized using HE initialization [42]. Table 2 documents the setting used for the model training. The model is trained for 240 epochs on an Nvidia GTX 2080Ti GPU with the 1.7.1 version of the PyTorch library.

5.3. Experimental Results

As seen in Table 3, the late fusion predicts the action more accurately than the early fusion for Approach 1. In addition, Approach 2 is observed to be fairly accurate in detecting the action of taking the plastic bags, with an F1 score of 0.82. Lastly, for Approach 3, it is found that the (2 + 1)D ResNet model can perform the video classification best when compared with the other two video classification modelling techniques.

The computational complexity of various approaches is evaluated on a computer with 16 GB of RAM and a 2.61 GHz CPU processor without a GPU. The detailed comparison is presented in Table 4.

5.4. Ablation Study

The first ablation study examines Approach 1 of hand-crafted feature extraction; the effect of the relation features of hand keypoints and the plastic bag is examined (Table 5). This experiment is conducted using the early fusion model. This experiment indicates that the relation features (distances between plastic bag and hand) could be important for the action recognition.

The second ablation study examines Approach 2 of the construction of EMI imaage. There are two hyperparameters used in this approach. The first hyperparameter is the thickness of the hand skeleton structure, which is controlled in OpenCV. It could be adjusted using two options, including a small value of 10 and a large value of 30. The second hyperparameter is the choice of the weighting method that combines multiple frames to obtain a single EMI image. It could be set using either the linear weighting method (1) or the quadratic weighting method (2). The effects of thickness and weighting methods are compared in Table 6 and Figure 7. As shown in Figure 7, increasing the skeleton thickness produces EMI images with larger highlighted regions (see images on the right column in Figure 7). Similarly, using the quadratic weighting method, which assigns higher weights to more recent frames, results in EMI images with enhanced contrast (see images on the bottom row in Figure 7). As seen from Figure 7, a larger skeleton thickness and linear weighting method can achieve better action classification performance than that of a smaller skeleton thickness and quadratic weighting method.

5.5. Overall Evaluation

Comparing the results of all three approaches, it is observed that the model performances and number of parameters in Approach 2 and Approach 3 are comparable. However, while the model in Approach 2 is less complex, substantial computational time and resources are required for hand keypoint extraction and image overlay on all 150 frames of each 5-second video clip to generate the EMI image prior to image classification. In contrast, the video classification model selects several frames from the video at predefined intervals as inputs to the model. It is important to note that Approach 2 might overfit to the limited conditions due to the absence of other objects besides the red plastic bag in the dataset.

The performances of the LSTM and ResNet models for Approaches 1 and 2 are dependent on the accuracy of the pre-trained models, including Yolov5 and Mediapipe, for plastic bag detection and hand keypoint detection, respectively, as inputs for LSTM or ResNet models. Due to the nature of the Roboflow plastic bag dataset, the Yolov5 model is observed to be insensitive to the plastic bags with the edges occluded by the customer’s hand, while the coordinates of the plastic bag bounding box of the previous video frames are used as a reference for Approach 1 to mitigate this issue; it would have been preferable to have a more robust plastic bag detection model.

6. Limitations and Potential Impacts

We fully acknowledge that our research is focused on the specific task of plastic bag grabbing, particularly within the context of Singapore. This specificity stems from the unique challenges and requirements of this application, such as the distinct (or niche) colors and interactions with plastic bags commonly encountered in Singapore. Despite its niche focus, our work has a significant impact on real-world applications in the retail sector, particularly in cashier-less stores. The ability to accurately detect and classify such specific actions is vital for enhancing operational efficiency and security in these contexts. We believe our released benchmark video dataset and the evaluation of major baseline approaches broaden the relevance of our research beyond academia.

While our study is niche, we see it as a stepping stone for the broader computer vision-based human activity research. The introduction of our dataset and baseline methods can provide valuable insights and a foundation for related tasks, such as fine-grained action recognition and object interaction analysis. Researchers can adapt or extend these methods to other fine-grained action recognition tasks.

Future research in this area can be extended in two significant directions based on the findings of this paper. Firstly, while this work evaluates three CNN-based baselines for action video recognition, future studies could focus on transformer-based methods [20]. Transformers, with their ability to capture long-range dependencies and temporal correlations, hold great promise for action recognition tasks. Secondly, the benchmark dataset presented in this study primarily covers a limited range of scenarios, which might not fully represent the diversity encountered in real-world applications. For instance, the task of recognizing actions related to plastic bag grabbing in supermarket checkout scenarios could benefit from datasets encompassing more diverse and challenging conditions, such as varying backgrounds, lighting conditions, and object types. Expanding datasets to include such out-of-distribution scenarios [43] will be crucial for improving the generalizability and robustness of action recognition systems in practical deployment.

7. Conclusions

In this paper, we introduce a novel benchmark dataset specifically designed for the task of recognizing plastic bag grabbing actions from video footage. This dataset addresses a unique and niche challenge in action recognition, capturing the subtleties of human–object interactions involving plastic bags in diverse scenarios. To tackle this task, we have developed three baseline models tailored to this task. They include a handcrafted feature-based model, a multiple-frame CNN-based recognition model, and a 3D deep learning model. Through a detailed comparative evaluation, we highlight the strengths and limitations of each method, providing valuable insights and a robust foundation for advancing research in fine-grained action recognition tasks.

Author Contributions

Conceptualization, P.J.L., B.Y.N., N.I.M. and J.T.; methodology, P.J.L., B.Y.N. and N.I.M.; software, P.J.L., B.Y.N. and N.I.M.; data curation, P.J.L., B.Y.N. and N.I.M.; writing—original draft preparation, P.J.L., B.Y.N., N.I.M., J.T. and C.-C.L.; writing—review and editing, P.J.L., B.Y.N., N.I.M., J.T. and C.-C.L.; supervision, J.T. and C.-C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to express their sincere gratitude to the editor and the five reviewers for their insightful suggestions on revising this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Single-Use Plastic Bags and Their Alternatives: Recommendations from Life Cycle Assessments; United Nations Environment Programme: New York, NY, USA, 2020.
Lekavičius, V.; Bobinaitė, V.; Balsiūnaitė, R.; Kliaugaitė, D.; Rimkūnaitė, K.; Vasauskaitė, J. Socioeconomic Impacts of Sustainability Practices in the Production and Use of Carrier Bags. Sustainability 2023, 15, 12060. [Google Scholar] [CrossRef]
Geetha, R.; Padmavathy, C. The Effect of Bring Your Own Bag on Pro-environmental Behaviour: Towards a Comprehensive Conceptual Framework. Vision J. Bus. Perspect. 2023. [Google Scholar] [CrossRef]
Nielsen, T.D.; Holmberg, K.; Stripple, J. Need a bag? A review of public policies on plastic carrier bags—Where, how and to what effect? Waste Manag. 2019, 87, 428–440. [Google Scholar] [CrossRef] [PubMed]
Kua, I. Singapore Supermarkets Start Charging for Plastic Bags. 2023. Available online: https://www.bloomberg.com/news/articles/2023-07-03/singapore-supermarkets-start-charging-for-plastic-bags (accessed on 1 January 2025).
Hong, L. What Happens If You Take a Plastic Bag Without Paying from July 3? 2023. Available online: https://www.straitstimes.com/singapore/environment/pay-for-plastic-bags-at-supermarkets-from-july-3-or-you-might-be-committing-theft-legal-experts (accessed on 1 January 2025).
Ahn, Y. Plastic Bag Charge: Some Customers Say They Will Pay or Switch to Reusables, but Scepticism Abounds over “Honour System”. 2023. Available online: https://www.todayonline.com/singapore/supermarket-plastic-bag-honour-system-sceptical-2197591 (accessed on 1 January 2025).
Ting, K.W. Barcodes and Dispensers: How Supermarkets in Singapore Are Gearing Up for the Plastic Bag Charge. 2023. Available online: https://www.channelnewsasia.com/singapore/plastic-bag-charges-singapore-supermarkets-dispensers-barcodes-3573671 (accessed on 1 January 2025).
Chua, N. Shop Theft Cases Jump 25 Percent in First Half of 2023 as Overall Physical Crime Rises. 2023. Available online: https://www.straitstimes.com/singapore/courts-crime/physical-crime-increases-in-first-half-of-2023-as-shop-theft-cases-jump-25 (accessed on 1 January 2025).
Reid, S.; Coleman, S.; Vance, P.; Kerr, D.; O’Neill, S. Using Social Signals to Predict Shoplifting: A Transparent Approach to a Sensitive Activity Analysis Problem. Sensors 2021, 21, 6812. [Google Scholar] [CrossRef] [PubMed]
Koh, W.T. Some Customers Take Plastic Bags Without Paying at Supermarkets Based on Honour System—CNA. 2023. Available online: https://www.channelnewsasia.com/singapore/some-customers-not-paying-plastic-bag-ntuc-fairprice-honour-system-3745016 (accessed on 1 January 2025).
Dataset. Plastic Bags Dataset. 2022. Available online: https://universe.roboflow.com/dataset-t7hz7/plastic-bags-0qzjp (accessed on 1 August 2024).
Marionette. Plastic Paper Garbage Bag Synthetic Images. 2022. Available online: https://www.kaggle.com/datasets/vencerlanz09/plastic-paper-garbage-bag-synthetic-images (accessed on 1 January 2025).
Nazarbayev University. Plastic and Paper Bag Dataset. 2023. Available online: https://universe.roboflow.com/nazarbayev-university-dbpei/plastic-and-paper-bag (accessed on 1 January 2025).
Molchanov, P.; Yang, X.; Gupta, S.; Kim, K.; Tyree, S.; Kautz, J. Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4207–4215. [Google Scholar]
Zhang, Y.; Cao, C.; Cheng, J.; Lu, H. EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition. IEEE Trans. Multimed. 2018, 20, 1038–1050. [Google Scholar] [CrossRef]
Avola, D.; Bernardi, M.; Cinque, L.; Foresti, G.L.; Massaroni, C. Exploiting Recurrent Neural Networks and Leap Motion Controller for the Recognition of Sign Language and Semaphoric Hand Gestures. IEEE Trans. Multimed. 2019, 21, 234–245. [Google Scholar] [CrossRef]
Gao, C.; Li, Z.; Gao, H.; Chen, F. Iterative Interactive Modeling for Knotting Plastic Bags. In Proceedings of the International Conference on Robot Learning; Liu, K., Kulic, D., Ichnowski, J., Eds.; PMLR: Cambridge, MA, USA, 2023; Volume 205, pp. 571–582. [Google Scholar]
Hu, K.; Shen, C.; Wang, T.; Xu, K.; Xia, Q.; Xia, M.; Cai, C. Overview of temporal action detection based on deep learning. Artif. Intell. Rev. 2024, 57, 26. [Google Scholar] [CrossRef]
Selva, J.; Johansen, A.S.; Escalera, S.; Nasrollahi, K.; Moeslund, T.B.; Clapes, A. Video Transformers: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12922–12943. [Google Scholar] [CrossRef] [PubMed]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lucic, M.; Schmid, C. ViViT: A Video Vision Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Los Alamitos, CA, USA, 27 October–2 November 2021; pp. 6816–6826. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is Space-Time Attention All You Need for Video Understanding? In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021.
Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6804–6815. [Google Scholar]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3192–3201. [Google Scholar]
Madan, N.; Moegelmose, A.; Modi, R.; Rawat, Y.S.; Moeslund, T.B. Foundation Models for Video Understanding: A Survey. arXiv 2024, arXiv:2405.03770. [Google Scholar]
Liu, X.; Zhou, T.; Wang, C.; Wang, Y.; Wang, Y.; Cao, Q.; Du, W.; Yang, Y.; He, J.; Qiao, Y.; et al. Toward the unification of generative and discriminative visual foundation model: A survey. Vis. Comput. 2024. [Google Scholar] [CrossRef]
Pareek, P.; Thakkar, A. A survey on video-based Human Action Recognition: Recent updates, datasets, challenges, and applications. Artif. Intell. Rev. 2021, 54, 2259–2322. [Google Scholar] [CrossRef]
Dang, L.; Min, K.; Wang, H.; Piran, M.; Lee, C.; Moon, H. Sensor-based and vision-based human activity recognition: A comprehensive survey. Pattern Recognit. 2020, 108, 107561. [Google Scholar] [CrossRef]
Bandini, A.; Zariffa, J. Analysis of the Hands in Egocentric Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6846–6866. [Google Scholar] [CrossRef] [PubMed]
Hutchinson, M.S.; Gadepally, V.N. Video Action Understanding. IEEE Access 2021, 9, 134611–134637. [Google Scholar] [CrossRef]
Satyamurthi, S.; Tian, J.; Chua, M.C.H. Action recognition using multi-directional projected depth motion maps. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 14767–14773. [Google Scholar] [CrossRef]
Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2556–2563. [Google Scholar]
Xie, T.; Tian, J.; Ma, L. A vision-based hand hygiene monitoring approach using self-attention convolutional neural network. Biomed. Signal Process. Control 2022, 76, 103651. [Google Scholar] [CrossRef]
Wu, Y.; Lin, Q.; Yang, M.; Liu, J.; Tian, J.; Kapil, D.; Vanderbloemen, L. A Computer Vision-Based Yoga Pose Grading Approach Using Contrastive Skeleton Feature Representations. Healthcare 2022, 10, 36. [Google Scholar] [CrossRef] [PubMed]
Mediapipe. Hand Landmark Model. 2023. Available online: https://github.com/google/mediapipe/blob/master/docs/solutions/hands.md (accessed on 1 January 2025).
Ultralytics. Ultralytics/yolov5: V7.0—YOLOv5 SOTA Realtime Instance Segmentation. 2022. Available online: https://zenodo.org/records/7347926 (accessed on 1 January 2025).
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 6105–6114. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6450–6459. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Ye, N.; Zeng, Z.; Zhou, J.; Zhu, L.; Duan, Y.; Wu, Y.; Wu, J.; Zeng, H.; Gu, Q.; Wang, X.; et al. OoD-Control: Generalizing Control in Unseen Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7421–7433. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Screenshot examples of action videos collected in this paper.

Figure 2. An illustration of how a video sequence is cropped into various clips used in our dataset.

Figure 3. A conceptual overview of three baseline approaches in our study.

Figure 4. An illustration of relational feature extraction by calculating the distances between the index and thumb of the hand and the four corners of the bounding box of the detected plastic bag.

Figure 5. A conceptual overview of two fusion strategies used in our Approach 1, including (a) early fusion and (b) late fusion.

Figure 6. An illustration of three options of the 3D CNN-based approach, including (a) a pre-trained EfficientNet model, (b) a 3D CNN model, and (c) a (2 + 1) D CNN model.

Figure 7. Example of energy motion images generated from overlaying hand keypoints across frames with different skeleton thickness values and weighting methods.

Table 1. An overview of the relevant datasets and their comparison with our dataset. Only research works related to hand–bag interaction action are discussed in this paper. Our dataset can be accessed at https://doi.org/10.17632/h3hs3nnvbc.1 accessed on 1 January 2025.

	Ref.	Year	Bag Object	Hand Gesture	Hand Bag Interaction	Available to Public	Remarks
Plastic bag object detection	[12]	2022	√	-	-	√	1 class (plastic bag or not) and 283 images.
	[13]	2022	√	-	-	√	3 classes (Plastic, Paper, and Garbage Bag) and 5000 instances
	[14]	2023	√	-	-	√	6 classes (anorganic, paper bag, shopping bag, smoke, stretch, trashbag) and 400 images
Hand bag interaction	[15]	2016	-	√	-	√	25 types of gestures and 1532 samples
	[16]	2018	-	√	-	√	83 types of gestures and 24,161 samples
	[17]	2019	-	√	-	√	30 types of gestures and 1200 samples
	[18]	2023	√	√	√	-	4 types of bags and 43,000 images
Our dataset		-	√	√	√	√	2 classes and 989 clips

Table 2. Implementation details of three baseline models in our study.

	Approach 1		Approach 2	Approach 3
	Early Fusion	Late Fusion		EfficientNet	3D CNN	(2 + 1)D CNN
Optimizer	Adam	Adam	Adam	Adam	Adam	Adam
Learning rate	0.001	0.001	0.005–0.00005	−	0.0001	0.0001
Batch size	32	32	32	8	8	8
Epoch	30	30	200	20	20	20

Table 3. The performance evaluation of three baseline approaches.

	Option	Precision	Recall	F1 Score	Accuracy
	Early Fusion	0.62	0.52	0.57	0.59
Approach 1	Late Fusion	0.85	0.33	0.47	0.62
Approach 2		0.89	0.76	0.82	0.88
	EfficientNet	0.82	0.69	0.76	0.86
Approach 3	3D CNN	0.78	0.88	0.82	0.88
	(2 + 1)D ResNet	0.92	0.91	0.91	0.94

Table 4. The computational complexity comparison of three baseline approaches. The YOLO object detection and Mediapipe calculation are excluded from this evaluation.

	Option	Frame Resolution	Inference Frame Rate	# Parameters	FLOPs
Approach 1	Early Fusion	$1920 \times 1080$	15	8,968,594	17,955,137
	Late Fusion	$1920 \times 1080$	15	16,056,098	31,927,660
Approach 2		$224 \times 224$	30	279,778	4,003,459,372
	EfficientNet	$224 \times 224$	2	4,052,133	8,011,611,411
Approach 3	3D CNN	$224 \times 224$	2	876,898	59,520,912,012
	(2 + 1)D CNN	$224 \times 224$	2	586,226	9,160,758,184

Table 5. Ablation study of relational features used in Approach 1.

Hand Keypoints	Relational Feature	Precision	Recall	F1 Score	Accuracy
√	-	0.81	0.25	0.38	0.58
√	√	0.62	0.52	0.57	0.59

Table 6. Ablation study of hyperparameters used in Approach 2, including skeleton thickness and weighting methods used in the frame combination.

Skeleton Thickness		Frame Combination		Precision	Recall	F1 Score	Accuracy
Small Value	Large Value	Linear Weighting	Quadratic Weighting	Precision	Recall	F1 Score	Accuracy
√		√		0.79	0.75	0.77	0.82
	√	√		0.89	0.76	0.82	0.88
√			√	0.81	0.73	0.77	0.82
	√		√	0.83	0.81	0.82	0.77

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Low, P.J.; Ng, B.Y.; Mahzan, N.I.; Tian, J.; Leung, C.-C. Video-Based Plastic Bag Grabbing Action Recognition: A New Video Dataset and a Comparative Study of Baseline Models. Sensors 2025, 25, 255. https://doi.org/10.3390/s25010255

AMA Style

Low PJ, Ng BY, Mahzan NI, Tian J, Leung C-C. Video-Based Plastic Bag Grabbing Action Recognition: A New Video Dataset and a Comparative Study of Baseline Models. Sensors. 2025; 25(1):255. https://doi.org/10.3390/s25010255

Chicago/Turabian Style

Low, Pei Jing, Bo Yan Ng, Nur Insyirah Mahzan, Jing Tian, and Cheung-Chi Leung. 2025. "Video-Based Plastic Bag Grabbing Action Recognition: A New Video Dataset and a Comparative Study of Baseline Models" Sensors 25, no. 1: 255. https://doi.org/10.3390/s25010255

APA Style

Low, P. J., Ng, B. Y., Mahzan, N. I., Tian, J., & Leung, C.-C. (2025). Video-Based Plastic Bag Grabbing Action Recognition: A New Video Dataset and a Comparative Study of Baseline Models. Sensors, 25(1), 255. https://doi.org/10.3390/s25010255

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Video-Based Plastic Bag Grabbing Action Recognition: A New Video Dataset and a Comparative Study of Baseline Models

Abstract

1. Introduction

2. Related Works

3. Dataset

3.1. Data Collection and Organization

3.2. Limitation of This Dataset

4. Three Baseline Approaches and a Comparative Study

4.1. Selection of Three Baseline Approaches

4.2. Approach 1: Hand-Crafted Features + LSTM

4.3. Approach 2: Energy Motion Image + Image Classification

4.4. Approach 3: 3D CNN Model

5. Experimental Results

5.1. Performance Metrics

5.2. Implementation Details

5.3. Experimental Results

5.4. Ablation Study

5.5. Overall Evaluation

6. Limitations and Potential Impacts

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI