Next Article in Journal
Handwritten Recognition Techniques: A Comprehensive Review
Next Article in Special Issue
An Improved Pedestrian Detection Model Based on YOLOv8 for Dense Scenes
Previous Article in Journal
Power Transformer On-Load Capacity-Regulating Control and Optimization Based on Load Forecasting and Hesitant Fuzzy Control
Previous Article in Special Issue
Index for Quantifying ‘Order’ in Three-Dimensional Shapes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Effective Video Summarization Using Channel Attention-Assisted Encoder–Decoder Framework

1
Quantum Technologies and Advanced Computing Institute, King Abdulaziz City for Science and Technology, Riyadh 11442, Saudi Arabia
2
Department of Information Technology, College of Computer, Qassim University, Buraydah 51452, Saudi Arabia
3
Department of Computer Science, Islamia College Peshawar, Peshawar 25000, Pakistan
4
Department of Electrical Engineering, College of Engineering, Jouf University, Sakaka 72388, Saudi Arabia
5
Department of Electrical Engineering, College of Engineering, Qassim University, Buraydah 52571, Saudi Arabia
*
Author to whom correspondence should be addressed.
Symmetry 2024, 16(6), 680; https://doi.org/10.3390/sym16060680
Submission received: 10 March 2024 / Revised: 30 April 2024 / Accepted: 4 May 2024 / Published: 1 June 2024

Abstract

:
A significant number of cameras regularly generate massive amounts of data, demanding hardware, time, and labor resources to acquire, process, and monitor. Asymmetric frames within videos pose a challenge to automatic summarization of videos, making it challenging to capture key content. Developments in computer vision have accelerated the seamless capture and analysis of high-resolution video content. Video summarization (VS) has garnered considerable interest due to its ability to provide concise summaries of lengthy videos. The current literature mainly relies on a reduced set of representative features implemented using shallow sequential networks. Therefore, this work utilizes an optimal feature-assisted visual intelligence framework for representative feature selection and summarization. Initially, the empirical analysis of several features is performed, and ultimately, we adopt a fine-tuning InceptionV3 backbone for feature extraction, deviating from conventional approaches. Secondly, our strategic encoder–decoder module captures complex relationships with five convolutional blocks and two convolution transpose blocks. Thirdly, we introduced a channel attention mechanism, illuminating interrelations between channels and prioritizing essential patterns to grasp complex refinement features for final summary generation. Additionally, comprehensive experiments and ablation studies validate our framework’s exceptional performance, consistently surpassing state-of-the-art networks on two benchmarks (TVSum and SumMe) datasets.

1. Introduction

In recent years, the prevalent integration of surveillance cameras and smart mobile devices, like smartphones and GoPro cameras, has triggered an exponential surge in video volume. This surge in video data, particularly from surveillance cameras, presents formidable challenges in processing and analyzing expansive datasets. However, the lack of efficient management has accumulated massive redundant information within these videos. This makes it arduous and tiresome for viewers to obtain meaningful information conveniently and causes significant space wastage. The inherent asymmetry of video frames, where some frames hold more significance than others, complicates the task of effectively summarizing a video. Consequently, the effective analysis and storage of these extensive video datasets have emerged as a recent research hotspot [1]. These challenges are particularly prominent in applications such as video browsing [2], retrieval [2], and anomaly detection [3], among others [4,5,6]. The manual extraction of informative segments from video data is laborious, underscoring the pressing need for automated techniques to eliminate redundancy and extract valuable information. Various approaches have emerged in response to these challenges, including video condensation, video skimming, and video summarization (VS). Video skimming [7] involves crafting a concise segment from the original video, providing a holistic representation. On the other hand, VS [8] is a technique tailored to extract salient frames or sequences from an input video. Its primary objective is to facilitate rapid video browsing by condensing the input video into a synopsis while preserving crucial information. VS has evolved into a pivotal focus in research with the exponential growth in video data, mainly stemming from surveillance and smartphone recordings. The leading goal of VS is to generate a condensed version of a video, achieved by identifying and selecting the keyframes or shots that encapsulate the key theme of the input video. This streamlined representation dramatically enhances the efficiency of video browsing and retrieval.
VS can be broadly categorized into four approaches, as depicted in Figure 1. Categorizing VS approaches into different categories is essential for understanding all the strategies utilized in the area. The rationale for our chosen categories arises from the necessity to incorporate crucial elements that impact the process of generating a summary. The “Number of Views” subcategory addresses the differences in summarization methods determined by whether they are used for single-view or multi-view VS, highlighting the significance of having diverse perspectives as outlined in [9]. The “Data Dimension” division classifies the technical aspect into two main groups: 2D and 3D [10]. Within the 2D category, there is further specificity to include spatial, temporal, and spatiotemporal components, demonstrating different ways of visual information processing [11]. The third broad umbrella is “Multi-Modal Data”, which acknowledges the significance of involving many modalities, such as audio and text, with visual information. This emphasizes the wide range of multi-modal data and their use to generate VS. The “Context-Based” category highlights the importance of context when determining the importance of video segments. This category includes input-driven and saliency-based ways of comprehending context [12]. These categories provide an in-depth structure for evaluating and understanding VS approaches.
These methods cover various aspects, including the number of perspectives, dimensions of data, multi-modal data, and context-oriented VS. When viewed from a given standpoint, this classification branches into single-view VS and multi-view VS (MVS). In a single view, the emphasis lies in generating a summary for a single view, where the resulting summary must uphold three fundamental properties: minimal repetition, representativeness, and diversity. Commonly used VS techniques predominantly target single-perspective videos, aiming to produce a summary that adequately represents the input video by focusing solely on the intra-perspective correlations. On the other hand, MVS entails producing summaries of many videos from multiple views. This category plays a significant role in many fields, such as sports, surveillance, educational training, and news, where considerable volumes of video data need to be reduced to short summaries to save time [9,13]. In the field of sports analysis, MVS can gather clips from numerous viewpoints or matches to provide thorough coverage. MVS applications are extended to surveillance systems by reducing long recordings from several cameras into concise summaries, enabling swift inspection. The same holds true in educational contexts, where MVS systems can help learners better understand multiple educational videos by extracting pertinent information. The domain of VS has witnessed a transformative evolution, particularly in the dichotomy between unsupervised and supervised methods. The challenge inherent in unsupervised VS lies in the precision of its outcomes. The results, obtained without the guidance of labeled data, often exhibit a degree of ambiguity. This ambiguity stems from the absence of labels that would otherwise serve as benchmarks to gauge and verify the utility of the VS results.
Supervised video summarization (SVS) strategies have become incredibly effective approaches. These techniques use human-assigned labels to identify every keyframe in videos, producing a more concentrated and accurate learning process. A thorough accordance with the patterns for included keyframes accelerates the progress of the models’ performance. It is observable that the critical turning point in the improvement of SVS emerged when datasets, such as SumMe and TVSum, became available to the research community. These datasets pose a structured assessment technique, enabling an in-depth assessment of summaries compared to the ground truth (GT). Considerable research is conducted in the field of VS using SOTA methods by exploring various approaches for gathering essential information from videos. By including benchmark datasets, researchers can analyze the efficacy of various VS methods to improve them. These established standards, serving as benchmarks for comparison, have catalyzed a significant shift in evaluating VS strategies. They have played a crucial role in improving the accuracy and reliability of evaluations in this field.
In unsupervised VS methods, heuristic evaluation criteria play a pivotal role, encompassing metrics such as diversity [14]; memorability and entropy, as in work [15]; representativeness [16]; and reconstructed error [17]. Techniques within this category include adversarial learning approaches [18], clustering-based methods [19], and reinforcement learning-based [20]. These methods excel at summarizing videos without supervision, leveraging various heuristic measures. However, their drawback lies in their inability to learn from manually created annotations, leading to suboptimal performance compared to supervised methods. Recurrent neural network (RNN)-based approaches take center stage within the supervised method domain. Notable methods include user ranking with multi-stage spatiotemporal representation [21] and VS with long short-term memory [22]. While these approaches have demonstrated remarkable performance, they grapple with challenges associated with long video sequences, namely the exploding and vanishing gradient problems, hence the need for recent innovations such as convolutional sequence networks [23] and attention models. For instance, due to the focus on capturing temporal local information within sequence convolutional layers, the fully convolutional sequence network faces limitations in modeling correlations among non-local video nonsymmetric frames. Attention-based methods, including video summarization with attention [24] and video attention-based encoder-decoder networks, show promise in capturing long-range information. However, they encounter challenges in exploiting the hierarchical structure inherent in videos. Furthermore, the complexities of attention-based methods [25,26] grow quadratically with the sequence length, presenting scalability concerns. These challenges underscore the ongoing efforts in refining unsupervised and supervised VS methods, each grappling with distinctive limitations in their quest for optimal performance across diverse video scenarios.
Previous research [24,27] has empirically demonstrated that videos exhibit a sequential structure with components of varying significance. VS methods, aiming to identify the most representative and crucial video asymmetric frames, often necessitate a comprehensive understanding of long-range video sequence dynamics. Conversely, the prediction of frame-level importance scores heavily depends on short-range data. Motivated by these insights, we propose a channel attention method for SVS that integrates multiscale feature fusion. The overview of our framework is shown in Figure 2, comprising four modules: the deep feature extraction module, the encoder-decoder module followed by the progressive feature fusion, and the attention module to enhance representations before final predictions. Initially, we extract optimal deep feature fusion through empirical study. Subsequently, we employ a fully convolutional encoder–decoder network capable of simultaneous processing of all frames; in contrast to LSTM models [22] that operate sequentially, our models incorporate temporal modules like temporal convolution, temporal pooling and temporal deconvolution, reminiscent of the modules in semantic segmentation models. Integrating these multiscale temporal features facilitates the extraction of multilevel features at the frame level. Subsequently, the incorporation of a channel attention block refines these features. Implementing channel attention to video frame features in the context of VS entails selectively emphasizing specific channel-wise regions within each frame. This attention mechanism prioritizes pertinent visual information, potentially elevating feature extraction quality. The incorporation of channel attention aims to enhance the model’s capability to capture crucial details, thereby contributing to more effective video summarization by accentuating the most salient content in each frame.

Contributions

The key contributions of this study are briefly highlighted in the following points:
  • Our primary contribution lies in breaking away from previous approaches that depended on fewer representative features and shallow layered networks. In an establishing move, we introduce the utilization of an InceptionV3 backbone for extracting deep features in the area of VS. This decision is informed by a thorough empirical analysis of the network at an intermediate stage, allowing us to extract optimal, domain-specific representations that delve deeper into the visual contents of a video. In stark contrast to prevailing trends favoring conventional GoogleNet-based backbone models, our approach stands out by proposing a novel and distinctive method for extracting the most representative features, setting our work apart in the field.
  • We employ an encoder block consisting of five convolutional sub-blocks, followed by a decoder incorporating two convolution transpose sub-blocks. This strategic framework significantly enhances the model’s proficiency in capturing and distilling complex temporal relationships within video frames. The synergy between the InceptionV3 backbone and our tailored encoder–decoder structure sets our approach apart, allowing it to assist the features before passing to the attention module.
  • As a pivotal advancement in our framework, we integrate a channel attention module to refine the extracted feature further. This mechanism intricately illuminates the interrelations between channels within the feature map. By deriving 1-D weights for each channel, the channel attention mechanism intensifies the focus on pivotal information within video asymmetric frames. This strategic integration significantly enhances the model’s proficiency in capturing and prioritizing essential features, contributing to a more nuanced understanding of complex relationships.
  • Our framework has shown exceptional performance through comprehensive, rigorous experiments and extensive ablation study, exceeding the current state-of-the-art networks in VS. Through thorough experimentation and methodical examination, our proposed framework continuously showed superior performance compared to established benchmarks, confirming its effectiveness and showcasing its breakthroughs on two datasets.
The following sections of this article will progress, with each section unveiling a distinct aspect. Section 2 thoroughly examines the extensive body of literature connected to the topic. It explores both supervised and unsupervised approaches and provides a detailed analysis of attention-based strategies. Proceeding to Section 3, the focus transitions to the core of our contribution—the suggested framework. Here, we carefully establish the foundation, clearly defining the issue and elaborating on the theoretical principles that support our solution. In Section 4, we provide the experimental findings, which offer empirical evidence to substantiate our arguments and exhibit the efficiency of our framework. In the last part of Section 5, we bring together the many aspects of our investigation, providing valuable insights that connect the current situation to what lies ahead. As we conclude this introduction, we also offer insights into the unexplored possibilities of future undertakings in this sector.

2. Related Literature

Video summarization has emerged as a focal point of significant interest, attributed to its substantial application value across diverse contexts. Acknowledging the constraints of brevity, our focus centers on closely aligned approaches, with a comprehensive review available in [12,22,28]. Furthermore, our exploration extends to a detailed examination of VS’s supervised, unsupervised, and attention-based mechanisms. The references provided serve as valuable resources for an exhaustive understanding of the broader landscape, particularly regarding systematic reviews.

2.1. Supervised and Unsupervised Video Summarization

In the initial phase, VS addresses concerns related to low-level visual coherence, the grouping of frames, and the incorporation of submodular functions [29,30]. Hindered by manually crafted features and shallow layered networks, the performance of VS falls short of satisfaction. Capitalizing on the immense success of deep learning in image classification [31,32], recent years have witnessed a proliferation of deep learning-based VS methods. Most of these methods center on refining models to capture the crucial temporal relationships integral to VS. The work in [22] proposed an LSTM-based network, further enriched by the determinantal point process. Since this groundbreaking work, LSTM has emerged as a cornerstone for VS, with frequently advanced techniques developing [25,33,34]. For instance, the work [33] introduced a novel loss to gauge the fidelity of predicted summaries in preserving original semantic information. The study [35] devises VS as a temporal interest detection challenge addressed by the anticipated DSNet. In [14], a reinforcement learning model is employed, with rewards considering diversity and representativeness in generated summaries. Departing from the egalitarian treatment of all inputs, ref. [25] presents an attentive encoder–decoder network that assigns apparent weights to the discriminative features.
The study [19] offered VSUMM, a technique to generate static video summaries via visual extraction of features and k-means clustering. The method also involves manually producing user summaries for assessment purposes, which are of higher quality than other existing approaches. Another study [36], introduced a model called seqDPP, designed for VS tasks. This model leverages the core sequential patterns in the data to improve efficiency, beating the regular DPP model. To boost the performance of VS, ref. [37] proposed using a regularization loss term and a CSNet to tackle the problems of ineffective feature learning. These enhancements are particularly beneficial in videos that are long in duration. The article [38] provides a method that combines shot segmentation using an HSA-RNN. In the work [39], a new approach is presented for automated keyframe-based VS that leverages human-created summaries for choosing a subset of keyframes. The article [40] proposed a framework for summarizing edited and raw videos. This framework incorporates models for relevance, representativeness, variety, and storyness, along with thorough scoring functions and a mixed training set to enhance its performance. A study conducted in [41] introduced a probabilistic model that uses a reinforcement learning algorithm to modify the duration of video segments dynamically. This model aims to enhance local variety and surpass maximum likelihood estimation methods.
In response to the scarcity of data, several GAN-based VS approaches are suggested in reference works [18,42,43]. The study referenced in [42] used the discriminator to differentiate between the input video and the anticipated summary, while the study mentioned [18] aims to determine the origin of the summarized video. Drawing inspiration from the efficacy of graph structures in video synopsis [44], applying graph structures enhances temporal modeling capabilities [45,46]. The sequential nature of LSTM-based approaches poses challenges for parallelization, resulting in substantial training time, particularly on large-scale datasets. Addressing this concern, [24] discards LSTM in favor of a self-attention-based framework capable of achieving the sequence transformation task in a lone forward-backward pass throughout the training process. Conversely, its single-layer design falls short of fully modeling complex frame relationships, thereby limiting performance. Moreover, more intricate attention-augmented methods are proposed in [47,48]. In [49], a block sparsity-based approach is introduced to investigate correlations among frames. Compared to these approaches, our approach utilizes attention to capture channel relationships in the frames, yielding superior performance.

2.2. Attention-Based Approaches

The attention process in the human visual and cognitive system is responsible for prioritizing crucial information for ongoing jobs, particularly when faced with a surplus of information [49]. Drawing inspiration from this work, numerous attention models have been proposed and have demonstrated enhanced performance in diverse domains, including VS. In the early stages, attention was computed from low features as a cue for VS, but it failed to consider the complex dependencies among frames [50]. Recently, novel attention models have emerged, and we will delve into two pivotal attention modules with significant implications for video summarization. The first of these modules, the self-attention module, was initially proposed for sequence modeling tasks. By taking care of every position in a series, it calculates the answer at a point [51]. Unlike existing sequential models such as LSTM, this novel approach allows each current position to attend to all positions regardless of distance. The study [52] proposed a method for SVS that utilizes a multiscale hierarchical attention approach. This approach combines intra-block and inter-block attention mechanisms and extends to a two-stream system for integrating visual and motion information. Another study [53] presented two models using vsLSTM and dppLSTM deep networks. These models use attention techniques to enhance performance and reduce computational complexity compared to the original approaches.
The attention mechanism in the human visual and cognitive system is responsible for prioritizing essential information for the ongoing job, particularly when there is excess information [51]. This not only facilitates increased computational parallelization but also reduces model complexity. Matrix multiplication streamlines calculations, making it efficient for the entire sequence in a single pass. Consequently, it has succeeded in numerous tasks, such as detecting objects, energy analytics [54], and video analytics, such as anomaly detection [55,56]. For example, the authors in [57] defined object identification as a problem of directly predicting sets using the self-attention paradigm. Although self-attention is mainly used for sequence tasks, CBAM works on multidimensional feature maps [58]. This tool is specifically created to determine what and where to highlight or reduce, successfully improving intermediate characteristics using spatial and channel-wise attention. Comparable notions are also used in other works [59,60,61]. Thanks to their simplified structure, these attention blocks may be smoothly included in any CNN architecture with little additional cost, allowing for end-to-end training. As a result, they were extensively utilized to improve the results of numerous tasks [36,62]. The existing body of attention-based research on VS methods struggles to properly deal with the complex relationships between frames when measuring attention. However, the methods’ performances are downgraded due to the lack of optimal intermediate features. Although some studies emphasize employing self-attention and hierarchical attention mechanisms to capture dependencies within and between frames, there is still a requirement for additional investigation of moving optimal features at the intermediate level.
Furthermore, more advanced attention with a perfect match of optimal features and temporal learning models is capable of handling extended temporal dependencies and differences in video content. In contrast to existing works, our work involves empirically extracting optimal deep feature fusion and introducing a fully convolutional encoder-decoder network for parallel processing of all frames. We integrate temporal modules for multilevel feature extraction at the frame level. To refine these features, we incorporate a channel attention block, selectively emphasizing specific channel regions within each frame. This integration of channel attention enhances feature extraction quality, contributing to more effective video summarization by accentuating the most salient content in each frame.

3. The Proposed Methodology

In this section, we embark on a comprehensive exploration, commencing with a detailed exposition of our problem formulation (Section 3.1). Subsequently, we delve into the intricacies of our approach to deep feature extraction outlined in Section 3.2. Following this, we unveil the architecture and functionality of the fully convolutional encoder-decoder network in Section 3.3, shedding light on its pivotal role in our methodology. Theoretical concepts underpinning progressive features fusion take center stage in Section 3.4, offering a nuanced understanding of our approach. As we navigate our framework, Section 3.5 takes a closer look at the integration of the channel attention module, a crucial component contributing to the refinement of features before the ultimate prediction.

3.1. Problem Formulation

Prior research has explored two distinct output formats within video summarization: (1) binary labels and (2) importance scores at the frame level. The outputs in binary labels are either keyframes [19,22,63,64] or keyshots [65,66,67,68]. Keyframes constitute a selection of non-consecutive frames chosen for summarization, while keyshots correspond to time intervals within a video, with each interval encompassing a continuous sequence of frames. Frame-level importance scores [65,68] signify the likelihood of selecting a frame for summarization. Existing datasets provide ground-truth annotations in at least one of these two formats. Although frame-level scores offer more detailed information, obtaining annotations for binary labels is more straightforward. This study focuses on acquiring VS training exclusively from binary label-based annotations, specifically those based on keyframes. Imagine a video with T frames, where each frame has undergone preprocessing, such as by a pre-trained convolutional neural network (CNN), resulting in a representation as a feature vector. The frames within the video are designated as outlined in Equation (1):
V F = A 1 ,   A 2 , A 3 , A n
where A i is the feature descriptor of the frame in the video. We aim to give a label of zero or one (binary labels) to each n frame. The corresponding importance score binary labels could be formulated as shown in Equation (2):
V L = S a 1 ,   S a 2 , S a 3 , S a T
where V L represents the corresponding ground truth labels to the frames. S a 1 T shows the labels of the frames, and T shows the total number of labels. It is assumed that there is access to a training dataset containing videos, where each frame is accompanied by a GT binary label that signifies whether it should be incorporated into the summary video. Following model training, our primary objective is to input unseen videos into the model, and the network is designed to produce an output that indicates whether the nth frame should be included in a summary.

3.2. Deep Feature Extraction

The prowess of CNNs in autonomously discerning crucial features from raw frames positions them as versatile tools across various computer vision applications. However, selecting a domain-specific CNN architecture is challenging, demanding a delicate balance between achieving accurate predictions and managing computational complexity in real-world applications. Researchers commonly turn to pre-trained models as foundational feature extractors in the SVS domain.
As given in the comparative analysis, the approaches that were utilized often used GoogleNet 1024 instead of investigating other CNN-based models. Pre-trained networks offer robust and diverse feature extraction pipelines, making them valuable initializations for various vision-based classification tasks. In alignment with the successes of existing feature extraction methods across diverse computer vision domains [69], we integrate a range of backbone feature extractors. This includes architectures like Xception, EfficientNetB0, GoogleNet, MobileNet, ResNet50, NASNetMobile, and InceptionV3. The objective is to ascertain the most effective mechanism for feature selection in our specific domain, with a primary focus on recognizing the importance of frames in highly challenging scenarios. This strategic approach aims to enhance the network’s adaptability and efficacy in addressing the complexities of our target problem. The practical validation presented in Section 4 underscores the efficacy of leveraging InceptionV3 features, a choice grounded in both empirical results and theoretical considerations. InceptionV3, an evolution of the Inception architecture, demonstrates its superiority through a sharp number of Inception modules [55] and refined adaptations that contribute to superior outcomes compared to its predecessors. One of the standout features of Inception modules is their ability to perform multiscale processing, a crucial factor in achieving enhanced performance across diverse computer vision tasks. This modification aligns with our overarching design principles, emphasizing efficient feature extraction and dimensionality reduction to improve performance in the specific context of our problem domain.
This optimal selection of InceptionV3 features is empirically grounded and theoretically sound, focusing on harnessing the Inception architecture’s strengths for effective and efficient video summarization. Within the InceptionV3 framework, three fundamental Inception modules play pivotal roles. Each module integrates parallel convolutional and pooling layers, creating a versatile structure depicted in Figure 3. Notably, the inclusion of smaller convolutional layers with dimensions like 1 × 1, 1 × 3, 3 × 1, or 3 × 3 serves to minimize trainable parameters, optimizing computational efficiency. The input size for InceptionV3 is set at 224 × 224 with RGB. The initial image processing involves five convolutional layers, each applying multiple 3 × 3 kernels. A strategic departure in our proposed network architecture involves excluding the final dense layers from InceptionV3. Instead, we extract an 8 × 8 feature vector boasting 2048 channels, denoted as μ :
μ = α ( α α β π
In our network architecture, denoted by Equation (3), each α signifies an Inception module (of which three are employed), and α denotes the initial convolutional operation applied to the input α . The output, represented by the feature vector μ , encapsulates information encompassing the object’s structural elements, edge characteristics, color attributes, shapes, and more. However, these features are inherently coarse, lacking the granularity required for precise fire scene classification. In complex scenes, direct utilization of these coarse features often leads to inaccurate predictions and a notable absence of essential localization information.
The α features undergo further processing within a convolution-based encoder-decoder network to address these limitations. This network serves the crucial role of refining the features by leveraging channel attention mechanisms. Channel attention is critical in this refinement process, discerningly extracting the most pertinent channel-wise details from the μ features.

3.3. Encoder–Decoder Module

The encoder plays a crucial role in processing frames, extracting high-level semantic features, and capturing long-term structural relationship information among frames. Simultaneously, the decoder serves the purpose of generating a sequence of 0/1 labels. Our models predominantly incorporate temporal modules, encompassing temporal convolution, pooling, and deconvolution. These modules resemble those commonly employed in semantic segmentation networks. The architecture of existing semantic segmentation models, specifically FCN-16, is seamlessly adapted to the design of our encoder–decoder module for video summarization. In this module, the RGB frame has the dimensions η × λ × 3, where η and λ denote the height and width of the image, respectively. The resulting output or prediction takes the form of η × λ × Ω, where Ω represents the number of classes in the channel dimension. Furthermore, the input is characterized by a dimension of 1 × n × d. Here, n represents the number of frames in a video, and d denotes the dimension of the feature vector associated with each frame.
As illustrated in Figure 4, the one-dimensional convolutional blocks contain 1D convolution followed by batch normalization and Relu activation. At the end of each block, the max-pooling operation captures the most prominent features or patterns from the features, an encoder-decoder network. We convert all the spatial convolutions in FCN to temporal convolutions. Similarly, spatial max-pooling and deconvolution layers are converted into corresponding temporal counterparts. We organized our network similarly to FCN but modified the network by reducing the first five convolutional layers to three from conv1 to conv3, consisting of multiple temporal convolution layers where a batch normalization and a ReLU activation follow each temporal convolution. We introduce temporal max-pooling adjacent to every convolutional layer. conv4 and conv5 are composed of a temporal convolution, succeeded by ReLU activation and dropout. We also have conv6, consisting of a 1 × 1 convolution (to produce the desired output channel), batch normalization, and deconvolution operation along the time axis. In the module, we used these features without classification layers, in which the features vectors are 8 × 8 with 2048 channels (Q) and integrated CA mechanism for further strengthening of model performance as is discussed in Section 3.3.

3.4. Channel Attention

To proficiently identify task-relevant features essential for task accomplishment in images, we conducted experiments utilizing the CA module between two foundational layers to extract relevant features. As delineated in this study, this module encompasses a global average pooling layer, max-pooling layers, three fully connected layers, and a multiplication operation. Despite the inclusion of diverse components, the primary function of channel attention is to elucidate the interrelation between each channel within the feature map, deriving a 1-D weight that is subsequently multiplied with a specific channel. This process intensifies focus on pivotal information within frames, optimizing performance for the targeted task. The acquisition of optimal weights is facilitated through two parallel pooling operations subsequent to channel attention (Q). These operations encompass average and max-pooling, generating two descriptors for each channel.
The concatenation of these descriptors is then input into a shared multilayer perceptron featuring three fully connected layers, thereby fostering the creation of more potent feature vectors. The final step involves obtaining the CA by employing the SoftMax (ρ) function, as elucidated in Figure 5. The formula for this process is presented below:
= ρ M   ( A v g P ( Q + M   M a x P Q )
where AvgP denotes the average pooling, MaxP represents the max-pooling operations, and signifies the features refined by channel attention.

3.5. Features Fusion

After processing the output from pool 2, we apply a 1 × 1 convolution and implement batch normalization. The ensuing step involves merging this processed output with the deconv1 feature map, employing an element-wise addition. The initial level features from Con1 are less mature but contain specific information. Thus, we merged the features by connecting Con1 to Decon2. This merging operation draws inspiration from the concept of a skip connection [70], a technique well-established in semantic segmentation. Skip connections play a crucial role by combining feature maps from coarse layers with those from finer layers, fostering the generation of more nuanced visual features. The second connection is taken from the output of Decon2 and fused with the CA refined features to strengthen the features and maintain the optimal flow. The progressive feature fusions are highlighted visually in Figure 2. In our domain, we empirically validated that integrating such skip connections is beneficial, particularly for recovering essential temporal information crucial for the summarization process. This strategic fusion allows the model to harness high-level and low-level features, contributing to a richer video content representation.
Moreover, as the journey through the encoder-decoder module progresses, the ultimate features generated are harmoniously merged with features that have undergone refinement through channel attention. This meticulous integration process consolidates diverse information captured throughout the video summarization pipeline. It enhances the model’s capability to discern and incorporate a spectrum of features, elevating its efficacy in the intricate task of VS.

4. Experimental Results

This section presents a comprehensive analysis of our experimental results, beginning with a description of the datasets utilized (Section 4.1) and the process of creating supervised annotations (Section 4.2). We then detail the experimental setup, including hardware, software, and configurations (Section 4.3). Subsequently, we conduct an ablation study (Section 4.4), investigating the impact of deep feature extractors, analyzing the encoder-decoder architecture, assessing the effectiveness of attention mechanisms, and exploring the implications of progressive feature fusion. Qualitative results are discussed in Section 4.5, offering visual insights into our model’s performance. Finally, we provide a comprehensive comparison with state-of-the-art approaches in Section 4.6, highlighting the strengths and contributions of our proposed methodology.

4.1. Datasets

To evaluate the effectiveness of our proposed framework, we utilized two benchmark datasets: TVSum and SumMe, as shown in Table 1. These datasets cover a broad spectrum of video content, including documentaries, sports, and everyday activities, ensuring a comprehensive assessment. Figure 6 displays a sample of these two datasets. We have deliberately chosen to use the TVSum and SumMe datasets in our research primarily because they provide a wide range of activities in various environments. TVSum is particularly significant for its selection of several event categories, involving ten kinds of events within many sectors. Although these datasets have been available since before 2016, they still present significant challenges, as seen by their consistently low-performance levels even after nearly a decade of research efforts. Our analysis demonstrates that the accuracy of these datasets is continuously below the third percentage, indicating their constant level of complexity. In addition, the recently published work [8] used the TVSum and SumMe datasets. This demonstrates the continuing significance and applicability of these datasets in ongoing research on video summarizing. Using these well-recognized standards ensures a thorough assessment and comparison of our suggested methods with the most advanced technologies. Hence, our choice to use the TVSum and SumMe datasets is based on their diverse content, persistent challenges, and connection with our research goals, enabling rigorous testing and validation of our methods.

4.1.1. TVSum

The TVSum dataset is a commonly used benchmark within the fields of SVS. Its purpose is to aid in assessing and advancing algorithms that aim to generate concise and useful video summaries automatically. The dataset encompasses 50 videos from multiple genres, including documentaries, sports, and daily activities. This is an in-depth examination of the challenges in summarizing visual information across several fields. A total of twenty individuals labeled the dataset. The TVSum database provides multiple user-generated summaries for each video, which constitute essential human-annotated references to evaluate the effectiveness of summarization methods. The collection of annotations includes a diverse array of views on crucial events and moments presented in the samples of video frames, consequently encouraging an in-depth comprehension of the accuracy of summarization.

4.1.2. SumMe

The SumMe dataset is an important resource in the domain of SVS, particularly developed to facilitate the development of techniques for generating brief and pertinent video summaries. SumMe features a broad range of videos that include various topics such as daily scenarios, sports, and other hobbies. SumMe distinguishes itself by emphasizing user-generated summaries and providing various annotations for each video to include a wide range of opinions. Including these comments enhances the depth and breadth of the assessment of summarization algorithms since they capture the many approaches humans use to extract essential information from video. Consequently, SumMe is a significant reference point for scholars and practitioners, facilitating the advancement and enhancement of video summarization within the discipline.

4.2. Supervised Annotations

VS benefits significantly from the supervised method since the model is able to learn from the ratings humans have assigned to individual video frames. Using supervised methods, the model improves its accuracy by learning the patterns seen in human-labeled frames; this helps it to better conform to the requirements for including keyframes in the summary movie. Unsupervised VS results, on the other hand, are less reliable [37,67].
Instead of depending on human-labeled significance ratings, unsupervised methods analyze images using variables like entropy and complexity to decipher underlying patterns. It is difficult to assess the accuracy of summaries produced by unsupervised algorithms since no GT is available for comparison. When building SVS datasets, anywhere from 15 to 20 human annotators use representativeness, variety, and context criteria to give significant values to video frames [68]. The likelihood of a particular frame being included in the video’s summary is determined by its anticipated score.

4.3. Experimental Setup

The video dataset was subjected to a standardized downsampling technique, reducing the frame rate to two frames per second. This process was carried out using the methods described in reference [35]. The extracted characteristics were then input into the model, with separate GPU resources assigned to process each frame. It is crucial to emphasize that our suggested network can adapt to diverse feature representations. The studies were carried out on a server equipped with an NVIDIA GeForce RTX 2080 Ti graphics processing unit, which had a RAM capacity of 11 GB. PyTorch version 1.10.0 was used for the implementation of the model, and the parameters were optimized using the ADAM optimizer. During the training process, it was determined that a batch size of four yielded ideal results. Additionally, a momentum value of 0.85 and a fixed learning rate of 5 × 104 were found to be effective.

4.4. Evaluation Metrics, Training, and Testing

Based on the methodology outlined in the reference [23], our study used assessment matrices based on keyshots. The F1 score is used in this study to evaluate the efficacy of the suggested network. The aforementioned measure has been used in prior research to assess the likeness between the summaries provided by the model, which are determined by their significance score, and the summaries generated by humans, which are determined by the temporal overlap between them. The model’s exceptional performance is shown in its F1 score, which approaches 100%. In this context, the created summary is represented as l z Whereas the ground truth summary for a video is designated as V S . The precision (Pr) and recall (Re) may be computed using the Equations (5) and (6) as shown below:
P r = l z     V S l z  
  R e = l z     V S V S  
Finally, the F1 score is computed using Equation (7) and serves as our assessment measure in this paper, consistent with other studies. Its purpose is to measure how well the automatic and GT summaries match one another. The Equation for this is as follows:
F 1 = 2 Pr × Re 2 Pr + Re × 100
The metric computation for video, including several ground truth summaries, was performed using the conventional approach described in the state-of-the-art research conducted by [8]. A random sampling technique was used to choose 20% of the datasets as test samples, while the remaining 80% were allocated for training and validation purposes. The data was dispersed randomly, and the trials were conducted many times. The performance was evaluated by calculating the average F1 score.

4.5. Ablation Study

Within our ablation study, we thoroughly analyze and deconstruct the essential elements of our suggested framework to ascertain their respective impacts on the overall performance. We examine the impact of deep feature extractors (Section 4.5.1) and analyze how various feature extraction techniques affect the model’s capacity to gather pertinent information effectively. In the next section, Section 4.5.2, our encoder–decoder analysis comprehensively examines the complexities inherent in the different layers leading to the proposed module. This analysis aims to elucidate the encoder-decoder framework’s significance in influencing the summarization process’s output. Section 4.5.3 evaluates the effectiveness of attention mechanisms in directing the model’s attention toward essential areas inside the video frames followed by the study investigates the progressive features fusion approach and assesses its impact on the overall summarizing quality. This examination focuses on the fusion of features at various stages and its potential to improve the summarization process. The primary objective of this extensive ablation research is to provide a detailed analysis and evaluation of the individual components, enhancing our comprehension of the underlying mechanisms of the proposed framework.

4.5.1. Impact of Deep Features

The feature extraction process plays a crucial role in determining the performance of a model. Selecting an ideal feature descriptor is of utmost importance in improving the model’s effectiveness.
A comprehensive examination of many commonly used models for VS was undertaken to determine the optimal feature extractor. The extracted characteristics obtained from these models were then subjected to an assessment procedure to assess their performance. The tests conducted in our study included the analysis of several backbone descriptors on both the TVSum and SumMe datasets. The results of these trials are shown in Table 2. Implementing our plan using the Inception V3 backbone significantly produced the most encouraging outcomes. The exceptional performance of Inception V3 may be due to its ability to use a wide range of convolutional filters effectively. Notably, Inception V3 exhibited outstanding performance, achieving a score of 59.5 on the TVSum dataset and an impressive 49.4 on the challenging SumMe dataset. These results underscore the efficacy and superiority of the Inception V3 backbone in our domain. This allows the model to capture complex spatial and temporal correlations across various areas within frames. The features chosen for their superior performance were subjected to further improvement using a refinement module. Likewise, NASNet Mobile scored 58.0 on the TVSum dataset and 48.7 on the more demanding SumMe dataset, securing the runner-up position in terms of performance. The following sections provide the findings derived from evaluating several backbone models, highlighting the effectiveness of the selected Inception V3 backbone for extracting features in the proposed VS.

4.5.2. Empirical Analysis of Encoder Decoder Block

In the ablation analysis of our module, a thorough examination was conducted to assess the effects of gradually including temporal blocks in the network design. The model’s capacity was gradually expanded by including additional blocks, starting with integrating the first block. This initial block included essential components such as 1D convolution, batch normalization, and ReLU activation. A progressive improvement in the network’s capacity to record complex temporal aspects and connections among video frames was found with the addition of each subsequent block, from the second to the fifth. As indicated in Table 3, the network exhibited strong performance with the inclusion of five blocks. Specifically, the F1 score demonstrated notable improvement, reaching 60.2 on TVSum and 50.9 on SumMe datasets. The validation outcomes at each stage yielded valuable insights into the complex impacts of particular blocks, aiding us in comprehending the best configuration for our encoder–decoder module. This methodical technique enabled the achievement of a harmonious equilibrium between the intricacy of the model and its performance, guaranteeing the efficacy of the ultimate architecture in extracting significant information for VS.

4.5.3. Effectiveness of Attention Mechanism with Progressive Features Fusion

The CA module is integrated to grasp the essential features effectively. This module contains global average pooling, max-pooling layers, three interconnected layers, and a multiplication operation. It allows for investigating the relationships between channels within the feature map. The main objective of the CA module is to allocate attention by giving a one-dimensional weight to each channel. This added weight strengthens the focus on essential information inside frames, enhancing overall performance. Ideal weights are achieved via parallel average and max-pooling methods, providing descriptors for each channel. Pooling methods in CNNs, such as global average pooling and max-pooling, aid in reducing the size of feature maps by downsampling them. This step minimizes the spatial dimensions while retaining crucial information. Pooling operations combine spatial information from many frames and channels.
The multiplication operation inside the CA module incorporates weights to channels, hence determining their significance within the feature map. Although these methods do not directly identify features, they improve the representations obtained by CNNs, which may include colors, textures, shapes, and motion patterns that are important for video summarization. The combination of these descriptors is subjected to further processing in a shared multilayer perceptron, facilitating the generation of robust feature vectors. In CA-guided progressive feature fusion, we have incorporated skip connections by taking the connection from Con2 to Decon2. Subsequently, the features from Decon2 are merged with the CA refined features. This strategic fusion is a crucial step in our encoder–decoder module, ensuring the harmonious integration of information from different layers. The skip connection from Con2 to Decon2 facilitates combining features from coarse layers with those from finer layers, contributing to the recovery of essential temporal information crucial for the VS process.
The final fusion involves blending the refined features from the CA module with the features from Decon2, resulting in a comprehensive representation that optimally captures both high-level and low-level features for enhanced model efficacy. This process is visualized in Figure 2 and is empirically validated to be beneficial for the intricate task of video summarization. The model is trained over 50 epochs after incorporating channel attention and progressive features fusion, as shown in Figure 7.

4.6. Comparison with State-of-the-Art Approaches

The suggested model underwent a comprehensive evaluation compared to state-of-the-art approaches using two datasets. The comparisons were carried out using canonical, enhanced, and transfer configurations. The studies in question were first conducted using vsLSTM techniques [71]. Subsequently, the scope of the research was expanded to include DL approaches such as DR-DSN [14], FCSN [23], VsAR, and VsAR-FT [38]. Furthermore, in these experimental configurations, a total of seven attention-based mechanisms were utilized. These methods include A-AVS, M-AVS [18], and diverse global attention for VS [72], LMHA [73], TVS, and SHTVS [74]. These mechanisms encompass a range of techniques that use LSTM, fully convolutional sequential networks, reinforcement learning, spatiotemporal information, adversarial learning, and attention-based networks. Our model repeatedly showed superior performance compared to state-of-the-art methods. It is worth noting that inside the setting, the performance decreases compared to the circumstances of other datasets, emphasizing the inherent difficulties connected with the practice of transfer learning. Furthermore, Table 4 shows the performance of the benchmark methods on Mr. HiSum dataset.
However, it encounters difficulties when applied to other datasets, indicating the need for additional improvement and optimization in future iterations. As evident from Table 5, our approach exhibits superior performance on both datasets, securing the top-ranking position. The closest contender, LMHA [73], trails as the runner-up on both datasets.

4.7. Statistical Analysis

To determine if the performance on both datasets is statistically significant, we executed a two-sample t-test to evaluate the differences in mean F1 scores between our proposed method and the LMHA method [73]. Given the F1 score for each video in the dataset, we can calculate the t-values.
For TVSum dataset:
Ttvsum = 61.5 61.0 0.608 × 2 50
After calculation:
Ttvsum     0.5 0.1216
Ttvsum     4.112
For SumMe dataset:
Ttvsum = 51.8 51.1 0.608 × 2 25
After calculation:
Ttvsum     0.7 0.1719
Ttvsum     4.069
For TVSum, the mean F1 score of SAVS-Net is 61.5, while the mean F1 score of LMHA is 61.0. Using a two-sample t-test with a significance level of α = 0.05, we obtained a t-value of Ttvsum ≈ 4.112 with 49 degrees. The critical t-value at α = 0.05 is approximately 2.009 for a two-tailed test. Based on the data from the TVSum dataset, we can infer that SAVS-Net and LMHA have significantly different mean F1 scores. Therefore, we reject the null hypothesis. Similarly, for SumMe, the mean F1 score of SAVS-Net is 51.8, while the mean F1 score was 51.1. Conducting the two-sample t-test yielded a t-value of 4.069. The critical t-value at α = 0.05 is around 2.064 for a two-tailed test. Since tSumMe > 2.064, we reject the null hypothesis. The findings demonstrate that SAVS-Net surpasses LMHA in terms of statistical significance on both datasets, highlighting the effectiveness of our proposed technique in VS tasks.

5. Conclusions and Future Work

The increasing prevalence of cameras has resulted in an unprecedented rise in video data output, requiring substantial deployment of resources for gathering, processing, and monitoring. Computer vision has made significant progress in improving the efficiency of acquiring and examining high-resolution video footage. In light of this context, video summarizing has arisen as a crucial study area, providing a means to extract succinct summaries from extensive video recordings. The current approaches often depend on limited sets of representative characteristics implemented via shallow sequential networks, accompanied by a restricted range of attention processes. This research presents a novel approach that involves a paradigm change in visual intelligence. The proposed framework leverages optimum characteristics to aid in summarizing long videos.
In contrast to traditional methods, we use an InceptionV3 backbone for feature extraction after a thorough empirical examination of various traits. Our objective is to achieve an ideal representation. The strategic encoder–decoder component, consisting of five convolutional blocks and two transpose blocks, captures complex contacts in frames. Furthermore, integrating a channel attention mechanism exposes the interconnections within channels, focusing on crucial elements to get a thorough comprehension before making the final prediction. The extraordinary performance of our framework has been validated by rigorous experiments and ablation studies, consistently surpassing the performance of state-of-the-art networks on two distinct datasets. Our model’s consistent and enduring superiority compared to established benchmarks validates its efficacy as a pioneering and innovative addition to the dynamic field of video summarization. It is observable that although video transformers have shown promising outcomes, our method presents unique improvements and developments. It represents a seamless integration of many modules that effectively gather and prioritize important features inside video frames. However, the combined use of the InceptionV3 backbone and our encoder-decoder structure enable us to significantly enrich the features before transmitting them to the focus module. In addition, we included a channel attention module as an essential advancement in our framework. This module efficiently emphasizes the relationships between channels in the feature map, improving the model’s capacity to prioritize key characteristics. The channel attention mechanism enhances the concentration of crucial information inside video frames by calculating 1-D weights for each channel. Our design has constantly surpassed known standards in terms of performance, as shown by stringent evaluations and extensive ablation research. This validation validates the effectiveness of our suggested strategy and displays its significant advancements in video summarization.
In future research, our primary objective will be to investigate uncertainty-aware models to improve video summarization’s effectiveness. Furthermore, our goal is to provide lightweight networks specifically designed for implementation on edge devices, guaranteeing the ability to analyze data in real-time. These projects can enhance video summarizing approaches, promoting flexibility and effectiveness in many applications and situations.

Author Contributions

Conceptualization, writing—original draft preparation, methodology, supervision, F.A.; validation, software, formal analysis, S.H.; investigation, writing—review and editing, W.A.; formal analysis, data curation, project administration, supervision, Z.J.; visualization, investigation, writing—review and editing, M.D.A.; resources, methodology, software, M.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data available in a publicly accessible repository that are cited in the article.

Acknowledgments

We would like to thank the anonymous reviewers for their valuable contributions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Jin, Y.; Li, X. Visualizing the hotspots and emerging trends of multimedia big data through scientometrics. Multimed. Tools Appl. 2019, 78, 1289–1313. [Google Scholar] [CrossRef]
  2. Li, J.; Zhang, C.; Liu, Z.; Hong, R.; Hu, H. Optimal volumetric video streaming with hybrid saliency based tiling. IEEE Trans. Multimed. 2022, 25, 2939–2953. [Google Scholar] [CrossRef]
  3. Workie, A.; Sharma, R.; Chung, Y.K. Digital video summarization techniques: A survey. Int. J. Eng. Technol. 2020, 9, 81–85. [Google Scholar]
  4. Khan, H.; Huy, B.Q.; Abidin, Z.U.; Yoo, J.; Lee, M.; Seo, K.W.; Hwang, D.Y.; Lee, M.Y.; Suhr, J.K. A modified yolov4 network with medium-scale challenging benchmark for efficient animal detection. In Proceedings of the 9th International Conference on Next Generation Computing, Danang, Vietnam, 20–23 December 2023. [Google Scholar]
  5. Khan, H.; Haq, I.U.; Munsif, M.; Mustaqeem; Khan, S.U.; Lee, M.Y. Automated wheat diseases classification framework using advanced machine learning technique. Agriculture 2022, 12, 1226. [Google Scholar] [CrossRef]
  6. Tiwari, V.; Bhatnagar, C. A survey of recent work on video summarization: Approaches and techniques. Multimed. Tools Appl. 2021, 80, 27187–27221. [Google Scholar] [CrossRef]
  7. Kumar, K. EVS-DK: Event video skimming using deep keyframe. J. Vis. Commun. Image Represent. 2019, 58, 345–352. [Google Scholar] [CrossRef]
  8. Khan, H.; Hussain, T.; Khan, S.U.; Khan, Z.A.; Baik, S.W. Deep multi-scale pyramidal features network for supervised video summarization. Expert Syst. Appl. 2024, 237, 121288. [Google Scholar] [CrossRef]
  9. Hussain, T.; Muhammad, K.; Ding, W.; Lloret, J.; Baik, S.W.; de Albuquerque, V.H.C. A comprehensive survey of multi-view video summarization. Pattern Recognit. 2021, 109, 107567. [Google Scholar] [CrossRef]
  10. Mujtaba, G.; Malik, A.; Ryu, E.-S. LTC-SUM: Lightweight client-driven personalized video summarization framework using 2D CNN. IEEE Access 2022, 10, 103041–103055. [Google Scholar] [CrossRef]
  11. Hussain, T.; Muhammad, K.; Ullah, A.; Cao, Z.; Baik, S.W.; de Albuquerque, V.H.C. Cloud-assisted multiview video summarization using CNN and bidirectional LSTM. IEEE Trans. Ind. Inform. 2019, 16, 77–86. [Google Scholar] [CrossRef]
  12. Apostolidis, E.; Adamantidou, E.; Metsai, A.I.; Mezaris, V.; Patras, I. Video summarization using deep neural networks: A survey. Proc. IEEE 2021, 109, 1838–1863. [Google Scholar] [CrossRef]
  13. Habib, S.; Khan, I.; Aladhadh, S.; Islam, M.; Khan, S. External features-based approach to date grading and analysis with image processing. Emerg. Sci. J. 2022, 6, 694–704. [Google Scholar] [CrossRef]
  14. Zhou, K.; Qiao, Y.; Xiang, T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  15. Fei, M.; Jiang, W.; Mao, W. Memorable and rich video summarization. J. Vis. Commun. Image Represent. 2017, 42, 207–217. [Google Scholar] [CrossRef]
  16. Elhamifar, E.; Sapiro, G.; Sastry, S.S. Dissimilarity-based sparse subset selection. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 2182–2197. [Google Scholar] [CrossRef]
  17. Yuan, L.; Tay, F.E.H.; Li, P.; Feng, J. Unsupervised video summarization with cycle-consistent adversarial LSTM networks. IEEE Trans. Multimed. 2019, 22, 2711–2722. [Google Scholar] [CrossRef]
  18. Fu, T.-J.; Tai, S.-H.; Chen, H.-T. Attentive and adversarial learning for video summarization. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 7–11 January 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
  19. De Avila, S.E.F.; Lopes, A.P.B.; da Luz, A., Jr.; de Albuquerque Araújo, A. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognit. Lett. 2011, 32, 56–68. [Google Scholar] [CrossRef]
  20. Lei, J.; Luan, Q.; Song, X.; Liu, X.; Tao, D.; Song, M. Action parsing-driven video summarization based on reinforcement learning. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2126–2137. [Google Scholar] [CrossRef]
  21. Huang, S.; Li, X.; Zhang, Z.; Wu, F.; Han, J. User-ranking video summarization with multi-stage spatio–temporal representation. IEEE Trans. Image Process. 2018, 28, 2654–2664. [Google Scholar] [CrossRef]
  22. Zhang, K.; Chao, W.-L.; Sha, F.; Grauman, K. Video summarization with long short-term memory. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII 14; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
  23. Rochan, M.; Ye, L.; Wang, Y. Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  24. Fajtl, J.; Sokeh, H.S.; Argyriou, V.; Monekosso, D.; Remagnino, P. Summarizing videos with attention. In Computer Vision–ACCV 2018 Workshops: 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Revised Selected Papers 14; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
  25. Ji, Z.; Xiong, K.; Pang, Y.; Li, X. Video summarization with attention-based encoder–decoder networks. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1709–1717. [Google Scholar] [CrossRef]
  26. Liang, G.; Lv, Y.; Li, S.; Wang, X.; Zhang, Y. Video summarization with a dual-path attentive network. Neurocomputing 2022, 467, 1–9. [Google Scholar] [CrossRef]
  27. Zhao, B.; Li, X.; Lu, X. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017. [Google Scholar]
  28. Meena, P.; Kumar, H.; Yadav, S.K. A review on video summarization techniques. Eng. Appl. Artif. Intell. 2023, 118, 105667. [Google Scholar] [CrossRef]
  29. Ngo, C.-W.; Ma, Y.-F.; Zhang, H.-J. Video summarization and scene detection by graph modeling. IEEE Trans. Circuits Syst. Video Technol. 2005, 15, 296–305. [Google Scholar]
  30. Zhou, H.; Sadka, A.H.; Swash, M.R.; Azizi, J.; Sadiq, U.A. Feature extraction and clustering for dynamic video summarisation. Neurocomputing 2010, 73, 1718–1729. [Google Scholar] [CrossRef]
  31. Khan, H.; Ullah, M.; Al-Machot, F.; Cheikh, F.A.; Sajjad, M. Deep learning based speech emotion recognition for Parkinson patient. Electron. Imaging 2023, 35, 298-1–298-6. [Google Scholar] [CrossRef]
  32. Amin, S.U.; Hussain, A.; Kim, B.; Seo, S. Deep learning based active learning technique for data annotation and improve the overall performance of classification models. Expert Syst. Appl. 2023, 228, 120391. [Google Scholar] [CrossRef]
  33. Islam, M.; Aloraini, M.; Aladhadh, S.; Habib, S.; Khan, A.; Alabdulatif, A.; Alanazi, T.M. Toward a Vision-Based Intelligent System: A Stacked Encoded Deep Learning Framework for Sign Language Recognition. Sensors 2023, 23, 9068. [Google Scholar] [CrossRef]
  34. Ji, Z.; Zhao, Y.; Pang, Y.; Li, X.; Han, J. Deep attentive video summarization with distribution consistency learning. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 1765–1775. [Google Scholar] [CrossRef]
  35. Zhu, W.; Lu, J.; Li, J.; Zhou, J. Dsnet: A flexible detect-to-summarize network for video summarization. IEEE Trans. Image Process. 2020, 30, 948–962. [Google Scholar] [CrossRef]
  36. Gao, P.; Zhang, Q.; Wang, F.; Xiao, L.; Fujita, H.; Zhang, Y. Learning reinforced attentional representation for end-to-end visual tracking. Inf. Sci. 2020, 517, 52–67. [Google Scholar] [CrossRef]
  37. Jung, Y.; Cho, D.; Kim, D.; Woo, S.; Kweon, I.S. Discriminative feature learning for unsupervised video summarization. Proc. AAAI Conf. Artif. Intell. 2019, 33, 8537–8544. [Google Scholar] [CrossRef]
  38. Zhao, B.; Li, X.; Lu, X. Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  39. Habib, S.; Albattah, W.; Alsharekh, M.F.; Islam, M.; Shees, M.M.; Sherazi, H.I. Computer Network Redundancy Reduction Using Video Compression. Symmetry 2023, 15, 1280. [Google Scholar] [CrossRef]
  40. Li, X.; Zhao, B.; Lu, X. A general framework for edited video and raw video summarization. IEEE Trans. Image Process. 2017, 26, 3652–3664. [Google Scholar] [CrossRef]
  41. Li, Y.; Wang, L.; Yang, T.; Gong, B. How local is the local diversity? reinforcing sequential determinantal point processes with dynamic ground sets for supervised video summarization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  42. Mahasseni, B.; Lam, M.; Todorovic, S. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  43. He, X.; Hua, Y.; Song, T.; Zhang, Z.; Xue, Z.; Ma, R.; Robertson, N.M.; Guan, H. Unsupervised video summarization with attentive conditional generative adversarial networks. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019. [Google Scholar]
  44. He, Y.; Gao, C.; Sang, N.; Qu, Z.; Han, J. Graph coloring based surveillance video synopsis. Neurocomputing 2017, 225, 64–79. [Google Scholar] [CrossRef]
  45. Zhao, B.; Li, H.; Lu, X.; Li, X. Reconstructive sequence-graph network for video summarization. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2793–2801. [Google Scholar] [CrossRef]
  46. Park, J.; Lee, J.; Kim, I.-J.; Soh, K. Sumgraph: Video summarization via recursive graph modeling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXV 16; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
  47. Wang, J.; Bai, Y.; Long, Y.; Hu, B.; Chai, Z.; Guan, Y.; Wei, X. Query twice: Dual mixture attention meta learning for video summarization. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020. [Google Scholar]
  48. Liu, Y.-T.; Li, Y.-J.; Wang, Y.-C.F. Transforming multi-concept attention into video summarization. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
  49. Ma, M.; Mei, S.; Wan, S.; Hou, J.; Wang, Z.; Feng, D.D. Video summarization via block sparse dictionary selection. Neurocomputing 2020, 378, 197–209. [Google Scholar] [CrossRef]
  50. Mei, S.; Guan, G.; Wang, Z.; Wan, S.; He, M.; Feng, D.D. Video summarization via minimum sparse reconstruction. Pattern Recognit. 2015, 48, 522–533. [Google Scholar] [CrossRef]
  51. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the NIPS 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  52. Gygli, M.; Grabner, H.; Riemenschneider, H.; Van Gool, L. Creating summaries from user videos. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
  53. Khan, K.; Khan, R.U.; Albattah, W.; Nayab, D.; Qamar, A.M.; Habib, S.; Islam, M. Crowd Counting Using End-to-End Semantic Image Segmentation. Electronics 2021, 10, 1293. [Google Scholar] [CrossRef]
  54. Munsif, M.; Khan, H.; Khan, Z.A.; Hussain, A.; Ullah, F.U.M.; Lee, M.Y.; Baik, S.W. Pv-anet: Attention-based network for short-term photovoltaic power forecasting. In Proceedings of the The 8th International Conference on Next Generation Computing, Jeju, Republic of Korea, 6–8 October 2022; pp. 133–135. [Google Scholar]
  55. Ul Amin, S.; Ullah, M.; Sajjad, M.; Cheikh, F.A.; Hijji, M.; Hijji, A.; Muhammad, K. EADN: An efficient deep learning model for anomaly detection in videos. Mathematics 2022, 10, 1555. [Google Scholar] [CrossRef]
  56. Ul Amin, S.; Kim, Y.; Sami, I.; Park, S.; Seo, S. An Efficient Attention-Based Strategy for Anomaly Detection in Surveillance Video. Comput. Syst. Sci. Eng. 2023, 46, 3939–3958. [Google Scholar] [CrossRef]
  57. Husman, M.A.; Albattah, W.; Abidin, Z.Z.; Mustafah, Y.M.; Kadir, K.; Habib, S.; Islam, M.; Khan, S. Unmanned Aerial Vehicles for Crowd Monitoring and Analysis. Electronics 2021, 10, 2974. [Google Scholar] [CrossRef]
  58. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  59. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-excite: Exploiting feature context in convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
  60. Hwang, B.-Y.; Lee, S.-H.; Lee, S.-H. Modified YOLOv4S based on Deep learning with Feature Fusion and Spatial Attention. J. Korea Converg. Soc. 2021, 12, 31–37. [Google Scholar]
  61. Li, G.; Lv, J.; Wang, C. A modified generative adversarial network using spatial and channel-wise attention for CS-MRI reconstruction. IEEE Access 2021, 9, 83185–83198. [Google Scholar] [CrossRef]
  62. Li, J.; Liu, X.; Zhang, W.; Zhang, M.; Song, J.; Sebe, N. Spatio-temporal attention networks for action recognition and detection. IEEE Trans. Multimed. 2020, 22, 2990–3001. [Google Scholar] [CrossRef]
  63. Habib, S.; Khan, I.; Islam, M.; Albattah, W.; Alyahya, S.M.; Khan, S.; Hassan, M.K. Wavelet Frequency Transformation for Specific Weeds Recognition. In Proceedings of the 1st International Conference on Artificial Intelligence and Data Analytics (CAIDA), Riyadh, Saudi Arabia, 6–7 April 2021; pp. 97–100. [Google Scholar]
  64. Mundur, P.; Rao, Y.; Yesha, Y. Keyframe-based video summarization using delaunay clustering. Int. J. Digit. Libr. 2006, 6, 219–232. [Google Scholar] [CrossRef]
  65. Gygli, M.; Chao, W.-L.; Grauman, K.; Sha, F. Creating summaries from user videos. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VII 13; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
  66. Gygli, M.; Grabner, H.; Van Gool, L. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  67. Potapov, D.; Douze, M.; Harchaoui, Z.; Schmid, C. Category-specific video summarization. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VI 13; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
  68. Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  69. Li, S.; Yan, Q.; Liu, P. An efficient fire detection method based on multiscale feature extraction, implicit deep supervision and channel attention mechanism. IEEE Trans. Image Process. 2020, 29, 8467–8475. [Google Scholar] [CrossRef]
  70. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2015, 39, 640–651. [Google Scholar]
  71. Habib, S.; Hussain, A.; Islam, M.; Khan, S.; Albattah, W. Towards Efficient Detection and Crowd Management for Law Enforcing Agencies. In Proceedings of the 1st International Conference on Artificial Intelligence and Data Analytics (CAIDA), Riyadh, Saudi Arabia, 6–7 April 2021; pp. 62–68. [Google Scholar]
  72. Li, P.; Ye, Q.; Zhang, L.; Yuan, L.; Xu, X.; Shao, L. Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognit. 2021, 111, 107677. [Google Scholar] [CrossRef]
  73. Zhu, W.; Lu, J.; Han, Y.; Zhou, J. Learning multiscale hierarchical attention for video summarization. Pattern Recognit. 2022, 122, 108312. [Google Scholar] [CrossRef]
  74. An, Y.; Zhao, S. SHTVS: Shot-level based Hierarchical Transformer for Video Summarization. In Proceedings of the 2022 the 5th International Conference on Image and Graphics Processing (ICIGP), Beijing, China, 7–9 January 2022. [Google Scholar]
  75. Jiang, H.; Mu, Y. Joint video summarization and moment localization by cross-task sample transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  76. Habib, S.; Alsanea, M.; Aloraini, M.; Al-Rawashdeh, H.S.; Islam, M.; Khan, S. An Efficient and Effective Deep Learning-Based Model for Real-Time Face Mask Detection. Sensors 2022, 22, 2602. [Google Scholar] [CrossRef]
  77. Apostolidis, E.; Balaouras, G.; Mezaris, V.; Patras, I. Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames. In Proceedings of the 2022 International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022. [Google Scholar]
  78. Elfeki, M.; Borji, A. Video summarization via actionness ranking. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 7–11 January 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
  79. Huang, C.; Wang, H. A novel key-frames selection framework for comprehensive video summarization. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 577–589. [Google Scholar] [CrossRef]
  80. Puthige, I.; Hussain, T.; Gupta, S.; Agarwal, M. Attention over attention: An enhanced supervised video summarization approach. Procedia Comput. Sci. 2023, 218, 2359–2368. [Google Scholar] [CrossRef]
  81. Zhao, B.; Li, X.; Lu, X. TTH-RNN: Tensor-train hierarchical recurrent neural network for video summarization. IEEE Trans. Ind. Electron. 2020, 68, 3629–3637. [Google Scholar] [CrossRef]
  82. Fu, H.; Wang, H. Self-attention binary neural tree for video summarization. Pattern Recognit. Lett. 2021, 143, 19–26. [Google Scholar] [CrossRef]
Figure 1. Categorization of different video summarization approaches.
Figure 1. Categorization of different video summarization approaches.
Symmetry 16 00680 g001
Figure 2. Visual overview of the proposed framework.
Figure 2. Visual overview of the proposed framework.
Symmetry 16 00680 g002
Figure 3. Visual representation of the features flow inside InceptionV3 architecture tailored for video summarization.
Figure 3. Visual representation of the features flow inside InceptionV3 architecture tailored for video summarization.
Symmetry 16 00680 g003
Figure 4. Visualization depicting the 1D Convolutional Blocks within the Encoder Block for a comprehensive understanding.
Figure 4. Visualization depicting the 1D Convolutional Blocks within the Encoder Block for a comprehensive understanding.
Symmetry 16 00680 g004
Figure 5. Overview of the utilized attention mechanism.
Figure 5. Overview of the utilized attention mechanism.
Symmetry 16 00680 g005
Figure 6. Illustration depicting the dataset utilized, providing a visual overview. The first two rows showcase diverse samples from the TVSum dataset, while the subsequent two rows present samples from the SumMe dataset.
Figure 6. Illustration depicting the dataset utilized, providing a visual overview. The first two rows showcase diverse samples from the TVSum dataset, while the subsequent two rows present samples from the SumMe dataset.
Symmetry 16 00680 g006
Figure 7. Results on testing splits of the TVSum and SumMe datasets. The model is trained over 50 epochs after incorporating channel attention and progressive features fusion.
Figure 7. Results on testing splits of the TVSum and SumMe datasets. The model is trained over 50 epochs after incorporating channel attention and progressive features fusion.
Symmetry 16 00680 g007
Table 1. Description of the utilized datasets.
Table 1. Description of the utilized datasets.
DatasetNo. of VideosMean LengthNumber of Annotators
TVSum [68]504 min 18 s20 users
SumMe [52]252 min 40 s15–18 users
Table 2. Empirical analysis of the intermediate features using different models. The results are reported in the F1 Score.
Table 2. Empirical analysis of the intermediate features using different models. The results are reported in the F1 Score.
DatasetGoogleNetXceptionEfficentNetB0ResNet-50NASNetMMobileNetInceptionV3
TVSum55.956.554.257.558.053.059.5
SumMe48.347.846.448.348.744.849.4
Table 3. The impact of the TCN blocks on ultimate performance. ✔ defines the inclusion of the block where × shows the exclusion.
Table 3. The impact of the TCN blocks on ultimate performance. ✔ defines the inclusion of the block where × shows the exclusion.
Dataset.B1B2B3B4B5F1 scorePrecisionRecallKappaCorrelation
TVSum × × × × 57.80.5730.5840.560.58
× × × 58.30.5780.5890.580.59
× × 58.90.5850.5970.590.58
× 59.50.5910.6050.580.57
60.20.5980.6130.600.58
SumMe × × × × 47.80.4710.4870.460.48
× × × 48.40.4780.4940.480.47
× × 49.80.4920.5090.490.51
× 50.20.4960.5140.480.50
50.90.5030.5210.500.50
Table 4. Performance comparison of different approaches on Mr. HiSum dataset.
Table 4. Performance comparison of different approaches on Mr. HiSum dataset.
NetworkF1 ScorePrecision RecallKappaCorrelation
iPTNet [75]50.5349.2552.820.520.50
DSNet [35]50.7851.9749.610.500.51
SL-module [76]55.3154.9255.710.530.55
VASNet [24]55.2654.7255.810.560.57
PGL-SUM [77]55.8955.9655.820.540.55
SAVS-Net (ours)56.7357.1256.340.570.56
Table 5. Comparative analysis with state-of-the-art methods on both TVSum and SumMe datasets.
Table 5. Comparative analysis with state-of-the-art methods on both TVSum and SumMe datasets.
TechniqueFeaturesSumMeTVSum
F-1RankF-1Rank
SWVT TVSum [68]HOG + GIST + SIFT--50.0
ESSV [39]AlexNet40.914--
VsLSTM (M1) [71]GoogleNet37.61754.212
VsLSTM (M2) [71]GoogleNet38.61654.711
GSF [40]VGGnet-1643.11252.714
SeqDPP [41]GoogleNet44.3858.46
FCSN [23]GoogleNet48.8456.89
HSA-RNN [38]SIFT + Optical flow44.1959.84
vsLSTM + Att (M1) [53]InceptionV142.21357.88
dLSTM + Att (M2) [53]InceptionV143.81153.913
VsAR [78]GoogleNet40.11556.310
A-AVS (M1) [25]GoogleNet43.91059.45
M-AVS (M2 [25]GoogleNet44.4661.02
KS-CVS [79]CapsNet46.0558.07
AoA [80]GoogleNet45.0658.56
TTH-RNN [81]GoogleNet44.3760.23
SBNT [82]GoogleNet50.7361.02
LMHA [73]GoogleNet51.1261.02
SAVS-Net (ours)InceptionV351.8161.51
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alharbi, F.; Habib, S.; Albattah, W.; Jan, Z.; Alanazi, M.D.; Islam, M. Effective Video Summarization Using Channel Attention-Assisted Encoder–Decoder Framework. Symmetry 2024, 16, 680. https://doi.org/10.3390/sym16060680

AMA Style

Alharbi F, Habib S, Albattah W, Jan Z, Alanazi MD, Islam M. Effective Video Summarization Using Channel Attention-Assisted Encoder–Decoder Framework. Symmetry. 2024; 16(6):680. https://doi.org/10.3390/sym16060680

Chicago/Turabian Style

Alharbi, Faisal, Shabana Habib, Waleed Albattah, Zahoor Jan, Meshari D. Alanazi, and Muhammad Islam. 2024. "Effective Video Summarization Using Channel Attention-Assisted Encoder–Decoder Framework" Symmetry 16, no. 6: 680. https://doi.org/10.3390/sym16060680

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop