Multimodal Video Summarization Using Machine Learning: A Comprehensive Benchmark of Feature Selection and Classifier Performance

Marevac, Elmin; Kadušić, Esad; Živić, Nataša; Buzađija, Nevzudin; Tabak, Edin; Velić, Safet

doi:10.3390/a18090572

Open AccessArticle

Multimodal Video Summarization Using Machine Learning: A Comprehensive Benchmark of Feature Selection and Classifier Performance

by

Elmin Marevac

¹

,

Esad Kadušić

²

,

Nataša Živić

^3,*

,

Nevzudin Buzađija

¹

,

Edin Tabak

¹ and

Safet Velić

²

¹

Polytechnic Faculty, University of Zenica, 72000 Zenica, Bosnia and Herzegovina

²

Faculty of Educational Sciences, University of Sarajevo, 71000 Sarajevo, Bosnia and Herzegovina

³

Faculty of Digital Transformation, Leipzig University of Applied Sciences, 04277 Leipzig, Germany

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(9), 572; https://doi.org/10.3390/a18090572

Submission received: 30 July 2025 / Revised: 3 September 2025 / Accepted: 5 September 2025 / Published: 10 September 2025

(This article belongs to the Special Issue Visual Attributes in Computer Vision Applications)

Download

Browse Figures

Versions Notes

Abstract

The exponential growth of user-generated video content necessitates efficient summarization systems for improved accessibility, retrieval, and analysis. This study presents and benchmarks a multimodal video summarization framework that classifies segments as informative or non-informative using audio, visual, and fused features. Sixty hours of annotated video across ten diverse categories were analyzed. Audio features were extracted with pyAudioAnalysis, while visual features (colour histograms, optical flow, object detection, facial recognition) were derived using OpenCV. Six supervised classifiers—Naive Bayes, K-Nearest Neighbors, Logistic Regression, Decision Tree, Random Forest, and XGBoost—were evaluated, with hyperparameters optimized via grid search. Temporal coherence was enhanced using median filtering. Random Forest achieved the best performance, with 74% AUC on fused features and a 3% F1-score gain after post-processing. Spectral flux, grayscale histograms, and optical flow emerged as key discriminative features. The best model was deployed as a practical web service using TensorFlow and Flask, integrating informative segment detection with subtitle generation via beam search to ensure coherence and coverage. System-level evaluation demonstrated low latency and efficient resource utilization under load. Overall, the results confirm the strength of multimodal fusion and ensemble learning for video summarization and highlight their potential for real-world applications in surveillance, digital archiving, and online education.

Keywords:

video summarization; multimodal features; machine learning; random forest; feature fusion; hyperparameter optimization; beam search; pyAudioAnalysis; OpenCV; Flask

1. Introduction

The continuous progress in digital photography and electronics has enabled the integration of high-definition optical sensors into affordable mobile phones and action cameras. As a result, we are witnessing an exponential increase in user-generated content in the form of video recordings. Furthermore, an increasing number of users frequently record daily activities such as sports or travel and share this content through social media platforms like Facebook (http://www.facebook.com), Instagram (http://www.instagram.com), Twitter (http://www.twitter.com), or video-sharing platforms such as YouTube (http://www.youtube.com) and Vimeo (http://www.vimeo.com). According to official statistics from YouTube [1], users watch more than a billion hours of video content daily, with over 500 h of video recordings uploaded every minute. From this data, it can be inferred that further growth in these figures is inevitable in the coming years.

With the increasing availability of information, there is a growing need for users to efficiently navigate large collections of video content [2] and extract meaningful insights from extensive video recordings. To address these continuously evolving requirements, numerous research efforts have focused on video summarization techniques. Briefly, this task aims to create concise summaries of given video recordings [3]; when a user views such summaries, they should immediately grasp the most significant parts of the content. Beyond enabling efficient navigation and processing of video content for entertainment purposes [4], video summarization applications include video recordings from surveillance cameras [5], medical procedures [6], or large-scale footage captured by unmanned aerial vehicles (UAV) [7], among others.

Despite the significant advances in video summarization, several challenges continue to hinder the generation of high-quality and practically useful summaries. First, videos often contain heterogeneous information across visual, auditory, and textual modalities, and effectively combining these diverse signals to identify the most relevant content remains a difficult task. Differences in modality characteristics, asynchronous events, and varying levels of informativeness make it challenging for existing methods to consistently select the segments that best represent the core narrative of a video. Second, preserving temporal coherence is critical for producing summaries that are not only informative but also easy to follow; however, abrupt transitions or fragmented segment selections can reduce both the interpretability and the perceived quality of generated summaries. Third, while many approaches focus primarily on algorithmic performance, practical deployment aspects—such as computational efficiency, scalability, and real-time responsiveness—are often overlooked, limiting the applicability of these methods in real-world scenarios where videos vary widely in length and complexity. The method proposed in this work directly addresses these limitations through a systematic and multimodal approach. By analyzing audio, visual, and fused feature representations at the segment level, the method is able to identify content that is genuinely informative while filtering out redundant or irrelevant segments. Temporal smoothing mechanisms are applied to maintain continuity across the selected segments, enhancing the overall coherence of the summaries. Furthermore, the approach is implemented as a fully operational pipeline within a web service, allowing for efficient, scalable, and real-time generation of descriptive captions for the most relevant segments. In addition, the study establishes a benchmark by systematically evaluating segment-level classification performance across multiple modalities and machine learning models, while also analyzing system performance using key operational metrics, including step-wise latency for each processing stage, total request time for end-to-end responsiveness, GPU and CPU utilization to monitor computational load, memory usage, and requests per second (RPS) to assess throughput and real-time scalability. Through this integration of multimodal analysis, temporal refinement, practical deployment considerations, and benchmarking, the method provides summaries that are not only semantically rich but also operationally robust, effectively bridging the gap between research-oriented algorithms and their application in real-world multimedia environments.

Approaches to video summarization can be categorized into four primary types, depending on the type of audiovisual features generated and presented to users [3], namely:

(a): keyframes/images [8], which represent extracted moments displayed sequentially and are often referred to as “static” summaries;
(b): sets of video segments [9], frequently termed “dynamic” summaries and serving as an extension of the first type by retaining audio and visual motion elements;
(c): graphical symbols [10], which complement other features with a form of graphical syntax to enhance user interpretation of summaries; and
(d): automatically generated textual descriptions [11], designed to provide efficient content summaries for video materials.

This paper presents a unified two-stage pipeline for video summarization:

Stage 1 involves multimodal feature extraction and classification across video, audio, and fused modalities, providing a benchmark for algorithms designed to identify informative segments at the segment level; while
Stage 2 focuses on the development and evaluation of a web service that leverages this classification capability. The service generates descriptive subtitles exclusively for the informative segments detected in Stage 1, employing a Beam Search strategy to ensure conciseness, coherence, and contextual consistency.

Crucially, the optimized classifier from Stage 1 is deployed directly within the web service to select the informative segments, creating a seamless pipeline from raw video to summarized captions. The innovation of this study lies in its comprehensive and systematic approach to multimodal video summarization. By evaluating and integrating audio, visual, and hybrid features at the segment level, the method effectively identifies the most informative content while filtering out irrelevant or redundant segments. Temporal smoothing techniques are applied to ensure continuity, producing summaries that are both semantically rich and easy to follow. Beyond improving summary quality, the study advances practical deployment by incorporating detailed benchmarking of classification performance alongside operational system metrics such as processing latency, resource utilization, and throughput. This unified framework provides not only a deeper understanding of feature and classifier contributions but also actionable insights for developing scalable, real-time video summarization systems capable of handling diverse multimedia content. In addition, this framework combines features of both extractive and generative summarization. Stage 1 performs extractive summarization by selecting informative segments, while Stage 2 adds a generative layer by producing new textual descriptions for those segments. This hybrid approach improves both informativeness and interpretability of the resulting summaries.

The remainder of this paper is organized as follows. Section 2 reviews related work, covering audio-visual integration in video summarization, existing datasets, and problem definition and motivation. Section 3 describes the datasets used, the feature extraction methods applied to obtain audio, visual, and combined segment-level features, and the supervised classifiers evaluated, including details on parameter tuning and preprocessing techniques. Section 4 outlines the experimental setup and evaluation protocols, specifying the metrics employed to assess both summarization accuracy and operational performance indicators such as latency and resource utilization. Section 5 details the system implementation, focusing on the integration of the summarization models into a full-stack application developed with React for the frontend and Flask for the backend, describing the architecture, communication between components, and deployment considerations. Section 6 presents the comprehensive benchmarking results, including classifier performances across different modality combinations, feature selection outcomes, the impact of temporal smoothing through median filtering, as well as an in-depth analysis of computational resource usage and latency measurements. Finally, Section 7 discusses the implications of these findings, synthesizes the key contributions, and offers concluding remarks along with directions for future research in scalable, real-time multimodal video summarization systems.

2. Related Work

2.1. Audio-Visual Integration in Video Summarization

In recent years, a significant number of prominent studies have presented diverse techniques for video content summarization, resulting in notable outcomes. In reference [12], the authors define video summarization as a process involving sequential decision-making within a deep network trained through an end-to-end learning framework. This network is capable of predicting the probability that each frame (image) in a video belongs to the summary, requiring the use of an encoder structured as a convolutional neural network (CNN) responsible for feature extraction and a long short-term memory (LSTM) network tasked with calculating frame probabilities for the decoder. Unlike this purely deep learning-based approach, our work explores classifier performance across multimodal features, providing a systematic benchmarking rather than relying on a single end-to-end neural model.

Reference [13] proposes a novel supported technique for video summarization based on LSTM architecture, which autonomously selects keyframes to generate compact and meaningful video summaries. A notable characteristic of this research is the demonstration that domain adaptation techniques can enhance the overall summarization process. Our work differs by focusing not only on keyframe selection but also on full segment-level classification, integrating both audio and video features and emphasizing deployment feasibility.

In reference [14], a generic algorithm for video summarization is introduced through the fusion of features from diverse multimodal sources. This approach integrates low-level feature concatenation across visual, audio, and textual inputs to intuitively analyze the structure and content of input videos, aiming to create a final result grounded in informative segments derived from all specified sources. In contrast, our method also adopts feature fusion but complements it with a broad evaluation of different classifiers and post-processing strategies, thereby extending the scope of multimodal benchmarking.

Reference [15] highlights that the primary objective of video summarization methodology is to generate a more compact version of the original raw video while retaining substantial semantic information and ensuring comprehensive content for viewers. An innovative solution named “SASUM” is presented, which differs from conventional techniques focused solely on summary diversity by extracting the most descriptive segments within the summary itself. Specifically, “SASUM” comprises modules for segment selection and video descriptors, collectively generating a final video that minimizes deviation between its description and human-generated summaries used as ground truth. Compared to SASUM, our approach does not optimize for summary diversity alone but also evaluates classifier-driven informativeness, coupled with temporal smoothing for continuity and practical deployment.

Reference [16] emphasizes that the vast number of videos produced daily necessitates summarization techniques to create concise formats devoid of redundant information. The described approach, termed “SalSum”, employs a generative adversarial network (GAN) previously trained using fixation data from human eye movements. An unsupervised model combines colour and visual elements, while the summary represents a fusion of colour and edge information recorded by traversing video content. Unlike this generative approach, our study focuses on supervised classification of multimodal features and practical benchmarking of system-level performance, addressing real-world deployment challenges.

All methods and techniques mentioned here are highly significant for video summarization, with several currently representing state-of-the-art solutions. However, most do not simultaneously consider visual and auditory information, making them unsuitable for user-generated videos such as those from surveillance systems or smartphones. The integration of these data types presents substantial potential for classifying video segments and summarizing them with high robustness and reliability. Our contribution directly addresses this gap by systematically integrating and benchmarking audio-visual fusion, demonstrating both improved accuracy and deployability.

2.2. Video Summarization Datasets Overview

One of the most significant challenges in developing models for video classification and summarization lies in selecting an appropriate dataset encompassing diverse categories and annotations varying in quality and length. Additionally, it is crucial to consider the specific purpose of model creation since this determines the choice of indicators and context for input data. Publicly available general-purpose datasets are typically reserved exclusively for evaluating model accuracy rather than training. Consequently, some of the most recent and widely used publicly accessible sources related to video summarization research are outlined below.

The “MED Summaries” dataset [17] is a relatively new collection designed for evaluating dynamic summaries of video recordings. It includes annotations for 160 videos categorized into ten groups within its test set. These categories include events such as birthday celebrations, tire changes, flash mobs, vehicle extrication, and animal care among others. While useful for event-driven scenarios, our work instead constructs a multimodal dataset focused on segment-level informativeness across diverse user-generated categories, enabling fine-grained classification and benchmarking.

The “TVSum” dataset (Title-based Video Summarization) [18] aims to address limitations in prior knowledge of the main theme of video recordings. The entire dataset comprises 50 videos spanning various genres such as news, tutorials, documentaries, vlogs, and first-person perspectives. It contains 1000 frame-level annotations ranked by users (20 per video), with video durations ranging between two to ten minutes. This dataset enables automated evaluation of summarization techniques without requiring costly user studies or surveys. Compared to TVSum, our dataset covers longer and more diverse videos and emphasizes multimodal (audio and visual) features, allowing us to assess classifier robustness under real-world conditions.

The “SumMe” dataset [19] consists of 25 videos covering holidays, events, and sports, sourced from the YouTube platform. Each video is annotated with at least fifteen manually created summaries (totalling 390), with durations varying between one to six minutes. Unlike SumMe, which is relatively small in scale, our dataset aggregates around 60 h of video content, providing broader coverage for systematic benchmarking of feature and classifier combinations.

The “UT Ego” dataset (University of Texas at Austin Egocentric) [20] includes 101 videos recorded using head-mounted action cameras during various activities such as eating, shopping, attending lectures, driving, and cooking. Each video lasts approximately three to five hours, captured at 15 frames per second with a resolution of 320 × 480 in uncontrolled conditions, resulting in segments with rapid movements. Although egocentric data provides valuable insights into continuous activities, our focus shifts to varied user-generated categories, balancing controlled annotations with multimodal integration.

Finally, the “VSUMM” dataset [21] was initially developed for generating static summaries through a novel evaluation method that enables objective comparison across different methodologies while eliminating subjectivity in quality assessments. It is also known as the “YouTube Dataset” and contains 50 videos from the Open Video2 project. These videos range in duration from one to four minutes, with an aggregate length of approximately 75 min. The collection spans genres such as documentaries, educational content, ephemeral footage, historical recordings, and lectures. It includes 250 manually created summaries generated by fifty individuals, with each contributor annotating five video recordings, ensuring that every video contains five distinct summaries developed by different users. In contrast, our dataset and experiments are designed around segment-level classification and multimodal fusion, extending beyond static summaries to enable benchmarking of end-to-end summarization pipelines.

2.3. Problem Definition and Motivation

The rapid proliferation of online video content across platforms such as social media, educational repositories, and entertainment services has underscored the urgent need for efficient automatic video summarization techniques. Users frequently struggle with lengthy, unstructured videos that hinder the swift extraction of relevant information. Traditional unimodal summarization methods, typically based solely on visual cues, often fall short in capturing the full semantic complexity of video content, which inherently integrates visual, auditory, and textual modalities. This limitation has prompted growing interest in multimodal video summarization, which seeks to exploit the complementary strengths of diverse feature modalities through advanced machine learning techniques. Despite this progress, a systematic benchmarking of feature selection strategies and classifier performance in multimodal settings remains a critical yet insufficiently explored area of research.

Previous studies, such as that of Zhang et al. [22] underscores the importance of multimodal fusion by proposing MF2Summ, which adopts a sophisticated temporal alignment mechanism employing dual Transformer architectures to fuse visual and auditory signals effectively. Their technique highlights that integrating cross-modal attention with temporal correspondence modelling significantly improves summary quality over unimodal baselines on popular datasets like SumMe and TVSum. However, Zhang et al. also reveal limitations inherent in models that do not incorporate higher-level semantic knowledge, such as textual transcripts or external domain information, restricting their ability to fully contextualize video content. In contrast, our study focuses on audio-visual fusion at the segment level combined with systematic classifier benchmarking, filling the gap left by deep Transformer-based methods that emphasize fusion but overlook classifier-level comparisons and deployment feasibility.

Similarly, He et al. [23] introduce a method called “Align and Attend”, which innovates by employing dual contrastive losses to better align and attend to visual and textual modalities in a unified Transformer framework. This approach showcases the critical role of temporal and cross-modal correspondence in generating coherent video summaries. By enforcing both intra- and inter-modal contrastive learning, their method avoids redundancy and improves the informativeness of extracted summaries. Yet, Wang et al. acknowledge that their modality scope remains limited, and the exclusion of audio or knowledge-based features leaves room for extending the semantic breadth of multimodal approaches. Our approach differs by explicitly incorporating audio features alongside visual descriptors, demonstrating that such multimodal integration improves classifier accuracy while remaining computationally efficient and suitable for real-world deployment.

The work of Xu et al. [24] further refines feature selection by focusing on learning summary-worthy visual representations through a bidirectional visual-language attention mechanism. Their use of self-distillation with pseudo summaries guides their model toward emphasizing features that correspond closely with human-generated summaries. Such an approach leads to improved abstractive summarization performance, especially valuable in data-scarce scenarios. Nevertheless, their focus is mainly on vision-text fusion, which omits valuable audio cues and external knowledge that could enrich summary semantics. Unlike Xu et al., our research systematically evaluates audio, video, and fused features and ranks them through recursive feature elimination (RFE), providing a transparent benchmark of which modalities contribute most to informativeness.

Incorporating external semantic knowledge is another pivotal direction. The study by Lu et al. [25] presents knowledge-aware multimodal deep networks that embed high-level semantic encoders to enhance feature representation and selection. Their Knowledge-Aware Multimodal Network (KAMN) demonstrates that injecting structured knowledge allows for capturing complex event dependencies and improves generalization beyond raw audiovisual signals. However, the reliance on external knowledge bases introduces challenges such as computational complexity and domain adaptation, which could limit scalability and applicability across diverse video types. In contrast, our framework avoids dependency on external knowledge and instead emphasizes operational benchmarking, ensuring scalability for real-world applications. Additionally, spatiotemporal feature fusion techniques discussed by Kashid et al. [26] highlight the necessity of modelling both spatial and temporal dependencies jointly for robust video summarization. By fusing frame-level and sequence-level features, their approach better captures subtle dynamic events and multimodal cues. Nonetheless, such fusion approaches may falter when modalities are asynchronous or when salient events are sparsely distributed, leading to potential dilution of important information. Our method addresses this by introducing temporal smoothing via median filtering, which improves continuity of selected segments and mitigates fragmentation caused by asynchronous or noisy inputs.

Recent advances further reinforce the importance of multimodal and hierarchical fusion. Yang et al. [27] propose a method that integrates audio, visual, and ASR-generated textual information using a two-stage fusion strategy, leveraging video Transformers for visual features, BART for text encoding, and Whisper for audio representation. By incorporating ASR text, they overcome the limitation of relying solely on ground truth transcripts. Unlike their reliance on large pre-trained models, our work systematically benchmarks handcrafted audio-visual features and multiple classifiers, demonstrating competitive results without end-to-end Transformer architectures. Similarly, Yu et al. [28] introduce a hierarchical multimodal summarization framework with dynamic sampling that adaptively selects frames based on motion intensity, improving semantic fidelity. While their approach captures motion-aware temporal details, our median filtering provides a lightweight alternative to preserve segment continuity, emphasizing classifier-driven summarization.

Several survey and applied studies contextualize multimodal video summarization trends. Alaa et al. [29] provide a comprehensive review of both extractive and abstractive methods, covering supervised, unsupervised, weakly supervised, and reinforcement learning paradigms. They highlight key trends such as the rise of attention-based deep architectures, multimodal fusion techniques, and increasing incorporation of user interactivity and personalization. Their survey also identifies critical challenges, including dataset scarcity, real-time processing constraints, and the balance between extractive and abstractive summarization, which are directly addressed in our study through systematic feature evaluation and benchmarking.

Chen et al. [30] introduce Video Summarization with Language (VSL), a personalized framework that leverages pre-trained visual-language models to generate summaries aligned with individual user preferences. The pipeline converts video frames and closed captions into textual representations, enabling semantic scene selection without requiring expensive supervised training. VSL also introduces the “UserPrefSum” dataset, a genre-labelled movie dataset automatically annotated via CLIP zero-shot capabilities, facilitating realistic evaluation of personalized summaries. Their method preserves scene and dialogue integrity, demonstrating the potential of multimodal semantic representations for user-centric summarization. Lee et al. [31] employ large language models (LLMs) to generate detailed textual descriptions for each video frame, which are then evaluated for importance using in-context learning. A global self-attention mechanism aggregates these frame-level scores to maintain narrative coherence, and experiments on SumMe and TVSum show improved semantic fidelity compared to traditional visual-based methods. Pang et al.’s MeCo framework [32] similarly uses LLMs for timestamp-free temporal localization, generating semantically rich event descriptions by partitioning videos into holistic event and transition segments. While these works emphasize semantic reasoning and event-level understanding, our study focuses on systematic benchmarking of audio, visual, and fused features using multiple classifiers, combined with temporal smoothing for practical deployment.

Guo et al. [33] propose CFSum, a two-stage transformer fusion framework that combines video, audio, and textual features. The approach first performs coarse-grained fusion via self-attention across modalities, followed by fine-grained cross-modal attention between video/audio and text features. Modal autoencoders augment intra-modal context to generate robust segment representations, and a saliency prediction head selects the most informative segments. Experiments on TVSum, YouTube Highlights, and QVHighlights demonstrate that CFSum captures deep inter- and intra-modal interactions effectively. While their method relies on end-to-end deep learning, our work focuses on systematically evaluating handcrafted and fused features with multiple classifiers, providing insights into feature importance and deployment efficiency. Psallidas et al. [34] present a supervised approach to dynamic video summarization that selects informative segments from unedited user-generated videos while preserving temporal order. Their method fuses handcrafted audio-visual descriptors with deep visual features extracted from pretrained networks such as VGG19 to construct rich multimodal representations. Video summarization is formulated as a binary classification task on one-second segments labelled as “informative” or “non-informative” based on aggregated human annotations. Classifiers such as Random Forest, XGBoost, and k-NN are applied, showing that combining handcrafted and deep features improves performance and generalization across multiple datasets, including custom YouTube collections, SumMe, and TVSum. Our implementation of feature extraction, analysis, and gathering is inspired by their approach, particularly in constructing the fused 224-dimensional feature vectors that integrate audio and video descriptors. While Psallidas & Spyrou demonstrated the value of combining handcrafted and deep features, our study extends this by systematically benchmarking classifier families, applying recursive feature elimination, and integrating the optimal classifier into a deployable end-to-end summarization pipeline.

Despite notable advances, several persistent challenges remain at the forefront of multimodal video summarization:

Feature Selection: Determining which features across each modality truly indicate summary-worthy content is nontrivial. Inadequate feature selection may propagate irrelevant or redundant information, reducing summary conciseness and informativeness.
Cross-Modal Fusion and Alignment: Effective fusion strategies, especially those leveraging Transformers and attention mechanisms, are crucial to reconcile asynchronous or weakly correlated modalities. Yet, aligning diverse signals while maintaining computational tractability remains an open area of research.
Utilization of External Knowledge: The integration of semantic, user, or task-specific knowledge has the potential to significantly enhance summary quality, but methods for knowledge injection and utilization are still in their infancy, often sacrificing efficiency or adaptability.
Dataset Limitations: Many benchmark datasets (e.g., SumMe, TVSum) are relatively small-scale and may not fully reflect the diversity or complexity of real-world video data, hampering robust evaluation and generalization studies.
Classifier Performance: The ultimate success of any machine-learning-driven summarization system depends not only on input representations and fusion strategies but also on the choice and training of classifiers to distinguish important segments—an area where comprehensive benchmarking is critically lacking [24].

The main contributions of this research paper are as follows:

Comprehensive Evaluation of Multimodal Features: This study benchmarks and compares a diverse set of audio, visual, and combined (hybrid) segment-level feature representations for video summarization, quantifying their individual and joint contributions to summary informativeness and reliability.
Systematic Classifier Benchmarking: A range of machine learning classifiers, including Naïve Bayes, K-nearest neighbours (KNN), logistic regression, decision trees, random forests, and XGBoost, are systematically evaluated on the task of multimodal segment classification, highlighting how classifier selection impacts summary accuracy, recall, and macro F1 performance metrics.
Feature Selection and Ranking: The research applies recursive feature elimination (RFE) on the full feature set to identify and rank the most informative audio and video attributes, providing insights into which modalities and features most influence classifier performance and summary potential.
Temporal Smoothing for Segment Continuity: Median filtering is implemented post-classification to improve the temporal smoothness of summary outputs by mitigating irregular transitions or overly fragmented segmentations that detract from the viewing experience.
Operational Metrics Benchmarking: The study introduces fine-grained tracking and evaluation of key operational metrics including step-wise latency (the execution time of each processing stage such as saving, feature extraction, clustering, dropping segments, and caption generation), total request time (overall end-to-end processing duration for each captioning request), simulated GPU and CPU utilization (reflecting computational load during processing), memory utilization to estimate runtime memory pressure, requests per second as a throughput measure, caption generation time comparisons between greedy and beam search decoding, caption length statistics as proxies for verbosity, and an analysis of beam width trade-offs that balances decoding complexity and caption richness.
Unified Benchmarking Framework: The research integrates classification performance with operational system metrics, equipping practitioners with a holistic understanding of both model effectiveness and practical deployment considerations, ranging from computational efficiency to system scalability and real-world responsiveness.
Generalizability and Guidance for Deployment: The results and framework established here are positioned to inform practical decisions for real-world video summarization deployment, enabling adaptive, resource-efficient, and user-oriented video content services in complex and high-demand multimedia environments.

Together, these studies illustrate diverse strategies for multimodal video summarization, ranging from LLM-based semantic reasoning and personalized summaries to transformer-based fusion and classifier-driven approaches. Our work bridges these strategies by systematically evaluating fused audio-visual descriptors, ranking feature importance, and applying temporal smoothing, resulting in an efficient, deployment-ready framework for real-world video summarization.

3. Materials and Methods

3.1. Dataset and Annotations

The dataset employed for training, testing, and evaluating models consists of a variable number of video recordings organized into distinct categories within separate folders. The primary source of these videos is the YouTube platform, with data collection criteria emphasizing single-angle shots or footage devoid of abrupt frame transitions, special effects, transitions, additional editing, or background music. This approach ensures that the summarization and processing of video recordings remain consistent with their original versions while also facilitating adaptability to more structured and higher-quality recordings. Activities depicted in the videos may vary; however, outdoor scenes are typically preferred due to natural lighting and sound conditions. These include action or extreme sports scenarios captured from open environments. Specifically, ten categories were considered: “Automobiles and Vehicles”, “Racing (Street/Road/Drag Racing)”, “Kayaking (Outdoor Recreation)”, “Rock Climbing”, “Hunting and Fishing”, “Scuba Diving”, “Food Reviews (Food & Drink)”, “Amusement/Water Parks”, “Mountain Biking”, and “Survival in the Wild”.

The objective of this procedure was to define video segments suitable for training and testing a machine learning model. As reference values, time intervals identified by users as interesting, informative, or frequently viewed were considered. YouTube provides functionality to track the most repeated portions of videos, while numerous repositories and datasets containing pre-annotated data are available. One such dataset widely utilized in research related to this platform is the YouTube-8M dataset, which comprises two components:

Frame-level training data, where columns “id” and “labels” identify video recordings and their categories, respectively, while “segment_start_times”, “segment_end_times”, “segments_labels”, and “segment_scores” define annotated segments as temporal intervals, category labels, and indicators of belonging to specific categories; and
Video-level training data, characterized by columns “mean_rgb” and “mean_audio”, which represent average values of audio and video features in the form of decimal arrays.

The primary focus of this study is on utilizing frame-level data, while video-level sources such as dictionaries of labels can be employed for supplementary exploratory analysis. To reduce the initial dataset size (estimated at 1.53 terabytes), only specified columns were selected, excluding “segment_end_times” because all segments in this dataset are precisely five seconds long.

Following the retrieval of annotation data, additional cleaning and aggregation procedures were performed to establish a reliable reference set that varied in terms of video dynamism. Consequently, the dataset was divided into three equally sized subsets based on the number of annotations:

Videos with fewer than five annotations (with a minimum threshold of three annotations for this study);
Videos with medium annotation counts (ranging between five and seven annotations); and
Videos with higher annotation counts (exceeding seven annotations).

This process reduced the initial dataset of approximately 1.1 million videos to a final set of 450 video recordings, totalling around 60 h of content. The individual video durations ranged from 15 s to 15 min, with an average length of 8 min.

Frame-level data can be accessed directly from the YouTube-8M website, Google Cloud Storage, or via a Python 2 script available at “data.yt8m.org”. Video metadata are organized into TFRecord files, with each file containing 287 video entries. For illustration purposes, an example of available data was utilized from the “The 3rd YouTube-8M Video Understanding Challenge” dataset hosted on Kaggle [35]. Since features will be extracted independently without relying on pre-existing datasets, it became necessary to retrieve specific videos by their unique identifiers within the dataset. To accomplish this task, a script from the “youtube-8m-video-frames” repository was employed [36]. This tool utilizes “youtube-dl” to download a specified number of videos based on predefined categories, regardless of anonymized video identifiers. An alternative approach would involve direct retrieval using IDs obtained by querying the page “data.yt8m.org/2/j/i/prefix/id.js”, where “prefix” denotes the first two characters of an identifier. However, this method exhibits limited accessibility for other TFRecord files due to its dependency on specific URL structures.

3.2. Feature Engineering

The objective of this section is to extract informative audio and video features commonly used in sound and image classification as well as grouping processes for analyzing and searching information within scenes. These are considered as two distinct modalities whose application is illustrated through a conceptual framework presented in Figure 1, with detailed descriptions provided in subsequent sections.

Through several of the aforementioned studies, it has been observed that manually extracted low-level features can capture both perceptual and harmonic information from audio signals. These features have broad applicability and directly contribute to the creation of feature vectors using statistical parameters such as mean values and standard deviations, which are commonly applied in event recognition tasks. For this reason, a set of segment-based audio features is generated for each audio segment extracted via “ffmpeg” using the “pyAudioAnalysis” library. According to the outlined procedure, feature extraction for each segment initially occurs on a short-term basis, with the final audio representation of the segment being composed of 34 features derived from the aggregation and composition of results in the temporal, frequency, and cepstral domains. The duration of short-term segments may vary between 10 and 200 milliseconds depending on specific requirements. The selected features implemented within the “pyAudioAnalysis” library are detailed in Table 1.

According to the described procedure, for each one-second audio segment, a sequence of ten 68-dimensional feature vectors is extracted from short-term frames. These vectors are used to compute segment-level statistics as the final representation: for each segment (comprising multiple short-term frames with corresponding short-term feature vectors), two statistical parameters, mean and standard deviation, are derived. Consequently, each audio segment utilizes a total of 2 × 68 = 136 audio statistics.

Beyond extracting audio features from the audio signal of each video recording, a substantial amount of visual information has been extracted due to the significance of this modality during the summarization process. The “multimodal_movie_analysis” library was employed to extract 88 visual features at five-second intervals from individual frames, including the following:

45 colour-related features, including: eight-class histogram of red, green and blue colour values, respectively, eight-class histogram of grayscale intensity values, five-class histogram of the ratio between maximum and mean values for each RGB channel, and eight-class histogram of saturation/intensity values
Average absolute difference between two consecutive frames in grayscale (one feature)
Two features for face recognition using the OpenCV-based implementation of the Viola-Jones algorithm: count of detected faces and average ratio of boundaries created around all faces within a frame relative to the total size of the frame
Three features for optical flow estimation, calculated using the Lucas-Kanade method: mean magnitude of flow vectors, standard deviation of flow vector angles, and the ratio of these two features, which can indicate camera movement or tilt probability
One feature for frame duration: Predefined within the library to return the number of frames composing a single second
36 object recognition-related features, derived using the Single Shot Multibox Detector: total number of detected objects, average confidence level of detection, mean ratio of object area relative to the frame size, and recognition across twelve object categories (“person”, “vehicle”, “nature”, “animal”, “tool/equipment”, “sport”, “kitchen”, “food”, “furniture”, “electronics”, “devices/appliances”, and “enclosed space”).

These features provide a comprehensive range of scene representation at low (simple colour nuances), medium (optical flow dynamics), and high (presence of objects and faces) levels. They also enable flexible identification of “informative” segments of video, allowing users to prioritize specific features based on application requirements while uncovering which visual cues correlate most strongly with the informativeness or uninteresting nature of a segment. It is important to note that during feature extraction for each video segment in the dataset, five distinct feature vectors are generated and aggregated through averaging. This ensures that every one-second segment is directly synchronized with its corresponding audio feature vector.

The integration of extracted audio and video features from segments can significantly enhance data comprehension and the identification of feature correlations aimed at optimizing model performance. Two widely recognized approaches for combining information are early fusion and late fusion. In early fusion, diverse sources of information are merged prior to being fed into the model, resulting in an augmented dataset that is processed differently by the model due to its increased size, thereby enabling the detection of more patterns. Conversely, late fusion involves combining information after model processing at higher levels within the methodology framework, aiming to create a more robust model or integrate outputs from distinct models. This study employs early fusion, wherein audio and video feature data are synchronized and integrated before undergoing model processing, leading to the formation of a matrix comprising 224 features.

3.3. Machine Learning Models

Machine learning and artificial intelligence have emerged as preeminent fields within computer science, serving as central themes in predictive modelling and data mining. Machine learning constitutes a subset of artificial intelligence that enables computational systems to “learn” from data through specific algorithms and methods, allowing them to identify patterns or common combinations upon which conclusions are drawn for classification, prediction, summarization, and similar tasks. Applied models rely exclusively on the experience accumulated through training algorithms, aiming to enhance generated outcomes. The proliferation of data crawling processes and Big Data sources has provided machine learning algorithms with substantial volumes of data, thereby improving model accuracy in applications such as user or account classification, motion detection, facial recognition, email filtering, and similar domains. An overview of the types of machine learning, categories, and commonly used algorithms is presented in Figure 2.

The process of training and validating models involves intricate mathematical frameworks that process input data and enhance decision-making capabilities as well as system performance. While there exists no universally accepted definition of machine learning, it can be most appropriately described as a methodical approach to creating and continuously refining mathematical models within a dataset in order to extract meaningful results and identify patterns [37].

The categorization of video segments into informative or non-informative constitutes a binary classification problem. Due to its demonstrated effectiveness and widespread application in prior research as well as established best practices, the following algorithms will be implemented and compared:

Naive Bayes Classifier
K-Nearest Neighbours (KNN) algorithm
Logistic Regression
Decision Tree Algorithm
Random Forest Algorithm
XGBoost (eXtreme Gradient Boosting) Classifier.

The application of any algorithm or model in machine learning also requires careful selection of hyperparameters to enhance performance and robustness. Each algorithm contains a set of parameters that are not determined during training but instead adjusted manually by the user, typically based on input data characteristics or hardware resource constraints. Following model creation, a fine-tuning process is conducted to identify the optimal combination of hyperparameters for the selected classifier, which may lead to improved results. The search for an optimal combination is commonly automated through iterative testing or exploration of predefined values, reducing the likelihood of human error, algorithmic bias, and enabling heuristic approaches [38].

Hyperparameter optimization can be categorized into distinct methods depending on the chosen algorithm. The most common approaches include random search and exhaustive (grid) search, as illustrated in Figure 3. Random search evaluates various randomly generated combinations within user-defined value ranges for each hyperparameter across a specified number of iterations. In contrast, exhaustive or grid search systematically tests all possible parameter combinations. While grid search also relies on user-defined values for each hyperparameter, it lacks iteration limits, which may result in longer execution times for complex models or large-scale applications. Assigning hyperparameters as distributions has been identified as an effective practice during the fine-tuning of selected algorithms, ensuring adaptability to diverse data scenarios [38].

In this study, grid search was employed for certain classifiers to develop a more robust model with enhanced performance, which directly influences the final outcome of video summarization. This approach ensures systematic exploration of hyperparameter combinations, contributing to improved accuracy and reliability in identifying key segments within multimedia content.

4. Experimental Setup and Evaluation

The following section outlines a multimodal approach for supervised video summarization, which belongs to the category of summarization referred to as “video skimming” (or rapid video browsing). This method involves generating a temporally compressed version of longer videos by identifying their significant segments. The analysis of input data is conducted through audio and video feature matrices extracted from one-second segments. These segments are classified as “informative” (i.e., sufficiently interesting to be included in the final summary) or “non-informative” (i.e., lacking content relevant to the summary). This classification is achieved using a supervised binary classifier, whose training can be based on audio features, video features, or combined modalities.

4.1. Feature and Classifier Selection

Each video is represented by audio and video feature vectors corresponding to one-second segments. As previously described, each segment can be classified as “informative” (i.e., sufficiently relevant to contribute to the final summary) or “non-informative” (i.e., lacking content suitable for inclusion in the summary). This distinction clearly defines a binary classification task.

To accurately classify video segments based on their informativeness, multiple classifiers were trained using three distinct categories of features:

Audio features: 136-dimensional sound feature vector
Video features: 88-dimensional video feature vector
Audio-visual features: combined representation comprising 224 dimensions (based on early fusion methodology)

The dataset was divided into training and testing subsets for all three modal configurations using an 80/20 ratio at the video-level granularity. While segment-based partitioning is theoretically possible, such an approach could result in overlapping frames between training and test sets, thereby introducing bias into model evaluation. The final sample sizes for training and testing are detailed in Table 2.

For the Naive Bayes classifier, K-nearest neighbours (KNN), logistic regression, and decision tree classifiers, implementations from [39] were utilized with appropriate parameter adjustments. Specifically, the KNN classifier’s “k” parameter, representing the number of neighbours, was optimized through grid search. In logistic regression, the inverse regularization strength parameter “C” was calibrated. The decision tree classifier was optimized by adjusting the criterion for evaluating data splits (e.g., Gini impurity or entropy) and the maximum depth of trees. The random forest classifier was based on a balanced implementation from [40], while the XGBoost classifier utilized an adaptation from [41]. All classifiers were subjected to parameter optimization using grid search, with the modified parameters documented in Table 3.

Following the extraction of informative segments, one prominent challenge in generating a final summarized version of video content is the lack of continuity or irregularities during frame transitions caused by excessive segmentation or interruptions. To mitigate this issue, a median filter was applied to ensure smoothness in the sequence of predicted classifications. The entire training and processing workflow comprises three primary stages:

Determination of audio, video, and combined features for each segment of the video
Classification of each segment using one of the specified classifiers
Post-processing of sequential classifier predictions to eliminate obvious errors or disruptions.

The post-processing stage employs a two-step filtering pipeline. The first step involves applying a median filter with length N₁ to smooth the initial sequence of predictions through localized temporal windows. Subsequently, the filtered results are passed to a hard filter that retains only sequences of consecutive positive predictions (i.e., informative segments) with a minimum duration of N₂. In practical terms, N₂ defines the threshold for the minimum frame length considered as informative. A graphical representation of this process is presented in Figure 4.

4.2. Evaluation Metrics

The selected evaluation metrics for the classifiers include:

Precision for the positive class: measures the proportion of segments classified as “informative” that correspond to actual informative segments according to ground truth labels
Recall for the positive class: quantifies the proportion of informative segments identified as such by the classifier relative to all truly informative segments in the dataset
F1 score: represents the macro average of individual F1 scores for each class, computed as the harmonic mean of precision and recall values per class. This metric serves as a normalized measure of overall classification performance
Overall precision: calculates the total percentage of segments (both negative and positive) that are correctly classified by the system
Area under the receiver operating characteristic curve (AUC): functions as a general indicator of classifier performance, illustrating its ability to distinguish between classes across varying decision thresholds for the positive class output

Among the aforementioned metrics, the F1 score and overall precision provide general measures of classification performance. However, the F1 score is more appropriate when addressing class imbalance, as it accounts for both precision and recall values across all classes. In contrast, precision and recall specific to the positive class serve as indicative measures that reflect the classifier’s behaviour at a particular decision threshold. For example, a precision value of 50% paired with a recall of 60% indicates that half of the detected segments are genuinely informative, while in a sample of ten such segments, six would correspond to true positives. The area under the ROC curve (AUC) offers additional utility by quantifying the classifier’s ability to distinguish between two classes irrespective of the chosen probability threshold, thereby providing insights into its overall discriminative capacity across varying operational conditions.

4.3. Integration into an End-to-End Summarisation Pipeline

The classification benchmark acquires its practical significance when integrated into a complete summarization pipeline that transforms raw video recordings into coherent summaries. In this design, the classifier is not an isolated analytical tool but rather the decisive element that governs which segments of the video are preserved and which are discarded, thereby shaping the narrative structure of the final output. Figure 5 illustrates the proposed multimodal video summarization and captioning pipeline, highlighting the sequential processing stages and supporting analysis modules that operationalize this process. The pipeline proceeds through a sequence of well-defined stages. The video is first segmented into fixed one-second intervals, providing a balance between temporal resolution and computational efficiency. A preprocessing module extracts frames and prepares audio signals for subsequent analysis, establishing a uniform temporal granularity. For each segment, multimodal descriptors are extracted, combining auditory and visual features that jointly characterize its informational content. These descriptors are then passed to the trained classification model, which assigns each segment a binary label denoting whether it is informative or non-informative. Because raw predictions often exhibit sporadic fluctuations, a temporal smoothing step based on median filtering is introduced to impose local consistency and prevent abrupt transitions. Finally, consecutive informative segments are merged into key clips, which form the structural backbone of the generated summary and serve as input for downstream modules responsible for textual or subtitle generation.

This integration illustrates several strengths of the approach. The reliance on multimodal descriptors enables resilience across a wide spectrum of video types, from static scenes with limited motion to dynamic, audio-rich recordings. The use of temporal refinement ensures that the resulting sequence of clips maintains continuity and avoids the disjointedness typical of purely frame-level methods. The summary and captioning stage then generates descriptive textual subtitles for the curated clips using a sequence-to-sequence encoder–decoder architecture with attention, supporting both greedy and beam search decoding strategies. Furthermore, the modular nature of the pipeline makes it adaptable: alternative classifiers, feature sets, or smoothing strategies can be introduced without altering the overall workflow, allowing the system to evolve with advances in methodology.

Nevertheless, several limitations remain inherent to the proposed design:

Segment granularity: the use of fixed one-second intervals does not always coincide with semantic event boundaries, occasionally resulting in over-segmentation or the omission of short but meaningful actions.
Computational overhead: while multimodal feature extraction substantially enhances classification accuracy, it also increases processing demands, which may constrain scalability in real-time or resource-limited environments.
Sensitivity to variability: rapid scene transitions, background noise, or abrupt changes in visual dynamics can still introduce inconsistencies, making additional refinement or adaptive mechanisms necessary.

Supporting modules provide additional rigour. Feature importance analysis via Recursive Feature Elimination (RFE) ranks the 224 features to identify the most discriminative attributes, while comprehensive benchmarking evaluates both model performance (accuracy, F1-score, AUC) and operational metrics (latency, GPU/CPU utilization, memory footprint), offering a holistic view of system efficiency under practical conditions. Taken together, these considerations illustrate a system that balances methodological rigour with practical feasibility. By uniting multimodal classification, temporal refinement, key-clip assembly, and caption generation within a single architecture, the pipeline is able to transform raw classifier outputs into summaries that remain semantically coherent while maintaining computational efficiency.

5. System Implementation

As the second stage of the summarization pipeline, a simple web application was developed to generate a sequence of subtitles for the informative video segments selected by the classifier described in Section 3 and Section 4. The application takes the key clips identified by the classifier and generates a sequence of subtitles using greedy and beam search algorithms. The project comprises three distinct components:

Training and Testing Scripts/Jupyter Notebooks implemented in Python 3.7.9: These scripts are designed to train and evaluate models, ultimately saving the most recent model state after successful training for subsequent reuse or integration into server-side operations
Server-Side Components (Backend): Implemented using Flask, this backend handles file uploads through a POST endpoint. After storing the video, it segments the input into fixed intervals, extracts frames, and computes the full set of multimodal features. These features are passed to a pre-trained Random Forest classifier for segment classification, followed by median filtering to refine predictions. Representative keyframes are then selected from informative segments and forwarded to the captioning model, which generates textual descriptions using greedy or beam search decoding. Both the classifier and captioning model are loaded from pre-trained checkpoints to ensure efficiency and consistency across deployments, and
Client-Side Interface (Frontend): Built with ReactJS, this frontend consists of a single page where users can upload videos to the server and view the generated subtitles

The primary purpose of each component is briefly outlined in further sections.

5.1. Script for Model Training and Evaluation

Within the Jupyter Notebook designed for training and testing the two approaches outlined in this work, a sequence of steps is executed:

Library Integration: The script begins by importing essential libraries, including TensorFlow, matplotlib, scikit-learn, and other relevant dependencies required for data manipulation, visualization, and model development
Data Acquisition and Preparation: The COCO dataset, which contains images and corresponding annotations/descriptions, is downloaded and preprocessed. The ‘tf.keras.utils.get_file’ function is utilized to retrieve the dataset files, which are subsequently stored in the “train2014” directory. Annotations are organized separately within the “annotations” folder
Data Loading and Preprocessing: The script reads image annotations and file paths from the downloaded datasets. Initial and final labels are appended to each description, and the data is shuffled randomly. A maximum number of examples (“num_examples”) is set, followed by selection of a subset of data for training purposes
Image Processing: A function named “load_image” is defined to read and process images using TensorFlow utilities. The InceptionV3 model is employed to extract image features, which are then saved as NumPy arrays for subsequent use in the model pipeline
Description Tokenization: The ‘Tokenizer’ class from the ‘tf.keras.preprocessing.text’ library is employed to tokenize the textual descriptions and generate a vocabulary dictionary. The size of the vocabulary is restricted to the top $k$ most frequent words, while sequences are padded to ensure uniform length using the ‘pad_sequences’ function
Dataset Splitting: The dataset is partitioned into training and test subsets using the ‘train_test_split’ function from the scikit-learn library. This step facilitates model evaluation by isolating a distinct portion of data for validation purposes
TensorFlow Dataset Creation: A TensorFlow dataset is constructed from the training data, with the ‘map’ function applied to each image-path and annotation pair to execute the ‘load_image’ function. The resulting dataset undergoes shuffling, batching, and prefetching to optimize computational efficiency during model training
Model Definition: Two architectures are defined: ‘CNN_Encoder’ and ‘RNN_Decoder’. The ‘CNN_Encoder’ employs a dense layer for encoding image features extracted from the InceptionV3 model, whereas the ‘RNN_Decoder’ incorporates an embedding layer and a gated recurrent unit (GRU) to generate descriptive text. Additionally, a custom class named ‘BahdanauAttention’ is implemented to handle attention mechanisms within the decoding process
Loss Function and Optimization Definition: The sparse categorical cross-entropy loss function is selected for model training, complemented by the Adam optimizer to adjust parameters during the learning process
Model Training: A training step function is defined using the ‘@tf.function’ decorator. This function iterates over data batches, performs forward and backward passes through the models, computes the loss, and updates weights accordingly. The training loop executes for a predefined number of epochs, with periodic checkpoints to save the model’s state at critical intervals. The evolution of the average training loss over time is illustrated in Figure 6.
Evaluation: Two evaluation functions are implemented: ‘evaluate2’ generates descriptions using greedy search, while ‘evaluate’ employs beam search to produce multiple candidate descriptions and selects the most probable one. These functions leverage trained models to generate textual descriptions for input images
Attention Visualization: The attention weights of the model are visualized for a given image and description through the ‘plot_attention’ function. This aids in interpreting how the model focuses on specific regions of the input during prediction, with a visual example provided in Figure 7.
Testing and Final Evaluation: A training loop is executed to print loss values for each epoch, followed by a demonstration of description generation using sample images. This step validates the model’s performance and confirms its ability to produce coherent outputs consistent with the input data

5.2. Server-Side Application

The implementation of the Flask application includes a REST API with a single primary endpoint “/process” through which users can submit videos for analysis. The output is returned as a JSON object comprising four fields:

“beam_search_X” (where X denotes the number of results or word indices used as predic-tions) or “greedy”—a structure where keys correspond to static resources for accessing frames described by their values, and;
“beam_search_caption” or “greedy_caption”—strings that consolidate all values within the respective “beam_search_X” or “greedy” objects into a single textual summary.

Input data is transmitted via a form containing two fields:

“file”—the video requiring processing;
“beam_index”—the number of results and word indices selected for prediction.

Upon receiving a request, the system first handles file upload and storage, after which the video is segmented into one-second intervals. For each segment, frames are extracted and the full set of 224 multimodal features is computed. These features are then passed to the pre-trained Random Forest classifier, the same model identified as optimal in the benchmarking study, which predicts whether a segment is informative or non-informative. A median filtering step is subsequently applied to smooth the prediction sequence and ensure temporal consistency. From the resulting continuous sequences of informative segments, representative keyframes are selected and forwarded to the captioning model. This model, employing either greedy or beam search decoding, generates descriptive textual output for each selected frame. This ensures that the generative captioning model is applied exclusively to segments preselected by the classifier, reinforcing the dependency between the two stages and ensuring the captions describe only the most relevant content. The final response is returned as a JSON object that contains both the generated subtitles and references to the corresponding keyframes. To improve efficiency, both the classifier and the captioning model are loaded from pre-trained checkpoints stored in the “ckpt” directory, avoiding the need for retraining. This design is particularly advantageous when computational resources are limited, when training and inference are performed on separate systems, or when deploying a pre-validated model. Additional auxiliary functions were implemented to adapt the image-based captioning pipeline to video input, ensuring that frame extraction, filtering, and clustering remain compatible with the temporal characteristics of the data. An example of the application’s output, along with the selected frames used for captioning, is shown in Figure 8.

5.3. Client-Side Application

The client-side implementation utilizes a web interface developed with ReactJS, enabling users to submit specific videos, review the generated descriptions, and examine subtitles for each distinct informative segment. This interface is structured around a single component termed “VidCap”, which integrates two core functionalities:

A function named “handleBeamChange” that updates the index value associated with beam search parameters, and;
An event handler “onChangeFile” responsible for retrieving the selected file and corresponding index, constructing a FormData object, initiating a POST request to the “/process” endpoint, and dynamically updating component content based on the server’s response containing the final description and frame-specific subtitles.

The layout and behaviour of this interface are shown in Figure 9, which depicts the home page of the application. Figure 10 presents an example of the generated subtitles for a shorter video clip, illustrating the mapping between selected keyframes and the textual summary produced by the backend pipeline. The component architecture ensures seamless interaction between user input and backend processing by maintaining state synchronization through controlled updates triggered by the server’s output. This design emphasizes real-time feedback and data consistency during video analysis workflows.

6. Results

6.1. Classification Performance Evaluation and Feature Importance Analysis

Table 4 presents calculated values for the area under the ROC curve (AUC) and F1 scores corresponding to six distinct classification methods and three individual modalities (audio, video, and combined). The random forest classifier emerges as the most effective approach, achieving the highest AUC value and one of the top-performing F1 results, as illustrated in Figure 11. However, the significance of AUC in this context lies in its ability to capture the classifier’s adaptability across varying operational thresholds, reflecting its capacity to distinguish between classes at different probability thresholds. Notably, classifiers incorporating visual modalities demonstrate an average improvement of 6% compared to those relying exclusively on audio-based features.

Following post-processing for the optimal classification model (random forest), a combination of N₁ and N₂ parameters was evaluated. The parameter pair (3,5) yielded a relative performance enhancement of approximately 3%. Analysis of final metric values presented in Table 5 reveals an inverse relationship between precision and recall measures when applying this procedure. This trade-off is also further illustrated in Figure 12, which compares macro-average precision and recall scores for the combined classifier using only video data. This behaviour is anticipated, particularly due to the influence of parameter N₂, which enhances confidence in detecting extended informative frames despite potential noise, at the expense of overlooking shorter segments with insufficient consecutive positive segments.

The identification of features most influential in determining the performance of a selected classifier, alongside video content summarization, constitutes an essential component of this research. Application of the Recursive Feature Elimination (RFE) algorithm enabled systematic ranking of features through iterative removal, resulting in the extraction of ten most significant audio and video attributes from a total of 224 available features, as detailed in Table 6. The analysis of these results reveals several key insights. Among the ten most influential features, three pertain to audio characteristics, specifically spectral and delta-based metrics. These features align with the nature of audio signals, as they quantify variations in sound intensity or pitch—elements that naturally capture user attention and enhance the informativeness or engagement of video segments. The remaining seven features are derived from video processing. Of these, three relate to motion dynamics at the frame level, including optical flow and frame duration, indicating that abrupt movements, camera repositioning, or scene transitions significantly contribute to segment informativeness. The other four features are associated with histograms of saturation values and grayscale intensity distributions, effectively categorizing visual outputs into distinct states such as extremely dark imagery, mildly illuminated scenes, low saturation (nearly monochrome visuals), and high saturation (vivid colours). This suggests that frames characterized by either exceptionally vibrant or unusually dim and desaturated visuals may be perceived as more engaging or curiosity-inducing by users. Overall, these findings emphasize the interplay between perceptual relevance and feature selection in video summarization tasks [34].

6.2. System Analysis During Load Testing

To evaluate the runtime performance of the proposed video summarization system under realistic operating conditions, we instrumented the full processing pipeline and web request handling infrastructure to capture detailed metrics during execution. The system was deployed on a workstation equipped with 32 GB of RAM, an NVIDIA RTX 4070 GPU and an AMD Ryzen 5 5600X CPU subjected to a load of 1000 sequential web-based captioning requests. Each request processed a distinct HD resolution video between 30 s and 10 min in length and passed through the entire inference pipeline. The following metrics were recorded and analyzed:

Step-wise Latency: Execution time was measured independently for each processing stage: saving the uploaded video (save), extracting multimodal features for segment classification (extract_features), performing segment classification (classify), clustering similar frames from the informative segments (cluster), selecting representative keyframes (drop), and generating captions for these keyframes (caption). This breakdown enables identification of dominant computational bottlenecks across the entire two-stage pipeline.
Total Request Time: The cumulative time from receiving a request to completing its classification and captioning response was recorded for each video. This end-to-end latency metric serves as the primary indicator of user-perceived responsiveness and system throughput under production-like conditions.
GPU Utilization: The percentage of GPU usage was logged for each request throughout the process. This metric captures the intensity of model inference and feature extraction workloads offloaded to the GPU, and provides insight into accelerator saturation and resource efficiency.
CPU Utilization: The corresponding CPU load was measured concurrently, reflecting the computational effort expended on orchestration, preprocessing, clustering, and token-level decoding steps that remain CPU-bound in the current implementation.
Memory Utilization: The percentage of system memory usage was tracked per request to assess runtime memory consumption, particularly during the handling of high-resolution video frames and intermediate feature representations.
Requests Per Second (RPS): For each individual request, system throughput was calculated as the reciprocal of the total request time. This metric characterizes the processing rate of the deployed service and is critical for estimating real-time scalability under concurrent access.
Caption Generation Time: The duration of the captioning phase was measured separately for different decoding configurations, including greedy decoding and beam search with varying widths. This enables quantitative comparison of decoding strategies in terms of speed-performance trade-offs.
Caption Length: The number of words in the generated caption was recorded for each request and decoding strategy. This metric serves as an indirect proxy for linguistic diversity and descriptiveness, and facilitates analysis of how decoding strategies influence output verbosity.
Beam Width Trade-off: For each beam width setting, the average caption generation time and average caption length were aggregated. This joint metric illustrates the trade-off between computational cost and potential gains in caption quality or expressiveness when increasing decoding depth.

The request throughput, measured in requests per second (RPS), is visualized in Figure 13a. The average RPS was recorded at 0.02 with a peak value of 0.14, corresponding to approximately one complete captioning operation every 50 s. The standard deviation of ±0.02 indicates moderate variance, with the graph showing pronounced fluctuations and intermittent spikes in throughput. These peaks likely reflect occasional reductions in request complexity or faster-than-average execution paths. However, the overall low throughput suggests that the system is currently optimized for offline or low-concurrency scenarios, rather than real-time captioning or multi-user environments. Given the average video length of 315.53 s (±165.95 s), the latency introduced by the current RPS could significantly impact user experience in time-sensitive applications. Enhancing model efficiency through batch processing, asynchronous I/O, or GPU acceleration could substantially improve throughput and reduce latency under load, making the system more viable for interactive deployments.

The latency characteristics of the system are further detailed in Figure 13b, which plots the total end-to-end processing time per request. The mean total request time was 98.55 s, with a standard deviation of ±55.55 s and a maximum of 272.79 s. The graph reveals substantial variability across requests, with frequent deviations from the mean and several pronounced latency spikes. A red horizontal line marks the global mean, providing a visual anchor for identifying outliers and performance bottlenecks. The most significant contributor to overall latency is the feature extraction stage, which alone averages 83.37 s (±47.08 s) and peaks at 233.00 s. This step dominates the total request time and represents the most critical target for optimization.

Figure 13c provides insights into the system’s internal hardware behaviour during execution. GPU utilization averaged 50.96% (±24.13%), with intermittent peaks reaching full saturation at 100%. This suggests that while the captioning and feature extraction stages do leverage GPU acceleration, there are notable idle periods or inefficiencies between pipeline stages that prevent sustained utilization. CPU usage averaged 34.19% (±17.33%), with bursts up to 86.71%, indicating moderate load across requests. The variability and sub-saturation levels imply that certain stages—such as clustering or classification—may be CPU-bound but not consistently demanding. Memory utilization remained stable, averaging 55.62% (±11.55%) and peaking at 93.80%, which reflects adequate provisioning under the current workload. The graph shows frequent fluctuations across all three metrics, reinforcing the need for pipeline-level optimizations to improve resource consistency and throughput. Techniques such as stage parallelism, asynchronous execution, and GPU memory reuse could help reduce idle time and enhance overall system efficiency.

Figure 14 presents a comparative breakdown of processing times across the captioning pipeline stages: save, extract_features, classify, cluster, drop, and caption. The extract_features stage is the most time-consuming, averaging 83.37 s with a high variance (±47.08 s) and peaking at 233.00 s, indicating substantial GPU load and variability in video complexity. This step dominates the overall latency profile and is the primary bottleneck for throughput. The classify and caption stages follow, with mean times of 6.24 s and 3.33 s, respectively, both showing moderate variance and occasional spikes. These stages involve model inference and sequence generation, and while GPU-accelerated, they still contribute significantly to end-to-end latency. The save stage, averaging 2.92 s ± 1.64 s, reveals unexpected I/O overhead, possibly due to serialization or disk contention. The cluster stage, at 2.08 s ± 1.18 s, reflects CPU-bound frame grouping logic and remains a non-negligible contributor. The drop stage is lightweight and consistent (0.62 s ± 0.35 s), serving as a deterministic filtering step.

Together, these results reveal a system architecture that performs reliably under load, with consistent resource utilization and predictable behaviour across requests. The pipeline’s primary bottleneck is the extract_features stage, which dominates total latency and exhibits high variance. Secondary contributors include classify, caption, and save, each showing measurable impact on end-to-end performance. Resource usage remains within acceptable bounds, with GPU, CPU, and memory utilization fluctuating but never consistently saturated. While the system is well-suited for offline and batch processing, its current latency profile limits applicability in real-time or high-concurrency environments. The breakdown of stage-wise timings and resource patterns provides a clear roadmap for architectural refinement, depending on deployment goals.

On the other hand, Figure 15 provides a detailed comparison of captioning strategies using greedy and beam search decoding, highlighting the interplay between output length, variability, and computational cost. Figure 15a illustrates the distribution of caption lengths across decoding configurations. Greedy decoding, which deterministically selects the most probable token at each step, produces the shortest captions, with an average length of 11.48 tokens and limited variance (±1.81). This reflects a constrained linguistic output, as the model prioritizes high-probability sequences over diversity. Beam search, in contrast, considers multiple candidate sequences simultaneously, enabling more complex and expressive captions. As beam width increases, mean caption length grows steadily—from 14.90 tokens for beam width 3, to 15.52 for beam width 5, and 16.16 for beam width 10. The broadening of the distributions with wider beams indicates greater variability in output lengths, suggesting that beam search not only produces longer captions but also accommodates a wider range of syntactic structures and descriptive content.

Figure 15b,c illustrate the computational trade-offs associated with increasing beam width. Figure 15b shows that caption generation time rises steadily with beam width, from 0.51 s for greedy decoding to 1.11 s for beam = 10, reflecting the additional computation required to explore multiple hypotheses. Figure 15c presents a combined view of latency and caption length, revealing a nonlinear relationship: although longer captions are produced with wider beams, the rate of increase diminishes beyond beam width 5. This saturation effect highlights a point of diminishing returns, where additional computational effort contributes minimally to descriptive richness. The results suggest that, while extremely wide beams may occasionally yield the longest captions, they incur disproportionately high inference costs, which may be impractical in real-time applications.

Taken together, these findings underscore the importance of carefully selecting decoding strategies to balance linguistic expressiveness and computational efficiency. Beam widths in the range of 3–5 appear to represent an optimal compromise, achieving significant improvements in caption length and variability while maintaining moderate latency (0.67–0.80 s). Moreover, the increase in distribution variance with wider beams indicates that beam search supports richer semantic coverage, enabling captions that capture more nuanced aspects of the image content. By analyzing both output characteristics and decoding performance, these results provide practical guidance for configuring sequence generation systems to achieve a desirable trade-off between quality and efficiency.

In conclusion, the runtime evaluation of the video captioning system demonstrates robust and stable performance under realistic sequential load, effectively leveraging GPU acceleration for inference and feature extraction. However, the feature extraction stage emerges as the primary bottleneck, consuming the majority of processing time and limiting overall scalability. This highlights a critical area for optimization through parallelization techniques, hardware offloading, or algorithmic redesign, potentially incorporating more efficient GPU kernels or deep learning–based feature encoders to reduce latency and improve throughput. Memory management remains stable, though there is room to reduce the runtime memory footprint through more efficient video frame handling and feature representation compression. Smarter frame sampling strategies, such as content-adaptive keyframe selection or motion-aware filtering, could reduce the number of frames subjected to expensive processing without compromising caption quality. Additionally, memory-mapped I/O and zero-copy data transfers between CPU and GPU could minimize data movement overhead, contributing to overall system efficiency [25].

The analysis of decoding strategies reveals a trade-off between caption richness and computational cost. Wider beam searches yield longer and more diverse captions, but the benefits diminish beyond moderate beam widths due to increasing decoding time. Adaptive decoding approaches, such as dynamic beam width adjustment, early stopping, or hybrid beam-sampling methods, could more efficiently balance linguistic expressiveness with latency constraints, enabling flexible tuning based on application requirements. Beyond these targeted bottlenecks, the system could benefit from architectural innovations such as asynchronous, pipeline-parallel processing and batch-based GPU workloads to enhance concurrency and hardware utilization. Complementary improvements in model compression, memory management, and content-adaptive frame selection would collectively bolster scalability and responsiveness. Altogether, these strategic advancements offer a clear roadmap for evolving the system from a stable, interactive prototype into a performant, real-time captioning service capable of supporting diverse deployment scenarios.

7. Concluding Remarks and Future Work

In contemporary contexts characterized by constant and ubiquitous usage of multimedia storage and transmission services, maintaining competitive advantage in a market marked by increasing competition demands a thorough understanding of factors influencing user satisfaction. Challenges faced by such systems include developing robust algorithms for video suggestion, summarization, analysis, filtering, and classification tasks. These challenges involve precise processing of multimodal data (audio, visual, and combined), mitigating inherent subjectivity in task execution, and addressing scalability issues when analyzing large datasets. Recent advancements in machine learning, computer vision, and natural language processing offer promising pathways to resolve these challenges.

This study investigated the concept of multimodal video summarization through various machine learning models and algorithms, examining their key characteristics, the significance of employed attributes and components, and potential outcomes with or without additional preprocessing steps. Integrating multiple modalities enables a comprehensive understanding of video content, facilitating efficient summarization techniques that are readily adaptable to web service frameworks. The work also demonstrated the integration of an informative segment selection model into a practical, end-to-end web-deployed pipeline. The system identifies relevant video segments using fused audio-visual features and generates descriptive subtitles for the selected segments. Evaluation of the pipeline reveals a solid foundation in operational stability and resource utilization, while highlighting specific technical challenges. Notably, the feature extraction and segment clustering steps represent key bottlenecks, limiting latency reduction and throughput scaling. Addressing these challenges will require parallelized algorithms, hardware-accelerated implementations, and asynchronous or batch processing to improve concurrency and efficiency. Adaptive decoding mechanisms offer additional potential, enabling dynamic trade-offs between video processing efficiency, caption quality, and computational resources.

Developing such systems necessitates trade-offs in temporal and hardware resources, alongside careful attention to data collection and preprocessing. Nonetheless, the efficiency and adaptability of the integrated approach, combined with potential extensions, can significantly enhance user experience across platforms. Future work will further strengthen the integrated system along several dimensions. Expanding the dataset to include more diverse videos and incorporating textual modalities, such as video descriptions, comments, and subtitles, can improve summarization quality and semantic richness. Preprocessing and feature extraction may be replaced or augmented by deep neural networks, and pre-trained language models like BERT can support both initial text processing and final subtitle generation. End-to-end training of the pipeline represents a particularly promising direction, aligning segment selection and description generation objectives, potentially through reinforcement learning strategies where the reward is based on the quality of the final summary text.

Beyond model-level improvements, enhancing robustness and generalization will be critical, including handling noisy audio, low-resolution video, or domain shifts. Methods such as domain adaptation, data augmentation, and adversarial training can mitigate these challenges. Incorporating explainable AI techniques will provide insight into which features influenced segment selection, supporting transparency and user trust. User-centric evaluation, including A/B testing and subjective assessment of caption relevance, will ensure that system improvements translate into meaningful enhancements in real-world settings. Scalability can be further advanced through cloud-native, distributed, or serverless architectures, enabling efficient processing of large-scale video libraries. Expanding capabilities to multilingual and cross-cultural content will increase accessibility and relevance, while integration with personalized recommendation systems can tailor summaries to individual user preferences. Finally, ethical and privacy considerations must guide future development, ensuring that content summarization respects user data and mitigates bias.

Future work may also explore user-centred evaluation, ranging from small-scale user studies to larger-scale assessments, including subjective evaluations of subtitle accuracy, coherence, and usefulness. Such evaluations would provide critical insights into how well the generated summaries meet user expectations at different levels of deployment, complementing the current objective metrics and guiding further refinements toward real-world usability. Collectively, these enhancements aim to create a scalable, efficient system optimized not only for speed but also for semantically rich, coherent, and user-aligned captioning. By combining robust multimodal analysis, flexible pipeline design, and continuous improvements in model, data, and evaluation strategies, the prototype can evolve into a production-ready platform capable of meeting emerging demands in automated multimedia understanding.

Author Contributions

Conceptualization, E.M. and E.K.; methodology, E.M., E.K., N.B. and N.Ž.; software, E.M., E.K. and E.T.; validation, E.M., E.K., N.Ž., N.B. and E.T.; formal analysis, E.M., E.K., N.Ž., N.B., E.T. and S.V.; investigation, E.M., E.K. and E.T.; resources, E.M., E.K., N.Ž., E.T. and S.V.; data curation, E.M., E.K., N.Ž., E.T. and S.V.; writing—original draft preparation, E.M., E.K. and E.T.; writing—review and editing, E.M., E.K., N.Ž., E.T. and S.V.; visualization, E.M., E.K., N.B. and E.T.; supervision, E.M., E.K., N.Ž., N.B., E.T. and S.V.; project administration, E.M., E.K. and N.Ž.; funding acquisition, E.K., N.Ž., N.B. and S.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to further ongoing closed research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

The Social Shepherd. 23 Essential YouTube Statistics You Need to Know in 2023. Available online: https://thesocialshepherd.com/blog/youtube-statistics (accessed on 23 June 2023).
Furini, M.; Ghini, V. An audio-video summarization scheme based on audio and video analysis. In Proceedings of the 3rd IEEE Consumer Communications and Networking Conference, Las Vegas, NV, USA, 8–10 January 2006. [Google Scholar] [CrossRef]
Money, A.G.; Agius, H. Video summarisation: A conceptual framework and survey of the state of the art. J. Vis. Commun. Image Represent. 2008, 19, 121–143. [Google Scholar] [CrossRef]
Xiong, Z.; Radhakrishnan, R.; Divakaran, A.; Rui, Y.; Huang, T.S. A Unified Framework for Video Summarization, Browsing, and Retrieval. In A Unified Framework for Video Framework for Video Summarization, Browsing and Retrieval; Academic Press: Cambridge, MA, USA, 2006; pp. 221–235. [Google Scholar] [CrossRef]
Lai, P.K.; Decombas, M.; Moutet, K.; Laganiere, R. Video summarization of surveillance cameras. In Proceedings of the 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Colorado Springs, CO, USA, 23–26 August 2016. [Google Scholar] [CrossRef]
Priya, G.G.L.; Domnic, S. Medical video summarization using central tendency-based shot boundary detection. Int. J. Comput. Vis. Image Process. 2013, 3, 55–65. [Google Scholar] [CrossRef][Green Version]
Trinh, H.; Li, J.; Miyazawa, S.; Moreno, J.; Pankanti, S. Efficient UAV video event summarization. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012) IEEE, Tsukuba, Japan, 11–15 November 2012. [Google Scholar]
Spyrou, E.; Tolias, G.; Mylonas, P.; Avrithis, Y. Concept detection and keyframe extraction using a visual thesaurus. Multimed. Tools Appl. 2009, 41, 337–373. [Google Scholar] [CrossRef]
Li, Y.; Merialdo, B.; Rouvier, M.; Linares, G. Static and dynamic video summaries. In Proceedings of the 19th ACM International Conference on Multimedia–MM ’11, ACM Press, Scottsdale, AZ, USA, 28 November–1 December 2011. [Google Scholar] [CrossRef]
Lienhart, R.; Pfeiffer, S.; Effelsberg, W. The MoCA workbench: Support for creativity in movie content analysis. In Proceedings of the 3rd IEEE International Conference on Multimedia Computing and Systems, IEEE Computer Society Press, Hiroshima, Japan, 17–23 June 1996. [Google Scholar] [CrossRef]
Chen, B.-C.; Chen, Y.-Y.; Chen, F. Video to text summary: Joint video summarization and captioning with recurrent neural networks. In Proceedings of the British Machine Vision Conference 2017, British Machine Vision Association, London, UK, 4–7 September 2017. [Google Scholar] [CrossRef]
Zhou, K.; Qiao, Y.; Xiang, T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the AAAI Conference on Artificial, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar] [CrossRef]
Zhang, K.; Chao, W.-L.; Sha, F.; Grauman, K. Video summarization with long short-term memory. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
Evangelopoulos, G.; Zlatintsi, A.; Potamianos, A.; Maragos, P.; Rapantzikos, K.; Skoumas, G.; Avrithis, Y. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Trans. Multimed. 2013, 15, 1553–1568. [Google Scholar] [CrossRef]
Wei, H.; Ni, B.; Yan, Y.; Yu, H.; Yang, X.; Yao, C. Video summarization via semantic attended networks. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar] [CrossRef]
Pantazis, G.; Dimas, G.; Iakovidis, D.K. Salsum: Saliency-based video summarization using generative adversarial networks. arXiv 2020, arXiv:2011.10432. [Google Scholar] [CrossRef]
Potapov, D.; Douze, M.; Harchaoui, Z.; Schmid, C. Category-Specific Video Summarization. In Computer Vision–ECCV 2014; Springer: Cham, Switzerland, 2014; Volume 8694, pp. 540–555. [Google Scholar] [CrossRef]
Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. TV-Sum: Summarizing web videos using titles. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
Gygli, M.; Grabner, H.; Riemenschneider, H.; Van Gool, L. Creating Summaries from User Videos. In Computer Vision–ECCV 2014; Springer: Cham, Switzerland, 2014; Volume 8695, pp. 505–520. [Google Scholar] [CrossRef]
Lee, Y.J.; Ghosh, J.; Grauman, K. Discovering important people and objects for egocentric video summarization. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
de Avila, S.E.F.; Lopes, A.P.B.; da Luz, A., Jr.; de Albuquerque Araujo, A. Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn. Lett. 2011, 32, 56–68. [Google Scholar] [CrossRef]
Wang, S.; Zhang, J. MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment. arXiv 2025, arXiv:2506.10430. [Google Scholar] [CrossRef]
He, B.; Wang, J.; Qiu, J.; Bui, T.; Shrivastava, A.; Wang, Z. Align and Attend: Multimodal Summarization with Dual Contrastive Losses. arXiv 2023, arXiv:2303.07284. [Google Scholar] [CrossRef]
Xu, Z.; Meng, X.; Wang, Y.; Su, Q.; Qiu, Z.; Jiang, X.; Liu, Q. Learning Summary-Worthy Visual Representation for Abstractive Summarization in Video. arXiv 2023, arXiv:2305.04824. [Google Scholar] [CrossRef]
Xie, J.; Chen, X.; Zhao, S.; Lu, S.-P. Video summarization via knowledge-aware multimodal deep networks. Knowl. Based Syst. 2024, 293, 111670. [Google Scholar] [CrossRef]
Kashid, S.; Awasthi, L.K.; Berwal, K.; Saini, P. Spatiotemporal Feature Fusion for Video Summarization. IEEE MultiMedia 2024, 31, 88–97. [Google Scholar] [CrossRef]
Yang, Z.; He, J.; Toda, T. Multi-Modal Video Summarization Based on Two-Stage Fusion of Audio, Visual, and Recognized Text Information. In Proceedings of the 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE, Macau, China, 3–6 December 2024; pp. 1–6. [Google Scholar] [CrossRef]
Yu, L.; Zhao, X.; Xie, L.; Liang, H.; Liang, R. Hierarchical Multi-Modal Video Summarization with Dynamic Sampling. IET Image Process. 2024, 18, 4577–4588. [Google Scholar] [CrossRef]
Alaa, T.; Mongy, A.; Bakr, A.; Diab, M.; Gomaa, W. Video Summarization Techniques: A Comprehensive Review. arXiv 2024, arXiv:2410.04449. [Google Scholar] [CrossRef]
Chen, B.; Zhao, X.; Zhu, Y. Personalized Video Summarization by Multimodal Video Understanding. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24), ACM, Boise, ID, USA, 21–25 October 2024; pp. 4382–4389. [Google Scholar] [CrossRef]
Lee, M.J.; Gong, D.; Cho, M. Video Summarization with Large Language Models. arXiv 2025, arXiv:2504.11199. [Google Scholar] [CrossRef]
Pang, Z.; Otani, M.; Nakashima, Y. Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization. arXiv 2025, arXiv:2503.09027. [Google Scholar] [CrossRef]
Guo, Y.; Xing, J.; Hou, X.; Xin, S.; Jiang, J.; Terzopoulos, D.; Jiang, C.; Liu, Y. CFSum: A Transformer-Based Multi-Modal Video Summarization Framework with Coarse-Fine Fusion. arXiv 2025, arXiv:2503.00364. [Google Scholar] [CrossRef]
Psallidas, T.; Spyrou, E. Video Summarization Based on Feature Fusion and Data Augmentation. Computers 2023, 12, 186. [Google Scholar] [CrossRef]
Kaggle. YouTube-8M Video Understanding Challenge 2019. Available online: https://www.kaggle.com/c/youtube8m-2019 (accessed on 27 July 2025).
gsssrao. youtube-8m-videos-frames: YouTube-8M Videos, Frames, and IDs Generator. Available online: https://github.com/gsssrao/youtube-8m-videos-frames (accessed on 27 July 2025).
Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014; pp. 73–86. [Google Scholar] [CrossRef]
Obradović, S.; Milošević, B.; Štrbac, P.S. MATLAB i mašinsko učenje. In Proceedings of the XIII Međunarodni Naučno-Stručni Simpozijum INFOTEH—JAHORINA, Istočno Sarajevo, Bosnia and Herzegovina, 19–21 March 2014. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar] [CrossRef]
Lemaître, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 2017, 18, 559–563. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]

Figure 1. Conceptual framework of the feature extraction process for audio and video modalities.

Figure 2. Types of machine learning, categories, and algorithms.

Figure 3. Random and grid search: circles represent all selected hyperparameter values, while green circles indicate the best results.

Figure 4. Example of post-processing a noisy signal using a median filter (N₁ = 3) to smooth local predictions and a smoothing filter (N₂ = 5) that enforces a minimum duration for informative segments, compared with the original sine wave.

Figure 5. Multimodal video summarization pipeline showing stages from segmentation and feature extraction to classification, temporal post-processing, summary assembly, and caption generation, with supporting analysis modules for feature importance and benchmarking.

Figure 6. Change in the average loss value of the model across different training epochs.

Figure 7. Example of attention weight visualization for each token in the description.

Figure 8. Example of the end-to-end summarization pipeline. The system first identified informative segments using the benchmarked Random Forest classifier. Three representative frames from these segments were then passed to the captioning model to generate the descriptive subtitles shown.

Figure 9. Home page of the application.

Figure 10. Example of generated subtitles for a shorter video clip.

Figure 11. ROC curve comparing Random Forest and XGBoost classifiers, with the diagonal dotted line indicating random-chance performance.

Figure 12. Macro-average precision (a) and recall (b) for the combined Random Forest classifier using only video data, with the orange line showing the kernel density estimate.

Figure 13. Runtime performance metrics of the captioning system over 1000 sequential web-based requests. (a) Requests per second (RPS), indicating system throughput; (b) Total end-to-end request times, including the mean latency trendline; (c) System resource utilization measured per request, showing GPU, CPU, and memory load percentages. These metrics collectively illustrate the runtime stability, latency characteristics, and hardware usage patterns under operational load.

Figure 14. Latency distribution of individual processing steps within the captioning pipeline. Box plots represent the per-request execution times for the save, extract, cluster, drop, and caption stages across 1000 sequential requests. The cluster and caption stages exhibit the highest variability and dominate total processing time, indicating key areas for optimization in the end-to-end inference workflow.

Figure 15. Comparative analysis of captioning strategies using greedy and beam search decoding. (a) Caption length distributions across decoding configurations, illustrating the tendency of wider beam widths to produce longer and more variable captions; (b) Caption generation times per image, showing increased latency as a function of beam width; (c) Trade-off between decoding time and caption length as beam width increases, highlighting diminishing returns in linguistic richness beyond beam width 5.

Table 1. Selected short-term audio features.

No.	Feature Name	Description
1	Zero-crossing rate	Measure of sign changes in signal values during the frame duration
2	Signal energy	Sum of squared signal values, normalized relative to the frame length
3	Energy entropy	Measure of abrupt changes in normalized subframe energy entropies
4	Spectral centroid	Centre of mass of the frequency distribution
5	Spectral spread	Second-order central moment of the spectrum
6	Spectral entropy	Entropy of normalized spectral energies for a given set of subframes
7	Spectral flux	Square difference between normalized magnitudes of two consecutive frames
8	Spectral roll-off	Frequency below which a specified percentage (typically 90%) of the magnitude spectrum is concentrated
9–21	Mel-frequency cepstral coefficients (MFCCs)	Cepstral representation of sound using the Mel frequency scale
22–33	Chromagram (Pitch Class Profile—PCP)	Twelve-dimensional Gaussian distribution of spectral energy, described by a vector of means and covariance matrix
34	PCP deviation	Standard deviation of the aforementioned twelve chromagram coefficients

Table 2. Training and test subsets.

Subset	Video Count	Segment Count	Average Video Duration
Training Set	360	164,232	07:35
Test Set	90	51,768	09:37

Table 3. Modified hyperparameters for classifiers.

Classifier	Hyperparameter	Best Value
Logistic Regression	C	0.1
KNN	k	5
Decision Tree	criterion	entropy
Decision Tree	max_depth	6
Random Forest	criterion	gini

Table 4. Classification results across audio, video, and combined modalities.

Classifier	Area Under ROC Curve (AUC)			F1 Macro Average
Classifier	Audio	Video	Combined	Audio	Video	Combined
Random Selection	52.3%			50.2%
Naïve Bayes	62.2%	65.3%	64.7%	53.2%	50.8%	54.1%
KNN	61.0%	64.7%	65.3%	57.7%	59.9%	60.4%
Logistic Regression	66.0%	71.8%	70.4%	43.9%	47.6%	52.5%
Decision Tree	63.2%	70.6%	69.8%	44.1%	48.1%	48.6%
Random Forest	69.0%	74.4%	74.2%	60.5%	63.7%	64.2%
XGBoost	68.7%	69.5%	72.7%	63.0%	63.1%	66.1%

Table 5. Final results of Random Forest model application with post-processing.

Parameter Combinations (N₁–N₂)	Precision for Positive Class	Recall	F1 Macro Average	Overall Precision
Without Parameters	44.6%	74.5%	64.2%	64.3%
3–3	46.3%	73.4%	65.2%	66.2%
3–5	47.6%	70.8%	67.5%	69.1%
5–3	45.2%	73.3%	66.1%	66.3%
5–5	45.9%	73.8%	66.5%	66.5%

Table 6. Ten most significant features for segment classification.

Feature Name	Description	Modality
spectral_flux_mean	Mean value of spectral flux	Audio
delta spectral_spread_std	Delta standard deviation of spectral spread	Audio
delta mfcc_5_std	Delta standard deviation of the 5th Mel-Frequency Cepstral Coefficient (MFCC 5)	Audio
hist_v0	First class of grey value histogram	Video
hist_v3	Fourth class of grey value histogram	Video
hist_s1	Second class of saturation value histogram	Video
hist_s5	Sixth class of saturation value histogram	Video
frame_value_diff	Difference in frame values	Video
mag_std	Standard deviation of magnitude flow	Video
shot_durations	Duration of current frame	Video

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Marevac, E.; Kadušić, E.; Živić, N.; Buzađija, N.; Tabak, E.; Velić, S. Multimodal Video Summarization Using Machine Learning: A Comprehensive Benchmark of Feature Selection and Classifier Performance. Algorithms 2025, 18, 572. https://doi.org/10.3390/a18090572

AMA Style

Marevac E, Kadušić E, Živić N, Buzađija N, Tabak E, Velić S. Multimodal Video Summarization Using Machine Learning: A Comprehensive Benchmark of Feature Selection and Classifier Performance. Algorithms. 2025; 18(9):572. https://doi.org/10.3390/a18090572

Chicago/Turabian Style

Marevac, Elmin, Esad Kadušić, Nataša Živić, Nevzudin Buzađija, Edin Tabak, and Safet Velić. 2025. "Multimodal Video Summarization Using Machine Learning: A Comprehensive Benchmark of Feature Selection and Classifier Performance" Algorithms 18, no. 9: 572. https://doi.org/10.3390/a18090572

APA Style

Marevac, E., Kadušić, E., Živić, N., Buzađija, N., Tabak, E., & Velić, S. (2025). Multimodal Video Summarization Using Machine Learning: A Comprehensive Benchmark of Feature Selection and Classifier Performance. Algorithms, 18(9), 572. https://doi.org/10.3390/a18090572

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Video Summarization Using Machine Learning: A Comprehensive Benchmark of Feature Selection and Classifier Performance

Abstract

1. Introduction

2. Related Work

2.1. Audio-Visual Integration in Video Summarization

2.2. Video Summarization Datasets Overview

2.3. Problem Definition and Motivation

3. Materials and Methods

3.1. Dataset and Annotations

3.2. Feature Engineering

3.3. Machine Learning Models

4. Experimental Setup and Evaluation

4.1. Feature and Classifier Selection

4.2. Evaluation Metrics

4.3. Integration into an End-to-End Summarisation Pipeline

5. System Implementation

5.1. Script for Model Training and Evaluation

5.2. Server-Side Application

5.3. Client-Side Application

6. Results

6.1. Classification Performance Evaluation and Feature Importance Analysis

6.2. System Analysis During Load Testing

7. Concluding Remarks and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI