1. Introduction
During the last few years and mainly due to the rise of social media and video sharing websites, there has been an exponential increase in the amount of user-generated audio visual content. The average user captures and shares several aspects of her/his daily life moments, such as (a) personal videos, e.g., time spent with friends and/or family and hobbies; (b) activity videos, e.g., sports and other similar activities; (c) reviews, i.e., sharing opinions regarding products, services, movies etc.; and (d) how-to videos, i.e., videos created by users in order to teach other users to fulfill a task. Of course, apart from those that are creating content, a plethora of users daily consume massive amounts of such content. Notably, YouTube users daily watch more than 1 billion hours of visual content, while creating and uploading more than 500 h [
1]. Of course, these numbers are expected to further increase within the next few years, resulting in overexposure to massive amounts of data, which in turn may prohibit users from capturing relevant information. The latter becomes more difficult in case of lengthy content, thus necessitating aid to this task [
1].
Video summarization is a promising solution to the aforementioned problem, aiming to extract the most relevant segments from a given video, in order to create a shorter, more informative version of the original video, which is engaging, while preserving its main content and context [
2]. Specifically, a generated summary typically consists of a set of representative frames (i.e., the “keyframes”) or a set of video fragments. These parts should be kept in their original order, while the summary should be of a much shorter duration than the original video, while including its most relevant elements. Video summarizing applications include efficient browsing and retrieval of visual art movies (such as films and documentaries), TV shows, medical videos, surveillance videos, and so forth.
Video summarization techniques may be categorized into four main categories [
1], which differ based on their output, i.e., the actual summary that is delivered to its end user [
2]. Specifically, these categories are [
3,
4,
5,
6] (a) a collection of video frames (keyframes), (b) a collection of video segments, (c) graphical cues, and (d) textual annotations. Summaries belonging to (a) and (b) are frequently referred to as “static” and “dynamic”, respectively. Note that a dynamic summary preserves the audio and the motion of videos, whereas a static summary consists of a collection of still video frames. In addition, graphical cues are rarely employed in conjunction with other methodologies. As expected, users tend to prefer dynamic summaries over static ones [
7]. Video summarization techniques may be also categorized as (a) unimodal approaches, i.e., those that are based only on the visual content of the video, and (b) multimodal approaches, i.e., those that are using more than one of the available modalities, such as audio, textual, semantic (i.e., depicted objects, scenes, people, etc.) video content [
8]. Depending on the training approach that is used, summarization techniques may be categorized as (a) supervised, i.e., those that are based on datasets that have been annotated by human annotators, in either a per-frame or a per-fragment basis; (b) unsupervised, i.e., those that do not rely on some kind of ground truth data, but instead use a large corpus of available data so as to “learn” the important parts; and (c) weakly supervised, i.e., those that do not need exact, full annotations, but instead are based on weak labels, which are imperfect yet able to create powerful predictive models.
In this work, a supervised methodology for the creation of brief dynamic summaries of user-generated audiovisual content is proposed, which falls into a subcategory known as “video skimming” [
9,
10]; i.e., the goal is to produce an output video consisting of bits of the input video, structured so that the original temporal order is preserved. This sort of summary is critical since it allows for a better comprehension [
9] of the original video by its users. The herein presented video summarization approach faces the problem as a classification task. Specifically, the herein presented work relies on annotated ground truth on a per-segment basis. For this, a custom dataset that has been collected and annotated in the context of prior work [
1] is mainly used. A set of user-generated activity videos from YouTube was manually collected and was annotated by a collaborative annotation process, involving several users, which were asked to provide binary annotations on a per-fragment basis. Specifically, video segments of 1 s duration were annotated as being “informative” or “noninformative”; i.e., according the opinion of the human annotators, the former should be part of the summary, while the latter should be omitted. Each video was annotated by more than three users, while the “informativeness” of a given video was decided upon a voting procedure.
The approach for selecting the most informative parts of videos is as follows: It begins with the extraction of (a) handcrafted audio and visual features and (b) deep visual features. The former are extracted using well-known and widely used features, which have been proven reliable and effective in several classification tasks. For the extraction of the latter, the well-defined pretrained model VGG19 [
11] is used, which also has been successfully applied into a plethora of computer vision applications. Note that the proposed approach is multimodal, since the handcrafted features, apart from the visual properties of videos, also capture the aural properties, as well as semantic features (e.g., faces and objects). To produce the visual summaries, both handcrafted and deep feature representations are then fused, and experiments with a set of well-known supervised classification approaches are conducted. To verify the proposed methodology and to assess whether the deep features are able to provide a boost of performance over the use of solely handcrafted features, experiments on three datasets are presented. Specifically, apart from the aforementioned custom dataset that has been created, experiments using two well-known, publicly available datasets, namely, TVSum [
12] and SumMe [
13], are also presented, while the proposed approach is also compared with a plethora of state-of-the-art techniques. To further improve the outcome of the proposed approach, experiments with augmentation of the training dataset are conducted. Specifically, when experimenting with a given dataset, the remaining two are used as training data. The experimental results indicate that (a) deep features, when fused with handcrafted ones, are able to provide an increase of performance, and (b) augmentation of the training dataset in some cases also provides a significant performance boost.
The remainder of this work is organized as follows:
Section 2 includes recent relevant research works from the broader field of video summarization, focusing on those that use deep features. Then, in
Section 3, detailed descriptions of the video summarization datasets used in this study are presented. In
Section 4, the proposed classification methodology is presented, which comprises handcrafted audio and video extraction, deep feature extraction, and augmentation of a training dataset. Model training, experimental results, and comparisons with the state of the art are presented and discussed in
Section 5. Finally, conclusions and future work plans are presented in
Section 6.
2. Related Work
In our previous work [
1], a user-generated video summary method that was based on the fusion of audio and visual modalities was proposed. Specifically, the video summarization task was addressed as a binary, supervised classification problem, relying on audio and visual features. The proposed model was trained to recognize the “important” parts of audiovisual content. A key component of this approach was its dataset, which consists of user-generated, single camera videos and a set of extracted attributes. Each video included a per-second annotation indicating its “importance”.
The fundamental purpose of a video summarization approach, according to [
14], is to create a more compact version of the original video, without sacrificing much semantic information, while making it relatively complete for the viewer. In this work, the authors introduced SASUM, a unique approach that, in contrast with previous algorithms that focused just on the variety of the summary, extracted the most descriptive elements of the video while summarizing it. SASUM, in particular, comprised a frame selector and video descriptors to assemble the final video so that the difference between the produced description and the human-created description was minimized. In [
15], a user-attention-model-based strategy for keyframe extraction and video skimming was developed. Audio, visual, and linguistic features are extracted, and an attention model is created based on the motion vector field, resulting in the creation of a motion model. Three types of maps based on intensity, spatial coherence, and temporal coherence are created and are then combined to create a saliency map. A static model was also used to pick important backdrop regions and extract faces and camera attention elements. Finally, audio, voice, and music models were developed. To construct an “attention” curve, the aforementioned attention components were linearly fused. Keyframes were extracted from local maxima of this curve within pictures, whereas skim segments were chosen based on a variety of factors.
Based on deep features extracted by a convolutional neural network (CNN), the authors of [
16] trained a deep adversarial long short-term memory (LSTM) network consisting of a “summarizer” and a “discriminator” to reduce the distance between ground truth movies and their summaries. The former, in particular, was made up of a selector and an encoder that picked out relevant frames from the input video and converted them to a deep feature vector. The latter was a decoder that distinguished between “original” and “summary” frames. The proposed deep neural network aimed to deceive the discriminator by presenting the video summary as the original input video, thinking that both representations are identical. Otani et al. [
17] proposed a deep video feature extraction approach with the goal of locating the most interesting areas of the movie that are necessary for video content analysis. In [
18], the authors focused primarily on building a computational model based on visual attention for summarizing videos from television archives. In order to create a static video summary, their computer model employed a number of approaches, including face identification, motion estimation, and saliency map calculation. The above computational model’s final video summary was a collection of important frames or saliency pictures taken from the initial video.
The methodology proposed by [
19] used as input sequences original video frames and produced their projected significance scores. Specifically, they adopted a framework for sequence-to-sequence learning so as to formulate the task of summarization, addressing the problems of short-term attention deficiency and distribution inconsistency. Extensive tests on benchmark datasets indicated that the suggested ADSum technique is superior to other existing approaches. A supervised methodology for the automatic selection of keyframes of important subshots of videos is proposed in [
20]. These keyframes serve as a summary, while the core concept of this approach is the description of the variable-range temporal dependence between video frames using long short-term memory networks (LSTM), taking into consideration the sequential structure that is essential for producing insightful video summaries. In the work of [
21], the specific goal of video summarization was to make it easier for users to acquire movies by creating brief and informative summaries that are diverse and authentic to the original videos. To summarize movies, they used a deep summarization network (DSN), which selected the video frames to be included in summaries, based on probability distributions. Specifically, it forecast a probability per video frame, indicating how likely it is to be selected. Note that, within this process, labels were not necessary; thus, the DNS approach may operate completely unsupervised. In [
22], a unique video-summarizing technique called VISCOM was introduced, which is based on the color occurrence matrices from the video, which were then utilized to characterize each video frame. Then, from the most informative frames of the original video, a summary was created. In order to make the aforementioned video-summarizing model robust, VISCOM is tested on a large number of videos from a range of genres.
The authors of [
23] present a new approach to supervised video summarization using keyshots. They introduce a soft, self-attention mechanism that is both conceptually simple and computationally efficient. Existing methods in the field utilize complex bidirectional recurrent networks such as BiLSTM combined with attention, which are difficult to implement and computationally intensive compared with fully connected networks. In contrast, the proposed method employs a self-attention-based network that performs the entire sequence-to-sequence transformation in a single forward pass and backward pass during training. The results demonstrate that this approach achieves state-of-the-art performance on two widely used benchmarks in the field, namely, TVSum and SumMe.
Finally, the authors in [
24] introduce a novel method for supervised video summarization that addresses the limitations of existing recurrent neural network (RNN)–based approaches, particularly in terms of modeling long-range dependencies between frames and parallelizing the training process. The proposed model employs self-attention mechanisms to determine the significance of video frames. Unlike previous attention-based summarization techniques that model frame dependencies by examining the entire frame sequence, this method integrates global and local multihead attention mechanisms to capture different levels of granularity in frame dependencies. The attention mechanisms also incorporate a component that encodes the temporal position of video frames, which is crucial for video summarization. The results show that the model outperforms existing attention-based methods and is competitive with other top-performing supervised summarization approaches.
3. Datasets
In this section, the datasets that have been used for the experimental evaluation of this work are presented, specifically, (a) a custom dataset comprising user-generated videos obtained from YouTube [
1]; the SumMe dataset, which comprises videos of various sources, such as movies, documentaries, and sports [
13]; and (c) the TVSum dataset, which comprises videos of various genres, such as documentaries, vlogs, and egocentric videos [
12].
3.1. Custom Dataset
The first dataset that was utilized in this work was initially proposed in our previous work [
1] and consists of 409 user-generated videos that were collected from YouTube. In order to collect these videos, the following criteria were adopted: (a) absence of video edits and music scores over the original audio source of the videos and (b) single-camera configuration, e.g., an action camera or a smartphone’s camera. Specifically, it comprises videos from the following 14 video categories: car review, motorcycle racing, kayaking, climbing, fishing, spearfishing, snack review, skydiving, roller coasters, theme park review, downhill, mountain bike, survival in the wild, and primitive building.
This dataset was constructed motivated by the intention to propose a summarization approach that could be effectively applied to unedited, “raw” videos. Therefore, the majority of videos in this dataset are based on outdoor activities, such as action and extreme sports. It consists of 409 videos of a total duration of 56.3 h. Their duration ranges between 15 s and ~15 min, while their average duration is ~8.25 min.
The videos were annotated by 22 human annotators in order to construct the ground truth video summaries. The annotation was binary; users were asked to note the time intervals that they found informative/interesting. Note that within each video, a maximum number of informative intervals was not imposed, while annotators were able to freely label the timestamps of each informative time interval, making the whole process completely subjective. Each video was annotated by more than 1 annotator (typically by 3–4 annotators), while their opinions were aggregated upon using a simple majority-based aggregation rule. In particular, each segment of 1 s was considered to be included in the summary (i.e., characterized as “informative”) if at least 60% of the annotators agreed on that decision. Note that videos that either were annotated by less than 3 annotators or did not contain any informative segments were discarded. Upon this process, the dataset ended up comprising 336 videos, while it was split into two sets, i.e., a set of 268 training and a set of 68 test videos, comprising 127,972 and 31,113 samples, respectively.
3.2. Public Datasets
The SumMe dataset is a video summarization dataset introduced by Gygli et al. [
13]. Specifically, it comprises 25 videos from various sources, such as movies, documentaries, sports events, and online videos. Each video is encoded in MP4 format and has a varying duration. The dataset includes 390 human summaries, while it has been annotated by at least 15 users. Note that annotation is not binary, i.e., ground truth values per frame within the range 0–1. However, since the proposed methodology requires a binary annotation for each second of the video, the ground truth values required a conversion by applying a value of 0.6 as the agreed-upon threshold for aggregation; i.e., any values greater than 0.6 were transformed to 1, indicating informative parts, while values below the threshold were transformed to 0, indicating noninformative parts. Upon this process, SumMe comprises 2687 and 793 training and test samples, respectively.
The TVSum dataset was proposed by Sum et al. [
12] and comprises 50 videos of various genres, including news, “how-to”, documentaries, vlogs, and egocentric videos. Each video in the dataset is accompanied by 1000 annotations of shot-level importance scores obtained using a crowdsourcing approach, ending up with 20 annotators per video. Each video is divided into a set of 2 s shots, and scores for each shot are provided for each annotator. Specifically, the annotation is not binary, while users rate the informativeness of each shot compared with other shots from the same video, by providing a score ranging from 1 (low) to 5 (high). In this work, both 1 s halves of each 2 s segment were considered being of equal informativeness; i.e., the same score for both was used, and upon imposing a threshold of 3 (i.e., the median score value) and binarizing, each 1 s segment was ultimately marked as either “informative” or “noninformative”. Upon this process, TVSum comprises 10,454 and 2093 training and test samples, respectively.
4. Methodology
The goal of this work is to provide a solution to the problem of video summarization by constructing dynamic visual summaries from unedited, raw videos and extend the methodology that has been presented and evaluated in the context of our previous work [
1], wherein video summarization was first tackled as a supervised classification task. The proposed approach falls under the broad area of “video skimming”, aiming to select “interesting” fragments of the original video and merge them in order to create a temporally shortened version of it. Specifically, a given video stream is analyzed at a segment level; from each segment of 1 s duration, audio and visual features are extracted. These segments are then classified by supervised classification algorithms as being either “informative” (i.e., adequately interesting so as to be included in the produced video summary) or “noninformative” (i.e., not containing any important information, and thus should not be part of the produced video summary) [
1].
For this, supervised binary classifiers that are trained on feature representations of audio, visual, or combined modalities, which include handcrafted features fused with deep visual features, are used. Specifically, deep convolutional neural networks [
25] that have been trained on massive datasets and are currently considered as the most advanced technique for the problem of deep visual feature extraction have been applied to the problem at hand. Thus, the final feature vector is formed upon concatenation of both handcrafted audio and visual feature vectors and also the deep visual feature vector. Moreover, in order to train models so as to be more robust, a data augmentation approach, by incorporating features extracted from other datasets within the training process, is also proposed. In
Figure 1, a visual overview of the feature extraction and classification pipeline is presented.
4.1. Feature Extraction
As it has already been mentioned, in this work, two modalities of information extracted from audiovisual content, i.e., the audio and the visual modality, are used. Specifically, handcrafted features are extracted. These have been frequently used in audio and/or visual classification and clustering tasks, such as music information retrieval, auditory scene analysis, video classification, and picture retrieval, to accomplish a rich feature representation in both modalities, accompanied with deep visual features with the use of deep, pretrained models. Our intention was to add as many aspects of audio and visual information as possible. In the following, the proposed feature extraction methodology is presented in depth.
4.2. Audiovisual Handcrafted Feature Extraction
As it has already been mentioned, the goal of this work is to extract both aural and visual features and use them for summarization. Regarding the first modality, the goal is to extract both perceptual and harmonic information of audio signals using a set of handcrafted audio characteristics. Specifically, statistics such as mean and standard deviation of several features are used. These have been proved to be helpful in event recognition tasks, so as to generate feature vectors that capture a low-level representation of an entire audio segment. For this, segment-level audio features per audio clip are extracted using the
pyAudioAnalysis Python library [
26] and
ffmpeg (note:
https://ffmpeg.org, accessed on 15 July 2023).
Initially, short-term audio feature extraction is performed, and then segment-level feature statistics are calculated, so as to create the audio segment representation. Short-term processing is carried out for each segment of the audio signal, with each short-term window having 68 short-term features. The audio signal is divided into segment-level windows (either overlapping or nonoverlapping) (i.e., 34 features and 34 deltas). Segment windows may range from 0.5 s to several seconds, depending on what defines a homogeneous segment in each application area, whereas short-term windows commonly range from 10 to 200 ms. Time domain, frequency domain, and cepstral domain are the three categories of short-term features that are extracted; they are presented in
Table 1. Note that each short-term feature sequence’s mean and standard deviation comprise 136 features in total, which form the audio feature vector.
Lastly, for the visual modality, a wide range of visual features are extracted in order to capture the visual properties of video sequences. Since this modality is considered to be more important as it has been experimentally demonstrated in previous works [
1,
27], the goal here was to extract a richer representation compared with the audio modality. Specifically, every 0.2 s, 88 visual features from the corresponding frame are extracted; these are presented in
Table 2 and may be grouped depending on their feature level representation, such as, low (i.e., simple color aggregates), mid (i.e., optical flow features), and high (i.e., objects and faces).
4.3. Deep Visual Feature Extraction
In order to enhance the visual feature representation, deep visual features, by using already-trained deep neural networks, are extracted. Specifically, knowledge from transfer learning [
31] is a technique for utilizing previously trained models for some task into a new task. Tasks such as classification, regression, and clustering may benefit from transfer learning. Thus, since the proposed approach tackles video summarization as a classification task, a visual domain adaptation approach [
32] is adopted. The deep visual features are extracted upon analyzing the first frame of each 1 s segment of the video, depending on the frames per second (fps) retrieved from each video’s metadata.
The frame vector of each second of the video is given as input in the VGG19 pretrained model, which is a VGG model [
11] version that consists of 19 layers in total (16 convolution layers, 3 fully connected layers, 5 max pooling layers, and 1 SoftMax layer). In this work, the last 4 layers from VGG19 were removed in order to reduce the domain specification of the model [
33] and using the corresponding intermediate output of the network as the feature representation for the visual modality, resulting in a total of 4096 features.
4.4. Dataset Augmentation
The final step, in order to enhance the predictive power and the robustness of the proposed methodology, is the augmentation of the training dataset. This was achieved by incorporating additional data from the SumMe and the TVSum datasets. This augmentation step resulted in broadening the diversity of training data, contributing to the overall generalization ability of the model. The training datasets used in case of SumMe and TVSum summarization tasks were also augmented. Specifically, both datasets were partitioned into distinct training and testing subsets by allocating 80% and 20% of the available videos, respectively. When testing with a given dataset, its training data were augmented with the remaining two datasets. Note that prior to the augmentation step, the transformation approaches that have been previously described in
Section 3.2 in case of SumMe and TVSum, were followed so as to obtain binary annotations for each 1 s video segment.
5. Results and Experiments
This section presents experimental results of the proposed video summarization approach using the datasets that have been previously presented in
Section 3. As in our prior work [
1], the following classifiers for the classification of the fused features resulting upon the combination of handcrafted and deep visual features are presented: (a) naive Bayes (NB), (b) k-nearest neighbors (kNN), (c) logistic regression (LR), (d) decision trees (DT), (e) random forests (RF), and (f) XGBoost (XGB). The classifiers of cases of (a)–(d) have been implemented using the scikit-learn library [
34], with the appropriate parameter adjustments. Specifically, in the case of kNN adjusted, the
k parameter, which stands for the number of nearest neighbors, was adjusted. Moreover, in the case of LR experiments involved the
C inverse regularization parameter. In the case of DT, the split quality measure criterion for DT (i.e., Gini impurity, entropy, etc.) was varied. It should be noted that RF is a collection of DT models that may be seen of as an improvement of bagging. For their construction, the balanced classifier from [
35] was utilized and experiments involved the maximum tree depth. Finally, for the XGB, the classifier from [
36] was used. In terms of the split quality measure criterion and the quantity of tree estimators, both tree classifiers have been optimized. In the case of kNN,
k was set equal to 5, while in the case of LR, the value of
C was set equal to 10. The DT classifier adopts the “entropy” criterion for its information gain strategy and also limits the maximum tree depth to 6 in order to prevent overfitting. Accordingly, the RF classifier adopts the Gini impuriry criterion, while the maximum number of trees is 400. Similarly, also in the case of the XGBoost classifier, the maximum number of trees is set to 400.
To validate the effectiveness of the proposed approach, the macro averaged F1-score, i.e., the macro average of the individual class-specific F1-scores, is computed. Note that F1-score is the harmonic mean of recall and precision, per class; therefore, the F1 macro average provides an overall normalized metric for the general classification performance. Additionally, note that the F1 macro average provides a general evaluation metric of video summarization, formulated as a classification task, taking into account the class imbalance of the task and aggregating precision and recall.
In
Table 3, the F1-scores for the six different classification methods applied to the herein used custom dataset are depicted, and for the cases of (a) using only handcrafted features, (b) fusing handcrafted and deep features, (c) fusing handcrafted and deep features and also augmenting the training dataset with SumMe and TVSum. Accordingly, in
Table 4 and
Table 5, results for SumMe and TVSum are presented. Moreover, in
Table 6, comparisons of the proposed approach with several approaches that are based on supervised learning techniques are presented.
In the case of the custom dataset, it may be observed that deep features provided a boost of performance in most cases, while the overall best F1-score in that case was increased, compared with the use of handcrafted features only. Specifically, 5 out of 6 classifiers demonstrated a higher F1-score, while kNN showed equivalent performance. The best performance using handcrafted features was observed in the case of using XGB and was 62.3, while in the case of handcrafted features and using the same classifier, it was 59.7. Additionally, in the case of the fusion of handcrafted and deep features, XGB demonstrated the best overall performance, which in that case was 63.7. However, by augmenting the training dataset with all available samples of SumMe and TVSum, performance was further improved, reaching a highest F1-score equal to 65.8, when using the RF classifier. Note that in this case, 5 out 6 classifiers exhibited improved performance, with the only exception being the DT, which showed almost equivalent performance.
Continuing with the SumMe dataset, it may be observed that the fusion of handcrafted with deep features provided a boost of performance in 4 out of 6 cases, while the overall best score was also increased in this dataset. A slight decrease in performance was observed in the cases of NB and RF. Specifically, the overall best F1-score was 50.6 in the case of using only handcrafted features and the NB classifier and was increased to 51.2 when adding the deep features and the DT. Notably, by using only deep features, slightly increased performance was achieved, i.e., 51.7 by also using the DT. By augmenting the dataset with the custom one and TVSum, a further performance boost was observed. The F1-score was increased in 4 out of 6 classifiers, while best F1-score was equal to 58.4. Notably, the NB exhibited a high drop of performance. A smaller drop of performance was also observed in the case of DT.
Finally, in the case of TVSum, the addition of deep features led to an increase in F1-score. The best overall F1-score in both cases was observed when using the RF classifier. In the case of using only handcrafted features, the best F1-score was 56.4, which was increased to 57.6 using only deep features and then to 60.3 by fusing both types of features. All 6 classifiers exhibited an increase in performance due to the addition of the deep features. Notably, in the case of TVSum, the augmentation of the training dataset was unable to provide a further increase in the best overall performance. Specifically, 5 out of 6 classifiers showed a drop of performance, while the best overall F1-score decreased to 56.9, again when using the RF.
To compare the proposed approach with state-of-the-art research works that are based on supervised learning, TVSum and SumMe were used, which, as already mentioned, are popular video summarization datasets. These comparisons are depicted in
Table 6. As it could be observed, the proposed approach achieved the best performance in SumMe; it ranked eighth overall. These results clearly indicate its potential. Note that almost all approaches presented in
Table 6 are based on deep architectures, while the herein proposed one is based on traditional machine learning algorithms, while it uses deep architectures only for feature extraction and not for classification.
6. Conclusions and Future Work
In this paper, previous research work on video summarization has been extended by (a) introducing pretrained models as transfer learning feature extractors in order to represent the audiovisual content of videos in a more intuitive form, disengaged from the exclusive use of handcrafted features, and (b) augmenting the training dataset by using other datasets in the training procedure. The performance was evaluated using a custom dataset comprising YouTube user-generated video clips that had been collected and annotated on a 1 s basis as “informative” and “noninformative”. This process was performed by human annotators, who annotated as “informative” video segments which to their opinion were suitable to be part the final video summary. Two publicly available datasets were also used, namely, TVSum and SumMe. Video summarization was formulated as a binary classification task, and feature vector representations were extracted from 1 s video segments of the audiovisual streams of videos. Both handcrafted and deep features were fused into a single vector, and experiments with a set of six well-known classifiers were performed, with and without augmentation of training dataset. It has been experimentally shown that the herein proposed approach achieved better performance than most contemporary research works, and it has been verified that, in most cases, deep features, when fused with the handcrafted ones, are able to provide a boost of performance. The latter may be further improved by augmenting the training dataset with examples from other datasets.
Plans for future work include the enrichment of the custom dataset with several other datasets comprising more videos from various and heterogeneous domains. We believe that such an augmentation may lead to a further increase in performance as it will allow training of a more robust video summarization model. Moreover the proposed approach could be extended from a fully to a weakly supervised methodology. Additionally, as several discrepancies among annotators were observed, it would be interesting to further experiment with the voting scheme and the aggregation approach, and also try to detect and exclude “mischievous” annotators, in an overall effort to enhance the objectivity of our models. It will be also interesting to exploit speech and textual modalities that may be present within the aural and the visual part of the video sequences. In addition to the deep feature extraction, deep sequence classification approaches with attention mechanisms have been extremely promising, leading to better results; therefore, it is within our immediate plans to investigate such methodologies.