1. Introduction
News video summarization tasks aim to extract the key frame sequence from a complete and long news video to summarize the news video, to meet the needs of users for quickly browsing and understanding the content [
1]. In video summarization, a key frame extraction algorithm is a feasible and effective method. This method summarizes the video by splitting it into individual shots and then extracting key frames from each shot.
In shot segmentation, the mainstream methods extract the features of video frames and judge whether they are located on the boundary of the shots by comparing the differences between the two frames. Wu et al. [
2] introduced a method to directly compare pixel differences between two frames, but the method was easily affected by the motion and rotation of the shots. Yang et al. [
3] used Hu-invariant-moment to extract features for initial inspection, and then carried out re-inspection through color features to determine the boundary of the shots. Bae et al. [
4] realized shot boundary detection through discrete cosine transform and pattern matching of color histograms after extracting the color features of the frame. Zheng et al. [
5] improved the efficiency of color feature extraction by using the MapReduce platform. However, the disadvantages of the above methods using color features are that two completely different frames may also show similar color features, thus causing false detection. Tang [
6] compared the difference between frames by extracting the ORB (oriented fast and rotated BRIEF, BRIEF: binary robust independent elementary features) features in video frames, in order to detect shot boundaries. Despite the high processing speed of ORB features, it can easily result in false detection when the shot is zooming. Rachida et al. [
7] extracted SIFT feature points as local features of frames to construct a SIFT–PDH histogram, and then defined the distance of SIFT–PDH between frames and selected an adaptive double threshold to realize the detection of shot boundaries. SIFT features can maintain a high stability against such factors as the rotation and motion of the shots, the diversity of object sizes, the intensity of light and changes in brightness. However, due to the large amount of calculation in the SIFT algorithm, the real-time performance is restrained. Bendraou et al. [
8] used a low-rank matrix approximation of Frobenius norm to extract intra-frame features. In recent years, deep learning technology has been used to learn the inherent characteristics of data from a large number of samples and is particularly suitable for visual computing tasks [
9,
10,
11,
12], including video detection, classification and segmentation. Zeng et al. [
13] extracted the features of frames and calculated the similarity between frames by recurrent neural network (RNN), to realize the detection of abrupt and gradual frames. However, this method neglects the applicable scope of the features and cannot express the frame information well.
In the research of key frame extraction, Zhong et al. [
14] realized key frame extraction by building a fully connected graph. Qu et al. [
15] proposed an extraction algorithm based on SIFT feature points, but the extraction algorithm based on local features showed high redundancy and low real-time performance. Qu et al. [
16] selected the frame with the maximum image entropy as the key frame from each shot. Liang et al. [
17] extracted the deep features of the frame by constructing a convolutional neural network (CNN), and then selected the frame containing the most significant feature as the key frame. The above two methods can only select one key frame in a shot. Clustering algorithms are also widely employed in key frame extraction. Sun [
18] and Sandra [
19] used the K-means clustering algorithm to determine the location of key frame. The major disadvantage of the K-means clustering algorithm is that the number of clusters needs to be set in advance, which means it cannot well adapt to changes in shot complexity.
For video summarization, many types of research have been carried out in the past and are ongoing until now. Souček et al. [
20] presented an effective deep network architecture TransNetV2 for shot detection, which provided promising detection accuracy and enabled efficient processing of larger datasets. Liu et al. [
21] proposed a hierarchical visual model, which hypothesized a number of windows possibly containing the object of interest. Zhang et al. [
22] proposed a context-aware video summarization (CAVS) framework, where sparse coding with a generalized sparse group lasso was used to learn a dictionary of video features as well as a dictionary of spatio-temporal feature correlation graphs. Jurandy et al. [
23] exploited visual features extracted from the video stream and presented a simple and fast algorithm to summarize the video content. Recently, Workie et al. [
24] provided a comprehensive survey on digital video summarization techniques, from classical computer vision until the recent deep learning approaches. These techniques fall into summarized, unsupervised and deep reinforcement learning approaches. Cahuina et al. [
25] presented a method for static video summarization using local descriptors and video temporal segmentation. It uses a clustering called “X-mean” and needs to specify two parameters in advance to determine the number of clusters, leading to inconvenience in practice. Therefore, video summarization still faces various challenges, including computing equipment, complexity and lack of datasets.
In order to overcome the shortcomings of the above algorithms, this paper proposes a shot boundary detection algorithm based on SURF features [
26] and a key frame extraction algorithm with an improved clustering algorithm for summarizing news videos. First, by regarding SURF features as local features of frames, this algorithm extracts SURF feature points from each frame and matches them. The similarity between adjacent frames is calculated according to the matching results and then a similarity curve between frames is depicted. Based on the similarity curve, the mutation (sudden switch) and gradient (gradual switch) of the shots are detected by selecting the double thresholds. Second, for each shot, an HSV color histogram of the frame in the shot is extracted, and then the number of clusters is dynamically determined by clustering the color histograms through the improved clustering algorithm. After the clustering is completed, the frame closest to the center is selected as the key frame in each cluster. SURF features can not only maintain invariability for shot rotation and scale variations, but also enjoy good stability in shot motion, noise and brightness changes. Because the proposed clustering algorithm does not need to set the number of clusters in advance, the number of extracted key frames is able to agree with the complexity of shots.
In brief, the major contributions of this paper are highlighted as follows.
(1) In order to facilitate SURF based shot detection, we propose an inter-frame similarity metric based on the SURF feature points respectively extracted from two frames and the matched SURF feature points as well. Furthermore, in order to improve the detection accuracy of abrupt and gradual shots, we propose a double-threshold method to locate the shot boundary in the similarity curve.
(2) We propose an improved HSV color histogram clustering algorithm to extract the key frame within a shot. The algorithm can dynamically determine the number of cluster centers without pre-setting in advance, so it can adapt to variations in the complexity of shots.
The remainder of this paper is organized as follows. In
Section 2, we introduce the SUFR description algorithm. In
Section 3, the proposed video summarization method is detailed. Experimental results are given and analyzed in
Section 4.
Section 5 concludes the paper.
2. Preliminaries
The speeded up robust feature (SURF) [
26] local feature description algorithm can maintain high stability for shot rotation, shot motion, object size change, light intensity and brightness change. The SURF algorithm improves on the SIFT algorithm through some steps, making the algorithm faster. The SURF algorithm includes three main steps: detecting the feature points, principal direction determination and feature descriptor generation.
- (1)
Feature point detection
The SURF algorithm detects feature points by calculating the determinant of the Hessian matrix of images at different scales. A Hessian matrix is a second order partial derivative matrix. For the point
in a given image, the Hessian matrix
defined at the point
with the scale
of is shown in Equation (1).
where
,
and
are Gaussian second order partial derivatives. The Hessian matrix of the image reflects the local curvature of images after Gaussian filtering.
The SURF algorithm uses box filters to approximate the Gaussian second partial derivatives.
Figure 1 compares the box filter with the Gaussian second derivative template. As seen, the box filters
and
approximately replace the Gaussian second partial derivatives
,
and
, improving the convolution efficiency of images.
The Hessian matrix constructed by the SURF algorithm and the determinant of the Hessian matrix [
26] are defined as follows:
According to the determinant of the Hessian matrix, whether the pixel is an extreme point can be determined. When the determinant is positive, it means that the pixel is an extreme point under the scale
.
- (2)
Determination of the main direction
In order to make SURF feature points have rotation invariance, SURF determines a principal direction for each feature point. In the SURF algorithm, a circular region with feature points as the center and 6 s (s refers to the corresponding scale of feature points) as the radius is first established. Then, a 4 s-sized Haar wavelet template is used to scan the horizontal and vertical directions of the region, and the horizontal and vertical Haar wavelet responses are calculated. Finally, it scans the 60-degree sector window in this area to select the direction of the maximum total horizontal and vertical responses of Haar as the main direction of the feature point.
- (3)
Generation of feature descriptors
A square box with side length of 20 s and direction as the main direction of the feature point is constructed at the feature point. In addition, the window is divided into 4 × 4 sub-boxes, where each box has a side length of 5 s. In each box, statistics of the horizontal Haar wavelet response,
, the vertical direction of the sum of Haar wavelet response,
, the absolute value of horizontal direction in the sum of Haar wavelet response
and the absolute value of vertical direction Haar wavelet response of the sum
are combined. Thereby, a four-dimensional vector is formed in each region. The four-dimensional vectors formed in each region are expressed as follows: