1. Introduction
Measuring the quality of digital videos has been a hot and important research topic in the literature. Namely, digital videos undergo a series of processes, i.e., compression or transmission, before they are displayed [
1]. Moreover, each process affects the video in a certain way, and in most cases it will introduce some type of artifact or noise. These artifacts, which can be blur, geometric distortions, or blockiness artifacts from compression standards, degrade the perceptual quality of the digital video. In the literature, video quality assessment (VQA) is divided into two broad classes: subjective and objective. Specifically, subjective VQA deals with collecting quality ratings from a group of human beings using a set of videos. The experiments can be carried out either in a laboratory environment [
2] or a crowd-sourcing process [
3] via online. The quality ratings, which were obtained from human observers, are averaged into one number—the mean opinion score (MOS)—to characterize the perceptual quality of each considered video sequence. In addition, subjective VQA deals with many aspects of video quality measurement, such as the selection of test video sequences, grading scale, time interval of video presentation to human subjects, viewing conditions, and selection of human participants [
4,
5]. As a result, subjective VQA provides benchmark databases [
5,
6,
7] which contain video sequences with their corresponding MOS values. These databases are extensively applied as training or testing data by different objective VQA methods which aim to construct mathematical models for accurately estimating the perceptual quality of video sequences.
Objective VQA can be classified with respect to different factors. The most common way of classification in the literature [
8,
9,
10,
11] is based on the availability of the pristine, reference videos, whose visual quality is considered perfect for the objective VQA algorithm. Specifically, objective VQA is categorized into three groups: full-reference (FR), reduced-reference (RR), and no-reference (NR) ones. For FR-VQA algorithms, the entire reference video is available, while NR-VQA algorithms have no access to the reference videos. On the other hand, some representative features of the reference video are available for an RR-VQA algorithm. In the literature, the construction of an NR-VQA algorithm is considered the most challenging [
12,
13] due to the complete lack of information about the reference videos and the most useful, as reference videos are not available in many practical, everyday applications, such as video streaming [
14].
Recently, the deep learning paradigm has dominated the field of computer vision, image, and video processing [
15,
16,
17,
18,
19]. Moreover, the field of NR-VQA was also heavily influenced by this trend [
20,
21,
22,
23,
24,
25]. The present paper’s specific contributions are a novel, innovative deep learning based approach for NR-VQA that relies on a set of in parallel pre-trained convolutional neural networks (CNN) to characterize versatitely the potential image and video distortions. More specifically, temporally pooled and saliency weighted video-level deep feature vectors are compiled from a set of pre-trained CNNs and mapped onto perceptual quality scores independently from each other using trained regressors. Finally, the quality scores coming from the different regressors are fused together to get the perceptual quality of the input video sequence. We empirically corroborate that the decision fusion of multiple deep architectures is able to significantly improve the performance of NR-VQA. Namely, extensive experiments were carried out on two large benchmark VQA databases (KoNViD-1k [
7] and LIVE VQC [
26]) with authentic distortions.
The remainder of this paper is structured as follows. In
Section 2, we introduce the status of research in NR-VQA. In
Section 3, we describe the overall architecture of the proposed method. In
Section 4, we describe the applied benchmark databases that were used to train and test the proposed architecture. Moreover, the applied evaluation metrics and environment are also described. In
Section 5, we introduce experiments designed to evaluate performance of the method and describe the experimental results. In
Section 6, we give the conclusion and clarify the next work.
2. Literature Review
Due to the complexity of the human visual system (HVS), NR-VQA is a very challenging task. Therefore, a huge amount of studies and papers can be found in the literature dealing with NR-VQA. Methods found in the literature can be classified into three large groups: bitstream-based, pixel-based, and hybrid models. Specifically, bitstream-based methods analyze the video frame headers and the decoded packets to estimate digital videos’ perceptual quality. A typical example of this group is the QANV-PA (Quality Assessment for Network Video via Primary Analysis) method [
27]. Namely, the authors extracted first five video frame level parameters, i.e., quantization parameter, frame display duration, number of lost packets, frame type, and bitrate. Moreover, a pooling procedure of the frame-level parameters was also introduced to characterize perceptual video quality. In contrast, Lin et al. [
28] built their model on three factors, i.e., quantization parameter, bit location, and motion. Yamagishi and Hayashi [
29] used a packet-layer model for estimating the perceptual quality of internet protocol television (IPTV) videos. Specifically, the authors analyzed the packet-headers of videos and extracted quality-aware features, such as bit rate and packet-loss frequency. Bitstream-based methods perform well in network video applications, such as video conferencing or IPTV, but cannot be exploited for general applications [
30].
Pixel-based NR-VQA methods take the raw video signal as input for quality prediction. Different natural scene statistics (NSS) approaches are very popular in the literature [
31,
32,
33]. The main idea behind NSS is that natural images and videos possess certain statistical regularities that are corrupted in the presence of noise. The discrete cosine transform (DCT) [
34] domain is very popular to quantify the deviation from “natural” statistics in the literature. For instance, Brandao and and Queluz [
35] used DCT coefficients to fit different probability density functions (PDF) on them. Specifically, the parameters of these PDFs were estimated by maximum likelihood and were applied for local error estimation. This was followed by a perceptual spatio-temporal weighting model to quantify overall perceptual quality. In contrast, Saad et al. [
36] first took the difference of consecutive video frames and applied on these difference images local block-based DCT. Next, the DCT coefficients were modeled by a generalized Gaussian distribution (GGD) and the parameters of the GGD were considered as quality-aware features. Moreover, these quality-aware features were combined with motion coherency vectors and mapped onto quality scores with the help of support vector regressor (SVR). In contrast, Li et al. [
37] utilized 3D-DCT for feature extraction instead of frame level features but similarly to [
36] the feature vectors were mapped onto quality scores with an SVR. Similarly to the work in [
37], Cemiloglu and Yilmaz [
38] utilized 3D-DCT for feature extraction but first the video content was segmented into cubes of various sizes relying on spatial and motion activity measurement. In contrast, Zhu et al. [
39] extracted video frame level features from each video frame. Specifically, six feature maps were generated for every video frames using DCT. Subsequently, five quality-aware features were extracted from the feature maps and temporally pooled together to form video-level feature vectors which were mapped onto quality scores with a neural network. In [
40], the authors improved further this method by introducing new frame level features. Besides DCT, other transform domains are also popular in the literature, such as shearlet [
41], wavelet [
42], or complex wavelet [
43] transform domains. Another line of works extracted different optical flow statistics to compile quality-aware feature vectors. For example, Manasa et al. [
44] characterized the inconsistencies in the optical flow both at image patch and video frame level. Specifically, intra-patch and inter-patch level irregularities were measured and combined with the correlation between successive frames. At the frame level, the magnitude difference between two consecutive frames in the optical flow was measured. Similarly to the previously mentioned methods, the extracted features were mapped onto quality scores with a trained SVR. In contrast, Men et al. [
45] combined spatial features, such as contrast or colorfulness, with temporal features derived from optical flow to compile feature vectors.
Recently, deep learning techniques have become very popular in pixel-based algorithms. Moreover, deep learning has also gained significant attention in the related fields, such as stereoscopic [
46] and omnidirectional [
47] image quality assessment, image superresolution [
48], or stereoscopic VQA [
49]. For instance, Li et al. [
41] trained a CNN (convolutional neural network) from scratch on 3D shearlet transform coefficients extracted from video blocks for perceptual video quality estimation. In contrast, Ahn and Lee [
20] fused hand-crafted and deep features to compile quality-aware feature vectors for video frames. Next, a frame to video feature aggregation procedure was applied and the resulting vector was regressed onto quality scores. Agarla et al. [
50] applied deep features extracted from pretrained CNNs for predicting image quality attributes, such as sharpness, graininess, lightness, and color saturation. Based on these quality attributes, frame-level quality scores were generated and used for perceptual video quality estimation using a recurrent neural network. In [
51], the authors improved further the previously mentioned method by introducing a sampling algorithm that eliminates temporal redundancy in video sequences by choosing representative video frames.
Hybrid methods combine the principles of bitstream-based and pixel-based algorithms. For instance, Konuk et al. [
52] combined a spatiotemporal feature vector with average bit rate and packet loss ratio. In [
53], the authors predict the perceptual quality of videos transferred over the universal mobile telecommunication system by combining sender bitrate, block error rate, and mean burst length in a nonlinear regression analysis. Similarly, Tao et al. [
54] investigated video quality over IP networks.
For comprehensive surveys about NR-VQA, we refer readers to the works in [
55,
56,
57],
3. Proposed Method
The high-level workflow of the proposed NR-VQA algorithm is depicted in
Figure 1. As it can be seen from this figure, multiple temporally pooled video-level feature vectors are compiled with the help of deep frame-level feature vectors extracted from each video frame using a diverse set of pre-trained CNNs. Next, these video-level feature vectors are mapped onto perceptual quality scores independently from each other. Finally, these scores are fused together to obtain an estimation for the perceptual quality of the input video sequence.
The main properties of the applied pre-trained CNNs are summarized in
Table 1. Specifically, it can be seen that seven different architectures were utilized from which six ones were trained on ImageNet [
58] and one CNN was trained on Places-365 [
59] dataset. The main idea behind this layout is that deep features with multiple sources could better capture possible image distortions than a single one [
60]. Namely, the computer vision research community has pointed out that internal activations of pre-trained CNNs as deep features are able to provide powerful representations [
61,
62,
63]. Moreover, CNNs can capture spatial and temporal dependencies in an image with the help of relevant convolutional filters [
64]. Further, the first layers of a CNN capture low-level image features, i.e., edges, colors, or blobs, while the network also captures high-level features which are important in understanding of image semantics [
65,
66]. The previously mentioned dependencies and features are obviously degraded in the presence of image noise and distortion. Therefore, they can be utilized as quality-aware features.
As already mentioned, the temporally pooled frame-level features are mapped onto perceptual quality scores using a regression machine learning technique. In this paper, we show experimental results with the usage of SVRs with Gaussian kernel functions and Gaussian process regressors (GPR) with rational quadratic kernel functions. Finally, the quality scores provided by the regressors trained on different deep features extracted with the help of different CNN architectures are fused together to obtain the perceptual quality of a given video sequence.
3.1. Frame-Level Feature Extraction
The workflow of the frame-level feature extraction is illustrated in
Figure 2. As previously mentioned, a diverse set of pre-trained CNNs was applied to extract frame-level feature vectors independently from each other. Specifically, AlexNet [
67], VGG16 [
68], ResNet18 [
69], ResNet50 [
69], GoogLeNet [
70], GoogLeNet-Places365 [
70], and InceptionV3 [
71] were considered for this purpose. Excluding GoogLeNet-Places365 [
70], these architectures were pretrained on ImageNet [
58] which contains more than one million images and 1000 semantic categories. On the other hand, GoogLeNet-Places365 [
70] was trained on the Places-365 [
59] database which consists of 18 million training images from 365 scene categories (i.e., art studio, beauty salon, biology laboratory, etc.). To extract frame-level features, saliency weighted global average pooling (SWGAP) layers—which is the contribution of this study—are attached to certain modules of the base models. As pointed out in previous works [
72,
73,
74,
75], considering multiple level of deep features is able to improve perceptual quality estimation, as CNNs capture image features at multiple levels.
Table 2 summarizes the considered modules of the applied pre-trained CNNs and the length of the extracted feature vectors. Specifically, it can be seen that the features of the convolutional modules were used in case of AlexNet [
67], VGG16 [
68], while the features of the residual and Inception modules were utilized in case of ResNet18 [
69], ResNet50 [
69] and GoogLeNet [
70], GoogLeNet-Places365 [
70], InceptionV3 [
71], respectively.
Global average pooling (GAP) layers are usually used in CNNs to enforce correspondence between feature maps and the number of semantic categories and by this to enable the training of networks on images with various resolution [
76]. Another common application of GAP is extracting resolution independent visual features from images with the help of a CNN. In this paper, we improve GAP to SWGAP for feature extraction using visual saliency. Namely, visual saliency algorithms deal with finding the most outstanding parts of a digital image from a perceptual point of view [
77]. From the perspective of perceptual quality estimation, it is also very essential that human beings tend to fixate on some particular regions of the image during the first three seconds of the observation [
78]. Motivated by the above observation, SWGAP is proposed for feature extraction to emphasize those regions which are salient to the human visual system. Namely, SWGAP performs a weighted arithmetic operation between an
feature map of a CNN and the resized (bilinear interpolation is applied)
saliency map of the input image. Formally, it can be written as
where
denotes the output value of SWGAP for one feature map. Further,
M and
N stand for the height and the width of the feature map, respectively. The coordinates of the feature maps and the resized saliency map are denoted by
i and
j. In this study, the method of Li et al. [
79] was applied to determine the saliency map of a video frame due to its low computational costs.
Figure 3 depicts several video frames and their saliency maps.
3.2. Video-Level Feature Extraction
As previously mentioned, the frame-level feature vectors obtained with the help of a CNN architecture are temporally pooled together to compile one feature vector that characterizes the whole video sequence. In this study, the average pooling of frame-level feature vectors were utilized to this end. Formally, the following can be written:
where
N is the number of frames found in the given video,
stands for the
ith entry of the
jth frame-level feature vector, while
denotes the
ith entry of the feature vector that characterizes the whole video sequence obtained by the
kth CNN architecture. The
feature vectors are mapped onto perceptual quality scores independently from each other by machine learning techniques. Specifically, we made experiments with two different regression techniques, such as SVRs with Gaussian kernel functions and GPRs with rational quadratic kernel functions. To obtain the estimated perceptual quality of a video sequence, the arithmetic mean or the median of the regressors’ outputs is taken.