1. Introduction
Digital pictures may occur different distortions in the procedure of acquisition, transmission, and compression, leading to an unsatisfactory perceived visual quality or a certain level of annoyance. Thus, it is crucial to predict the quality of digital pictures in many applications, such as compression, communication, printing, display, analysis, registration, restoration, and enhancement [
1,
2,
3]. Generally, image quality assessment approaches can be classified into three kinds according to the additional information needed. Specifically, full-reference image quality assessment (FR-IQA) [
4,
5,
6,
7] and reduced-reference image quality assessment (RR-IQA) [
8,
9,
10] need full and partial information of reference images, respectively, while blind image quality assessment (BIQA) [
11,
12,
13,
14] performs quality measure without any information from the reference image. Thus, BIQA methods are more attractive in many practical applications because the reference image usually is not available or hard to derive.
Early studies mainly focused on one or more specific distortion types, such as Gaussian blur [
15], blockiness from JPEG compression [
16], or ringing arising from JPEG2000 compression [
17]. However, images may be affected by unknown distortion in many practical scenarios. In contrast, general BIQA methods aim to work well for arbitrary distortion, which can be classified into two categories according to the features extracted, i.e., Natural Scene Statistics (NSS)-based methods and training-based methods.
NSS-based methods [
18] assume that the natural image with non-distorted obeys certain perceptually relevant statistical laws that are violated by the presence of common image distortions, and they attempt to describe an image utilizing its scene statistics from different domains. For example, BIRSQUE [
19] derives features from the locally normalized luminance coefficients in the spatial domain. M3 [
20] utilizes the joint local contrast features from the gradient magnitude (GM) map and the Laplacian of Gaussian (LOG) response. Later, a perceptually motivated and feature-driven model is deployed in FRIQUEE [
21], in which a large collection of features defined in various complementary, perceptually relevant color and transform-domain spaces are drawn from among the most successful BIQA models produced to date.
However, knowledge-driven feature extraction and data-driven quality prediction are separated in the above methods. It has been demonstrated that training-based methods outperform NSS-based methods by a large margin because a fully data-driven BIQA solution becomes possible. For example, CORNIA [
22] constructs a codebook in an unsupervised manner, using raw image patches as local descriptors and using soft-assignment for encoding. Considering that the feature set generally adopted in previous methods are from zero-order statistics and insufficient for BIQA, HOSA [
23] constructs a much smaller codebook using K-means clustering [
24] and introducing higher-order statistics. In contrast, the above methods capture spatially normalized coefficients and codebook-based features which are learned automatically from beginning to end by using CNNs. For example, TCSN [
25] aims to learn the complicated relationship between visual appearance and perceived quality via a two-stream convolutional neural network. DIQA [
26] defines two separated CNN branches to learn objective distortion and human visual sensitivity, respectively.
In this work, we propose an end-to-end BIQA based on classification guidance and feature aggregation, which is accomplished by two sub-networks with shared features in the early layers. Due to the lack of training data, we construct a large-scale dataset by means of synthesizing distortions and pre-train Sub-Network I to identify an image into a specific distortion type from a set of pre-defined categories. We find the proposed method will be much harder to achieve high accuracies on authentic images if only it is exposed to synthetic distortions during training. Then, we extract hierarchical features from the shared layers of two-subnetworks and another CNN (VGG-16 [
27]) pre-trained on ImageNet [
28], in which pictures occur as a natural consequence of photography and a unified feature group is formed.
Sub-Network II takes the hierarchical features and the classification information as inputs to predict the perceptual quality. The combination of two sub-networks enables the learning framework to have the probability of favorable quality perception and proper parameter initialization in an end-to-end training manner. We design a feature aggregation layer that could convert arbitrary input seizes to a fixed-length representation. Then, a fully connected layer is exploited as a linear regression model to map the high-dimensional features into the quality scores. This allows the proposed CGFA-CNN to accept an image of any size as the input, thus there is no need to perform any transformation of images (including cropping, scaling, etc.), which would affect perceptual quality scores.
The paper is structured as follows. In
Section 2, previous work on CNN-based BIQA related to our work is briefly reviewed. In
Section 3, details of the proposed method are described. In
Section 4, experimental results on the public IQA databases and the corresponding analysis are presented. In
Section 5, the work of this paper is concluded.
2. Related Work
In this section, we provide a brief survey about the major solutions to the lack of training data in BIQA and a review of recent studies related to our work.
Due to the number of parameters to be trained on CNN is usually very large, the training set needs to contain sufficient data to avoid over-fitting. However, the number of samples and image contents in the public quality-annotated image databases are rather limited, which cannot meet the need for end-to-end training of a deep network. Currently, there are two main methods to tackle this challenge.
The first method is to train the model based on image patches. For example, deepIQA [
29] randomly samples image patches from the entire image as inputs and predicts the quality score on local regions by assigning the subjective mean score (MOS) of the pristine image to all patches within it. Although taking small patches as inputs for data augmentation is superior to using the whole image in a given dataset, this method still suffers from limitations because local image quality with context varies across spatial locations even when the distortion is homogeneous. To resolve this problem, BIECON [
30] makes use of the existing FR-IQA algorithms to assign quality labels for sampled image patches, but the performance of such a network depends highly on that of FR-IQA models. Other methods such as dipIQ [
31] attempting to generate discriminable image pairs by involving FR-IQA models may suffer from similar problems.
The second method is to pre-train a network with large-scale datasets in other fields. For each pre-trained architecture, two types of back-end training strategies are available: replacing the last layer of the pre-trained CNN model with the regression layer and fine-tuning it with the IQA database to conduct image quality prediction or using SVR to regress the extracted features through the pre-trained networks onto subjective scores. For instance, DeepBIQ [
32] reports on the use of different features extracted from pre-trained CNNs for different image classification tasks via ImageNet [
28] and Places365 [
33] as a generic image description. Kim et al. [
34] selected the well-known deep CNN models AlexNet [
35] and ResNet50 [
36] as the architectures of the baseline models, which have been pre-trained for the task of image classification on ImageNet [
28]. These methods directly inheriting the weights from the pre-trained models for general image classification tasks have a defect of low relevance to BIQA but unnecessary complexity.
To better address the training data shortage problem, MEON [
37] proposes a cascaded multi-task framework, which firstly trains a distortion type identification network by large-scale pre-defined samples. Then, a quality prediction network is trained subsequently, taking advantage of distortion information obtained from the first stage. Furthermore, DB-CNN [
38] not only constructs a pre-training set based on Waterloo Exploration Database [
39] and PASCAL VOC [
40] for synthetic distortions, but also uses ImageNet [
28] to pre-train another CNN for authentic distortions. Motivated by the previous studies on MEON [
37] and DB-CNN [
38], we construct a pre-training set based on Waterloo Exploration Database [
39] and PASCAL VOC [
40] for synthetic distortions. Besides, both distortion type and distortion level are considered at the same time, which results in better quality-aware initializations and distortion information.
Although previous DNN-based BIQA methods have achieved significant performance, all of these methods usually comprise convolutional layers and pooling layers for feature extraction and employ fully connected layers for regression, which would suffer three limitations. First, such techniques as averaging or maximum pooling are too simple to be accurate for long sequences. Second, a fully connected layer is destructive to the high-dimensional disorder and spatial invariance of the local feature. Third, such CNNs typically require a fixed image size. To feed into the network, images have to be resized or cropped to a fixed size, and either scaling or cropping would cause the perceptual difference with the assigned quality labels. To tackle these challenges, we explore more sophisticated pooling techniques based on clustering approaches such as Bag-of-visual-words (BOW) [
41], Vector of Locally aggregated Descriptors (VLAD) [
42] and Fisher Vectors [
43]. Studies have shown that integrating VLAD as a differentiable module in a neural network can significantly improve the aggregated representation for the task of place recognition [
44] and video classification [
45]. Our proposed feature aggregation layer acts as a pooling layer on top of the convolutional layers, which converts arbitrary input seizes to a fixed-length representation. Afterward, using a fully connected layer for regression does not require any preprocessing of the input image.
5. Conclusions
In this work, we propose an end-to-end learning framework for BIQA based on classification guidance and feature aggregation, which is named as CGFA-CNN. In the fine-tuning phase, except for the shared convolutional layers, the rest of Sub-Network I only participates in the forward propagation, and the parameters are fixed. The fused feature group is aggregated and encoded by the FV layer to obtain a fisher vector. Then, the fisher vector is corrected by the CGU to obtain a quality-ware feature, which is mapped to a quality score by the regression model. In the test phase, only forward propagation is required to obtain the quality score. The results on the four publicly IQA databases demonstrate that the proposed method indeed benefited image quality assessment. However, CGFA-CNN is not a unified learning framework because it takes two steps to pre-train and fine-tune. The promising future direction is to optimize CGFA-CNN for both distortion identification and quality prediction at the same time. For example, we think that the autoencoder method could be designed to perform k-mean clustering. A VAE framework can be introduced to decode. This approach can replace the two-stage procedure. We also look forward to designing a potential objective function that could in principle reduce the necessity to rely on external procedures.
CGFA-CNN is versatile and extensible. For example, more distortion types and levels can be added to the pre-training dataset, and it could fuse with other approaches to achieve a new backbone network.