1. Introduction
With the rapid development of digital multimedia technology and the popularity of various photography devices, image information has become an important source of human visual information. However, in the process of going from obtaining digital images to arriving at the human visual system, there is an inevitable degradation in image quality. Therefore, it is meaningful to research image quality assessment (IQA) methods that are highly consistent with human visual perception [
1].
According to the degree of participation of the original image information, objective IQA methods can be classified into the following categories: full-reference IQA, reduced-reference IQA, and no-reference IQA [
2]. No-reference IQA is also called blind IQA (BIQA). Because BIQA methods do not require the use of reference image information and are more closely related to actual application scenarios, they have become a focus of research in recent years [
3].
Traditional BIQA methods (e.g., NIQE [
4], BRISQUE [
5], DIIVINE [
6], and BIQI [
7]) typically extract low-level features from images and then use regression models to map them to image quality scores. The extracted features are often manually designed and are often inadequate to fully characterize the quality of images. With the development of deep learning, many deep-learning-based BIQA methods (e.g., IQA-CNN [
8], DIQaM-NR [
9], DIQA [
10], HyperIQA [
11], DB-CNN [
12], and TS-CNN [
13]) have been proposed. With their powerful learning abilities, these methods can extract the high-level features of distorted images, and their performance is greatly improved compared to the traditional methods. Although most existing deep-learning-based IQA methods enhance the feature-extraction ability by proposing new network structures to improve the model’s performance, they overlook the important influence of HVS characteristics and the guiding role they may play.
The goal of BIQA is to judge the degree of image distortion with high consistency to human visual perception. It is natural to combine the characteristics of the human visual system (HVS) with powerful deep learning methods. Moreover, based on HVS characteristics, research on BIQA can provide new research perspectives for the study of IQA. This can help to develop evaluation metrics that are more in line with HVS characteristics and provide useful references for understanding how the HVS perceives image degradation mechanisms, making it a valuable scientific problem.
The HVS has many characteristics, such as the dual-pathway feature [
14,
15], in which visual information is transmitted through the ventral pathway and dorsal pathway in the visual cortex. The former is involved in image-content recognition and long-term memory and is also known as the “what” pathway. The latter is involved in processing spatial-location information of objects and is also known as the “where” pathway. Inspired by the ventral and dorsal pathways of the HVS, Karen and Andrew [
16] proposed a dual-stream convolutional neural network (CNN) structure and successfully applied it to the field of video action recognition. They used a spatial stream to take video frames as input to learn scene information and a temporal stream to take optical flow images as input to learn object motion information. Optical flow images explicitly describe the motion between video frames, eliminating the need for CNNs to implicitly predict object motion information, simplifying the learning process, and significantly improving the model accuracy. The contrast sensitivity characteristic of the HVS reflects the different sensitivity of the human eye to different spatial frequencies [
17]. This characteristic is similar to the widely used spatial attention mechanism [
18] and image saliency [
19]. Campbell et al. [
20] proposed a contrast sensitivity function to explicitly calculate the sensitivity of the HVS to different spatial frequencies. Some traditional IQA methods [
21,
22] use the contrast sensitivity function to weight the extracted features to achieve better results. In addition, when perceiving images, the HVS simultaneously pays attention to both global and local features [
23]. This characteristic is particularly important for IQA because the degree of distortion of authentically distorted images is often not uniformly distributed [
24]. Some IQA methods [
25,
26] are designed for extracting multi-scale features based on this characteristic, and the results show that using multi-scale features can effectively improve the algorithm’s performance. The aforementioned HVS characteristics have been directly or indirectly applied to computer-vision-related tasks and have been experimentally proven to be effective.
The main contribution of this article is to propose a new model based on dual-pathway and contrast sensitivity (DPCS) for BIQA. The HVS’s dual-pathway characteristic is used to guide the construction of a dual-pathway BIQA deep learning model, which can simultaneously learn the content and spatial location information of distorted images. The multi-scale and contrast sensitivity characteristics of the HVS are also introduced to enable the model to extract distortion features that are highly consistent with human perception. Specifically, our contributions are as follows:
First, inspired by the ventral and dorsal pathways of the HVS, a dual-stream convolutional neural network is proposed, with the two streams named the “what” pathway and the “where” pathway, respectively. The “what” pathway extracts the content features of distorted images, while the “where” pathway extracts the global shape features. The features of the two streams are fused and mapped into an image quality score.
Second, by weighting the gradient image of the contrast sensitivity as the input of the “where” pathway, the global shape features that are sensitive to the human eye can be extracted.
Third, a dual-stream multi-scale feature fusion module is designed to fuse the multi-scale features of the two pathways, enabling the model to focus on both global and local features of distorted images.
The rest of this paper is organized as follows.
Section 2 introduces related works for BIQA and analyzes their limitations.
Section 3 provides a detailed description of the proposed HVS-based dual-stream model, image-preprocessing method, and dual-stream multi-scale feature fusion module.
Section 4 reports the experiment results.
Section 5 discusses some related issues and concludes this paper.
2. Related Works
According to the method for feature extraction, BIQA methods can be generally divided into two categories: handcrafted feature-extraction methods and learning-based methods. Handcrafted feature-extraction methods typically extract the natural scene statistics (NSS) features of distorted images. Researchers have found that the NSS features vary with the degree of distortion. Therefore, NSS features can be mapped to image quality scores through regression models.
Early NSS methods extracted features in the transform domain of the image. For example, the BIQI method proposed by Moorthy and Bovik [
7] performs a wavelet transform on the distorted image and fits the wavelet decomposition coefficients using the generalized Gaussian distribution (GGD). They first determine the type of distortion and then predict the quality score of the image based on the specific distortion type. Later, they extend the features of BIQI to obtain the DIIVINE [
6], which more comprehensively describes scene statistics by considering the correlation of sub-bands, scales, and directions. The BLIINDS method proposed by Saad et al. [
27] performs a discrete cosine transform (DCT) on distorted images to extract contrast and structural features based on DCT, which are then mapped to quality scores through a probabilistic prediction model. It is computationally expensive for all of these methods to extract features in the transform domain of the image. To avoid transforming the image, many researchers have proposed methods to directly extract NSS features in the spatial domain. The BRISQUE method proposed by Mittal et al. [
5] extracts the local normalized luminance coefficients of distorted images in the spatial domain and quantifies the loss of “naturalness” of distorted images. This method has very low computational complexity. Based on the BRISQUE, Mittal et al. proposed NIQE [
4], which uses multivariate Gaussian models (MVGs) to fit the NSS features of distorted and natural images and defines the distance between the two models as the quality of the distorted image. The handcrafted feature-extraction methods achieve a good performance on small databases (such as LIVE [
28]), but the designed features can only extract low-level features of images, and their expressive power is limited. Therefore, their performance on large-scale synthetically distorted databases (such as TID2013 [
29] and KADID-10k [
30]) and authentically distorted databases (such as LIVE Challenge [
31]) is relatively poor.
With the successful applications of deep learning methods to other visual tasks [
32,
33], more and more researchers have applied deep learning to BIQA. Kang et al. [
8] first used CNNs for no-reference image quality assessment. To solve the problem of insufficient data, they segmented the distorted images into non-overlapping 32 × 32 patches and assigned each patch a quality score as its source image’s score. Bosse et al. [
9] proposed DIQaM-NR and WaDIQaM-NR based on the VGG [
32]. This method uses a deeper CNN and simultaneously predicts the quality scores and weights of image patches, and weighting summation is used to obtain the quality score of the image. Kim et al. [
33] proposed BIECON. It uses the FR-IQA method to predict the quality scores of distorted image patches, utilizes these scores as intermediate results to train the model, and subsequently finely tunes the model using ground truth scores of images. Kim et al. [
10] subsequently proposed DIQI. The framework is similar to BIECON but uses error maps as intermediate training targets to avoid overfitting. Su et al. [
11] proposed HyperIQA for authentically distorted images. This method predicts the image quality score based on the perceived image content and also increases the multi-scale features so that the model can capture local distortions. Some researchers have introduced multitask learning into BIQA, which integrates multiple tasks into one model for training and promotes each other based on the correlation between tasks. Kang et al. [
34] proposed IQA-CNN++, which integrates image quality assessment and image distortion type classification tasks and improves the model’s distortion type classification performance through multitask training. Ma et al. [
35] proposed MEON, which simultaneously performs distortion-type classification and quality score prediction. Unlike other multitask models, the authors first pre-train the distortion-type classification sub-network and then perform joint training of the quality score prediction network. The experimental results show that this pre-training mechanism is effective. Sun et al. [
36] proposed a Distortion Graph Representation (DGR) learning framework called GraphIQA. GraphIQA enables the distinction of distortion types by learning the contrast relationship between different DGRs and inferring the ranking distribution of samples from various levels within a DGR. Experimental results show that GraphIQA achieves state-of-the-art performance on both synthetic and authentic distortions. Zhu et al. [
37] proposed a meta-learning-based NR-IQA method named MeataIQA. The method collects a diverse set of NR-IQA tasks for different distortions and employs meta-learning to capture prior knowledge. The quality prior-knowledge model is then fine-tuned for a target NR-IQA task, achieving superior performance compared to state-of-the-art methods. Wang and Ma [
38] proposed an active learning method to improve the NR-IQA methods by leveraging group maximum differentiation (gMAD) examples. The method involves pre-training a DNN-based BIQA model, identifying weaknesses through gMAD comparisons, and fine-tuning the model using human-rated images. Li et al. [
39] proposed a normalization-based loss function, called “Norm-in-Norm” for NR-IQA. The loss function utilizes the normalization of predicted and subjective quality scores and is defined based on the norm of the differences between these normalized values. Theoretical analysis and experimental results show that the embedded normalization enhances the stability and predictability of gradients, leading to faster convergence. Zhang et al. [
40] conducted the first study on the perceptual robustness of NR-IQA models. The study identifies that conventional, knowledge-driven NR-IQA models and modern DNN-based methods lack inherent robustness against imperceptible perturbations. Furthermore, the counter-examples generated by one NR-IQA model do not efficiently transfer to falsify other models, highlighting valuable insights into the design flaws of individual models.
In recent years, continual learning has achieved significant success in the field of image classification, and some researchers have also applied it to IQA. Zhang et al. [
41] formulated continual learning for NR-IQA to handle novel distortions. The method allows the model to learn from a stream of IQA datasets, preventing catastrophic forgetting and adapting to new data. Experimental results show the effectiveness of the proposed method compared to standard training techniques for BIQA. Liu et al. [
42] proposed a lifelong IQA (LIQA) method to address the challenge of adapting to unseen distortion types by mitigating catastrophic forgetting and learning new knowledge without accessing previous training data. It utilizes the Split-and-Merge distillation strategy to train a single-head network for task-agnostic predictions. To enhance the model’s feature extraction ability, some researchers have proposed a dual-stream CNN structure. Zhang et al. [
12] proposed a DB-CNN, which uses VGG-16, pre-trained on ImageNet [
43], to extract authentic distortion features and uses CNN, pre-trained on Waterloo Exploration Database [
44] and PASCAL VOC 2012 [
45], to extract synthetic distortion features. Yan et al. [
13] also proposed a dual-stream method. The two streams take the distorted image and its gradient image as input, respectively, so that the gradient stream focuses more on the details of the distorted image.
Although the aforementioned deep-learning-based BIQA methods have achieved good results, there is still room for further improvement. For example, the relevant characteristics of the HVS can be combined with deep learning to make the model consistent with the perceptual approach of the HVS. Inspired by the dual-pathway characteristics of the HVS, our work also adopts a dual-pathway structure. However, our two pathways extract the content features and location features of the distorted image, which are functionally consistent with the ventral and dorsal pathways of the HVS. In addition, our dual-pathway model adds contrast-sensitivity-weighted gradient images as an input. This provides different perspectives of the distorted image for the model and explicitly learns the contrast sensitivity characteristics of the HVS. The dual-pathway multi-scale feature fusion module designed in our work enables the model to focus on the global and local features of the image simultaneously. It is also highly consistent with the process of HVS perception.
In comparison to DB-CNN and TS-CNN, particularly TS-CNN, our method shares similarities in using gradient images as the input for one stream of the network. However, there are key differences between our proposed method and these two works. First, our method explicitly models both the ventral (“what” pathway) and dorsal (“where” pathway) streams of the human visual system, providing a more comprehensive representation of the human perception mechanism. Secondly, we introduce a contrast-sensitive weighting scheme for the gradient images in the “where” pathway, which enhances the sensitivity of the network to important contrast information in the input images. Thirdly, our dual-pathway multi-scale feature fusion module allows for the effective integration of features at different levels, enabling the network to capture both local and global image characteristics. These main differences contribute to the distinctiveness of our proposed method to DB-CNN and TS-CNN and enhance the ability of the proposed deep network to capture and evaluate image quality from the angle of human visual perception.
4. Experiments
4.1. Image Quality Databases
To evaluate the performance of the proposed method, experiments are conducted on both synthetically distorted databases and authentically distorted databases, and the proposed approach is compared with the state-of-the-art methods. The synthetically distorted databases are LIVE [
28], CSIQ [
24], TID2013 [
29], KADID-10k [
30], and the Waterloo Exploration Database [
44], with detailed information summarized in
Table 2. The authentically distorted databases are LIVE Challenge (LIVEC) [
31] and KonIQ-10k [
55]. The LIVEC database contains 1162 images captured by different photographers using different equipment in natural environments, which include complex authentic distortion types. The KonIQ-10k dataset contains 10,073 images selected from the YFCC100M database [
56], ensuring diversity in image content and quality, and an even distribution in brightness, color, contrast, and sharpness.
4.2. Experimental Protocols and Evaluation Metrics
To avoid content overlap between the training and testing images, we use 80% of the synthetically distorted databases based on the reference images for training and the remaining 20% for testing. For the authentically distorted databases, we directly use 80% of all images for training and 20% for testing. Each database is randomly split 10 times according to the aforementioned rule for experiments, and the average of 10 experimental results is taken as the final result.
We use the Spearman rank-order correlation coefficient (SROCC) and Pearson linear correlation coefficient (PLCC) to evaluate the performance of the IQA methods. These coefficients are used to evaluate the monotonicity and linear correlation between the predicted scores and the ground truth scores, respectively. Their range is [−1, 1], and the larger the absolute value is, the better the model’s performance is. In addition, on the Waterloo Exploration Database, the D-test metric is used to evaluate the model’s ability to distinguish between reference images and distorted images, the L-Test metric is used to evaluate the consistency between the predicted rank orders with different distortion levels but the same content and distortion type and their true rank orders, and the P-Test metric is used to evaluate the consistency of the IQA model in terms of the distortion score order between image pairs and their true order.
4.3. Performance on Individual Database
The experimental results on the individual databases are summarized in
Table 3 and
Table 4. The proposed method is compared with three traditional methods (PSNR, SSIM [
57], and BRISQUE [
5]) and seven deep-learning-based methods (IQA-CNN [
8], BIECON [
33], MEON [
35], DIQaM-NR [
9], HyperIQA [
11], MMMNet [
58], AIGQA [
59], DB-CNN [
12], and TS-CNN [
13]) in terms of SROCC and PLCC results on six databases. Here, the DB-CNN and TS-CNN are similar to our proposed method, both with a dual-stream structure.
From
Table 3 and
Table 4, it can be observed that all methods exhibit good performance on the LIVE and CSIQ databases, which contain fewer distortion types. However, varying degrees of performance degradation are evident on the more complex distortion types of the TID2013 and KADID-10k databases, as well as the synthetically distorted databases of LIVEC and KonIQ-10k.
On the synthetically distorted databases of LIVE, TID2013, and KADID-10k, the proposed method achieves the top two SROCC and PLCC scores. On the authentically distorted databases of LIVEC and KonIQ-10k, the performance of the proposed method is among the top two methods, partly because the proposed method adopts a pre-trained ResNet-50 as the backbone to enable the model to learn the authentic distortions in the images more easily. Additionally, since the degree of distortion distribution in authentic distortion images is uneven, the proposed method introduces a multi-scale feature fusion module. This allows the model to focus on local details and better align with human visual perception.
Overall, based on the SROCC and PLCC results, the proposed method demonstrates excellent performance on six commonly used databases. Compared with other dual-pathway structures such as DB-CNN and TS-CNN, the proposed method maintains a leading position on most databases. In particular, compared with TS-CNN, the proposed method shows a significant performance difference on authentically distorted databases. This is mainly due to the incorporation of the dual-path characteristics of the HVS in the proposed approach, which can extract the content and location features of distorted images simultaneously. The contrast-sensitivity-weighted gradient image can explicitly extract the frequency information that is of interest to human vision. Additionally, the proposed multi-scale feature fusion module allows the model to focus on both global content and local details.
4.4. Performance on Individual Distortion Types
To compare the performance of the proposed method with the state-of-the-art methods on individual distortion types, experiments are conducted on three synthetically distorted databases, LIVE, CSIQ, and TID2013. All the distortion types are used for training on each database, and testing is performed on specific distortion types. The experimental results are summarized in
Table 5,
Table 6, and
Table 7 for each database, respectively.
From
Table 5, it can be observed that the proposed method achieves the best performance on four distortion types, JPEG, WN, GB, and FF, in the LIVE database. In particular, the proposed method outperforms other methods by a large margin on the FF distortion type. From
Table 6, it can be seen that the proposed method achieves the best performance on four distortion types, JPEG, WN, PN, and CC, in the CSIQ database, and obtains the second- and third-best performance on the JP2K and GB distortion types, respectively, with only a small gap between it and the top methods. For the more complex distortion types of PN and CC, the proposed method still maintains a high SROCC.
It can be observed from
Table 7 that the proposed method achieves top-two performance on 17 out of 24 distortion types, second only to HyperIQA’s 19 of out 24. Moreover, for complex distortion types such as NPN, BW, MS, and CC, most methods fail to achieve satisfactory results, while the proposed method still achieves relatively good performance. It can be seen that our method maintains stable and excellent performance on all distortion types in TID2013. Overall, the experimental results on the individual distortion types of the three datasets demonstrate that our method also performs well for specific distortion types.
4.5. Performance across Different Databases
Cross-database testing is a common method to test model generalizability. We conduct cross-database tests on four databases: LIVE, CSIQ, TID2013, and LIVEC. Specifically, we train the model on one database and test it on the others, such as training the model on the LIVE database and testing on the CSIQ, TID2013, and LIVEC databases, and so on. The SROCC results of the tests are summarized in
Table 8.
From
Table 8, it can be seen that the proposed method achieves the best performance in a total of eight cases, surpassing the DB-CNN’s three cases. When cross-database testing is conducted among the three synthetically distorted databases (LIVE, CSIQ, and TID2013), most methods achieve relatively good results. However, because synthetically distorted databases cannot fully simulate authentic distortion, many methods cannot achieve good performance on authentically distorted databases. Nevertheless, the proposed method still maintains good performance in such scenarios. Although it is trained on LIVE, CSIQ, and TID2013 and tested on LIVEC, it achieves the best performance. Similarly, when trained on LIVEC and tested on LIVE, CSIQ, and TID2013, our method also maintains good performance and achieves better results than other methods on TID2013. Although the performance on LIVE and CSIQ is slightly lower than DB-CNN, the proposed method still outperforms other methods and maintains a significant lead.
To further evaluate the generalization performance of the proposed method on large-scale databases, we train the model on the entire LIVE database and test it on the Waterloo Exploration Database, calculating the D-Test, P-Test, and L-Test metrics. The experimental results are presented in
Table 9. It can be observed that the proposed method achieves top-two performance in both D-Test and L-Test metrics. It also demonstrates competitive performance in the P-Test metric, which further validates its superior generalization capability.
4.6. Ablation Experiments
To validate the effectiveness of the modules in the proposed method, ablation experiments are conducted on the LIVE, CSIQ, TID2013, and LIVEC databases. The “what” pathway, which only takes distorted images as the input, is used as the baseline model. Then, the “where” pathway is added, which takes gradient images as the input, followed by the contrast-sensitivity-weighted gradient image as the input in comparison, and finally the multi-scale module. The experimental results are summarized in
Table 10. To further validate the significance of module contributions to model performance improvement, paired
t-tests were conducted on various models in the ablation experiments. The experimental results are shown in
Table 11, where 1, 0, and −1 represent the models in the corresponding row that are significantly better than, indistinguishable from, or worse than the models in the respective column. The confidence interval is set at 95%.
From
Table 10 and
Table 11, it can be observed that when there is only one pathway in the model, the performance is poor, especially when it only contains the “where” pathway. This is because the model can only extract high-frequency information from the gradient image and lacks detail information. When the model contains both the “what” pathway and the “where” pathway, the model can extract rich structural information from the gradient domain of the distorted image and significantly improve the model’s performance. This improves the performance by 0.011, 0.019, 0.017, and 0.009 on the three databases, respectively. When the contrast-sensitivity-weighted gradient image is used as the input for the “where” pathway, the improvement in model performance is even more significant, with increases of 0.015, 0.028, 0.028, and 0.019 on the three databases, respectively. This demonstrates that using the contrast-sensitivity-weighted gradient map as input can explicitly make the model focus more on the sensitive parts of the HVS, making the model highly consistent with the HVS perception.
Then, when the multi-scale module without a channel attention mechanism is added to the dual-pathway model, a slight improvement in model performance can be observed. However, this improvement is not significant, as only a simple concatenation of feature maps from the two pathways is performed in this case. This may result in redundant or irrelevant information being combined, limiting the model’s ability to effectively leverage the complementary strengths of the two pathways. Finally, adding the multi-scale module with a channel attention mechanism to the model shows that the performance of both the “where” pathway and the “what” pathway, which take the gradient image and contrast-sensitivity-weighted gradient map as inputs, are improved, with the largest improvement seen on the authentically distorted database LIVEC, with increases of 0.011 and 0.014, respectively. This is because, by incorporating a channel attention mechanism, the model gains the ability to selectively attend to informative channels from both pathways, effectively enhancing the fusion process. This allows the model to capture more fine-grained relationships between different channels, leading to improved performance.