1. Introduction
Face super-resolution (FSR), also referred to as face hallucination [
1], is a technology that enhances the quality of low-resolution (LR) face images by transforming them into high-resolution (HR) ones. Typically, face images suffer lower spatial resolution due to limited imaging conditions and low-cost imaging equipment. This degradation affects the performance of most practical downstream applications, such as face recognition and face analysis. As a result, FSR has become a popular and essential scientific tool in the fields of computer vision and image processing [
2].
Different from general image super-resolution, FSR is a technique that focuses on recovering crucial facial structures. Although these structures only occupy a small portion of the face, they are essential in distinguishing different faces and improving image quality. Baker and Kanade [
1] proposed the first FSR method, which triggered the upsurge of traditional FSR methods. Afterward, various traditional techniques for FSR have been developed over time, which can mainly resort to the interpolation approach [
3], Principal Component Analysis (PCA) [
4], convex optimization [
5], Bayesian approach [
6], kernel regression [
7], and manifold learning [
8]. Nevertheless, traditional methods are limited in producing plausible facial images due to their shallow structure and representation abilities. Recently, FSR has made significant progress due to the advent of deep learning techniques [
2]. Relying on the powerful deep convolution structures, various convolution neural networks (CNNs)-based FSR methods [
9,
10,
11,
12,
13] have been developed to predict the fine-grained facial details. However, due to the vanishing gradient problem, the actual receptive field of most CNN-based models is limited. This makes it challenging to model global dependency, resulting in blurry effects in the reconstructed face images. Devoting to capturing both local and global dependencies, transformer-based methods [
14,
15] have gained significant attention nowadays.
The efficacy of transformer-based methods in improving FSR performance is noteworthy. However, they still exhibit certain limitations that require attention. Transformer-based encoder-decoder networks comprise two major parts: an up/downsample module that connects adjacent-scale feature information and a transformer module that explores and enhances the corresponding-level features. We will discuss the limitations of these components separately below:
(1) Due to its relatively small network size, the up/downsample module has not received sufficient attention in FSR methods. However, it plays a more important role than the one applied in the general image super-resolution methods. This is because face images are highly structured with eyes, nose, and mouth in a specific location, which is also why some FSR methods require additional marking on the dataset. Nonetheless, as illustrated in
Figure 1b, the inner feature maps generated by pixel deconvolutional or shuffle layers have no direct relationship since they are produced by independent convolutional kernels, which can result in significant differences between the values of adjacent pixels. Therefore, an up/downsample module that can build direct relationships among adjacent pixels is in high demand.
(2) The transformer module has significantly improved FSR performance. However, raw feature maps are fed to the transformer blocks directly without examining their potential feature information, limiting their performance. As shown in
Figure 1g, raw features processed by the transformer block without guiding are not always detail-rich or even buried in gray, which restricts the following transformer blocks to only selecting a limited number of feature maps based on the self-attention heatmap (
Figure 1e). On the contrary, applying a guiding block to guide the transformer block about essential facial components results in more correlated feature maps (
Figure 1f). Such an approach is particularly beneficial for tackling the “one-to-many” FSR problem [
16] and finally yielding superior outcomes (
Figure 1h).
(3) Most previous research [
15,
17,
18] favors improving the transformer module for the transformer-based encoder-decoder FSR approaches. However, matching a compatible strong up/downsample module and transformer module is crucial; otherwise, some of the system potential could be wasted on either side.
In this work, we aim to address all the limitations mentioned above and propose a novel attention-guided transformer with pixel-related deconvolution network for face super-resolution. The proposed method utilizes a multi-scale connected encoder-decoder architecture as the backbone. In encoder-decoder branches, we carefully design an Attention-Guided Transformer Module (AGTM), which is composed of an Attention Guiding Block (AGB) and a Channel-wise Multi-head Transformer Block (CMTB). AGB aims to guide the transformer block in learning about essential facial components. Different from previous transformer-based methods [
15,
17] which utilized the same transformer structure for different feature layers, AGB is further divided into two subdivision modules to adapt different levels of features: AGTM at the top of the encoder-decoder network (AGTM-T) which promotes both local facial details and global facial structures, while AGTM at the bottleneck side (AGTM-B) which optimizes the encoded low-level features. Noting the problem that the usual spatial-wise transformers are limited to position-specific windows and their partition strategy may potentially alter the structure of the facial image [
19], the Channel-wise Multi-head Transformer Block (CMTB) is introduced to achieve an image-size receptive field by utilizing feature map channels. The AGB and CMTB are complementary and can simultaneously enhance local facial details and global facial structures. Furthermore, considering that face images are highly structured, we design a Pixel-Related Deconvolution (PRD) layer to establish direct relationships among adjacent pixels in the upsampling process for better face structure preservation. Moreover, different from the pyramid network [
13,
20] that progressively reconstructs high-resolution face images, we have also developed a Multi-scale Feature Fusion Module (MFFM) to wisely fuse multi-scale features for better network flexibility and reconstruction results.
Figure 1.
Visual analysis of the pixel-related deconvolution and the guiding block for transformer-based FSR methods: (
a) is the input face image; (
b) is the conventional pixel deconvolutional and pixel shuffle upsampling layer that neglects the relationship between adjacent pixels; (
c) is the proposed pixel-related deconvolution that establishes direct relationships among adjacent pixels; (
d) is the inner feature map outputs with corresponding upsampling methods; (
e–
g) are self-attention heatmaps, correlation maps [
21] between input and output feature maps, and inner feature maps without and with guiding blocks, respectively (please note that five different images are tested in the correlation map instead of one for fair comparison); (
h) is the output images (the top one is trained without pixel-related deconvolution and guiding blocks, while the bottom one is trained with them). Moreover, subfigure (
f) represents the correlations between input and output feature maps. The more the output feature maps squeeze together, the more the input and output feature maps are suited and correlated, and the better it benefits the “one-to-many” FSR problem to obtain fine-grained FSR results. Please refer to
Section 2.2 for more detailed information and related works.
Figure 1.
Visual analysis of the pixel-related deconvolution and the guiding block for transformer-based FSR methods: (
a) is the input face image; (
b) is the conventional pixel deconvolutional and pixel shuffle upsampling layer that neglects the relationship between adjacent pixels; (
c) is the proposed pixel-related deconvolution that establishes direct relationships among adjacent pixels; (
d) is the inner feature map outputs with corresponding upsampling methods; (
e–
g) are self-attention heatmaps, correlation maps [
21] between input and output feature maps, and inner feature maps without and with guiding blocks, respectively (please note that five different images are tested in the correlation map instead of one for fair comparison); (
h) is the output images (the top one is trained without pixel-related deconvolution and guiding blocks, while the bottom one is trained with them). Moreover, subfigure (
f) represents the correlations between input and output feature maps. The more the output feature maps squeeze together, the more the input and output feature maps are suited and correlated, and the better it benefits the “one-to-many” FSR problem to obtain fine-grained FSR results. Please refer to
Section 2.2 for more detailed information and related works.
To sum up, this work has four main contributions:
We devise an attention-guided transformer with pixel-related deconvolution network for face super-resolution. To the best of our knowledge, neither the guiding block that mines potential inner feature map information nor the pixel-related deconvolution that establishes direct relationships among adjacent pixels have been discussed before in the transformer-based FSR field. Results conducted on two frequently used benchmark datasets (i.e., CelebA [
22] and Helen [
23]) demonstrate that the proposed method surpasses other state-of-the-art methods both quantitatively and qualitatively.
We carefully design an Attention-Guided Transformer Module (AGTM) to extract fine-grained features by enhancing the inner feature map relationship. Thanks to its powerful modeling ability, the proposed method can proficiently explore and utilize both local facial details and global facial structures.
We develop a Pixel-Related Deconvolution (PRD) layer to establish direct relationships among adjacent pixels in the upsampling process for better face structure preservation and further strengthen the overall transformer-based FSR performance.
We propose an elaborately designed Multi-scale Feature Fusion Module (MFFM) to fuse multi-scale features for better network flexibility and reconstruction results. The module is essential for the proposed method to acquire a wide range of features, which in turn improves the quality of the restoration performance.
3. Proposed Method
Considering the vital role of the guiding blocks in identifying the essential facial components and aiming to establish direct relationships among adjacent pixels for better face structure preservation, we develop a novel attention-guided transformer with a pixel-related deconvolution network for face image super-resolution. This is the first study for the transformer-based FSR field to not only mine potential inner feature map information but also establish direct relationships among adjacent pixels in reconstructing highly structured face images.
To better elaborate on the proposed method, we divide it into four subsections. In the first section, we provide an overview of the architecture of the proposed method. Then, we delve into the main component of the proposed method, the Attention-Guided Transformer Module (AGTM), which consists of an Attention Guiding Block (AGB) and a Channel-wise Multi-head Transformer Block (CMTB). The AGB and CMTB are complementary and can simultaneously enhance local facial details and global facial structures. Afterward, we introduce the Multi-scale Feature Fusion Module (MFFM), which integrates features from all layers to improve network flexibility and restoration performance. Finally, we introduce the Pixel-Related Up/Downsample Module (PRUM/PRDM) that establishes direct relationships among adjacent pixels for better face structure preservation and further strengthens the overall image reconstruction performance.
3.1. Overview
The proposed method, illustrated in
Figure 2, is a symmetrical hierarchical network consisting of three stages: encoding, bottleneck, and decoding. The encoding stage aims to extract and enhance both local facial details and global facial structures. Meanwhile, the bottleneck stage is intended to optimize the encoded low-level features. Finally, the decoding stage is introduced to facilitate multi-scale feature fusion and image reconstruction. To simplify the description, we use the notations
,
, and
to represent the low-resolution (LR) images, the super-resolved (SR) images, and the ground-truth high-resolution (HR) images, respectively.
(1) Encoding Stage: The goal of the encoding stage is to extract and enhance both local facial details and global facial structures. To begin with, the face images traverse a
convolution layer to extract their low-level features. Since the output channel number should exceed the input ones while an excessive number of output channels would significantly increase the computational complexity [
53], we suggest using 32 output channels for optimal performance. Afterward, the extracted shallow features are passed through three encoding stages. Each stage comprises an Attention-Guided Transformer Module—Top (AGTM-T) and a Pixel-Related Downsample Module (PRDM). The AGTM-T comprises an Attention Guiding Block—Top (AGB-T) and a Channel-wise Multi-head Transformer Block (CMTB). All the blocks and modules mentioned above will be discussed in the following subsection. Moreover, it is worth noting that the channel of the input feature maps doubles, and the size of the input feature maps halves after each encoding stage.
(2) Bottleneck Stage: In the bottleneck stage, there are a large number of encoded feature maps, but each one is relatively small in size compared to those in the encoding stage. To better use these features in the decoding stage, we introduce the Attention-Guided Transformer Module—Bottleneck (AGTM-B). Unlike AGTM-T in the encoding stage, the guiding blocks in AGTM-B aim to further enhance the low-level encoded features. By applying AGTM-Bs, the model can continuously strengthen different facial features and focus on a broader range of facial structures.
(3) Decoding Stage: The decoding stage of the proposed method aims to reconstruct high-quality face images by utilizing previously extracted and refined multi-scale features. In this stage, low-level features are initially fed into the Pixel-Related Upsample Module (PRUM). The module halves the feature map channel and doubles the feature map size, which is the opposite of the PRDM in the encoding stage. After this, the upsampled features are combined with features from other scales using the Multi-scale Feature Fusion Module (MFFM) to enhance network flexibility and achieve better restoration performance. The well-combined features are then fed to AGTM-T for further refinement of image details. Finally, a convolutional layer with a size of is utilized to convert the learned feature maps to the output face image . The final SR face image output is obtained by adding the LR face image , which has been upsampled to the same size as the HR image through bicubic interpolation, to the output face image .
Moreover, to optimize the performance of FSR, the proposed model is supervised by minimizing the following pixel-level loss function:
where
N denotes the number of training images.
and
are the i-th SR and ground-truth HR face image in the training dataset, respectively.
3.2. Attention-Guided Transformer Module (AGTM)
As the pivotal component of the proposed method, AGTM comprises two blocks: the Attention Guiding Block (AGB) and the Channel-wise Multi-head Transformer Block (CMTB). To ensure the feature extraction and enhancement quality on multi-scale features, the AGB has been bifurcated into two separate types: the Attention Guiding Block—Top (AGB-T) in the encoding/decoding stage and the Attention Guiding Block—Bottleneck (AGB-B) in the bottleneck stage. AGTM at the top of the encoder-decoder network (AGTM-T) promotes both local facial details and global facial structures, while AGTM at the bottleneck side (AGTM-B) optimizes the encoded low-level features. Furthermore, noticing that the usual spatial-wise transformers are limited to position-specific windows and their partition strategy may potentially alter the structure of facial images, the Channel-wise Multi-head Transformer Block (CMTB) is introduced here to achieve an image-size receptive field by utilizing feature map channels. The AGB and CMTB are complementary and can facilitate the simultaneous promotion of both local facial details and global facial structures.
3.2.1. Attention Guiding Block—Top (AGB-T)
AGB-T, which aims to locate and guide both local and global facial structures for the following transformer module, is illustrated in
Figure 3a. It can be roughly divided into three parts: the Feature Distillation Network (FDN), the Hourglass Block, and the Channel Attention (CA) network. FDN aims to distill feature information from multiple levels of respective fields within the input feature maps. Firstly, a
convolutional layer is applied to the input feature maps to halve its number and select its internal principal components; then another
convolutional layer is used to restore the major information of the input feature maps by doubling the channel number. Then again, the original and the processed input feature maps are concatenated and sent to a full connection layer followed by a
convolutional layer to fully utilize the hierarchical features. After that, a CA network is applied to highlight the critical feature map channels, followed by a
convolutional layer to refine the distilled feature maps. Finally, a residual learning mechanism is applied to avoid the gradient vanishing problem. Following the distillation process of the FDN, the Hourglass Block [
54], which has demonstrated its efficacy in generating spatial attention maps [
13], is employed to capture landmark features of the human face, such as the eyes, nose, and mouth. Once the feature information is appropriately processed, the CA network [
38] is utilized to select and emphasize feature map channels that contain a higher number of features. Thanks to the well-designed structure that wisely distills internal principal features and mutualizes spatial and channel attention, the proposed AGB can successfully guide the following transformer block to capture the essential part of the face images for better reconstruction results.
After all the above, the final attention map for CMTB is generated by applying a convolutional layer followed by a sigmoid function. Then, input feature maps are element-wise multiplied with the attention map and fed to the following transformer block with better extracted spatial features and promoted channel information. Moreover, a residual connection with a full connection layer is also applied between the input and the CA network output to stabilize the training process.
3.2.2. Attention Guiding Block—Bottleneck (AGB-B)
Different from the above AGB-T, AGB-B in the bottleneck stage is designed to target and guide the low-level encoded features. The channel number of feature maps in the bottleneck stage is relatively large, but the size of each feature is relatively small compared to those in the encoding stage. Therefore, it is crucial to implement a dynamic selection mechanism that adaptively enables each neuron to adjust its receptive field size. Here, we introduce the selective kernel (SK) network [
55] to the AGB-B, which is shown in
Figure 3b. In the SK network, the input feature maps first pass through two convolution layers with different respective fields, followed by a batch normalization layer and a ReLU layer. The upper and lower outputs here are noted as
and
, respectively. Then, these output feature maps are elementwise summed and traverse a global average pool (GAP) to generate channel-wise statistics with different respective fields. After that, the inner feature maps are sent through two full connection layers to enable the guidance for the adaptive selections. Lastly, a soft attention layer is applied across different channels to extract information from different respective fields selectively. Here, we use the notations
and
to represent the upper and lower input of the Select layer in
Figure 3b, where
and
denote the previous inner feature map process, and
C denote the number of channels of the inner feature map, the output weight is:
where
c in
denotes the
c-th element of the
, likewise
,
and
. The final attention maps of the SK network are obtained through attention weights on inner feature maps from various respective fields:
where
denotes the output attention maps,
.
H and
W denote the height and width of the feature maps, respectively.
3.2.3. Channel-Wise Multi-Head Transformer Block (CMTB)
Following the pre-processing of the inner feature maps with guiding blocks, there still remains a demand for effectively aggregating previous feature data across various channels to facilitate high-quality face image restoration. However, the usual spatial-wise transformers are limited to position-specific windows, and their partition strategy may potentially alter the structure of the facial image [
19]. To address this limitation, we introduce the Channel-wise Multi-head Transformer Block (CMTB)—a novel approach capable of achieving image-size receptive fields based on channels rather than position-specific windows. Furthermore, CMTB is more computation-friendly, rendering it a suitable match for the previous guiding blocks. As depicted in
Figure 4, CMTB comprises two key components: the Channel-wise Multi-head Self-attention Network (CMSN) and the Gated-Dconv Feed-Forward Network (GDFN). While CMSN serves as the primary component, GDFN aims to encode information from spatially neighboring pixel positions to enable effective learning of local image structures.
CMTB has been proposed to achieve image-size receptive fields based on different channels of feature maps rather than position-specific windows. Suppose feature maps
as the input of the CMSN, which is reshaped into tokens
based on channels. Here,
H,
W, and
C denote the height, width, and channel numbers of the feature maps, respectively. Then
is linearly projected to obtain three different matrices:
query ,
key , and
value :
where
,
, and
are learnable parameters;
biases are omitted here for simplification. Afterwards,
,
and
are split into
N heads along the channel dimension:
,
,
, where the dimension of each head is
. Therefore, the self-attention matrix for
is:
where
denotes the transposed matrix of
. By implementing the reshape strategy, the size of the generated attention maps will be
instead of
, which greatly reduces the computational complexity. Moreover, a learnable parameter
is introduced to further improve the flexibility of the network. Subsequently,
N heads outputs are concatenated and fed to a full connection layer. The resulting attention matrix is then added with the embedding values from the position embedding generator:
where
are learnable parameters.
represents the position embedding generator, which is designed to encode the position information from various channel dimensions. It contains a
depth-wise convolution layer with a stride of 1 followed by a GELU layer [
56] and another
depth-wise convolution layer with a stride of 1. Finally, the output feature maps
can be calculated by reshaping the result of Equation (
6).
Additionally, we introduce GDFN [
57] to effectively learn local image structures by encoding information from spatially neighboring pixel positions. Given feature maps
as the input of the GDFN, the output
can be obtained by:
where
and
denote the
depth-wise convolution layer and the full connection layer, respectively.
represents the GELU non-linearity.
The AGB and CMTB are two components that work in tandem to enhance facial features and strengthen the inner feature map relationship. The AGB is responsible for extracting and guiding the key features from the inner feature maps, while the CMTB aggregates and refines the previously extracted feature information. By leveraging both these blocks, the AGTM can concurrently enhance local facial details and global facial structures, making it a promising solution for face image reconstruction tasks.
3.3. Multi-Scale Feature Fusion Module (MFFM)
The importance of multi-scale feature information in the image reconstruction process has been proven by the successive pyramid super-resolution networks [
13,
20]. However, the pyramid methods mentioned above reconstruct high-resolution images only from adjacent layers, limiting the FSR performance. To further utilize the multi-scale feature information and enable the network with better feature representation capabilities, we introduce the Multi-scale Feature Fusion Module (MFFM), which is shown on the bottom side of
Figure 1.
The first step of the MFFM is to unify the size of multi-scale feature maps. Noticing that the magnification scale of adjacent layers in the proposed method is always 2, we introduce a convolution layer with a stride of 2 and a transposed convolution layer with a stride and padding of 2 for down-scale and up-scale, processes respectively. Please note that the down-scale and up-scale mentioned here represent the M times downsampling and N times upsampling, respectively. Furthermore, for a larger magnification scale like or , double convolution or transposed convolution layers are applied, etc. Considering that the MFFM is trained for residual compensation for multi-scale feature maps, a single convolution or transposed convolution layer is applied here to modify the feature map size instead of the PRUM/PRDM for simplicity. After resizing the multi-scale feature maps to a uniform size, they are concatenated to undergo a full connection layer and then fed to the CA network to highlight the essential channels. Finally, the well-handled multi-scale features are integrated with the target feature map layer from the encoding stage.
4. Experiments
4.1. Dataset and Metrics
The proposed model is trained on the CelebA dataset [
22], and its performance is evaluated on both CelebA and Helen datasets, as well as on real face images. During the data preprocessing phase, we crop the images to a size of
based on their center point and treat them as the ground truth. After that, we obtain
LR face images from the ground truth using a
down-scale bicubic operation. It is worth noting that no additional facial landmarking is required on the datasets to train the model. We trained the model on 18,000 face images from the CelebA dataset and evaluated its performance on 1000 faces from the same dataset, along with 50 faces from the Helen dataset. Additionally, we directly applied the same model trained on CelebA to the Helen datasets and real face images to evaluate the flexibility of the model.
To evaluate the quality of the SR results, three image quality assessment metrics are introduced: Peak Signal-to-Noise Ratio (PSNR) [
58], Structural Similarity (SSIM) [
59], and Learned Perceptual Image Patch Similarity (LPIPS) [
60].
4.2. Implementation Details
All experiments are conducted using PyTorch [
61] on an NVIDIA GeForce RTX 4090 24 GB graphics card. The proposed model is optimized using Adam with
,
, and a learning rate of
.
4.3. Ablation Studies
To assess the effectiveness of individual model modules, we conducted a series of ablation studies on the CelebA test sets for ×8 SR.
(1) Study on AGTM-T: AGTM-T is a module that combines an AGB-T and a CMTB to extract and promote both local facial details and global facial structures. This module marks the first attempt to explore the potential of inner feature map information with a guiding block to reconstruct plausible face images in the transformer-based FSR area. To test its effectiveness, we design three test models by removing different module parts, of which the results are shown in
Table 1.
From the table, we can observe that:
(a) The performance decreases dramatically when the AGTM-T module is completely removed. The proposed model structure will be considerably shallower without the AGTM-T, which further results in the difficulty in refining input features. Moreover, the multi-scale feature fusion operation module (MFFM), which is complemented with the AGTM-T, will also be greatly affected.
(b) The AGTM-T with a single component performs better compared with the no-component one mentioned above. This demonstrates that both AGB-T and CMTB benefit the learning ability of the proposed model. However, the AGTM-T with AGB-T only lost the guiding target, while the AGTM-T with CMTB only cannot focus on crucial feature parts, limiting its performance.
(c) The carefully designed components AGB-T and CMTB ensure that the AGTM-T achieves the best performance in all evaluation matrices. This proves that the AGB-T and CMTB are complementary and can simultaneously enhance local facial details and global facial structures.
(2) Study on AGTM-B: AGTM-B, which contains an AGB-B and a CMTB, aims to enhance the low-level encoded features. Similar experiments as the above section are conducted and the results are shown in
Table 2. We have arrived at comparable observations and conclusions in the preceding AGTM-T section. However, we do notice that the performance of the model without AGTM-B is better than that without AGTM-T. This is because AGTM-T and MFFM are more complementary when compared to the relationship between AGTM-B and MFFM. The removal of AGTM-T will result in a further decline in the performance of the proposed method.
Furthermore, we also conduct an evaluation of the model on the number of AGTM-Bs, whose results are shown in
Table 3. It can be observed that the performance of the model is poor without any AGTM-B, suggesting that AGTM-B plays a crucial role in the model. Meanwhile, we also notice that the performance improves when the number of AGTM-Bs increases within a specific range. However, when the number of AGTM-Bs exceeds 4, the change rate of the evaluation matrix slows down, and the performance even decreases slightly. Therefore, to maintain a good balance between model size and performance, we set the number of AGTM-Bs to 4.
(3) Study on MFFM: MFFM is specially designed to integrate features from all layers to improve network flexibility and restoration performance. In this part, we create three different multi-scale feature fusion models to demonstrate the effectiveness of the MFFM, whose results are shown in
Table 4.
It can observed from the table that: (a) The experiment demonstrates the importance of incorporating multi-scale features in the image reconstruction process since the model without multi-scale feature fusion performs the worst. (b) Using an addition or concatenation layer to fuse multi-scale features has proven beneficial. However, it is imperative to note that these techniques are inadequate for the complex multi-scale feature fusion process. (c) The model with the carefully designed MFFM achieves the best performance regarding PSNR, SSIM, and LPIPS. This proves that a suitable feature fusion strategy like MFFM can benefit the image reconstruction process.
(4) Study on PRUM/PRDM: The PRUM/PRDM aims to establish direct relationships among adjacent pixels for better face structure preservation and further strengthen the overall image reconstruction performance. It is the first attempt to establish direct relationships among adjacent pixels in reconstructing highly structured face images in the transformer-based FSR area. In this part, we compare its performance with the usual deconvolution layers, of which results are shown in
Table 5.
It can be observed that the pixel deconvolutional and pixel shuffle layers can obtain barely satisfactory reconstruction results. This is because the inner feature maps produced by these layers have no direct pixel relationship. At the same time, our proposed PRUM/PRDM achieves better reconstruction results due to its ability to preserve face structure based on relative pixel positions.
4.4. Comparison with the State-of-the-Arts
To demonstrate the effectiveness of our proposed method, we conduct a comparison with several state-of-the-art methods. These include two GAN-based methods (SRResNet [
29] and RCAN [
38]), three attention-based methods (SPARNet [
10], SISN [
11], and IGAN [
39]), and two transformer-based methods (SwinIR [
17] and Uformer [
15]). We evaluate these methods on the CelebA and Helen datasets, along with the real face images. In addition, we apply bicubic interpolation as the baseline for comparison. All models are trained on the same CelebA dataset to ensure a fair comparison. The quantitative results are tabulated in
Table 6.
(1) Comparison on CelebA dataset: Quantitative comparisons of the proposed method with other existing methods on the CelebA dataset are presented in
Table 6. As per the table, our proposed method outperforms other competitive methods in terms of PSNR, SSIM, and LPIPS, which implies that our method has the advantage of recovering realistic face details. We have also provided some test images from the CelebA dataset for visual comparisons, which are shown in
Figure 6. Benefiting from the guiding blocks pointing to the key features and the pixel-related upsample layer preserving face structures, the proposed method can generate more precise nose contours and eye details while avoiding creating unpleasant artifacts compared with other state-of-the-art methods.
(2) Comparison on Helen dataset: Aiming to prove the flexibility of the proposed model, we assess its performance on the Helen dataset using the same model trained on CelebA. We present a quantitative and visual comparison of the proposed method with others on the Helen dataset in
Table 6 and
Figure 7, respectively. According to the results, the proposed method still shows superiority in restoring facial images both quantitatively and qualitatively. This proves the robustness and stability of the proposed method. However, it is worth noting that all methods experience a decrease in performance when the training and testing images are not from the same dataset. Therefore, investigating the styles among various datasets will be a promising way to enhance the generality of FSR methods in the future.
(3) Comparison on real face images: Restoring face images from real-world environments is a challenging task due to the complexity of the captured images. Although the CelebA dataset is a good source for simulating face images, it cannot replicate all the complexities of real-life scenarios. In order to test the effectiveness of the proposed method in restoring real-world face images, we conduct experiments on low-quality face images collected from the classic TV series “Friends”. It was shot in the 90s with low-tech imaging equipment and suffered severe low-resolution issues, making it perfect for testing. The experiment results are illustrated in
Figure 8. Benefiting from the guiding blocks pointing to the key features and the pixel-related upsample layer preserving face structures, our method reconstructs more detailed facial images with appealing facial structures compared with other state-of-the-art methods.
4.5. Noise Stress Test
Due to the fact that noises from image sensors are all randomly valued and located, the Gaussian noise best fits the image degradation model. However, some other noises from specific situations could also challenge the performance of the proposed FSR method. Therefore, we stress-test our model in this subsection on seven different noises: Gaussian, Poisson, Rayleigh, gamma, exponential, uniform, and salt-pepper.
Experiments in this subsection are conducted on the same 1000 test face images from the CelebA dataset as above. The noise images are multiplied by 0.3 and then added to the original HR images to simulate the noise-degrading process, except for the salt-pepper noise that directly operated on the original HR images. All the noise generation models can be obtained from Numpy [
62]. To make a fair comparison, we manually alternate the parameters of these noise models by adjusting the PSNR of their HR outputs to 26.0–26.5 dB. Here are the detailed parameters of the noise models: The Gaussian noise has a mean value of 0 and a standard deviation of 70. The Poisson noise has a lambda value of 50, while the Rayleigh noise has a scale of 40. The gamma noise has a shape of 7 and a scale of 7. The exponential noise has a scale value of 43. The uniform noise has a low value of 20 and a high value of 80. Lastly, the probability of the salt-pepper noise is set to 0.01. The noise images and their HR outputs are shown in
Figure 9.
These HR images, which have been impaired by noise, are
downscaled and subsequently directed through the proposed method. Additionally, we introduce the bicubic interpolation results as the baseline while choosing the best comparative method, Uformer, for better comparison.
Table 7 and
Figure 10 show the quality evaluation matrices and visual comparisons, respectively.
(a) There are varying degrees of reduction in reconstructing face images using different methods. The Uformer, which proves its effectiveness in denoising, requests a particular denoising dataset for training, or else it will be unable to construct reasonable images. On the contrary, the proposed method, which introduces the guiding blocks and PRUM to mine and preserve face structures, has successfully reconstructed face images with noises.
(b) Noises like Gaussian and salt-pepper influence face structures (i.e., SSIM) much more than others. However, they can easily be recovered using the proposed method. This is because the natural face images taken from image sensors always contain Gaussian noises, which makes the proposed method familiar with this kind of noise. The salt-pepper noise influences only a few pixels, while the times degradation further weakens its impact. Therefore, it can be overcome by the face structure preserving modules of the proposed method.
(c) The Poisson, Rayleigh, gamma, exponential, and uniform noises greatly affect the performance of all FSR methods, which proves their ability to blur images. More attention needs to be paid to overcome the influence of these types of noises.
4.6. Face Recognition Results
To further prove that the proposed method can recover crucial facial structures that are essential in distinguishing different faces, we also perform face recognition as a measurement. Specifically, we chose the commonly used LFW [
63] dataset as the face recognition database. Then, several images are randomly picked, downsampled, and super-resoluted as the reference images with different FSR methods. After that, we select face images with the same and other identities as test images for every reference. Finally, we adopt a pre-trained face recognition model, Deepface [
64], to perform face recognition. Moreover, we also measure Uformer along with the proposed method for better comparison and the bicubic interpolation as the baseline. The Receiver Operator Characteristic (ROC) curve can be seen in
Figure 11.
(a) The performance of Deepface [
64] on the original HR images is excellent, which proves the significant improvement in the face recognition field based on the deep-learning network.
(b) SR images with bicubic interpolation are difficult for Deepface [
64] to verify. This is reasonable due to the poor SR performance of the bicubic interpolation, which can be seen from the sections mentioned above.
(c) Both the SR images reconstructed by the Uformer and the proposed method have obtained satisfactory face recognition performance. Moreover, the proposed method has a larger AUC result, demonstrating its better performance in face recognition tasks and further proving its ability to recover crucial facial structures that are essential in distinguishing different faces.
4.7. Model Complexity Analysis
In previous experiments, the proposed method has demonstrated its superior ability in both quantitative and qualitative FSR performance. In this section, we compare its model performance, size, and execution time with other state-of-the-art methods, whose results are shown in
Figure 12. According to the figure, our method achieves the best quantitative results while maintaining comparable model size and execution time. Hence, the proposed approach strikes a better balance between model performance, size, and execution time compared to other state-of-the-art methods.