There exist some problems for most CNN-based SR methods, for example, the structure information of the image cannot effectively be exploited, a single-scale convolutional kernel is utilized solely to separate features, there is no viable fusion mechanism between adjacent feature extraction modules, and each channel feature is dealt with similarly. To tackle the above issues, we propose a gradient-guided and multi-scale feature network for image super-resolution (GFSR) to reconstruct as much detailed information as possible.
In this section, we first present the overall framework of the network, and afterward describe the gradient branch, multi-scale convolution unit, residual feature fusion block, and the adaptive channel attention block used in detail.
3.1. Network Structure
As can be seen from
Figure 1, our GFSR consists of two branches, including the trunk branch and the gradient one. On the trunk branch,
M Residual of Residual Inception Blocks (RRIB) and
N Residual of Residual Receptive Field Blocks (RRRFB) form a deep feature extraction module to acquire the deep features of the input image, and a feature fusion block (FFB) is designed to integrate the gradient features produced on the gradient branch to the trunk branch to generate the reconstructed images. On the gradient branch,
P RRIBs are used to extract features from the LR gradient map and receive the intermediate features at different levels produced on the trunk branch. Finally, the extracted gradient features from all RRIBs are used as an essential structural prior to guide the super-resolution process.
On the trunk branch, the shallow feature
of the input image
can be obtained by two cascaded 3 × 3 convolutional (Conv) layers,
where
stands for the shallow feature extraction function.
Then,
is sent to the deep feature extraction module consisting of
M RRIBs and
N RRRFBs to further acquire the corresponding deep feature. Let
and
respectively denote the functions of RRIB and RRRFB, and let
denote the features obtained by the upsampling module, then we can get,
where
represents the output features obtained by
M RRIBs,
represents the output features obtained by
N RRRFBs, ⊕ stands for the element-addition operation, and
represents the upsampling operation.
On the gradient branch, the gradient map
of the input image is first calculated through the gradient function
, which is then passed through a series of modules (i.e., two Convs,
P RRIBs, and an upsampling module) to obtain the corresponding gradient features
,
where
denotes the shallow feature of the gradient map
obtained by two cascaded
Convs, and
denotes the deep features obtained by
P RRIBs.
Finally, the gradient features
produced on the gradient branch are fused to the trunk branch to generate the reconstructed image
,
where
denotes the feature fusion block.
3.2. Gradient Branch
To effectively exploit the gradient features of LR images, we design a gradient branch structure. On the gradient branch, the gradient feature map of the LR image is used to estimate the one of the HR image, which is used as structural prior information to guide the super-resolution process.
The gradient map is generally acquired through exploring the difference between adjacent pixels in the horizontal and vertical directions,
where
, denotes the coordinate of the pixels,
represents the gradient function, and
represents the second norm. Considering that the gradient map is close to zero in most regions of an image, it facilitates CNN to pay more attention to explore the spatial relationship of the image structure, thereby estimating an approximate gradient map for the SR image.
As shown in
Figure 1, several intermediate-level features on the trunk branch are sequentially propagated to the gradient branch, so the output of the gradient branch not only contains the image structure features, but also contains abundant detail and texture features. Such a design method is not only effectively supervised to recover the gradient map, but also significantly reduces the number of parameters of the gradient branch. At the same time, since the gradient map is able to directly reveal whether a particular region of the image is sharp or smooth, we fuse the gradient features obtained from the gradient branch to the features on the trunk branch to guide the super-resolution process. Specifically, in our network, the features from the 5th, 10th, 15th, and 20th modules in the trunk branch are propagated to the gradient branch, and then the gradient features obtained by the gradient branch are integrated into the trunk branch.
3.3. Multi-Scale Convolution Unit
To improve the feature representation ability of the network, a common method is to design a very deep network with tremendous convolutional layers to explore more abundant features. Although this method can improve network performance to a certain extent, it also brings other problems. On the one hand, the extending of the network depth is accompanied by a continuous expansion in the number of parameters, which may prompt overfitting in the case of inadequate training data. On the other hand, a network with an incredible number of layers implies higher computational complexity and is prone to bring in gradient exploding or vanishing, making network optimization difficult. Fortunately, the GoogLeNet [
41] model solves these problems well. The critical thought of the Inception is to build a dense block structure. It utilizes multi-scale convolution kernels to acquire the feature map, and then combines the output of several branches to the following layer. The structure accelerates both the qualities and the resilience to scale without increasing the network depth of the network, thus improving the capability of feature extraction.
For the RIB structure, different from the original Inception module, we removed the pooling layer. Since the max-pooling operation extracts the maximum value of the feature map in a certain area to generate the representative salient features in the area, which may result in the loss of a large amount of detail features. Thus, the pooling operation is often fatal for SR tasks. For example, a max-pooling operation may cause the feature map to losing nearly of the important information. For low-level tasks such as image super-resolution, the feature information of each pixel is very important. Consequently, we directly use the stepping convolutional layer to replace the pooling layer.
For the SR task, we propose a multi-scale convolution unit with a parallel structure to extract the receptive field features of different scales to recover as many image details as possible. Specifically, we first assemble convolution filters of various scales into the network, such as
,
,
. However, a larger-scale convolution kernel incurs higher computational costs and increases the time complexity. Consequently, in our network, we use small-scale filters instead of large-scale ones to minimize the number of parameters [
42,
43]. In fact, in our network, only
and
scale convolution filters are used, and the
scale filter is replaced by two cascaded
scale filters. Ref. [
42] has proved that the features extracted by filters with a scale larger than
are typically weak as the filter with a scale larger than
can generally be simplified to a set of filters with a scale of
, and the parameter of two cascaded
scale filters is only 18/25 of the parameter of a single
scale filter. As shown in
Figure 2a, our RIB (Residual Inception Block) employs the multi-branch structure with different scale convolution kernels to correspond to different sizes of receptive fields, and the scales of the convolution kernel on the three branches are
,
and
,
and a set of cascaded
, respectively. In this way, more features could be learned, and then the features of the three branches are fused to obtain the representative feature.
In addition, Ref. [
42] also proved that asymmetric convolution kernels [
44] could be used to replace conventional symmetric convolution kernels. In certain situations, asymmetric convolution kernels can learn more representative features than symmetric convolution kernels, and reduce computational complexity without loss of accuracy. Furthermore, the receptive field of the human eye is related to the eccentricity of the retina center [
45], and the size of the receptive field varies with the eccentricity, which can be controlled accessibly by the dilated convolution to achieve adjusting the size of the receptive field. Therefore, the asymmetric convolution kernel and the dilated convolution are separately introduced into the proposed multi-scale convolution unit to enhance the ability of feature extraction.
Importantly, it was proposed in Ref. [
42] that the asymmetric convolution unit has a lackluster performance to be placed in the front of the architecture, while more suitable to be placed in the middle and subsequent layers of the network to strengthen the expression ability. We accept the proposal to placing it in the appropriate place of the middle and subsequent layers in our architecture. In our proposed model, the RRFB (Residual Receptive Field Block) model is utilized for posterior-layer feature extraction, which can extract abundant detail features, especially texture and edge features. As appeared in
Figure 2b, the RRFB model we exercised is improved from
Figure 2a. Specifically, we utilize a set of
and
asymmetric convolution kernels rather than the
convolution kernels referenced previously. Meanwhile, we also introduce dilated convolution at the end of each branch structure to adjust the size of the receptive field of feature extraction for obtaining excellent reconstruction performance. Finally, the features extracted from each branch are merged to obtain the final feature representation.
To make the extricated includes more expressive, we fuse the residual structure in RIB and RRFB respectively, and assign weights to the input features.
3.4. Residual Feature Fusion Block
Inspired by the residual network, we design an adaptive weighted residual feature fusion block (RFFB) to solve the low relevance of context information. As shown in
Figure 3, our RFFB consists of multiple basic blocks for feature extraction and a local residual feature fusion block. To encourage exploration of the dependencies between the residual channel features, we add a local adaptive channel attention block (ACAB) after the RFFB module. ACAB will be presented in detail in
Section 3.5.
A traditional residual module stacks a series of feature modules to construct a deep network, so the identity features must go through a long path connection to merge with the residual features, and then propagate the merged features to the subsequent feature extracting block. Thusly, such a design method just yields complex features and does not take full advantage of the neat features extracted from each module, bringing about very localized utilization of features in the process of network learning. More importantly, the contextual information of the image may be lost and its relevance cannot be well expressed.
The residual feature fusion block is designed to fuse the feature information learned from all the feature extraction modules as much as possible. However, simply stacking all the feature information directly together will accumulate an excessive number of features, bring a considerable number of parameters for the network and dramatically increment the difficulty of training. For the sake of solving the above problem, we must adaptively fuse local features obtained by each feature extraction module first and then propagate them to the subsequent layer of the structure. Inspired by MemNet [
46], we introduce a
convolutional layer after the feature fusion block (FFB) to adaptively adjust the number of output features. As shown in
Figure 3, the residual features of
B Basic Blocks (such as RIB and RRFB are introduced in
Section 3.1) are transferred directly to FFB, then a convolutional layer is designed to reduce the dimension of the feature to the same dimension as the input features, and finally, the output of RFFB is the result of accumulating the identity feature and fusion feature by performing the element addition operation. Assuming that
and
are the input and output of the
m-th RFFB module, and
represents the output of the
B-th Basic Block of the
m-th RFFB module, we can obtain
where
denotes the residual feature fusion function, and
denotes the Relu activation function.
To fully exploit the expressive power of residual features, as shown in
Figure 4, Jung et al. [
47] proposed a weighted residual unit (wRU) to generate the weight of the different residual units by wSE(weighted Squeeze-and-Excitation) module. Even though this method improves the presentation of the residual unit, it introduces additional parameters and computational overhead for generating weights. Inspired by wRU, we expect to develop an adaptive weighted residual unit to adaptively learn the weights of residual features. As shown in
Figure 3,
and
are learnable parameters. In our network, their initial values are both set to one, and their values are updated through continuous iterative learning.
3.5. Adaptive Channel Attention Block
Previous studies [
15,
33] have revealed that the super-resolution performance can be further improved by incorporating the channel attention mechanism into the SR model. Specifically, the feature representation ability of the CNN model is maximized by assigning different attention degrees to the different channel features.
Inspired by SENet [
48], RCAN [
15] incorporates the channel attention mechanism into its model and performs the dimensionality reduction operation on feature channels, which effectively improves the visual quality of the reconstructed image and viably diminishes the complexity of the model. Unlike SENet, which performs the fully connected operation on all channels, RCAN only performs fully connected on channels obtained through dimensionality reduction. As pointed out in Ref. [
20], the dimensionality reduction operation destroys the direct correlation between the channel and its weight, and the acquired channel attention dependences are wasteful and useless. In order to address the adverse impacts of dimensionality reduction operations on channel attention, we integrate the adaptive channel attention block (ACAB) into the proposed network to explore the correlation dependencies between channel features without pulverizing the primitive relationship between channels. As we all know, the relationship of pixels in an image is identified with the distances among pixels, hence we can infer that the relationship between channel features is additionally equivalent: the great dependence can be found between the adjacent channels, and the dependence relationship is gradually decreasing with the increasing distance of channels. Moreover, Ref. [
20] also revealed that the trend of frequency in different images is fundamentally similar in the same convolutional layer, and expresses a robust local periodicity. Therefore, correlation calculations are only performed on the adjacent
k channels are in the ACAB.
As shown in
Figure 5, ACAB is composed of a global average pooling layer, the nearest neighbor fully connected module, and a channel feature representation layer. Here, the nearest neighbor fully connected module only connects the k nearest channels to investigate the relationship between channels.
Based on the above-mentioned analysis, we can roughly get the channel attention weight of the whole image by exploring the relationship between adjacent
k channels. The weight of the
i-th channel
of the feature
can be acquired as follows:
where
represents the Sigmoid activation function,
represents the parameter of the convolution kernel corresponding to
, and
represents the set of
k adjacent channels of
.
To further boost the super-resolution performance, all channels share weight information, and then the amount of parameters can be reduced to
, as:
where
C is the channel number.
Since the adaptive channel attention is to fittingly obtain cross-channel information, it is incredibly necessary to determine the range of channel interaction
k. From the previous analysis, we can see that
k should be a certain mapping relationship to
C,
As the linear mapping relationship has certain limitations, and the number of channels in the SR model is normally a multiple of 2 to further raise the flexibility of the model, the mapping relationship in Equation (
14) can be represented as an exponential function with base 2:
Consequently, according to the given
C,
k can be calculated,
where
denotes the nearest odd number of
x. Furthermore, the information interaction of cross-channel features can be implemented by using a one-dimensional convolution, that is, a convolutional layer with a convolutional kernel size of
k, and it is known that a convolution kernel with an odd size is more competitive than the one with an even size in separating the features, so the value of
k is generally taken to be an odd number. Considering that the number
k of feature channels is generally selected to be 64 dimensions in most SR models, we also set the feature channel number to 64. Then,
k can be calculated according to Equation (
16), i.e.,
.