In this section, the networks proposed in this article will be described in detail. As mentioned in
Section 2.1 above, LF uses a two-plane model
to parameterize the rays. Therefore, a 4D LR LF image can be represented as
, where
represents angular resolution and
represents spatial resolution. In this paper, we use a square SAI array (i.e.,
), so a 4D LF image contains an
array of SAIs. Our goal is to obtain an HR LF image
from its LR counterpart
, and
denotes the spatial magnification factor. In order to decrease computing complexity, the proposed approach is only executed on the Y channel of the LF image. A bicubic interpolation algorithm is performed to obtain the super-resolution of the Cr and Cb channels, and after that, the Y, Cr and Cb channels are converted into an RGB image.
3.1. Overall Network Framework
The overall network framework of our method is illustrated in
Figure 1, and is composed of three major components: a feature extraction block (FEB), view interaction blocks and a reconstruction block. Firstly, the LR SAI
is fed into the reshaping block of the upper branch, which is connected along the channel dimension to obtain
; then, the FEB is used to extract the global-view feature
from
. Different from the upper branch, the lower branch does not need to perform reshaping operation, but directly extracts features, and maps each SAI to the depth feature representation to obtain the hierarchical feature
. For different views, the weights in the FEB of the lower branch are shared. In order to thoroughly investigate the correlation between all views, the extracted features are subsequently passed continuously through four double-branched view interaction blocks. This process enhances and connects the features in a sequential manner, effectively maintaining the parallax structure. The view interaction block is composed of four parts: an inter-view unit (InterU), a global-view feature, an intra-view unit (IntraU) and a multi-view aggregation block (MAB). These are collectively referred to as IGIM. Finally, the HR SAIs
are obtained through the reconstruction block, which is composed of a feature fusion block and an upsampling block.
3.2. Feature Extraction Block (FEB)
Our FEB is treated separately for each input SAI, and its weights are shared between SAIs. Firstly, a
convolution in the FEB is introduced to extract the initial features of a single SAI and then feed them into cascaded ResASPP and a residual channel-reconstruction block (RCB) for further feature extraction. The depth C is set to 32. The architecture of ResASPP is shown in
Figure 1, which is to extend receptive field to gain rich contextual information. We can extract multi-scale information by combining
dilated convolutions with three different expansion rates in parallel, which are 1, 2 and 4. The Leaky ReLU (LReLU) is an activation function, and its leaky factor is 0.1. Finally, the features obtained by the three parallel branches are connected and then fused through
convolution. In addition, we design an RCB to successfully reduce the channel redundancy between features by introducing a channel reconstruction unit (CRU) [
33]. As shown in
Figure 1, our RCB consists of two
convolutions, a CRU and a LReLU activation function.
The CRU is constructed by using a split-transform-fuse approach to reduce channel redundancy, as shown in
Figure 2. Firstly, the aim of the splitting part is to divide the input features
with
channels into the upper-branch features
with
channels and the lower-branch features
with
channels, where
is the split ratio, the numerical range is between 0~1, and
is the squeeze ratio. The channels of those feature maps are compressed by a
convolution to improve the calculation speed.
Secondly, the aim of the transforming part is to extract the rich representative feature maps
in the upper-branch features
through efficient group-wise convolution (GWC) and point-wise convolution (PWC). GWC, which was first introduced in AlexNet, can be regarded as a sparse convolution connection method in which each output channel is linked to only a certain group of input channels. In addition, PWC is used to keep the information flowing across channels and enable dimensionality reduction by reducing the number of filters. The process can be represented as
where
and
are the learnable weight matrices of GWC and PWC,
is the size of convolution kernel and
is the group size in the experiment (
). At the same time, only PWC is used to acquire the feature maps
with shallow hidden details in the lower-branch features
as a supplement to the upper-branch feature mapping, which can be represented as
where
is also a weight matrix of PWC, and
means concatenation.
Finally, the aim of the fusing part is to obtain the global spatial information through the global average pooling operation to obtain
. The process is as follows:
Then, the upper-branch and lower-branch global channel-weight descriptors are superimposed together, and a channel-weight soft attention operation is designed to create the important feature vectors
and
as follows:
Under the guidance of these two vectors, the upper-branch feature maps and the lower-branch feature maps are combined in a channel-wise manner to obtain output features
with reduced redundancy in the channel dimension, and the procedure is as follows:
From the above-detailed process, it can be seen that after integrating the channel reconstruction unit (CRU) into the channel-reconstruction block (RCB), we can not only reduce the redundant calculation of feature extraction blocks in the channel dimension of feature mapping, but also promote the learning of representative features in the feature extraction process. This is an indispensable way to efficiently acquire important features.
3.3. View Interaction Block
The view interaction block is composed of an inter-view unit (InterU), a global-view feature, an intra-view unit (IntraU) and a multi-view aggregation block (MAB), which is an improvement of intra–inter-view feature interaction [
34]. Specifically, we can reduce the redundant calculation of feature mapping in the spatial dimension after integrating a spatial reconstruction unit (SRU) into InterU and IntraU in the view interaction block, so as to promote the efficient use of complementary information between all views in the interaction process. Furthermore, our MAB uses a smaller number of 3D convolutions to efficiently model the correlation between all SAIs to further achieve SSR. The architecture of InterU is shown in
Figure 3a and the architecture of IntraU is shown in
Figure 3b; the former uses the hierarchical feature
obtained by the lower branch to update the global inter-view feature
, and the latter uses the global inter-view feature
obtained by the upper branch to update the hierarchical feature
.
The SRU in the structure is constructed by using a separate-reconstruct method to reduce spatial redundancy, as depicted in
Figure 4. Firstly, the aim of the separating part is to subdivide the feature maps with a large amount of information and the feature maps with a small amount of information corresponding to the spatial content in the input feature
, and it first uses group normalization to standardize
, which can be portrayed as follows:
where
and
are the standard and mean deviation in
;
and
are trainable affine transformations; and
is a small positive constant. Each normalized correlation weight
is determined to represent the importance of each feature map, which can be obtained from the following formula:
Then,
is mapped to 0~1 through the sigmoid function, and the threshold is set for gating. The weight exceeding the threshold is reset to 1 to obtain the information weight
; otherwise it is set to 0, and the non-information weight
is obtained. The entire process of obtaining
can be depicted as
Finally, and are multiplied by the input features to create weighted features with a large amount of information and weighted features with a small amount of information.
The aim of the reconstructing part is to first cross-add the informative features
and the less informative features
, fully combine the two pieces of information with different weights, and then connect them to obtain the output features
with reduced redundancy in the spatial dimension. The specific process is as follows:
where
is element-wise summation,
is element-wise multiplication and
is concatenation.
In addition, the multi-view aggregation block (MAB) in the view interaction block is based on 3D convolution to model the correlation between all SAIs, as shown in
Figure 1. In the MAB, the hierarchical features
obtained by the lower branch are firstly stacked along the angular dimension to obtain the feature
, which is processed by three
convolutions. Then, the features obtained by the combination of three parallel branches are connected. The features are fused by using a
convolution with a dilation rate of (1, 1, 1). A
convolution is used to process the fused features, which are added to the input features to obtain the output hierarchical features. The activation function PReLU in the MAB is an improvement of LReLU with a leaky factor of 0.02.
3.4. Reconstruction Block
As shown in
Figure 1, the reconstruction block includes a feature fusion block (FFB) and an upsampling block. Firstly, the FFB is composed of four cascaded residual dense blocks (RDBs), and each RDB is composed of 6 convolutional layers and 6 activation functions. Our FFB can fuse shallow features and deep features generated by view interaction blocks to obtain high reconstruction accuracy. In each RDB, the input features are performed by a
convolution; then, the following dense connected layers are processed by
convolutions. The output
of the
-th (
) convolution of each RDB can be portrayed as
where
means the activation function LReLU,
represents the
-th convolution and [·] denotes concatenation. Then, the features are fused by the last
convolution, and the local residual connection is eventually designed. In the RDB, the numbers of filters are 4
,
,
,
,
, and 4
from left to right. The fused features are then fed to the upsampling block, which consists of a
convolution, a
convolution and a PixelShuffle layer between them. The LR feature maps can be converted into HR feature maps through the upsampling block. Finally, the features generated by the upsampling block are added to the initial features to achieve global residual learning to an HR LF.