Cross-View Attention Interaction Fusion Algorithm for Stereo Super-Resolution

Zhang, Yaru; Liu, Jiantao; Zhang, Tong; Zhao, Zhibiao

doi:10.3390/app13127265

Open AccessArticle

Cross-View Attention Interaction Fusion Algorithm for Stereo Super-Resolution

¹

School of Artificial Intelligence, Anhui University of Science and Technology, Huainan 232001, China

²

Institute of Energy, Hefei Comprehensive National Science Center, Hefei 230031, China

³

School of Automation and Electrical Engineering, Tianjin University of Technology and Education, Tianjin 300222, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(12), 7265; https://doi.org/10.3390/app13127265

Submission received: 9 April 2023 / Revised: 13 June 2023 / Accepted: 15 June 2023 / Published: 18 June 2023

Download

Browse Figures

Versions Notes

Abstract

:

In the process of stereo super-resolution reconstruction, in addition to the richness of the extracted feature information directly affecting the texture details of the reconstructed image, the texture details of the corresponding pixels between stereo image pairs also have an important impact on the reconstruction accuracy in the process of network learning. Therefore, aiming at the information interaction and stereo consistency of stereo image pairs, a cross-view attention interaction fusion stereo super-resolution algorithm is proposed. Firstly, based on parallax attention mechanism and triple attention mechanism, an attention stereo fusion module is constructed. The attention stereo fusion module is inserted between different levels of two single image super-resolution network branches, and the attention weight is calculated through the cross dimensional interaction of the three branches. It makes full use of the ability of single image super-resolution network to extract single view information and further maintaining the stereo consistency between stereo image pairs. Then, an enhanced cross-view interaction strategy including three fusion methods is proposed. Specifically, the vertical sparse fusion method is used to integrate the interior view information of different levels in the two single image super-resolution sub branches, the horizontal dense fusion method is used to connect the adjacent attention stereo fusion modules and the constraint between stereo image consistency is further strengthened in combination with the feature fusion method. Finally, the experimental results on Flickr 1024, Middlebury and KITTI benchmark datasets show that the proposed algorithm is superior to the existing stereo image super-resolution methods in quantitative measurement and qualitative visual quality while maintaining the stereo consistency of image pairs.

Keywords:

stereo super-resolution; convolutional neural network; interaction fusion; attention mechanism

1. Introduction

In the field of computer vision, the quality and resolution of images are usually uneven. The purpose of super-resolution image reconstruction task is to reconstruct and generate high-quality and high-resolution images through computer simulation learning. In other words, super-resolution task can be considered as the process of restoring image details from rough to fine. Image super-resolution reconstruction technology has a wide range of applications and development prospects in computer vision, public security, military and medicine [1]. Generally, the most commonly used methods to improve image resolution are interpolation algorithm [2], Markov random field model [3], principal component analysis model [4], etc. These algorithms still fall short in terms of visual quality, and many details (such as sharp edges) cannot be recovered. Since researchers began to study the super-resolution convolutional neural network, the methods based on deep learning have been playing a dominant role in the tasks of static monomial image super-resolution [5] and stereoscopic image super-resolution [6].

Aiming at the super-resolution of monocular image based on deep learning, Dong et al. [7] first proposed SRCNN, which realized end-to-end mapping by using deep convolutional neural network to automatically learn and optimize all network parameters. Kim et al. [8] designed a deeper super-resolution network VDSR with a total of 20 convolutional layers, and a large range of global pixel information can be obtained by deepening the number of network levels. With the proposal of dense network, Tong et al. [9] proposed SRDenseNet, which applied jump connection to dense blocks to learn the features of low level and high level. Zhang et al. [10] proposed a residual dense network to promote effective feature learning through a continuous storage mechanism. Guo et al. [11] proposed the dual regression strategy to improve the accuracy of image reconstruction by learning the forward and reverse mappings between low resolution and high resolution. He et al. [12] utilized the complementary relationship between external and internal instance-based super-resolution methods to achieve image adaptive super-resolution.

For the three-dimensional super-resolution based on deep learning, compared with the monocular image super-resolution, the number of correlation algorithms in the domain is small, the research is not mature enough, and the performance still has great room for improvement. Ruikang et al. [13] designed an attention module based on view and time and a fusion module based on time and space, which were respectively used to integrate the information across time and across view and aggregate the information across time in view to build high-resolution stereo video. Since the parallax value of stereo image is much larger than that of video frame and optical stream image, and the spatial resolution of the left and right angles is very low, the relevant algorithms of video super-resolution [13] and optical stream super-resolution [14] are not suitable for the stereo image super-resolution task. To solve this problem, Jeon et al. [15] proposed for the first time a primary and pioneering stereo super-resolution algorithm (StereoSR). By stacking the left image and the right image generated by moving pixels at different intervals, the feature maps of two perspectives are integrated. The direct mapping relationship between parallax displacement value and pixel of high-resolution image is obtained. However, because the maximum parallax value that StereoSR can calculate is fixed, it has high requirements for both sensor and scene image. For this reason, Wang et al. [16] proposed PASSRnet, which includes a residual atrous spatial pyramid pooling module and a disparity attention module. The residual atrous space pyramid pooling module is constructed by alternately cascading residual ASPP blocks, which not only expands the receptive field but also enriches the diversity of convolutions. The parallax attention module can obtain the stereo pixel correspondence with global field of perception along the horizontal parallax displacement direction of binocular images, without being limited by the size of parallax value. When processing low-resolution stereo image pairs, the above stereo super-resolution algorithm cannot obtain the long distance dependence of image pairs. By referring to the existing attention mechanism and parallax attention mechanism, Duan et al. [17] continue to adopt PASSRnet’s network structure and add the channel attention mechanism and global residual connection to integrate the remote semantic information of high and low levels with better reconstruction accuracy than PASSRnet. Song et al. [18] proposed an algorithm combining the two mechanisms of self-attention and parallax attention, and aggregated the features in the image view and the information features of the stereo image at the same time, so as to reconstruct the stereo super-resolution image pair with high quality. In addition, The Stereo Attention Module [19] is proposed in view of the excellent performance of the existing single image super-resolution methods in utilizing the information in the view. By using the prior knowledge of the single image super-resolution network, multiple three-dimensional attention modules are inserted between the branches of the single image super-resolution network of the left and right images, so as to realize multiple interactions between the two view information and improve the performance of the network reconstruction of image details. Wang et al. [20] designed a twin network based on bidirectional parallax attention module by using symmetric cues in stereo image pairs. The two views are computed in a highly symmetrical manner to improve the performance of stereo super-resolution. However, the pixel offset between stereo image pairs and the sub-pixel up-sampling used in the reconstruction process require discriminant features to identify the corresponding pixels. Since the parallax between left and right images is variable, Dan et al. [21] proposed a parallax feature alignment module to use parallax information for feature alignment and fusion. Zhu et al. [22] proposed a cross-view capture network to achieve super-resolution of stereoscopic images by designing cross-view blocks and cascaded spatial awareness modules. Wang et al. [23] proposed a universal parallax attention mechanism to capture stereo correspondence by calculating the feature similarity along the epipolar line, without considering parallax changes. Jin et al. [24] proposed to take the swin transformer as the framework and combine the bidirectional parallax attention module to maximize the auxiliary information provided by the binocular structure. Chu et al. [25] proposed NAFNet, a powerful and simple image recovery model, for single-view feature extraction and fused features between views by adding cross-attention modules to suit binocular scenes.

Although some achievements have been made in the research of convolutional neural networks in super-resolution learning, the following problems still exist: (1) The prior knowledge learned from the single image super-resolution algorithm is not fully utilized in the binocular image super-resolution task; (2) There is a lack of new mechanisms for efficient use of information from the left and right views, and information within and between views is not communicated adequately.

Therefore, we use parallax attention to calculate the pixel similarity of two images along the epipolar axis, and use the triple attention mechanism to integrate the cross-dimensional information of the corresponding view, so as to build a stereo fusion block. Then, we propose an enhanced cross-view interaction strategy to make the stereo fusion block work cooperatively between two single image super-resolution networks to integrate cross-view information and in-view information.

2. Network Architecture

The left and right images with low resolution are input into the existing basic single image super-resolution network branches for training, and the pre-trained network model is obtained. The whole network model of stereo super-resolution mainly includes three parts: single image super-resolution branch, attention stereo fusion module and enhanced cross-view interaction.

2.1. Attention Stereo Fusion Module

In order to integrate the effective information of the original features in the view, we construct an attention stereo fusion module based on parallax attention mechanism. The attention stereo fusion module consists of three blocks: parallax attention block, triple attention block and stereo fusion block. The details are as follows:

Parallax attention block. Consistent with the parallax attention mechanism first proposed in the literature [16] for stereo super-resolution, the parallax attention map of both perspectives is calculated, and then the effective mask of the left and right images is obtained by comparing the consistency of the parallax between the left and right images. The parallax attention block is used to measure the similarity of the pixel values of two feature images from different perspectives along the horizontal epipolar direction. The details of the parallax attention block are shown in Figure 1.

Triple attention block. Triple attention block is mainly composed of three parallel branches, as shown in Figure 2. Through rotation operation and residual transformation, the attention weight value of the information in the corresponding view is adjusted, and the information between the two dimensions of channel and space is encoded, so as to establish the dependency relationship between different dimensions. Specifically, given an input feature

X \in ℝ^{C \times H \times W}

, it is transferred to the three branches of the triple attention block in parallel. The three branches are the attention computing branch of channel C, the information interaction branch of channel C and space dimension H, and the information interaction branch of channel C and space dimension W. Then, the output features of the three branches are averaged and aggregated to generate the final output tensors of size

C \times H \times W

, which are the left attention image

F_{att_L}

and the right attention image

F_{att_R}

respectively.

Taking the attention calculation branch of channel C dimension as an example, maximum pooling and average pooling operations are used for channel C dimension, and tensor

X_{1} \in ℝ^{2 \times H \times W}

is obtained by cascading operation.

X_{1} = [H_{\max} (X), H_{avg} (X)]

(1)

Then, the tensor obtained by standard convolution with convolution kernel size

1 \times 1

and batch normalization is used to generate the channel attention weight

X_{2} \in ℝ^{1 \times H \times W}

through the Sigmoid activation function.

X_{2} = σ (H_{BN} (H_{3 \times 3} (X_{1})))

(2)

Finally, the output tensor

X_{o 1} \in ℝ^{C \times H \times W}

of the first branch is obtained by matrix multiplication between the

X_{2}

calculated by Formula (2) and the input feature

X

.

X_{O 1} = X_{2} \times X

(3)

Stereo fusion block. The stereo fusion block is constructed to connect the similar information provided by the parallax attention mechanism along the horizontal polar direction and the cross-dimensional interactive information obtained by using triple attention. The fusion process is divided into two stages. Take right-to-left conversion as an example, and the specific steps are as follows:

(1) The first stage: The left and right stereo input features

F_{left}^{in}, F_{right}^{in} \in ℝ^{H \times W \times C}

are converted into residual blocks to reduce training conflicts. The transformed features and

F_{att_L}

are transferred to the convolution layer with the convolution kernel size of 2 for preliminary fusion to generate

F_{0} \in ℝ^{H \times W \times C}

.

\begin{matrix} F_{0} = H_{1 \times 1} [H_{res} (F_{left}^{in}) \times F_{att_L}] \\ F_{1} = H_{1 \times 1} [H_{res} (F_{right}^{in}) \times F_{att_R}] \end{matrix}

(4)

(2) The fusion feature

F_{0}

obtained in the first stage is cascaded with the effective mask

V_{L \to R}

for the left image and transmitted to the convolution layer with the convolution kernel size of

1 \times 1

to generate the final feature

F_{left}^{out} \in ℝ^{H \times W \times C}

for the left perspective.

{\begin{matrix} F_{left}^{out} = H_{1 \times 1} ([F_{0}, V_{L \to R}]) \\ F_{right}^{out} = H_{1 \times 1} ([F_{1}, V_{R \to L}]) \end{matrix}

(5)

2.2. Enhanced Cross-View Interaction Strategy

According to the placement of the attention stereo fusion module, the enhanced cross-view interaction strategy is proposed, as shown in Figure 3, including three parts: vertical sparse connection, horizontal dense connection and feature fusion.

Vertical sparse fusion. The parameters of the pre-trained single image super-resolution network model are known. For left and right low-resolution images, parameters between two single image super-resolution networks can be shared. Considering that if each level feature in the super-resolution network branch of the previous image and single image are input into the attention stereo fusion module, the computational complexity increases and the computational redundancy is increased.

Therefore, as shown in Figure 3, we set the number of vertical connections to be far less than the number of connections at the network level, and adopted the sparse connection method to input the left and right view features of the upper and lower branches into the attention stereoscopic fusion module at a certain distance to calculate the left and right consistency.

Transverse dense fusion. In order to make multiple interactions between multiple levels at different depths of the attention stereo fusion module, based on the single image super-resolution network structure of VDSR [8] and SRResNet [26], we propose an adaptive horizontal dense connection method, which acts on the attention stereo fusion module. This method is similar to DenseNet [27] in the feature extraction method, which can perform multiple effective fusion even when the number of attention stereo fusion modules is small. After adding the horizontal connection of the attention stereo fusion module, the parallax attention map of the previous module can be calculated and the input to the next module. The pixel value of the effective mask area can guide the module to learn the next parallax attention map and further strengthen the constraint between stereo consistency.

In addition, in existing single image super-resolution networks, the number of network layers may not always be the same, resulting in different numbers of stereoscopic attention modules and corresponding changes in the number of horizontal connections. If the number of attention stereo fusion modules is one, there is no need to connect them horizontally; if it is two, the modules are connected to each other. If it is three, the modules are combined in pairs, and so on.

Feature fusion. The conventional feature fusion operation is used to fuse the features of vertical sparse connections and horizontal dense connections to obtain more discriminative and robust features. The feature fusion operation mainly includes dense layer and transition layer: (1) The dense layer cascades the disparity attention map calculated by each stereo fusion module with the disparity attention map calculated by other stereo fusion modules in the channel dimension. At the same time, the corresponding increase of the number of feature channels leads to the increase of network model parameters. (2) The transition layer controls the number of network model parameters by controlling the number of feature channels. The convolution layer with convolution kernel size of

1 \times 1

is used to reduce the number of channels, and the average pooled layer with step size of two is used to reduce the height and width by half.

3. Experimental Results

In this section, the experimental setup is first introduced, and then the ablation experiment and accuracy analysis of the model are carried out to verify the reconstruction ability of the network proposed in this paper.

3.1. Experimental Setup

In order to test the effectiveness of the proposed network, the attention stereo fusion module proposed in this paper is applied to five single image super-resolution networks, such as SRCNN [7], VDSR [8], LapSRN [28], SRDenseNet [9] and SRResNet [26]. We use Flickr 1024 dataset [29] as the training set of the network. For the generation of training data, first, all images are down-sampled with sampling factors of 2 and 4 to generate low-resolution images. Then, these low-resolution images are cropped into small image blocks of the same size

30 \times 90

at an interval of 20 pixels, and the corresponding high-resolution images are cropped into small image blocks of the same size. In addition, the training data are enhanced by random horizontal and vertical flipping to ensure the diversity of training data.

All series of network operations were programmed by PyTorch and implemented with NVIDIA RTX 2080ti GPU. Since our method is improved based on reference [19], in order to prove its superiority, the parameter settings of other networks are consistent with those of reference [19]. After loading the pre-trained single image super-resolution model, Adam optimization method is used to fine tune the proposed network, and the learning rate is set as

1 \times 10^{- 4}

. When the PSNR value on the verification set converges, the training process is stopped. The Flickr 1024 dataset was adopted as the training set, consisting of 1024 pairs of images with different resolutions. The large scale of data and significant changes in scene types provide data support for model learning in stereo vision tasks. The KITTI dataset is a stereoscopic dataset for urban and rural streetscape roads. The difference from the Flickr 1024 dataset is that the KITTI dataset contains richer foreground and close-up images. For performance evaluation, we use 20 images from KITTI dataset [30,31] and five images from Middlebury dataset as test datasets. The larger the sampling factor, the more difficult it will be to reconstruct the image. Therefore, the comparative experiment in this paper only uses the image with the sampling factor of four. PSNR and SSIM metrics are used to evaluate the performance of stereo super-resolution networks. The evaluation indicators PSNR and SSIM are both obtained by averaging the calculation results of multiple images, which is persuasive for the performance evaluation of the algorithm.

3.2. Ablation Analysis

This section studies the impact of different network parameters and network structure on stereo super-resolution performance through ablation experiments, mainly including two parts: attention stereo fusion module and enhanced cross-view interaction strategy.

3.2.1. Ablation Analysis of Attention Stereo Fusion Module

On KITTI 2015 and Middlebury datasets, the pre-trained single image super-resolution network VDSR [8] was used as the upper and lower branches for ablation experiments. For fair comparison, only the vertical sparse connection included in the enhanced cross-view interaction strategy was used. The results are shown in Table 1, and where PA represents parallax attention block, PA + TA represents attention stereo fusion module, PA + CA represents attention module combined with parallax and channel and l represents the number of times inserted into the super-resolution branch of a single image where l is set to one, two and three, respectively.

As can be seen from Table 1, for KITTI 2015 and Middlebury datasets, the results calculated by PA + TA_l are higher than those calculated by PA_ l where the PSNR/SSIM calculated by PA + TA_3 are 25.19 dB/0.854 and 28.15 dB/0.887, respectively. After embedding the attention stereo fusion module (PA + TA)_3 in VDSR, the performance is significantly improved compared with PA + TA_3. Specifically, the PSNR gain is 0.10 dB, the SSIM gains are 0.01 and 0.003, respectively. It can be proved that the proposed attention stereo fusion module is conducive to the multiple aspects of learning between stereo images and enhances the consistency of stereo image pairs by strengthening the multibranch communication between the channel and the two dimensions of image (height and width). In addition, the number of times the attention stereo fusion module is inserted in the single image super-resolution branch also determines the performance gain. It can be seen from the Table 1 that with the increase of the number of times, the performance will improve uniformly, but the improvement range will decrease step by step. Therefore, in order to significantly improve the performance gain within a reasonable computing overhead, the number of insertions times is set to 3. In other words, compared with the parallax attention module, our proposed attention stereo fusion module can more compactly realize the bidirectional information transmission of left and right images, further providing basis for multiple information interaction between two perspective images.

3.2.2. Ablation Analysis of Enhanced Cross-View Interaction Strategy

Based on ablation experiment of the attention stereo fusion module, ablation experiments are carried out for the combination mode between the attention stereo fusion module and the super-resolution branch of the previous single image as well as the interaction mode between the attention stereo fusion modules and mainly includes three parts: vertical connection, horizontal connection and feature fusion. The ablation experiments of enhanced cross-view interaction strategy are shown in Table 2 where VSC, VDC, TSC and FS represent vertical sparse connection, vertical dense connection, transverse sparse connection and feature summation, respectively. The symbol ✔ indicates that the corresponding block is selected.

Although it can be concluded from the ablation experiments of attention stereo fusion modules that, within a certain range, the performance improves to a small degree with the increase in the number of attention stereo fusion modules. It can be seen from Table 2 that the attention stereo fusion module is inserted between the super-resolution branches of a single image through vertical dense connection, which does not significantly improve the performance compared with sparse connection but adds many connections. In the process of network learning, there is overlap and redundancy of information between adjacent levels. Therefore, it can be proved that the proposed vertical sparse connection is beneficial to enhance the information interaction between views.

In addition, it can be seen from Table 2 that the horizontal dense connection between attention stereo fusion modules has significantly improved performance on both datasets compared with sparse connection. It can be concluded that the horizontal dense connection between attention stereo fusion modules increases the multi-level connection of information between views, which improves the performance of stereo super-resolution task in reconstructing details. As the number of modules increases further, the performance will eventually become saturated. In the single image super-resolution network, the feature difference between adjacent levels is not obvious, and the view information has been fully utilized, further increasing the number of modules can only provide a small improvement. The feature fusion method also plays a very important role in the network with many connections. The improved accuracy is higher than that of vertical sparse connection and horizontal dense connection. It is proved that the proposed feature fusion method is effective and can further improve the network performance.

3.3. Quantitative Analysis

The cross-view attention interaction fusion algorithm is a binocular interaction module built on a single image super-resolution network, which can be applied to various different single image super-resolution branches. The disadvantage is that the algorithm is based on a pre-trained model of monocular super-resolution and relies on its model structure. The more complex the model structure, the more levels the model has. Furthermore, the more attention stereo fusion modules there are, the more horizontal connections there will be. Ultimately, it leads to higher algorithm complexity. Therefore, we focus on analyzing the super-resolution reconstruction accuracy for static stereo images.

In order to more accurately evaluate the performance of our algorithm, we used a challenging low-resolution stereo image with a down-sampling factor of four as input, referring to the evaluation method of stereo attention module in reference [19]. On Middlebury, KITTI 2012 and KITTI 2015 datasets, the attention stereo fusion module and enhanced cross-view interaction strategy proposed in this paper are applied to single image super-resolution networks such as SRCNN [7], VDSR [8], SRDenseNet [9], LapSRN [28] and SRResNet [26]. These combined networks are compared with the stereo super-resolution algorithm in literature [19], PASSRnet [16], SPAM [18] and iPASSR [20]. The comparison results are shown in Table 3. In the calculated results, the first row represents the PSNR value and the second row represents the SSIM value.

The super-resolution performance performed by our method on Middlebury, KITTI 2012 and KITTI 2015 can be improved accordingly. On the KITTI 2015 dataset, the PSNR and SSIM values calculated by the algorithm based on SRResNet [26] in this paper are 25.61 dB and 0.872, respectively, which has the highest accuracy compared with all algorithms in the Table 3. For example, compared with the highest calculation accuracy based on SRResNet, there is still a PSNR gain of 0.08 dB and a SSIM gain of 0.009. Compared with the stereo super-resolution networks PASSRnet, SPAM and iPASSR, the network constructed by the combination of our algorithm and SRResNet can achieve the best super-resolution effect on KITTI 2012 and KITTI 2015 datasets. The results show that the proposed attention stereo fusion module effectively combines the information in the left and right views after vertical sparse connection. The horizontal dense connection method proposed in the enhanced cross-view interaction strategy also further combines various information in view, between views and in attention stereo fusion modules.

3.4. Qualitative Analysis

Outdoor detailed reconstruction is more challenging than indoor reconstruction, so we select multiple datasets for different outdoor scenes, and select distant and close range objects with different definitions to verify the generalization ability of the model. In addition, since low-quality images with a sampling factor of four are more difficult to reconstruct than images with a sampling factor of two, qualitative experiments are conducted using low-quality images with a sampling factor of four for comparison.

On the Flickr 1024 dataset, the proposed algorithm is applied to monocular image super-resolution networks such as SRCNN [7], SRDenseNet [9], VDSR [8], LapSRN [28] and SRResNet [26], and the qualitative results of these networks and PASSRnet [16] are analyzed. The visual comparison with sampling factor of four is shown in Figure 4. The two positions of a graph in the data set are enlarged, and the result graphs of different algorithms are displayed at the same magnification.

As can be seen from Figure 4, the effect of detail restoration can be analyzed more intuitively by comparing the original high-resolution image. For example, in the super-resolution result map calculated by combining SRResNet and our algorithm, the contours around the letters V, I and A are more complete, and the details are richer and clearer; in the eaves position, compared with the results of other algorithms, the details of stripes and holes recovered by our algorithm are more complete and clearer. It can be proved that our attention stereo fusion algorithm can obtain better detail restoration effect, and the enhanced cross-view interaction strategy is helpful to further improve the reconstruction performance of stereo super-resolution network. As the accuracy of stereo super-resolution reconstruction depends on the similarity between the corresponding pixels in the horizontal polar direction of the left and right views, the stereo consistency between the left and right views is reflected in the effect of image detail restoration. Experimental results show that our method can well maintain the stereo consistency of corresponding pixels between stereo views.

In addition to testing remote image, the close range image in KITTI 2012 dataset [30] is also tested. The visual comparison of the results with sampling factor of four is shown in Figure 5. It can be seen that our algorithm has more obvious edges in the reconstruction results at the window eaves of the house; at the fence position, the resulting image is also clearer and more orderly than other algorithms, greatly reducing blur and ghost. For these two amplified positions, the PSNR and SSIM values calculated by this algorithm are higher than those of other algorithms. Therefore, the attention stereo fusion module mentioned in this method can enhance the ability of feature recognition and network learning and reasoning, thereby maintaining good stereoscopic consistency of corresponding pixels between stereoscopic views. In addition, the enhanced cross-view interaction strategy also greatly improves the information exchange between different branches and different modules, and the information transmission ability is stronger.

In addition, consistent with the comparison algorithm adopted on the KITTI 2012 dataset, the qualitative results with a sampling factor of four were also compared and analyzed on the KITTI 2015 dataset [31]. As shown in Figure 6, super-resolution visual comparison of the two positions of the intake grille and road sign is observed, including the original high-resolution image, the reconstruction results of our algorithm and other algorithms. The results are displayed at the same magnification for clearer observation and analysis.

As can be seen from Figure 6, in the result graph recovered by SRCNN, LapSRN [28] and SRResNet [26] combined with our method, the details and contours of the arrow and disc are clearer and more complete, and the two stripes at the position of the air inlet of the automobile engine are smoother without crossing. For the SRResNet + our algorithm, the calculated PSNR and SSIM values are the highest at the sign position, which are 25.61 dB and 0.868 dB, respectively; At the air inlet of the automobile engine, the calculated PSNR and SSIM values are also the highest, which are 25.71 dB and 0.869 dB, respectively. Through visual and numerical analysis, it shows that cross-view attention information interaction can further improve the performance of stereo image reconstruction based on single image super-resolution network and has an important impact on stereo super-resolution.

4. Conclusions

In this paper, we propose a stereo super-resolution algorithm for cross-view attention interaction fusion, including attention stereo fusion module and enhanced cross-view interaction strategy. The attention stereo fusion module is inserted between the left and right single image super-resolution network branches by using the enhanced cross-view interaction strategy, which combines the advantages of parallax attention and triple attention to improve the consistency of parallax information between stereo image pairs and the cross dimensional interaction between image information. At the same time, an enhanced cross-view interaction strategy is proposed, which includes horizontal dense connection, vertical sparse connection and feature fusion. The three connection methods make full use of the performance of single image super-resolution network branches and further strengthen the constraints between image pair stereo consistency. Experimental results show that the proposed algorithm effectively captures the corresponding relationship between stereo images and improves the ability of recovering details of stereo images. In addition, the proposed algorithm is based on existing pre-trained models, with a focus on considering the impact of different single image super-resolution network models on the reconstruction accuracy of the proposed algorithm. Therefore, in the following research, a model that does not rely on existing single image super-resolution network branches will be further designed to achieve an integrated super-resolution network model for stereo views.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z.; software, Y.Z. and Z.Z.; validation, Y.Z., J.L. and T.Z.; formal analysis, Y.Z., J.L. and T.Z.; investigation, Y.Z. and J.L.; resources, T.Z.; data curation, Y.Z. and Z.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, J.L. and Z.Z.; visualization, Y.Z.; supervision, T.Z. and Z.Z.; project administration, T.Z. and Z.Z.; funding acquisition, T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Research Projects of Anhui Educational Committee (grant No. 2022AH050830), Scientific Research Foundation for High-level Talents of Anhui University of Science and Technology (grant No. 2021yjrc37), the Institute of Energy, Hefei Comprehensive National Science Center (grant No. 21KZS216) and Tianjin Municipal Education Commission Research Program (grand No. 2020KJ118).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shi, S.; Xiangli, B.; Yin, Z. Multiframe super-resolution of color images based on cross channel prior. Symmetry 2021, 13, 901. [Google Scholar] [CrossRef]
Mastylo, M.; Silva, E. Interpolation of the measure of noncompactness of bilinear operators. Trans. Am. Math. Soc. 2018, 370, 8979–8997. [Google Scholar] [CrossRef]
Katsuki, T.; Torii, A.; Inoue, M. Posterior-mean super-resolution with a causal gaussian markov random field prior. IEEE Trans. Image Process. 2012, 21, 3182–3193. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chakrabarti, A.; Rajagopalan, A.; Chellappa, R. Super-resolution of face images using kernel PCA-based prior. IEEE Trans. Multimed. 2007, 9, 888–892. [Google Scholar] [CrossRef] [Green Version]
Esmaeilzehi, A.; Ahmad, M.; Swamy, M. FPNet: A deep light-weight interpretable neural network using forward prediction filtering for efficient single image super resolution. IEEE Trans. Circuits Syst. Ii-Express Briefs 2022, 69, 1937–1941. [Google Scholar] [CrossRef]
Zhang, Q.; Feng, L.; Liang, H.; Yang, Y. Hybrid domain attention network for efficient super-resolution. Symmetry 2022, 14, 697. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kim, J.; Lee, J.; Lee, K. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Tong, T.; Li, G.; Liu, X.; Gao, Q. Image super-resolution using dense skip connections. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4809–4817. [Google Scholar]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 13, 1–16. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Guo, Y.; Chen, J.; Wang, J.; Chen, Q.; Cao, J.; Deng, Z.; Xu, Y.; Tan, M. Closed-loop matters: Dual regression networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5406–5414. [Google Scholar]
He, Y.; Cao, W.; Du, X.; Chen, C. Internal learning for image super-resolution by adaptive feature transform. Symmetry 2020, 12, 1686. [Google Scholar] [CrossRef]
Xu, R.; Xiao, Z.; Yao, M.; Zhang, Y.; Xiong, Z. Stereo video super-resolution via exploiting view-temporal correlations. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 460–468. [Google Scholar]
Ahn, H.; Jeong, J.; Kim, J.; Kwon, S.; Yoo, J. A fast 4K video frame interpolation using a multi-scale optical flow reconstruction network. Symmetry 2019, 11, 1251. [Google Scholar] [CrossRef] [Green Version]
Jeon, D.; Beak, S.; Choi, I.; Kim, M. Enhancing the spatial resolution of stereo images using a parallax prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1721–1730. [Google Scholar]
Wang, L.; Wang, Y.; Liang, Z.; Lin, Z.; Yang, J.; An, W.; Guo, Y. Learning parallax attention for stereo image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12242–12251. [Google Scholar]
Duan, C.; Xiao, N. Parallax-based spatial and channel attention for stereo image super-resolution. IEEE Access 2019, 7, 183672–183679. [Google Scholar] [CrossRef]
Song, W.; Choi, S.; Jeong, S. Stereoscopic image super-resolution with stereo consistent feature. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12031–12038. [Google Scholar] [CrossRef]
Ying, X.; Wang, Y.; Wang, L.; Sheng, W.; An, W.; Guo, Y. A stereo attention module for stereo image super-resolution. IEEE Signal Process. Lett. 2020, 27, 496–500. [Google Scholar] [CrossRef]
Wang, Y.; Ying, X.; Wang, L.; Yang, J.; An, W.; Guo, Y. Symmetric parallax attention for stereo image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 19–25 June 2020; pp. 1–13. [Google Scholar]
Dan, J.; Qu, Z.; Wang, X. A disparity feature alignment module for stereo image super-resolution. IEEE Signal Process. Lett. 2021, 28, 1285–1289. [Google Scholar] [CrossRef]
Zhu, X.; Guo, K.; Fang, H.; Chen, L.; Ren, S.; Hu, B. Cross view capture for stereo image super-resolution. IEEE Trans. Multimed. 2022, 24, 3074–3086. [Google Scholar] [CrossRef]
Wang, L.; Guo, Y.; Wang, Y.; Liang, Z.; Lin, Z.; Yang, J.; An, W. Parallax attention for unsupervised stereo correspondence learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2108–2125. [Google Scholar] [CrossRef] [PubMed]
Jin, K.; Wei, Z.; Yang, A.; Guo, S.; Gao, M.; Zhou, X.; Guo, G. SwiniPASSR: Swin transformer based parallax attention network for stereo image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 919–928. [Google Scholar]
Chu, X.; Chen, L.; Yu, W. NAFSSR: Stereo image super-resolution using NAFNet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1238–1247. [Google Scholar]
Shi, W.; Caballero, J.; Theis, L.; Huszar, F.; Aitken, A.; Ledig, C.; Wang, Z. Is the deconvolution layer the same as a convolutional layer? arXiv 2016, arXiv:1609.07009. [Google Scholar]
Huang, G.; Liu, Z.; Weinberger, K. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
Lai, W.; Huang, J.; Ahuja, N.; Yang, M. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5835–5843. [Google Scholar]
Wang, Y.; Wang, L.; Yang, J.; An, W.; Guo, Y. Flickr1024: A large-scale dataset for stereo image super-resolution. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 3852–3857. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Menze, M.; Geiger, A. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3061–3070. [Google Scholar]

Figure 1. Parallax attention block.

Figure 2. Triple attention block.

Figure 3. Enhanced cross-view interaction.

Figure 4. Super-resolution visualization comparison on the Flickr 1024 dataset.

Figure 5. Super-resolution visualization comparison on the KITTI 2012 dataset.

Figure 6. Super-resolution visualization comparison on the KITTI 2015 dataset.

Table 1. Ablation experiment on attention stereo fusion module.

Setting of Attention Module	KITTI 2015		Middlebury
Setting of Attention Module	PSNR (dB)	SSIM	PSNR (dB)	SSIM
PA_1	24.98	0.849	27.94	0.886
PA_2	25.02	0.849	27.95	0.886
PA_3	25.10	0.850	28.00	0.886
(PA + CA)_1	25.12	0.851	28.04	0.886
(PA + CA)_2	25.16	0.851	28.10	0.886
(PA + CA)_3	25.19	0.854	28.15	0.887
(PA + TA)_1	25.14	0.851	28.11	0.886
(PA + TA)_2	25.23	0.856	28.21	0.886
(PA + TA)_3	25.29	0.861	28.28	0.890

Table 2. Ablation experiment on enhanced cross-view interaction strategy.

Connection and Fusion Settings						Middlebury		KITTI 2015
VSC	VDC	TSC	TDC	FS	Ours	PSNR (dB)	SSIM	PSNR (dB)	SSIM
✔		✔		✔		28.21	0.880	25.23	0.851
✔		✔			✔	28.23	0.883	25.26	0.854
✔			✔		✔	28.28	0.890	25.29	0.861
	✔		✔		✔	28.29	0.889	25.29	0.860
	✔	✔			✔	28.24	0.883	25.24	0.855

Table 3. Quantitative comparison of stereo super-resolution models.

Models	Middlebury	KITTI 2012	KITTI 2015	Average Value
Models	×4	×4	×4	Average Value
PASSRnet [16]	28.62	26.26	25.42	26.77
PASSRnet [16]	0.893	0.826	0.860	0.860
SPAM [18]	29.36	26.31	24.81	26.83
SPAM [18]	0.912	0.869	0.860	0.880
iPASSR [20]	29.11	26.35	25.25	26.90
iPASSR [20]	0.835	0.803	0.807	0.815
SRCNN [7] + SAM [19]	27.70	25.64	24.77	26.04
SRCNN [7] + SAM [19]	0.875	0.857	0.843	0.858
SRCNN [7] + Ours	27.75	25.70	24.79	26.08
SRCNN [7] + Ours	0.881	0.860	0.848	0.863
VDSR [8] + SAM [19]	28.25	26.15	25.22	26.54
VDSR [8] + SAM [19]	0.887	0.868	0.855	0.870
VDSR [8] + Ours	28.28	26.21	25.29	26.59
VDSR [8] + Ours	0.890	0.872	0.861	0.874
SRDenseNet [9] + SAM [19]	28.14	26.10	25.17	26.47
SRDenseNet [9] + SAM [19]	0.885	0.866	0.853	0.868
SRDenseNet [9] + Ours	28.19	26.17	25.22	26.53
SRDenseNet [9] + Ours	0.892	0.872	0.859	0.874
LapSRN [28] + SAM [19]	28.25	26.15	25.20	26.53
LapSRN [28] + SAM [19]	0.888	0.868	0.855	0.870
LapSRN [28] + Ours	28.30	26.19	25.26	26.58
LapSRN [28] + Ours	0.894	0.872	0.861	0.876
SRResNet [26] + SAM [19]	28.81	26.35	25.53	26.90
SRResNet [26] + SAM [19]	0.897	0.873	0.863	0.878
SRResNet [26] + Ours	28.92	26.42	25.61	26.99
SRResNet [26] + Ours	0.905	0.881	0.872	0.886

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Liu, J.; Zhang, T.; Zhao, Z. Cross-View Attention Interaction Fusion Algorithm for Stereo Super-Resolution. Appl. Sci. 2023, 13, 7265. https://doi.org/10.3390/app13127265

AMA Style

Zhang Y, Liu J, Zhang T, Zhao Z. Cross-View Attention Interaction Fusion Algorithm for Stereo Super-Resolution. Applied Sciences. 2023; 13(12):7265. https://doi.org/10.3390/app13127265

Chicago/Turabian Style

Zhang, Yaru, Jiantao Liu, Tong Zhang, and Zhibiao Zhao. 2023. "Cross-View Attention Interaction Fusion Algorithm for Stereo Super-Resolution" Applied Sciences 13, no. 12: 7265. https://doi.org/10.3390/app13127265

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-View Attention Interaction Fusion Algorithm for Stereo Super-Resolution

Abstract

1. Introduction

2. Network Architecture

2.1. Attention Stereo Fusion Module

2.2. Enhanced Cross-View Interaction Strategy

3. Experimental Results

3.1. Experimental Setup

3.2. Ablation Analysis

3.2.1. Ablation Analysis of Attention Stereo Fusion Module

3.2.2. Ablation Analysis of Enhanced Cross-View Interaction Strategy

3.3. Quantitative Analysis

3.4. Qualitative Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI