*Article* **Lightweight Image Super-Resolution Based on Local Interaction of Multi-Scale Features and Global Fusion**

**Zhiqing Meng 1,\*, Jing Zhang 1, Xiangjun Li 2,\* and Lingyin Zhang <sup>1</sup>**


**Abstract:** In recent years, computer vision technology has been widely applied in various fields, making super-resolution (SR), a low-level visual task, a research hotspot. Although deep convolutional neural network has made good progress in the field of single-image super-resolution (SISR), its adaptability to real-time interactive devices that require fast response is poor due to the excessive amount of network model parameters, the long inference image time, and the complex training model. To solve this problem, we propose a lightweight image reconstruction network (MSFN) for multi-scale feature local interaction based on global connection of the local feature channel. Then, we develop a multi-scale feature interaction block (FIB) in MSFN to fully extract spatial information of different regions of the original image by using convolution layers of different scales. On this basis, we use the channel stripping operation to compress the model, and reduce the number of model parameters as much as possible on the premise of ensuring the reconstructed image quality. Finally, we test the proposed MSFN model with the benchmark datasets. The experimental results show that the MSFN model is better than the other state-of-the-art SR methods in reconstruction effect, computational complexity, and inference time.

**Keywords:** multi-scale; local interaction; lightweight image reconstruction network; global fusion

**MSC:** 68T01; 68T07

#### **1. Introduction**

Single-image super-resolution (SISR) refers to the process of recovering a natural and clear high-resolution (HR) image from a low-resolution (LR) image. SISR has a wide range of applications in the real world, which are often used to improve the visual quality of images [1] and the performance of other high-level vision tasks [2], especially in the fields of satellite and aerial imaging [3–5], medical imaging [6–8], ultrasound imaging [9], and face recognition [10] etc. However, since different HR images can be downsampled to the same LR image, as a result, the incompatibility makes SISR still a challenging task.

In recent years, with the continuous improvement of computer learning capabilities, deep neural networks, especially methods based on convolutional neural networks, have been widely used in SISR, which has greatly promoted the development of image reconstructions. Dong et al. [11] first introduced a convolutional neural network (CNN) into the field of SR images, and proposed a super-resolution convolutional neural network (SRCNN). However, as the input LR image needs to be preprocessed by bicubic interpolation, the computational complexity is increased, and the high-frequency details in the original image are lost, which limit the efficiency of image reconstruction. Shi et al. [12] proposed an efficient sub-pixel convolutional neural network (ESPCN), which effectively replaces the bicubic interpolation preprocessing with a sub-pixel convolutional algorithm for upsampling operation, thereby reducing the overall computational complexity and avoiding the checkerboard effect caused by the deconvolution layer. In pursuit of better

**Citation:** Meng, Z.; Zhang, J.; Li, X.; Zhang, L. Lightweight Image Super-Resolution Based on Local Interaction of Multi-Scale Features and Global Fusion. *Mathematics* **2022**, *10*, 1096. https://doi.org/10.3390/ math10071096

Academic Editor: Jakub Nalepa

Received: 21 February 2022 Accepted: 25 March 2022 Published: 29 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

model performance, Zhang et al. [13] proposed the very deep residual channel attention network (RCAN) based on ESPCN, which stacks a large number of residual blocks and local connections to obtain better reconstruction quality.

It is found that increasing the network depth can improve the quality of image reconstruction, but it also leads to a substantial increase in the number of model parameters, and it also makes the training model more complicated. To solve this problem, Tai et al. [14] added a recursive block to the neural network to reduce model parameters, constructed a deep recursive residual network (DRRN), and transmitted the residual information through a combination of global learning and local learning to reduce the difficulty of training. DRRN uses a shared parameter strategy to reduce the parameters, but, in fact, it requires a huge amount of calculation to reconstruct the image. Hui et al. [15] proposed an information distillation network (IDN) which divides the features into two parts, with one part retained and the other part continuing to be used to extract information; thus, the model parameters are reduced under the premise of ensuring the quality of reconstruction quantity. Liu et al. [16] proposed a residual feature distillation network (RFDN) based on residual learning. The network retains the original features of the image without introducing additional parameters through residual connection, but the obtained feature map lacks related information of local features. Based on RFDN, this paper strips the channels with rich information features in the model, and pays more attention to the multi-scale channel information of the original image and the associated information of the local area. The main work of this paper is as follows:


**Figure 1.** Trade-off between reconstruction performance and parameters on Urban100 with scaling factor ×4.

#### **2. Related Work**

In recent years, the super-resolution of single image has been studied extensively [17–19]. We present an overview of the deep CNN for image super-resolution in Section 2.1. In order to reduce model parameters and speed up image reasoning, lightweight image super-resolution models have been widely studied. We will elaborate on this part in Section 2.2.

#### *2.1. Deep CNN for Image Super-Resolution*

Dong et al. [11] used end-to-end convolutional neural network (SRCNN) for the first time to extract, map, and reconstruct image features, and found that the reconstruction effect exceeded the traditional image super-resolution (SR) method. However, the network structure is simple and the correlation between low-resolution (LR) image and original image is not considered. Some researchers started with the depth of the network, hoping to fully extract the relevant information between images through the deep network model. Kim et al. [20] proposed a very deep super-resolution (VDSR) convolutional network based on the global residual learning method, which not only improves the reconstruction effect, but also accelerates the network convergence speed. Haris et al. [21] proposed a deep backprojection network (DBPN) for super-resolution with iterative up–down sampling, which provides timely feedback of the error mapping at each stage, and performs better, especially in large-scale images. Yang et al. [22] used skip connections to increase the number of network layers, which enhanced the feature expression ability of the network and made the reconstructed image closer to the real image. Lim et al. [23] removed the batch specification layers that affected the reconstruction effect in an enhanced deep super-resolution network (EDSR), and stacked more convolutional layers to achieve better performance of the model. In order to improve the visual effect of reconstructed images, Yang et al. [24] constructed a multi-level feature extraction module using dense connections, which can obtain richer hierarchical feature images. With the deepening of the network structure, the number of parameters and the computational complexity of this type of model increase greatly, limiting its application in the real world.

#### *2.2. Lightweight CNN for Image Super-Resolution*

In order to reduce the number of model parameters, the complexity, and training difficulty of network calculation, researchers began to improve the deep network, compressing the model by sharing parameters, residual learning, attention mechanism, and information distillation, and proposed a lightweight image reconstruction network based on CNN. Kim et al. [25] used a deep recursive structure in the deep recursive convolutional network (DRCN) to share parameters, but the model performance was degraded compared with VDSR in some test sets, and the actual amount of computation of the model did not decrease accordingly. Tai et al. [14] proposed a deep recursive residual network (DRRN)-based DRCN, which reduces storage cost and computational complexity by global connection of multipath residual information. Li et al. [26] added an adaptive weighted block in residual learning to fully extract image features and effectively limit the number of model parameters. Hu et al. [27] introduced channel attention mechanism into a deep neural network, and added weight to the features of each output channel in the convolution operation to reasonably allocate limited computer resources, so as to obtain a wide application in the lightweight network architecture. Hui et al. [28] proposed an information distillation network, which uses a combination of embedding loss and information distillation to solve the problem of image recognition. They used a small-size convolution kernel to compress network parameters and reduce the computational cost and complexity of the training model. Tian et al. [29] proposed heterogeneous structure in information extraction and enhancement blocks, which greatly reduced the computational cost and memory consumption. Hui et al. [15] used convolution kernels with sizes of 1 × 1 and 3 × 3 to enhance the extracted features, which made the model have better image reconstruction performance and inference speed. Jiang et al. [30] constructed a sparse perceptive attention module based on pruning, which can reduce the model size without a noticeable drop in performance. However, these methods cannot make full use of the associated information between the original image and the low-resolution (LR) image, and the interaction of information between different regions has not been paid enough attention. Based on this, we adopt a fusion block based on multi-scale feature local interaction to fully extract the feature information in the original image. In addition, we strip and compress the channels, and make a trade-off between the performance and the inference speed, which effectively improves the comprehensive performance of the model.

#### **3. Proposed Method**

#### *3.1. Network Architecture*

In this paper, we propose a lightweight image reconstruction network based on local interaction of multi-scale features, and use local interaction of multi-scale features and the global connection of comparative residuals to learn second-order feature statistics in order to obtain more representative features. The network structure we propose mainly includes five parts: shallow feature extraction block, deep feature extraction block based on multi-scale interaction mechanism, global feature fusion block, upsampling block, and image reconstruction block, as shown in Figure 2.

As shown in Equation (1), *ILR* represents the input image, and the network uses a convolution layer to extract the shallow features of the input image *ILR*. The shallow feature extraction block can be expressed as follows:

$$X\_{SF} = F\_{SF}(I\_{LR}) \tag{1}$$

where *FSF*(·) represents a simple single-layer convolution mapping, which aims to achieve shallow feature extraction. The shallow feature *XSF* is extracted through single-layer convolution, and then *XSF* is input into the deep feature extraction block based on multiscale interaction mechanism to obtain the high-dimensional feature *XPF* after feature mapping, which is expressed as Equation (2):

$$X\_{PF} = F\_{DPAM}(X\_{SF}) \tag{2}$$

**Figure 2.** The architecture of our proposed lightweight image reconstruction network (MSFN).

The deep feature extraction block is composed of M feature interaction blocks (FIBs) and M skip connections, where *FDPAM*(·) represents the mapping function corresponding to the deep feature extraction block, and *XSF* represents extracted feature maps with deep receptive fields. The features extracted from each FIB are first concatenated in the channel dimension in series to form new high-dimensional features, and then a single-convolution layer is used to reduce the dimensionality of the obtained high-dimensional features. Compared with the existing SISR methods, our feature extraction block based on the multiscale feature interaction mechanism proposed in this paper can make the network more effectively use the extracted features and suppress invalid features. Moreover, this block can compare and fuse the receptive field information of different scales in the original image, fully retain the texture information of the low-resolution (LR) image, and effectively improve the quality of the reconstructed image. The features extracted from FIB are input into the global feature fusion block, and the feature information extracted at different stages is retained to the maximum extent by means of global connection. The fused feature *XGF* is shown in Equation (3):

$$X\_{GF} = F\_{GF}([X\_{PF\_1}, X\_{PF\_2} \cdots X\_{PF\_m}])\tag{3}$$

where *XPFm* represents the high-dimensional feature extracted by the M-th FIB. The feature information extracted by the M FIBs is input into the mapping function *FGFF*(·) corresponding to the feature fusion block, and the feature information is spliced in the channel dimension to obtain the global feature *XGF* based on the entire network.

Then, the features after fusion and transformation are used as the input of upsampling block, and the input is upsampled by using the method of sub-pixel convolution [12] to obtain a high-resolution (HR) feature mapping. The features after upsampling are shown in Equation (4):

$$X\_{SR} = F\_{UP}^{L}(X\_{GF}) = PS(X\_{GF}) \tag{4}$$

$$PS(T\_{x,y,c \cdot r^2}) = T\_{rx,ry,c} \tag{5}$$

In the above Equation, *F<sup>L</sup> UP*(·) represents the upsampling operation based on sub-pixel convolution, and *XSR* represents the high-resolution (HR) feature map output after upsampling. At present, the commonly used upsampling methods in the field of super-resolution (SR) reconstruction include interpolation operation, transposed convolution operation, and sub-pixel convolution operation. The sub-pixel convolution operation achieves upsampling by rearranging pixels, reducing the amount of model parameters. Therefore, in order to make the network achieve better results in terms of reconstruction rate and accuracy, we choose to implement upsampling through sub-pixel convolution operation. In Equation (4), *PS*(·) represents a periodic sorting operator, which rearranges the feature map with a

size of H × W × C · r2 into a feature map with a shape of rH × rW × C. Equation (5) mathematically describes the subpixel upsampling operation, the effect of which is shown in Figure 3.

**Figure 3.** Sub-pixel sample operation.

#### *3.2. Multi-Scale Feature Interaction Block*

In this section, we provide more details on the multi-scale FIB. The FIB is the main structure for feature mapping and local fusion in the network, which constructs N multiscale feature interaction components (MSCs) and N channel attention blocks (CABs) for pixel information of different scales. The FIB structure is shown in Figure 4.

**Figure 4.** Multi-scale feature interaction block (FIB).

The input of each FIB needs to pass through the MSC to extract the feature information under the condition of multiple receptive fields. As shown in Equation (6), the output feature *<sup>X</sup>i*−<sup>1</sup> *out* of the *<sup>i</sup>* <sup>−</sup> 1th MSC is the input feature *<sup>X</sup><sup>i</sup> in* of the *<sup>i</sup>*-th MSC. *<sup>F</sup><sup>i</sup> MSC*(·) is the mapping relationship corresponding to the *i*-th MSC, through which we can extract the interaction and spatial information of the regional features of *X<sup>i</sup> in* at different scales, so that the high-frequency information and edge texture details of the input image relatively can be completely preserved by the feature *X<sup>i</sup> out*.

$$X\_{out}^{i} = F\_{MSC}^{i}(X\_{in}^{i}) \quad (X\_{in}^{i} = X\_{out}^{i-1}) \tag{6}$$

The specific architecture of the MSC is shown in Figure 5. It can be seen that the MSC is mainly composed of three filters of different scales, and the convolution kernel sizes of the filter are 1 × 1, 3 × 3, and 5 × 5, respectively. MSCs enrich spatial information by expanding receptive fields, in which the large-scale filters are mainly used to extract feature attention information in different regions, and the small-scale filters are used to enhance the correlation degree between local regions. We pad the edge of the feature map with elements with zero pixel value to ensure that the size of the feature map remains unchanged after the convolution operation. When the size of the convolution kernel of the filter is 3 × 3 and 5 × 5, the corresponding edge filling scale is 1 and 2, respectively. When the size of the convolution kernel of the filter is 1 × 1, no edge filling is performed in the feature map.

**Figure 5.** Multi-scale feature extraction component.

The MSC uses filters of different scales to extract and enhance feature information, and the enhanced features are added pixel by pixel according to their weights to obtain a new feature map with rich spatial elements, as shown in Equations (7) [31], (8), and (9):

$$X\_i = b\_i + \sum\_{j=0}^{C\_{in}-1} \mathcal{W}\_i \times X\_{pre} \quad (i = 1, 2, 3) \tag{7}$$

$$X\_{MF} = \sum\_{i=1}^{3} k\_i \times \mathbb{C}\_i \times X\_i \quad (k\_i = 1, i = 1, 2, 3) \tag{8}$$

$$\mathbf{C}\_{i} = \frac{1}{|X\_{i}|} \quad (i = 1, 2, 3) \tag{9}$$

In Equation (7), *W*1, *W*2, and *W*<sup>3</sup> represent the weight coefficients corresponding to filters with convolution kernel sizes 1 × 1, 3 × 3, and 5 × 5, respectively. As shown in Figure 6, convolution kernels of different scales focus on the correlation information between different regions of the same object, and then perform weighted summation for the extracted feature information. In Equation (9), *Ci* represents the two-norm value of each feature vector, and each feature map is normalized by this value. In Figure 6, *k*1, *k*2, and *k*<sup>3</sup> represent the corresponding weight coefficients of feature information extracted from each convolution kernel, respectively. In this paper, *k*<sup>1</sup> = *k*<sup>2</sup> = *k*<sup>3</sup> = 1, which makes the extracted feature map *XMF* have rich spatial information features and regional interactions,

and is helpful for the restoration and construction of key features and edge information in subsequent image reconstruction.

**Figure 6.** Multi-scale convolution operation.

The features extracted by the feature interaction component are firstly input into the channel attention component, then the output result is input into the remaining *N* − 1 MSCs and CABs for iterative optimization; finally, the features obtained at each stage are spliced in the channel dimension, and then the high-dimensional feature *X<sup>i</sup>* is obtained by residual connection with the initial input feature *Xpre*. Specifically, as shown in Equations (10) and (11):

$$X^{i} = F\_{\text{CAB}}^{i}(F\_{\text{MSC}}^{i}(\cdot \cdot \cdot F\_{\text{CAB}}^{1}(F\_{\text{MSC}}^{1}(X\_{\text{prr}})))) \quad (i = 1, 2, \cdot \cdot \cdot, N) \tag{10}$$

$$X\_t = F\_{conv1}(\text{Concat}[F\_{conv1}^i(X^1), \dots, F\_{conv1}^{N-1}(X^{N-1})] + X^N) + X\_{p\text{ref}} \tag{11}$$

where *F<sup>i</sup> MSC*(·) and *<sup>F</sup><sup>i</sup> CAB*(·) represent the relationship corresponding to the *i*-th MSC and CAB, respectively. We use a single convolutional layer to reduce the dimensionality of the feature maps *X<sup>i</sup>* obtained at each stage, then splice the dimension-reduced features in the channel dimension, and finally add the original feature *Xpre* on the pixel-level dimension to obtain the final feature map *Xt*.

#### **4. Experiments**

In this section, we firstly test the influence of the number of FIBs and channels on the quality of the reconstructed image; secondly, we perform test experiments on SR benchmark datasets such as Set5 [32], Set14 [33], Urban100 [34], BSD100 [35], and Manga109 [36]; and then we use the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) of the Y-channel in YCbCr as quantitative indicators to compare the experimental data with other excellent super-resolution (SR) methods. Finally, we visualize the reconstruction results and analyze the reconstruction effects from a subjective visual perspective.

#### *4.1. Training Settings*

In order to compare with existing network algorithms, such as DRRN [14], CARN [37], MemNet [38], and IMDN [28], we use the same training dataset—the DIV2K dataset [39]. The dataset used in this paper includes a total of 800 training images, 100 validation images, and 100 test images, and contains rich scenes with rich edge and texture details. Meanwhile, we perform data enhancement on the training images [40] by using random rotation, horizontal flip, and small window slice to make the training data expand to eight times the original one, so that it can adapt to image reconstruction problems with different tilt angles.

In the training phase, we set batch size to 16, LR input size to 64 × 64, and the number of channels in the convolution layer to 48. The deep feature extraction block based on multiscale feature interaction mechanism contains six FIBs, and each FIB contains four MSCs and four CABs. Among them, the selection of the number of channels and the numbers of FIBs

will be explained in detail in Section 4.2 of this paper. Meanwhile, the model parameters are optimized using the Adam [41] algorithm, which are set to *β*<sup>1</sup> = 0.9, *β*<sup>2</sup> = 0.999, and *ε* = 10<sup>−</sup>8. The learning rate is initially set to 10<sup>−</sup><sup>3</sup> by using weight normalization and then decreased to half each 200 epoch of back-propagation. All the experiments were completed on a computer with the following specifications: Intel i7-9700, 32 GB RAM, and NVIDIA GeForce RTX2080Ti 12 GB GPU.

#### *4.2. Ablation Experiment*

We first study the influence of the number of multi-scale feature interaction blocks (FIBs) in the model on the final experimental results, taking the DIK2K dataset as the training object, and then we test the quantitative indicators of the model on the Set14 dataset. The experimental results are shown in Table 1 and Figure 7.

**Figure 7.** The influence of the number of FIBs on the model reconstruction effect. (**a**) LOSS vs. number of epochs; (**b**) PSNR vs. number of epochs; (**c**) SSIM vs. number of epochs. The influence of the number of channels in the FIB on the model reconstruction effect. (**d**) LOSS vs. number of epochs; (**e**) PSNR vs. number of epochs; (**f**) SSIM vs. number of epochs.


**Table 1.** The influence of the number of FIBs on the model reconstruction effect.

In order to better understand the relationship between the number of FIBs and the quality of image reconstruction, we set the number of channels to 48, control the number, and keep parameters of other components in the model unchanged, and only change the number of FIBs. It can be seen from Table 1 that the image reconstruction quality is positively correlated with the number of FIBs. Here we set the scaling factor to four: it shows that when the number of FIBs increases from four to six, the model parameters are relatively increased by 176 K, and the PSNR of reconstructed images is relatively increased by 0.14, which indicates that the reconstruction quality has been significantly improved. When the number of FIBs increases from six to eight, the SSIM value of the reconstructed

image is improved to 0.7824. The influence curves of the number of FIBs on the LOSS value, PSNR value, and SSIM value of reconstruction results are shown in Figure 7a–c. As the number of FIBs increases, the LOSS value of reconstructed image relative to the original image decreases, while the value of quantitative indicators such as PSNR and SSIM increases.

In order to verify the influence of the number of channels in the FIB on the reconstruction quality of the model, we perform comparative experiments on models with different numbers of channels. It can be seen from Table 2 that as the number of channels increases, the reconstruction quality of the model for the Urban100 dataset increases, but the number of model parameters also increases sharply. When the number of channels is adjusted from 48 to 64, the number of the entire model parameters is greatly increased from 571 K to 1004 K, while the SSIM value is only increased by 0.0009. Figure 7d–f show the comparison of LOSS value, PSNR value, and SSIM value of models based on different number of channels on Set5 dataset, respectively. It can be found that although the image reconstruction quality can be improved by increasing the number of channels in FIB, the number of model parameters also increases sharply, as shown in Table 2. Therefore, from the perspective of model lightweight, the larger number of channels is not the better one, and it needs to be considered comprehensively in combination with the number of model parameters. As can be seen from Tables 1 and 2 and Figure 7, when we set the number of FIBs to six and the number of channels to 48 after considering comprehensively, the model has the best comprehensive performance in terms of parameter number and reconstruction effect.


**Table 2.** The influence of the number of channels in the FIB on the model reconstruction effect.

In order to further explore the operation mechanism of feature extraction from different-sized convolution kernels and their influence on reconstructed images, we stripped the feature map extracted from convolution layers of the first MSC at different scales in the second FIB and performed visual analysis on the separated features. Figure 8b shows that the small-scale convolution kernel pays more attention to the pixel information of the shallow layer, focusing on extracting the small-resolution features in the original image. By analyzing Figure 8c,d, we can find that the larger the size of the convolution kernel, the more global the extracted information, and the more attention is given to the relevance of local information. Therefore, using convolution kernels of different scales to extract and pay attention to spatial information of different levels has theoretical significance and practical effect in terms of visualized results.

#### *4.3. Quantitative Analysis*

We compared the proposed MSFN with commonly used baseline SR models with ×2, ×3, and ×4 scales, including SRCNN [11], FSRCNN [42], VDSR [20], LapSRN [43], DRRN [14], MemNet [38], LESRCNN [29], SRMDNF [44], SRDenseNet [45], CARN [37], and IMDN [28], and here we use PSNR and SSIM [46] as quantitative evaluation metrics. PSNR evaluates the distortion level between the image and the target image based on the error between the corresponding pixels. PSNR is the most common and widely used objective evaluation metric of images. In order to compare the reconstruction performance with the mainstream super-resolution algorithm, PSNR is selected as one of the quantitative evaluation metrics. However, since PSNR does not take into account the visual characteristics of human eyes, the evaluation results are often inconsistent with people's subjective feeling. Therefore, we compare the reconstruction results of each algorithm on SSIM metric. SSIM is a full-reference image quality evaluation metric, which measures image similarity from the three aspects of brightness, contrast, and structure. SSIM is more consistent with the characteristics of human eye observation images in the objective world.

**Figure 8.** Feature map visualization: (**a**) input image; (**b**) feature map output by the convolution kernel with a size of 1 × 1; (**c**) feature map output by the convolution kernel with a size of 3 × 3; (**d**) feature map output by the convolution kernel with a size of 5 × 5.

The specific results are shown in Table 3 (the red text font represents the optimal results, the number of FIBs in the MSFN model and the MSFN-S model is set to six, the number of channels is set to 48, and the convolution kernel of MSC in the MSFN-S model is set to 1 × 1, 3 × 3, and 1 × 1, respectively).

It can be seen from Table 3 that when the scaling factor is 2, the PSNR value of the MSFN model proposed in this paper is increased by 0.25 dB, 0.25 dB, 0.15 dB, 0.32 dB, and 0.61 dB on the five datasets, respectively, compared with the CARN model of the same parameter scale; it also can be seen that the MSFN model is superior to the CARN model in reconstruction effect. When the scaling factor is 3, the number of parameters of the small-scale MSFN-S model is similar to that of the LESRCNN model, but the image reconstruction quality is much higher than that of the LESRCNN model. The test result on the BSD100 dataset is increased by 0.2 dB, which greatly improves the quality of the reconstructed image, so that the reconstructed image contains rich original information and texture details. When the scaling factor is 4, we screened out the model whose reconstruction quality exceeds 32.10 dB on Set5, among which the MSFN-S model has the smallest number of parameters, and the MSFN-S model obtains better results in the reconstruction tests of other datasets. With Manga109 as the test dataset, the SSIM value of the MSFN reconstructed image is the best in the larger model structure with more than 1000 K parameters, and the optimal value is 0.9089, which is improved by 0.0065 compared with the SRMDNF model of the same scale.


**Table 3.** Average PSNR/SSIM value for scale factor ×2, ×3, and ×4 on datasets Set5, Set14, BSD100, Urban100, and Manga109.

Red color indicates the best performance.

In order to understand the comprehensive performance of each model, we compare the amount of computational complexity required in the image reconstruction process, the inference time, and the PSNR value of the reconstruction result with those of the models such as LESRCNN [29], CARN [37], IMDN [28], and MSFN-S.

FLOPs stands for floating point operands and can be used to measure the complexity of algorithms and models. Equation (12) describes the theoretical concept of FLOPs mathematically [47]:

$$FLOPs = \left(2 \times \mathbb{C}\_i \times K^2 - 1\right) \times H \times W \times \mathbb{C}\_o \tag{12}$$

*Ci* and *Co* represent the input and output channels, respectively, *K* represents the size of the convolution kernel, and *H* and *W* represent the size of the output feature map. We randomly select an image from the Set14 dataset with a resolution of 528 × 656 as the test image. We input the image into each reconstruction model, and calculate the computational complexity required by the convolution layer in each model according to Equation (12). Meanwhile, we record the inference time and reconstruction effect in Table 4. From the perspective of inference time, MSFN-S inference test image only takes 31 ms, while IMDN and CARN model need 37 ms and 62 ms to complete inference, respectively. From the perspective of image reconstruction quality, MSFN-S has the highest image quality, with which the value of PSNR reaches 26.67 dB. Therefore, the MSFN-S model is more efficient than the other three models in terms of information timeliness and reconstruction capability.

**Table 4.** Complexity of five networks for SISR.


#### *4.4. Qualitative Visual Analysis*

Since quantitative indicators such as PSNR and SSIM do not pay attention to the continuity of local details and cannot fully reflect the image quality, we make a visual analysis of the reconstructed images of each model. Here, we use img005 in the Set14 dataset, img019 in the BSD100 dataset, img026 in the Urban100 dataset, and img093 in the Manga109 dataset for the analysis of visualization, with the results shown in Figure 9, from which we can see that the models SRCNN, DRRN, MemNet, and LESRCNN have weak ability to reconstruct edge information and lack relatively clear line information. For example, in the reconstruction result of the img005 image, the edge lines of the headwear are blurry, and the contours of small objects cannot be restored well, while the reconstructed image of the MSFN model has better line information. From the reconstructed image of img019 by MSFN, it can be seen that MSFN can better restore the details of the bifurcation in the upper left corner of the original image, while models such as CARN cannot. Compared with IMDN and other models, the MSFN model has improved its ability to recover key information of the original image. In the original image of img093, there is a black spot in the lower left corner of the eye. Only the MSFN model pays attention to the continuity of the global information and local details of the original image, so that the detailed information of the black spot is better reconstructed. By comparing the visualization results, it can be seen that the MSFN model has a certain improvement in image reconstruction effect compared with the existing models.

In order to verify the correctness more accurately of the subjective judgments of various reconstruction methods, we designed an image definition questionnaire that requires respondents to score the definition of the reconstruction results of each model according to their subjective feelings and select the best restored image given the original image. A total of 108 valid questionnaires were collected in this survey, and the final results are shown in Figure 10, where the *y*-axis label of the line graph in Figure 10 is the score, which indicates the respondent's definition score of the reconstructed image. Scores range from 0 to 10, with higher scores indicating clearer images to respondents. The *y*-axis of the bar chart is labeled as frequency, which indicates the number of times interviewees select the reconstructed image as the best restored image. Figure 10c is a subjective analysis of the reconstruction results of img093. It can be seen that the sharpness scores of the reconstruction results of MSFN and MSFN-S are higher than the reconstruction results of the other algorithms. Since MSFN better restores the eye details of the img093, such as the outline of the eye edge and black spots, the number of people who think that the MSFN reconstruction result is closest to the original image is the largest. From Figure 10b,d, it can be found that people think that the reconstructed images of MSFN are more realistic and have higher definition. Therefore, from the perspective of subjective visualization, we can conclude that the reconstruction effect of the MSFN model is better, and the reconstructed image has more local details.

**Figure 9.** Comparison of reconstructed HR images of img005, img019, img026, and img093 by different SR algorithms with the scale factor ×4.

**Figure 10.** Subjective analysis of different reconstructed images. (**a**) Subjective analysis of reconstruction results of img005; (**b**) subjective analysis of reconstruction results of img019; (**c**) subjective analysis of reconstruction results of img093; (**d**) subjective analysis of reconstruction results of img026.

#### **5. Conclusions**

We propose a lightweight image reconstruction network based on multi-scale local interaction and global fusion mechanism. The network uses filters of different sizes to pay attention to the interactive information and correlation degree of different regions of the same pixel, so that the convolution kernels of the same level have different sizes of receptive fields, and retain the rich spatial information of the original image under the condition of fewer parameters. Therefore, our proposed model is superior to other image superresolution (SR) models of the same level in both subjective visual effects and quantitative indicators. Although the effectiveness of the proposed method has been verified in this paper, we will carry out further study in other applications (such as image denoising and blur reduction) in the future. Besides this, our proposed method is only applied to the models with magnification factor of 2, 3, and 4, and the customization of magnification factor is very important for practical application scenarios. Therefore, the customization of magnification factor of this model needs to be further studied.

**Author Contributions:** Conceptualization, Z.M. and J.Z.; methodology, Z.M. and X.L.; investigation, L.Z.; writing—original draft preparation, J.Z. and L.Z.; writing—review and editing, X.L. and L.Z.; supervision, Z.M.; funding acquisition, Z.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research is funded by National Natural Science Foundation of China (No. 11871434).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland www.mdpi.com

*Mathematics* Editorial Office E-mail: mathematics@mdpi.com www.mdpi.com/journal/mathematics

Disclaimer/Publisher's Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Academic Open Access Publishing

www.mdpi.com ISBN 978-3-0365-8471-3