**3. Proposed Method**

Although more complicated deep neural network models have demonstrated better SR performance than conventional methods, it is difficult to implement them on lowcomplexity, low-power, and low-memory devices, due to the massive network parameters and convolution operations of deeper and denser networks. In case of SR-DenseNet, it is difficult to implement this model to the applications for real-time processing even though its SR performance is superior to that of other neural network models. To address this issue, we considered two lightweight network structures at the expense of unnoticeable quality degradation, compared to SR-DenseNet. The purpose of the proposed two lightweight neural networks for SISR is to quadruple the input images the same as SR-DenseNet. Firstly, SR-ILLNN learns the feature maps, which are derived from both low-resolution and interpolated low-resolution images. Secondly, SR-SLNN is designed to use only lowresolution feature maps of the SR-ILLNN for a few more reducing the network complexity.

#### *3.1. Architecture of SR-ILLNN*

Figure 2 shows the proposed SR-ILLNN, which consists of two inputs, 15 convolution layers and two deconvolution layers. The two inputs are denoted as a low-resolution (LR) image XLR and a bi-cubic interpolated low-resolution (ILR) image *XILR* where N and M denote the width and height of the input image *XLR*, respectively. The reason why we deployed the two inputs is to compensate the dense LR features of SR-DenseNet with high-resolution (HR) features of *XILR*, while reducing the number of convolutional layers as many as possible. As depicted in Figure 2, it consists of three parts, which are LR feature layers from convolutional layer 1 (Conv1) to Conv8, HR feature layers from Conv9 to Conv12, and shared feature layers from Conv13 to Conv15.

Each convolution is operated as in (1), where *Wi*, *Bi*, and '⊗' represent the kernels, biases, and convolution operation of the *i*th layer, respectively. In this paper, we notate a kernel as [*Fw* × *Fh* × *Fc*], where *Fw* × *Fh* and *Fc* are the spatial size of filter and the number of channels, respectively:

$$F\_i(X\_{LR}) = \max(0, \mathcal{W}\_i \odot F\_{i-1}(X\_{LR}) + B\_i),\tag{1}$$

**Figure 2.** The framework of the proposed SR-ILLNN.

In the process of convolution operation, we applied rectified linear unit (ReLU, *max*(0, *x*)) on the filter responses and used a partial convolution-based padding scheme [23] to avoid the loss of boundary information. The padding sizes is defined so that the feature maps between different convolution layers can have the same spatial resolution as follows:

$$\text{Padding Size} = \text{Floor}(F\_{\text{w}}/2),\tag{2}$$

where Floor(x) means the rounding down operation. Note that the convolutional layers of Conv1–4 and Conv9–12 of Figure 2 are conducted to generate output feature maps with dense connections for more flexible and richer feature representations, which are concatenated the feature maps of the previous layer with those of the current layer. So, convolution operations with dense connections are calculated as in (3):

$$\begin{array}{l} F\_{\text{i}}(\text{X}\_{LR}) = \max(0, \mathcal{W}\_{\text{i}} \otimes \begin{bmatrix} F\_{\text{1}}(\text{X}\_{LR}), \dots, F\_{\text{i}-1}(\text{X}\_{LR}) \end{bmatrix} + B\_{\text{i}})\_{\text{y}}\\ F\_{\text{j}}(\text{X}\_{LR}) = \max(0, \mathcal{W}\_{\text{j}} \otimes \begin{bmatrix} F\_{\text{0}}(\text{X}\_{LR}), \dots, F\_{\text{j}-1}(\text{X}\_{LR}) \end{bmatrix} + B\_{\text{j}}) \end{array} \tag{3}$$

A ResNet scheme [11] with skip connections can provide a high-speed training and prevent the gradient vanishing effect, so we deployed a local and a global skip connection to train the residual at the output feature maps of Conv4 and Conv15. Because the output feature maps *F*4 and *XLR* have the different number of channels in local skip connection, *XLR* is copied up to the number of channels of *F*4 before operating Conv5.

It should be noted that the number of feature maps has a strong effect on the inference speed. Therefore, the proposed SR-LNNs is designed to reduce the number of feature maps from 192 to 32, before deconvolution operation. Then, Deconv1 and Deconv2 are operated for image up-sampling as follows:

$$F\_{decovv}(X\_{LR}) = \max(0, \,\mathcal{W}\_{decovv} \,\triangleleft, \, F\_{i-1}(X\_{LR}) \, + \,\, B\_{decovv}),\tag{4}$$

where *Wdeconv*, *Bdeconv* are the kernels and biases of the deconvolution layer, respectively, and the symbol '' denotes the deconvolution operation. As each deconvolution layer has different kernel weights and biases, it is superior to the conventional SR methods like pixel-wise interpolation methods.

In the stage of the shared feature layers, the output feature maps of the LR feature layers *<sup>F</sup>*8(*XLR*) are concatenated with those of HR feature layers *<sup>F</sup>*12(*XILR*). Then, the concatenated feature maps [*<sup>F</sup>*8(*XLR*), *<sup>F</sup>*12(*XILR*)] are inputted to the shared feature layers as in (5):

$$F\_{13}(X) = \max(0, \mathcal{W}\_{13} \otimes |F\_{8}(X\_{LR}), F\_{12}(X\_{LR})| + B\_{13}) \tag{5}$$

Note that the activation function (ReLU) is not applied to the last feature map when the convolution operation is conducted in Conv15. Table 1 presents the structural analysis of the network parameters in SR-ILLNN.


**Table 1.** Analysis of network parameters in SR-ILLNN.

#### *3.2. Architecture of SR-SLNN*

Because SR-ILLNN has hierarchical network structure due to the two inputs, we propose a SR-SLNN to reduce the network complexity of SR-ILLNN. As depicted in Figure 3, the SR-SLNN was modified to remove the HR feature layers and the shared feature layers of SR-ILLNN. In addition, it has seven convolution layers and two deconvolution layers, where two convolution layers between deconvolution layers are also removed. Table 2 presents the structural analysis of network parameters in SR-SLNN.

**Figure 3.** The framework of the proposed SR-SLNN.


**Table 2.** Analysis of network parameters in SR-SLNN.

*3.3. Loss Function and Hyper-Parameters*

We set hyper-parameters as presented in Table 3. In order to find the optimal parameter *θ* = {Filter weights, Biases}, we defined mean square error (MSE) as the loss function (6), where *XHR*, *Y*, and *N* are the final output image of SR-LNN, the corresponding label image, and the batch size. Here, *L*(*θ*) is minimized by Adam optimizer using the back-propagation. In particular, the number of epochs were set to 50 according to the Peak Signal-to-Nosie Ratio (PSNR) variations of the validation set (Set5) and the learning rates were empirically assigned to the intervals of epoch.

Since it is important to set the optimal number of network parameters in the design of lightweight neural network, we investigated the relation between the number of parameters and PSNR according to the various filter sizes. As measured in Table 4, we implemented the most of convolution layers with 3x3 filter size, except for deconvolution layers to generate the interpolated feature map that accurately corresponds to the scaling factor:

$$L(\theta) \;= \frac{1}{N} \sum\_{i=0}^{N-1} \|X\_{HR}\left(X^i\right) - Y^i\|\_2^2 \tag{6}$$

**Table 3.** Hyper-parameters of the proposed methods.


**Table 4.** Relation between the number of parameters and PSNR according to the various filter sizes.

