Image Upscaling with Deep Machine Learning for Energy-Efficient Data Communications

Tovar, Nathaniel; Kwon, Sean (Seok-Chul); Jeong, Jinseong

doi:10.3390/electronics12030689

Open AccessArticle

Image Upscaling with Deep Machine Learning for Energy-Efficient Data Communications

by

Nathaniel Tovar

^1,†,

Sean (Seok-Chul) Kwon

^1,†

and

Jinseong Jeong

^2,*

¹

Department of Electrical Engineering, California State University, Long Beach, CA 90840, USA

²

School of Electrical and Computer Engineering, University of Seoul, Seoul 02504, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors are co-first authors contributing equally to this work.

Electronics 2023, 12(3), 689; https://doi.org/10.3390/electronics12030689

Submission received: 20 December 2022 / Revised: 18 January 2023 / Accepted: 21 January 2023 / Published: 30 January 2023

(This article belongs to the Special Issue Energy-Aware and Efficient Computing and Communications, Volume II)

Download

Browse Figures

Versions Notes

Abstract

:

Advanced algorithms of image quality enhancement have been attracting substantial attention recently due to the successful business model of video streaming services. The extremely high image quality in video streaming demands a significant increase in the transmit data rate. In turn, the required ultrahigh data rate causes the saturation of the video streaming service network if there is no remedy for this situation. Compression algorithms have contributed to the energy-efficient transmission of data; however, they have almost reached the upper bound. The demand for ultrahigh image quality by the user is significantly increasing. Meanwhile, minimizing data transmission is inevitable in energy-efficient communications. Therefore, to improve energy efficiency, we propose to decrease the image resolution at the transmitter (Tx) and upscale the image at the receiver (Rx). However, standard upscaling does not yield ultrahigh-quality images. Deep machine learning contributes to image super-resolution techniques with the cost of enormous time and resources at the user end. Hence, it is inappropriate for real-time applications. With this motivation, this paper proposes a deep machine learning-based real-time image super-resolution with a residual neural network on the prevalent resources at the user end. The proposed scheme provides better quality than conventional image upscaling such as interpolation. The comprehensive simulation verifies that our scheme substantially outperforms the conventional methods, utilizing the seven-layer residual neural network.

Keywords:

image upscaling; deep machine learning; neural network; image super-resolution; energy-efficient communication

1. Introduction

Seamless high-data transmission is inevitable to support the ultrahigh-quality real-time video stream. In contrast, seamless ultrahigh-quality video streaming is challenging in the scenario of unstable internet service. Even in a well-organized communication network, the severe saturation of users causes an unsatisfied data rate in the overall network, and thus, unstable ultrahigh-quality video streaming. In particular, the use case of video streaming with a smartphone has confronted serious inconvenience in terms of the data rate and the concomitant quality of service. The current users with insufficient bandwidth have no other option except to receive low-quality video and upscale to the native viewing resolution. In turn, the upscaling at the user end needs to be accomplished promptly with limited resources not significantly affecting the overall performance of the user equipment (UE).

Digital image processing with the determination of the image upscaling algorithm should be carefully fulfilled to reach the best performance in a given scenario. Hence, the determination process needs to be based on the comparison of reasonable metrics. In particular, utilizing deep machine learning requires training a deep machine learning network. It is essential to optimize the deep machine learning process considering the trade-off between complexity and performance in terms of image quality. On the other hand, it is challenging to standardize objective metrics to evaluate image quality, since the final evaluation of the image quality is, inherently, a subjective process.

It is crucial to build up objective metrics that are closely correlated with the subjective cognition of the image quality since the human-perception-based evaluation of the image quality requires significant endeavor and time. Several objective metrics have been proposed and utilized with a strong correlation to subjective human cognition. The objective metrics are categorized into three types: full-reference, no-reference, and reduced-reference metrics. Full-reference metrics are feasible in the scenario where the original image is provided. In contrast, no-reference metrics are utilized in the opposite scenario, i.e., in the case of no original image. Lastly, reduced-reference metrics are useful in the situation where only part of the original image is presented [1].

The most representative full-reference metrics are mean squared error (MSE) and peak signal-to-noise ratio (PSNR), and it is prevalent to utilize them to evaluate the performance of image processing. It is noteworthy that MSE and PSNR can accurately convey how close the original and predicted images are to each other. However, they are not regarded as strongly correlated with the subjective judgment of individuals. The reason is that human perception is affected more by some visual components such as intensity and structure than others such as color changes [1]. With this recognition, sensitivity-based image quality estimation was introduced [2].

The sensitivity-based evaluation of image quality primarily depends on the structural similarity between a reference image and the processed image. Structural features are extracted from both images and compared to each other in an analogous fashion with the MSE-based evaluation. It is worth emphasizing that the most representative metrics to indicate the image quality are the structural similarity (SSIM) and multi-scale structural similarity (MSSIM) [3]. The disadvantage of utilizing those metrics, is that a reference image is required; only a single type of distortion can be effectively reflected on the metric. Further, computing structural similarity is limited to grey-scale images. Nonetheless, it is noteworthy that SSIM and MSSIM are regarded as the most appropriate metrics for sensitivity-based evaluation. We can overcome this disadvantage by adopting new algorithms for evaluating image quality [4,5].

Simple upscaling methods, such as nearest-neighbor, bilinear, and bicubic interpolation, can upscale low-resolution images in an efficient and timely manner. However, they do not achieve reliable high-quality outputs. Downscaling an image causes the loss of information, and thus, simple upscaling cannot restore it. More complex interpolation methods have been proposed to generate higher-quality images than conventional interpolation methods. One of the methods is the directional interpolation of images based on visual properties and rank order filtering [6]. In this method, a gradient filter is applied to detect edges, corners, and streaks such that the interpolation does not cause incorrect restoration. Those components are separately interpolated utilizing one-dimensional FIR filters in the direction of those structures. The rest of the image is recovered by adopting simple linear interpolation [6].

The proposed method in this paper utilizes deep machine learning and has the following benefits over other conventional methods of real-time upscaling.

Wide-scale availability to the image transmission and processing system with a prevalent CPU at the UE;
Optimized performance for the individual UE;
Significant improvement over standard interpolation methods while maintaining multitasking at the UE.

Several research works introduce different complex interpolation methods to compensate for the shortcomings of conventional interpolation, in particular, on high-frequency regions [7,8,9,10,11]. In general, all those methods apply relatively simple interpolation methods for low-frequency regions and complex interpolation methods for high-frequency regions. It is noteworthy that crucial errors tend to occur in the regions where pixel values rapidly change. It is still challenging to satisfactorily interpolate high-frequency regions although those methods yield benefits. A detailed explanation of the difference in interpolation methods is omitted in this section since it is out of the scope of this paper.

Advanced post-processing methods attract substantial attention since interpolation methods alone, no matter how complex, cannot reliably recover accurate high-resolution pixel values. One of the most representative and least complex post-processing methods is iterative back-projection (IBP). The IBP exploits multiple low-resolution images with slight shifts and rotations for a given object. A high-resolution image is created by those multiple low-resolution images based on the shifted values. That is, IBP utilizes real pixel values to be inserted based on shifted versions of the same image rather than employing interpolation, which assumes a mathematical relationship between adjacent pixels [12]. The IBP process has exhibited various improvements over the years [13,14]; whereas, the core concept of using multiple variations of low-resolution images to create a high-resolution image remains the same.

Another advanced post-processing method proposes upscaling a low-resolution image via simple interpolation methods first and correcting artifacts along the edges of the image. One applaudable implementation of this method uses total variation regularization along edges to flatten out artifacts caused by bicubic interpolation. Total variation regularization pursues minimizing the total variation over pixels; meanwhile, allowing insignificant or little changes to be made. That is, the method fulfills smoothing the change of pixel values while it still retains edges [15].

2. Deep Machine Learning Method and Neural Network Architecture

2.1. Deep Machine Learning Neural Network for Image Upscaling

With the advent of enormous image databases and powerful graphics processors, artificial intelligence (AI)-based super-resolution solutions have become more popular and have yielded better results than before. An essential advantage of a deep machine learning approach is that the complexity of the algorithm can be adjusted primarily by the amount of time given to the system for training the model and the waiting time the system endures. Deep machine learning approaches in image upscaling can be divided into two categories: multiple image super-resolution (MISR) and single image super-resolution (SISR). As the names imply, MISR uses multiple different variations of the same image to achieve super-resolution (similar to IBP); meanwhile, SISR only uses a single image.

Various machine learning architectures have been proposed, and they can be applied to the issue of image super-resolution [16,17,18,19,20,21]. Among them, the convolutional neural network (CNN) is regarded as an architecture that is most widely applicable and most efficiently suited. CNNs are most naturally suited for computer vision-based tasks since CNNs interact directly with two-dimensional data, and thus, can preserve spatial relationships.

With this motivation, a variety of CNN implementations are developed for SISR. Residual neural network, network in network (NiN), skip connections and

1 \times 1

CNNs are of particular interest [22,23,24,25]. A residual neural network uses ordinary bicubic interpolation on the input image and adds the output of its neural network to that image as illustrated in Figure 1. Therefore, the network learns to correct errors in bicubic upscaling instead of training to directly perform upscaling. It is worth of mentioning that the neural network does not need to be very complex since the performance of bicubic upscaling is already satisfactory.

NiN is the architecture of creating multiple independent neural networks in a single algorithm as the name implies as illustrated in Figure 2 as an example. A single output layer in Figure 2 is fed into another network in a consecutive manner. The advantage of the NiN architecture is that multiple specialized sections of the neural network can be realized. In the context of the CNN, NiN is typically used to extract more abstract information from feature maps [25].

Figure 3 depicts an example of skip connections architecture that pursues concatenation to combine the outputs of multiple layers to provide the input for the output layer. In the same fashion as other neural network architectures, skip connections architecture has several layers connected in a consecutive manner. However, skip connections architecture has several skipped connections, e.g., the direct connection from the first layer to the concatenation layer with the second and third layers skipped. Several different types of skip connection architectures are proposed, but the fundamental types are short and long skip connections. Both short and long skip connections serve to convey different levels of information to deeper layers to foster those layers to have diverse information to work with. Skip connections also provide the neural network model with alternate paths for backpropagation, which is beneficial for avoiding the scenario of the vanishing gradient. In more technical terms, skip connections neural networks can reduce elimination, overlap and linear independence singularities [26].

Lastly,

1 \times 1

CNNs are used to take multiple inputs and produce a single output. This is also called a cross-channel parametric pooling layer. The output is essentially a weighted sum of all of the inputs. This is useful in reducing the complexity of a network and facilitating NiN implementation.

2.2. Real-Time Image Super-Resolution

The disadvantage of deep machine learning neural networks is the high complexity, i.e., the long computation time of training the neural network before generating an output. In real-time image super-resolution (RTISR), it is a primary issue to be resolved. For instance, RTISR for video with 24 frames per second (FPS) demands the generation of an image every 41

μ s

; meanwhile, a CNN takes more than 200

μ s

in upscaling an image although it is carefully designed to be as efficient and lightweight as possible [27]. This long computation time of upscaling is due to the substantial difference in the image size and quality of the input and output images. In general, RTISR output images have significantly large image sizes with high quality; while their inputs are small-size low-resolution images.

The complexity of RTISR depends on not only the input image size but also the filter size. In particular, the input image size significantly affects the overall computation load. Applying a single

3 \times 3

filter to a 1080p image without skipping any pixels, should accomplish 18,662,400 multiplications and additions [27]. Further, applying a

5 \times 5

filter demands 51,840,000 multiplications and additions. On the other hand, a

3 \times 3

filter can be applied for a 540p image (the resolution used in [27]) without skipping any pixels via 4,665,600 multiplications and additions. Meanwhile, a 180p image would only require 518,400 multiplications and additions for a

3 \times 3

filter.

In this paper, we carefully investigate the trade-off between the performance of the image quality and complexity, where the latter is also crucial to the performance of RTISR. Low-resolution input images and a simple

3 \times 3

filter can reduce the complexity; whereas, the neural network cannot create high-quality images. Otherwise, the neural network requires enormous processing time to reach a high-quality output. This increase in processing time negates the benefit of the low-resolution input image and simple filter. The small filter size,

3 \times 3

, may not be enough to analyze and enhance spatial relationships. On the other hand, applying a small-size filter to a low-resolution input image can estimate spatial relations for relatively long distances in the image.

Our objective is to provide high-resolution images from low-resolution input images, via utilizing deep machine learning-based upscaling. The prevalently utilized methods for performing upscaling are deconvolution and pixel shuffling. Deconvolution produces the best results if the process is fulfilled gradually over several layers; in contrast, pixel shuffling requires only one layer. Hence, the latter has lower computational complexity than the former. Pixel shuffling takes multiple low-resolution input images and rearranges their pixels into a single high-resolution image.

In turn, real-time super-resolution for video streaming pursues video super-resolution. It can be regarded as a special type of multiple-image super-resolution where a sequence of images are temporally related to each other. Thus, video super-resolution algorithms take advantage of the multiple information available to generate high-quality images. Nonetheless, it is challenging to obtain related multiple information in the case that a video includes significant changes in a sequence of input images.

A previous high-resolution output can be utilized to improve the current frame’s high-resolution output by detecting the motion occurring between frames. In addition to this method, an edge-directed interpolation of the low-resolution image can enhance the performance of video super-resolution upscaling [28]. The aforementioned algorithm is regarded as one of the fastest video super-resolution algorithms since it takes into account only the current and previous frames [28]. Several other methods are more akin to recurrent neural networks than the algorithm in [28]. They utilize multiple previous frames to generate the current high-resolution frame [29,30].

Video super-resolution methods, inherently, demand intensive computation, although they can generate higher-quality images than SISR. On the other hand, video super-resolution methods take into account multiple images rather than one single image. Hence, they are not appropriate for the system model in this paper, i.e., RTISR.

Researchers in the area of image processing recently have substantial attention to RTISR; meanwhile, satisfactory endeavors to develop diverse RTISR schemes, are not realized and need to be fostered. For instance, several efficient neural network models have been proposed; hardware-specific RTISR implementations are reported [31,32,33,34]. On the other hand, further efforts are required in the adjustment of model parameters to set up the complexity and efficiency based on the given constraints of resources. Leading companies in this area such as Nvidia and AMD fulfill research to implement RTISR [35,36]; however, they generally consider the scenario of their high-end graphics cards.

3. System Resource

The prevalent system resource of current UEs is a dominant component affecting the performance of deep machine learning-based RTISR. This section estimates and defines the available resource of general UEs in a reasonable fashion. The most representative use case of video streaming is real-time downloading through the computer and smartphone, and the latter is the more prevalent scenario than the former. Smartphones dictate the upper bound of the resource available for deep machine learning-based RTISR since they have higher resource constraints than computers. Recent popular smartphones have at least a 2 GHz CPU and multiple cores [37,38]. Nonetheless, it is reasonable to assume that a single core is a resource to accomplish deep machine learning-based RTISR since smartphones, in general, perform several distinct multiple tasks in parallel with the cores.

The current smartphone simultaneously accesses cellular networks, WiFi networks, and Bluetooth connections along with a considerable number of other devices for machine-to-machine communications. Furthermore, the smartphone fulfills several multiple independent tasks and supports complicated processes of the operating system. In the consideration of that situation, we first consider a CPU fully available for deep machine learning-based image upscaling. The impact of multiple CPUs on the proposed scheme will be our future work.

A 2 GHz CPU can perform 2 billion calculations in a second, which will cover the whole processes a CPU has to accomplish in a second. The whole processes include a variety of operations including, but not limited to, retrieving/storing data and multiple calculations. In characterizing resource limitations, a more sophisticated measure than the aforementioned CPU clock speed, i.e., 2 GHz CPU, is the number of floating point operations per second (FLOPS) that CPUs can compute. Modern consumer-grade 64-bit CPUs can perform four FLOPS per cycle; while 32-bit CPUs can do eight FLOPS per cycle [39,40]. In the scenario of a 64-bit CPU, which is worse than a 32-bit CPU in terms of FLOPS, a modern computer or smartphone can be regarded as having the capacity of accomplishing 8 billion computations with a single 2 GHz CPU core.

We take into account the video streaming system to support 24 images per second, i.e., 24 FPS; it demands, in general, the computational budget of 333 million calculations [41]. Consequently, the neural network architecture and the associated deep machine learning process should not exceed the computational budget in accomplishing RTISR with a single core of a 64 bit 2 GHz CPU. Furthermore, the practical process includes complicated instructions besides pure computations. For instance, conditional statements in the simulation codes such as the “if” command, result in significantly slower executions than regular statements [41]. Thus, it is required to reserve the overhead for complicated operations.

4. Methodology and Results

MSE and PSNR are not strongly correlated with the subjective judgment of individuals, as described in Section 1, since human perception is seriously affected by several components such as visual intensity and structure. Figure 4 demonstrates that the subjective quality of an image is crucially affected by different sorts of distortions, although one of the representative image quality metrics, MSE is the same in each scenario in Figure 4. On the other hand, the subjective image quality can be estimated and indicated more accurately by MSSIM than MSE or PSNR as described in the figure. MMSIM is regarded as a sensitivity-based image quality measurement.

The research in this paper is based on deep machine learning models in TensorFlow and the associated image upscaling simulations. The deep machine learning-based image upscaling model in Tensorflow accomplishes image segmentation on a 180p image before it takes the pieces of the input image as the input to the network. The size of the input images is

20 \times 20

and the output images are in

120 \times 120

. Hence, the model must be run 144 times to yield a complete output image. Correspondingly, the number of calculations the proposed model can perform is 2,312,500. Table 1 describes the computation cost in terms of the number of FLOPS for each type of layer.

The consideration of memory bandwidth will be also our future work. Each device is utilizing its memory for parallel processing with multiple tasks. It is difficult to determine the prevalent memory bandwidth since it depends on the situation of each UE. Thus, our primary focus in this paper is FLOPS, i.e., computation load. For the analogous reason, the computation load is calculated in terms of FLOPS rather than the computation time. However, it is noteworthy that the computation load is, in general, regarded as proportional to FLOPS. Considering memory bandwidth will be interesting in future work since each smartphone has a different level of memory depending on the users’ choice.

The residual neural network architecture described with Figure 1 in Section 2.1, is taken into account for RTISR in this paper. The reason for selecting the residual neural network architecture is, it is appropriate for a single 64-bit 2 GHz CPU, the prevalent resource that current smartphones can practically provide with multi-tasking for several other processes. That is, a residual neural network utilizes ordinary bicubic interpolation on the input image and corrects the error in the output of bicubic interpolation by combining the output of its neural network. The bicubic interpolation provides a system with a satisfactory level of upscaling performance; consequently, the network need not be significantly complex to fulfill the training for the whole RTISR process.

The mandatory layers to support the considered residual network architecture, are Layers 1, 6, 7, and 8 in Table 1. The sum of the number of FLOPS in those layers, 696,400 among the total 2,312,500 FLOPS is dedicated to performing those tasks. Consequently, we only have 1,616,100 FLOPS to manipulate in the provided resource. Utilizing the remaining number of FLOPS, the residual neural network model can afford 41 convolutions with a

7 \times 7

filter, 80 convolutions with a

5 \times 5

filter, or 224 convolutions with a

3 \times 3

filter. It is desired that some layer has a large size filter to accommodate for the broadly correlated spatial information; whereas, all layers need not have a large size of filters. Small filter sizes cannot accommodate large-scale spatial information, i.e.,

3 \times 3

filters, cannot account for any spatial features larger than

3 \times 3

.

Furthermore, it is required to determine the optimal depth of the neural network in deep machine learning-based RTISR to maximize the performance as illustrated in Figure 5. A neural network with 4 layers and one with 138 layers are created, respectively, representing a wide neural network and a deep neural network. The former corresponds to the image in Figure 5e, and the latter is associated with the image in Figure 5f. The wide neural network consists of 2 convolutions with a

7 \times 7

filter and 2 convolutions with a

5 \times 5

filter at the stage of feature extraction. In contrast, the deep neural network is composed of 3 convolutions with a

7 \times 7

filter; 15 convolutions with a

5 \times 5

filter, and 120 convolutions with a

3 \times 3

filter in feature extraction. The aforementioned wide network requires 1,302,400 FLOPS, while the deep network demands 1,392,000 FLOPS due to the increased number of feature maps. After training for 50 epochs, the loss function is satisfactorily converged and maintains stability.

The Tx downscales the original image before the transmission; the concomitant received image is Figure 5c. Comparing Figure 5c to Figure 5d–f, we can recognize that the background texture and the direction of slopes in Figure 5d–f are portrayed far better than those in Figure 5c. The SSIM of Figure 5c is substantially smaller than that of Figure 5d and can be regarded as zero.

Further, the wide neural network adopted for the image in Figure 5e, outperforms the deep neural network utilized in the image of Figure 5f. The wide neural network model exhibits higher performance than the deep neural network model in both the SSIM and MSE, although the former requires less complexity than the latter. Further, the deep machine learning-based RTISR in Figure 5e,f outperforms the image upscaling by bicubic interpolation in Figure 5d. In particular, if the segment of an image is part of the background, the proposed scheme provides satisfactory performance. Consequently, the wide neural network is utilized as a system model in the simulations.

Within the category of the wide neural network, the comparison of the performance in terms of average MSSIM is illustrated for the different numbers of layers, i.e., 1 layer to 11 layers in Table 2. In this comparison, the performance shows an insignificant difference. We perform feature extraction with 7 layers since it shows the highest average MSSIM in the category of wide neural network in Figure 5. Further, feature extraction with 7 layers has a higher average MSSIM than that with 8–11 layers. It is worth mentioning that the former has less complexity than the latter in Table 2. That is, the former yields a faster training period than the latter.

On the other hand, the impact of embedding a NiN model on the overall RTISR process is also estimated to determine whether to include a NiN model in the process. This is fulfilled by taking 120 convolutions with a

3 \times 3

filter and dividing them up between the feature extraction stage and the NiN stage. Table 3 shows how the overall RTISR system/network performs with decreasing resources dedicated to the NiN stage in the system. It is observed that the difference in average MSSIM between different numbers of

3 \times 3

filter convolutions is negligible. Moreover, even with zero convolution in NiN, i.e., no NiN stage in effect, the average MSSIM is slightly higher than in other cases in that the NiN stage is involved in the overall RTISR system. Based on those results, a NiN stage is not included in the proposed deep machine learning-based RTISR system.

Several combinations of convolutions with

7 \times 7

,

5 \times 5

, and

3 \times 3

filters in the neural network are also evaluated. Some of them are illustrated in Table 4. For a different combination of convolutions, the system performance of average MSSIM shows a negligible difference as far as the number of layers is equal to or greater than 7. In contrast, the number of FLOPS in feature extraction substantially increases as the number of layers increases. That is, the scenario of fewer feature maps requires fewer resources in terms of the number of FLOPS for feature extraction; while the average MSSIM is almost the same in the neural network with more than 7 layers.

Based on the comprehensive simulations described in this section, we propose the deep machine learning-based RTISR system with the residual neural network model for the realization of RTISR with the prevalent resource in current UEs such as smartphones. The proposed scheme utilizes seven layers of convolutions composed of three

7 \times 7

filters and four

5 \times 5

filters. NiN is not adopted for the reason described in Table 3. The proposed scheme is applied for the input image in Figure 6c. The image resulting from the proposed deep machine learning-based RTISR in Figure 6e shows better performance than the image in Figure 5e. It is worth mentioning that we need to adjust the number of layers even within the category of the wide neural network based on the performance and complexity, and the proposed deep machine learning model is the 7-layer wide neural network with three

7 \times 7

filters and four

5 \times 5

filters.

The proposed scheme is applied for textured images in Figure 7a,d; our scheme remarkably outperforms the conventional bicubic interpolation for image upscaling in the comparison of SSIM. The SSIMs of images resulting from the bicubic interpolation in Figure 7b,e are 0.0105 and 0.0077, respectively. They are significantly improved by the proposed deep machine learning-based RTISR with the SSIMs 0.9815 and 0.9175 in Figure 7c and Figure 7f, respectively. It is noteworthy that the texture of the images is partially restored at a satisfactory level after applying the proposed scheme in Figure 7c,f.

5. Conclusions

In modern communications, video streaming is one of the prevalent application platforms that demand a significantly high data rate. Further, the data rate required for high-quality video streaming is ever-increasing; whereas, energy-efficient communication technology is a new essential component of future wireless communications. Consequently, seamless low or medium-quality videos can be preferred to buffered high-quality videos in the scenario of data saturation in the platform. The proposed deep machine learning-based RTISR remarkably outperforms the popular bicubic interpolation-based upscaling in the circumstances of prevalent modern UEs. The comprehensive simulations verify the improvement in terms of SSIM and MSE along with visual sense.

Further, the proposed scheme also provides UEs with energy-efficient computation with relatively low complexity, compared to conventional deep machine learning. The NiN architecture is not required for high-performance image upscaling. Thus, the proposed scheme significantly reduces the complexity of deep machine learning-based RTISR. With a residual neural network having seven layers composed of three

7 \times 7

convolutional filters and four

5 \times 5

convolutional filters, the RTISR process significantly improves the performance of the image quality from that of standard bicubic interpolation-based image upscaling.

One of the future works will be pursuing the further enhancement of the performance. Several leading companies have developed high-quality real-time image upscaling algorithms for their high-end graphics cards. In the development of upscaling algorithms for the general UEs, it is necessary to have a constraint in the resource of the hardware including multiple CPUs and memory, and the complexity-performance tradeoff needs to be investigated in detail.

Author Contributions

Conceptualization, N.T. and S.K.; methodology, N.T.; software, N.T.; validation, S.K. and J.J.; formal analysis, N.T. and S.K.; investigation, N.T., S.K. and J.J.; resources, N.T., S.K. and J.J.; data curation, N.T.; writing—original draft preparation, N.T., S.K. and J.J.; writing—review and editing, S.K. and J.J.; visualization, N.T. and S.K.; supervision, S.K. and J.J.; project administration, S.K. and J.J.; funding acquisition, J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 2019 Research Fund of the University of Seoul for Jinseong Jeong. In addition, this work was supported by the CSULB Foundation Fund (RS261-00181-10185) for Nathaniel Tovar and Sean Kwon.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Winkler, S. Issues in Vision Modeling for Perceptual Video Quality Assessment. Signal Process. 1999, 78, 231–252. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.; Lu, L. Why is image quality assessment so difficult? In Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, USA, 13–17 May 2002; Volume 4, pp. 4–3313. [Google Scholar] [CrossRef] [Green Version]
Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Panetta, K.; Samani, A.; Agaian, S. A Robust No-Reference, No-Parameter, Transform Domain Image Quality Metric for Evaluating the Quality of Color Images. IEEE Access 2018, 6, 10979–10985. [Google Scholar] [CrossRef]
Chen, B.; Li, H.; Fan, H.; Wang, S. No-reference Quality Assessment with Unsupervised Domain Adaptation. arXiv 2020, arXiv:2008.08561. [Google Scholar]
Algazi, V.; Ford, G.; Potharlanka, R. Directional interpolation of images based on visual properties and rank order filtering. IEEE Comput. Soc. 1991, 4, 3005–3008. [Google Scholar] [CrossRef]
Morse, B.; Schwartzwald, D. Isophote-based interpolation. In Proceedings of the 1998 International Conference on Image Processing, ICIP98 (Cat. No.98CB36269), Chicago, IL, USA, 7 October 1998; Volume 3, pp. 227–231. [Google Scholar] [CrossRef] [Green Version]
Carrato, S.; Ramponi, G.; Marsi, S. A simple edge-sensitive image interpolation filter. In Proceedings of the 3rd IEEE International Conference on Image Processing, Lausanne, Switzerland, 19 September 1996; Volume 3, pp. 711–714. [Google Scholar] [CrossRef]
Lee, S.W.; Paik, J. Image Interpolation Using Adaptive Fast B-Spline Filtering. In Proceedings of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, MN, USA, 27–30 April 1993; Volume 5, pp. 177–180. [Google Scholar] [CrossRef]
Allebach, J.; Wong, P.W. Edge-directed interpolation. In Proceedings of the 3rd IEEE International Conference on Image Processing, Lausanne, Switzerland, 16–19 September 1996; Volume 3, pp. 707–710. [Google Scholar] [CrossRef]
Li, X.; Orchard, M. New edge-directed interpolation. IEEE Trans. Image Process. 2001, 10, 1521–1527. [Google Scholar] [CrossRef] [PubMed]
Irani, M.; Peleg, S. Improving resolution by image registration. CVGIP Graph. Model. Image Process. 1991, 53, 231–239. [Google Scholar] [CrossRef]
Marquina, A.; Osher, S. Image Super-Resolution by TV-Regularization and Bregman Iteration. J. Sci. Comput. 2008, 37, 367–382. [Google Scholar] [CrossRef]
Numnonda, T.; Andrews, M. High resolution image reconstruction using mean field annealing. In Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, Ermioni, Greece, 6–8 September 1994; pp. 441–450. [Google Scholar] [CrossRef]
Xu, J.; Li, M.; Fan, J.; Xie, W. Discarding jagged artefacts in image upscaling with total variation regularisation. IET Image Process. 2019, 13, 2495–2506. [Google Scholar] [CrossRef]
Deng, X.; Dragotti, P.L. Deep Coupled ISTA Network for Multi-Modal Image Super-Resolution. IEEE Trans. Image Process. 2020, 29, 1683–1698. [Google Scholar] [CrossRef] [PubMed]
Kasem, H.M.; Hung, K.W.; Jiang, J. Spatial Transformer Generative Adversarial Network for Robust Image Super-Resolution. IEEE Access 2019, 7, 182993–183009. [Google Scholar] [CrossRef]
Pal, S.; Jana, S.; Parekh, R. Super-Resolution of Textual Images using Autoencoders for Text Identification. In Proceedings of the 2018 IEEE Applied Signal Processing Conference (ASPCON), Kolkata, India, 7–9 December 2018; pp. 153–157. [Google Scholar] [CrossRef]
Huang, Y.; Wang, W.; Wang, L. Video Super-Resolution via Bidirectional Recurrent Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1015–1028. [Google Scholar] [CrossRef] [PubMed]
Xue, X.; Zhang, X.; Li, H.; Wang, W. Research on GAN-based Image Super-Resolution Method. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 27–29 June 2020; pp. 602–605. [Google Scholar] [CrossRef]
Yamanaka, J.; Kuwashima, S.; Kurita, T. Fast and Accurate Image Super Resolution by Deep CNN with Skip Connection and Network in Network. arXiv 2017, arXiv:1707.05425. [Google Scholar]
Ghosh, V.C.A.; Thulasidharan, P.P. A Deep Neural Architecture for Image Super Resolution. In Proceedings of the 2018 International Conference on Data Science and Engineering (ICDSE), Patna, India, 26–28 September 2019; pp. 1–5. [Google Scholar] [CrossRef]
Mao, X.; Shen, C.; Yang, Y. Image Denoising Using Very Deep Fully Convolutional Encoder-Decoder Networks with Symmetric Skip Connections. arXiv 2016, arXiv:1603.09056. [Google Scholar]
Romano, Y.; Isidoro, J.; Milanfar, P. RAISR: Rapid and Accurate Image Super Resolution. arXiv 2016, arXiv:1606.01299. [Google Scholar] [CrossRef] [Green Version]
Lin, M.; Chen, Q.; Yan, S. Network In Network. arXiv 2013, arXiv:1312.4400. [Google Scholar]
Orhan, A.E. Skip Connections as Effective Symmetry-Breaking. arXiv 2017, arXiv:1701.09175,. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar] [CrossRef] [Green Version]
Simonyan, K.; Grishin, S.; Vatolin, D.; Popov, D. Fast video super-resolution via classification. In Proceedings of the 2008 15th IEEE International Conference on Image Processing, San Diego, CA, USA, 12–15 October 2008; pp. 349–352. [Google Scholar] [CrossRef]
Yu, W.; Zhang, M. Super Resolution Reconstruction of Video Images Based on Improved Glowworm Swarm Optimization Algorithm. In Proceedings of the 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC), Chongqing, China, 27–29 June 2018; pp. 331–335. [Google Scholar] [CrossRef]
Su, H.; Wu, Y.; Zhou, J. Adaptive incremental video super-resolution with temporal consistency. In Proceedings of the 2011 18th IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September 2011; pp. 1149–1152. [Google Scholar] [CrossRef]
Shi, B.; Tang, Z.; Luo, G.; Jiang, M. Winograd-Based Real-Time Super-Resolution System on FPGA. In Proceedings of the 2019 International Conference on Field-Programmable Technology (ICFPT), Tianjin, China, 9–13 December 2019; pp. 423–426. [Google Scholar] [CrossRef]
Szydzik, T.; Callico, G.M.; Nunez, A. Efficient FPGA implementation of a high-quality super-resolution algorithm with real-time performance. IEEE Trans. Consum. Electron. 2011, 57, 664–672. [Google Scholar] [CrossRef]
Chen, Q.; Sun, H.; Zhang, X.; Tao, H.; Yang, J.; Zhao, J.; Zheng, N. Algorithm and VLSI Architecture of Edge-Directed Image Upscaling for 4k Display System. IEEE Trans. Circuits Syst. Video Technol. 2016, 26, 1758–1771. [Google Scholar] [CrossRef]
Rubio-Ibáñez, P.; Martínez-álvarez, J.J.; Doménech-Asensi, G. Efficient VHDL Implementation of an Upscaling Function for Real Time Video Applications. In Proceedings of the 2019 IEEE International Symposium on Circuits and Systems (ISCAS), Sapporo, Japan, 26–29 May 2019. [Google Scholar]
Burnes, A. NVIDIA DLSS 2.0: A Big Leap in AI Rendering 2020. 23 March 2020. Available online: https://www.nvidia.com/en-us/geforce/news/nvidia-dlss-2-0-a-big-leap-in-ai-rendering (accessed on 23 March 2020).
AMD FidelityFX Super Resolution. 2021. Available online: https://www.amd.com/en/technologies/fidelityfx-super-resolution (accessed on 1 January 2022).
Yordan. Top 20 Most Popular Phones in 2020. 27 December 2020. Available online: https://www.gsmarena.com/top_20_most_popular_phones_in_2020-news-46737.php (accessed on 27 December 2020).
Smartphone Processors Ranking. NanoReview.Net. January 2022. Available online: https://nanoreview.net/en/soc-list/rating (accessed on 1 January 2022).
Svets, G. CPU World. 2021. Available online: https://www.cpu-world.com/CPUs/Core_i7 (accessed on 1 January 2022).
Dolbeau, R. Theoretical Peak FLOPS per Instruction Set: A Tutorial. J. Supercomput. 2018, 74, 1341–1377. [Google Scholar] [CrossRef]
Ostrovsky, I. Fast and Slow if-Statements: Branch Prediction in Modern Processors. 15 May 2010. Available online: http://igoro.com/archive/fast-and-slow-if-statements-branch-prediction-in-modern-processors (accessed on 15 May 2010).

Figure 1. Residual network architecture example.

Figure 2. Network in network architecture example.

Figure 3. Illustration of skip connections neural network architecture.

Figure 4. Comparison of the same image with the same MSE = 210 but different sources of degradation: (a) Original image (8 bits/pixel, cropped from 512,512 to 256,256 for visibility); (b) contrast-stretched image, MSSIM = 0:9168; (c) mean-shifted image, MSSIM = 0:9900; (d) JPEG compressed image, MSSIM = 0:6949; (e) blurred image, MSSIM = 0:7052. (f) salt-pepper impulsive noise-contaminated image, MSSIM = 0:7748.

Figure 5. Comparison between wide network and deep network: (a) Original 1080p image; (b)

120 \times 120

high-resolution segment; (c)

20 \times 20

downsampled segment as a system input; (d) bicubic upsampled image [SSIM: 0.00816, MSE: 1514.09]; (e) wide network output [SSIM: 0.523, MSE: 57.67]; (f) deep network output [SSIM: 0.516, MSE: 58.3].

Figure 5. Comparison between wide network and deep network: (a) Original 1080p image; (b)

120 \times 120

high-resolution segment; (c)

20 \times 20

downsampled segment as a system input; (d) bicubic upsampled image [SSIM: 0.00816, MSE: 1514.09]; (e) wide network output [SSIM: 0.523, MSE: 57.67]; (f) deep network output [SSIM: 0.516, MSE: 58.3].

Figure 6. Comparison of deep machine learning based RTISR with the residual neural network to conventional bicubic upscaling: (a) Original 1080p image; (b)

120 \times 120

high-resolution segment; (c)

20 \times 20

downsampled segment as a system input; (d) Bicubic upsampled image [SSIM: 0.00816, MSE: 1514.09]; (e) Deep machine learning based RTISR [SSIM: 0.5292, MSE: 56.77].

Figure 6. Comparison of deep machine learning based RTISR with the residual neural network to conventional bicubic upscaling: (a) Original 1080p image; (b)

120 \times 120

high-resolution segment; (c)

20 \times 20

downsampled segment as a system input; (d) Bicubic upsampled image [SSIM: 0.00816, MSE: 1514.09]; (e) Deep machine learning based RTISR [SSIM: 0.5292, MSE: 56.77].

Figure 7. Simple textured images: Deep machine learning-based RTISR with the residual neural network and conventional bicubic upscaling: (a) Original image; (b) Bicubic upsampled image of (a) [SSIM: 0.0105]; (c) Deep machine learning based RTISR of (a) [SSIM: 0.9815]; (d) Original image; (e) Bicubic upsampled image of (d) [SSIM: 0.0077]; (f) Deep machine learning based RTISR of (d) [SSIM: 0.9175].

Table 1. Computational Cost Of CNN Layers.

#	Layer	Estimated Number of FLOPs
1	Bicubic upscaling 20 × 20 to 120 × 120	322,000
2	7 × 7 convolutional layer with padding: one 20 × 20 input and one 20 × 20 output and PReLU activation function	39,200
3	5 × 5 convolutional layer with padding: one 20 × 20 input and one 20 × 20 output and PReLU activation function	20,000
4	3 × 3 convolutional layer with padding: one 20 × 20 input and one 20 × 20 output and PReLU activation function	7200
5	1 × 1 convolutional layer with padding: one 20 × 20 input and one 20 × 20 output and PReLU activation function	800
6	3 × 3 convolutional layer with padding: one 20 × 20 input and 36 20 × 20 outputs and no activation function	244,800
7	Pixel Shuffler 20 × 20 to 120 × 120	115,200
8	Add final layer output to bicubic image	14,400

Table 2. Average MSSIM for different numbers of layers in wide residual neural networks.

Number of Layers in Neural Networks.	Average MSSIM
1	0.8943
2	0.8952
3	0.8953
4	0.8951
5	0.8956
6	0.8956
7	0.8958
8	0.8958
9	0.8956
10	0.8957
11	0.8955

Table 3. Average MMSIM for varying number of convolutions with a

3 \times 3

filter in the NiN stage.

Table 3. Average MMSIM for varying number of convolutions with a

3 \times 3

filter in the NiN stage.

Number of 3 × 3 Convolutions in NiN	Average MSSIM
60	0.8952
30	0.8956
15	0.8956
3	0.8956
0	0.8957

Table 4. Performance for the combination of convolutions with different numbers of

7 \times 7

,

5 \times 5

and

3 \times 3

filters and layers in the neural network.

Table 4. Performance for the combination of convolutions with different numbers of

7 \times 7

,

5 \times 5

and

3 \times 3

filters and layers in the neural network.

Number of $7 \times 7$ Convolution	Number of $5 \times 5$ Convolution	Number of $3 \times 3$ Convolution	Number of FLOPS	Average MSSIM
10	30	60	1,448,800	0.8953
3	45	60	1,474,400	0.8958
3	9	160	1,479,200	0.8959

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tovar, N.; Kwon, S.; Jeong, J. Image Upscaling with Deep Machine Learning for Energy-Efficient Data Communications. Electronics 2023, 12, 689. https://doi.org/10.3390/electronics12030689

AMA Style

Tovar N, Kwon S, Jeong J. Image Upscaling with Deep Machine Learning for Energy-Efficient Data Communications. Electronics. 2023; 12(3):689. https://doi.org/10.3390/electronics12030689

Chicago/Turabian Style

Tovar, Nathaniel, Sean (Seok-Chul) Kwon, and Jinseong Jeong. 2023. "Image Upscaling with Deep Machine Learning for Energy-Efficient Data Communications" Electronics 12, no. 3: 689. https://doi.org/10.3390/electronics12030689

APA Style

Tovar, N., Kwon, S., & Jeong, J. (2023). Image Upscaling with Deep Machine Learning for Energy-Efficient Data Communications. Electronics, 12(3), 689. https://doi.org/10.3390/electronics12030689

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image Upscaling with Deep Machine Learning for Energy-Efficient Data Communications

Abstract

1. Introduction

2. Deep Machine Learning Method and Neural Network Architecture

2.1. Deep Machine Learning Neural Network for Image Upscaling

2.2. Real-Time Image Super-Resolution

3. System Resource

4. Methodology and Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI