Next Article in Journal
Bioimaging Using Full Field and Contact EUV and SXR Microscopes with Nanometer Spatial Resolution
Next Article in Special Issue
Fast Method of Recovering Reference-Wave Intensity in Two-Step-Only Quadrature Phase-Shifting Holography
Previous Article in Journal
Modular and Offsite Construction of Piping: Current Barriers and Route
Previous Article in Special Issue
Artwork Identification for 360-Degree Panoramic Images Using Polyhedron-Based Rectilinear Projection and Keypoint Shapes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Stereoscopic Image Super-Resolution Method with View Incorporation and Convolutional Neural Networks

1
Faculty of Information Science and Engineering, Ningbo University, Ningbo 315211, China
2
Intelligent Household Appliances Engineering Center, Zhejiang Business Technology Institute, Ningbo 315012, China
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2017, 7(6), 526; https://doi.org/10.3390/app7060526
Submission received: 6 March 2017 / Revised: 10 May 2017 / Accepted: 12 May 2017 / Published: 26 May 2017
(This article belongs to the Special Issue Holography and 3D Imaging: Tomorrows Ultimate Experience)

Abstract

:
Super-resolution (SR) plays an important role in the processing and display of mixed-resolution (MR) stereoscopic images. Therefore, a stereoscopic image SR method based on view incorporation and convolutional neural networks (CNN) is proposed. For a given MR stereoscopic image, the left view of which is observed in full resolution, while the right view is viewed in low resolution, the SR method is implemented in two stages. In the first stage, a view difference image is defined to represent the correlation between views. It is estimated by using the full-resolution left view and the interpolated right view as input to the modified CNN. Accordingly, a high-precision view difference image is obtained. In the second stage, to incorporate the estimated right view in the first stage, a global reconstruction constraint is presented to make the estimated right view consistent with the low-resolution right view in terms of the MR stereoscopic image observation model. Experimental results demonstrated that, compared with the SR convolutional neural network (SRCNN) method and depth map based SR method, the proposed method improved the reconstructed right view quality by 0.54 dB and 1.14 dB, respectively, in the Peak Signal to Noise Ratio (PSNR), and subjective evaluation also implied that the proposed method produced better reconstructed stereoscopic images.

Graphical Abstract

1. Introduction

With advancements in imaging, processing, and display technologies in recent years, stereoscopic video entertainment and communication have emerged as promising services of novel visual user experiences such as three-dimensional (3D) television [1], free-viewpoint video [2], and video conferencing [3]. Compared with monocular images, stereoscopic images provide depth perception and engender an immersive user experience [4]. Meanwhile, the immense amount of data generated by stereoscopic imaging requires large storage and transmission capabilities and thus must be efficiently encoded and processed. On the basis of binocular suppression theory [5], higher quality views will be received as the perceived quality of stereo vision by the human visual system (HVS). Thus, mixed resolution (MR) stereoscopic image processing techniques are motivated by binocular perception theory. Specifically, one view of the MR stereoscopic image is provided with full resolution (FR), whereas the other view is degraded by the MR stereoscopic image observation model. To decrease the amount of data while preserving the high definition and stereo vision experience, the low-resolution (LR) view must be super-resolved to a high resolution (HR) at the decoder and display side. In recent years, MR stereoscopic imaging and processing techniques have proven to be effective approaches for stereoscopic imaging and compression [6].
Existing super-resolution (SR) methods are used to reconstruct the FR image from its LR version. These methods are mainly divided into three types; interpolation [7,8], reconstruction [9,10], and learning [11,12,13,14,15]. Among them, the learning-based method has become widely used owing to its outstanding performance. Its basic idea is to establish a mapping relation between the LR and HR image patches and then to find the optimal solution from the LR image. Thus, to study the common prior knowledge between the image patches, most renowned methods adopt the learning-based strategy [16]. Chang et al. [11], for example, adopted the concept of local linear embedding to propose an SR reconstruction method based on neighborhood embedding. Furthermore, the above-mentioned neighborhood embedding is deduced to a more complex sparse coding formulation in Yang et al.’s work [12,13]. They determined that a linear combination of atoms from an over-complete dictionary can well represent natural image patches. Therefore, after the training process on HR and LR image patches, HR and LR dictionaries are jointly obtained. Then, according to the observed LR patch and LR dictionary, the sparse coefficients are achieved and applied to the HR dictionary to produce the final HR patches.
To improve the SR speed while maintaining SR accuracy, Timofte et al. [14] used several smaller complete dictionaries to replace the single large over-complete dictionary, thereby greatly reducing the computational cost. In recent years, with the development of deep learning, an increasing number of researchers have employed deep learning for image processing such as image classification [17], object detection [18], and image denoising [19]. Additionally, some researchers have begun establishing a depth model for SR reconstruction. Cui et al. [20] adopted stacked auto-encoders which combined the internal example-based approach to gradually upsample LR images layer by layer. Moreover, Dong et al. [15] combined dictionary learning and neural networks to establish a model of the SR convolutional neural network (SRCNN). This model showed better performance than traditional methods, such as dictionary learning and sparse coding. Furthermore, Liu et al. [21] emphasized the importance of traditional sparse representation. They integrated it into deep learning to further improve the SR results. Although the LR view of the MR stereoscopic image can be directly upsampled by these single-view SR methods, these methods do not take advantage of the correspondence between views for stereoscopic image SR.
The observed FR image in the neighboring view of the stereoscopic image can provide richly detailed information of the scene. Thus, the relativity between views has been utilized to strengthen the particular LR views. Garcia et al. [22] proposed an SR method that exploits depth information for the MR multi-view video. On the basis of the available depth maps, their approach enhanced the observed LR image by extracting the high-frequency content from the neighboring FR view. However, the acquisition of the depth maps was not discussed in their work as these are not easy to accurately estimate. In addition, Brust et al. [23] employed the estimated depth map calculated in advance from the original FR stereoscopic pairs to render the LR view from other HR views. Again, the original FR stereoscopic pairs cannot be obtained at the decoder side. Therefore, all of the above methods are not fully consistent with the MR stereoscopic imaging and processing techniques.
Unlike existing SR methods, which require depth maps or depth estimation, we combine the correlation between the views of the MR stereoscopic image without estimating the depth map. We propose a stereoscopic image SR method based on view incorporation and convolutional neural network (CNN). The proposed stereoscopic image SR method can be implemented in two stages. In the first stage, for establishing links between views, a view difference image is defined, and the modified CNN is created to estimate a high-precision view difference image. Then, the estimated right view image is obtained by subtracting the estimated view difference image from the observed FR left view image. In the second stage, we consider that the estimated right view image should be retained with the LR right view image with regard to the MR stereoscopic image observation model. Accordingly, we model the global reconstruction constraint for incorporating the right view by projecting the estimated right view image obtained in the first stage onto the solution space of the image observation model. The solution can be computed by iterative back projection [24]. The SR results demonstrate that the proposed SR method retained more details and well reduced ringing.
In short, the contributions of this paper are outlined as follows:
  • We combine the correlation between the views of the MR stereoscopic image without estimating the depth map;
  • We use the view difference image containing the image texture information, as well as the depth information of the stereoscopic pairs, as the input to the modified CNN, whose pooling layers and fully connected layers are removed according to the SR task;
  • We combine the high-precision view difference image estimated by the modified CNN with the global reconstruction constraint to further improve the performance of the MR stereoscopic image SR.
The remainder of this paper is organized as follows. Section 2 describes an MR stereoscopic image observation model. Then, the proposed stereoscopic image SR method is illustrated in Section 3. Experiments are given and discussed in Section 4. Section 5 presents the conclusions.

2. MR Stereoscopic Image Observation Model

As shown in Figure 1, the observed FR left view and downsampled right view constitute the MR stereoscopic video sequence. Accordingly, the MR stereoscopic video coding model can enable a large amount of data to be compressed for storage and transmission. This is because directly encoding the FR stereoscopic videos leads to doubling the required storage and bandwidth. Actually, owing to the restricted storage space and network bandwidth, this MR stereoscopic video coding model is crucial to providing clear bitrate reduction [6]. Furthermore, for ensuring stereo vision comfort, the SR of the MR stereoscopic video is needed at the decoder. Therefore, to contribute to the MR stereoscopic video coding model, we adopt an MR stereoscopic image observation model for stereoscopic image SR.
As shown in Figure 2, a stereoscopic imaging system obtains the original FR stereoscopic image pairs (with the size of N1 × N2 for each view). In general, we assume that there exists the observed FR left view and observed LR right view (with the size of M1 × M2), which is degraded on account of the blurring and downsampling operation. The degradation model is expressed by
Y = D B X
where Y and X denote the observed LR right view image and original FR right view image (that is the unknown FR right view image), respectively. Moreover, Y and X are both in vector format with the sizes of M1M2 × 1 and N1N2 × 1, respectively. D is the downsampling matrix with the size of M1M2 × N1N2, and B is the blurring matrix with the size of N1N2 × N1N2. In addition, let Z represent the observed FR left view image, the purpose of the proposed method is to acquire an FR right view image, X, by making full use of the abundant information of the observed LR right view image Y and observed FR left view image Z.

3. Proposed Stereoscopic Image SR Method with View Incorporation and CNN

In this paper, we focus on the estimation of the view difference image and global reconstruction constraint to solve the SR task of MR stereoscopic images, as depicted in Figure 3. The proposed method includes two main stages. The first stage is the estimation of a view difference image for employing the correlation between the left and right views by using a modified CNN. Firstly, the observed LR right view image is interpolated into FR by using a bicubic interpolation filter. Secondly, the view difference image between the observed FR left view image and interpolated right view image is defined and employed as the input to the modified CNN. Hence, the high-precision view difference image is provided as the output. Furthermore, the estimated right view image is obtained by combining the observed FR left view image with the high-precision view difference image. The second stage is the global reconstruction constraint process for incorporating the right view. By making the estimated right view image obtained in the first stage align with the LR right view according to the MR stereoscopic image observation model, we model the global reconstruction constraint by using the iterative back projection method. Finally, after the above two stages, the SR of the LR right view is obtained.

3.1. Estimation of View Difference Image with Modified CNN

The key aspect of the SR process of MR stereoscopic images is to use the correlation between views as far as possible to enhance the resolution of the LR right view. As mentioned above, the view difference image is very important for representing the correlation between views because both the image texture information and depth information of the stereoscopic pairs [25] are included in the view difference image. Therefore, we establish a modified CNN, the pooling layers and fully connected layers of which are removed according to the SR task to construct the high-precision view difference image, as shown in Figure 4. In addition to the input layer, the modified CNN training framework consists of three layers, in which the hidden layers [26] are the first two convolution layers, and the output layer is the third convolution layer. Given an initial view difference sub-image obtained by a FR left view training sub-image and an interpolated right view training sub-image, the first convolution layer of the modified CNN extracts a number of feature maps. Then the second convolution layer maps these feature maps to high-precision feature vectors. Finally, the third convolution layer produces the high-precision view difference sub-image according to these high-precision feature vectors.

3.1.1. Image Patch Extraction and Representation

The image patch extraction and representation operation extracts image patches from the view difference image and represents them as high-dimensional vectors by a number of bases. In our formulation, this is the same process as convolving the view difference image by a number of filters. These vectors obtained by convolution contain a number of feature maps that can be further mapped to finer feature vectors.
Generally, the view difference image is defined by:
I d ( x , y ) = Z ( x , y ) Y B ( x + d x ( x , y ) , y + d y ( x , y ) )
where Z and YB denote the observed FR left view image and the interpolated right view image, which is the FR; that is, respectively, Z = {Z(x, y)} and YB = {YB(x, y)}. In addition, dx(x, y) and dy(x, y) are the horizontal and vertical components of the disparity at position (x, y), and Id = {Id(x, y)} is the defined view difference image on the basis of the disparity information. Actually, for SR of the MR stereoscopic image, we cannot obtain the original FR stereoscopic image pair at the decoder side. Thus, the disparity map cannot be accurately identified. Similar to [25], the view difference image is directly calculated from the stereoscopic image pairs as:
I d ( x , y ) = Z ( x , y ) Y B ( x , y )
We take the initial view difference image, Id, as the input of the modified CNN, and the convolutional operation in the first layer in the CNN is represented as:
I d 1 = W 1 * I d + B 1
where * denotes the convolutional operation. Then, we apply the rectified linear unit (RELU) [27] to alleviate the overfitting problem after the convolutional operation:
I d 1 = max ( 0 , W 1 * I d + B 1 )
where W1 and B1 represent n1 filters of the support f1 × f1 × c1 and biases, which comprise an n1-dimensional vector, respectively. Here, f1 represents the filter size, and c1 denotes the number of channels in the input image. Additionally, output Id1 is composed of n1 feature maps of the input image.

3.1.2. Mapping and Estimation of View Difference Image and Right View Image

After extracting an n1-dimensional feature map for each image patch in the first CNN layer, these n1-dimensional vectors are mapped into n2-dimensional vectors in the second CNN layer. The high-precision view difference image is estimated in the third CNN layer. Finally, we estimate the right view image after the three layer operation.
Similar to the first layer, the second layer is built through the convolution and the RELU, as follows:
I d 2 = max ( 0 , W 2 * I d 1 + B 2 )
where W2 and B2 represent n2 filters of the support f2 × f2 × c2 and the biases, which comprise an n2-dimensional vector, respectively. Output Id2 is composed of n2 feature maps, which can be conceptually used as representations of high-precision view difference image patches that will constitute a full high precision view difference image.
According to the representations of high-precision view difference image patches in the second layer, the convolutional operation in the third CNN layer is defined to produce the high-precision view difference image, Id3 , as:
I d 3 = W 3 * I d 2 + B 3
where W3 and B3 represent n3 filters of the support f3 × f3 × c3 and the biases, which comprise an n3-dimensional vector.
After the three layer operation, the right view image is estimated as:
X 1 = Z I d 3
where X1 is the estimated right view image which is the output of the first stage.

3.1.3. Modified CNN and Training

In this paper, the modified CNN is created to estimate the high-precision view difference image. Our objective is to train network f with three layers when given a training dataset, { Z ( i ) , X ( i ) , Y B ( i ) } i = 1 N , so that the high precision view difference image, I d 3 ( i ) = f ( Z ( i ) Y B ( i ) ) , is estimated, where Z ( i ) and X ( i ) denote the ground-truth left view image and ground-truth right view image, respectively. Furthermore, Y B ( i ) is the interpolated right view image, and N is the number of training samples. Then, a labeled ground-truth difference image in the used CNN is defined as I d l a b e l ( i ) = Z ( i ) X ( i ) . To produce the high-precision view difference image, we use the sum of the absolute difference (SAD) as the distortion function:
L ( θ ) = 1 N i = 1 N | f ( Z ( i ) Y B ( i ) ) I d l a b e l ( i ) |
According to Equation (9), the distortion between the estimated high-precision view difference image I d 3 and ground-truth difference image I d l a b e l is minimized. The parameters θ = { W 1 , W 2 , W 3 , B 1 , B 2 , B 3 } are hence obtained.
Stochastic gradient descent [28] is used to minimize the distortion by standard back-propagation. Then, the parameters’ matrices are updated as
Δ i + 1 = M Δ i + r L θ i l , θ i + 1 l = θ i l + Δ i + 1
where l {1, 2, 3} represents the indices of layers, i denotes iterations, M is the momentum parameter with a value 0.9, r is the learning rate, and L θ i l is the derivative of the network parameters. Initially, all filters are initialized with a random Gaussian distribution with a zero mean and standard deviation of 0.001. Meanwhile, the biases of each layer are initialized by zero. The learning rate is 2.5 × 10−4 for the first two layers and 2.5 × 10−5 for the last layer.

3.2. Global Reconstruction Constraint to Incorporate the Right View

For a given LR right view, Y, there may be many HR right views X owing to the extremely ill posed character of SR. We consider that the reconstructed HR right view X should be retained with the LR right view Y in terms of the MR stereoscopic image observation model. However, the estimated right view image X1 obtained in the first stage may not satisfy this condition. Therefore, the global reconstruction constraint is enforced for incorporating the right view by projecting X1 onto the solution space of Y = DBX. It is computed as:
X * = arg min X D B X Y 2 2 + X X 1 2 2
Thus, the solution can be computed by iterative back projection. The updated equation is:
X t + 1 = X t + v [ B T D T ( Y D B X t ) + ( X X 1 ) ]
where Xt denotes the reconstructed right view image after the (t − 1)-th iteration and ν denotes the step size with the value of one.
We use X* from the aforementioned optimization as the ultimately reconstructed right view image. On one hand, image X* is as close as possible to the estimated right view image X1, obtained by estimation of the view difference image in the first stage. On the other hand, it satisfies the global reconstruction constraint in the second stage.

4. Experimental Results and Discussion

The training and testing data used in our experiments are described in this section. We explore our modified CNN and provide the convergence analysis to ensure the method is efficient. Then, the SR results of the proposed method are compared with those of recent state-of-the art methods. Next, the running time of all these methods is compared to evaluate the computational complexity. At last, subjective evaluation is adopted to analyze the perceived quality of stereoscopic images.

4.1. Training and Testing Data

Considering that the appropriate amount of training data can increase the CNN performance [29], we used 20 stereoscopic images from the Middlebury Stereo Dataset [30] as the training dataset. Each image pair was comprised of two views. Figure 5 shows the right view of each pair. To provide sufficient information for training the modified CNN, while reducing the complexity of the modified CNN, we extracted the view difference image between the 20 pairs of the FR left view image and the interpolated right view image, and we randomly cropped sub-images with the size of 33 × 33 from the training images. A total of 552,729 sub-images (that is N, the number of training samples) for training were generated. To avoid the boundary effect in the training process, all the convolutional layers have no padding. Moreover, although we used a changeless image size during training, the modified CNN could be employed on images of variable sizes in the testing process.
To test the performance and robustness of the modified CNN, we employed the first frame of various multi-view video sequences, including Champagne_tower, Dog, Pantomime, Newspaper, Pozan Street [31], Ballet, and Breakdancers [32], as well as the remaining image pairs in the Middlebury Stereo Dataset; Sword 2, Umbrella, and Vintage. Figure 6 shows the right view of each pair in the testing dataset. For Champagne_tower, the FR left view is view 38 and view 39 is selected as the LR right view. For Dog and Pantomime, the FR left view is view 40 and view 39 is selected as the LR right view. For Newspaper, the FR left view is view 2 and view 3 is selected as the LR right view. For Poznan Street, the FR left view is view 5 and view 4 is selected as the LR right view. For Ballet and Breakdancers, the FR left view is view 2 and view 1 is selected as the LR right view.
In this paper, the specific CNN parameters are established according to practical experience, as shown in Figure 4. The MATLAB toolbox MatConvNet [33] is applied to actualize our CNN. Each parameter of the convolutional filters is set to [ f l , f l , c l , n l ] l = 1 3 = [ 9 , 9 , 1 , 64 ] l = 1 , [ 1 , 1 , 64 , 32 ] l = 2 , [ 5 , 5 , 32 , 1 ] l = 3 . To strengthen the correlation between image patches, the convolution stride is set to one for all layers. For a three-channel color image, we followed the methods of [12,13,14,15] in the experiments. Additionally, our stereoscopic image SR is used to contribute to the MR stereoscopic video coding model, whose video format is YCbCr (That is a color space. Y is the brightness (luma) component of the color space, while Cb and Cr are the blue and red concentration offset components). Thus, the color image is converted to the YCbCr color space, and only the Y component of the image is reconstructed by our SR reconstruction method, while both the Cb and Cr components are upsampled by the bicubic interpolation method [34]. Furthermore, although our network can be easily extended to multi-channel image processing, Dong’s experiments performed on the YCbCr color space [15] demonstrated that the Cb and Cr channels scarcely improved the performance. Therefore, the following objective evaluation indices are calculated only in the Y channel in Section 4.2 and Section 4.3.

4.2. Convergence Analysis

To verify the efficiency of the proposed method, a convergence analysis of the proposed method was conducted. Figure 7 shows the training performance on the Middleburry Stereo Dataset [30] and the testing of the performance on the image Dog. For evaluating the convergence of the modified CNN, the above-mentioned SAD was computed as a training error and Peak Signal to Noise Ratio (PSNR) was used as a testing error in each epoch. Here, an epoch denotes the training times.
It is evident in the figure that, with the increase in the epoch, SAD in training gradually decreases, whereas PSNR in testing progressively increases. Furthermore, the gradual convergence tendency with the increase in the epoch can be predicted. These results show that the reconstruction results are better with more training epochs until the modified CNN reaches stability. Furthermore, the performance of the proposed method surpasses that of the sparse coding (SC) method [13] baseline with a few training epochs, and it outperforms the SRCNN [15] with proper training epochs. Finally, it converges to the PSNR value of 46.66 dB on the image Dog.

4.3. Stereoscopic Image SR Results

4.3.1. Contrast with Single-View Methods

To demonstrate that the proposed method is more effective than single-view methods, comparative experiments were carried out between the SC method [13], SRCNN method [15], and proposed method. In addition to the common PSNR index, we used two other objective evaluation indices, namely the structural similarity index (SSIM) [35] and the blind/referenceless image spatial quality evaluator (BRISQUE) [36]. Since the PSNR and SSIM indices are full-reference image quality assessments and the BRISQUE index is a no-reference image quality assessment, we considered PSNR and SSIM together for facilitating the analysis. Figure 8 depicts the reconstruction results of these three methods for the first frame of the Pozan Street sequence. It is seen that, according to the details from the local amplification region, SC [13] produced a relatively vague reconstruction. Furthermore, as shown by the edge of the letter P and the number 0, the proposed method was more effective in ringing reduction compared with SRCNN [15]. Figure 9 shows the reconstruction results of the three methods for the Umbrella image. From the magnified details of the umbrella stick and umbrella skeleton, the proposed method produced sharper edges that more closely approximate the real HR images. Figure 10 additionally shows that the details of the pantomimist clothes in the first frame of the Pantomime sequence obtained by the proposed method are clearer. Table 1 presents the PSNR and SSIM values in the Y channel obtained by SC [13], SRCNN [15], and the proposed method. As shown in the table, compared with SC [13] and SRCNN [15], the average PSNR value of the proposed method is increased by 2.39 dB and 0.54 dB, respectively, and the average SSIM value of the proposed method is slightly increased by 0.01 and 0.0002, respectively. Furthermore, the BRISQUE value in Table 2, which typically has a value between 0 and 100 (0 represents the best quality, 100 the worst), demonstrates the significant performance of the proposed method. The average BRISQUE value of the proposed method is decreased by 21.66 and 2.03 compared with SC [13] and SRCNN [15]. Although the CNN architecture of the proposed method is similar to that of SRCNN [15], the proposed method achieves better performance benefiting from the combination of the high-precision view difference image estimated by the modified CNN and the global reconstruction constraint. Overall, according to the tables and the reconstructed images, the proposed method achieves a better SR result than SC [13] and SRCNN [15].

4.3.2. Contrast with Depth-Based Methods

To show the advantages of the proposed method, which fully leverages the correspondence between views without estimating the depth map, comparative experiments were carried out between the depth-based method presented by Garcia and the proposed method. As mentioned in Garcia’s work [22], for fully considering the high-frequency information of original images, the test sequences, including Dog, Pantomime, Pozan Street, Ballet, and Breakdancers, were resized to 640 × 480, 640 × 480, 960 × 544, 512 × 384, and 256 × 192, respectively, by using a six-tap Lanczos interpolation filter. To ensure consistency, we employed these downsampled images for testing the original images under the same conditions.
As depicted in Table 3, the average PSNR value of the proposed method increased by 1.14 dB compared with Garcia’s method [22]. Two significant characteristics of the proposed method contributed to the improvement; (1) the estimation of the high-precision view difference image between views owing to modified CNN in the first stage and (2) the gradual improvement of the reconstructed right view by enforcing the global reconstruction constraint to incorporate the right view.

4.4. Running Time

To evaluate the computational complexity, the running times of all methods were compared for the ten testing stereoscopic images listed in Table 1, as shown in Table 4. All results were acquired from the corresponding authors’ MATLAB code, and ours were likewise obtained on MATLAB software. All the algorithms were run on the same machine with an Intel 2.30-GHz CPU and 16 GB of RAM. It was apparent that the proposed method had a faster processing speed than the SC method [13]. This is because the proposed method does not need to solve a complex optimization problem as the SC method does [13]; thus, the running time of the proposed method was less than that of the SC method. Moreover, the proposed method’s speed was close to that of SRCNN [15]. As soon as the training of our modified CNN was complete, the SR results were quickly obtained by the proposed feed-forward method.

4.5. Subjective Evaluation

4.5.1. Generation of Testing Stereoscopic Images

For evaluating the quality of the stereoscopic images generated by different SR methods, a subjective experiment was implemented, and the procedure followed the International Telecommunications Union-Radio Communications Sector (ITU-R) Recommendation BT.500 [37] so that subjective quality of the reconstructed stereoscopic images relative to the original stereoscopic images was obtained. Specifically, this experiment also used the ten aforementioned testing stereoscopic images and adopted a Double-Stimulus Continuous Quality-Scale method [37], which is equivalent to simultaneously scoring two stimuli corresponding to the reconstructed stereoscopic image and the original stereoscopic image. Thus, for ten testing stereoscopic images and three different SR methods, there are a total of 10 × 3 = 30 comparison clips produced.

4.5.2. Experimental Environment and Participants

In the experiment, the stereoscopic projection system was adopted, and the participants needed to wear polarized glasses, which separate the left and right views to the appropriate eyes. The system consisted of two projectors (BenQ PB8250 DLP), DELL real-time 3D graphics workstations, a polarized light bracket, and a metal screen (150 inches). The experiment was conducted in a specific laboratory, in which the illumination, temperature, and other experimental conditions followed ITU-R Recommendation BT.500.
A total of 20 participants were involved in the study, with an average age of 23 years, and all of the participants underwent the color vision test and met a 20:30 visual acuity test and a stereoscopic visual acuity test at 40 s-arc. They were non-experts whose professional backgrounds are not directly related to image quality. Each participant, needed to score 30 pairs of stereoscopic images. Each pair cost 40 s on display, 10 s on scoring, and 10 s on resting. According to ITU Recommendation BT.500, the display order of the 30 pairs was random. The distance between the participants and the screen was 3 m; that is, three times the height of the screen. In addition, to make the participants be familiar with the scoring process, four other pairs of stereoscopic images were displayed to them before the official scoring.

4.5.3. Ranking and Raw Data Processing

After the stereoscopic images were scored by the participants, the Difference Mean Opinion Scores (DMOS) between each pair of stereoscopic images, which include the original stereoscopic image and the reconstructed stereoscopic image, were calculated. According to ITU-R BT.500, the value range of DMOS is from 0 to 100. The formula is as follows:
D M O S j = 1 N j i = 1 N j d i j
d i j = r i r e f ( j ) r i j
where rij denotes the raw quality score of the j-th reconstructed stereoscopic image evaluated by the i-th participant, riref(j) is the raw quality score assigned by the i-th participant to the j-th original stereoscopic image, and Nj denotes the number of participants involved in assessment of the j-th stereoscopic image. Then DMOSj is the mean of the raw difference scores dij.
Before calculating the final DMOS of each of the reconstructed stereoscopic images, the data of the participants with poor score stability should be removed. Specially, if the value of dij is beyond the 95% confidence interval of DMOSj, then dij is called an outlier.

4.5.4. Results and Analysis

Figure 11 shows the DMOS of the ten testing stereoscopic images reconstructed with the three different SR methods. It is obvious that the SC method [13] was poorer than the other two SR methods. Most of the participants thought that the stereoscopic images reconstructed by the SC method were more obscure. Benefiting from the correlation between views, the proposed method also outperformed the SRCNN [15], demonstrating its better stereo visual quality.

5. Conclusions

To ensure the perceived quality of stereo vision, we proposed in this paper a method using CNNs and incorporating views to reconstruct a FR stereoscopic image. Compared with single-view and depth-based methods, the proposed method combines the correlation between left and right views without estimating the depth image. Firstly, a deep learning tool is used to estimate the view difference image comprising the image texture information and the depth information of the stereoscopic pairs. Then, the estimated right view image is projected onto the solution space of the MR stereoscopic image observation model. Finally, the HR reconstructed right view image is obtained. The experimental results indicated that the performance of the proposed method is superior to those of the existing methods in terms of both reconstruction effect and speed. In the future, we will focus on accelerating the CNN convergence and further consider exploiting the temporal correlation of video to research MR stereoscopic video SR.

Acknowledgments

This work was supported by the Natural Science Foundation of China under Grant Nos. U1301257, 61671258, and 61620106012, the National High-tech R&D Program of China under Grant No. 2015AA015901, and the Natural Science Foundation of Zhejiang Province under Grant Nos. LY15F010005 and LY16F010002. It was also sponsored by the K.C. Wong Magna Fund of Ningbo University.

Author Contributions

Z.P. and G.J. designed the algorithm and wrote the source code. They together wrote the manuscript. H.J. and M.Y. provided suggestions on the algorithm and revised the entire manuscript. F.C. and Q.Z. provided suggestions on the experiments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Li, J.; Chen, C.; Liu, Y.; Chen, X. Small-world brain functional network altered by watching 2D/3DTV. J. Vis. Commun. Image Represent. 2016, 38, 433–439. [Google Scholar] [CrossRef]
  2. Domański, M.; Bartkowiak, M.; Dziembowski, A.; Grajek, T.; Grzelka, A.; Łuczak, A.; Mieloch, D.; Samelak, J.; Stankiewicz, O.; Stankowski, J.; et al. New results in free-viewpoint television systems for horizontal virtual navigation. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar]
  3. Nikolic, S.; Lee, M.J.W. Special session: Exploring learning opportunities in engineering education using 2D, 3D and immersive video augmented online technologies. In Proceedings of the IEEE Frontiers in Education Conference (FIE), Erie, PA, USA, 12–15 October 2016; pp. 1–2. [Google Scholar]
  4. Passig, D.; Tzuriel, D.; Eshel-Kedmi, G. Improving children’s cognitive modifiability by dynamic assessment in 3D Immersive Virtual Reality environments. Comput. Educ. 2016, 95, 296–308. [Google Scholar] [CrossRef]
  5. Julesz, B. Foundations of Cyclopean Perception; The University of Chicago Press: Oxford, UK, 1971; p. 406. [Google Scholar]
  6. Chung, K.-L.; Huang, Y.-H. Efficient multiple-example based super-resolution for symmetric mixed resolution stereoscopic video coding. J. Vis. Commun. Image Represent. 2016, 39, 65–81. [Google Scholar] [CrossRef]
  7. Zhang, X.; Wu, X. Image Interpolation by Adaptive 2-D Autoregressive Modeling and Soft-Decision Estimation. IEEE Trans. Image Process. 2008, 17, 887–896. [Google Scholar] [CrossRef] [PubMed]
  8. Zhu, S.; Zeng, B.; Zeng, L.; Gabbouj, M. Image Interpolation Based on Non-local Geometric Similarities and Directional Gradients. IEEE Trans. Multimed. 2016, 18, 1707–1719. [Google Scholar] [CrossRef]
  9. Kim, K.I.; Kwon, Y. Single-Image Super-Resolution Using Sparse Regression and Natural Image Prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1127–1133. [Google Scholar] [PubMed]
  10. Mourabit, I.E.; Rhabi, M.E.; Hakim, A.; Laghrib, A.; Moreau, E. A new denoising model for multi-frame super-resolution image reconstruction. Signal Process. 2017, 132, 51–65. [Google Scholar] [CrossRef]
  11. Hong, C.; Dit-Yan, Y.; Yimin, X. Super-resolution through neighbor embedding. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 27 June–2 July 2004. [Google Scholar]
  12. Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image super-resolution as sparse representation of raw image patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
  13. Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image Super-Resolution Via Sparse Representation. IEEE Trans. Image Process. 2010, 19, 2861–2873. [Google Scholar] [CrossRef] [PubMed]
  14. Timofte, R.; De Smet, V.; Van Gool, L. A+: Adjusted Anchored Neighborhood Regression for Fast Super-Resolution. In Proceedings of the Asian Conference on Computer Vision, Singapore, 1–5 November 2014; pp. 111–126. [Google Scholar]
  15. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
  16. Yang, C.Y.; Ma, C.; Yang, M.H. Single-Image Super-Resolution: A Benchmark. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 372–386. [Google Scholar]
  17. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  18. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–10 December 2015; pp. 91–99. [Google Scholar]
  19. Agostinelli, F.; Anderson, M.R.; Lee, H. Adaptive multi-column deep neural networks with application to robust image denoising. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 1493–1501. [Google Scholar]
  20. Cui, Z.; Chang, H.; Shan, S.; Zhong, B.; Chen, X. Deep Network Cascade for Image Super-resolution. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 49–64. [Google Scholar]
  21. Liu, D.; Wang, Z.; Wen, B.; Yang, J.; Han, W.; Huang, T.S. Robust Single Image Super-Resolution via Deep Networks With Sparse Prior. IEEE Trans. Image Process. 2016, 25, 3194–3207. [Google Scholar] [CrossRef] [PubMed]
  22. Garcia, D.C.; Dorea, C.; de Queiroz, R.L. Super Resolution for Multiview Images Using Depth Information. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1249–1256. [Google Scholar] [CrossRef]
  23. Brust, H.; Tech, G.; Mller, K. Report on Generation of Mixed Spatial Resolution Stereo Data Base; Technical Report Project No. 216503; MOBILE 3DTV: Tampere, Finland, 2009. [Google Scholar]
  24. Irani, M.; Peleg, S. Improving resolution by image registration. CVGIP Graph. Models Image Process. 1991, 53, 231–239. [Google Scholar] [CrossRef]
  25. Ma, L.; Wang, X.; Liu, Q.; Ngan, K.N. Reorganized DCT-based image representation for reduced reference stereoscopic image quality assessment. Neurocomputing 2016, 215, 21–31. [Google Scholar] [CrossRef]
  26. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed]
  27. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  28. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  29. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in neural information processing systems, Lake Tahoe, Nevada, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
  30. Scharstein, D.; Hirschmüller, H.; Kitajima, Y.; Krathwohl, G.; Nešić, N.; Wang, X.; Westling, P. High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth. In Proceedings of the 36th German Conference on Pattern Recognition, Münster, Germany, 2–5 September 2014; pp. 31–42. [Google Scholar]
  31. Mobile 3DTV. Available online: http://sp.cs.tut.fi/mobile3dtv/stereo-video/ (accessed on 4 February 2017).
  32. Zitnick, C.L.; Kang, S.B.; Uyttendaele, M.; Winder, S.; Szeliski, R. High-quality video view interpolation using a layered representation. In Proceedings of the 31st international conference on computer graphics and interactive techniques, Los Angeles, California, USA, 8–12 August 2004; pp. 600–608. [Google Scholar]
  33. Vedaldi, A.; Lenc, K. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM international conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 689–692. [Google Scholar]
  34. Bicubic Interpolation. Available online: https://en.wikipedia.org/wiki/Bicubic_interpolation (accessed on 11 February 2017).
  35. Zhou, W.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. on Image Process. 2004, 13, 600–612. [Google Scholar]
  36. Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-Reference Image Quality Assessment in the Spatial Domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef] [PubMed]
  37. Methodology for the Subjective Assessment of the Quality of Television Pictures. Available online: https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.500-11-200206-S!!PDF-E.pdf (accessed on 4 February 2017).
Figure 1. Mixed resolution (MR) stereoscopic video coding model.
Figure 1. Mixed resolution (MR) stereoscopic video coding model.
Applsci 07 00526 g001
Figure 2. MR stereoscopic image observation model.
Figure 2. MR stereoscopic image observation model.
Applsci 07 00526 g002
Figure 3. Overall procedure of the proposed method.
Figure 3. Overall procedure of the proposed method.
Applsci 07 00526 g003
Figure 4. Modified convolutional neural networks (CNN) training framework.
Figure 4. Modified convolutional neural networks (CNN) training framework.
Applsci 07 00526 g004
Figure 5. Right view of each stereoscopic pair in the training dataset (from the Middlebury Stereo Dataset [30]).
Figure 5. Right view of each stereoscopic pair in the training dataset (from the Middlebury Stereo Dataset [30]).
Applsci 07 00526 g005
Figure 6. Right view of each stereoscopic pair in the testing dataset.
Figure 6. Right view of each stereoscopic pair in the testing dataset.
Applsci 07 00526 g006
Figure 7. Convergence curve. (a) Training performance on Middleburry Stereo Dataset [30]; (b) Testing the performance on the image Dog.
Figure 7. Convergence curve. (a) Training performance on Middleburry Stereo Dataset [30]; (b) Testing the performance on the image Dog.
Applsci 07 00526 g007
Figure 8. Super-resolution (SR) results and Peak Signal to Noise Ratio (PSNR) values in the Y channel for the first frame of the Pozan Street sequence. (a) Ground truth; (b) local amplification region of the ground truth; (c) local amplification region of sparse coding (SC) [13]; (d) local amplification region of SRCNN [15]; and (e) local amplification region of the proposed method.
Figure 8. Super-resolution (SR) results and Peak Signal to Noise Ratio (PSNR) values in the Y channel for the first frame of the Pozan Street sequence. (a) Ground truth; (b) local amplification region of the ground truth; (c) local amplification region of sparse coding (SC) [13]; (d) local amplification region of SRCNN [15]; and (e) local amplification region of the proposed method.
Applsci 07 00526 g008
Figure 9. SR results and PSNR values in the Y channel for the Umbrella image. (a) Ground truth; (b) local amplification region of the ground truth; (c) local amplification region of SC [13]; (d) local amplification region of SRCNN [15]; and (e) local amplification region of the proposed method.
Figure 9. SR results and PSNR values in the Y channel for the Umbrella image. (a) Ground truth; (b) local amplification region of the ground truth; (c) local amplification region of SC [13]; (d) local amplification region of SRCNN [15]; and (e) local amplification region of the proposed method.
Applsci 07 00526 g009
Figure 10. SR results and PSNR values in the Y channel for the first frame of the Pantomime sequence. (a) Ground truth; (b) local amplification region of the ground truth; (c) local amplification region of SC [13]; (d) local amplification region of SRCNN [15]; and (e) local amplification region of the proposed method.
Figure 10. SR results and PSNR values in the Y channel for the first frame of the Pantomime sequence. (a) Ground truth; (b) local amplification region of the ground truth; (c) local amplification region of SC [13]; (d) local amplification region of SRCNN [15]; and (e) local amplification region of the proposed method.
Applsci 07 00526 g010
Figure 11. Difference Mean Opinion Scores (DMOS) of ten testing stereoscopic images reconstructed with three different SR methods.
Figure 11. Difference Mean Opinion Scores (DMOS) of ten testing stereoscopic images reconstructed with three different SR methods.
Applsci 07 00526 g011
Table 1. Comparison with SC [13] and SRCNN [15] for PSNR (dB)/ structural similarity index (SSIM).
Table 1. Comparison with SC [13] and SRCNN [15] for PSNR (dB)/ structural similarity index (SSIM).
ImagesResolutionScaleSC [13]SRCNN [15]Proposed
Sword22856 × 2000246.76/0.985751.07/0.995151.35/0.9951
Umbrella2960 × 2016248.11/0.989751.08/0.995451.93/0.9955
Vintage2912 × 1924240.84/0.985542.11/0.990342.23/0.9894
Champagne_tower1280 × 960243.76/0.985644.83/0.991746.42/0.9921
Dog1280 × 960243.88/0.979946.20/0.990146.66/0.9903
Pantomime1280 × 960245.71/0.990646.41/0.994247.91/0.9948
Newspaper1024 × 768241.21/0.972042.60/0.983442.54/0.9832
Pozan Street1920 × 1088240.77/0.960042.49/0.978742.69/0.9788
Ballet1024 × 768241.01/0.959042.40/0.969942.82/0.9714
Breakdancers1024 × 768241.08/0.940242.44/0.957342.42/0.9573
Average-243.31/0.974845.16/0.984645.70/0.9848
Table 2. Comparison with SC [13] and SRCNN [15] for blind/referenceless image spatial quality evaluator (BRISQUE).
Table 2. Comparison with SC [13] and SRCNN [15] for blind/referenceless image spatial quality evaluator (BRISQUE).
ImagesResolutionScaleSC [13]SRCNN [15]Proposed
Sword22856 × 2000267.9448.0546.34
Umbrella2960 × 2016277.6041.7540.20
Vintage2912 × 1924262.6542.7037.65
Champagne_tower1280 × 960262.2339.8936.81
Dog1280 × 960242.2940.3138.39
Pantomime1280 × 960279.8340.7740.39
Newspaper1024 × 768246.7036.8634.21
Pozan Street1920 × 1088247.6541.2538.79
Ballet1024 × 768263.2241.7739.26
Breakdancers1024 × 768252.0932.5630.53
Average-260.2240.5938.56
Table 3. Comparison with the depth-based Garcia [22] for PSNR (dB). The fourth column data from Garcia [22]
Table 3. Comparison with the depth-based Garcia [22] for PSNR (dB). The fourth column data from Garcia [22]
ImagesResolutionScaleGarcia [22]Proposed
Dog640 × 480236.5438.33
Pantomime640 × 480238.5939.63
Pozan Street960 × 544235.7835.90
Ballet512 × 384236.3436.83
Breakdancers256 × 192239.0941.38
Average-237.2738.41
Table 4. Running time of the ten testing stereoscopic images listed in Table 1 (unit: s).
Table 4. Running time of the ten testing stereoscopic images listed in Table 1 (unit: s).
MethodsSC [13]SRCNN [15]Proposed
Average2064.8095.0597.54

Share and Cite

MDPI and ACS Style

Pan, Z.; Jiang, G.; Jiang, H.; Yu, M.; Chen, F.; Zhang, Q. Stereoscopic Image Super-Resolution Method with View Incorporation and Convolutional Neural Networks. Appl. Sci. 2017, 7, 526. https://doi.org/10.3390/app7060526

AMA Style

Pan Z, Jiang G, Jiang H, Yu M, Chen F, Zhang Q. Stereoscopic Image Super-Resolution Method with View Incorporation and Convolutional Neural Networks. Applied Sciences. 2017; 7(6):526. https://doi.org/10.3390/app7060526

Chicago/Turabian Style

Pan, Zhiyong, Gangyi Jiang, Hao Jiang, Mei Yu, Fen Chen, and Qingbo Zhang. 2017. "Stereoscopic Image Super-Resolution Method with View Incorporation and Convolutional Neural Networks" Applied Sciences 7, no. 6: 526. https://doi.org/10.3390/app7060526

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop