1. Introduction
Image super resolution (SR) [
1,
2] refers to the reconstruction of a high-resolution (HR) image from its counterpart in the low-resolution (LR) space. Recently, SR image reconstruction has been a prolific area of research in the fields of digital image processing and computer vision because of its ability to solve the inherent resolution limitation problems of low-cost image sensors. Real-world applications for SR range from enhancement of blurred and noisy images/videos into high-definition (HD) images/videos [
3], robust pattern recognition [
4], and microscopic object detection [
5]. In addition, a higher-quality image obtained from SR leads to a higher degree of accuracy in medical imaging analysis, where proper and accurate localization of tumors is required. Ultrasound images cannot always be acquired with ideal image quality due to the phenomena in the transmission medium and intrinsic properties of ultrasound, especially at low spatial resolution. Although the use of high-quality medical imaging equipment is widely practiced, ultrasound images might have various image artifacts, such as blurring phenomenon caused by the propagation of ultrasonic waves, and noise due to the ultrasound beam characteristics [
5]. These kinds of artefacts in medical images might lead to poor visual quality of textural, and spatial components. Moreover, this could impose a limitation on posterior analysis of medical conditions, possibly leading to false suppositions. In this regard, the requirement for better spatial resolution of the ultrasound images with high textural information of lesions and blood vessels is one of the most research topics in the field of medical imaging in recent years. Within the medical imaging domain, ultrasound is a versatile and widely practiced diagnostic tool. Compared to other medical imaging modalities, ultrasound usually produces poor spatial resolution of deep tissues due to the wavelength-dependent inverse relation between penetration depth and resolution [
6]. Ultrasound imaging has a resolution constraint by diffraction due to the scale of its wavelength and relies on the echo of deep tissue, making it difficult to reconstruct high quality images due to the tissue variability and compressibility [
7]. In ultrasound medical imaging, the presence or absence of a lesion is determined by observing the shape of the region of interest, the degree of blood flow and the opposing smoothness. In such a scenario, the role of textural information of such lesions in the spatial component of the ultrasound image is vital for the proper diagnostic treatment. Ultrasound images with visible, vivid textural patterns, and high spatial resolution are therefore essential for the accuracy of medical diagnosis. The exploitation of the SR task to effectively increase the resolution of poor-quality ultrasound images has recently prompted extensive interest of research [
8,
9].
In previous research, multiple SR studies based on interpolation techniques [
10,
11], and sparse representation [
12,
13] have been explored for enhancing the quality of images to the desired resolution. Moreover, in [
14], to improve the temporal resolution of the ultrasound imaging, an imaging sequence and signal processing approach was proposed by employing deconvolution and data acquisition based on spatio-temporal interframe-correlation. Recently, due to the paradigm shift in image processing technology, the deep learning (DL) framework has been considered for SR tasks [
15,
16,
17,
18] in various applications. These DL techniques [
19,
20] offer the significant prospect of providing better SR quality than conventional schemes because of their ability to learn non-linear feature representatives from image data. Popular frameworks, such as ResNet [
21], DenseNet [
22], and recurrent neural networks (RNNs) [
23], have been utilized in SR tasks with natural images. In the medical imaging domain, multiple studies [
5,
24,
25,
26] have investigated DL-based SR approaches. Especially for ultrasound imaging, DL technology has been incorporated to improve the quality of ultrasound imaging in studies such as [
27,
28,
29]. Liu et al. [
27] proposed a perception consistency ultrasound image SR technique based on a self-supervised cycle generative adversarial network (CycleGAN) framework. This study integrated multi-level feature loss and the adversarial characteristics of the generator during the SR task to balance the visual similarity between real data and reconstructed data. In [
28], a multi-frame SR approach was proposed for ultrasound images from a set of LR images. To cope with motion estimation while performing multi-frame SR, this research proposed a DL network that obtains HR images by reducing the effect of existing noise in LR ultrasound images. Similarly, in [
29], a fast medical image super-resolution (FMISR) method was proposed where a set of three hidden layers with a mini-network in between are used for complete feature extraction. This is followed by a sub-pixel convolution (SPC) layer for successful image up-sampling. Although these techniques yielded convincing results in terms of image quality, they do not exploit the low-level features of the input image throughout the network. The spatial feature of any image is highly constrained within the low-level features extracted from the initial layers of the CNN. The need for low-level features while implementing any SR task is therefore important, as it can provide additional information for reconstructing the high-frequency details of the HR image [
30]. In addition, when low-level features are passed into the latter layers through skip connections, the vanishing gradient problem can be alleviated. Furthermore, conventional techniques do not take feature redundancy into account, which instigates the risk of learning the same feature multiple times, resulting in limited reconstruction quality of the SR imaging.
To overcome these shortcomings, we propose a novel SR model for ultrasound imaging, referred to as a symmetric series convolutional neural network (SS-CNN). The proposed model comprises two parts: a feature extraction network (FEN) and up-sampling. To extract vital features that contain fine structural details of ultrasound images, a FEN consisting of a symmetric series of two feed forward convolutional neural networks (CNNs) was designed. Subsequent layers in the SS-CNN are concatenated with skip connections to utilize low-level features of the LR input image and to minimize the vanishing gradient problem. This leads to the reconstruction of rich high-frequency details from the input image while conducting the SR task. In addition, it provides the benefits of the compact network structure with minimal feature redundancy by allowing feature reuse between symmetric convolutional series. In doing so, a considerable number of features, including details of high-frequency components, will be propagated to the final layer of the FEN. Then, the up-sampling procedure is conducted by employing the SPC layer as an up-sampling operator. In this layer, the feature map generated by the FEN is multiplied by a multi-dimensional kernel and then subjected to a periodic shuffling operation to finally generate the output HR image. In the test results obtained using a publicly available ultrasound image database, we observed that the proposed scheme offered superior performance in terms of peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) and a fair SR reconstruction speed, compared to state-of-the-art SR approaches.
The rest of the paper is organized as follows. In
Section 2, the proposed system is discussed. Experiments and results are discussed in
Section 3. Finally, in
Section 4, conclusions are reported.
2. The Proposed System
The system architecture of the proposed scheme is depicted in
Figure 1. The network is comprised of the FEN and an up-sampling layer. The FEN is responsible for extracting imminent features from the input image and forwarding them to the up-sampling layer. The up-sampling layer then enhances image resolution by upscaling the output feature map from the preceding layer to produce the HR output.
2.1. The Feature Extraction Network
Feature extraction in SR tasks can be formulated as a sampling problem where perfect reconstruction of the desired features is required. The main purpose of this network is to extract image blocks from LR images. The dimensions of the extracted blocks remain in the LR space. This network basically extracts local hierarchical features of an input LR image in different stages, from shallow to deep. To acquire a more hierarchical representations of the input image, increasing the depth of the FEN is the general approach in the DL paradigm. The deep layers can prominently extract various ranges of useful feature representations (low to high). However, the huge gradient loss from the backpropagation process must be dealt with. Therefore, an alternative structure for building a FEN would be passing the same input to a series of two symmetric convolutional layers, where the feature maps of the corresponding layers are concatenated. With the symmetric series, abundant semantic information of the LR input can be explored, from which better perceptual quality of the SR output can be achieved. Note that the resolution of ultrasound images is impeded by the fundamental limits of diffraction, creating a long-standing trade-off between resolution and penetration. Accurate image reconstruction is therefore crucial for the SR task and can be achieved by using a robust feature extractor prior to up-sampling.
In the SS-CNN model, the FEN consists of a symmetric series of two convolutional layers: Conv A and Conv B. Each series holds
convolutional layers (
being the depth of each series) followed by a rectified linear unit (ReLU) as a non-linear activation function. The input for the network is the LR samples created by down-sampling the original HR samples. During forward propagation, the feature maps generated by subsequent Conv A and Conv B layers are concatenated. This allows the FEN to learn global features of the ultrasound images by concealing together the local features of both inputs from each subsequent layer. In this way, the network avoids learning redundant features and instead learns a unique representation of the input produced by each layer. The network also facilitates sufficient regularization as the weights are shared between each symmetric convolutional layer, so any external regularization techniques are not required. Besides, skip connections are introduced between the input image and the latter convolutional layers of the FEN. Good SR performance is guaranteed when the neural network has the collective knowledge of multiple levels of feature information [
30,
31]. In particular, low-level features can potentially provide additional information needed when reconstructing high-frequency details of an image. Skip connections not only help preserve spatial information, but also create short paths for gradients to flow through the network, allowing networks to be trained without gradient vanishing and overfitting problems. Therefore, skip connections allow image features to be preserved from the lower-level layers, and bypass the important textural details to high level layers in the CNN, resulting in high quality HR images in the SR process.
The detailed structure of the proposed FEN with
= 6 is presented in
Table 1. The convolutional layers of both series use 64 filters and 3 × 3 kernel sizes to convolve over an image during single convolution. Small-sized filters are preferred because they facilitate easier weight sharing and reduce the network complexity. The cumulative features extracted by concatenating simultaneous feature maps from both series are forwarded to two convolutional layers at the latter stages of feature extraction. The later layers use thirty-two and three filters, respectively, with a receptive size of 1 × 1. To achieve potential improvement in SR reconstruction performance, both later layers are concatenated with the input image via skip connections. This ensures that the final layer of the feature extraction network contains enough textural information from the input image.
Note that the design of this network focuses on extracting global features from the input image. Therefore, to achieve this, the convolutional layers from both series have an identical size for the receptive field so the concatenated feature maps from the subsequent layers of Conv A and Conv B accommodate sufficient image information. In addition, to preserve important details of the feature maps produced, pooling and down-sampling layers are not used throughout the network.
2.2. Up-Sampling Layer
The up-sampling layer is responsible for reconstructing a high-quality HR image from the image features extracted from the FEN. Generally, in conventional DL-based SR tasks, a transposed convolution (deconvolution) layer is used for up-sampling to reconstruct high-quality images [
15]. However, since the FEN of the proposed model does not utilize any down-sampling layers (e.g., maxpooling), the transposed convolution layers are not suitable candidates for an up-sampling operation. Moreover, using the deconvolutional layer for up-sampling has the disadvantage of significantly increasing the computational cost and model complexity due to the large numbers of feature maps being fed into it. Therefore, to up-sample the features of an LR image into the high-quality HR space, we used the SPC layer, which is a learnable layer that first multiplies its input with a multi-dimensional kernel to generate multiple feature maps for up-sampling. The multi-dimensional kernel consists of an array of several image upscaling filters, and a large number of same sized feature maps are obtained by performing a convolution operation in the SPC layer. These multiple feature maps are then rearranged or shuffled to generate output with a significantly higher resolution in such a way by multiplying the height and width of the output tensor with upscaling factor. With this, the network can learn to use multiple channels of the LR image features obtained from the FEN to represent a single HR image. Unlike the deconvolutional layer, which explicitly enlarges the feature map to increase resolution, the SPC layer expands the number of feature maps and applies a specific region-mapping criterion to obtain the HR output. From this, there is uniformity in the resolution of the feature maps obtained from the FEN, and the model becomes computationally less expensive.
In the SPC layer, the sets of weights for the convolutional kernel are independent from each other during convolution. For each feature map, this layer can generate
channels in a single upscaling, where
is the upscale factor. Therefore, by choosing the desired value of
, the SPC layer can map the image to the HR space from its LR counterpart in a single upscaling. From
Figure 1b, we can clearly see that by convolving the input of dimensions
(height, width, and channels, respectively) with a multi-dimensional kernel, we can obtain output with the same dimensions as the input but with additional
channels. Subsequently, a periodic shuffling (PS) operation is used to reshape the output channels to the desired HR output. This operation rearranges a tensor with dimensions (
) into a (
) matrix. From this,
channels are distributed to the spatial dimension of the image, where both height and width are multiplied by
to generate the HR output. Throughout this process, the up-sampling layer is able to generate HR output with a single upscaling of the feature maps obtained from the FEN without using a complex and computationally expensive deconvolution operation.
2.3. Datasets
The proposed model was evaluated using two publicly available ultrasound image datasets, with examples as shown in
Figure 2. The breast ultrasound images (BUSI) dataset [
32] (Dataset A) was collected in 2018 at Baheya Hospital, Cairo, Egypt, for early detection and treatment of women’s cancer. A total of 780 images with an average resolution of 500 × 500 pixels were acquired from female candidates between ages 25 and 75 years. The dataset consists of 133 normal images (without cancer), 437 images with cancer, and 210 images with benign tumors. The whole dataset was divided into three parts: training with 521 images, validation using 109 images, and testing 150 images. The other dataset [
33] (Dataset B) was collected from the UDIAT Diagnostic Center of the Parc Tauli Corporation in Sabadell, Spain. This dataset was collected in 2012 using an 8.5 MHz Siemens ACUSON Sequoia C512 HD linear array transducer. The dataset contains a total of 163 images with a mean resolution of 760 × 570 pixels, where 53 are cancer images, and the remaining 110 are images of benign lesions. The main purpose for creating this dataset was detecting the lesions. However, in this research, we employed it for the SR task. Out of the entire dataset, 110 images were selected for training and subjected to data augmentation later, while the remaining 30 and 23 images were used for validation and testing, respectively.
2.4. Data Augmentation
The images available from the datasets were not numerous enough to prevent overfitting and for properly training various weights of the SS-CNN structure. Therefore, we expanded the dataset’s limited availability by adapting a data augmentation technique. In the experiment, various augmentation techniques, such as rescaling the original pixel values between [0, 1], 10° image rotation, horizontal and vertical shifting of 2 pixels, horizontal flipping, and random zooming operation, were utilized. During data augmentation, we increased the dataset sizes by adding synthetic instances to the existing training set. These instances were created by applying domain-specific techniques, such as geometric transformations, to the original samples. Dataset A and Dataset B were augmented to obtain 880 and 862 images for training, respectively.
2.5. Training
For the SR task, the training process follows a self-supervised learning strategy in which manual labelling of the training data is not required. The LR images are first generated by down-sampling their HR counterparts from the training sets. The output generated during forward propagation was compared with the original HR images to compute the loss function. The parameters used while training the network are presented in
Table 2. The whole network was optimized by minimizing the mean square error (MSE) loss between the model’s prediction and its discrete HR ground truth images. For SR imaging, the previous study in [
34] showed robust image quality with the adaptive momentum (Adam) optimizer. Therefore, to handle sparse gradients due to the noisy spatial resolution of LR images, we used the Adam optimizer with a learning rate of 0.001. During training, the model learns at the level of individual image pixels. To cope with the input image dimensions for the designed network, the images were resized to 300 × 300 pixels. For training and validation purposes, 80% and 20% of the dataset images were respectively allocated. The whole model was trained within 100 epochs with a smaller batch size of 8 to improve generalization of the proposed model.
Figure 3a,b show the learning curves of the SR model with datasets A and B, respectively, for both training and validation datasets. In both scenarios, the training and validation loss converge with respect to epochs. No sign of overfitting can be observed, which demonstrates effective training of the model up to the point where saturation in training loss and accuracy was achieved.
Furthermore, as the SR model learns to enhance the visual quality of the LR image upon training, therefore, PSNR increments relative to the training epochs are shown in
Figure 4. This figure represents how the model is learning to enhance the visual quality of the LR images while the training continues. For Dataset A, the PSNR increases from 28 dB and finally reaches 35 dB. Here, we can observe saturation of PSNR after 70 epochs. In contrast, for Dataset B, the PSNR during the initial stage of training was 24 dB, which finally reached 29 dB during the final epochs. This indicates that loss in the model converges with the training epochs, and therefore, the model was trained properly on both datasets.
Figure 5 shows the reconstructed images at the beginning and the end of the training phase. As shown in
Figure 5b–d, the reconstructed image presents an enhanced image quality and valuable details. A significant difference in the visual quality of the image can be observed between the first and final epochs. Thus, we can consider the model we proposed to have learned the given datasets well.