1. Introduction
Ultrasound is among the most popular medical imaging modalities. Ultrasound devices are small and cheap compared to other medical imaging modalities such as magnetic resonance imaging (MRI) or computed tomography (CT). The different medical imaging modalities allow the medical team to examine the inner parts of the patient body, providing additional information about the patient’s condition to the physician. Accurate and fast diagnosis is crucial in some scenarios, such as emergency rooms. Since ultrasound imaging is the fastest, smallest, and most portable compared to other imaging modalities, it is a perfect candidate for such scenarios. Furthermore, in contrast to CT and X-Ray scans, ultrasonic waves are non-ionizing, rendering it a secure medical imaging apparatus.
Compared to other imaging modalities, such as MRI and CT, the main disadvantage of ultrasound is its inferior image quality. Due to the inherent non-homogeneity of the human body and its diverse composition of tissues, each possessing unique physical properties, the ultrasonic waves captured during imaging tend to exhibit higher noise levels. Furthermore, the relatively lower frequency (and hence higher wavelength) of ultrasonic waves, compared to other imaging modalities, reduces spatial resolution. Unclear images resulting from noise and reduced resolution can complicate the diagnosis process for the medical team, potentially leading to errors or incorrect diagnoses. Hence, generating high-quality ultrasound images is critical for fast and accurate diagnosis. In contemporary practice, employing image denoising and enhancement algorithms is commonplace after generating the ultrasound image.
Speckle noise reduction [
1] is one such example. In the context of ultrasound imaging, speckle noise has been studied extensively. There are different approaches for speckle noise reduction in ultrasound, for example, non-local filtering methods [
2,
3] and deep learning methods [
4,
5,
6,
7]. Image denoising post-processing step reduces the device’s frame rate, which is not optimal.
Besides image denoising and enhancement, there is a growing trend in the medical imaging community of integrating automatic image analysis algorithms such as classification and segmentation. As with other problems with computer vision nowadays, deep learning methods present state-of-the-art results on medical imaging tasks. Neural networks are susceptible to out-of-distribution data samples. Thus, noisy images can lead to wrong predictions by the neural network. To ensure the sustained and reliable performance of deep neural networks (DNNs) for image analysis tasks, it is imperative to have a requisite level of quality in brightness mode (B-Mode) ultrasound images.
Ultrasound imaging relies on using ultrasonic waves, which, when emitted by the transducer and the subsequent reception of reflected echo signals, are employed to generate the resulting image. The reflected wave is recorded by the transducer and each signal is then encoded to a pixel value. The pixel’s grayscale value depends on the reflected signal properties, where lower received signal power, compared to transmitted energy, implies high ultrasonic wave absorption and is encoded to a lower grayscale value. Higher received signal power implies low ultrasonic wave absorption and is encoded to higher pixel values. The ultrasound transducer is composed of N transmitters; in each transmit event, a subset of transducer elements are selected to transmit ultrasonic waves, then the echo is received by the receiving elements. The imaging scheme dictates the different sets of receiving/transmitting elements. For example, in focused transmission, each transmit event captures a depth-wise line within the target tissue. Each transmit element is focused on the target line by applying an appropriate transmit time delay.
With focused transmission, also known as line scanning, the entire image reconstruction is time-consuming, since each line is acquired separately. All the transducer elements are employed to transmit a plane-wave ultrasonic wave capturing the whole region with a single transmission. The generated plane ultrasonic wave is transmitted at a different angle within each transmission event. By applying specific time delays to each of the transmission elements, the combination of these time-delayed excitation signals results in the formation of an angled plane wave. When using a plane wave, the generated echoes represent multiple lines at a single transmission event. Consequently, when assuming the same penetration depth of ultrasonic waves, the frame rate is higher in cases where unfocused transmit is employed than focused transmit. While plane-wave transmission is faster and more suitable for real-time imaging, it is associated with reduced resolution and contrast compared to focused transmission. Thus, the image reconstruction algorithm becomes critical for ensuring optimal overall performance.
The process of forming an ultrasound image involves the following steps:
Receiving echo of the generated ultrasonic wave;
Applying time of flight correction to the received signal;
Beamforming the time-aligned signal’s array;
Applying log compression;
Image post-processing.
A low-complexity delay and sum (DAS) beamforming algorithm is usually selected to maintain a high framerate in commercial ultrasound devices. Predetermined static delays are usually used to perform time of flight correction to the received signals, after which a summation of the channel data are performed to generate a beamformed signal. The low computational complexity of DAS compromises the overall beamformed signal main lobe width and side lobe width. More advanced adaptive algorithms exist, such as the minimum variance distortionless response (MVDR). With adaptive beamforming, the summation weights are not constant and calculated from the data and produce better results [
8]. Although adaptive beamformers such as the MVDR offer superior performance compared to the DAS beamformer, they encounter significant computational complexity unsuitable for real-time applications.
Deep learning has demonstrated remarkable achievements across diverse tasks, including image processing, speech recognition, and more [
9]. Particularly in medical imaging, deep learning has emerged as the leading approach, exhibiting state-of-the-art performance in tasks such as image classification and segmentation [
10]. For example, Chen et al. [
11] have shown great success with the task of cerebrovascular segmentation from time-of-flight MRI data. They proposed a solution to the problem in the settings of semi-supervised learning. They incorporated two identical neural networks, on trained on labeled data and the second trained with unlabeled data. The networks are set to share weights. For labeled data, they used cross-entropy loss and for unlabeled data they proposed a consistency loss term between the input data and a perturbation of the sample, thus ensuring same segmentation map for a given sample and its perturbation. Their model has shown state-of-the-art results in terms of the DICE score [
12].
Deep learning strategies have been applied to improve the performance of model-based and data-adaptive approaches like DAS and MVDR in terms of computational performance and image quality.
Incorporating a data driven approach like deep learning can result a reduction in computational performance. For example, estimation of the MVDR output image with a neural network can reduce the results. In [
13], the authors have shown that they were able to generate images on par in terms of perceptual quality to MVDR while maintaining a computational complexity of
compared to
. Additionally, with deep learning one can combine multiple sequential steps of the image formation pipeline, like beamforming and denoising into a single faster neural network.
2. Related Work
Goudarzi et al. [
14] proposed a MobileNetV2 [
15], neural network to estimate the reconstruction of a multi-angle DAS beamformer from a single-angle acquisition. the network input is a
tensor, where
C is the number of receiving channels and
W is the spatial window size set to 32. The network output is a two elements vector representing the IQ elements of the estimated multi-angle DAS. With a parameter count of 2.226 million, MobileNetV2 is considered a relatively lightweight neural network, enabling faster computation and inference times. However, since the reconstruction is performed pixel-by-pixel, the performance does not meet the requirement for real-time applications.
Rothlübbers et al. [
16] adopted a distinct methodology wherein the direct estimation of multi-angle in-phase and quadrature (IQ) components was replaced. Instead, the output of the DNN was employed as the beamforming weights. The resultant weights were subsequently multiplied with the input from a single angle to form the multi-angle estimate. The training data are 107 samples of privately acquired raw ultrasound RF data and publicly available data [
17]. The network is then trained with a linear combination of mean squared error per pixel loss and multi-scale structural similarity (MS-SSIM) [
18].
Following the beamforming and log compression of the received echo signal, the subsequent step in the ultrasound image pipeline is the post-processing step. The post-processing steps are usually applied to improve the contrast and reduce the noise of the beamformed signal. Noise reduction is particularly crucial in situations where the notified area is more expansive, as observed in the case of plane wave ultrasound transmission. This is due to the tendency of plane wave ultrasound to exhibit higher noise levels and lower spatial resolution compared to focused transmit ultrasound. In medical imaging, post-processing operations on images, such as noise reduction, automatic segmentation, and classification, hold significant value in automating the diagnostic process or enhancing image quality. Denoised images offer a higher level of clarity, thereby aiding the medical team in the diagnostic process. Denoised images improve accuracy and efficiency in medical diagnoses by providing a more distinct visualization of anatomical structures and pathological features. A common approach nowadays is to apply a task-specific algorithm after the image has been formed. Applying additional subsequent algorithms after the image formation decreases the framerate, which is not optimal. Another issue with that approach is that for every new task, a new separate neural network has to be trained or explicitly designed for the required task. Integrating a beamforming algorithm that can reconstruct the post-processed beamformed image directly, without incorporating external algorithms or methods in addition to the beamforming process, holds substantial significance. A single beamformer that also outputs a post-processed image offers notable benefits in terms of improved performance and enhanced stability in end-to-end performance.
Bhatt et al. [
19] proposed a UNet-based architecture [
20] to predict segmentation and image formation reconstruction. The proposed architecture is based on one encoder and two decoders. Each decoder outputs a task-specific output. The first one outputs an ultrasound image reconstruction, and the second one outputs a segmentation map. One significant advantage of that approach is that the model outputs both a segmentation map and an ultrasound image simultaneously. Also, using one single encoder allows the model to learn features relevant to both tasks and then decode the global features of each task by a separate encoder. A disadvantage of this approach is that the computational resources required for running this model grow proportionally to the number of desired tasks since each requires its encoder. Furthermore, a new encoder must be trained from scratch for each future task.
Khan et al. [
21] proposed a different approach; they trained a U-Net variation. To control the task-specific output, they added adaptive instance normalization layers [
22] (AdaIN) at the bottleneck block of the U-Net architecture. In parallel to the primary U-Net beamformer, they also trained a small, fully connected neural network that maps a style code to the AdaIN parameters-normalization mean and variance. After which, a normalization with task-specific mean and variance is applied to the output of the bottleneck block. The advantages of the approach proposed in [
21] are:
Scalability: given enough task-specific data, one has to train only a small portion of their complete architecture—the fully connected AdaIN layer parameters;
Performance: during inference, the AdaIN parameters can be pre-computed, and hence only a single forward pass of the U-Net network is required to generate task-specific output.
With the approach proposed in [
21], there is an evident improvement in both scalability and performance. However, it is essential to note that, for each task, only the representation of the bottleneck layer is modified. As a result, the task-specific output is solely controlled by employing different normalization techniques on the output of the bottleneck layer. Hence, we opted for a per-layer task adaptation approach. Rather than applying task-specific normalization solely to the bottleneck representation, we introduce a layer-wise convolutional filter normalization technique. This approach enables us to modify the learned convolutional filters of each layer based on the requirements of the specific task. We benchmark our proposed normalization scheme and beamforming neural network on publicly available data from [
23]. We also test our task-specific performance with speckle noise reduction and sub-sampling. The following section introduces our fully convolutional neural network architecture designed for ultrasound beamforming. We elucidate the architectural details, highlighting the key components and their functionality in the beamforming process. Following that, we present our innovative approach to multitask learning in the context of beamforming. Specifically, we propose a per-layer normalization scheme wherein scale and bias parameters are learned independently for each task. Our adaptive normalization scheme allows for better task-specific adaptation while maintaining a consistent network architecture across all tasks, thus differentiating our approach from previous works such as [
21], which lack comparable specificity. Moreover, our approach avoids introducing additional sub-networks, as observed in [
19], simplifying the overall model architecture while achieving improved performance.
The rest of the paper is organized as follows.
Section 3 describes the problem settings and the main existing approaches to ultrasound beamforming.
Section 4 introduces our proposed beamformer and multitask learning approach. In
Section 5, we proceed to describe the experimental setup, training data, and evaluation metrics. Finally, in
Section 6, we present our results along with a discussion and conclusion in
Section 7 and
Section 8, respectively.
3. Existing Beamformers
For plane wave imaging, each one of the transducer elements is then used for recording the received echo signal, The resulting received echo of plane wave signal is a tensor
, where
C is the number of receiving channels,
E is the number of transmit events and
is the number of time samples recorded. Then, time-of-flight correction is applied to the received signal ensuring that it is aligned correctly in terms of timing. The delays for the time of flight correction are calculated from the geometry of the transducer concerning each pixel. The next step of the image formation is beamforming of the time-aligned signals to generate the final image
. Beamforming is a signal processing technique for sensor arrays to generate a unified signal from multiple sources. The ultrasound transducer is composed of multiple sensors. After an ultrasonic wave is generated from the transmit elements, the echo is received by a subset
C of receiving elements. These multiple signals are then aggregated to generate the final beamformed echo signal. DAS algorithm represents the fundamental and rudimentary beamforming algorithm utilized in ultrasound imaging. In DAS, each received signal is delayed by time quantity based on the sensor array geometry. After applying the time delay, the signals are assumed to be time aligned, and the beamformed signal is the sum of the time-aligned signals (constant weighting with a value of 1). Model-based beamforming algorithms can be described mathematically by:
where
W is an apodization tensor,
X is the received signal, and
e and
c are the transmission events and channels, respectively.
DAS is widely used in ultrasound image formation because of its high performance, which allows for high framerates. One drawback of DAS for ultrasound beamforming is that the resulting image usually suffers from low contrast.
More advanced adaptive beamforming algorithms exist. For example, The MVDR [
24] is another beamforming algorithm that, in contrast to the DAS algorithm, performs a weighted sum across the received time-aligned signals. MVDR is an adaptive beamforming algorithm where the summation weights are computed by solving the following optimization problem:
where
is the apodization weights and
is the received signal covariance matrix. By computing optimal summation weights in terms of variance and distortion, the MVDR beamformer typically yields a beamformed signal characterized by narrow side lobes. Consequently, this results in enhanced image quality, improved contrast, and reduced noise levels. A significant drawback of the MVDR beamformer is the high computational cost.
Solving (
2) requires inverting the covariance matrix of the received signal, which takes
steps [
25]. Compared to DAS, which has a computational complexity of
, MVDR is far more computationally demanding. Furthermore, the covariance matrix of the received signal (
) is not known and has to be estimated, which is another challenge. There is also an adaptive variation of the DAS, the Filter Delay-Multiply and Sum (F-DMAS) [
26,
27]; in this algorithm, the weight is computed as a form of weighted summation across receiving signals from other elements.