*2.2. Reflection Artifacts*

Applying the conventional PAT reconstruction directly to the ultrasound signals generated by the forward model (1) could result in artifacts. This is because the conventional reconstruction methods do not take into consideration of the acoustic reflections. We provide a numerical example in this section to illustrate the effect of the acoustic reflections.

We choose the Shepp–Logan phantom (see Figure 1a) as the initial pressure *p*0(*r*) to produce ultrasound signals based on the forward model (1) with *c*(*r*) = 1. The recorded acoustic signal is shown in Figure 1b. To demonstrate the reflection artifacts by conventional algorithms, we adopt one of the conventional time reversal (TR) methods proposed in [8]. The sound speed in the TR is again *c*(*r*) = 1. Then the TR method is mathematically equivalent to the 2D universal back-projection formula. The stoppage time is again *T* = 4. The reconstructed image is illustrated in Figure 1c. It is clear that direct application of the TR method leads to numerous artifacts in the reconstructed image, especially near the boundary of the ellipses in the Shepp–Logan phantom. The generation of these artifacts can actually be well understood from the mathematical point of view; see the analysis in [12] for details.

The failure of the conventional TR method indicates that some procedures must be introduced to correct the reflection artifacts in the presence of reflections. There have been some iterative correction algorithms, for instance [10–14,39] as well as methods based on deep learning [25]. In the next section, we propose a novel neural network to remove the reflection artifacts. Its efficacy is compared with two of the popular iterative correction algorithms as well.

**Figure 1.** Numerical example to illustrate the effect of the acoustic reflections. (**a**): a 128 × 128 Shepp–Logan phantom as the initial pressure. (**b**): the recorded acoustic signal in the presence of sound hard reflectors. (**c**): reconstruction by conventional TR.

#### **3. Artifacts Correction by Deep Learning**

This section presents the proposed neural network for correction of reflection artifacts in PAT imaging. We will make use of the networks to accelerate some conventional iterative algorithms in the following way. Instead of running the iterations, we train the network to learn the map between the reconstructed image by the first iteration and the ground truth. The reconstruction procedure then consists of two steps, given the measured ultrasound signals. The first step is to apply an iterative algorithm to the signals to ge<sup>t</sup> the output after the first iteration. The second step is to feed the output into the neural network to obtain the reconstructed image. These two steps preserve the predictability of the model-based iterative algorithm, while at the same time replace the number of iterations by the neural network to achieve improved computational efficiency. The proposed method can be referred to as the deep learning (DL) algorithm or DL reconstruction for short.

#### *3.1. Reflection Artifacts Correction Model*

Assume that *I*IS is the simulated initial source without any artifacts and *I*RA is a PAT image including reflection artifacts, the relationship between them can be expressed as follows:

$$I\_{\rm RA} = \mathcal{F}(I\_{\rm IS}),\tag{2}$$

where function F denotes the acoustic-reflection-induced process due to the hyperechoic structures or the special setting of making measurement such as in reverberant-field PAT. The reflection artifacts correction model is to seek an approximate inverse, G≈F −1, to reduce the reflection artifacts from *I*RA, i.e.,

$$I\_{\text{est}} = \mathcal{G}(I\_{\text{RA}}) \approx I\_{\text{IS}}.\tag{3}$$

Next, we introduce the proposed network to learn the approximate inverse function G.

## *3.2. The Proposed Network*

The proposed network architecture for reflection artifacts correction is shown in Figure 2, which has following three parts:


with a stride of 2 to decrease the size of feature-maps at the 2nd, 4th and 5th convolutional layers, and the deconvolution with a stride of 2 to increase the size of feature-maps at the 1st, 3rd, and 5th deconvolutional layers. The stride of 1 is used at remaining layers. All layers have 32 (de)convolutional filters of size 5 × 5 to capture the global structures. In addition to this, three conveying links [29,40] copy the former feature-maps and reuse them as the input to a later layer that has the same size of feature-maps via a concatenation operation along channel dimension, which preserve high-resolution features. At last, a residual skip connection [41] enables the above U-net to infer the artifacts and noise from the input feature-maps. The summation between the input feature-maps and the outputs of the U-net are the corrected feature-maps, which is followed by an activation function and then serves as the input to the reconstruction part. Note that zero padding is used throughout this part.

3. **Reconstruction** is to recover the final output from the corrected feature-maps. This part has 4 deconvolutional layers, each of them has 32 filters of size 3 × 3 with a stride of 1 except for the last layer that has only 1 filter. Zero padding is not used for these four layers.

The rectified linear unit (ReLU) is used throughout this network [42], which is defined as *f*(*x*) = *x*<sup>+</sup> = max(0, *<sup>x</sup>*). The input to the network *I*RA is the output of the first iteration of averaged time reversal (ATR), which is denoted as ATR-.

**Figure 2.** The proposed network structure for reflection artifacts correction. It comprises three parts—feature extraction, artifacts reduction, and reconstruction. In particular, we use a modified version of U-net as the backbone of artifacts reduction part.

## *3.3. Loss Function*

The parameters of the network need to be optimized by minimizing an appropriate loss function. The loss function we choose is the combination of the mean-squared error (MSE) and structural similarity (SSIM) [43].

The MSE between the output of the network, *I*est, and the initial reference source, *I*IS, is formally defined as

$$\text{MSE}(I\_{\text{est}}, I\_{\text{IS}}) = \frac{1}{w \times h} \left\| I\_{\text{est}} - I\_{\text{IS}} \right\|\_{F}^{2} \tag{4}$$

where *w* and *h* are the width and height of the image, respectively, and · refers to Frobenius norm. Although MSE is the most straightforward loss function to optimize the network, the resultant images are usually over-smoothing and lose some details.

In contrast to MSE, the SSIM can measure the similarity between two images in terms of their structures and textures. SSIM index is calculated on various windows of an image. The measure between the window *x* over *I*est and the window *y* over *I*IS, based on a common size *k* × *k*, is defined as

$$\text{SSIM}(\mathbf{x}, y) = \frac{2\mu\_x \mu\_y + c\_1}{\mu\_x^2 + \mu\_y^2 + c\_1} \frac{2\sigma\_{xy} + c\_2}{\sigma\_x^2 + \sigma\_y^2 + c\_2} \tag{5}$$

where *μx* and *μy* are the averages of *x* and *y* respectively, *σ*<sup>2</sup> *x* and *σ*<sup>2</sup> *y* are the variances of *x* and *y* respectively, and *<sup>σ</sup>xy* is the covariance of *x* and *y*. Also, *c*1 = 0.01<sup>2</sup> and *c*2 = 0.03<sup>2</sup> are two constants, which are used to stabilize the division with a weak denominator. The windows size *k* is 11, as suggested. The SSIM between two images *I*est and *I*IS, SSIM(*<sup>I</sup>*est, *<sup>I</sup>*IS), refers to the average of the SSIM index over all windows.

The loss function is defined as

$$\min\_{\theta\_{\mathcal{Q}}} \mathbb{E}\_{(I\_{\text{est}}, I\_{\text{IS}})} \sqrt{1 + \text{MSE}(I\_{\text{est}}, I\_{\text{IS}})} \times \left[1 - \text{SSIM}(I\_{\text{est}}, I\_{\text{IS}})\right],\tag{6}$$

where *θ*G denotes the parameters of the network G. This loss function not only reduces the noise in terms of MSE but also preserve the image structures as measured by SSIM [44]. Various algorithms can solve this minimization problem. In this work, we adopt the Adam algorithm to update the parameters [45]. The gradients of the parameters are computed using a back-propagation algorithm [46].
