*2.3. Modified U-Net Neuron Network*

U-Net is an efficient deep learning framework for image segmentation task. As Figure 4 shows, it can be divided into the front and latter part. The front is the downsampling part for feature extraction, which is composed of convolution layer and pooling layer. The latter one is the upsampling part for image reconstruction, which is composed of an up-convolution layer and a convolution layer. It should be noted that the net concatenates the front part and the latter part in each parallel layer, so that more information of feature extraction can be reserved.

For the classical U-Net neuron network, suppose that *Zi* is the data obtained after upsampling at the *i*-th layer, *Yi* is the data before upsampling, *Xi* is the data mapped from the left side of the network at the *i*-th layer and *C* represents the upsampling process. Then its mathematical formula can be expressed as

$$Z\_i = X\_i + \mathbb{C}Y\_i. \tag{1}$$

Our task is to invert seismic waveforms to generate velocity field and anomalies locations. A modified U-Net is proposed with a different structure from the classical one. Initially, the convolution operation starts from 1 pixel outside the edge of the seismic image, which guarantees that the generated velocity field does not change the size while acquiring the features of the original data. In addition, since it is found in the classical U-Net experiments that the generated velocity fields are contaminated by the contour of the seismic waveform, and the reason is that the classical U-Net transmits some parts of the data to the output directly, we modify the classical U-Net structure and omit the transmitted part. The structure of the modified U-Net is shown in Figure 4, and the dashed line indicates the transmitting process in traditional U-Net. The corresponding formulation is as follows:

$$\begin{cases} Z\_i = \mathcal{C}Y\_{i\prime} & \text{if} \quad i = 1; \\ Z\_i = X\_i + \mathcal{C}Y\_{i\prime} & \text{if} \quad i > 1. \end{cases} \tag{2}$$

In addition, the upsampling process is divided into two branches, which are used to predict the velocity field and the anomaly mask, respectively. Under this setting, which is shown in Figure 5, the velocity and the anomaly mask can be generated from the corresponding seismic waveform simultaneously.

**Figure 4.** Architecture of the modified U-Net neuron network.

Velocity field image generation is a regression problem, and the MSE type of loss function is often used. On the other hand, anomaly image generation is a classification problem, where the BCE type of loss function should be employed. Classification and regression problems cannot use the same loss function as the criterion. Therefore, we define an ensemble loss function of the entire U-Net as a weighted summation of MSE and BCE loss with a fine-tunable parameter *α*. The accuracy requirement of the model is that the ensemble loss function is less than a threshold, so that the accuracy of two branches can be guaranteed simultaneously. The ensemble loss function is defined as

$$Loss = \alpha \text{BCE} + (1 - \alpha) \text{MSE} \tag{3}$$

### *2.4. Post-Processing*

In practice of the Velocity Model Building, geophysicists usually apply a smoothing post-processing step to obtain a more reliable background velocity field. Following this traditional setting, we added two convolutional layers to the generated velocity field image to achieve a smoothing effect. As shown in Figure 6, the velocity field with a post-processing step has a more reasonable background and without touching the anomalies.

**Figure 6.** Post-processing step.
