*3.2. Loss Functions*

Loss function is the key to the deep learning. Our loss function is consisted of cluster loss and the reconstruction loss.

1. Clustering loss *<sup>C</sup>*(*<sup>θ</sup>F*, *<sup>θ</sup>G*): this is the mean standard deviation of all clusters of the learned basis and *θ* is the parameters we need to optimize. It should be noted that the loss here is computed using the learned basis instead of the input of the network. This loss controls the clustering process, i.e., the smaller the loss, the better the clustering in the sense of clustering the multiscale basis. Let us denote *<sup>κ</sup>ij* as *j*th realization in *i*th cluster; *<sup>G</sup>*(*F*(*<sup>κ</sup>ij*)) ∈ R*<sup>d</sup>* will then be *j*th learned basis in cluster *i* and let *θG* and *θF* be the parameters associated with *G* and *F*, the loss is then defined as follow,

$$\mathcal{L}(\theta\_{\mathcal{F}}, \theta\_{\mathcal{G}}) = \frac{1}{|A|} \sum\_{i}^{|A|} \sum\_{j}^{A\_i} \frac{1}{A\_i} \|G(F(\kappa\_{ij})) - \bar{\theta}\_i\|\_{2^\prime}^2 \tag{11}$$

where |*A*| is the number of clusters which is a hyper parameter and *Ai* denotes the number of elements in cluster *i*; *φ*¯*i* ∈ R*<sup>d</sup>* is the mean of cluster *i*. This loss clearly serves the purpose of clustering the solution instead of the input heterogeneous fields; however, in order to guarantee the learned basis are closed to the pre-computed multiscale basis, we need to define the reconstruction loss which measures the difference between the learned basis and the pre-computed basis.

2. Reconstruction loss *<sup>R</sup>*(*<sup>θ</sup>F*, *<sup>θ</sup>G*): this is the mean square error of multiscale basis *Y* ∈ <sup>R</sup>*<sup>m</sup>*,*d*, where *m* is the number of samples. This loss controls the construction process, i.e., if the loss is small, the learned basis are close to the real multiscale basis. This loss will supervise the learning of the cluster. It is defined as follow:

$$R(\theta\_{\mathcal{F}}, \theta\_G) = \frac{1}{m} \sum\_{i}^{m} \|G(F(\kappa\_i)) - \phi\_i\|\_{2'}^2 \tag{12}$$

where *<sup>G</sup>*(*F*(*<sup>κ</sup>i*)) ∈ R*<sup>d</sup>* and *φi* ∈ R*<sup>d</sup>* are learned and pre-computed multiscale basis of *i*th sample *κi* separately.

The entire loss function is then defined as *<sup>L</sup>*(*<sup>θ</sup>F*, *<sup>θ</sup>G*) = *λ*1*C* + *λ*2*R*, where *λ*1, *λ*2 are predefined weights. We are going to solve the following optimization problem:

$$\min\_{\theta\_G, \theta\_F} L(\theta\_F, \theta\_G) \tag{13}$$

for the required training process.

#### *3.3. Adversary Network Severing as an Additional Loss*

We have introduced the reconstruction loss which measures the similarity between the learned basis and the pre-computed basis in the previous section. It is the mean square error (MSE) of the learned and pre-computed basis. MSE is a smooth loss and easy to train but there is a well known fact about MSE that this loss will blur the image. In the area of image super-resolution and other low level computer vision tasks, the loss is not friendly to inputs with high contrast and the resulting generated images are usually over smooth. Our problem has multiscale nature and is similar with the low level computer vision task, i.e., this is a generative task; hence blurring and over smoothing should happen if the model is trained by MSE. To define a grea<sup>t</sup> reconstruction loss is important.

Motivated by some works about the successful application of deep fully convolutional network (FCN) in computer vision, we design a perceptual loss to measure the error. It is now clear that the lower layers in the FCN usually will extract some general features shared by all objects like the horizontal (vertical) curves, while the higher layers are usually more objects oriented. This gives people the insight to train the network using different layers. Johnson then proposed the perceptual loss [29] which is the combination of the MSE of selected layers of the VGG model [51]. The authors claim in their paper that the early layers tends to produce images that are visually indistinguishable from the input; however if reconstruct from higher layers, image content and overall spatial structure are preserved but color, texture, and exact shape are not.

We will adopt the perceptual loss idea and design an adversary network to compute an additional reconstruction loss. The network structure can be seen in Figure 6.

The adversary net is fully convolutional with input and output both pre-computed multiscale basis. The network has an auto encoder structure and is pre-trained; i.e., we are going to solve the following minimization problem:

$$\min\_{\theta\_A} \frac{1}{m} \sum\_{i} ||f(\phi\_i) - \phi\_i||\_{2\prime}^2\tag{14}$$

where *φi* is the multiscale basis and *f* is the adversary net associated with trainable parameter *θA*. Denote *fj*(·) as the output of layer *j* of the adversary network. The additional reconstruction loss is then redefined as:

$$A(\theta\_{\mathcal{F}}, \theta\_{\mathcal{G}}) = \frac{1}{m} \sum\_{i=1}^{m} \sum\_{j \in I} ||f\_j(\mathcal{G}(F(\kappa\_i))) - f\_j(\phi\_i)||\_{2'}^2 \tag{15}$$

where *I* is the index set which contains some layers of the adversary net. The complete optimization problem can be now formulated as:

$$\min\_{\theta\_G, \theta\_F} \lambda\_1 \mathcal{C} + \lambda\_2 \mathcal{R} + \lambda\_3 \mathcal{A}. \tag{16}$$

**Figure 6.** The complete network.
