*4.2. Losses and Training*

All modules of the SSAE are trained in thress phases:

1. **Reconstruction Phase**: The autoencoder updates the encoder and the decoder to minimize the reconstruction error:

$$\min\_{E,D} \mathbb{E}[||X - \overline{X}||^2] \tag{13}$$

where *X* ˜ is the reconstruction of *X*, and || · ||2 is Euclidean distance.

2. **Regularization Phase**: Firstly, SSAE updates discriminator to apart the real samples from the encoded samples. In addition, then, SSAE updates encoder to confuse the discriminator. This phase can be represented by:

$$\mathbb{E}\min\_{\mathrm{E}}\max\_{\mathrm{DISC}}\mathbb{E}[\log(\mathrm{DISC}(\mathrm{Z}\_{\mathrm{real}}))] + \mathbb{E}[\log(1 - \mathrm{DISC}(\mathrm{E}(X)))] \tag{14}$$

3. **Classification Phase**: SSAE updates classifier and encoder Simultaneously by minimizing CrossEntropy and the distance of latent vectors within same class.

$$\min\_{E, \mathbb{C}} \mathbb{E} \left[ \text{CrossEntropy} \left( \mathbb{C} (E(X)), y \right) + \omega \left( t \right) \cdot l\_{\mathbb{G}} \right] \tag{15}$$

where *lG* is related to supervised feature clustering. It is defined as:

$$l\_{\mathbb{G}}((Z\_i, y\_i), (Z\_j, y\_j)) = \begin{cases} ||Z\_i - Z\_j||^2, & y\_i = y\_j \\ \max(0, m - ||Z\_i - Z\_j||)^2, & y\_i \neq y\_j \end{cases} \tag{16}$$

where *m* is the margin between different classes. Due to the difference in the various customers' SM data, *lG*(·) is designed to ascend distance of latent features between diferent catergaries with a minimum distance *m*. It also can be regarded as regularizer to classifier. Refer to [28], the weight ramp-up function *ω*(*t*) is defined as:

$$\omega(t) = \exp[-\mathbf{5}(1-T)^2] \tag{17}$$

where, *T* increases linearly with the number of iterations from zero to one, in the first 40% (refer to [28]) of the total iterations.

The SSAE must be trained jointly with Adam [32]. The pseudocode of training algorithm with mini-batches is provided by Algorithm 1.

### **Algorithm 1** Mini-batch training of SSAE

**Require:** *x* = training inputs **Require:** *y* = labels for labeled inputs in *L* **Require:** *zreal* = random number from *N*(0, *I*) **Require:** *Eθ* (*x*) = encoder with trainable parameters *θ* **Require:** *<sup>D</sup>γ*(*x*) = decoder with trainable parameters *γ* **Require:** *DISCφ*(*x*) = discriminator with trainable parameters *φ* **Require:** *<sup>C</sup>ϕ*(*x*) = classifier with trainable parameters *ϕ* **Require:** *<sup>N</sup>*(*x*) = stochastic input augmentation function **Require:** *ω*(*t*) = weight of consistency loss 1: **for** *t* = 1 to *iterations* **do** 2: Draw a minibatch *Bu* from unlabeled samples randomly 3: *x*˜*i* ← *<sup>D</sup>γ*(*<sup>E</sup><sup>θ</sup>* (*xi* ∈ *Bu*)) 4: *zi* ← *Eθ* (*xi* ∈ *Bu*) 5: *lossAE* ← 1 |*Bu*| ∑*<sup>i</sup>*∈*Bu d*(*xi*, *<sup>x</sup>*˜*i*) 6: update *θ and γ* using Adam 7: *lossDisc* ← 1 |*Bu*| ∑*<sup>i</sup>*∈*Bu log*(*DISCφ*(*zreal*)) + *log*(1 − *DISCφ*(*zi*)) 8: update *φ and θ* using Adam 9: Draw a balanced minibatch *Bl* from labeled samples randomly 10: *y*˜*i* ← *<sup>C</sup>ϕ*(*<sup>E</sup><sup>θ</sup>* (*xi* ∈ *Bl*)) 11: Construct *S*, pairs of (*xi*, *xj*) with their labels, from *Bl* 12: *lossC* ← 1|*Bl* | ∑*<sup>i</sup>*,*j*∈*Bl log y*˜*i*[*yi*] + *ω*(*t*) · 1|*S*| ∑*<sup>i</sup>*,*j*<sup>∈</sup>*<sup>s</sup> lG*((*<sup>E</sup>θ* (*xi*), *yi*),(*<sup>E</sup><sup>θ</sup>* (*xj*), *yj*)) 13: update *θ and γ* using Adam

14: **end for**
