*3.5. Data Preprocess*

Because there are some missing and error values in SM data caused by communication error or smart meter failure, the raw data cannot be used by DNN directly. For missing data, this paper interpolates them by the average of before and after 2 days at the same hour of day. The detailed equation is shown by:

$$\pounds\_i = \frac{\sum\_{k=-2}^{2} \mathbb{X}\_{i+(24\*k)}}{\#(Existed)} \tag{9}$$

where, *x*ˆ*i* is the interpolated value, *i* denotes the hourly time stamp, #(*Existed*) represents the number of existed values in before and after two days at the same hour of day.

For error values, there are two situations: (1) negative value; (2) extreme value. We process them by following equation:

$$\mathbf{x}\_{i} = \begin{cases} 0 & \mathbf{x}\_{i} < 0\\ \mathcal{Q}\_{3}(\mathbf{x}) + [\mathcal{Q}\_{3}(\mathbf{x}) - \mathcal{Q}\_{1}(\mathbf{x})] \ast 3 & \mathbf{x}\_{i} > \mathcal{Q}\_{3}(\mathbf{x}) + [\mathcal{Q}\_{3}(\mathbf{x}) - \mathcal{Q}\_{1}(\mathbf{x})] \ast 3 \end{cases} \tag{10}$$

where, *x* is a sequence of certain electricity magnitude of certain customer, *Q*3(·) and *Q*1(·) are upper quantile and lower quantile respectively. For different electricity magnitude or different customer, *Q*3(·) and *Q*1(·) are different.

Furthermore, standardization of samples is a necessary requirement for most machine learning algorithms, especially DNN. Too large value of samples will cause excessive computing error for DNN. According to SM data, different magnitudes have different scales, especially among different customers. Non-uniform scalers of samples will degrade the predictive performance of DNN. Refer to Sections 3.1 and 3.3, the values of *ftotal*, *Uimbalance*, *Iimbalance*, ˆ *f* , *LR* are already located in the range of [0, 1]. Hence, the remainder voltages and currents are normalized respectively according to following equation:

$$\mathcal{X}\_{i} = \frac{\mathbf{x}\_{i}}{Q\_{3}(\mathbf{x}) + [Q\_{3}(\mathbf{x}) - Q\_{1}(\mathbf{x})] \ast \mathbf{3}} \tag{11}$$

where, *x*¯*i* is normalized value, *x* is a sequence of certain electricity magnitude of certain customer. After this, all channels of a sample and all samples from different customers are normalized into same range.
