**2. Methodology**

In this section, we introduce the characteristics of the electricity consumption data and then compare several data structures regarding their advantages and disadvantages. Finally, we introduce TextCNN and the reason why it is suitable for the two-dimensional time-series.

#### *2.1. Data Structure Analysis*

Smart meters can collect the electricity consumption data at a high frequency, such as once an hour. The datasets can be expressed as:

$$D\_n = \left\{ \mathbf{x}\_{h\_1}^{d\_1}, \mathbf{x}\_{h\_2}^{d\_1}, \dots, \mathbf{x}\_{h\_{24}}^{d\_1}, \mathbf{x}\_{h\_1}^{d\_2}, \dots, \mathbf{x}\_{h\_j}^{d\_i} \right\} \tag{1}$$

where *Dn* represents the data of user *n*. *xdi hj* is the value recorded by smart meters during time *hj* on day *di*.

Most studies focus on periodical features of daily or weekly consumption patterns to detect electricity theft. Therefore, they merge the data of one day into one value and utilize the one-dimensional data structure or its variant, as shown in Figure 1. The datasets can be further expressed as:

$$D\_n = \left\{ \mathbf{x}\_{d\_1 \prime}, \mathbf{x}\_{d\_2 \prime}, \dots, \mathbf{x}\_{d\_i} \right\} \tag{2}$$

where *xdi*represents the total amount of electricity consumption on day *di*.

**Figure 1.** One-dimensional data structure of electricity consumption data.

In this way, they neglect the intraday electricity change and fail to capture the intraday features. In this paper, we construct the data into a two-dimensional grid, which is suitable for feature extraction from not only different days but different time periods. The two-dimensional grid can be expressed as:

$$D\_{nl} = \begin{bmatrix} \mathbf{x}\_{l\_1}^{d\_1} & \mathbf{x}\_{l\_1}^{d\_2} & \cdots & \mathbf{x}\_{l\_1}^{d\_i} \\ \mathbf{x}\_{l\_2}^{d\_1} & \mathbf{x}\_{l\_2}^{d\_2} & \cdots & \mathbf{x}\_{l\_2}^{d\_i} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{x}\_{l\_{24}}^{d\_1} & \mathbf{x}\_{l\_{24}}^{d\_2} & \cdots & \mathbf{x}\_{l\_{24}}^{d\_i} \end{bmatrix} \tag{3}$$

The columns in Equation (3) represent the electricity consumption data of 24 h a day. In fact, smart meters may collect data more frequently, and may collect varieties of information, such as three-phase voltages and currents, power factors and so on. In this manner, in order to simplify the expression of Equation (3), we utilize the following column vector to represent the amount of data on day *di*:

$$\mathbf{x}\_{d\_i} = \begin{Bmatrix} \mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_j, \dots, \mathbf{x}\_F \end{Bmatrix}^T \tag{4}$$

where *F* is the number of data from one day. So far, the dataset of one user can be expressed as:

$$D\_{\boldsymbol{n}} = \left\{ \mathbf{x}\_{d\_1}, \mathbf{x}\_{d\_2}, \dots, \mathbf{x}\_{d\_i} \right\} \tag{5}$$

Further, we use Figure 2 to explain the above two-dimensional structure. In the left figure, an individual curve represents the data *xj* on different days and the cluster of curves demonstrates the daily electricity consumption. Then the expansion of the cluster is a grid, as shown in the right figure, which is also formulated as Equation (5). The height of the grid represents the number of data points from one day points. The length of the grid represents the number of days. In other words, the grid of electricity consumption data is a two-dimensional time-series.

**Figure 2.** Two-dimensional data structure of electricity consumption data.

To extract the consumption patterns of different users, we utilize the same intercepted window to consecutively intercept different users' data. Then we can obtain a series of two-dimensional grids with the same length and height. For user *n*, we use *yn* to label the intercepted window of the time-series to judge whether it is electricity theft or not, as shown in Figure 2. Then, we build a nonlinear map function from an input time-series to predict a class label *yn* formula:

$$y\_n = f(\mathbf{x}\_{d\_l \prime} T) \, d\_l \in T \tag{6}$$

 where *T* is the length of the intercepted window and *f*(·) is the key nonlinear function we aim to learn.

In order to conveniently express this data structure in CNN, we use **<sup>D</sup>**(*<sup>N</sup>*, *F*, *T*) to represent the intercepted segments, where *N* is the number of samples. For an individual data **<sup>D</sup>**(*<sup>F</sup>*, *T*) in the dataset **<sup>D</sup>**(*<sup>N</sup>*, *F*, *<sup>T</sup>*), the classification function *f*(·) needs learning. So far, we have constructed the two-dimensional structure that maintains the full information of the raw data and transforms the electricity theft detecting problem into the classification of time-series. Based on the two-dimensional time-series structure, we utilize TextCNN to learn the classification function.

#### *2.2. CNN Structure Analysis*

CNN specializes in processing data with a grid-like structure [23]. For different input data types, the structure of CNN should be selected further to achieve effectiveness—TextCNN, RCNN, etc. [24–26]. Considering the above two-dimensional time-series, we focus on TextCNN in this research. TextCNN is widely used in natural language processing (NLP) fields such as text classification, emotion analysis and sensitivity analysis for its simple structure and effectiveness [27,28].

#### 2.2.1. Basic Introduction to CNN

The normal multilayered neural networks, which are also called deep neural networks (DNN), consist of input layers, hidden layers and output layers. CNN has an additional convolutional layer, as shown in Figure 3a. The discrete convolution is the key operation in convolutional layers. As shown in Figure 3b, we use a 2 × 2 kernel as an example to illustrate the discrete convolution. The input **I** has a value in each grid. Then, a two-dimensional kernel function **K** ∈ R2×<sup>2</sup> is used to extract features. The output **S** of the convolution is:

$$\mathbf{S}(i,j) = \sum\_{k\_i=0}^{1} \sum\_{k\_j=0}^{1} \mathbf{I}\{i+k\_i, j+k\_j\} \mathbf{K}\{k\_i, k\_j\} \tag{7}$$

Equation (7) and Figure 3b together illustrate that convolutional kernels map the neighboring information of the input into the output. Therefore, compared with DNN, CNN has an advantage of considering the information in the small neighborhoods, which is a crucial future in the classification of two-dimensional data, as the neighboring grids usually carry related information [29,30]. For example, if we regard **I** as a black and white picture, the kernels can efficiently extract features, such as edges, angles and shapes from neighboring pixels.

**Figure 3.** Diagrams of the CNN structure and the discrete convolution: (**a**) CNN structure; (**b**) discrete convolutions.

#### 2.2.2. Differences between CNN and TextCNN

The kernel size is the main difference between CNN and TextCNN. As shown in Figure 4a, we use height and length to describe the size of a two-dimensional kernel. The commonly used kernel size in CNN is 3 × 3 [31,32], while in TextCNN the height of kernels is always equal to that of input data [27]. This is because for text classification, the most significant thing is to efficiently capture the internal features of an individual word and the correlations between multiple words. As shown in Figure 4a, the convolutional kernels are sliding windows with the same height as a single world. The kernel will only move in the length direction, so each time the kernel will slide over a complete word.

**Figure 4.** Characteristics of CNN: (**a**) feature schematic diagram of TextCNN's kernels; (**b**) differences between CNN and TextCNN.

The influences of different kernel sizes on the network are shown in Figure 4b. In order to capture the association between the green grids and the yellow grids, TextCNN requires only one convolutional layer, while CNN requires three convolutional layers. Therefore, TextCNN simplifies the structure of the neural network and reduces the parameters that require manual intervention. In this manner, the efficiency and effectiveness of capturing the internal features of a word and the correlations between multiple words are guaranteed.

In electricity theft detection, we aim to capture the features from the data correlations of weeks, days, hours and even more frequent time periods. Analogously, the intraday feature of electricity consumption is similar to the association between the green grids and the yellow grids in Figure 4b, and the multi-day correlations are extracted by different kernels, such as **K**1, **K**2 and **K**3 in Figure 4a. Therefore, to efficiently extract features of electricity consumption data, we propose a neural network based on TextCNN for the classification of two-dimensional time-series.
