*4.3. Anomaly Detection*

In anomaly detection, there is a dilemma that we have difficulty obtaining or only having a very small portion of the tag data. This makes it difficult for us to make a clear distinction between anomalies and normality. Besides, anomaly detection should be quick, but in social networks, the network shape is changing and growing at any time. These are the problems we need to solve. Finally, after careful consideration in our mind, we choose the local outlier factor (LOF) [48] model to solve the challenges. The reason why we choose it is beacuse: (1) Compared with supervised learning methods, it can be used directly without label dataset training and thus has a wider application range. (2) It can quantify the anomalies into scores instead of the labels, and let us find users who need to be explored in depth intuitively and efficiently. (3) It is a density-based detection method. Its detection results can be directly responded by dimension reduction.

An ego *p*'s anomalous score *LOFk*(*p*) based on his k-distance neighborhood is defined as followed:

$$LOF\_k(p) = \frac{\sum\_{o \in N\_k(p) \: \frac{\operatorname{rd}d\_k(o)}{\operatorname{rd}d\_k(p)}}}{|N\_k(p)|} = \frac{\sum\_{o \in N\_k(p)} \operatorname{lrd}\_k(o)}{|N\_k(p)| \cdot \operatorname{lrd}\_k(p)}\tag{5}$$

$$reach - distance\_k(p, o) = \max\{d\_k(o), d(p, o)\}\tag{6}$$

$$d\_k(p) = d(p, o) \tag{7}$$

$$\operatorname{lrd}\_k(p) = 1 / \frac{\sum\_{o \in N\_k(p)} \operatorname{reach} - \operatorname{dist}\_k(p, o)}{|N\_k(p)|} \tag{8}$$

*dk*(*p*) is the k-distance of *p*. *reach* − *distancek*(*p*, *o*) is the k-reach-distance from node *o* to *p*. It represents the maximum value between the k-distance of *o* and the real distance between *p* and *o*. *Nk*(*p*) is k-distance neighborhood of node *p*, which means it is a set include all the nodes that less than or equal to the k-distance of *p*. *lrdk*(*p*) is the local reach-ability density of node p based on *Nk*(*p*). It means the local density of the current point and its surroundings. From the equation we can know that when *lrdk*(*p*) is higher, the *p* is more likely to be a normal node. Above all, we can conclude that if a node's *LOFk*(*p*) is higher, it indicates the node is more different with its local k-neighbor. Generally, an ego i's feature vector is shown below:

$$f\_{\dot{i}} = \begin{bmatrix} k\_{in'}^{\dot{i}} k\_{out'}^{\dot{i}} \mathcal{W}\_{\dot{i}\prime} \pi\_{\dot{i}\prime} \delta\_{\dot{i}\prime} T\_{\dot{i}} \end{bmatrix} \tag{9}$$

It can describe the global properties of the ego, which is important in anomaly detection. We use *F* = { *fi*, *i* ∈ *<sup>G</sup>*}, the collect of *fi* and the LOF to find out a preliminary detection results, and then do more in-depth research.

### **5. System Design and Overview**
