Next Article in Journal
A Control Strategy for Suppressing Zero-Sequence Circulating Current in Paralleled Three-Phase Voltage-Source PWM Converters
Previous Article in Journal
Wideband UHF Antenna for Partial Discharge Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Combining Machine Learning and Social Network Analysis to Reveal the Organizational Structures

Department of Computational Intelligence, Faculty of Computer Science and Management, Wrocław University of Science and Technology, 50-370 Wrocław, Poland
*
Author to whom correspondence should be addressed.
Appl. Sci. 2020, 10(5), 1699; https://doi.org/10.3390/app10051699
Submission received: 7 January 2020 / Revised: 17 February 2020 / Accepted: 21 February 2020 / Published: 2 March 2020
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Formation of a hierarchy within an organization is a natural way of assigning the duties, delegating responsibilities and optimizing the flow of information. Only for the smallest companies the lack of the hierarchy, that is, a flat one, is possible. Yet, if they grow, the introduction of a hierarchy is inevitable. Most often, its existence results in different nature of the tasks and duties of its members located at various organizational levels or in distant parts of it. On the other hand, employees often send dozens of emails each day, and by doing so, and also by being engaged in other activities, they naturally form an informal social network where nodes are individuals and edges are the actions linking them. At first, such a social network seems distinct from the organizational one. However, the analysis of this network may lead to reproducing the organizational hierarchy of companies. This is due to the fact that that people holding a similar position in the hierarchy possibly share also a similar way of behaving and communicating attributed to their role. The key concept of this work is to evaluate how well social network measures when combined with other features gained from the feature engineering align with the classification of the members of organizational social network. As a technique for answering this research question, machine learning apparatus was employed. Here, for the classification task, Decision Trees, Random Forest, Neural Networks and Support Vector Machines have been evaluated, as well as a collective classification algorithm, which is also proposed in this paper. The used approach allowed to compare how traditional methods of machine learning classification, while supported by social network analysis, performed in comparison to a typical graph algorithm. The results demonstrate that the social network built using the metadata on communication highly exposes the organizational structure.

1. Introduction

People around the world send hundreds of emails to exchange information within organizations. As an implicit result of that, each of these interactions forms a link in a social network. This network can be a valuable source of knowledge about human behaviors and what is more, conducting the analysis can reveal groups of employees with similar communication patterns. These groups usually coincide with different levels of the organization’s hierarchy and additionally, employees who work in the same position generally have a comparable scope of duties. It is common for organizations to observe some hierarchy because, formally, an organized structure helps with better management of employees and gaining an advantage within the market. Therefore, the analysis of the network created from a set of emails could retrieve valuable data about inner corporation processes and recreate an organizational structure. An interesting and promising idea seems to be the combination of network measures and additional features extracted from messages for classification tasks. Social network analysis (SNA) has the potential to boost machine learning algorithms in a field of organization structure detection, thus capturing relations between data seems to be very important for this kind of dataset.
The reverse engineering of the corporate structure of an organization can be perceived two-way. On the one hand, if successful, it could reveal company structure by having only meta-data and this imposes a risk in the case when the structure is intentionally kept secret, for example, for keeping the competitive edge or protecting the employees from takeovers by other companies. On the other hand, this could lead to reconstructing the structure of malicious organizations when only partial information about them is available.
In the literature, there are several works describing the detection of organizational structures. However, most of them use the Enron dataset [1] or focus rather on a network approach and omit standard supervised classification algorithms. It should be noted that each organization is managed in a slightly different way which means that communication patterns could differ within each of them. These differences imply that some solutions may give better or worse results depending on the network’s specificity; it is important that studies on an organization hierarchy should not be limited to only one dataset.
The authors of Reference [2,3] introduced a concept of matching a formal organizational structure and a social network created from email communications. Experiments were carried out on messages coming from a manufacturing company located in Poland as well as the well-known Enron dataset. The research results showed that in both cases, some network metrics were able to reveal organizational hierarchy better than others. This work also touched on the problem that a formal structure sometimes may significantly differ and will not converge with the daily reality.
The idea of combining network metrics and other features extracted from email dataset to reveal corporate hierarchy is introduced in Reference [4]. The authors presented their own metric named “social score” which defines the importance of each employee in the network. This metric is defined as a weighted average of all features and is used in a grouping algorithm. The grouping method is a simple straight scale level division algorithm which assigns employees to defined intervals by the social score.
The study on the usage of network measures as input features for classification algorithms was presented in Reference [5]. The basic concept of this work focused on retrieving company hierarchy based on the network created from social media accounts of the employees. The authors presented that centrality measures and clustering coefficients in combination with other features extracted from social media can detect leaders in a corporate structure. However, this research used individual features of a person like a gender, hometown or number of friends instead of features gained from job activities and interactions among employees. Other articles describing the combination of SNA and standard classification methods are References [6,7]. They both work using the Enron dataset and features based on the number of sent/received messages. In Reference [8] the usage of some network metrics as input for classification and clustering algorithms has been described. Furthermore, the results were compared to a novel measure called Human Rank (improvement of Page Rank). However, the use of classification based on social network features is not limited only to the corporate environment. For instance, following the ideas of studying the social networks of criminalists [9], the authors of Reference [10] used features of a social network of co-arrestees for predicting the possibility of future violent crimes. A similar concept was also used in Reference [11] for analyzing co-offending networks. In that work, a co-offence prediction algorithm using supervised learning has been developed. Yet, the classification in social networks based on communication or behaviour in social media, can relate to completely different areas, such as poverty detection [12], personality traits discovery [13] or occupation [14]. All that is possible because our digital traces do differ depending on our role or status.
In the area of the problem being solved also many solutions concentrated mainly on classification from a graph perspective. For instance, identification of key players of social network based on entropy [15], applying graphical models [1] or factor graph models [16].
The following work focuses on the organizational structure detection based on nine-months of e-mail communication between employees of a manufacturing company located in Poland as well as the Enron dataset. The research used Decision Tree, Random Forest, Neural Network and Support Vector Machine (SVM) algorithm for classification, moreover influence of minimum employee activity was examined. The obtained results were compared with the simple graph algorithm of collective classification also proposed in this paper. The weakness of this approach is the fact that an independent and identical distribution (IID) condition is difficult to meet due to network measures which were calculated once before splitting data on training and test set. In social network analysis, full satisfaction of the IID condition is hard to achieve because if we had built independent networks for training and test data, we would get totally different network measures and the importance of each node could be biased. However, network measures could be valuable features for machine learning algorithms in sight of capturing connections between data. The results showed that the combination of classification algorithm and social network analysis can reveal organizational structures, however, small changes in the network can change the efficiency of the algorithms. Furthermore, a graph approach, such as collective classification, is able to classify well even with limited knowledge about node labels.

2. Materials and Methods

In this section, after introductory sections to supervised learning and social network analysis, a proposed solution is described in detail, as well as the used datasets. The presented solution is created with Python language, as well as NetworkX library for a social network creation and Scikit-learn for all machine learning tasks.

2.1. Supervised Learning

Machine learning can be considered an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed [17]. The set of tools derived from the field of statistics enables the possibility to perform multiple tasks far exceeding human capabilities or simple algorithms in variety of disciplines, ranging from text analysis, computer vision, medicine and others. In machine learning, classification is a supervised learning approach in which the algorithm learns from the data input given to it and then uses this learning to classify new observation. Here, by supervised we mean providing the algorithm instances of objects that have been categorized as belonging to certain class and requiring it to develop a method to adequately classify other objects without known class. In order to fulfill this goal, numerous algorithms have been developed and tuned over last decades, such as logistic regression [18], naive Bayes classifier [19], nearest neighbor [20], Support Vector Machines [21], decision trees [22], random forests [23] or neural networks [24]. Each of these methods takes a different perspective to the task. Regarding the methods used in this work, decision trees build a tree consisting of the tests of features: each branch represents the outcome of the test, and each leaf node represents a class label. Random forest extends the concept of decision trees by building a multitude of them and outputs the class that is the mode of the classes of the individual trees. What one can say about these two methods is that the rules of classification are transparent and highly interpretable. Regarding two other methods evaluated in this work, Support Vector Machine constructs a hyperplane or a set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification. Neural networks try to mimic human brain in terms of how information is being passed and analysed. Here, a neural network is based on a collection of connected units or nodes called neurons. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons connected to it. Neurons are grouped in layers and a signal passing the layers is being converted by neurons into the one that at the final (output layer) will be decided upon the class membership. This creates an architecture of neural network. Contrary to decision trees and random forests, Support Vector Machines and neural networks are not that easily interpretable in terms of the importance of features [25].
What links all of these methods is that they usually require that the samples are required to follow IID principle, namely they have to be independent and identically distributed. Unfortunately, this is not always the case, especially when we consider any network-related data. For instance in social networks, people tend to cluster in groups of similar interests [26] or change their opinion based on others’ opinion [27]. In this case it is hard to consider the samples as IID, so another set of approaches has been developed: collective classification that tries to make the use of the networked structure of the data [28].
Another problem in classification is that rarely the instances are equally distributed over all classes to be classified. This problem is referred to as imbalanced data and there are multiple techniques that allow to tackle it, mainly belonging to one of two groups: under-sampling and over-sampling [29,30]. In under-sampling, the dominant groups are being reduced to be equally represented as previously under-represented classes. Contrary, over-sampling generates the synthetic instances that belong to under-represented class leading to more balanced data. One of the most prominent over-sampling techniques is SMOTE that bases on nearest neighbors judged by Euclidean Distance between data points in feature space and perform vector operations to generate new data points [31,32]. Using this technique provides more representative and less biased samples compared to random over-sampling.
In Section 2.6 one can find information on which classification algorithms have been used in this work and Section 2.7 contains more information on how we used collective classification for discovering the organizational structure from social network.

2.2. Social Network Analysis

The field of social network analysis can be understood as a set of techniques deriving knowledge about human relationships based on the relations they form—usually by being members of social networks of different kinds. These networks can relate to family, friends, companies or organizations they are employees members of, or social media they participate in. More formally, social network consists of a finite set or sets of actors and the relation or relations defined on them [33]. To help understanding this definition of a social network, some other concepts that are fundamental in this case should be explained. An actor is a discrete individual, corporate or collective social unit [33]. This can be a person in a group of people, a department within a company or a nation in the world system. Actors are linked to each other by social ties and these relations are the core of the social network approach. Social networks are presented using graph structures, where nodes are actors and edges are connections between them. Hence, all graph theory methods and measured can be applied. A graph may be undirected, which means that there is no distinction between the two vertices associated with each edge, or its edges may be directed from one vertex to another. A graph is being defined usually as an ordered pair G : = ( V , E ) , where V are vertices or nodes and E are edges. On top of that multiple methods are being used in order to derive knowledge about network members or network itself, such as centrality measures [34], community detection [35], modelling evolution [36], or detection of influential nodes [37]. As described in Section 1, social network analysis also became present in organizations where it is often being referred to as organizational network analysis [38].

2.3. Datasets

In this work we evaluated two datasets containing both: metadata on communication and organizational structure in companies. This allowed to use the features extracted from the social network built using the communication data as features for classifiers. These classifiers have been then distinguishing the level of an employee in a corporate hierarchy. Detailed description of the datasets is presented below.

2.3.1. Manufacturing Company

The analyzed dataset contains a nine-month exchange of messages among clerical employees of a manufacturing company located in Poland [3]. The dataset consists of two files—the first contains the company hierarchy, the second stores the communication history. The analyzed company contains the three level hierarchy: the first management level, the second management level and regular employees. The file with emails consists of senders and recipients, as well as the date and time of sent messages. Moreover, emails from former employees and technician accounts are also included in this file. Due to the lack of data about supervisors, former employees and technical accounts were removed from the further research. The final organizational structure to be analyzed is shown in Figure 1 and in Table 1. It is important to note that the dataset does not contain any correspondence with anyone outside of the company, moreover, the company structure has been consistently stable within the period of time being considered and has not undergone any changes.

2.3.2. Enron

The second analyzed dataset comes from the Enron company [1]. Enron was a large American energy establishment founded in 1985 subsequently became famous at the end of 2001 due to financial fraud. During the investigation, the dataset has been made public, however the organizational hierarchy has never been officially confirmed. Despite of this limitation, the Enron email corpus has become the subject of many studies, which allowed to partially reconstruct the company’s structure. The authors of this paper decided to use processed version of this dataset which already include positions assigned to the employees. There is a seven-level hierarchy in this data set, however, to reduce the complexity of this structure the authors proposed more generic three-level hierarchy showed in Table 2, the same as in the manufacturing company dataset. The applied approach allowed for a better distinction of managerial and executive positions from regular employees. The analyzed period contains messages from over 3 years and due to limited knowledge about inner company processes the authors assumed that the organizational structure was stable during it.
Both datasets are available for evaluation or further research, see Supplementary Materials.

2.4. Network

The network was built using the email exchanges of its members, where the nodes were employees and the edges were the messages. It was decided to use a directed graph defined as follows: Social network is a tuple S N = ( V , E ) , where V = { v 1 , , v n } , n N + is the set of vertices and E = { e 1 , , e k e } , k e N + is the set of edges between them. Each vertex v i V represents an individual v i e and each edge e i j corresponds to the directed social relationship from v i to v j , such that E = { ( v i , v j , w i j ) : v i V , v j V , v i = v i e , v j = v j e and w i j [ 0 , 1 ] } . The edge weights defined according to the following formula:
w i j = e i j e i ,
where e i j is the sum of messages sent from node i to node j and e i is the total sum of messages sent from node i. All self loops were removed.
In Figure 2 the weighted directed network built using e-mail communication in the manufacturing company is depicted. What can be noticed is that the position of the first level and second level management is not always central in this network. As a result of that, using centrality measures would not be enough for detecting positions in organizational hierarchy. The reasons why there is no direct correlation between position in the social network and the organizational networks can be of many kind, for example, management positions do not require intense communication, using different forms of communication or having supporting personnel to communicate on behalf.

2.5. Features

In the created social network, the centrality measures presented below have been calculated as input features for classification algorithms. These measures are also briefly described in Table 3.
  • indegree centrality:
    C I N ( v i ) = | e j i E | , j i ,
    where e j i is the edge going from every node v j to evaluated node v i .
  • outdegree centrality:
    C O U T ( v i ) = | e i j E | , i j ,
    where e i j is the edge going from evaluated node v i to every other node v j in the network.
  • betweenness centrality:
    C B ( v i ) = v s v i v d σ v s v d ( v i ) σ v s v d ,
    where σ v s v d ( v i ) is the number of shortest paths between nodes v s and v d passing through node v i and σ v s v d is the number of all shortest paths between v s and v d .
  • closeness centrality:
    C C ( v i ) = N v y d ( v y , v i ) ,
    where N is the number of vertices in the network and d ( v y , v i ) is a distance between vertices v y and v i .
  • eigenvector centrality:
    C E ( v i ) = 1 λ k a v k , v i C E ( v k ) ,
    where A = ( a i , j ) is the adjacency matrix of a graph and λ 0 is a constant.
  • page rank:
    C P R ( v i ) = α k a v k , v i d k C P R ( v k ) + β ,
    where α and β are constants and d k is the out-degree of node v k if such degree is positive, or d k = 1 if the out-degree of node v k is null. Again, A = ( a i , j ) is the adjacency matrix of a graph.
  • hub centrality:
    C H U B ( v i ) = β k a v i , v k C A U T ( v k ) ,
    where A = ( a i , j ) is the adjacency matrix of a graph and C A U T ( v k ) is the authority centrality of a node (see Equation (8)), β is a constant.
  • authority centrality:
    C A U T ( v i ) = α k a v k , v i C H U B ( v k ) ,
    where A = ( a i , j ) is the adjacency matrix of a graph and C H U B ( v k ) is the hub centrality of a node (see Equation (7)), α is a constant.
Moreover, a local clustering coefficient was calculated for each node, which allows capturing density of connections between neighbors, as well as two additional features related to cliques:
C C C ( v i ) = 2 m v i k i ( k i 1 ) ,
where m v i is the number of pairs of neighbors of a node v i that are connected. In the formula it is linked with the the number of possible pairs of neighbors of node v i , which is k v i ( k v i 1 ) 2 , where k v i is the degree of a node v i .
A clique is defined as a fully connected subgraph which means that each node has directed links to all other nodes in the clique. The first feature is the total numbers of cliques in which an employee is assigned, furthermore the second is the size of the biggest clique for the specific node. Reference [40] contains more details on all the measures introduced above.
The next features were based on neighborhood variability, which is determined in three ways: sent neighborhood variability, received neighborhood variability and general neighborhood variability. Overall, neighborhood variability is defined as the difference between sets of neighbors which the specific node communicates in the previous and the next month. Sent neighborhood variability considers a set of neighbors to which the given node was sending messages. Received neighborhood variability looks at a set of neighbors from which the given node had been receiving messages. General neighborhood variability uses a set of neighbors with which the node communicates, without distinguishing between sending and receiving messages. The Jaccard coefficient was used for calculating the difference between sets, so the coefficient takes values between 0 and 1 where 0 means totally different sets and 1 means identical sets. The Jaccard coefficient was calculated for each pair of alternating months. Moreover, if the employee had not been active in a directly following month, the nearest next month would be considered. For example: the employee was active in January, but not in February and again was active in March; therefore, coefficient would be calculated between sets of neighbors in January and March. Furthermore, a neighborhood variability was calculated as an average Jaccard coefficient for each node based on previous partial coefficients. Formally, sent variability measure can be defined as following:
V A R S N T ( v i ) = | N s n t v i , m 1 N s n t v i , m | | N s n t v i , m 1 N s n t v i , m | ,
where N s n t v i , m 1 is the set of neighbours that a certain node v i sent messages to in the month m 1 and N s n t v i , m is the is the set of neighbours that a certain node v i sent messages to in month m or, if no messages have been sent in m, then m + 1 , m + 2 , , m m a x are considered. Similarly, received and general neighbourhood variability measures can be defined by substituting the sets of neighbours to the neighbours that either sent messages to node v i (received variability) or the set of neighbours the contact occurred with in any direction and involved v i (general variability).
Furthermore, features such as the number of weekends worked and the amount of overtime taken were taken into account. For overtime, work between 4:00 PM and 6:00 AM were considered. It should be mentioned that overtime was only calculated for the manufacturing company dataset. Due to the limited knowledge about the Enron dataset, it was impossible to know whether different timezones should be considered because the dates were given in the POSIX format and Enron had branches located in different timezones.
As a summary, all used features used for classification are presented in Table 3.

2.6. Classification

The classification task was carried out using the Decision Tree [22], Random Forest, Neural Network (multi-layer perceptron) with L-BFGS solver and SVM algorithm with the polynomial kernel for different set up of following parameters of the experiment: number of recognized employee groups, minimum number of active months as well as the percentage of used features.
The first parameter refers to the previously mentioned three-level hierarchy of employees, which can also be flattened to only two levels—management level and regular employees. The experiment was run with two values of this parameter to see how the performance of the algorithms vary with recognizing two and three groups of employees.
The meaning of the second parameter is checked to see how the activity of a person may have influence on the result of the classification and therefore was examined to see if higher minimum months of employee activity correspond with better results. There is an assumption that some patterns of behavior required more time to be revealed, so the classification was run five times starting with one month minimum activity and ending with 5 months minimum activity. For each value, the network had to be recreated and features calculated again as some nodes were eliminated from the network.
The third parameter examines the impact of the elimination of the most significant features. For this parameter, the experiment was carried out nine times, starting from all features to only ten percent of features with a continual decrease of ten percent. The importance of features for Decision Tree and Random Forest algorithms was determined based on Gini importance parameter from previously learned model. The Neural Network and SVM algorithm are not so easily interpretable and importance of the features cannot be obtained from the outcome of the model. In this case, the importance of the features must be determined before learning a model. For this purpose, the univariate feature selection method based on the chi-squared test was used.
In the analyzed manufacturing company dataset, there was a problem with the unbalanced size of classes, which is common for a company structure where the group of regular employees is the most numerous and the management level has fewer members. However, the Enron dataset is much more balanced in each group, which may indicate a different management model in this company. To handle this problem the technique of oversampling was used to solve it, therefore to match the size of all minor classes to the size of the majority class of regular employees, SMOTE algorithm was used. To prevent data leakage, oversampling was performed only on a training set.
In general, for each combination of the above parameters, a model was trained with the usage of the grid search algorithm with 5-fold cross validation, so all the possible combinations from the range of given values were tested, and the best one with respect to the f-score macro average was returned. The hyperparameters search space is shown in Table A21 as well as the best ones for each model in Table A22, Table A23, Table A24 and Table A25.

2.7. Collective Classification

Collective classification is a different way of revealing company hierarchy from a graph perspective. This approach uses the connection between nodes to propagate labels within the whole network. Loopy belief propagation is an example of collective classification described in detail in References [28,41]. Therefore, in this paper, a simplified version of this algorithm is introduced to compare with standard classification algorithms.
The proposed collective classification method is presented as Algorithm 1. The first step of this algorithm is choosing a utility score and sorting all nodes according to it (line 1). The utility score can be one of the calculated features from the previous section. The next step is to reveal the labels for the given percentage of nodes of each class l i L with the highest utility score (line 2). These nodes are marked as known (their labels are constant) and labeled V L , whereas the other nodes are treated as unkown V U K and unlabeled. Furthermore, the propagation of labels begins in a loop until the stop condition is met or the number of iterations exceeds the given maximum number of iterations. In one iteration, each labeled node sends a message to all of its neighbors by treating edges as undirected (line 5); moreover, all received labels in a given iteration are saved for each node v i in a counter c v i (line 6). The labels update begins after all nodes sent a message to their neighbors, so the sending order does not affect the result. If the node v i has received one label more often than others (line 13), this label will be assigned to it (line 14) and the node will be additionally treated as labeled v i V L (line 15), otherwise for this node counter u v i will be increased (line 18). If u v i exceeds the maximum value (line 20), it will be reset (line 26) and the node will be assigned the label with the highest position in the company hierarchy among the labels with the highest count (lines 22 to 24). At the end of iteration the stop conditions is always checked and it is determined as a difference between sets of previous and current labels, therefore if the Jaccard coefficient is bigger than the given minimum Jaccard value and all nodes have assigned label then the algorithm will end (line 29). Additionally, in the case of unbalanced classes, the algorithm allows defining a t h r e s h o l d . During the phase of counting how many times each label was received by the node, the result for the majority class will be divided by this threshold to prevent domination of this class. (line 11)
The collective classification algorithm was run with three parameters: number of recognized employee groups, minimum number of active months, percentage of known nodes. The first two are identical to the parameters from the previous section, but the last one determines percentage of the known (labeled) nodes. Nine values of this parameter were used from 90% to 10% with a decrease of ten percent. The manufacturing company dataset required setting threshold on the contrary to the Enron dataset where it was not necessary. Additionally, to find the best utility score experiment was carried out for all calculated features, as well as the best Jaccard value and threshold were chosen from a range of different values. The hyperparameters search space is shown in Table A26 as well as the best ones for each model in Table A27, Table A28, Table A29, Table A30 and Table A31.
Algorithm 1 Collective Classification Algorithm.
1:sort nodes descending by utility score
2:assign given percentage of top v i V to V L for each label
3:repeat
4: //perform message passing
5:for each edge ( v i , v j ) E , v i V L , v j V U K do
6:   c V j ( l V i ) c V j ( l V i ) + 1
7:end for
8: //perform label update
9:for each node v i V U K do
10:  if l v i is a majority class then
11:    c V i ( l V i ) c V i ( l V i ) / t h r e s h o l d
12:  end if
13:  if exists only one label with highest count for the node v i then
14:    l v i l : max l L c v i ( l )
15:   assign v i to V L
16:    u v i 0
17:  else
18:    u v i u v i + 1
19:  end if
20:  if the maximum value of u v i has been reached then
21:   //get set of labels with the highest count
22:    L m a x l : max l L c v i ( l )
23:   //get label with the highest position in the hierarchy (smaller is higher)
24:    l v i l : min L m a x
25:   assign v i to V L
26:    u v i 0
27:  end if
28:end for
29:until stop condition

3. Results

The problem that is tackled can be considered as a binary classification for two groups of employees and multiclass classification for three groups. Therefore, f-score macro average measure was used to evaluate the solution in sight of the one metric which was needed to compare both results. This measure can handle the above cases, moreover, as was written in Reference [42] it copes well with the problem of unbalanced classes. The biggest advantage of this measure is the equal treatment of all classes which means that a result is not dominated by a majority class.
The results for the manufacturing company dataset are shown in Figure 3, Figure 4 and Figure 5. The f-score macro average for the randomly assigned labels is around 0.42 for two levels of the hierarchy and 0.24 for three levels of the hierarchy, in comparing the best result for the two levels was 0.7768 obtained by Random Forest and for the three levels 0.4737 achieved by Decision Tree. The much higher score obtained for two groups of classification can be explained by unbalanced classes. The classification of the three groups got worse results because of the small number of samples which was insufficient for the distinction between the two levels of management, even when oversampling was used. Furthermore, Random Forest got a slightly better results especially for two groups of employees. A strange phenomenon can be observed when a reduction of the most important features occasionally concludes with a better result; meaning that there could be some noise among the features which may affect the decision boundary. The potentially explanation of this phenomenon might be related to the problem described in Reference [2], so the observed alteration could be a result that the hierarchy, which arises from daily duties does not converge with company structure on paper. This inconsistency could be the source of some noise in the used features which has an influence on the obtained result; therefore, changing the network structure by eliminating some nodes, as well as removing the most important features, could result in moving a decision boundary. It is also noticeable that the parameter of a minimum employee activity also has impact on the classification but it is difficult to indicate the best value because no clear pattern is visible; however, most of the best results are obtained for a minimum activity greater than one month. The best results for two and three groups of employees was obtained by the collective classification algorithm which was able to classify nodes even if more than half of the labels were unknown.
Figure 6, Figure 7 and Figure 8 present the results for the Enron dataset which are similar to the results of the previous dataset. The result obtained by random labels was equal 0.49 for the two levels of the hierarchy and 0.33 for the three levels. The best f1-score for the supervised learning methods was achieved by Random Forest algorithm, and it was 0.8198 for the two hierarchy levels and 0.6423 for the three levels. The results of the collective classification algorithm where higher than the results of the standard classification if the knowledge of the node labels was over 70%. Below this value, the results were similar to standard classification, moreover, for three groups, if the knowledge of nodes fell below 40%, the results significantly deteriorated. It is visible that excessive reduction of features or known nodes leads to the results close to randomness. Furthermore, similarity of the results is important because shows that the presented solution works well for various organizational management models. In the manufacturing company the majority of the employees are regular clerical workers in contrast to the small management group. In the Enron dataset the situation is opposite, so the ratio between the first and second management level and regular employees is balanced.
As a summary, the best results obtained by supervised learning algorithms are presented in Table 4. Moreover, all numerical results can be found in Appendix A in Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9, Table A10, Table A11, Table A12, Table A13, Table A14, Table A15, Table A16, Table A17, Table A18, Table A19 and Table A20.
Interesting conclusions about trained models can be drawn from the Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, Figure A7, Figure A8, Figure A9, Figure A10, Figure A11 and Figure A12, which presents the importance of the features for models that used a set of all features. First of all, it should be noticed that the Decision Tree and Random Forest use many features; however, none of the features stand out significantly in terms of importance. Nevertheless, the clustering coefficient could be highlighted for the manufacturing company and indegree centrality for the Enron dataset because in many cases these features have slightly bigger importance than the others. For the Neural Network and SVM algorithm the total number of cliques is visibly the best one for the both datasets. Moreover, sent and received neighborhood variability are also in some cases significant, therefore, it shows that not only network centrality measures but also features created from employees’ behavior can be important in the classification task. A common element for all algorithms is the fact that for all parameters of the experiment, the worst feature is always the eigenvector centrality. Furthermore, Table A27, Table A28, Table A29, Table A30 and Table A31 show the best utility score depending on different combinations of experiment parameters. Unlike supervised methods, it is difficult to identify the most discriminating feature for collective classification because a wide range of them is used as a utility score.

4. Discussion

The comparison of the results for standard classification algorithms and collective classification show that the first of them cope better with balanced data such as the Enron datasets. The conducted experiments also present that Decision Tree and Random Forest, as well as Neural Network and SVM algorithm, have been able to obtain similar results. However, in an organizational environment, the possibility of the result interpretation could be highly appreciated, therefore the first two algorithms are a better solution if we would like to deeply understand the communication behavior on different organizational hierarchy levels. Furthermore, the presented collective classification algorithm obtained better results in the case of an unbalanced dataset. These results indicate that the graph algorithm is able to reduce impact of majority classes and predict well contrary to the standard classification algorithms. In addition, future research should examine if the above impact of unequal classes is also associated with some network hidden characteristics. Furthermore, it was presented that the minimum length of the period in which an employee was active influences the result, however the ideal value of the minimum activity depends on an analyzed dataset, as well as other parameters.
A network created from email communication may vary from organization to organization, especially regarding to the size of the company and different management models, for example: a big international company with many employees and a complicated hierarchy could create a social network with totally different properties than a small startup with a bunch of employees and a simple structure. Moreover, in some companies email communication could be one of many ways of passing messages, so a dataset of emails do not have to contain full information about the connection between employees in the company. These differences could cause some patterns of behavior assigned to a specific level of hierarchy, which does not have to appear in a constructed social network.
Future work should focus on the study of communication coming from different types of companies, moreover, further research should discover which organizational structures can provide the best results of classification task. Therefore, better results could be an implication of some hidden graph properties corresponding to the way which an organization is managed, so future studies also should focus on the examination of a network structure and revealing its characteristics. The biggest problem may be obtaining data for research due to the fact that internal communication of a company is confidential and has to be anonymized before being shared. Another interesting approach is the attempt to use graph embeddings instead of conventional features provided to supervised learning algorithms. This way the properties of nodes will be encoded in a form of vectors making them more suitable as direct input to algorithms. Regarding the collective classification, instead of analysing particular features as the input for utility score, latent Dirichlet allocation could be used to create a utility score combining features. Lastly, the reader would notice that the social network used in this study has been an aggregated one. This was mainly due to the fact that organizational structure of manufacturing company did not undergo any changes in the period covered by the dataset and in the case of Enron, the structure was inferred from e-mails and no other information was known. However, for applying proposed approach in organizations, it would be advised to verify the capabilities of temporal approach: both in the area of measures as well as networks.

Supplementary Materials

The manufacturing company email dataset alongside with corporate hierarchy has been published by the authors of this manuscript at Harvard Dataverse, see https://doi.org/10.7910/DVN/6Z3CGX. The Enron dataset is available from Reference [1].

Author Contributions

Conceptualization, R.M. and M.N.; methodology, R.M. and M.N.; software, M.N.; validation, R.M.; investigation, R.M. and M.N.; data curation, R.M. and M.N.; writing—original draft preparation, M.N. and R.M.; writing—review and editing, R.M.; visualization, M.N.; supervision, R.M.; project administration, R.M.; funding acquisition, R.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Science Centre, Poland, project no. 2015/17/D/ST6/04046 as well as by the European Unions Horizon 2020 research and innovation programme under grant agreement no. 691152 (RENOIR) and the Polish Ministry of Science and Higher Education fund for supporting internationally co-financed projects in 2016–2019 (agreement no. 3628/H2020/2016/2).

Acknowledgments

Authors of this manuscript would like to thank to Piotr Bródka for his valuable remarks. Moreover, we would like to express our gratitude to the Reviewers of the manuscript for their precious feedback.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
SNASocial Network Analysis
IIDIndependent and Identical Distribution
SVMSupport Vector Machine
L-BFGSlimited memory Broyden-Fletcher-Goldfarb-Shanno algorithm

Appendix A

Table A1. F-score macro average for the Decision Tree classification of two levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A1. F-score macro average for the Decision Tree classification of two levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.54720.52700.59870.53680.56410.61300.58500.64220.60920.6568
2 months0.50400.53600.54730.59380.64640.64170.66270.70390.64760.6642
3 months0.59890.53360.58220.53120.56780.63910.57720.62300.61620.6642
4 months0.65020.48180.51700.53150.56600.58480.58980.62340.63880.6398
5 months0.64230.56220.60830.58260.61740.63870.60940.62310.66400.6388
Table A2. F-score macro average for the Random Forest classification of two levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A2. F-score macro average for the Random Forest classification of two levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.48990.57890.57170.57860.57420.64260.57660.62900.64010.7146
2 months0.64210.53110.50750.58570.67930.65420.68110.64640.65280.6963
3 months0.49770.62280.57850.58490.59330.57180.60240.60540.65320.7111
4 months0.56470.51040.56570.59280.63370.58480.63360.61640.66830.6895
5 months0.64230.60000.51760.59230.64930.65330.69810.74250.77680.7057
Table A3. F-score macro average for the Neural Network classification of two levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A3. F-score macro average for the Neural Network classification of two levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.48480.46350.53490.48420.53560.55950.55770.57600.56800.5627
2 months0.53030.49100.56740.55320.53910.58140.53880.59860.62410.6187
3 months0.52700.52100.54930.55440.55860.57980.53040.61440.61440.6247
4 months0.44970.47250.49090.48450.54120.59060.55000.60130.58260.6037
5 months0.49460.52750.56010.61120.54860.54310.55340.60420.59190.6000
Table A4. F-score macro average for the SVM classification of two levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A4. F-score macro average for the SVM classification of two levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.42090.47030.51280.52360.46750.49980.49970.57110.58020.5977
2 months0.44990.54610.59220.56090.53680.48960.50470.60870.61420.6245
3 months0.43640.48960.59170.51860.51490.50270.49150.60180.62180.6517
4 months0.42230.43380.57830.54160.53120.48730.50120.60010.61340.6250
5 months0.43790.44930.57030.57220.54410.53630.52740.60530.61390.6501
Table A5. F-score macro average for the collective classification of two levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A5. F-score macro average for the collective classification of two levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Known Nodes
10%20%30%40%50%60%70%80%90%
Min. activity1 month0.73890.73470.78760.84020.88990.89370.90051.00001.0000
2 months0.70670.81720.80230.86130.88950.92530.94661.00001.0000
3 months0.68340.81680.82900.86100.88950.92530.94631.00001.0000
4 months0.73400.81630.80190.87900.88910.92530.94631.00001.0000
5 months0.72330.80310.80100.87860.88870.92500.94631.00001.0000
Table A6. F-score macro average for the Decision Tree classification of three levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A6. F-score macro average for the Decision Tree classification of three levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.29100.29240.36850.33190.39920.35640.38350.47370.40710.4585
2 months0.27660.29500.35410.34160.36700.35500.39810.40580.40570.4390
3 months0.34250.33780.32390.34840.34040.39280.39030.39930.41360.4100
4 months0.30240.29040.30410.34130.40140.35820.37550.39230.41560.4208
5 months0.33730.40640.40890.37230.36740.40910.37900.36510.38520.3999
Table A7. F-score macro average for the Random Forest classification of three levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A7. F-score macro average for the Random Forest classification of three levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.28720.38670.40750.40650.39540.38860.37800.43120.43800.4374
2 months0.31370.37420.33890.29810.38340.45750.35100.41490.44200.4363
3 months0.31220.33670.37050.36670.38390.38990.39350.39900.44350.4064
4 months0.35200.35990.33470.38220.39170.33600.34070.38100.38520.4163
5 months0.37200.31600.37780.33620.39250.38500.35710.41330.42490.4048
Table A8. F-score macro average for the Neural Network classification of three levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A8. F-score macro average for the Neural Network classification of three levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.28380.30900.32730.30900.34940.34650.30900.41380.35200.4149
2 months0.29570.38860.31970.33830.28180.33060.30970.39500.39590.3567
3 months0.26620.38480.28980.33650.28140.30840.30840.37850.35060.3241
4 months0.26600.30810.28360.34170.29650.30810.30810.35760.35170.3505
5 months0.30930.30770.34190.36000.32910.40140.35740.37360.30770.3513
Table A9. F-score macro average for the SVM classification of three levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A9. F-score macro average for the SVM classification of three levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.24490.30550.40850.09620.28350.26670.29410.37520.36690.2960
2 months0.24930.31470.32790.36370.42030.27890.26810.34740.32860.3389
3 months0.24950.32180.36660.35970.39840.25040.25880.34970.38810.3265
4 months0.26150.31770.37150.35300.46220.25580.24990.33930.34370.3385
5 months0.26880.30180.36150.38560.42410.40840.26660.35140.23990.3060
Table A10. F-score macro average for the collective classification of three levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A10. F-score macro average for the collective classification of three levels of the hierarchy in the manufacturing company dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Known Nodes
10%20%30%40%50%60%70%80%90%
Min. activity1 month0.35700.37350.37310.30950.49100.53280.53970.60000.8730
2 months0.35640.47180.36400.52600.49690.56420.63440.65890.8730
3 months0.34330.36650.36890.52580.30990.56420.57260.65890.8730
4 months0.34720.38410.37570.52550.30950.56420.58460.60000.8713
5 months0.36260.39720.36780.47950.53940.64960.57260.60000.8713
Table A11. F-score macro average for the Decision Tree classification of two levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A11. F-score macro average for the Decision Tree classification of two levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.48800.50610.58740.61160.63690.69590.66020.70070.66720.7156
2 months0.49970.50700.60030.67550.62970.65590.68140.67780.69120.7406
3 months0.43320.54050.70400.73880.67100.63360.69790.68180.73280.7475
4 months0.51380.43670.59030.61750.67020.70760.71680.74550.74210.7194
5 months0.55380.52010.68290.71130.72750.73320.72940.75480.75060.7849
Table A12. F-score macro average for the Random Forest classification of two levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A12. F-score macro average for the Random Forest classification of two levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.51620.49070.60730.64960.64930.68410.67840.74640.75070.7286
2 months0.62810.60150.67470.65600.67090.70050.69870.68940.73100.7189
3 months0.43970.66580.68120.72180.70960.71800.77240.76080.76090.7922
4 months0.46040.52240.67590.66780.69080.67970.73070.73870.75350.7713
5 months0.48970.60570.65670.71420.69980.70780.71880.80470.79110.8198
Table A13. F-score macro average for the Neural Network classification of two levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A13. F-score macro average for the Neural Network classification of two levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.50320.58770.62630.64640.61000.72910.72600.72410.69850.6858
2 months0.49380.67740.70530.71300.70890.69270.74430.73400.68920.6763
3 months0.51350.64780.70190.69300.71650.72560.74020.72730.73180.6944
4 months0.51840.62120.68310.67100.72990.76840.74560.78350.69370.6342
5 months0.46840.62680.73050.69820.76970.75500.76630.76420.77900.6936
Table A14. F-score macro average for the SVM classification of two levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A14. F-score macro average for the SVM classification of two levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.55800.56150.66730.64230.63540.72300.70540.72490.68020.6733
2 months0.56870.63500.69580.68240.71380.73000.73560.73290.66960.6249
3 months0.57310.64300.72210.75190.71390.70820.70870.74750.68330.6290
4 months0.48830.63670.66980.64830.71860.75650.74560.75340.67310.5987
5 months0.52480.61670.69750.71460.78550.74300.77180.77570.70770.6200
Table A15. F-score macro average for the collective classification of two levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A15. F-score macro average for the collective classification of two levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Known Nodes
10%20%30%40%50%60%70%80%90%
Min. activity1 month0.52850.55540.76290.69760.78120.78960.78960.76840.8286
2 months0.56890.59200.68650.67270.67180.73500.73800.81200.8901
3 months0.58630.53850.63340.66270.65730.68210.66410.67610.8000
4 months0.56050.53590.59720.61170.67250.68240.76180.87500.8730
5 months0.61410.51630.60030.62600.65810.58260.74290.75001.0000
Table A16. F-score macro average for the Decision Tree classification of three levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A16. F-score macro average for the Decision Tree classification of three levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.31830.38530.46740.48570.48530.48300.49910.52360.50740.5613
2 months0.39950.42730.45140.48430.45960.47650.56370.52930.50780.5159
3 months0.31700.45700.45980.52450.51310.51640.52050.53800.55540.5568
4 months0.34710.47160.44350.46030.52870.53050.56830.53370.51230.5641
5 months0.36090.33930.53000.50990.58270.54340.47610.51990.54840.4946
Table A17. F-score macro average for the Random Forest classification of three levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A17. F-score macro average for the Random Forest classification of three levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.35620.42440.53640.48460.49190.51500.52370.58350.55020.5451
2 months0.35350.46040.48280.50050.52480.58430.54770.55940.56330.5849
3 months0.44470.47450.50700.50760.54130.59610.59100.62580.58420.6423
4 months0.38770.47350.49910.54270.60940.63180.62040.61510.62240.6320
5 months0.41580.47580.51170.49700.51160.52350.58620.60040.58260.5963
Table A18. F-score macro average for the Neural Network classification of three levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A18. F-score macro average for the Neural Network classification of three levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.37830.43160.49380.49250.48170.48570.48720.54390.57210.5104
2 months0.35660.52210.54390.51210.53100.47870.55040.52170.56390.5176
3 months0.38350.49900.50480.46960.51420.55140.57000.55350.51420.4754
4 months0.36880.48380.48750.45630.49830.57160.61070.56150.57950.5080
5 months0.37140.49980.45830.44430.49090.49040.53560.46870.48600.4393
Table A19. F-score macro average for the SVM classification of three levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A19. F-score macro average for the SVM classification of three levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Used Features
10%20%30%40%50%60%70%80%90%100%
Min. activity1 month0.36070.43080.46440.44480.50350.49340.54520.56820.55960.5171
2 months0.38250.50750.48910.56480.53100.48740.52840.57080.55510.4910
3 months0.38740.49070.49290.44920.49460.52560.61370.59150.50970.4622
4 months0.40750.54110.46260.45050.45110.51770.56260.56870.54070.4576
5 months0.35590.46040.48590.43370.47870.52590.43970.49620.46910.4046
Table A20. F-score macro average for the collective classification of three levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Table A20. F-score macro average for the collective classification of three levels of the hierarchy in the Enron dataset. The values which are bolded are the best results for a given minimum communication activity.
Percentage of the Known Nodes
10%20%30%40%50%60%70%80%90%
Min. activity1 month0.28260.30440.37390.45140.47300.55170.51600.56540.5596
2 months0.28740.31930.37080.39330.43930.50840.54260.53330.6746
3 months0.25930.33680.36810.40070.42840.59150.57530.51660.5000
4 months0.26870.30770.36280.40310.50940.51050.64270.64760.7302
5 months0.27250.27740.35420.34890.39540.48320.45930.45510.6111
Table A21. Hyperparameter search space for supervised learning algorithms.
Table A21. Hyperparameter search space for supervised learning algorithms.
Values
Decision Treemax_depth1, 2, 3, ..., 20
max_features1, 2, 3, ..., 16 (manufacturing company); 1, 2, 3, ..., 15 (Enron)
Random Forestmax_depth1, 2, 3, ..., 20
max_features1, 2, 3, ..., 16 (manufacturing company); 1, 2, 3, ..., 15 (Enron)
n_estimators1, 2, 4, 8, 16, 32, 64, 100, 200
Neural Networkalpha0.0001, 0.001, 0.01, 1
hidden_layer_sizes(13), (9), (4), (13, 9), (13, 4), (9, 4), (13, 9, 4), (4, 9, 4), (9, 13, 9), (9, 13, 4), (4, 9, 13, 9, 4)
SVMdegree3, 4, 5, 6, 7, 8, 9, 10, 15, 20
C1, 2, 3, 4, 5, 10, 15, 20, 25, 50, 100, 150, 200
Table A22. The best values of the hyperparameters for the models which uses the full set of features for the manufacturing company dataset with two hierarchy levels.
Table A22. The best values of the hyperparameters for the models which uses the full set of features for the manufacturing company dataset with two hierarchy levels.
Min. Activity
1 month2 months3 months4 months5 months
Decision Treemax_depth488105
max_features32191
Random Forestmax_depth866106
max_features221232
n_estimators321616168
Neural Networkalpha0.00010.010.00010.0010.01
hidden_layer_sizes(13, 4)(13)(13, 9, 4)(13, 9)(13, 9)
SVMdegree2020201520
C150200200200200
Table A23. The best values of the hyperparameters for the models which uses the full set of features for the manufacturing company dataset with three hierarchy levels.
Table A23. The best values of the hyperparameters for the models which uses the full set of features for the manufacturing company dataset with three hierarchy levels.
Min. Activity
1 Month2 Months3 Months4 Months5 Months
Decision Treemax_depth10712138
max_features18218
Random Forestmax_depth7571010
max_features3111467
n_estimators2416328
Neural Networkalpha0.0010.00010.010.010.001
hidden_layer_sizes(9, 13, 4)(4, 9, 13, 9, 4)(4, 9, 13, 9, 4)(13, 9, 4)(13, 9, 4)
SVMdegree1520202020
C200200200200150
Table A24. The best values of the hyperparameters for the models which uses the full set of features for the Enron dataset with two hierarchy levels.
Table A24. The best values of the hyperparameters for the models which uses the full set of features for the Enron dataset with two hierarchy levels.
Min. Activity
1 Month2 Months3 Months4 Months5 Months
Decision Treemax_depth28732
max_features193412
Random Forestmax_depth25394
max_features37522
n_estimators422162
Neural Networkalpha0.010.0010.00010.0010.0001
hidden_layer_sizes(13, 9)(4, 9, 4)(4, 9, 4)(4, 9, 13, 9, 4)(9)
SVMdegree63736
C20102010050
Table A25. The best values of the hyperparameters for the models which uses the full set of features for the Enron dataset with three hierarchy levels.
Table A25. The best values of the hyperparameters for the models which uses the full set of features for the Enron dataset with three hierarchy levels.
Min. Activity
1 Month2 Months3 Months4 Months5 Months
Decision Treemax_depth54584
max_features7811913
Random Forestmax_depth24723
max_features21221414
n_estimators248216
Neural Networkalpha0.0010.010.010.00010.001
hidden_layer_sizes(4, 3)(13)(13)(9)(13, 9, 4)
SVMdegree93363
C5202001504
Table A26. Hyperparameter search space for collective classification.
Table A26. Hyperparameter search space for collective classification.
ParameterValues
utility scoreall features from Table 3
threshold1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Jaccard value0.7, 0.8, 0.9, 0.99
Table A27. The best values of the hyperparameters for the collective classification algorithm for the manufacturing company dataset with two hierarchy levels (1 to 3 months of minimum activity).
Table A27. The best values of the hyperparameters for the collective classification algorithm for the manufacturing company dataset with two hierarchy levels (1 to 3 months of minimum activity).
Hyperparameters
Min. Activity% of the Known NodesUtility ScoreThresholdJaccard Value
1 month90%hubs centrality40.7
clustering coefficient50.7
overtime40.7
80%hubs centrality40.9
70%indegree centrality40.7
the total numbers of cliques40.7
the biggest clique40.7
60%indegree centrality40.7
outdegree centrality40.7
the total numbers of cliques40.7
the biggest clique40.7
50%indegree centrality40.7
outdegree centrality40.7
40%the total numbers of cliques40.7
30%page rank centrality40.7
20%betweenness centrality50.7
10%the number of weekends worked40.7
2 months90%closeness centrality40.7
hubs centrality40.7
clustering coefficient50.7
80%hubs centrality40.9
the biggest clique40.7
70%indegree centrality40.7
the total numbers of cliques40.7
the biggest clique40.7
60%indegree centrality40.7
outdegree centrality40.7
the total numbers of cliques40.7
50%indegree centrality40.9
40%the total numbers of cliques40.7
30%authorities centrality40.9
20%betweenness centrality50.7
10%the number of weekends worked40.7
3 months90%closeness centrality40.7
hubs centrality40.7
clustering coefficient50.7
80%hubs centrality40.9
the biggest clique40.7
70%indegree centrality40.7
the total numbers of cliques40.7
the biggest clique40.7
60%indegree centrality40.7
outdegree centrality40.7
the total numbers of cliques40.7
50%received neighborhood variability40.7
40%the total numbers of cliques40.7
30%eigenvector centrality40.7
20%betweenness centrality50.7
10%eigenvector centrality40.7
Table A28. The best values of the hyperparameters for the collective classification algorithm for the manufacturing company dataset with two hierarchy levels (4 to 5 months of minimum activity).
Table A28. The best values of the hyperparameters for the collective classification algorithm for the manufacturing company dataset with two hierarchy levels (4 to 5 months of minimum activity).
Hyperparameters
Min. Activity% of the Known NodesUtility ScoreThresholdJaccard Value
4 month90%outdegree centrality30.7
eigenvector centrality30.7
hubs centrality40.7
clustering coefficient50.7
80%hubs centrality40.9
the biggest clique40.7
70%indegree centrality40.7
the total numbers of cliques40.7
the biggest clique40.7
60%outdegree centrality40.7
the total numbers of cliques40.7
50%the biggest clique40.99
received neighborhood variability40.7
40%the total numbers of cliques40.7
30%authorities centrality40.9
20%betweenness centrality50.7
10%authorities centrality40.7
5 months90%outdegree centrality30.7
betweenness centrality30.7
closeness centrality40.7
hubs centrality40.7
clustering coefficient50.7
80%hubs centrality40.9
the biggest clique40.7
70%indegree centrality40.7
the total numbers of cliques40.7
the biggest clique40.7
60%the total numbers of cliques40.7
50%the biggest clique40.99
received neighborhood variability40.7
40%the total numbers of cliques40.7
30%authorities centrality40.7
20%betweenness centrality50.7
10%the number of weekends worked40.7
Table A29. The best values of the hyperparameters for the collective classification algorithm for the manufacturing company dataset with three hierarchy levels.
Table A29. The best values of the hyperparameters for the collective classification algorithm for the manufacturing company dataset with three hierarchy levels.
Hyperparameters
Min. Activity% of the Known NodesUtility ScoreThresholdJaccard Value
1 month90%the number of weekends worked70.7
80%overtime40.7
70%overtime50.7
60%hubs centrality50.7
50%the number of weekends worked50.7
40%closeness centrality50.7
clustering coefficient20.7
the number of weekends worked40.7
30%closeness centrality20.7
20%received neighborhood variability30.7
10%the biggest clique20.7
2 months90%the number of weekends worked70.7
80%overtime50.7
70%betweenness centrality50.7
60%hubs centrality50.7
50%the number of weekends worked50.7
40%general neighborhood variability50.7
30%closeness centrality20.7
20%clustering coefficient30.7
10%the biggest clique20.7
3 months90%the number of weekends worked70.7
80%overtime50.7
70%hubs centrality50.7
60%hubs centrality50.7
50%hubs centrality40.7
clustering coefficient20.7
the number of weekends worked30.7
40%general neighborhood variability50.7
30%hubs centrality20.7
20%received neighborhood variability30.7
10%page rank centrality40.7
the number of weekends worked50.7
4 months90%the number of weekends worked70.7
80%betweenness centrality40.7
overtime40.7
70%betweenness centrality40.7
60%hubs centrality50.7
50%closeness centrality50.7
eigenvector centrality40.7
hubs centrality40.7
clustering coefficient20.7
the number of weekends worked20.7
40%general neighborhood variability50.7
30%hubs centrality20.7
20%received neighborhood variability30.7
10%page rank centrality40.7
the number of weekends worked50.7
5 months90%the number of weekends worked70.7
80%betweenness centrality40.7
70%eigenvector centrality50.7
60%betweenness centrality60.7
50%sent neighborhood variability50.7
40%betweenness centrality60.7
30%hubs centrality20.7
20%received neighborhood variability30.7
10%the number of weekends worked50.7
Table A30. The best values of the hyperparameters for the collective classification algorithm for the Enron dataset with two hierarchy levels.
Table A30. The best values of the hyperparameters for the collective classification algorithm for the Enron dataset with two hierarchy levels.
Hyperparameters
Min. Activity% of the Known NodesUtility ScoreThresholdJaccard Value
1 month90%page rank centrality10.7
80%authorities centrality10.7
70%authorities centrality10.7
60%indegree centrality10.8
50%the biggest clique10.7
40%indegree centrality10.7
30%indegree centrality10.7
20%general neighborhood variability10.7
10%closeness centrality10.7
2 months90%received neighborhood variability10.7
80%closeness centrality10.7
hubs centrality10.7
70%closeness centrality10.7
60%the biggest clique10.7
50%the biggest clique10.7
40%indegree centrality10.7
30%indegree centrality10.7
20%general neighborhood variability10.99
10%general neighborhood variability10.7
3 months90%hubs centrality10.7
80%authorities centrality10.7
70%authorities centrality10.7
60%the biggest clique10.99
50%the biggest clique10.7
40%indegree centrality10.7
30%indegree centrality10.7
20%received neighborhood variability10.7
10%general neighborhood variability10.7
4 months90%closeness centrality10.7
hubs centrality10.7
80%indegree centrality10.7
70%indegree centrality10.99
60%the biggest clique10.8
50%betweenness centrality10.7
40%betweenness centrality10.7
30%page rank centrality10.7
20%general neighborhood variability10.99
10%eigenvector centrality10.7
5 months90%authorities centrality10.7
hubs centrality10.7
80%indegree centrality10.7
70%outdegree centrality10.7
60%outdegree centrality10.7
50%betweenness centrality10.7
40%betweenness centrality10.7
30%betweenness centrality10.7
20%betweenness centrality10.7
10%the biggest clique10.7
Table A31. The best values of the hyperparameters for the collective classification algorithm for the Enron dataset with three hierarchy levels.
Table A31. The best values of the hyperparameters for the collective classification algorithm for the Enron dataset with three hierarchy levels.
Hyperparameters
Min. Activity% of the Known NodesUtility ScoreThresholdJaccard Value
1 month90%the biggest clique10.7
80%hubs centrality10.7
70%page rank centrality10.7
60%outdegree centrality10.7
50%eigenvector centrality10.7
40%the total numbers of cliques10.7
30%indegree centrality10.7
20%eigenvector centrality10.7
10%general neighborhood variability10.7
2 months90%received neighborhood variability10.7
80%hubs centrality10.7
70%eigenvector centrality10.7
60%outdegree centrality10.7
50%outdegree centrality10.7
40%hubs centrality10.7
30%indegree centrality10.7
20%authorities centrality10.7
10%the total numbers of cliques10.99
3 months90%the biggest clique10.7
80%hubs centrality10.7
70%closeness centrality10.7
60%outdegree centrality10.7
50%the biggest clique10.7
40%authorities centrality10.7
30%outdegree centrality10.7
20%sent neighborhood variability10.7
10%the total numbers of cliques10.9
4 months90%closeness centrality10.7
80%authorities centrality10.7
70%outdegree centrality10.7
60%betweenness centrality10.7
50%outdegree centrality10.7
40%general neighborhood variability10.7
30%betweenness centrality10.7
20%general neighborhood variability10.7
10%received neighborhood variability10.7
5 months90%the biggest clique10.7
80%the biggest clique10.8
70%outdegree centrality10.7
60%closeness centrality10.7
50%the biggest clique10.7
40%the number of weekends worked10.7
30%the number of weekends worked10.8
20%outdegree centrality10.7
10%clustering coefficient10.7
Figure A1. Features importance (Gini importance) for the Decision Tree which uses the full set of features for the manufacturing company dataset with two levels of the hierarchy.
Figure A1. Features importance (Gini importance) for the Decision Tree which uses the full set of features for the manufacturing company dataset with two levels of the hierarchy.
Applsci 10 01699 g0a1
Figure A2. Features importance (Gini importance) for the Decision Tree which uses the full set of features for the manufacturing company dataset with three levels of the hierarchy.
Figure A2. Features importance (Gini importance) for the Decision Tree which uses the full set of features for the manufacturing company dataset with three levels of the hierarchy.
Applsci 10 01699 g0a2
Figure A3. Features importance (Gini importance) for the Random Forest which uses the full set of features for the manufacturing company dataset with two levels of the hierarchy.
Figure A3. Features importance (Gini importance) for the Random Forest which uses the full set of features for the manufacturing company dataset with two levels of the hierarchy.
Applsci 10 01699 g0a3
Figure A4. Features importance (Gini importance) for the Random Forest which uses the full set of features for the manufacturing company dataset with three levels of the hierarchy.
Figure A4. Features importance (Gini importance) for the Random Forest which uses the full set of features for the manufacturing company dataset with three levels of the hierarchy.
Applsci 10 01699 g0a4
Figure A5. Features importance (based on univariate feature selection method which uses chi-squared test) for the Neural Network and SVM which use the full set of features for the manufacturing company dataset with two levels of the hierarchy.
Figure A5. Features importance (based on univariate feature selection method which uses chi-squared test) for the Neural Network and SVM which use the full set of features for the manufacturing company dataset with two levels of the hierarchy.
Applsci 10 01699 g0a5
Figure A6. Features importance (based on univariate feature selection method which uses chi-squared test) for the Neural Network and SVM which use the full set of features for the manufacturing company dataset with three levels of the hierarchy.
Figure A6. Features importance (based on univariate feature selection method which uses chi-squared test) for the Neural Network and SVM which use the full set of features for the manufacturing company dataset with three levels of the hierarchy.
Applsci 10 01699 g0a6
Figure A7. Features importance (Gini importance) for the Decision Tree which uses the full set of features for the Enron dataset with two levels of the hierarchy.
Figure A7. Features importance (Gini importance) for the Decision Tree which uses the full set of features for the Enron dataset with two levels of the hierarchy.
Applsci 10 01699 g0a7
Figure A8. Features importance (Gini importance) for the Decision Tree which uses the full set of features for the Enron dataset with three levels of the hierarchy.
Figure A8. Features importance (Gini importance) for the Decision Tree which uses the full set of features for the Enron dataset with three levels of the hierarchy.
Applsci 10 01699 g0a8
Figure A9. Features importance (Gini importance) for the Random Forest which uses the full set of features for the Enron dataset with two levels of the hierarchy.
Figure A9. Features importance (Gini importance) for the Random Forest which uses the full set of features for the Enron dataset with two levels of the hierarchy.
Applsci 10 01699 g0a9
Figure A10. Features importance (Gini importance) for the Random Forest which uses the full set of features for the Enron dataset with three levels of the hierarchy.
Figure A10. Features importance (Gini importance) for the Random Forest which uses the full set of features for the Enron dataset with three levels of the hierarchy.
Applsci 10 01699 g0a10
Figure A11. Features importance (based on univariate feature selection method which uses chi-squared test) for the Neural Network and SVM which use the full set of features for the Enron dataset with two levels of the hierarchy.
Figure A11. Features importance (based on univariate feature selection method which uses chi-squared test) for the Neural Network and SVM which use the full set of features for the Enron dataset with two levels of the hierarchy.
Applsci 10 01699 g0a11
Figure A12. Features importance (based on univariate feature selection method which uses chi-squared test) for the Neural Network and SVM which use the full set of features for the Enron dataset with three levels of the hierarchy.
Figure A12. Features importance (based on univariate feature selection method which uses chi-squared test) for the Neural Network and SVM which use the full set of features for the Enron dataset with three levels of the hierarchy.
Applsci 10 01699 g0a12

References

  1. McCallum, A.; Wang, X.; Corrada-Emmanuel, A. Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email. J. Artif. Intell. Res. 2007, 30, 249–272. [Google Scholar] [CrossRef]
  2. Palus, S.; Kazienko, P.; Michalski, R. Evaluation of Corporate Structure Based on Social Network Analysis. In Social Development and High Technology Industries; IGI Global: Hershey, PA, USA, 2012; pp. 58–69. [Google Scholar] [CrossRef]
  3. Michalski, R.; Palus, S.; Kazienko, P. Matching Organizational Structure and Social Network Extracted from Email Communication. In Business Information Systems; Springer: Berlin/Heidelberg, Germany, 2011; pp. 197–206. [Google Scholar] [CrossRef]
  4. Creamer, G.; Rowe, R.; Hershkop, S.; Stolfo, S.J. Segmentation and Automated Social Hierarchy Detection through Email Network Analysis. In Advances in Web Mining and Web Usage Analysis; Springer: Berlin/Heidelberg, Germany, 2009; pp. 40–58. [Google Scholar] [CrossRef] [Green Version]
  5. Fire, M.; Puzis, R. Organization Mining Using Online Social Networks. Networks Spat. Econ. 2015, 16, 545–578. [Google Scholar] [CrossRef] [Green Version]
  6. Namata, G.; Getoor, L.; Diehl, C. Inferring formal titles in organizational email archives. In Proceedings of the ICML Workshop on Statistical Network Analysis, Pittsburgh, PA, USA, 29 June 2006. [Google Scholar]
  7. Zhang, C.; Hurst, W.B.; Lenin, R.B.; Yuruk, N.; Ramaswamy, S. Analyzing Organizational Structures Using Social Network Analysis. In Lecture Notes in Business Information Processing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 143–156. [Google Scholar] [CrossRef]
  8. Wang, Y.; Iliofotou, M.; Faloutsos, M.; Wu, B. Analyzing Communication Interaction Networks (CINs) in enterprises and inferring hierarchies. Comput. Netw. 2013, 57, 2147–2158. [Google Scholar] [CrossRef]
  9. Coles, N. It’s not what you know—It’s who you know that counts. Analysing serious crime groups as social networks. Br. J. Criminol. 2001, 41, 580–594. [Google Scholar] [CrossRef]
  10. Shaabani, E.; Aleali, A.; Shakarian, P.; Bertetto, J. Early identification of violent criminal gang members. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; pp. 2079–2088. [Google Scholar]
  11. Tayebi, M.A.; Ester, M.; Glässer, U.; Brantingham, P.L. Spatially embedded co-offence prediction using supervised learning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 1789–1798. [Google Scholar]
  12. Sundsøy, P.; Bjelland, J.; Reme, B.A.; Iqbal, A.M.; Jahani, E. Deep learning applied to mobile phone data for individual income classification. In Proceedings of the 2016 International Conference on Artificial Intelligence: Technologies and Applications, Bangkok, Thailand, 24–25 January 2016; Atlantis Press: Paris, France, 2016. [Google Scholar]
  13. Kosinski, M.; Stillwell, D.; Graepel, T. Private traits and attributes are predictable from digital records of human behavior. Proc. Natl. Acad. Sci. USA 2013, 110, 5802–5805. [Google Scholar] [CrossRef] [Green Version]
  14. Huang, Y.; Yu, L.; Wang, X.; Cui, B. A multi-source integration framework for user occupation inference in social media systems. World Wide Web 2015, 18, 1247–1267. [Google Scholar] [CrossRef]
  15. Ortiz-Arroyo, D. Discovering Sets of Key Players in Social Networks. In Computer Communications and Networks; Springer: London, UK, 2009; pp. 27–47. [Google Scholar] [CrossRef] [Green Version]
  16. Dong, Y.; Tang, J.; Chawla, N.V.; Lou, T.; Yang, Y.; Wang, B. Inferring Social Status and Rich Club Effects in Enterprise Communication Networks. PLoS ONE 2015, 10, e0119446. [Google Scholar] [CrossRef] [Green Version]
  17. Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef]
  18. Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013; Volume 398. [Google Scholar]
  19. Maron, M.E. Automatic indexing: An experimental inquiry. J. ACM (JACM) 1961, 8, 404–417. [Google Scholar] [CrossRef]
  20. Altman, N.S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 1992, 46, 175–185. [Google Scholar]
  21. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  22. Utgoff, P.E. Incremental induction of decision trees. Mach. Learn. 1989, 4, 161–186. [Google Scholar] [CrossRef] [Green Version]
  23. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  24. Ayodele, T.O. Types of machine learning algorithms. In New Advances in Machine Learning; IntechOpen: London, UK, 2010; pp. 19–48. [Google Scholar]
  25. Molnar, C. Interpretable Machine Learning; Lulu Press, Inc.: Morrisville, NC, USA, 2019. [Google Scholar]
  26. McPherson, M.; Smith-Lovin, L.; Cook, J.M. Birds of a feather: Homophily in social networks. Annu. Rev. Sociol. 2001, 27, 415–444. [Google Scholar] [CrossRef] [Green Version]
  27. Friedkin, N.E.; Johnsen, E.C. Social influence and opinions. J. Math. Sociol. 1990, 15, 193–206. [Google Scholar] [CrossRef]
  28. Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; Eliassi-Rad, T. Collective classification in network data. AI Mag. 2008, 29, 93. [Google Scholar] [CrossRef] [Green Version]
  29. Sun, Y.; Wong, A.K.; Kamel, M.S. Classification of imbalanced data: A review. Int. J. Pattern Recognit. Artif. Intell. 2009, 23, 687–719. [Google Scholar] [CrossRef]
  30. Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef] [Green Version]
  31. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  32. Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
  33. Wasserman, S.; Faust, K. Social Network Analysis: Methods and Applications; Cambridge University Press: Cambridge, UK, 1994; Volume 8. [Google Scholar]
  34. Friedkin, N.E. Theoretical foundations for centrality measures. Am. J. Sociol. 1991, 96, 1478–1504. [Google Scholar] [CrossRef]
  35. Pujol, J.M.; Béjar, J.; Delgado, J. Clustering algorithm for determining community structure in large networks. Phys. Rev. E 2006, 74, 016107. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  36. Michalski, R.; Palus, S.; Bródka, P.; Kazienko, P.; Juszczyszyn, K. Modelling social network evolution. In Proceedings of the International Conference on Social Informatics, Singapore, 6–8 October 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 283–286. [Google Scholar]
  37. Michalski, R.; Kazienko, P. Maximizing social influence in real-world networks—the state of the art and current challenges. In Propagation Phenomena in Real World Networks; Springer: Berlin/Heidelberg, Germany, 2015; pp. 329–359. [Google Scholar]
  38. Michalski, R.; Kazienko, P. Social Network Analysis in Organizational Structures Evaluation. In Encyclopedia of Social Network Analysis and Mining; Springer: New York, NY, USA, 2014; pp. 1832–1844. [Google Scholar] [CrossRef]
  39. Adai, A.T.; Date, S.V.; Wieland, S.; Marcotte, E.M. LGL: Creating a map of protein function with an algorithm for visualizing very large biological networks. J. Mol. Biol. 2004, 340, 179–190. [Google Scholar] [CrossRef] [PubMed]
  40. Barabási, A.L.; Posfai, M. Network Science; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar]
  41. Kajdanowicz, T.; Michalski, R.; Musial, K.; Kazienko, P. Learning in unlabeled networks—An active learning and inference approach. AI Commun. 2016, 29, 123–148. [Google Scholar] [CrossRef] [Green Version]
  42. Özgür, A.; Özgür, L.; Güngör, T. Text categorization with class-based and corpus-based keyword selection. In Proceedings of the International Symposium on Computer and Information Sciences, Istanbul, Turkey, 26–28 October 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 606–615. [Google Scholar]
Figure 1. Organizational hierarchy after removal of former employees and technical accounts. Red nodes—first level management. Orange nodes—second level management. Blue nodes—regular employees.
Figure 1. Organizational hierarchy after removal of former employees and technical accounts. Red nodes—first level management. Orange nodes—second level management. Blue nodes—regular employees.
Applsci 10 01699 g001
Figure 2. Weighted and directed social network in a manufacturing company built upon e-mail communication. The colouring scheme is the same as for Figure 1: red nodes—first level management, orange nodes—second level management, blue nodes—regular employees. The algorithm used for visualisation is a force-directed large graph layout [39] with the root node with the highest betweenness.
Figure 2. Weighted and directed social network in a manufacturing company built upon e-mail communication. The colouring scheme is the same as for Figure 1: red nodes—first level management, orange nodes—second level management, blue nodes—regular employees. The algorithm used for visualisation is a force-directed large graph layout [39] with the root node with the highest betweenness.
Applsci 10 01699 g002
Figure 3. The result of the classification of two groups for the manufacturing company dataset.
Figure 3. The result of the classification of two groups for the manufacturing company dataset.
Applsci 10 01699 g003aApplsci 10 01699 g003b
Figure 4. The result of the classification of three groups for the manufacturing company dataset.
Figure 4. The result of the classification of three groups for the manufacturing company dataset.
Applsci 10 01699 g004aApplsci 10 01699 g004b
Figure 5. The result of the collective classification of three groups for the manufacturing company dataset.
Figure 5. The result of the collective classification of three groups for the manufacturing company dataset.
Applsci 10 01699 g005aApplsci 10 01699 g005b
Figure 6. The result of the classification of two groups for the Enron dataset.
Figure 6. The result of the classification of two groups for the Enron dataset.
Applsci 10 01699 g006aApplsci 10 01699 g006b
Figure 7. The result of the classification of three groups for the Enron company dataset.
Figure 7. The result of the classification of three groups for the Enron company dataset.
Applsci 10 01699 g007aApplsci 10 01699 g007b
Figure 8. The result of the collective classification of three groups for the Enron company dataset.
Figure 8. The result of the collective classification of three groups for the Enron company dataset.
Applsci 10 01699 g008
Table 1. Organizational structure after removal of former employees and technical accounts.
Table 1. Organizational structure after removal of former employees and technical accounts.
Hierarchy LevelNumber
The first management level12
The second management level8
Regular employees134
Table 2. Enron hierarchy.
Table 2. Enron hierarchy.
FlattenedOriginalNumber
The first management levelCEO40
President
Vice President
The second management levelDirector37
Managing Director
Manager
Regular employeeEmployee53
In House Lawyer
Trader
Table 3. Features.
Table 3. Features.
Feature NameDefined inBrief Description
indegree centralityEquation (1)a number of incoming links to a given node
outdegree centralityEquation (2)a number of outgoing links from a given node
betweenness centralityEquation (3)the frequency of a node appearing in shortest paths in the network
closeness centralityEquation (4)the length of the shortest paths between the node and all other nodes in the graph
eigenvector centralityEquation (5)a relative measure of importance dependent on the importance of neighbouring nodes in the network
page rank centralityEquation (6)relative measure of importance also based on eigenvectors of an adjacency matrix, more tunable
hubs centralityEquation (7)indication of position in relevance to important nodes—authorities
authorities centralityEquation (8)importance of node based on being referred to by hubs
clustering coefficientEquation (9)degree to which nodes in a graph tend to cluster together
the total numbers of cliquesSection 2.5total numbers of cliques in which an employee is assigned
the biggest cliqueSection 2.5size of the biggest clique for the specific node
sent neighborhood variabilityEquation (10)difference between sets of neighbours a node sends emails to in consecutive months
received neighborhood variabilityEquation (10)difference between sets of neighbours a node receives emails from in consecutive months
general neighborhood variabilityEquation (10)difference between sets of neighbours a node communicates with in consecutive months
overtimeSection 2.5a number of days an employee worked overtime (only for manufacturing company)
the number of weekends workedSection 2.5how many times an employee worked over weekends
Table 4. The best results obtained by supervised methods.
Table 4. The best results obtained by supervised methods.
DatasetNumber of LevelsAlgorithmF1-ScoreMin. Activity% of Features
manufacturing company2Decision Tree0.70392 months80%
Random Forest0.77685 months90%
Neural Network0.62473 months100%
SVM0.65173 months100%
3Decision Tree0.47371 month80%
Random Forest0.45752 months60%
Neural Network0.41491 month100%
SVM0.46224 months50%
Enron2Decision Tree0.78495 months100%
Random Forest0.81985 months100%
Neural Network0.78354 months80%
SVM0.78555 months50%
3Decision Tree0.58275 months50%
Random Forest0.64233 months100%
Neural Network0.61074 months70%
SVM0.61373 months70%

Share and Cite

MDPI and ACS Style

Nurek, M.; Michalski, R. Combining Machine Learning and Social Network Analysis to Reveal the Organizational Structures. Appl. Sci. 2020, 10, 1699. https://doi.org/10.3390/app10051699

AMA Style

Nurek M, Michalski R. Combining Machine Learning and Social Network Analysis to Reveal the Organizational Structures. Applied Sciences. 2020; 10(5):1699. https://doi.org/10.3390/app10051699

Chicago/Turabian Style

Nurek, Mateusz, and Radosław Michalski. 2020. "Combining Machine Learning and Social Network Analysis to Reveal the Organizational Structures" Applied Sciences 10, no. 5: 1699. https://doi.org/10.3390/app10051699

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop