3. Email Network Data and Main Indicators
The data used here were extracted from the Stanford Network Analysis Project (SNAP) “
email-Eu-core-temporal” network, a well-known reference dataset for Social Network Analysis (SNA) of email traffic [
26,
27]. Email activity networks are a representative form of social networks. A number of individuals can be linked according to their interaction in terms of messages sent and received, while the number of interactions, their frequency, and the time differences may reveal information about the strength of bilateral relationships. Seen from a wider perspective, the different patterns of all bilateral relationships across the network can provide information on the role of each individual in the organization. Studying email networks, therefore, can be useful for the analysis of the operation of an organization.
The network was generated using real email traffic data from a large European research institution. Anonymized information about all incoming and outgoing email of the research institution was collected during 18 months. The information retained consists of the (anonymized) sender, the (anonymized) receiver, and the timestamp of the message dispatch. To convert the set of email messages into a network, each email address is considered a node. A directed edge between nodes i and j is created if i sent at least one message to j. SNAP also provides an additional dataset, “email-Eu-core-department-labels”, which associates each individual email address to one of the 42 departments of the research organization. The resulting network consists of 986 nodes (unique email addresses). Since 21 email addresses had only outgoing messages within the institution, and 162 email addresses had only incoming messages from within the institution, there are only 824 transmitting nodes and 965 receiving nodes. Membership to a department ranges from 1 to 109, with a mean of 23.93 members and a median of 14.5.
The email activity dataset consists of 332,334 observations, each corresponding to an email send by an anonymized user id to another anonymized user id, with the corresponding timestamp. A graph representation of the email network can be easily constructing by assuming that each member of the network is a node of the graph. These observations can be considered as dynamic edges in the graph describing the network. The number of bilateral links regardless of the number and direction of the interaction, i.e., the static edges, is 24,929. The timestamps of each message allow the analysis of the dynamics over time.
The number of emails sent by each individual is highly correlated with the number of emails received (Pearson correlation = 0.747).
Figure 1 suggests that the linear relation between the two numbers and the absolute levels of the two values are independent of the department where each individual belongs to. Email activity appears to be an effect of the individual’s role within the department and the organization at large, rather than an attribute associated to the role that each department has inside the organization.
The correlation between the number of sent and received emails is even higher when summarized at department level (Pearson correlation = 0.967). The number of emails sent by a department’s members to members of other departments is proportional to the number of emails received from other departments. In addition, even though there is significant variance in the number of emails sent or received by each individual, the aggregate figures at department level are, to a large extent, proportional to the number of individuals in each department (
Figure 2). Even though there is significant variance among individuals in regard to email activity, the email flows between departments is symmetrical.
Given that the email traffic network has a high number of nodes and connections, it is practically impossible to visualize its structure in a meaningful way. Even when summarizing at department level, mapping the connections between the 42 nodes in order to identify patterns in the flow of information is still a complicated task.
Figure 3 summarizes the top 10% of directional bilateral email flows between departments. While a few departments appear frequently in these bilateral flows, the pattern of connections implies that no department is dominant in terms of intra-institutional email traffic.
The basic statistics of email activity during the period comprise of the traffic data for the network at individual (
Table 2) and department (
Table 3) level. There is high variance in regard to the number of emails each member of the network sent or received during the period, as well as in the ratio between the two. There are a few members that dominate the institution in terms of the number of emails sent. Twenty members sent more than 2500 emails during the period, perhaps due to an information dissemination role that they may have. But sending many emails does not necessarily result in receiving many (or vice versa). Several different profiles seem to be present in the dataset, again, probably due to the different roles in the institution and the communication patterns each individual prefers.
4. Graph Theory Indicators
The basic traffic data across the network provides an overview of the activity but is not sufficient to explain the role of each individual in the organization. Graph theory indicators are commonly used in social network analysis in order to describe the topology of a network and explain the relationships among its members. The three most frequently used indicators address centrality, a measure of the importance of each individual (corresponding to a node in the network) within the social network: degree centrality, closeness centrality, and betweenness centrality. All three centrality indicators were introduced by Freeman in 1979 [
28] and form the basis of current SNA methods.
Degree centrality is the simplest expression of centrality and corresponds to the total number of existing connections between an individual node and the other nodes of the network. If
when a connection between nodes i and j exists, and
in the opposite case, the basic definition of degree centrality is:
Normalizing the indicator adjusts for the network size by expressing a node’s centrality as a share of its maximum possible level, when connections with all other nodes in the system are present. If
N the number of nodes in the network [
29]:
For non-directed networks,
is assumed. In most social networks, however, and particularly in the email network used here, connections are asymmetric, and
is quite frequent in the flow of information. In such a case, the two variants of degree centrality assuming a directed graph may differ:
In practice, the degree centrality in a network of email flows coincides with the share of unique senders of emails received by each individual (in-degree) and the share of unique recipients of emails sent by each individual (out-degree).
Closeness centrality is an indicator of the centrality of each node based on the distance between each individual node and all other nodes in the network. It is calculated as the as the reciprocal of the sum of the length of the shortest paths between node
i and all other nodes in the network [
30]:
where
is the distance (number of edges) between nodes
i and
j.
Normalization, in order to account for the network size, is performed by multiplying closeness by
N − 1, where
N is the number of nodes in the network. The two directional forms of closeness after this transformation correspond to the inverse of the average distance from each node [
31]:
Betweenness centrality measures the number of shortest paths between all other nodes of the network that pass through an individual node. If
is the number of all shortest paths between all other nodes in the network, and
is the number of those shortest paths that pass through
i, betweenness centrality is calculated as:
As in the case of the other centrality indicators, betweenness centrality can also account for different weights that express distance and differentiate between the directions of the connection. The general case, thus, can be transformed into:
The calculations of these standard centrality indicators for the reference email network used here were done with the igraph software package in [
32]. The results for the main indicators are summarized in
Table 4, with their standard and, where applicable, weighted and/or directional expressions.
Degree centrality is a direct reflection of the number of unique individuals within the network that an email was exchanged with. The basic expression, normalized but non-directional, does not distinguish between sending or receiving an email. On average, each individual has been in contact with (in the sense that an email was sent to or received from) 6.8% of the individuals in the network during the period covered (67 out of a total of 985). The reach of contacts ranges from a minimum of 0.03% (3 individuals) to close to 55% (540 individuals).
If the direction of the email flow is taken into account, the directional version of degree centrality allows more detail in the characterization of each node. The sum of the two directions of directed degree centrality equals the undirected degree centrality for each node. Nevertheless, the differences in the distribution of values reflect the asymmetry in the number of unique senders and respondents in the network and the varying patterns in email activity of individual members of the network. Degree centrality is positively skewed, with its distribution having a long tail towards values of high centrality. This is the result of a low number of nodes being highly central in terms of the number of individuals they exchanged emails with, either because they sent emails to a higher proportion of the network than the average (out-degree) or because they received proportionally more (in-degree). The skewness of out-degree centrality is significantly higher than that of –in-degree due to the dominant role of a few members as sender of emails.
The closeness centrality indicators reflect a certain degree of symmetry in regard to their distribution statistics. The average member of the network is equally close to the center, regardless of whether incoming or outgoing email flow is considered. However, closeness for outgoing emails has a lower standard deviation and higher skewness than for incoming emails. This probably signifies that the relatively few members who send emails to a large part of the network act as an efficient channel of information flow across the network. The mean values for closeness centrality in
Table 4 correspond to an average distance of 2.654 edges for outgoing emails and 2.652 edges for incoming emails, confirming the observation that the email network analyzed here is dense and highly connected.
Betweenness centrality presents some small differences when the direction of the email flow is taken into account. While the two values are highly correlated at node level (Pearson correlation = 0.965), individuals with a large imbalance between the numbers of incoming and outgoing messages do have a marginal influence on the overall distribution.
6. Symmetrical and A-Symmetrical Models
Having presented the differences between symmetrical and directed indicators, the question can be transformed into whether using one or the other type of indicators influences the quality of the analysis of patterns in a social network. An experiment can be made by using a variable that expresses an operational characteristic of the network—independent from its topology—and estimate how the various centrality and clustering coefficients explain its variation. Given the data available in this dataset, a suitable variable that is independent from the individual network measures is the reaction time to emails. The dataset provides the timestamp for each email, information that was not used in the calculations of the indicators in the previous sections. While the reaction time between emails does not necessarily correspond to the time that has passed for a specific email to be responded, it still provides a quantifiable indicator of the temporal dimension of the email interaction between two members of a social network. Shinkuma et al. [
34] suggest that the frequency of interaction can be used as an indicator to characterize interpersonal communication in the network graphs.
In the experiment used here, a new indicator is constructed based on the email timestamps included in the dataset. If only the email exchanges that were bilateral during the period covered by the dataset (i.e., ) are taken into account and on the premise that the timestamp difference in an email exchange between two individuals is a proxy of the response time, the exchange of emails between i and j would have the form of a series that can be ordered by time.
The series would have n elements, where n is the number of emails between i and j, regardless of direction. The number of emails from i to j would be equal to k, where k < n. Each email has a timestamp {tn > tn−1}.
The response times of
i to the emails sent by
j can be calculated as the difference between the timestamp of each email
and the last unanswered email
:
The timestamp of the last unanswered email would correspond to the timestamp of the latest email
:
In a similar fashion, the response times by j to emails sent by I can be calculated as the difference between the timestamps of each sent email
and the next received email
, if any:
The timestamp of the next email
would be that of the first email
received after each
:
An example of the calculation of response times is given in
Table 6.
The average time of response by an individual
i would be:
while the average speed of responses received would be:
The average speed of reply is, however, very sensitive to the period of analysis used and to the specific day or hour that a specific email was sent. A more suitable indicator of the relative importance of an individual in the network could be the share of outgoing emails that were responded within a certain time threshold.
The formulation that calculates the share of responses in the opposite direction within the threshold can be used as an indicator of an individual’s own average speed of response.
Two different indicators are tested, with a 7 d and a 24 h threshold, respectively. For each indicator, three different models that explain the variation were developed:
Conventional model: independent variables include main statistics on individual email activity (number of emails sent and share of own replies within the threshold period) and standard symmetric centrality indicators;
Clustering model: independent variables include main statistics on individual email activity and directed clustering indicators; and
Extended Directional model: independent variables combine main statistics on individual email activity, directed centrality indicators, and directed clustering indicators.
6.1. Share of Outgoing Emails Responded Within 7 Days
The comparison of the three models that use a 7 d threshold is summarized in
Table 7. The main statistics indicators that are significant in all three models are the number of emails sent (
) and the individual’s own speed in replying (
). This suggests that there is a high degree of reciprocity in an individual’s email activity. The individuals in the network examined here who sent more emails, on average, received faster responses. At the same time, the individuals who reply fast have their emails also replied to fast. Both variables suggest that the more active the role of the individual is in the system, the stronger the role is that the individual has in the network (at least as far as the dependent variable expresses such strength). Of course, it is possible that the causal relationship has the opposite direction, i.e., the faster that an individual’s emails are responded, the higher the number and faster the responses of the individual.
The conventional model uses the two variables above and the individual’s closeness centrality indicator. The relation is positive, meaning that, the closer the individual is to the center of the network, the higher the share of the individual’s emails that are responded within 7 d. Neither degree nor betweenness appear as significant variables, suggesting that the speed of replies is not a function of the number of individual connections nor of the number of shortest paths that an individual node forms part of.
The clustering model, which uses the directed clustering coefficients, suggests that three directed coefficients can be useful in interpreting an individual’s role in the network. There is a correlation with both the In- and Out-clustering coefficient and a negative correlation with the cycle clustering coefficient. This indicates that the participation in triads which have all three nodes communicating with each other tend to have a more active role in the overall network. Conversely, if the individual acts simply as a middleman, i.e., is part of the weaker communication channel in a triad, the individual’s role in the network tend to be less active, at least measured in terms of the time for emails to be responded.
The Extended Directional model combines centrality and clustering indicators accounting, in both cases, for direction and weights. This model maintains the main independent variables of the other two approaches, rebalancing their respective estimates and resulting in a visible improvement in accuracy. The R2 coefficient of the Extended Directional model is 0.8596 compared to 0.8468 and 0.8428 for the other two models, respectively. Closeness is considered in its directional version, which results in its weight in the model to be split in two. The two directions are not symmetrical though, with the in-closeness centrality having a negative correlation which roughly counter-balances the positive impact of the out-closeness one. The three clustering coefficients remain significant in the Extended Directional model, maintaining the direction of the influence, with small changes in the estimates. The estimates of In- and Out-clustering coefficients converge to comparable levels, while the estimate for the Middleman coefficient decreases further.
The difference when direction and weights are taken into account to explain variation is noticeable, and the accuracy of the model increases. The department that each individual belongs to does not appear as significant. The three standard graph theory indicators, Degree, Closeness, and Betweenness, appear to be inter-related, even in their directed version, and, consequently, only Closeness appears as significant.