Next Article in Journal
Design and Analysis of A Beacon-Less Routing Protocol for Large Volume Content Dissemination in Vehicular Ad Hoc Networks
Next Article in Special Issue
An Effective Massive Sensor Network Data Access Scheme Based on Topology Control for the Internet of Things
Previous Article in Journal
Modelling of XCO2 Surfaces Based on Flight Tests of TanSat Instruments
Previous Article in Special Issue
A Novel Energy Efficient Topology Control Scheme Based on a Coverage-Preserving and Sleep Scheduling Model for Sensor Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Estimation of Anonymous Email Network Characteristics through Statistical Disclosure Attacks

1
Group of Analysis, Security and Systems (GASS), Department of Software Engineering and Artificial Intelligence (DISIA), Faculty of Information Technology and Computer Science, Office 431, Universidad Complutense de Madrid (UCM), Calle Profesor José García Santesmases, 9, Ciudad Universitaria, Madrid 28040, Spain
2
Department of Convergence Security, Sungshin Women’s University, 249-1 Dongseon-dong 3-ga, Seoul 136-742, Korea
*
Author to whom correspondence should be addressed.
Sensors 2016, 16(11), 1832; https://doi.org/10.3390/s16111832
Submission received: 22 August 2016 / Revised: 24 October 2016 / Accepted: 26 October 2016 / Published: 1 November 2016
(This article belongs to the Special Issue Topology Control in Emerging Sensor Networks)

Abstract

:
Social network analysis aims to obtain relational data from social systems to identify leaders, roles, and communities in order to model profiles or predict a specific behavior in users’ network. Preserving anonymity in social networks is a subject of major concern. Anonymity can be compromised by disclosing senders’ or receivers’ identity, message content, or sender-receiver relationships. Under strongly incomplete information, a statistical disclosure attack is used to estimate the network and node characteristics such as centrality and clustering measures, degree distribution, and small-world-ness. A database of email networks in 29 university faculties is used to study the method. A research on the small-world-ness and Power law characteristics of these email networks is also developed, helping to understand the behavior of small email networks.

1. Introduction and Related Work

1.1. SNA and Email Networks

Social network analysis (SNA) has received growing attention on different areas. SNA aims to obtain relational data from social systems to identify leaders, roles, and communities in order to model profiles or predict a specific behavior in users’ network. SNA is built-in with nodes (individuals or organizations) within the network and ties representing relationships among the nodes. Social relationships may be in the form of real world offline social networks (like friendship, communications, transactions, etc.) or it may be online social networks (like Facebook, Twitter, etc.).
SNA has been applied in Information Science [1], Financial Crimes [2], Political Science [3], Sociological studies [4], Biology [5], Economics [6], and Intelligence Analysis [7].
With regard to the properties and structures on the network, SNA has been applied to study the structure of Internet graph topologies [8], telecommunication graphs [9], emails and social networks [10,11,12].
Analysis of incomplete social networks where some nodes or edges are missing and only a sample of them is available is a field studied in [13,14]. A particular issue in this field is link prediction where, given some information about present nodes and links, the challenge is to fill or predict missing links in the network. See [15,16] for two surveys on link prediction. The main contribution of this article is focused on email networks, as a subdomain of social networks. Studies of email networks cover spam analysis [14,17,18], virality studies [12], community characteristics [10], structure and properties analysis [8,12], and studies on the temporal evolution of email data [19].

1.2. Privacy in Communication Networks

Social networks, and in particular email networks, need to be anonymized in order to preserve privacy of actors. Anonymity can be compromised by disclosing senders’ or receivers’ identity, message content, or sender-receiver relationships. Anonymity systems provide mechanisms to enhance user privacy and to protect computer systems. Research in this area focuses on developing, analyzing, and executing anonymous communication networks attacks and defenses.
Mixes [20] are considered the base for building high latency anonymous communication systems. A mix network aims to hide the correspondences between the items in its input and those in its output, changing the incoming packets’ appearance through cryptographic operations. Babel [21], Mixmaster [22], Mixminion [23], and Onion routing [24] are other anonymous communication designs.
The attacks against mix systems aim to reduce the anonymity by linking senders with the messages that they send, receivers with the messages that they receive or linking senders with receivers. There are intersection attacks based on traffic analysis. One of the first models to appear was called the Disclosure Attack [25]. In this first version the disclosure of links between users of a network was limited to relationships of one specific user with the rest of users in the domain, under an optimization approach.
Using the restrictive assumptions set in [25], the Statistical Disclosure Attack (SDA) was presented in [26]. In this attack, the information retrieved by the attacker is obtained through several rounds of communications, and statistical methods are employed to infer links between users. Other forms and derivations of the Statistical Disclosure Attack are presented in works [27,28,29,30]. In [31,32] a new statistical disclosure attack is presented that overcomes the restrictions usually considered in other methods and can be used under very general assumptions.
This article addresses the research problem of obtaining global and node characteristics of an anonymized email network through the application of a statistical disclosure attack. The techniques developed in [31,32] are used to infer email network characteristics under strongly incomplete information. It is shown that the attack can lead to obtain moderately accurate estimators for measures such as average degree, power law coefficient, betweenness coefficient or small-world-ness coefficient. These estimations can be used to study temporal evolution of the network, identifying substantial changes over time, compare different networks, or identify important users or clusters of users. The database used, that comprises many networks, allows study of the scope and limitations of the method under a broad statistical perspective, observing patterns, error rates, and possible biases in the estimations. This is a novelty, since there is not previous statistical disclosure attack (SDA) research on estimating global network characteristics or node based measures; generally SDA existing research focuses on a limited number of users and establishes very restricted hypotheses about a priori knowledge of the network or users behavior.
Most often analysis of email networks are restricted to only one email network; studies that treat many networks usually consider besides email data, social network data, patents data, etc. When characteristics of email networks such as small-world-ness coefficient, density or degree distribution parameter are studied for only one network, it is difficult to obtain information about the real range and variability of these parameter values. In this context, statistical variability between different small email networks can only be estimated from an ‘article to article’ perspective. Parameter variability can then be due to the different sources. There could be also a bias due to it being easier to publish papers that agree with the range of parameters in previous literature. In this case, variability could be underestimated. Also, with only one network, relationships between parameters cannot be studied.
In this article, there is not one, but a range of small email networks, each one collected at a different faculty in the same university. It allows not only to study if the parameter behavior agree with previous research but also to study the range and variability of parameters, as well as relationships between them. This is an important novelty with respect to previous research on email network data.
The rest of the article is structured as follows: Section 2 presents structural properties and measures of social networks, focusing on the particular case of email networks and their scale free or small-world characteristics. A new email database of 29 email local university networks is introduced to explore the consistency of theory. Section 3 presents the problem of retrieving network characteristics under the situation of strongly incomplete information. A statistical disclosure attack method is used to estimate the network and node measures. The method is applied to the university email database in order to study its performance. Section 4 provides conclusions and future work.

2. Social Networks and Email Network Properties: Email Database

The most significant structural properties of social networks are first introduced. Social Networks can be directed or undirected, weighted or unweighted. An email network can be set as directed and weighted; for simplicity, some characteristics can be computed as for an undirected and/or unweighted network. Measures of interest of a network can be divided into node centered measures and global measures.
At node level the most important measures are:
  • Degree: The centrality degree of a node is the number of users or nodes that are directly related to it. Two nodes of a graph are adjacent or neighbors if there is a branch that connects them. In the case of directed graphs, there are two types of degrees: The input degree of a node is the number of arcs that end in it. The output degree of a node is the number of arcs that originate from it.
  • Betweenness centrality measure: It is equal to the number of shortest paths from all vertices to all others that pass through that node. The computation of shortest paths in a network algorithms such as Floyd-Warshall or Johnson’s. A node with high betweenness centrality has a large influence on the transfer of items-messages through the network.
  • Clustering coefficient: It is a metric that measures the extent to which the neighbors of a node are also interconnected. Here the Watts and Strogatz ([33]) local clustering coefficient is used. The clustering coefficient of a node v is defined as
    C v = 2 E v k v ( k v 1 )
    where, k v denotes the number of neighbors of v , k v ( k v 1 ) 2 the maximum number of edges that can exist between the neighbors, and E v the number of the edges that actually exist.
  • Closeness: The degree of closeness is the ability of a node to reach all others in the network. A node is important if it is close to all others. The sum of the shortest path distance from the node n i to all the others is computed. The inverse of this sum is the closeness coefficient of node n i :
    C c ( n i ) = [ j = 1 g d ( n i , n j ) ] 1
    where d ( n i , n j ) is the shortest path distance between nodes i and j. While betweenness coefficient measures the role of the node n i as a bridge between nodes, a normalized closeness coefficient (the coefficient above multiplied by the number of nodes n − 1) may be seen as the inverse of the average shortest path distance between the node n i and all the nodes connected to n i .
At network level, the most important measures are:
  • Degree distribution: The degree distribution p(k) of a network is the fraction of nodes in the network with degree k. In a power law distribution, the fraction of nodes with degree k is p(k) kα where α is a constant exponent. Networks characterized by such degree distribution are called scale-free networks. Many real networks such as the Internet topology [8], the Web [9] and on-line social networks [10] are often scale free.
  • Average path length: In small-world networks, any two nodes in the network are likely to be connected through a short sequence of intermediate nodes, and the network diameter shrinks as the network grows [19].
  • Average clustering coefficient of a network: The average clustering coefficient of a social network shows to what extent friends of a person are also friends with each other [33].
  • Density: The density D of a network is defined as a ratio of the number of edges E to the number of possible edges, given by
    D = 2 E N ( N 1 )
    where N is the number of nodes in undirected graph, and D = E N ( N 1 ) in directed graphs.
  • Small-world-ness coefficient: The concept of “small-world-ness” as a property is exposed in [33], and characterizes networks with a high clustering coefficient (meaning by “high” as much higher than its equivalent in a random Erdos-Renyi network) and mean shortest path length similar to its equivalent in a random Erdos-Renyi network. Email users tend to form groups and the average shortest distance is small, leading to the small-world property.
In [34], a coefficient for measuring the small-world-ness of a network is presented and used as reference in this work. This small-world-ness coefficient is expressed as the ratio between γg and λg, where γg is the ratio between the average clustering network coefficient and the clustering coefficient of the network under the equivalent random Erdos-Renyi network, and λg is the ratio between the average shortest path length of the network and the average shortest path length of the equivalent random Erdos-Renyi network.
The behavior of email networks have been generally claimed to be scale free, that is, the degree distribution follows a power law distribution. See for example [12,19,35]. The scale free nature of email data implies that a few ranges of nodes have high degrees (many friends) while many nodes have small degrees.
In order to assess the scale free model for the degree distribution, log-log graphs or estimation of the power constant followed by a goodness of fit test are used. In this article, the goodness of fit method, through the use of a Kolmogorov Smirnov statistic, is applied.
There are several email network studies in the literature. In [35], scale-free and small-world properties of an email network at Kiel University are studied. The Enron email database structure is studied in [18]. In [17], an email network from National Taiwan University is analyzed to study the temporal evolution of the email network of an EU research institution. In [19], we also see a focus on the temporal evolution of a large US University email network.
The email database used in this study is a group of 29 independent email networks from Madrid Complutense University. This is the same as used in [32]. Each network is related to a University Faculty, and contains anonymized emails retrieved over one year between users of the department. Institutional emails, and emails going out of the Faculty network or coming from outside from the Faculty are not considered. Only time of sending and senders’ and receivers’ anonymized ID are kept. The textual content of email and headers are deleted.
Studies on email network data may restrict the problem to a closed domain (considering only messages sent and received within a domain) or be open in the sense that messages outgoing from the domain and received from outside are also considered.
Most studies (see for example [18,19,35,36]) belong to the first category; others also consider out-domain emails ([19]). Most studies come from university or research institution servers.
Our study follows the norm; each network is restricted to the closed domain and data comes from a University server. The contribution here is that there are many separate similar networks (one per faculty), allowing for studying patterns and contrast previous research on this kind of data from a broader statistical view.
For simplicity and in order to avoid ambiguity, graph construction follows [16,19], creating an undirected and unweighted graph.
Table 1 presents the main measures of the 29 faculty networks. The smallest network has only 8 nodes and 23 edges, and the biggest has 622 nodes and 8839 edges. The average degree goes from 2.88 to 14.5. A 96% of the networks have an average degree > 4, in concordance with previous research on small-world networks, that reveals that the average degree is higher than 4. The average betweenness is in the range of (4.71, 2865).
It is shown in [34,37] that as density increases clustering coefficient increases and mean shortest path length approaches the equivalent random Erdos-Renyi network mean shortest path length. Thus high density networks would be trivially small-world under the WS concept. It is advisable that small-world behavior and coefficients should take into consideration the density concept for comparisons between networks. As [37] remarks, density should be lower than 0.4 in order to consider small-world-ness properties without the confounding effect of high density. Here the density range is very low, 80% of the networks show a density coefficient lower than 0.10, laying in the range of the other email networks studied in the literature. In Figure 1, the relationship between nodes and edges is represented. When transforming to log scale the relation is approximately linear, almost proportional, showing that
log ( e d g e s ) k × log ( n o d e s )
That is e d g e s = n o d e s k . With the data presented here, k can be estimated by regression without intercept, giving k = 11.14, with regression R2 = 0.85.
Scale free behavior of the networks is evident. A Kolmogorov Smirnov test using bootstrap data presented in [38], is applied in order to check if the degree distribution in each network follows a power distribution. The null hypotheses of power distribution is only rejected for networks 11, 15, 22 and 27. The other 25 networks fit well to a power distribution, that is, P ( k ) = k .
The estimation of is achieved through maximum likelihood, following [38]. The estimate range is (1.43, 2.1). These values are similar to those found in [19], where data is limited to internal nodes of the closed network. Figure 2 and Table 2 show the degree distribution in log scale for faculty 6, and how it fits to a line.

Small-World Behavior

Small-world networks are characterized by a high cluster coefficient Cg with respect to the equivalent random Erdos-Renyi network coefficient Cr, and similar path length Lg to the equivalent random Erdos-Renyi network Lr. Gamma and lambda values measure the quotients between each pair of coefficients. If the email networks considered here are small-world networks, it is to expect lambda values near 1 and high gamma values. In [34] a small-world-ness coefficient is presented constructed as the quotient between gamma and lambda values. A network can be considered small-world if this coefficient is higher than 2, what happens in 27 of the 29 faculties.
In [34] it is pointed out that there is a linear relationship between small-world-ness coefficient and number of nodes. Figure 3a illustrates this fact. Also, it can be seen in Table 3 and Figure 3a that the faculties with smallest small-world-ness coefficient are those with less nodes, and it suggests that their small coefficient is due overall to their size and not to their structure in terms of shortest path and clustering.
Figure 3b allows also to detect in this case the network 22, which has an unusual behavior (higher Lg than expected). This network is also the one with highest shortest path, clustering coefficient, and small-world-ness coefficient. Figure 3c shows that mean shortest path and clustering coefficient have a special relationship. This seems to be increasing until a certain clustering coefficient and then decreasing when number of nodes is small and clustering coefficient is higher. It is possible that the clustering coefficient Cg has a special behavior when the number of nodes is too small.
The regression slope (with case 22 deleted) for Cg < 0.35 is positive, b = 1.55 while the regression slope for Cg > 0.35 is negative, b = −4.29. Previous results on this relationship have not been found in literature. Faculty 22 is a special case (outlier), as it has been clear in other Figures and tables.
Other characteristics are proportional to the number of nodes. Figure 3b shows that
L g k × log ( n o d e s )

3. Estimation of Email Network Characteristics through Statistical Disclosure Attacks

Privacy in communication networks can be compromised by statistical disclosure attacks. In this section it is shown how the method developed in [31,32] can be used to disclose user relationships (that is, existing and non-existing edges) in the network structure. Departing from very limited information, edges are inferred and users’ centrality measures and network global measures are estimated. This allows to detect high centrality nodes and characteristics of the network and establish the basis for studying network evolution with respect to global measures such as density, average degree, or average betweenness, and also node-based measures, when the attack is repeated at different time points.
The framework is habitual in statistical disclosure attacks in network communications: The information retrieved by the attacker is the number of messages sent and received by each user. This information is obtained in rounds that can be determined by equal length intervals of time, or alternatively by equal-sized batches of messages. Method is restricted, at this moment, to simple mix, where messages are grouped in batches at each round and then anonymously relayed, but can easily be extended to random threshold, where the batch size can be random, or pool mixes, where some messages are randomly selected and not relayed in each round. No restriction is made from before about the number of friends any user has, or about the distribution of messages sent. Both are considered unknown.
Attacker controls all users in the system. In our real data application, we aim at all email users of each network domain.
In each round, the attacker obtains a contingency table that represents messages sent from each user (rows) to each receiver (columns). Marginal row and column totals are known, and they represent the total number of messages sent and received by each user. However, the attacker does not know the pair (sender-receiver) for each message.
Table 4 represents a simplified version of one of these tables, retrieved in one round. There are many solutions for filling the table elements, that sum up to the marginals. Optimization algorithms (branch and bound) are generally slow and result in a very limited range of solutions. A very fast algorithm based on iterative random generation is used in [31] in order to obtain a large number of solutions (if not all) for each round. This information is used to order pairs of users from highest to lowest probability of relationship and finally obtain a classification result that aims to detect if one pair of users have had communication. In [32], a refinement of the method based on the use of the EM (Expectation-Maximization) algorithm that significantly improves the relationship predictions, is used.
In the network paradigm, the objective is to reconstruct the global network in the horizon of study, where nodes represent users and edges represent existing communication between pairs of users. The information employed to estimate the whole network is the incomplete information obtained in each round that can be seen itself as corresponding to a partial network with incomplete information.
The method leads to a final estimated network with its own measures and characteristics, that can be used as estimates of the real network measures. Besides, each user’s centrality measures can also be estimated.
As it was explained in [31,32], the performance of the attack is affected by the following aspects:
(1)
The number of nodes. As the number of nodes increases, the complexity of round tables and the number of feasible tables increases, so that it negatively affects the performance of the attack.
(2)
The percentage of existing edges over possible edges. As communication increases (more edges), the attack precision decreases.
(3)
The mean frequency of messages per round (sum of weights in the round associated weighted network): This is directly related to the batch size, and when it increases, the performance is negatively affected.
(4)
The number of rounds: As the number of rounds increases, this improves the performance of the attack, since more information is available.
(5)
The number of feasible tables generated by round: This affects computing time, and it is necessary to study to what extent it is useful to obtain too many tables. This number can be variable. Usually once a high number of tables is generated (about 300,000 tables per round in our proofs), there is no gain in generating more tables.
The problem of estimating network characteristics departing from network incomplete data is often addressed in scientific literature under the subtitle ‘link prediction’. Personal information about individuals is generally used to predict relationships. There are seldom studies where only structural information is used. An exception is [13], where network characteristics are estimated in networks where only some nodes and links are known. It is a different context than the one addressed here, where all links are unknown and information is obtained in a multiple round framework.
The data presented in Section 2 allows us to study the precision and behavior of the attack when estimating users and network measures. Since there are 29 networks, homogeneous to some extent, estimators sensitivity to scale (number of edges) and other factors can also be observed.
For each of the faculty domains, construction of the attack follows the pattern below:
(1)
Structure data in rounds. Messages are ordered by time and grouped by batch size B, forming rounds (each group of B messages is a round, that leads to a table similar to Table 4). In a real situation, this is the information the attacker is able to obtain.
(2)
Develop the version of the attack algorithm presented in [32] and obtain an estimate of the adjacency matrix of the network; that is, an estimate of the whole network.
(3)
Compute node centrality measures and network characteristics for the estimated network.
In order to develop this pattern a batch size needs to be decided on, as do the number of generated tables for each round, and number of iterations of the EM algorithm. Here a batch size of 15 is used, 500,000 tables per round are generated, and 5 EM iterations are developed.
Centrality measures are computed for each node of each faculty network. Estimation error increases with size: error in degree or betweenness estimation is higher for nodes with higher degree or betweenness. Figure 4 represents the relationship between estimates of betweenness and node degree and their relative true values. In general, estimates are within the expected range. There is a slight but clear bias in both estimations: node degrees are slightly overestimated by the attack estimates, whereas betweenness is slightly underestimated. Uncertainty in round tables leads to the overestimation of edges (relationships) in the network and this has as a consequence higher degree values.
Figure 5 represents graphically the relationships between estimates and the true global values, for number of edges, average degree, and average betweenness.
A line y = x is also represented to study possible biases. Some observations can be made:
  • Estimation error, as it was expected, increases generally with network size (number of nodes).
  • Number of edges for each network is slightly overestimated.
  • Average degree is slightly overestimated, in concordance with user degree estimations.
  • Average betweenness is slightly underestimated.
With respect to scale free behavior, the estimates of the power distribution parameter are slightly overestimated but within the true range, as is illustrated in Figure 6a.
Small-world characteristics are also estimated. Figure 6b–d present the relationship between the estimates and respective true values for Lg, Cg, and small-world coefficient. Mean shortest path Lg is slightly underestimated, whereas Cg and small-world-ness coefficient are slightly overestimated for large networks. With respect to the cutpoint of 2 for a small-world-ness coefficient, the estimator declares as small-world the same networks as the true value.
As it has been observed, estimation have often some bias. However, this bias is not so high and overall estimation error is accurate to a certain extent. It is known that the Mean Squared Error (MSE) of an estimator can be decomposed in two quantities:
M S E = b i a s 2 ( e s t ) + V a r ( e s t )
If both values are low in terms of the scale of the estimated quantity, the estimator is considered accurate. In Table 5 mean bias is computed for each network and parameter, as the mean of differences between the estimate and true value. Also, Mean Absolute Error (MAE) and CV are computed, CV representing MAE in percent over the mean value of the quantity estimated.
Estimation has much higher error at node level. At network level, estimators are satisfactory in general, within a controlled range of the quantity of interest. It is necessary to remark the limitations of the information used (all links are unknown a priori). At the network level, small-world-ness coefficient has the highest error in percent.
As it was pointed out, batch size significantly affects the performance of the attack. The attacker is limited to the number of times he can access information in batches. Figure 7 shows how the error in estimating the number of edges increases with batch size for all the faculties. As the number of edges is linearly or log-linearly related to most of the network measures, this has a direct consequence on the estimators’ errors.

4. Conclusions

Email networks are a particular case of interest in the field of Social Networks. Although these networks can share some characteristics with other social net systems, they have specific behaviors. In this work, a group of email networks with small size and low density are treated. It is usual in literature to mix network of different types in the same analysis, and sometimes the relationships between characteristics studied can be masked. With the data used in this article, homogeneity makes the findings about found relationships and characteristics more reliable. While most of them are already known it is interesting to observe them in a controlled context. Some observed findings are: Exponential relationship between edges and nodes, scale free behavior of almost all the email networks, small-world-ness behavior of almost all the email networks, linear relationship between small-world-ness and number of nodes, and linear relationship between shortest path and number of nodes.
Compromising the privacy of network communications is the aim of statistical disclosure attacks. Attacks can be developed to address information at multiple levels. In previous work, the aim of attack was to obtain diagnostics for relationships between each pair of users, detect communications, and classify each pair of users as linked or not. This work aims to obtain global characteristics of each node (centrality measures) that can be defined as a second level of privacy, and global characteristics of the network, as a third level of privacy. The utility of obtaining these measures can be exploited in two aspects: static information about special groups of nodes or comparing networks in different but homogeneous domains, and dynamic successive estimations over time. Characteristic measures of nodes and global network measures can be estimated at different points of time and will serve to study evolutions of nodes or networks.
Accuracy of the estimations is moderate, with some controlled bias in some measures. Estimators of the user characteristics have a higher error. In general, the range of estimations is similar to the objective range and results are considered encouraging due to the very limited information used. The attack is very affected by the batch size used; as batch size increases, accuracy decreases.
The method used here in order to disclose and estimate network characteristics under very limited information may be improved if further information is available. If, for example, some links are already known, this information can be incorporated under the Bayesian paradigm to the basic algorithm results in order to refine estimations. The same approach could be used if one or more rounds are completely known in advance. This added information can also be used to apply bias correction to estimates that are slightly biased, such as average between and average degree estimates.
Even if there is no more information available, there is still space for improvement. There are other statistical disclosure methods such as the least squares method ([39]) that can be combined with the algorithm adopted here in order to refine results and correct biased estimates. Also, studying the network evolution over time may help to understand the estimation of network characteristics. Finally, the disclosure attack used here should be extended to more complex anonymous systems such as onion routing.

Acknowledgments

This work was funded by the European Commission Horizon 2020 Program under Grant Agreement number H2020-FCT-2015/700326-RAMSES (Internet Forensic Platform for Tracking the Money Flow of Financially Motivated Malware).

Author Contributions

J. Portela, L.J. García Villalba, A.G. Silva Trujillo and A.L. Sandoval Orozco are the authors who mainly contributed to this research, performing experiments, analysis of the data and writing the manuscript. T.-H. Kim analyzed the data and interpreted the results. All authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Otte, E.; Rousseau, R. Social network analysis: A powerful strategy, also for the information sciences. J. Inf. Sci. 2002, 28, 441–453. [Google Scholar] [CrossRef]
  2. Rostami, A.; Mondani, H. The Complexity of Crime Network Data: A Case Study of Its Consequences for Crime Control and the Study of Networks. PLoS ONE 2015, 10, e0119309. [Google Scholar] [CrossRef] [PubMed]
  3. Huckfedlt, R. Interdependence, Density Dependence, and Networks in Politics. Am. Politics Res. 2009, 37, 921–950. [Google Scholar] [CrossRef]
  4. Freeman, L.C. The Development of Social Network Analysis: A Study in the Sociology of Science; Empirical Press: Vancouver, BC, Canada, 2014. [Google Scholar]
  5. Saragiotto, J.; Castro, N.; Teixeira, T.; Mariscal, L.M.; Saraiva, A. Social Network Analysis Metrics and Their Application in Microbiological Network Studies. Stud. Comput. Intell. 2014, 549, 251–260. [Google Scholar]
  6. Jackson, M. An Overview of Social Networks and Economic Applications; Handbook of Social Economic; Elsevier: Amsterdam, The Netherlands, 2011; Volume 1, pp. 511–585. [Google Scholar]
  7. Bright, D.; Hughes, C.; Chalmers, J. Illuminating dark networks: A social network analysis of an Australian drug trafficking syndicate. Crime Law Soc. Chang. 2012, 57, 151–176. [Google Scholar] [CrossRef]
  8. Faloutsos, M.; Faloutsos, P.; Faloutsos, C. On power-law relationships of the internet topology. In ACM SIGCOMM Computer Communication Review; ACM: New York, NY, USA, 1999; Volume 29. [Google Scholar]
  9. Broder, A.; Kumar, R.; Maghoul, F.; Raghavan, P.; Rajagopalan, S.; Stata, R.; Tomkins, A.; Wiener, J. Graph structure in the web. Comput. Netw. 2000, 33, 1–6. [Google Scholar] [CrossRef]
  10. Mislove, A.; Marcon, M.; Gummadi, K.P.; Druschel, P.; Bhattacharjee, B. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, San Diego, CA, USA, 23–26 October 2007; pp. 29–42.
  11. Kumar, R.; Novak, J.; Tomkins, A. Structure and evolution of online social networks. In Link Mining: Models, Algorithms, and Applications; Springer: New York, NY, USA, 2010; pp. 337–357. [Google Scholar]
  12. Weng, L.; Menczer, F.; Ahn, Y. Virality prediction and community structure in social networks. Sci. Rep. 2013, 3, 2522. [Google Scholar] [CrossRef] [PubMed]
  13. Bliss, C.A.; Danforth, C.M.; Dodds, P.S. Estimation of global network statistics from incomplete data. PLoS ONE 2014, 9, e108471. [Google Scholar] [CrossRef] [PubMed]
  14. Guimerà, R.; Sales-Pardo, M. Missing and spurious interactions and the reconstruction of complex networks. Proc. Natl. Acad. Sci. USA 2009, 106, 22073–22078. [Google Scholar] [CrossRef] [PubMed]
  15. Lü, L.; Zhou, T. Link prediction in complex networks: A survey. Phys. A Stat. Mech. Appl. 2011, 390, 1150–1170. [Google Scholar] [CrossRef]
  16. Wang, P.; Xu, B.W.; Wu, Y.R.; Zhou, X.Y. Link prediction in social networks: The state-of-the-art. Sci. China Inf. Sci. 2015, 58. [Google Scholar] [CrossRef]
  17. Tseng, C.; Chen, M. Incremental SVM model for spam detection on dynamic email social networks. In Proceedings of the International Conference on Computational Science and Engineering, Vancouver, BC, USA, 29–31 August 2009; pp. 128–135.
  18. Lam, H.; Yeung, D. A learning approach to spam detection based on social networks. In Proceedings of the 4th Conference on Email and Anti-Spam, Mountain View, CA, USA, 2–3 August 2007; p. 35.
  19. Leskovec, J.; Kleinberg, J.; Faloutsos, C. Graph evolution: Densification and shrinking diameters. ACM Trans. Knowl. Discov. Data 2007, 1, 2. [Google Scholar] [CrossRef]
  20. Chaum, D.L. Untraceable electronic mail, return addresses, and digital pseudonyms. Commun. ACM 1981, 24, 84–88. [Google Scholar] [CrossRef]
  21. Gulcu, C.; Tsudik, G. Mixing E-mail with Babel. In Proceedings of the Symposium on Network and Distributed System Security, San Diego, CA, USA, 22–23 February 1996; pp. 2–16.
  22. Moller, U.; Cottrell, L.; Palfrader, P.; Sassaman, L. Mixmaster Protocol Version 2. Internet Draft draft-sassaman-mixmaster-03, Internet Engineering Task Force. 2005. Available online: http://tools.ietf.org/html/draft-sassaman-mixmaster-03 (accessed on 9 February 2015).
  23. Danezis, G.; Dingledine, R.; Mathewson, N. Mixminion: Design of a type III anonymous remailer protocol. In Proceedings of the Symposium on Security and Privacy, Oakland, CA, USA, 11–14 May 2003; pp. 2–5.
  24. Dingledine, R.; Mathewson, N.; Syverson, P. Tor: The second generation onion router. In Proceedings of the 13th USENIX Security Syposium, San Diego, CA, USA, 9–13 August 2004; pp. 303–320.
  25. Agrawal, D.; Kesdogan, D. Measuring anonymity: The disclosure attack. IEEE Secur. Priv. 2003, 1, 27–34. [Google Scholar] [CrossRef]
  26. Danezis, G.; Serjantov, A. Statistical disclosure or intersection attacks on anonymity systems. In Proceedings of the 6th International Conference on Information Hiding, Toronto, ON, Canada, 23–25 May 2004; pp. 293–308.
  27. Mathewson, N.; Dingledine, R. Practical Traffic Analysis: Extending and Resisting Statistical Disclosure. In Proceedings of Privacy Enhancing Technologies Workshop, Toronto, ON, Canada, 26–28 May 2004; pp. 17–34.
  28. Danezis, G.; Diaz, C.; Troncoso, C. Two-sided statistical disclosure attack. In Proceedings of the 7th International Conference on Privacy Enhancing Technologies, Ottawa, ON, Canada, 20–22 June 2007; pp. 30–44.
  29. Troncoso, C.; Gierlichs, B.; Preneel, B.; Verbauwhede, I. Perfect Matching Disclosure Attacks. In Proceedings of the 8th International Symposium on Privacy Enhancing Technologies, Leuven, Belgium, 23–25 July 2008; pp. 2–23.
  30. Danezis, G.; Troncoso, C. Vida: How to Use Bayesian Inference to De-anonymize Persistent Communications. In Proceedings of the 9th International Symposium on Privacy Enhancing Technologies, Seattle, WA, USA, 5–7 August 2009; pp. 56–72.
  31. Portela, J.; Garcia Villalba, L.J.; Silva Trujillo, A.G.; Sandoval Orozco, A.L.; Kim, T.-H. Extracting association patterns in network communications. Sensors 2015, 15, 4052–4071. [Google Scholar] [CrossRef] [PubMed]
  32. Portela, J.; Garcia Villalba, L.J.; Silva Trujillo, A.G.; Sandoval Orozco, A.L.; Kim, T.-H. Disclosing User Relationships in Email Networks. J. Supercomput. 2016, 72, 3787–3800. [Google Scholar] [CrossRef]
  33. Watts, D.J.; Strogatz, S. Collective dynamics of ‘small-world’ networks. Nature 1998, 393, 440–442. [Google Scholar] [CrossRef] [PubMed]
  34. Humphries, M.D.; Gurney, K. Network ‘Small-World-Ness’. A quantitative method for determining canonical network equivalence. PLoS ONE 2008, 3, e0002051. [Google Scholar] [CrossRef] [PubMed]
  35. Ebel, H.; Mielsch, L.; Bornholdt, S. Scale-free topology of e-mail networks. Phys. Rev. E 2002, 66, 035103. [Google Scholar] [CrossRef] [PubMed]
  36. Kossinets, G.; Watts, D.J. Empirical analysis of an evolving social network. Science 2006, 311, 88–90. [Google Scholar] [CrossRef] [PubMed]
  37. Feldt, S.; Bridgeford, E.; Bassett, D.S. Small-World Propensity and Weighted Brain Networks. Sci. Rep. 2016, 6, 22057. [Google Scholar]
  38. Clauset, A.; Shalizi, C.R.; Newman, M.E.J. Power-law distributions in empirical data. SIAM Rev. 2009, 51, 661–703. [Google Scholar] [CrossRef]
  39. Perez-Gonzalez, F.; Troncoso, C.; Oya, S. A Least Squares Approach to the Static Traffic Analysis of High-Latency Anonymous Communication Systems. IEEE Trans. Inf. Forensics Secur. 2014, 9, 1341. [Google Scholar] [CrossRef]
Figure 1. (a) Relationship between edges and nodes; (b) Relationship between edges and nodes in logarithmic scale.
Figure 1. (a) Relationship between edges and nodes; (b) Relationship between edges and nodes in logarithmic scale.
Sensors 16 01832 g001
Figure 2. Log-log plot for Faculty 9 degree distribution.
Figure 2. Log-log plot for Faculty 9 degree distribution.
Sensors 16 01832 g002
Figure 3. (a) Relationship between small-world-ness coefficient and number of nodes; (b) Relationship between shortest path and number of nodes; (c) Relationship between Lg and Cg.
Figure 3. (a) Relationship between small-world-ness coefficient and number of nodes; (b) Relationship between shortest path and number of nodes; (c) Relationship between Lg and Cg.
Sensors 16 01832 g003
Figure 4. Relationship between estimates of (a) betweenness; (b) nodes degree, and its reciprocal real values for Faculty 16.
Figure 4. Relationship between estimates of (a) betweenness; (b) nodes degree, and its reciprocal real values for Faculty 16.
Sensors 16 01832 g004
Figure 5. Estimation of (a) average degree (b) average betweenness (c) number of edges.
Figure 5. Estimation of (a) average degree (b) average betweenness (c) number of edges.
Sensors 16 01832 g005aSensors 16 01832 g005b
Figure 6. (a) Estimation of power distribution parameter; (b) Estimation of Lg; (c) Estimation of Cg; (d) Estimation of small-world-ness coefficient.
Figure 6. (a) Estimation of power distribution parameter; (b) Estimation of Lg; (c) Estimation of Cg; (d) Estimation of small-world-ness coefficient.
Sensors 16 01832 g006aSensors 16 01832 g006b
Figure 7. Absolute error in number of edges estimation versus batch size.
Figure 7. Absolute error in number of edges estimation versus batch size.
Sensors 16 01832 g007
Table 1. Faculty network centrality measures.
Table 1. Faculty network centrality measures.
FacultyNodesEdgesAvdegreeAvbetweenDensity
18232.884.710.41
2371494.0380.360.11
3503486.9651.920.14
4532214.17100.000.08
5757439.9194.470.13
67678410.3280.760.14
7796878.70101.060.11
8846377.58109.710.09
91358726.46223.640.05
101408526.09286.930.04
121719185.37394.370.03
13343497214.50578.580.04
14407502412.34728.920.03
15429519512.11783.190.03
17447593813.28916.150.03
111598225.17318.340.03
16438501011.44881.890.03
1846633457.18998.500.02
1947541048.641034.460.02
20477545611.44942.730.02
2149142498.651206.010.02
2249221434.362865.470.01
23553810014.651083.430.03
24559668811.961130.170.02
25571760613.321024.700.02
26601665711.081292.480.02
2761558589.531193.280.02
28616743512.071457.240.02
29622883914.211203.880.02
Table 2. Faculty scale free characteristics.
Table 2. Faculty scale free characteristics.
FacultyNodesEdges EstimatePower p-Value
18232.10.85
2371491.80.56
3503481.540.96
4532211.970.39
5757431.540.54
6767841.480.78
7796871.490.63
8846371.600.22
91358721.640.09
101408521.630.12
111598221.770.01
121719181.690.26
1334349721.440.76
1440750241.490.16
1542951951.460.00
1643850101.480.37
1744759381.430.93
1846633451.650.37
1947541041.580.83
2047754561.520.21
2149142491.550.43
2249221431.860.00
2355381001.450.78
2455966881.480.73
2557176061.480.80
2660166571.530.27
2761558581.490.00
2861674351.470.14
2962288391.480.29
Table 3. Small-world characteristics. Faculties ordered by small-world coefficient.
Table 3. Small-world characteristics. Faculties ordered by small-world coefficient.
FacultyLgLrCgCrλgγgNodesSmall-World
11.671.610.490.761.040.6480.62
62.052.060.420.250.991.70761.71
31.992.160.480.260.921.86502.03
52.122.090.510.251.022.08752.04
22.722.610.410.181.042.28372.18
72.282.240.490.211.022.34792.30
82.292.380.430.170.962.47842.56
42.452.720.390.140.902.70533.00
92.612.800.290.100.932.881353.09
102.932.890.300.091.013.281403.23
132.642.480.300.081.073.743433.52
123.133.250.260.070.963.951714.11
142.692.660.260.061.014.284074.24
152.812.690.260.061.044.604294.41
172.882.640.290.061.095.054474.63
202.862.790.250.051.024.844774.73
232.842.640.280.051.085.505535.12
112.853.240.260.060.884.581595.20
252.762.730.250.041.015.475715.41
292.842.710.260.051.055.686225.42
162.892.750.300.051.055.904385.60
242.942.810.270.041.056.525596.23
283.152.840.280.041.117.316166.59
263.062.910.280.041.057.576017.19
213.203.110.270.031.037.824917.61
193.113.090.270.041.017.744757.69
272.873.090.240.030.937.496158.07
183.423.340.300.031.029.844669.62
226.184.320.340.021.4315.8249211.05
Table 4. Example of contingency table obtained in one round.
Table 4. Example of contingency table obtained in one round.
U2U4U5
U1 5
U2 4
U3 1
35210
Table 5. Mean Bias, Mean Absolute Error, and CV for main measures.
Table 5. Mean Bias, Mean Absolute Error, and CV for main measures.
BiasMAECV
Node levelDegree2.374460.78
Betweenness−2783220.81
Network level: generalEdges11.24860.13
Average degree1.682.280.24
Average between−1751810.25
Network level: scale freeDegree distribution exponent 0.050.110.06
Network level: Small-worldLg−0.500.500.17
Cg0.0760.110.33
Small-world-ness coefficient0.751.730.36

Share and Cite

MDPI and ACS Style

Portela, J.; García Villalba, L.J.; Silva Trujillo, A.G.; Sandoval Orozco, A.L.; Kim, T.-H. Estimation of Anonymous Email Network Characteristics through Statistical Disclosure Attacks. Sensors 2016, 16, 1832. https://doi.org/10.3390/s16111832

AMA Style

Portela J, García Villalba LJ, Silva Trujillo AG, Sandoval Orozco AL, Kim T-H. Estimation of Anonymous Email Network Characteristics through Statistical Disclosure Attacks. Sensors. 2016; 16(11):1832. https://doi.org/10.3390/s16111832

Chicago/Turabian Style

Portela, Javier, Luis Javier García Villalba, Alejandra Guadalupe Silva Trujillo, Ana Lucila Sandoval Orozco, and Tai-Hoon Kim. 2016. "Estimation of Anonymous Email Network Characteristics through Statistical Disclosure Attacks" Sensors 16, no. 11: 1832. https://doi.org/10.3390/s16111832

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop