1. Introduction
A social network often means the social structure between actors, which are generally individuals or individual organizations. It shows relationships of various types, ranging from random acquaintances to the close relationship, or to object flows (e.g., information, goods, money, signals, intermediates in the production cycle) between members of the community [
1].
Social network analysis (SNA) is focused on mapping and measuring relationships and information flows between people, their groups, organizations, or other entities in transforming information and/or knowledge. SNA attempts to make a prediction on the basis of the characteristics of the network as a whole entity, the properties of individual nodes based on network structure, and so forth. The subject of the research can be a complete social network, or parts of it can be related to a specific node.
Nowadays, graphs are used to model various interesting real-world phenomena. Much interest of researchers has been attracted by social networks in which one can distinguish between two types of objects, such as users and items, and where relationships only between a user and an item are of interest. They can be modeled via bipartite graphs, which are graphs in which edges exist only between two disjoint subsets of vertices. For example, in the case of customer data, there are two modalities: users and products. There are no edges between users in the user set, and there are no edges between products in the product set. An edge between a user and a product means that the user bought this product. Such graphs can be utilized to recommend some products to users. Another example is an Internet forum. In this case, there are two modalities: forum users and forum threads in which they write posts. There would be an edge between a forum user and a thread if the user wrote a post on it. One can recommend some interesting forum threads for the user. One may also seek for intermediate states of a dynamic network that have not been observed.
Both the actual graph structure and the graph dynamics and its development in time are essential. Such growth models are vital in SNA for a number of goals. The first one is to test which microoperations happening in the network may lead to the macrostructures that one can observe. The second reason is that one wants to develop and test various social network algorithms, such as recommendation algorithms based on social networks, but the available social networks are not numerous, and the threat of overfitting is serious. Therefore, one needs synthetic networks which are similar to real ones. The third reason is that one may want to perform some kind of what-if analysis on social networks without experimenting with real people. Many more reasons can be found. For the above-mentioned purposes, on the one hand, one needs growth models that are sufficiently similar to real-world phenomena, and on the other hand, one also requires a method of extracting model parameters from the actual real network in order to generate similar ones.
Over the last decade, a number of growth models for bipartite graphs have been proposed [
2,
3,
4]. Unfortunately, these bipartite graph generators have had some limitations. The bipartite graphs were created with limited reproduction of real-life graph properties, and two graph structures were also created, which complicates the models a lot.
In this paper, the graph generator proposed by Chojnacki [
5] is considered, which can be viewed as a graph growth model with seven parameters. In [
5], it has been demonstrated that the model qualitatively reflects properties of real-life bipartite graphs quite well. Therefore, it may be, and has been used for qualitative studies of various phenomena. Chojnacki’s model touches on a very important problem of "cold start" in the recommendation of products to users, and vice versa. The "cold start" problem concerns the recommendation of products to a new user from whom one has no information in the system. The same occurs when one has a new product that one has no information about and wants to recommend it to users in the system. Thus, from here on, this model will be called the
Cold Start User-Item Model (CSUIM).
Our long-term goal is to investigate CSUIM’s usefulness for quantitative analysis. Assuming that the real-world graph follows the growth paradigm described by CSUIM, this means that we want to identify the growth parameters of the graph so that one can, for example, investigate the growth of this graph in the past or in the future.
Regrettably, no results are known so far for computing or estimating model parameters from the real-world data for CSUIM. The current paper is intended to close this gap, and attempts to estimate to what extent the model parameters can be properly recovered from the graph in order to later on answer the question of how the application of recommendations onto the participants of a social network may change the social graph growth process. Therefore, artificial graphs generated from the Chojnacki model are studied in this paper, and the model recovery method, proposed in this paper, is applied to them.
In this paper, the first stage that is considered is methods of reconstructing generator models from the graph at some stage of development. A method to capture the parameters from the actual graph is proposed, and the similarity of metrics between the original graph and the one obtained from the model is verified.
Chojnacki used his model for other purposes. He created a benchmark framework for recommendation systems. His model examines how the recommendation system would behave, and was applied for the generation of different graphs.
The paper is structured as follows:
Section 2 presents attempts to describe real-world phenomena of uni-modal and bi-modal social networks available in the literature. In
Section 2.1, uni-modal graph models are described, and ideas used in the bipartite graph model are outlined in
Section 2.2. In
Section 3, the Chojnacki generator is mentioned briefly.
Section 5 presents theoretical node degree distribution models. The proposed approach to parameter estimation is described in
Section 6, and some linear dependencies for parameter estimation are investigated in
Section 7 and
Section 8. In
Section 9, a method for parameter computation from a graph is proposed. In
Section 10, experimental results on parameter recovery and model quality are presented.
Section 11 contains some concluding remarks.
2. Related Work
Much research efforts have been devoted to the qualitative description of the real-world phenomena of uni-modal social networks. Barabási [
6] coped with the impact of the removal of a few super-connected nodes, or hubs. Albert and Barábasi [
7] present a statistical approach to modeling random graphs, small-worlds, and scale-free networks, evolving networks, and the interplay between topology and the network’s robustness against failures and attacks.
Lin et al. [
8] used network history to predict communities in the current state, exploiting node degrees, modularity (as defined by Newman et al. [
9]), and their own soft modularity.
Leskovec et al. [
10] characterized the statistical and structural properties of communities as a function of network size and conductance.
Leskovec et al. [
11,
12] investigated the phenomenon of real graphs densifying over time, and shrinking of the average distance between nodes. They attempted to explain the phenomenon by models of “Community Guided Attachment” (CGA) and a more complex “Forest Fire Model” (FFM).
While there are many publications concerning uni-modal social network growth models, bimodal ones are far more rarely investigated, though as [
13] shows, they are important for product recommendation and rating prediction, or as [
14] (sec. 6.4.4.) reports, they may be used for investigation of models of scientific paper co-authorship.
Publications like [
15] or [
16] propose models for new edge prediction.
The paper of Lavia et al. [
17] is the most similar in spirit to our work. In their paper, the Netflix competition database was considered, and an explanation for the hardness of prediction was made. The authors there proposed a growth model of an item rating network based on a mixture of preferential and uniform attachment that reproduces the asymptotic degree distribution, but also agrees with the Netflix data in several time-dependent topological properties.
Our research differs from this in that we are considering a much more complex model, where both items and users can perform the edge attachment, and a bouncing mechanism for modeling the impact of local recommendation is included.
2.1. Graph Models for Unimodal Networks
In this section, the most popular graph generators are presented, also called graph models.
First, let us recall an important measure of graphs, which is frequently used when evaluating the quality of various graph models of social networks.
Many empirical graphs are well-modeled by small-world networks (see [
18]). For example, social networks, the connectivity of the Internet, wikis such as Wikipedia, and gene networks—all of them exhibit small-world network characteristics.
Therefore, in the literature, a couple of measures have been proposed to determine whether a graph is a small-world network. The most popular of them is the so-called local clustering coefficient (LCC), and for a vertex,
i it is defined as:
where
E is the set of all edges,
V is the set of all vertices,
are vertices, and
is the degree of vertex
i. The degree of vertex
i is the number of edges incident to this vertex. For the whole graph
G, the clustering coefficient is just
.
The Erdös-Réni model (see [
19]) is defined by two parameters: the number of nodes,
n, and the probability that there exists an edge between nodes,
p. This mechanism of node connection is called uniform attachment (UA). In this model, the node degree distribution follows the exponential distribution and local clustering coefficient
. The Bárabasi–Albert model (see [
7]) uses a "preferential attachment" (PA) mechanism for creating a connection between nodes. A graph is initialized with a connected graph with
nodes. In the following steps, each node is connected to
m existing nodes in such a way that the probability of connection is proportional to the number of links that the existing nodes already have. This method of connecting nodes causes that node degree distribution to follow a power-law distribution
and
. Liu’s model (see [
20]) was one of the first attempts at combining the uniform attachment and preferential attachment mechanisms. The authors proposed a parameter
of intensification, mediating between UA and PA.
The models mentioned previously have serious drawbacks—the LCC value does not depend on graph parameters. More flexible models were proposed by Vázquez (see [
21]), and independently by White (see [
22]), where LCC may be modified by changing graph parameters. In the Vázquez model, the idea is based on random walks (called also surfing) and a recursive search for generating networks. In the random walk model, the walk starts at a random node, follows links, and for each visited node, with some probability, an edge is created between the visited node and the new node. It can be shown that such a model generates graphs with a power-law degree distribution with an exponent greater than or equal to 2 (see [
23]).
2.2. Graph Models for Bimodal Networks
A bimodal network is understood as a network connecting two varieties of objects (like authors and their papers, employees and their firms, tourists and museums, etc.) [
24]. Other names for such a network are a bipartite, 2-partite, or 2-mode network. These networks can be modeled by bipartite graphs. A bipartite graph has the form
, where
, and
. That is, vertices of the bipartite graph can be divided into two disjointed sets,
U and
V, such that every edge connects a vertex in
U to another one in
V; that is,
U and
V are independent sets. These sets represent, for example, customers and products. If a customer
buys a product
, there is an edge between vertex
and
. Thus, there are no edges between customers and between items—they cannot buy each other. In the case of an Internet forum, one could also have two modalities: one for users and the other for threads the users participate in. Many other kinds of bipartite networks occur in real life [
25].
Although bimodal graphs are a specific subclass of uni-modal graphs, models mentioned in
Section 2.1 are not appropriate when one wants to model bipartite graphs of bimodal social networks. In the bipartite graph for all vertices
in the same modality set, one does not have any edges between them, so one always gets
. This means there is a severe problem when studying bipartite graphs, because, on the one hand, one does not have any means of looking at the fundamental small-world phenomena, and on the other hand, it is an obstacle in adopting traditional graph generators to the case of bipartite ones. Therefore, in [
5], another suitable metric for clustering tendency was proposed—the bipartite local clustering coefficient (BLCC):
W is the set of all vertices,
—the set of neighbors of vertex
, which are
steps away. In other words,
, where
is a minimal distance (number of edges) between vertices
i and
j. In [
5], it is shown that the graph metric
and
are similar in classical graphs.
Typically, in a social network, a small number of vertices have many direct neighbors, and a large number of vertices have a small number of direct neighbors. Social networks contain clusters with a high density of connections. This network property is called transitivity, and says that if nodes a and b have a common neighbor, then this influences the probability of the existence of an edge between a and b. The critical difference between unimodal and bimodal networks led to the development of separate models for the bimodal case. Let us mention a few.
There are two main approaches to model bipartite graphs: the iterative growth method and the configuration method. The iterative growth method is better for the recommendation process, as shown in [
5]. It simulates the growth of a network. In the configuration method, one gives their estimated general description of the network, and from this, one constructs the final state of the network. The description usually contains: the number of nodes in each modality, the probability density function (PDF) of the nodes’ degree, and the number of edges. Then, one creates nodes in each modality without edges. One creates edges from sampling endpoints of the edge from the node degree probability density function (PDF).
In [
4] Guillaume and Latapy presented a method of transforming the bipartite graph into a classical network (uni-modal) and of reversing this transformation. Unfortunately, the reverse transformation is not unique, and moreover, the retrieval of the bipartite structure is computationally hard. The authors pointed out that the computation of the largest clique containing a given link may be very expensive (it is NP-complete). Birmele [
2] builds a bipartite graph model from existing uni-modal graph models using retrieval of the bipartite structure from classical graphs. In [
3], Zheleva et al. analyzed the evolution of groups in an affiliation network. The affiliation network has two modalities: users, and groups to which users belong. In the co-evolution model, groups can disappear and merge. This model is not appropriate in the case of recommendation items for users—items do not merge. In [
26], Lattanzi and Sivakumar proposed a different model of the affiliation network as the bipartite graph. Their model for the evolving affiliation network and the consequent social network incorporates elements of preferential attachment and edge copying. They analyze the most basic folding rule, namely, replacing each society node in the affiliation network by a complete graph on its actors in the folded graph. The drawback of their models is that given a social network (or another large graph), it is not at all clear how one can test the hypothesis that it was formed by the folding of an affiliation network. The general problem of solving, given a graph
G on a set
Q of vertices, whether it was obtained by folding an affiliation network on vertex sets
Q and
U, where
, is NP-Complete.
All previously mentioned generators have an iterative growth mechanism. The common limitation of those generators is that they generate bipartite graphs with the power-law or uniform distribution of vertex degrees. Additionally, none of these models contains a parameter which controls the transitivity property. The previous approaches also have a significant drawback: configuration methods and methods based on retrieval of a bipartite structure decrease the bipartite local clustering coefficient (BLCC) compared to iterative methods. This means that models derived by those methods fit the real-world structures worse than those estimated by iterative methods. Moreover, in the pessimistic case, one deals with the NP-complete problem, so some approximations are needed. As these models suffered from various drawbacks, in [
5], another model was proposed, which is characterized in
Section 3 and which is the subject of our current investigation.
2.3. Recommender Systems for Bimodal Networks
Bimodal networks appear as a natural setting for a recommendation system, where objects of one modality are recommended for the objects of the other modality. We have already mentioned the works [
5,
26], and another addressing modeling for recommendations under these settings—however, there are many more. Ahmedi et al. (see [
24]) derived recommendations from a network associating tourists with points of interest, based on the vertex and edge labeling combined with some ranking or centrality function. He et al. (see [
27]) proposed to predict item popularity and to recommend items to users based on a specific version of a PageRank technique (eigenvectors of a special connectivity matrix). User preferences are expressed as weights. Shi et al. (see [
28]) used for recommendation a combination of content-based and collaborative filtering, while a method of combining both recommendations was developed via learning the weights of both components from previous prediction accuracy. Cheng et al. (see [
29]) exploited a matrix factorization model based on reviews and preferences. Ozsoy (see [
30]) proposed to use the word2vec technique, originally developed for seeking words occurring in similar contexts. This approach replaces words with users and items, creating a kind of word2vec representation of the item–user graph. Recommendations are based on the similarity of objects in this representation. Vasile et al. (see [
31]) proposed, based on the same word2vec technology, recommendations of items (products) based on their context (other products), as well as some textual information (content-based support). Kang and Yu (see [
32]) developed a soft-constraint-based online LDA algorithm for community recommendation. It also accommodates a technique used for document processing to the collaborative filtering setting. A user is represented as a “document”, being a probability distribution over latent topics, and each topic is represented as a probability distribution over communities. The number of users’ posts within each community forms the foundation for estimation of latent topics, whereby an online LDA algorithm is applied for this purpose. Communities are recommended based on the conditional distribution of a community against the user “document”. Liu et al. (see [
33]) developed a recommendation method enriching online LDAs with probabilistic matrix factorization. Other application cases are reviewed in [
34].
The current paper proposes a framework that differs from the just-mentioned approaches. They attempt to make recommendations taking into account the current state of the network. In the approach presented in this paper, the history of the network is modeled—that is, the predictions are related to the evolution of the network, and not to a suggested recommendation to a particular user at a given snapshot.
3. CSUIM Bipartite Graph Generator
The bimodal graph generator presented in [
5] is more flexible than graph generators mentioned in
Section 2.2, though it cannot generate a disconnected graph with desired properties. Its advantage is the capability to create graphs with a broader range of clustering behavior via the so-called bouncing mechanism. The bouncing mechanism is an adaptation of a surfing mechanism in classical graphs (see [
21]). The bouncing mechanism is used only to the edges, which were created according to the preferential attachment.
In the CSUIM, we consider a graph with the set of vertices
,
, where the set
U is called "users" and set
V is called "items". We consider both the uniform attachment, where incoming nodes form links to existing nodes selected uniformly at random, and the preferential attachment, when probabilities are assigned proportional to the degrees of the existing nodes (see [
35]).
The generator has seven parameters:
m—the initial number of edges, where the initial number of vertices is
—the probability that a new vertex v added to a graph in the iteration t is a user , so means the probability that the new vertex v is an item
—the number of edges added from the vertex of user type in one iteration (number of items bought by a single new user),
—the number of edges added from the vertex of item type in one iteration (number of users that bought the same new item)
—the probability of preferential attachment, —the probability of uniform attachment
—the probability of preferential attachment, —the probability of uniform attachment
—the fraction of edges attached in a preferential way, which were created using the bouncing mechanism
The
Cold Start User-Item Model (CSUIM) creates a node in the set of users with probability
and
in the set of items. The newly created node is connected with nodes of the opposite modality. If the node is of user type, it will be connected with
items, and if it is of item type, then it will be connected with
nodes of user type. To find the node to which the newly added node will be connected, we use two mechanisms: the “uniform attachment” (UA) and the “preferential attachment” (PA) described briefly in
Section 2.1. PA is drawn with probability
for items and
for users; otherwise, nodes are selected by UA. When PA is selected, we have to choose the fraction
of edges that will be attached by the bouncing mechanism. More details of the bouncing mechanism will be described after the description of the CSUIM algorithm.
The procedure for generating synthetic bipartite graphs is outlined in Algorithm 1.
Algorithm 1: Cold Start User-Item Model. |
Step 1. Initialize the graph with m edges (we have vertices). Step 2. Add a new vertex to the graph of type with probability , otherwise of type . Step 3. Choose a neighbor to join the new vertex according to the following rules: Step 3a. If the new node is , then add edges from this node to type vertices using the preferential attachment mechanism (with probability ) or uniform attachment (otherwise). Step 3b. If the new node is , then add edges from this node to type vertices, using the preferential attachment mechanism (with probability ) or uniform attachment (otherwise). Step 3c. Consider the newly added vertex and edges from this node added by preferential attachment (nodes and are from different modalities). Select fraction of those end nodes. For each node from this set, pick at random one of its neighbors, . From the randomly selected node , select its neighbor at random again. Connect the new node to the node selected in this way instead of the original node obtained by preferential attachment. Step 4. Repeat Steps 2 and 3 T times. |
Step 3c emulates the behavior called recommendation. One can imagine that a customer who is going to buy one of the products encounters another consumer who already purchased it, and recommends him another product instead. The first consumer changes his/her mind and follows this recommendation with a probability of . By varying this parameter, one can observe what happens when people are more or less amenable to the recommendation.
Selecting products by uniform attachment simulates consumers that do not bother about which product to choose. Preferential attachment simulates consumers that look for products on their own (e.g., dresses unseen frequently on the street). Note that this model of graph growth simulates a very special kind of purchase behavior—namely, the behavior of only new consumers and new products. Despite its limited applicability, the model is very important, because it concentrates on a very hard part of the recommendation process called "cold start". Cold start concerns the issue that the system cannot draw any inferences for users or items about which it has not yet gathered sufficient information. Recommender systems form a specific type of information filtering (IF) technique that attempts to present information items (e.g., movies, music, books, news, images, web pages) that are likely of interest to the user. Typically, a recommender system compares the user’s profile to some reference characteristics. These characteristics may be from the information item (the content-based approach) or the user’s social environment (the collaborative filtering approach). More detailed specifics of this hard problem and some solutions have been presented in [
36,
37].
It is easy to see that after t iterations with the bouncing mechanism disabled (), we have vertices of type and vertices of type . The average number of edges attached in one iteration is . After a large number of iterations, we can skip m initial edges in further calculations. Thus, we can show that the average number of vertices of type and of type depends only on iteration t and and does not depend on m, , or . The total number of edges depends only on , , and . This is not good news, because we cannot use them to estimate all parameters of the generator, especially , , and .
In the next section, an approach and a method of parameter extraction based not on the current state but rather on the dynamics of the network is presented.
4. Motivation for Proposed Approach
One can ask, how do you estimate generator parameters? Any approach to network parameter estimation should be based on the observable quantities that should be turned to the model parameters.
In the model described above, we can essentially observe the nodes and their interconnection, as well as their statistics (like degree distributions and/or clustering coefficients) as a source of information for parameter estimation.
At least three types of approaches seem to be considered:
Analytical;
Machine-learning; and
Brute force.
An analytical approach would mean establishing a closed-form model for some of the observables, such as node degree distribution in both modalities and an attempt to solve it analytically for the parameters. As we will see in the next section, the differential equation for the node degree distribution is not simple to solve, and only approximate solutions are known in the literature for particular settings of the variables, even in the simple case of no recommendation ().
A machine-learning approach was applied in [
38], but there seems to be no simple relationship to be extracted via machine learning.
Finally, a brute force approach would be to slice the space of parameters and then to generate a sample for each of the parameter space slices, compute the observables from the sample, and to choose the parameter set for which the sample is closest to the real graph.
Eventually, we follow the last path; however, we simplify the process. In the simplified process, we exploit independences between some effects of the parameters, as well as some simplifying implications of the theoretical models.
5. Theoretical Node Degree Distribution Models
As indicated in the previous section, in our approach, we measure some characteristics of a network to estimate the parameters of the model. One of the most important properties of a network is the node degree distribution for each modality. We consider the probability that a node has degree k at some moment in time t and denote it as . Variable t can be interpreted as a number of iterations made while generating a graph from the model.
Let us concentrate on CSUIM (see Algorithm 1 from
Section 3) when there is no bouncing mechanism. The bouncing mechanism is disabled when
. Let
represent the rate at which new nodes are introduced in modality
g (items or users) that is,
nodes of modality
g are added in time interval of duration
. In the current model
(on average) is added in a single time interval. Let
denote the expected number of nodes of modality
g whose degree is
k at time
t. Let us consider multiple attachments. Each new node of modality
g that is introduced chooses
existing nodes (
,
) of opposite modality
. With
, let us denote the number of nodes attached to a new node of modality
using the non-preferential (uniform) attachment, and with
, let us denote the number of nodes attached to a new node of modality
using preferential attachment (
,
,
,
, ).
Following the argument from [
35], one can see that the node distribution over time is governed by the equation:
with
.
In [
35], the solutions for the extreme cases of
(pure uniform attachment) and
(pure preferential attachment) were found. It turns out that for
t tending to infinity, the node distributions are governed approximately by exponential distribution and power distribution resp.
The change of
for pure uniform attachment is given by:
An approximate solution tends to the following when time is going to infinity:
In the case of preferential attachment, each newly attached node adds one to
at that instant. Then,
evolves according to the equation:
However, no mixed case was considered in [
35]. The mixed case was treated by [
5], though only for large
k. It turns out that the distribution in the mixed case is approximately power distribution, though with a complex exponent. The formula in [
5] is derived using the relaxation of the degree to a real positive number, defining probability density function over degrees. Using our notation, we have the following equation:
where
is the probability that modality g vertex
g has degree
, which is less than threshold value
k, and
is the average number of edges attached in one iteration.
In our paper [
38], we tried to extract the
and
coefficients from formula (
9), but it did not match the experimental distribution well.
Therefore, we seeked an alternative to this. This alternative is shown in the sections below.
6. Our Approach to Parameter Estimation
The model from the previous section, though difficult enough for an analytical solution, still means a substantial simplification in that is set to zero, and we deal with time tending to infinity and assume the k is large.
So, first of all, why shall we assume that ? If there is no bouncing, then we can easily see that the node degree distributions of both modalities are independent of one another so that they can be considered separately. Also, as and apparently influence one or the other modality degree distribution, we can guess that both can be estimated separately. This reduces the search space drastically, but what will happen if ? In this case, “under a stable distribution”, two nodes of, say, user type will pick up item nodes from approximately the same distribution. So if bouncing occurs, then it is equally likely that a node of degree k will increase its degree instead of a node of degree l, and that something will happen in the reverse direction. So, we can expect that under “modest” values of , the marginal distributions of degrees of both modalities will remain unchanged, and a model with is justifiable for them.
However, we can easily guess that will impact the clustering measures. So that after estimating , we can estimate separately.
7. A Linear Relationship to Obtain and
We would like to demonstrate how probability
of a node having degree
k changes in CSUIM. With respect to the definition of probability from Equation (
9),
Figure 1a,b depict dependency between
and
for fixed values of
or
, depending on modality. It turns out that for small
k (consuming most of the probability mass) and fixed
(
), the value of
decreases nearly linearly with
. We can see that when we add more edges to a node (ten times more), linear characteristics of the relation between
and
. This gives us an insight about setting up values of
and
.
The same dependency occurs when we consider simulations with the CSUIM (see
Figure 2a,b). We can see that the linear relation between
and
almost does not change when we fix
and change the value of
from 0 to 0.99. Note that Equation (
11) does not contain
. This experiment shows that not only in theory, but also in practice, computing
does not depend on
value.
Therefore, we looked at the relationship between (analogously for ) and the direction coefficient of the straight line approximating the relationship between and , and drew it for various values of (). We see that for a wide range of values of (), this relationship is linear, both for the theoretical and simulation models.
This insight led us to the algorithms for the identification of and , as described below.
How can it be explained that and are linearly dependent on the degree distribution exponent?
As already mentioned, when or (for the respective modality) is set to 1, then we have to do with the preferential attachment for that modality and the degree distribution follows a power-law, whereas when set to 0, the exponential distribution is followed. For values in-between, we have to do with a kind of mixture of both (which seems not to be a simple one).
If we take the formula
and draw it for various values of
k as a function of
, we will see that in a large range of values there is a nearly linear relationship. This result is shown in
Figure 3. Therefore, we exploited it for an estimation of
.
In the CSUI model when
grows, the probability of connecting the new link with preferential attachment grows as well. Thus, we can approximate the distribution of vertices’ degrees by the power-law distribution from the experimental degree distribution and compute the exponent of this distribution. We have
After applying the ln function to both sites, we get:
We see in
Figure 4 that theoretically for different combinations of
and
, parameter
has a linear relationship with exponent (
a coefficient in Equation (
11)) of the power-law distribution of a vertex degree. This observation provides us with an algorithm for
parameter estimation. Analogously, we can estimate a
parameter from the exponent of distribution of vertices from the item modality. Moreover, when
grows up to 1 (preferential attachment), then we get the desired power-law distribution of nodes degree
.
The preferential attachment has a power-law distribution with a “heavy tail” of node degrees, and the uniform attachment has an exponential distribution of node degrees, with a “light tail”. As demonstrated in [
39], an empirical mixture of these two distributions can be approximated with the power-law distribution. Therefore, linear regression analysis has been used sometimes to evaluate the fit of the power-law distribution to data and to estimate the value of the exponent. This way, one can also obtain the mixture parameter,
. The rationale behind this approach is that the heavy tail distribution dominates over the exponential distribution for nodes of higher degree. This technique produces biased estimates (see [
39]). As we see in the experiments, it is unreliable for low values of
(
) (below 0.1)—see
Figure 5a (
vs
) and
Figure 5b (
vs
k).
8. A Linear Relationship for
The bouncing parameter of the graph model may be used to model the behavior of users vulnerable to recommendations. We find out that this parameter is linearly correlated with a graph metric called “optimal modularity” (see [
9]).
Modularity is a measure of the quality of the clustering of nodes in a graph (we describe it briefly below).
Optimal modularity is the modularity of such a clustering of nodes for which the modularity is the highest among all the node clusterings of a given graph. It is known that finding the optimal modularity is an NP-hard task; therefore, there exist various greedy algorithms without a range guarantee. So, in fact, this term should be called “the optimal modularity for the algorithm X”, and so we mean here the optimal modularity computed by the algorithm described in [
9].
Modularity is the fraction of the edges that fall within the given groups (clusters) minus the expected such fraction if edges are distributed at random. The value of the modularity lies in the range
. It is positive if the number of edges within groups exceeds the number expected on the basis of chance. Examples of graph clusterings with positive and negative modularity values are shown in
Figure 6a,b, respectively. The upper boundary (modularity = 1) is approached if one has a multitude of complete graphs. For a given division of the network’s vertices into some clusters (called groups, communities, or modules), modularity reflects the concentration of nodes within modules compared to a random distribution of links between all nodes regardless of modules.
There are many ways to express the modularity. In our approach, we compute Newman’s modularity (see [
9]) as follows:
where
represents the adjacency matrix,
when there is an edge between nodes
i and
j and 0 otherwise,
is the sum of the weights of the edges attached to the vertex
i,
is the community to which the vertex
i is assigned, the
-function is Kronecker delta,
iff
and 0 otherwise, and
. The above formula for modularity can also be expressed as the difference between the quotient of the number of edges inside of communities and of the total number of edges minus the sum of squares of the shares of edges that have at least one end in the community.
In our computations of the optimal modularity, communities are obtained based on the Newman’s modularity concept. The algorithm runs as follows: initially, each node constitutes its own community, then nodes are moved between neighboring communities until a stopping criterion is reached. The obtained communities receive distinct identifiers called a modularity class. A node is moved to the community of one of its neighbors if this would increase the modularity of the entire network. At each step, the node giving the maximum modularity gain is selected. The process is terminated if no gain of modularity can be achieved.
Part of the Newman’s algorithm efficiency (see [
9]) results from the fact that the gain in modularity
obtained by moving an isolated node
i into a community
C can easily be computed by:
where
is the sum of the weights of the links inside
C,
is the sum of the weights of the links incident to nodes in
C,
is the sum of the weights of the links incident to node
i,
is the sum of the weights of the links from
i to nodes in
C, and
m is the sum of the weights of all the links in the network. A similar expression is used in order to evaluate the change of modularity when
i is removed from its community. Therefore, in practice, one evaluates the change of modularity by removing
i from its community and then by moving it into a neighboring community.
To sum up, the Newman’s optimal modularity tells us very important thing—how much our graph differs from a random one. In a fully random graph, edges are attached to some nodes at random from some distributions. The bouncing parameter of the CSUI model gives us a kind of dependence of node linking to other nodes—selecting both ends of an edge. Value represents a fraction of edges attached in a preferential way, which were created using the bouncing mechanism. The greater the value of is, the stronger dependence in creating links in the graph occurs. When we have some kind of dependence while creating links, the greater the value of modularity.
Let us return to the step in the CSUI model where the new node is added, and the bouncing mechanism is active. Let us consider bouncing from the newly created user vertex
u (see
Figure 7). Firstly, the bouncing algorithm selects an item vertex,
i. From this vertex, we can go further to user modality through edges added in previous steps either by an edge added in one of the previous iterations by adding a user node or item node. From the fact that we deal with the power-law distribution of a vertex degree, we know that most of the distribution mass have vertices with the smallest degree. Thus, it is more probable that we go through the edge added by adding a user node
and from this node to an item node
, which is the end node of the edge
e. Thus, we created a new edge,
. So the probability of creating edge
is:
Equation (
14) can be written in this form, because if we add a new vertex
u to the graph, then outgoing edges from this node are independent of each other. Because the node
i has a low degree, most of the outgoing links from
are independent, and analogously, most of the user nodes are of low degree, so outgoing links are independent. In general, after "sufficient" time during the further evolution of the network, we get
, so we have:
Thus, the bouncing mechanism does not change distribution on most degrees (small degree) and can be considered separately from and parameters.
On the other hand, modularity is a measure of distribution change of edge placement in graphs compared to their random placement. Edges mentioned before are placed almost randomly, and they have no influence on the modularity value. However, there are other combinations of the placement in bouncing—some edges are added when we added edges from
and also
u. In this case, edges are not independent because adding an edge from
to
i increases the probability of adding an edge from
u to
i. Thus, the independence of distribution is distorted, which implies a change of modularity. Therefore, we conclude that there may be a way to identify the bouncing parameter from the modularity, and we will determine this relationship empirically. In
Figure 8a, we see that this relationship seems to be linear even for small values of
and
. Unfortunately, when we add more edges in one step, the linear relation gets weaker—see
Figure 8b.
9. Parameter Estimation
Here, we estimate the parameters of the model based on a couple of observable network properties. The estimations are based on theoretical relationships between model parameters and metrics from the generated network from the previous section. In this section, we propose algorithms for the computation of all CSUI models. First, we describe the retrieval of parameters , m, , and . Then, we propose two algorithms. The first algorithm estimates and parameters using the distribution of node degree in each modality and linear regression. The second one uses modularity measure and linear regression for computation of the parameter.
9.1. Parameter
Theoretical equations from the previous section after some modification are useful to estimate parameters of a bipartite graph generator. The simplest one is
, which is the probability that a new vertex
v added to the graph in iteration
t is a user
, so
means the probability that the new vertex
v is an item
.
where
cardinality is the set of nodes users and
V is the set of nodes of item type.
9.2. Parameters , and m
There are two approaches to obtaining and . The first and simplest one is to set as the minimal degree in the user set, and to analogously set as the minimal degree in the item set.
The second way is more complicated. The average number of edges attached in one iteration is . is easy to estimate from graph as , where is the total number of edges in the graph. is computed from the previous section. Most of the vertex degree distribution mass is on the lower degrees k, so we can make the integer minimization of with an additional restriction . Another way is the brute force approach. It is done based on vertex degree distribution in each modality. We fix some and compute the value from equation .
m is the number of initial edges. It must be at least . The better way is to set it for computation on because it speeds up a few initial steps when there are few nodes in a graph.
9.3. Calculations of and
In
Section 7, we had shown theoretical linear relationships for obtaining
and
. Therefore, we can compute
and
from linear models:
where
are some constants calculated from the linear regression model and
is an exponent of the power-law distribution of node degree in item modality. Analogously, for the
parameter, we obtain
where
are some constants calculated from the linear regression model and
is an exponent of the power-law distribution of node degree in user modality. Technical details of computing exponent of the power-law distribution of node degree are shown in
Section 9.4 and
Section 9.5.
9.4. Calculations of and
In this section, we show how to compute an empirical power-law distribution. For each modality, we have a two-dimensional array
, where
is maximal degree in the considered modality. Based on this array, we compute arrays:
which contains the probability that the vertex has degree
k and
with the logarithm of the vertex degree
k. The pseudocode is shown in Algorithm 2. It is important for the Algorithm 3 from
Section 9.5 that this array contains only existing node degrees.
Algorithm 2: Computation of arrays and . |
Step 1. Count vertices of degrees , which exists in the graph, and store them in array , where has the value of k and contains the number of vertices with degree k. Step 2. Get the count of vertices of the considered modality as . Step 3. For each existing degree k (index i), compute: Step 3.1. Step 3.2. Step 3.3. Step 3.4. Step 4. Return arrays and . |
9.5. Calculations of Power-Law Exponent
For each modality in the graph, we compute the exponent of the power-law distribution in the following manner. We have two arrays computed in
Section 9.4:
, which contains the probability that the vertex has the degree
k, and
with the logarithm of the vertex degree
k. From those arrays, we compute the power-law exponent
of the node degree distribution in Algorithm 3.
Algorithm 3: Computation of power-law exponent . |
Step 1. Fit linear model to the data: . Step 2. The returned model has two coefficients: —intercept and —attribute coefficient. Step 3. Get coefficient from the attribute and save on variable . Step 4. Return . |
9.6. Calculations of Parameter
As we had shown in
Section 10.1, this relation is well-approximated by the linear model to some extent. If
and
, then the
parameter is predicted to be quite good from the simple linear model:
where
are some constants calculated from the linear regression model. Thus, we constructed the Algorithm 4.
Algorithm 4: Computation of . |
Input: Bipartite graph , where , U—user set and V—item set. Step 1. Compute and from Algorithm 5. Step 2. Create grid of values. Step 3. For each , generate the graph model and compute modularity. Step 4. Make dataset containing values and corresponding modularity values. Step 5. Make linear regression model having the response vector and one variable . Step 6. Predict the value from based on from graph G. |
Algorithm 5: Computation of and . |
Input: Bipartite graph , where . We will call U user set and V is item set. Step 1. Compute exponent of degree distribution of user node set U, and analogously, of item node set V. Step 2. Compute from Equation (16) and and from Section 9.2. Step 3. Define the set and the set , to be called the grid of and later. Step 4. For each pair , generate a bipartite graph with these parameters, setting , and as computed in Section 9.1 and Section 9.2 and setting to zero. From the generated graph, compute exponent of the degree distribution of the user set, and analogously, of the item set, as shown in Section 9.5. Step 5. For the data set consisting of pairs (, , perform linear regression creating with the response vector and one predictor variable, . Step 6. For the data set consisting of pairs (, , perform linear regression creating with response vector and one predictor variable . Step 7. Predict value from based on obtained from graph G. Step 8. Predict value from based on obtained from graph G. |
10. Experimental Results
Here, we present the experimental results on parameter recovery and model quality. We performed several simulations to validate theoretical relations involving parameters
and
described in
Section 7 and parameter
described in
Section 8. Those simulations are presented in
Section 10.1. After verifying theoretical properties, we made parameter estimation experiments. We tested how well the parameters of the CSUI model can be obtained from several real networks. The network generated from the CSUI model and real network were compared based on a number of metrics described in
Section 10.2.
10.1. Validity of Parameter Recovery Models
In this section, we make several simulations generating networks from the CSUI model to verify theoretical properties. Experiments with
and
were made based on Algorithm 5. Experimental results in
Figure 9a,b show that
and
parameters do not depend on each other. Model
contains two variables:
and
in
Figure 9a,
and
in
Figure 9b. Model
contains only one variable—
in
Figure 9a, and
in
Figure 9b. On top of each plot is the given p-value of the ANOVA test of difference between models
and
. Adjusted R-Squared values for models
and
in
Figure 9a are 0.94, and the p-value of the ANOVA test is 0.82. Adjusted R-Squared values for models
and
in
Figure 9b are around 0.86, and the p-value of the ANOVA test is 0.63. The p-value of the ANOVA test is greater than 0.05, so at this level of importance, there is no statistically significant difference. Thus, the
parameter does not depend on the
parameter in
Figure 9a, and the
parameter does not depend on the
parameter in
Figure 9b. Moreover, with more iterations (see
Figure 10a,b), this independency gets stronger—thus, there is a greater value of the ANOVA test.
The 2D plot of data obtained from the experiment is given in
Figure 11a,b for 5000 and 50,000 iterations, respectively. We can see that with more iterations, the spread of points for different values of the
parameter at the same value of
is getting smaller, which gives a better prediction of the parameter
. Moreover, with more iterations, this independency gets stronger—a greater
p-value of the ANOVA test. Thus, we can predict them separately.
Simulations with the
parameter were made based on Algorithm 4. Plots of data obtained from the experiment for 10,000 iterations and different values of
and
are given in
Figure 12. We can see an almost ideal fit (adj. R-squared value above 0.98).
10.2. Retrieval of Parameters
In our experiments, we used several topical fora from the StackExchange data dump from December 2011. This database is available online and licensed under the Creative Commons BY-SA 3.0 license. Stack Exchange is a fast-growing network of question-and-answer sites on diverse topics, from software programming to cooking to photography and gaming. We analyzed databases from the following forums: bicycles, databaseadministrators, drupalanswers, itsecurity, physics, texlatex, theoreticalcomputerscience, unixlinux, webapplications, webmasters, and wordpress. From this data, a bipartite graph for each dataset was created. In one modality, there were users in other topics. An edge was created when a user participated in some topics by writing a post in the topic. The edge between the user and topic was created only once. We interpreted the network structure as an undirected graph with no weights per edge.
Due to the limitation of the CSUI model, we took under consideration only the giant component (GC). The giant component is the biggest connected component in a graph. In a real-world graph, GC contains 70% or more of the whole graph and influences the growth of the network. From the created bipartite graphs, we calculated several graph and model properties and compared them to an artificial graph generated from the CSUI model. Metrics used in experiments:
Total Nodes—the total number of nodes in GC
Total Edges—the total number of edges in GC
Average Degree—the average of node degree in GC
Diameter—the maximal distance between all pairs of nodes in GC.
Radius—The radius of GC. The radius r of a graph is the minimum eccentricity of any vertex, . The eccentricity of a vertex v is the greatest geodesic distance between v and any other vertex.
Average Path Length—the average number of steps along the shortest paths for all possible pairs of network nodes. It is a measure of the efficiency of information or mass transport on a network.
Number Of Shortest Paths—the number of shortest paths in GC
Communities Number—the number of communities from Neumann’s modularity algorithm in GC. More details in
Section 8Density—measures how close the network is to a complete graph. A complete graph has all possible edges and density equal to 1.
Modularity—the Neumann’s modularity described in
Section 8Avg Item Clustering—the average value of BLCC for modality items based on Equation (
2)
Avg User Clustering— the average value of BLCC for modality users based on Equation (
2)
UsersCount—the number of nodes in modality users
ItemsCount—the number of nodes in modality items
User Average Degree—the average value of users node degree
Item Average Degree—the average value of items node degree
gen alpha—the value of parameter
from the CSUI model. Computation is based on Algorithm 5 from
Section 9.6. In column “Graph”, it is computed based on a real graph, and in column “Model”, it is computed based on a generated network from the CSUI model. This value is in
interval. We give the exact value from a linear model for demonstration purposes.
gen beta—the value of parameter
from the CSUI model. Computation is based on Algorithm 5 from
Section 9.6. Interpretation as for the gen alpha metric.
gen p add user—the value of parameter
from CSUI model. Computation is based on
Section 9.1.
gen p bouncing—the value of parameter
from the CSUI model. Computation is based on Algorithm 4 from
Section 9.6.
ExpUserCoeff—the exponent of exponential distribution of node degree of modality users. Computation based on
Section 9.5.
ExpItemCoeff—the exponent of exponential distribution of node degree of modality items. Computation based on
Section 9.5.
graph eta—the average number of edges in one iteration, .
We extracted graph parameters as shown in
Section 9. It turned out (see
Table 1,
Table 2,
Table 3,
Table 4 and
Table 5) that the most crucial parameters were
and
. Values of these two parameters determine how the graph generated by the model will be similar to a real one. We used two methods for finding optimal values
and
: discrete optimization and the brute force approach described in
Section 9.2. The brute force approach gave us the best results in half of the cases.
11. Conclusions
The Cold Start User-Item Model (CSUIM) of bipartite graphs is very flexible and can be applied to many different problems. In this article, we showed that the parameters of the CSUI model could be obtained easily from an unknown bipartite graph. We presented several algorithms to estimate the most important parameters:
—probability that the new vertex v added to the graph in iteration t is a user ;
—probability of preferential attachment, —probability of uniform attachment;
—probability of preferential attachment, —probability of uniform attachment;
—fraction of edges attached in a preferential way which were created using the bouncing mechanism.
We gave some advice about setting up the renaming parameters: m, , and . The experimental results showed that the CSUI model could be applied to some extent for modeling the bipartite graph of users and user posts.
Moreover, we gave a theoretical basis for estimating parameters and based on the degree distribution in each of the modalities. We showed that for small k (consuming most of the probability mass) and fixed (or ), the value of decreases nearly linearly with . The experiment presented in this paper proved that computing does not depend on the value and vice versa not only in theory, but also in practice. The sampling for linear regression models can simply be parallelized for more efficient computations. We also found out that the bouncing parameter was linearly correlated with Newman’s optimal modularity. Experiments made on real-world graphs showed that from these theoretical relationships, the CSIU model parameters , , and could be extracted quite well.
An in-depth analysis of the CSUI model provides an essential guide to future research concerning creating disconnected graphs. In general, it is a hard problem, and to simplify it, we moded the giant component of the analyzed graph. Although the CSUI model can produce disconnected graphs by first initializing step, it can only merge disconnected components and does not produce (divide) new components, as it happens in real-world networks.