6.1. Experiments on Synthetic Data
In
Section 4, we generalized the Bass Model with a probabilistic approach into the model of diffusion over both heterogeneous and connected social networks. This generalization enables us to estimate unobservable dynamic influence across heterogeneous social networks (meta-populations), only given cumulative adopters for each homogeneous network (one meta-population) over time. The goal of this section is to recover the hidden diffusion processes from generated synthetic datasets. For testing model performance, model fitting errors and parameter errors are evaluated in the experiments. The former describes how closely our model predicts the cumulative number of adopters for each meta-population, while the latter shows how correctly our model infers the ground truth parameters.
Synthetic Data Generation: As we discussed in
Section 3, the effects of interactions between different social networks on diffusion are not ignorable, and thus, we can think of all possible directions of influence flow between meta-populations. When it comes to news diffusion across heterogeneous social networks whose types are news, SNS and blog, we can build a
adjacency matrix for representing the existence of influence between two media types, and thus, there are
possible cases of relational structures among three media types in total. If we also vary the strength of influence, then the number of potential cases becomes intractable. For efficient and meaningful simulation, it is important to generate synthetic datasets reflecting representatives among such numerous possible cases.
Figure 6.
Unique structures of dynamic influence flows among three meta-populations, each of which reflects one of the three media types, such as news, SNS and blog, in our real data. All graphs include self-loops (influence between same media types), which are omitted for brevity. Empty links between two different nodes represent very weak connections compared to nonempty links, but they are not ignorable for a more accurate understanding of diffusion across heterogeneous social networks. Thus, they are all directed connections between nodes, but with different strengths. Our synthetic datasets are generated based on these structures.
Figure 6.
Unique structures of dynamic influence flows among three meta-populations, each of which reflects one of the three media types, such as news, SNS and blog, in our real data. All graphs include self-loops (influence between same media types), which are omitted for brevity. Empty links between two different nodes represent very weak connections compared to nonempty links, but they are not ignorable for a more accurate understanding of diffusion across heterogeneous social networks. Thus, they are all directed connections between nodes, but with different strengths. Our synthetic datasets are generated based on these structures.
For avoiding redundant cases, we first consider unique structures of the relations, which leaves us with 16 dynamic relations, as shown in
Figure 6. The dynamic structures include 13 motifs (1–13) and an additional three disconnected graphs (14–16). In this figure, each graph has three self-loops, indicating interactions within meta-populations, but they are all omitted for brevity. We assume that there always exists influence between two media types, but with different strengths. Thus, empty links between two different nodes represent very weak influence compared with nonempty links. Note that the 16th graph has weak connections between nodes, which avoids the most trivial case, i.e., isolated social networks in
Figure 1(a). In addition, applying a threshold to the strength of influence can simplify dynamic influence as the presence or absence of influence, which depends on application domains. However, in this study, we do not ignore every weak influence to consider real-world situations exhibiting dynamic relations between heterogeneous social networks.
Accordingly, three variants of link weights are considered as (non-empty-link-weight, empty-link-weight) =
in order to cover exemplary cases, such as (1) dominant influence between the same media types, (2) strong influence of one media type on the others and (3) balanced influence among three media types. We finally generated 48 (= 16 × 3 variants) datasets of cumulative adopters for the diffusion model as
, as shown in
Figure 7. The length of time step
T is chosen as one month (30 days) to reflect our real dataset period, and the subscripts 1, 2 and 3 of
indicate the three different types of meta-populations. Each link between nodes corresponds to the direction and strength of influence between meta-populations. In our model, they are denoted as
, which is the probability that an individual of type
i adopts when its neighbor of type
has adopted, as discussed in Equation (10).
Figure 7.
Synthetic data generation reflecting dynamic influences among three different types of meta-populations. Forty-eight synthetic datasets are generated in total, and the different population sizes of the three meta-populations reflect real-world situations, such as different numbers of adopters in news, SNS and blog media. The generated datasets are illustrated with daily cumulative adopters (left) and the proportion of the corresponding cumulative adopters (right).
Figure 7.
Synthetic data generation reflecting dynamic influences among three different types of meta-populations. Forty-eight synthetic datasets are generated in total, and the different population sizes of the three meta-populations reflect real-world situations, such as different numbers of adopters in news, SNS and blog media. The generated datasets are illustrated with daily cumulative adopters (left) and the proportion of the corresponding cumulative adopters (right).
Evaluation Metrics: Let us denote the model parameters by
, where
denotes the population size of each media type
i, and the definitions of
are different in each diffusion model. For example,
in the Bass Model, while
in the Dynamic Influence Model. To fit each model to the generated synthetic datasets, we apply nonlinear least squares (NLS) [
38], which minimizes the normalized root mean squared errors (RMSE):
where
is the estimated adoption probability of each population at time
t. Note that due to the parameter identification problem, where the same results are produced with different settings of parameters, we fix the power law coefficient
α to be
, whose value is typically in the range
[
27,
33].
Table 4 shows the averages and standard deviations of model fitting errors (RMSE) of two diffusion models, the Bass Model (BM) and our Dynamic Influence Model (DM), with the generated datasets. The DM outperforms the BM with more acceptable standard deviation, but this is not surprising, since the DM has more degrees of freedom, due to having more parameters than the BM. Therefore, we compared the prediction errors between the two models as shown in
Table 5. During a one month period of diffusion, we used the prior 60 and 80 percent of cumulative adoption history in each dataset for training the model parameters and, then, estimated the remaining 40 and 20 percent with the learned parameters, respectively. As the table shows, still, the DM outperforms the BM by one order of magnitude. The estimated parameter errors (averages and standard deviations in
Table 6) are also acceptable when compared to typical values of parameters in the BM (
,
and
) [
23], showing the feasibility of our model to reproduce parameters from the datasets.
Table 4.
Averages and standard deviations of model fitting errors (root mean squared errors (RMSE)) with synthetic datasets (BM: Bass Model, DM: Dynamic Influence Model).
Table 4.
Averages and standard deviations of model fitting errors (root mean squared errors (RMSE)) with synthetic datasets (BM: Bass Model, DM: Dynamic Influence Model).
| BM | DM |
---|
Mean | 2.19e-3 | 3.74e-4 |
STD | 8.77e-4 | 1.29e-4 |
Table 5.
Averages and standard deviations of prediction errors (RMSE) with synthetic datasets. Given daily cumulative adopters for 30 days, the prior 60 and 80 percent of the adoption history in each dataset are used for training parameters to predict the remaining 40 (12 days) and 20 percent (6 days), respectively.
Table 5.
Averages and standard deviations of prediction errors (RMSE) with synthetic datasets. Given daily cumulative adopters for 30 days, the prior 60 and 80 percent of the adoption history in each dataset are used for training parameters to predict the remaining 40 (12 days) and 20 percent (6 days), respectively.
| Train:Test = 60:40 | Train:Test = 80:20 |
---|
| BM | DM | BM | DM |
---|
Mean | 2.41e-3 | 1.83e-4 | 5.99e-4 | 4.2e-5 |
STD | 2.16e-3 | 1.72e-4 | 6.06e-4 | 4.2e-5 |
Table 6.
Averages and standard deviations of parameter errors of the proposed model with synthetic datasets (: external influence of individuals of type i; : population of individuals of type i; : internal influence of neighbors of type i on individuals of type j).
Table 6.
Averages and standard deviations of parameter errors of the proposed model with synthetic datasets (: external influence of individuals of type i; : population of individuals of type i; : internal influence of neighbors of type i on individuals of type j).
| Meta-population 1 | Meta-population 2 | Meta-population 3 |
---|
Par. | | | | | | | | | | | | | | | |
Avg. | 3.1e-4 | 1.6e-2 | 2.8e-2 | 1.4e-2 | 2.0e-1 | 2.7e-4 | 2.0e-2 | 3.6e-2 | 1.6e-2 | 2.2e-1 | 3.6e-4 | 1.3e-2 | 2.4e-2 | 1.4e-2 | 4.1e-1 |
Std. | 2.8e-4 | 1.9e-2 | 3.0e-2 | 1.4e-2 | 1.9e-1 | 2.6e-4 | 1.9e-2 | 3.4e-2 | 1.6e-2 | 2.3e-1 | 2.7e-4 | 1.2e-2 | 2.1e-2 | 1.4e-2 | 3.3e-1 |
6.2. Experiments on Real Data
In
Section 5, we described the preparation and analysis of the Sinn3r dataset. Among the 60 million English documents, we selected documents that contain at least one hyperlink in their main text and are also created by the 6.4 million identified users in
Table 2. We labeled these documents with 284 identified real-world news by using Wikipedia Current Events. Eventually, we selected the 63 news topics in
Table 3, each of which has driven adoptions of at least 150 identified users across social media. Thus, there are 63 real datasets, each of which consists of daily cumulative adopters for three media types (news, SNS and blog) during a one month period as an input,
i.e.,
.
As we discussed in
Section 4, our macro-level diffusion model does not require detailed network structures (see Equation (13) and
Appendix A), and what we only need to know is the power-law exponent,
α, based on the assumption of a power-law degree distribution. Most real-world networks in their degree distributions have power-law exponents in the range
[
27,
33]. When it comes to social media, the entire Twittersphere, including 41.7 million users, exhibited the exponent of about 2.3 [
9], the blogosphere showed the exponents of 2.5 and 2.6 [
1,
4] and authorship networks in our real data also follow a power-law degree distribution with the exponent of 2.3. Based on the observations from both related works and our study, we set the power-law exponent,
α, to be 2.5. With the collected 63 real datasets, we fit the models and further examine how real-world news spreads across social media by comparing different diffusion patterns between six categories in
Table 3.
Table 7.
Averages and standard deviations of RMSE for both model fitting and prediction errors (train: test = 80:20, for each dataset) with real datasets.
Table 7.
Averages and standard deviations of RMSE for both model fitting and prediction errors (train: test = 80:20, for each dataset) with real datasets.
| Model Fitting Error | Prediction Error |
---|
| BM | DM | BM | DM |
---|
Mean | 2.866e-2 | 2.259e-2 | 3.207e-2 | 2.481e-2 |
STD | 1.902e-2 | 1.027e-2 | 3.698e-2 | 1.018e-2 |
There are no ground truths of parameter values in the real data, so we fit the proposed model (DM) and the baseline model (BM) using nonlinear least squares (NLS), as in the experiments on the synthetic datasets, and evaluate model fitting errors and prediction errors as shown in
Table 7. Overall, due to noise in the real datasets, the performance of model fitting and prediction decreased by at least one order of magnitude, compared with those on the synthetic datasets in
Table 4 and
Table 5. However, our proposed model still performs better than the BM, with more acceptable standard deviations in all cases. This result can be interpreted as news diffusion being influenced by different social networks in a directed way, and thus, the proposed model can improve the accuracy of diffusion models dealing with single social networks.
Figure 8.
Example cases of model fitting results with real dataset from “arts and culture” and “politics” categories (BM: Bass Model; DM: Dynamic Influence Model; A: cumulative adopters up to time t). (a) Arts and culture: the film “Black Swan”; (b) arts and culture: multiculturalism failure; (c) politics: Yemen protests.
Figure 8.
Example cases of model fitting results with real dataset from “arts and culture” and “politics” categories (BM: Bass Model; DM: Dynamic Influence Model; A: cumulative adopters up to time t). (a) Arts and culture: the film “Black Swan”; (b) arts and culture: multiculturalism failure; (c) politics: Yemen protests.
Concurrent Diffusion across Social Media: We examine different diffusion patterns by the context of information.
Figure 8 shows three example cases of model fitting results from the arts and culture and the politics categories. As
Figure 8(a) and
Figure 8(b) demonstrate, topics in the same category do not necessarily exhibit similar diffusion patterns. In
Figure 8(a), the news about the film “Black Swan” rapidly spreads in news media first, and then, it continues to spread to other social media (more rapidly in SNS than blog). On the other hand, in the case of “multiculturalism” issues in
Figure 8(b), the growth rate was not rapid from the beginning, but the diffusion begins to grow sharply and simultaneously across all media types after 23 days, when UK Prime Minister, David Cameron, stated the failure of multiculturalism [
39]. Similarly, such concurrent behaviors are observed in the diffusion of political movements in the Middle East, such as Tunisia, Egypt, Sudan and Yemen. As shown in
Figure 8(c), the “Yemen protests” demonstrate synchronous diffusion patterns after 15 days.
Without direct interactions across social media, such simultaneous growth unlikely happens. As the figure shows, the BM cannot follow these concurrent growth patterns without considering the effects of influences among heterogeneous social networks. Therefore, influences from different social networks are not ignorable for a better understanding of diffusion processes.
Dynamic Influence in Social Media by Context of Information: By categorizing news topics according to
Table 3, we attempt to distinguish different diffusion patterns in terms of the strength and directionality of influence.
Figure 9 shows the distributions of estimated parameter values, where
in the
x-axis indicates the influence of media type
i on the other media type,
j, and the
y-axis represents the probability that the influence of type
i on
j is equal to or greater than
.
Figure 9(a) shows overall trends of interactions among three media types by aggregating parameter values of all news content. In general, news media are influenced by all media types in a balanced way, while SNS and blog, in that order, exhibit stronger internal interactions within the same media types. Considering the characteristics of news media, it seems to be required to monitor and reflect the trends of other media types, narrowing the gaps with them. As discussed earlier, the arts and culture topics demonstrate different diffusion patterns, as shown in
Figure 9(b) and
Figure 9(c). News media are the most influential in the diffusion of arts topics, such as the Academy Awards, film releases and celebrities. However, regarding the culture category, blog media tend to show strong influence on news and SNS media. Controversial subjects, such as multiculturalism failure and female education in Afghanistan, seem to lead to longer discussions, representing personal opinions, and thus, blog media can be a more suitable space compared to other micro blogs or unbiased news media. Like the arts topics, news media occupied influential positions in the economy topics. Exact statistics or facts about economic status can be well described in news media with reliability. Interestingly, regarding political topics, SNS media exhibit the highest influence on all media types, while the influence of blog media are negligible. Political news generally has great social repercussions, such as the Middle East protests, the Tuscon shooting and Wikileaks. In this respect, the micro-blogging space can be a better medium to distribute urgent issues rapidly, and their prompt proliferation influences news media to focus on the issues. In the technology and science category, internal buzz in SNS media is predominant in contrast to blog media.
In summary, news media are the most influential in the arts and the business and economy categories, while in the politics and the culture categories, SNS and blog media are influential, respectively. SNS media show strong internal interactions regarding the technology and science category in contrast to blog media. However, the characteristics of topics are more important than the categories, as we observed to be the case for the arts and culture category.
Figure 9.
Distributions of inferred parameter values with the real dataset by categories. The x-axis indicates value of parameter (the probability of the influence of media type i on j); the y-axis represents the probability of the parameter value more than . (a) All categories; (b) arts; (c) culture; (d) business and economy; (e) politics; (f) technology and science.
Figure 9.
Distributions of inferred parameter values with the real dataset by categories. The x-axis indicates value of parameter (the probability of the influence of media type i on j); the y-axis represents the probability of the parameter value more than . (a) All categories; (b) arts; (c) culture; (d) business and economy; (e) politics; (f) technology and science.
Figure 10.
Averages and standard deviations of cumulative adoption rates for all 63 pieces of news content by media types.
Figure 10.
Averages and standard deviations of cumulative adoption rates for all 63 pieces of news content by media types.
Diffusion Rate of Social Media: Figure 10 shows the averages and standard deviations of cumulative adoption rates for 63 pieces of news content by media types. In general, news media spread information most rapidly, and SNS media follow next. SNS media show almost similar patterns with news, and thus, they tend to be very responsive to the diffusion trends of news media. However, the diffusion rate in blog media grows more slowly compared to the other types.
In this section, by conducting experiments on both synthetic and real data sets, we showed a way of interpreting diffusion in terms of the strength and directionality of influence between populations. As a result, we found that news diffusion in social media is attributed to heterogeneous social networks, which are not separated, but interconnected.