4.1. Real-Life Usage Example
In this section, a brief real-life usage example of the proposed approach will be presented, explaining every step of the proposed framework on a small real network. In further sections, a more in-depth analysis is performed on a larger synthetic network.
The empirical example in this section will be performed on a real network. Enron emails network [
53] was selected due to its limited size (143 nodes and 623 edges), which allows us to study in detail the status of every single node of the network. It is important to keep in mind that the proposed approach is intended for networks with nodes characterized by multiple attributes. Due to the fact that the publicly available network repositories principally provide only edge lists of networks, the attributes had to be overlaid on the network artificially. Therefore, artificial values for two attributes were generated for the network, based on [
54]: gender (69 nodes male, and 74 nodes female), and age (0–29 years—62 nodes, 30–59 years—55 nodes, over 60 years—26 nodes).
For such a network, for illustrative purposes, two complete scenarios with two different targets will be presented. In both, a constant propagation probability (0.1) and seeding fraction (0.05, i.e., 7 vertices) is assumed.
4.1.1. Target 1: Male Aged 0–29
In this scenario, the aim of the viral marketing campaign is to reach men aged 0–29, that is, the targets are described by specific values of two criteria: gender (C2) and age (C5). The target group, therefore, consists of 28 nodes (see
Figure 2). Apart from the two target-describing attributes, some other criteria are also available: degree (C1), degree male (C3), degree female (C4), degree aged 0–29 (C6), degree aged 30–59 (C7), degree aged 60+ (C8). The decision maker (DM)/analyst, based on their expertise, provide the preference weights for all criteria: C1: 8.20, C2: 25.40, C3: 12.60, C4: 3.80, C5: 28.40, C6: 14, C7: 3.80, C8: 3.80. These weights are provided by the DM as input data to the proposed approach, as the ones which, according to the DM, allow to rank the nodes in order to find the seeds potentially best for maximizing influence in the targeted group. In order to provide such weights, the analyst can refer to archival knowledge and use decision support systems or MCDA methods such as AHP [
39].
Once the preference weights are known, the TOPSIS method is used to evaluate all vertices. The top seven (seeding fraction 0.05) are chosen as seeds and the campaign is started.
For this scenario, the simulations (see
Figure A1 in
Appendix A) have shown the campaign averagely reached 9/28 targeted nodes (32.14%), with global coverage 0.2224. A traditional degree-based approach for the same network results averagely in reaching 7.7/28 targeted nodes (27.5%), with global coverage 0.2881. The multi-criteria approach reached 4.64% more of the targeted nodes with global coverage lower by 0.0657.
4.1.2. Target 2: Female Aged 30–59
In this scenario, the aim of the viral marketing campaign is to reach women aged 30–59. The target group consists of 24 nodes (see
Figure 2). Again, apart from the two target-describing attributes, some other criteria are also available: degree (C1), degree male (C3), degree female (C4), degree aged 0–29 (C6), degree aged 30–59 (C7), degree aged 60+ (C8). It is important to note that, contrary to other approaches [
4], in the proposed approach the criteria values are reused and only the preference weights are adjusted. This time, the decision maker, based on their expertise, provide the following preference weights for the criteria: C1: 4.4, C2: 30.4, C3: 4, C4: 10.4, C5: 30.40, C6: 5.4, C7: 10.4, C8: 4.4.
Once the preference weights are known, the TOPSIS method is used to evaluate all vertices. The top seven (seeding fraction 0.05) are chosen as seeds, and the campaign is started.
For this scenario, the simulations (see
Figure A2 in
Appendix A) have shown the campaign on average reached 9.5/24 targeted nodes (39.58%), with global coverage 0.2552. A traditional degree-based approach for the same network results averagely in reaching 6.8/24 targeted nodes (28.33%), with global coverage 0.2881. The multi-criteria approach reached 11.25% more of the targeted nodes with global coverage lower by 0.0329.
4.1.3. Real-Life Example Discussion
In the real-life example, two complete scenarios with two different targets were presented. As expected, in both cases the proposed approach resulted in lowering the global coverage but increasing the influence in the targeted set of nodes. In both cases, it was the decision-maker (DM) who first determined the values for weights. This is a subjective assessment, based on the DM’s knowledge, skills and experience. In case the weights would have been estimated improperly, the ranking of the nodes would be ordered differently, and, therefore, different 7 nodes would be selected as seeds (see
Section 3.4). This, in turn, could result in reaching fewer targeted nodes in the network (see
Section 4.8). The actual participation of the decision-maker in the process of solving the task is very important in MCDA, and the actual performance of the obtained solution is dependent on both the quality of the attributes and the proper selection of the values of the vector of the relative importance of the decision model criteria. Attempting to obtain the maximum potential to reach through the seeded nodes to the targeted nodes requires searching for the most satisfying values of the vector of the relative importance of the decision model criteria.
4.3. Criteria for Seed Selection
As was described in
Section 3, in the proposed approach the initial seeds were selected from the network based on multiple criteria. In the case of the studied synthetic network, apart from the sex and age attributes, the general degree of each node was also taken into account, as well as the degree measurements based on each value of the two attributes. This resulted in a total of eight evaluation criteria, presented in
Table 1.
The criterion C1 represents the number of neighbors of each evaluated vertex. Criterion C2 is based on the sex attribute and is equal to 0 if there is a match between the targeted and actual sex or 1 in the case of a mismatch. Criterion C3 represents the count of male neighbors of a vertex, whereas criterion C4 represents female neighbors of a vertex. In turn, criterion C5 indicates the difference between the targeted and actual age group of a vertex. For example, if the targeted age group was young, vertices from age groups young, mid-aged and elderly would obtain the values of 0, 1 and 2 respectively. Since the targeted group in this experiment is in the middle, that is, mid-aged, vertices from this group would obtain value 0 and from other groups would obtain value 1 for criterion C5. Last, but not least, criteria C6, C7 and C8 represent the count of respectively young, mid-aged and elderly neighbors of a vertex. All criteria C1–C8 were then assembled to create a single decision matrix for the TOPSIS method. At this stage, it is important to note that during the research the authors decided to follow the degree-based criteria, as the degree is the most basic measure which can be used for benchmarking of the approach. If other measure, such as closeness, betweenness, eigencentrality, and so forth, was used as criterion C1, also the remaining criteria C3, C4, C6, C7, C8 would need to be modified to use the selected metric.
The last step required for the seed-selection setup was specifying the preference direction of all evaluation criteria C1–C8. Because criteria C2 and C5 represent difference between the targeted and actual values, the lowest possible values were preferred. On the other hand, since the remaining criteria are based on the degree network centrality measure, the preference direction for these criteria was maximum.
After the experiment was set up, three scenarios based on various weights of individual criteria were studied. Their description and results are presented in the following sections.
4.5. Scenario 2: Two Criteria
In the second scenario, the preference weight of the degree measure was reduced in favor of the more accurate female degree (C4) and mid-aged degree (C7). Therefore the weights of C4 and C7 were set to 100 while the weights of the rest of the criteria was set to 1. All vertices were evaluated again, under the new conditions and their ranking was built. The correlation coefficient between the rankings for both scenarios is equal to 0.9022 for the scores and 0.7510 for the ranks of the vertices. The results of the top 50 vertices, selected as seeds, are presented in
Table 3.
When
Table 3 is analyzed, it is clearly visible that the scores obtained by the best vertices are much more diversified than in case of the first scenario. The three leading vertices are still the ones labelled
3,
4 and
2; however, the order of the subsequent two has changed. The vertex
5 is now ranked 4 with the score of 0.5836 (previously 0.5200), followed by the vertex
12 now scored 0.5392 (previously 0.5400). The vertex
24 remained on position 6; however, it is now followed by vertex
6, scored 0.4741, which in the previous scenario was ranked 12th with the score of 0.4000. A detailed analysis of the differences between ranks obtained by vertices in the rankings for scenarios 1 and 2 is presented on
Figure 3A. The horizontal axis presents the consecutive ranks of all 1000 vertices of the studied network in scenario 1, whereas the vertical axis shows how these vertices were then ranked in scenario 2. The closer the point representing a vertex is to the diagonal line on the chart, the smaller the change in the rank occurred. It can be observed, that while in case of the top-ranked vertices only small changes in rank occur, as it can be confirmed in
Table 3, in the case of the vertices further down the list, changes of even hundreds of levels in rank can be observed.
Subsequent to the selection of the seeds, ten simulations were performed with the same conditions as in the first scenario. The visual representation of the outcomes of the simulations are presented in
Figure A4 in
Appendix A. In this scenario, the simulations averagely lasted 9.1 iterations, that is, longer by 0.5 iteration and resulted in 435.6 nodes infected (0.4356 coverage, 0.0020 more). What is interesting, the usage of two criteria allowed us to increase the coverage in the target group. Averagely 52 targeted nodes were infected, that is, 0.4 target coverage, which is 0.0115 more than in the first scenario.
4.7. Sensitivity Analysis
As it was observed in
Section 4.4,
Section 4.5 and
Section 4.6, depending on the preference weights regarding evaluation criteria, the evaluation score of each vertex varied, resulting in differences in the obtained rankings and diverse sets of initial seeds for performing the information propagation campaign. The MCDA methodological foundations of the proposed approach allow to perform sensitivity analysis of the obtained rankings, and thus recognize how changes in the criteria preference affect the final rankings and, in turn, the selected seeds.
In this section, a sensitivity analysis for the seed selection problem for the studied network is presented. For clarity, the subset of analyzed vertices was limited to the ones which were selected as seeds in any of the scenarios 1–3. This resulted in a subset comprising of a total of 63 vertices: 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 14, 16, 17, 18, 19, 20, 21, 24, 26, 29, 30, 33, 34, 36, 37, 40, 42, 45, 47, 48, 49, 53, 55, 56, 57, 59, 65, 69, 74, 82, 93, 97, 101, 103, 104, 113, 116, 122, 130, 135, 143, 151, 152, 153, 170, 172, 174, 185, 195, 238, 295, 464.
In order to perform the sensitivity analysis, at first the weights of all criteria were set to 1. Then, the weight of each criterion was gradually changed to 1, 25, 50, 75 and 100, while the rest of criteria remained at an unchanged level. Afterwards, the level of all criteria was increased to 25, and each criterion was tested again with the weight of 1, 25, 50, 75 and 100, while the rest of the criteria remained at an unchanged level. The same was then repeated for the levels of 50 and 75. At each combination of weights, the TOPSIS method was used to compute a ranking. The score and ranks of each of the 63 studied vertices was stored, and plotted afterwards. The plots representing the changes of score of each vertex is presented in
Figure 4. The changes of ranks are presented in
Figure 5.
The analysis of
Figure 4A shows how each of the criteria support or conflict with individual vertices. It is particularly clear because, while the weight of each criterion is increased in the range 1–100, the weights of the remaining criteria are locked at the level of 1. The chart A8 demonstrates that, in some cases, the vertex
3, which was the leading one in all three exemplary scenarios, in some cases can be outran by other vertices. If the weight of criterion C8 (elderly degree) was increased to 25, while the weights of the other criteria remained negligible at the value of 1, the score of vertex
3 would drop below 0.8 and it would be ranked 3rd. However, if the weights of the other criteria were levelled at 25, the vertex would be the leader again, unless the weight of criterion C8 was increased close to 100. Then the vertex
3 would be ranked second.
Similarly, as can be observed in chart A5, if the weight of criterion C5 (age) was increasing, yet the other weights remained at 1, the vertex 3 would lose score very fast, down to a level of approximately 0.2. However, if the weights of the other criteria were increasing, the downfall of the score would be reduced to 0.8 (B5) or even 0.9 (C5, D5).
An interesting observation can be made looking at charts A1–A8. As was seen in
Table 2 in
Section 4.4, many vertices obtained the same score, and therefore their rank could vary. During the sensitivity analysis, this resulted in plots for multiple vertices being superimposed one on another. For example, on chart A1, only vertices
3,
4,
2,
12 and
5 can be located easily, while the remaining vertices are stacked together on the chart.
Because criterion C1 is based on the degree centrality measure, the vertices’ plots cluster in multiple score-groups, based on a plentiful, yet enumerable set of possible degree values, in the case of the studied network. On the other hand, due to the fact that the criteria C7 and C8 are based on the degrees of less numerous social groups (mid-aged and elderly), the possible values of the degree measure are more limited in this case and, therefore, there are less possible score values, which can be observed on the charts A7 and A8. In case of the chart A2, it can be observed that if the vertices are appraised based on the criterion C2 (sex), where only two values are possible, the vertices cluster in two groups. Since both sexes are distributed in the studied network at a roughly even probability level, it can be observed on the chart that both groups of vertices’ plots are similar in size. On the other hand, however, in case of criterion C5, also only two values are possible, so the vertices are plotted in two groups too. However, because only about a quarter of the studied network is in the targeted middle-aged group, a clear disproportion between the groups of plots can be observed on the chart A5.
Whilst in the case of
Figure 4, the values on the vertical axis were limited to the range from 0 to 1, and multiple vertices were allowed to have the same value, in case of
Figure 5 each value can be assigned only to a single vertex at a time. As was mentioned earlier, the set of analyzed vertices is limited to 63 for readability. The charts on
Figure 5 are scaled to show ranks from 1 (best) to the worst one obtained by any of the 63 studied vertices. It is important to reiterate, that each of the 63 studied nodes was in the group of 50 best vertices in one of the scenarios described above. Therefore it is very unforeseen to observe that the chart C1 ends at about rank 120, obtained by the worst vertex
130, and the chart A6 ends around rank 600 for vertices
104 and
130. These observations emphasize the importance of proper selection of seeds for information spreading campaigns in social networks.
4.8. Full Range Analysis
The empirical study was concluded by performing a comprehensive set of 65,610 simulations based on the full range of the seed selection preference weights. For each of the eight decision criteria, the weights of 1, 50 and 100 were assigned. That resulted in possible sets of criteria preference weights and, consequently, 6561 sets of seeds, for each of which ten simulations under invariable conditions were performed. The results of the performed 65,610 simulations were then stored and aggregated for further analysis.
For the studied synthetic network, the highest number of infected vertices was reached for the seeds indicated by rankings based on high weights of the C5 (age) criterion, and negligible weights of the other criteria. It was equal to 459.7 infected nodes, that is, 0.4597 coverage. For such scenarios, averagely 61.3 targeted nodes were infected, that is, 0.4715 coverage of the targets.
On the other hand, the highest coverage within the targeted nodes was achieved in the simulations originating from the rankings produced by the scenarios in which high weight values were assigned to criteria C2 (sex) and C5 (age). On average 75.8 targeted nodes were infected in these simulations, that is, 0.5831 targets’ coverage. For these scenarios, on average 458.6 vertices were infected, that is, 0.4586 coverage. This substantial increase in the count of the infected targets might be caused by the fact, that for this scenario, all seeds were part of the target group themselves (resulting in on average 25.8 non-seed targets infected, i.e., 0.1985), whereas in the scenario described in
Section 4.6, only 5 of the initial seeds were from the target group (resulting in, on average, 47.7 non-seed targets infected, i.e., 0.3669 of the targets).
All in all, the simulation results have shown that the use of a multi-attribute seed selection approach, proposed in this paper, at the cost of reducing the coverage on the studied network by 0.0011, allowed us to increase the coverage within the targeted nodes by 0.1116 compared to the approach oriented on maximizing the global network coverage.