Next Article in Journal
Effect of Plasma Treatment of Titanium Surface on Biocompatibility
Previous Article in Journal
Intelligent Thermal Imaging-Based Diagnostics of Turbojet Engines
 
 
Article
Peer-Review Record

Optimal Multiculture Network Design for Maximizing Resilience in the Face of Multiple Correlated Failures

Appl. Sci. 2019, 9(11), 2256; https://doi.org/10.3390/app9112256
by Yasmany Prieto 1, Nicolás Boettcher 1,2, Silvia Elena Restrepo 3 and Jorge E. Pezoa 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2019, 9(11), 2256; https://doi.org/10.3390/app9112256
Submission received: 18 March 2019 / Revised: 20 May 2019 / Accepted: 24 May 2019 / Published: 31 May 2019
(This article belongs to the Section Electrical, Electronics and Communications Engineering)

Round  1

Reviewer 1 Report

Research methodology section should be introduced in this paper to connect with the proposed algorithms that you have in section 4.  A case study to test the model based on real data can be added.


Author Response

First, we comment that the Reviewer 2 of the submitted manuscript suggested to include a formal definition for “Reliability.” The lack of a precise definition for such a technical term, unfortunately, created confusion to some readers. We followed his/her suggestion and during the formalization process we decided to formalize also other terms and ideas associated to “Reliability.” As a result, we realized that the assessment of network infrastructure we have carried out is more precisely defined as “Resilience,” in the sense of Sterbenz et al. “Evaluation of network resilience, survivability, and disruption tolerance: analysis, topology generation, simulation, and experimentation,” Telecommunication systems, 2013, vol. 52, no. 2, pp. 705-736 (Reference [14] in the revised document.)

Consequently, we changed the word "Reliability" by the word "Resilience" in the title of the revised paper and throughout the entire revised manuscript accordingly.  Besides, in the revised manuscript, we defined formally the following technical terms: Resilience, Shared Risk Node Groups (SRNGs), Correlated Failures, monoculture and multiculture technologies. All these definitions, as well as all the changes carried out during the revision process, have been highlighted in blue in the new version of the manuscript.

Reviewer 1 comments

Comment 1: "Research methodology section should be introduced in this paper to connect with the proposed algorithms that you have in section 4."

Our answer: We thank the reviewer for his/her comment. In the revised manuscript we have introduced, at the beginning of Section 4, a summary of the methodology and materials used in our research work. More precisely, the methods and materials employed in our work have been explained. We also formally defined our correlated failure model and what we mean by network resilience.

Comment 2: "A case study to test the model basd on real data can be added."

Our answer: We comment that we have used real data in our evaluations. First, the network topologies assessed in the paper correspond to actual networks taken from The Internet Topology Zoo website. (Reference [54] in the revised manuscript.) This website is "a store of network data created from the information that network operators make public" [54]. In addition, the risk matrices, the number of risks, and the number of technologies used here have been taken from the Router Security (Reference [24] in the revised paper) and the Rapid7 Exploit Database (Reference [25] in the revised paper), which are websites focused on storing information about shared risk and exploits affecting network nodes. Thus, we respectfully believe that we have used real data on our evaluations.

Summary of changes

We have fixed typos as found throughout all the revised document.

We have changed the word "reliability" by "resilience" in the paper title to reflect in a better manner the work we have carried out.

We have changed throughout all the revised document the word "reliability" by "resilience" to define more precisely the work we have carried out.

We have included in the revised abstract the concept of Shared Risk Node Groups (SRNGs), which corresponds to the basis of the failure model we have used.

We have included also in the revised abstract some numerical results.

In the revised manuscript we have written some parts of the introduction to improve the clarity of the presentation. We have also included the concept of SRGNs in this section.

We have extensively revised Section 2 titled "Related work." We included new references to deliver a broader literature review.

Section 3 and 4 were reorganized entirely in the revised manuscript for presenting the paper in a more precise manner.

Section 3 includes now precise definitions for monoculture and multiculture technology networks. The section title was also changed to "Rationale."

The beginning of Section 4 in the revised paper includes a summary of the methodology we used to carry out our work. Also, for clarity of the presentation, we list the research questions guiding our work.

In the revised paper, Section 4.1 includes a formal definition for the failure model we employed. In particular, we included definitions for SRNGs, probabilistic SRNGs and their corresponding probabilities, correlated failures, and resilience.

Some parts of the text in Section 4.2 were revised to introduce the definition of SRNGs. In the same section, we introduced also a formula for computing the failure probabilities associated with the SRNGs.

Section 4.4 of the revised paper was utterly revised to introduce the concept of SRNGs, and their associated properties, in the resilience metrics. We also commented that the number of connected components metric is related to the All Terminal Reliability figure used in the literature.  

Section 4.5 was also revised to better describe the network topologies used in our assessments

Section 5, which presents our results, includes new computations that are included in the last two columns of Table 1 and in the new Figure 13. In addition, Figure 14 of the revised manuscript shows now all the multiculture-oriented redesigned topologies instead of only showing three topologies as in the previous version of the paper. We included also more comments not only on the new results but also on the previously reported findings

In the Conclusion section we also included some numerical figures.

We updated the abbreviation list accordingly.


Reviewer 2 Report

This article investigated an optimal multiculture network design for maximizing reliability in the face of massive correlated failures. However, this article was quite descriptive, and the main contents of this article did not reflect the title and keywords.  I wonder that this article is less valuable for readers because of the following points.

1. This article should give a definite definition or estimation function for Multiculture Network Reliability, and also demonstrate why such Multiculture Networks need to maximize the reliability.
2. This article did not design the any scenario of correlated failures, especially correlated attacks. It is import to show the readers that network robustness can avoid suffering different kinds of failures and attacks.
3. The networks used in article are oversimple, which only contained six nodes and seven edges. The authors should compare several classic and also large-scale networks, such as regular networks, random networks and small-world networks.
4. The experimental results were also oversimple, and there is also no real implication for readers to understand the results in real case.

Other comments:
1. Figure 2: the authors did not explain the meaning of r1, r2, and p1, p2.
2. Figure 2: the authors did not explain why the red box presented the optimal technology set.
3. Page 8: the authors did explain the abbreviation of “ATTR”, when it is first used in Section 3.6.1.
4. Page 9: I did not understand the transformation process of compatibility matrix.
5. Page 10: equation (13) was the same as equation (1).
6. Page 10: the authors did not explain why employ GA to solve the problem in Section 4.2. Furthermore, why used integer chromosome but not the classic binary chromosome.
7. So, I wonder that how to make a mutation for the integer chromosome in this article.
8. Page 12: the last paragraph of Page 12 was almost the same as the contents of Page 10.
9. Figure 7: it was quite similar as Figure 5.


Author Response

Comment: "This article investigated an optimal multiculture network design for maximizing reliability in the face of massive correlated failures. However, this article was quite descriptive, and the main contents of this article did not reflect the title and keywords.  I wonder that this article is less valuable for readers because of the following points.

1. This article should give a definite definition or estimation function for Multiculture Network Reliability, and also demonstrate why such Multiculture Networks need to maximize the reliability."

Our answer: We truly thank the reviewer for his/her excellent comment. We noticed that we did not included a formal definition for “Reliability,” and unfortunately, the lack of a formal definition for such technical term does create confusion to the readers. During the formalization of the term “Reliability” we realized that the assessment of network infrastructure we have carried out is more precisely defined as “Resilience,” in the sense of Sterbenz et al. “Evaluation of network resilience, survivability, and disruption tolerance: analysis, topology generation, simulation, and experimentation,” Telecommunication systems, 2013, vol. 52, no. 2, pp. 705-736 (Reference [14] in the revised document.) Consequently, we changed the word "reliability" by the word "resilience" in the paper title and throughout the entire revised manuscript accordingly. Moreover, after formally defining the term “Resilience” we also considered that the presentation of the paper would benefit greatly if other associated terms and ideas were formalized as well. Thus, we introduced rigorous definitions for the technical terms Shared Risk Node Groups (SRNGs), Correlated Failures, which happen to be the basis of the failure model we employed in our work. Besides, we included definitions for the concepts of monoculture and multiculture technologies. All these changes lead us to redefine some quantities, such as the Average Two Terminal Reliability, in terms of the above-mentioned technical formalizations.

Comment: "2. This article did not design the any scenario of correlated failures, especially correlated attacks. It is import to show the readers that network robustness can avoid suffering different kinds of failures and attacks.

Our answer: The focus of our paper is on network resilience analysis. We modeled correlated failure as different possible stressors in an abstract manner but we did not focus on studying different correlated failure scenarios. Regardless of this issue, we believe that our network resilience study is complete.

Comment: "3. The networks used in article are oversimple, which only contained six nodes and seven edges. The authors should compare several classic and also large-scale networks, such as regular networks, random networks and small-world networks.

Our answer: We respectfully disagree with the reviewer. Only the topology in Figure 3 contains six nodes and seven edges, and such network was used only to explain our rationale. The network topologies assessed in the paper correspond to actual networks taken from The Internet Topology Zoo website. (Reference [54] in the revised manuscript.) This website is "a store of network data created from the information that network operators make public" [54].  These network topologies are commonly used in the data network research community as benchmarks. Moreover, network having different node degrees were selected to study their effect on the multiculture network design and an actual Chilean network, which connects all the universities in Chile was also analyzed. As mentioned, our analysis focused on real networks (regular networks) but from Figure 8 in the revised paper it can be noted that Abilene and Gridnet topologies are examples of random networks.

Comment: "4. The experimental results were also oversimple, and there is also no real implication for readers to understand the results in real case.

Our answer: We respectfully disagree with the reviewer. We not only used in our assessments real-world topologies but also the risk matrices, the number of risks, and the number of technologies used in our work have been taken from the Router Security (Reference [24] in the revised paper) and the Rapid7 Exploit Database (Reference [25] in the revised paper). These websites focused on storing information about shared risk and exploits affecting real network nodes. Thus, we respectfully believe that our experimental evaluations are not also simplified and there is also an implication for the readers to understand the results in real cases.

Comment 5: "Other comments:1. Figure 2: the authors did not explain the meaning of r1, r2, and p1, p2.”

Our answer: In the revised manuscript we explained these symbols. We thank the reviewer for pointing out this issue.

Comment 6: “2. Figure 2: the authors did not explain why the red box presented the optimal technology set.”

Our answer: In the revised manuscript we explained that the optimal set was found using exhaustive search.

Comment 6: “3. Page 8: the authors did explain the abbreviation of “ATTR”, when it is first used in Section 3.6.1.”

Our answer: In the revised manuscript we fixed this problem.

Comment 7: “4. Page 9: I did not understand the transformation process of compatibility matrix.

Comment 8: “5. Page 10: equation (13) was the same as equation (1).

Our answer to comments 7 and 8: Each element of the Compatibility Matrix indicates that the two technologies satisfy simultaneously the constraints of not sharing more than one risk and are able to communicate between them. So, C_ij = 1 if the pair of technologies i and j meet both constraints jointly and C_ij = 0 otherwise. Equations (1) and (13) are the same because the optimization problem was transformed into an equivalent formulation, which differentiates from the original one in the constraints that are now imposed by the Compatibility Matrix.

Comment 9: “6. Page 10: the authors did not explain why employ GA to solve the problem in Section 4.2. Furthermore, why used integer chromosome but not the classic binary chromosome.

Our answer: Genetic Algorithms are a highly employed technique to solve non-well-structured problems, and consequently, there is plenty of bibliography about GAs to solve network and integer variable optimization problems. We used the integer-valued chromosome representation to properly model several technologies in a chromosome. The reviewer is right, however, in pointing out that we could have used a binary representation in our work.

Comment 10: “7. So, I wonder that how to make a mutation for the integer chromosome in this article.

Our answer: For the mutation process, we selected randomly a position in the chromosome. Once the position is selected, we choose randomly one of the other technologies of the set, different than the current technology, and replace the integer value by another one in the above-mentioned position. Once again, and in the context of the previous comment, if we used a binary representation we should have changed more than one bit to represent the change in the mutated technology.

Comment 11: “8. Page 12: the last paragraph of Page 12 was almost the same as the contents of Page 10.

Comment 12: “9. Figure 7: it was quite similar as Figure 5.

Our answer to comments 11 and 12: The reviewer is correct. Paragraphs are similar the same because they describe two GAs methodologies that are pretty similar but must change because the “Fair Technology Distribution Problem” and “The Reliable Node Placing Problem” are totally different. The methodologies differ in the genetic operators that were chosen for each problem. Along the same line of argument, both Figures depict the GAs work-flows and since both implementations are similar the resulting work-flows look also similar.

Summary of changes

We have fixed typos as found throughout all the revised document.

We have changed the word "reliability" by "resilience" in the paper title to reflect in a better manner the work we have carried out.

We have changed throughout all the revised document the word "reliability" by "resilience" to define more precisely the work we have carried out.

We have included in the revised abstract the concept of Shared Risk Node Groups (SRNGs), which corresponds to the basis of the failure model we have used.

We have included also in the revised abstract some numerical results.

In the revised manuscript we have written some parts of the introduction to improve the clarity of the presentation. We have also included the concept of SRGNs in this section.

We have extensively revised Section 2 titled "Related work." We included new references to deliver a broader literature review.

Section 3 and 4 were reorganized entirely in the revised manuscript for presenting the paper in a more precise manner.

Section 3 includes now precise definitions for monoculture and multiculture technology networks. The section title was also changed to "Rationale."

The beginning of Section 4 in the revised paper includes a summary of the methodology we used to carry out our work. Also, for clarity of the presentation, we list the research questions guiding our work.

In the revised paper, Section 4.1 includes a formal definition for the failure model we employed. In particular, we included definitions for SRNGs, probabilistic SRNGs and their corresponding probabilities, correlated failures, and resilience.

Some parts of the text in Section 4.2 were revised to introduce the definition of SRNGs. In the same section, we introduced also a formula for computing the failure probabilities associated with the SRNGs.

Section 4.4 of the revised paper was utterly revised to introduce the concept of SRNGs, and their associated properties, in the resilience metrics. We also commented that the number of connected components metric is related to the All Terminal Reliability figure used in the literature.  

Section 4.5 was also revised to better describe the network topologies used in our assessments

Section 5, which presents our results, includes new computations that are included in the last two columns of Table 1 and in the new Figure 13. In addition, Figure 14 of the revised manuscript shows now all the multiculture-oriented redesigned topologies instead of only showing three topologies as in the previous version of the paper. We included also more comments not only on the new results but also on the previously reported findings

In the Conclusion section we also included some numerical figures.

We updated the abbreviation list accordingly.


Reviewer 3 Report

The paper proposes a series of three sequential optimization problems for maximizing network reliability by considering coordinated attacks as correlated failures.

The abstract is concise but lacks a numerical analysis altogether. The introduction is consistent and coherent. The literature review, however, is very "thin" and discusses the literature at a very surface level. One key suggestion is to improve the reference list and add more technical depth to the literature review section.

Section 3 is detailed enough but lacks a centralized presentation of the proposed optimization problems. A pseudocode or an algorithm would be a major improvement. Some subsections such as 3.6.2 are too short and they do not warrant to be a separate section.

Section 4 has a title of algorithm which is far stretch and I recommend renaming this section.

Section 5 should be also expanded by adding more analysis of the findings.

The conclusion is a good summary of each of the proposed optimization problems but lacks any numerical analysis of the findings.

The English needs minor proofreading such as missing punctuation (e.g. In this paper we propose on page 2). 


Author Response

First, we comment that the Reviewer 2 of the submitted manuscript suggested to include a formal definition for “Reliability.” The lack of a precise definition for such a technical term, unfortunately, created confusion to some readers. We followed his/her suggestion and during the formalization process we decided to formalize also other terms and ideas associated to “Reliability.” As a result, we realized that the assessment of network infrastructure we have carried out is more precisely defined as “Resilience,” in the sense of Sterbenz et al. “Evaluation of network resilience, survivability, and disruption tolerance: analysis, topology generation, simulation, and experimentation,” Telecommunication systems, 2013, vol. 52, no. 2, pp. 705-736 (Reference [14] in the revised document.)

Consequently, we changed the word "Reliability" by the word "Resilience" in the title of the revised paper and throughout the entire revised manuscript accordingly.  Besides, in the revised manuscript, we defined formally the following technical terms: Resilience, Shared Risk Node Groups (SRNGs), Correlated Failures, monoculture and multiculture technologies. All these definitions, as well as all the changes carried out during the revision process, have been highlighted in blue in the new version of the manuscript.

Comment 1: "The abstract is concise but lacks a numerical analysis altogether.

Our answer: We thank the reviewer for his/her comment. In the revised manuscript we presented a numerical result regarding the relationship between topology connectivity and the multiculture network design. This result implies that both characteristics are necessary for improving network resilience.

Comment 2: "The introduction is consistent and coherent. The literature review, however, is very "thin" and discusses the literature at a very surface level. One key suggestion is to improve the reference list and add more technical depth to the literature review section."

Our answer: We have extensively revised Section 2 titled "Related work." We included new references to deliver a broader literature review. In particular, we added three articles that present a framework to improve network resiliency.  These articles include network diversity (multiculture) as an important requirement to avoid “fate sharing” in the presence of correlated failures. These articles validate our investigation and put it into context. Also, six new articles were added to show other applications of network diversity in the context of cyber-security and virus contention.

Comment 3: Section 3 is detailed enough but lacks a centralized presentation of the proposed optimization problems. A pseudocode or an algorithm would be a major improvement. Some subsections such as 3.6.2 are too short and they do not warrant to be a separate section."

Our answer: We thank the reviewer for his/her comment. As per the reviewer suggestion we included in Section 4.2 an algorithm relating the three sequential optimization problems. Besides, Sections 3 and 4 were reorganized entirely in the revised manuscript for presenting the paper in a more precise manner. For instance, Subsection 3.6.2 does not exist anymore and it is included in the Subsection 4.4 titled “Resilience metrics.”

Comment 4: "Section 4 has a title of algorithm which is far stretch and I recommend renaming this section."

Our answer: We thank the reviewer for his/her comment. In the revised paper we changed the section title to “Efficient search algorithms based on Transformations and metaheuristics.“

Comment 5: "Section 5 should be also expanded by adding more analysis of the findings."

Our answer: In the revised version of the paper we included additional analysis in the Results section. First, another column was added to Table 1 to give a more insightful discussion of the ATTR metric in the node failure scenario. In the Figure 13 we depict the relationship between the average node degree of the topologies and the networks yielded by our design methodology. The conclusion was that both the topology connectivity and the multiculture design are necessary to improve network resilience. Also, in Figure 14 we added the results of the designed multiculture networks for the eight topologies; in the previous manuscript, we only presented three topologies to depict the clustering effect.”

Comment 6: "The conclusion is a good summary of each of the proposed optimization problems but lacks any numerical analysis of the findings."

Our answer: We thank one more time the reviewer for his/her comment. As in the case of the Abstract, in the Conclusion of the revised manuscript we presented a numerical result regarding the relationship between topology connectivity and the multiculture network design.

Comment 7: "The English needs minor proofreading such as missing punctuation (e.g. In this paper we propose on page 2)."

Our answer: We have fixed typos as found throughout all the revised document.

Summary of changes

We have fixed typos as found throughout all the revised document.

We have changed the word "reliability" by "resilience" in the paper title to reflect in a better manner the work we have carried out.

We have changed throughout all the revised document the word "reliability" by "resilience" to define more precisely the work we have carried out.

We have included in the revised abstract the concept of Shared Risk Node Groups (SRNGs), which corresponds to the basis of the failure model we have used.

We have included also in the revised abstract some numerical results.

In the revised manuscript we have written some parts of the introduction to improve the clarity of the presentation. We have also included the concept of SRGNs in this section.

We have extensively revised Section 2 titled "Related work." We included new references to deliver a broader literature review.

Section 3 and 4 were reorganized entirely in the revised manuscript for presenting the paper in a more precise manner.

Section 3 includes now precise definitions for monoculture and multiculture technology networks. The section title was also changed to "Rationale."

The beginning of Section 4 in the revised paper includes a summary of the methodology we used to carry out our work. Also, for clarity of the presentation, we list the research questions guiding our work.

In the revised paper, Section 4.1 includes a formal definition for the failure model we employed. In particular, we included definitions for SRNGs, probabilistic SRNGs and their corresponding probabilities, correlated failures, and resilience.

Some parts of the text in Section 4.2 were revised to introduce the definition of SRNGs. In the same section, we introduced also a formula for computing the failure probabilities associated with the SRNGs.

Section 4.4 of the revised paper was utterly revised to introduce the concept of SRNGs, and their associated properties, in the resilience metrics. We also commented that the number of connected components metric is related to the All Terminal Reliability figure used in the literature.  

Section 4.5 was also revised to better describe the network topologies used in our assessments

Section 5, which presents our results, includes new computations that are included in the last two columns of Table 1 and in the new Figure 13. In addition, Figure 14 of the revised manuscript shows now all the multiculture-oriented redesigned topologies instead of only showing three topologies as in the previous version of the paper. We included also more comments not only on the new results but also on the previously reported findings

In the Conclusion section we also included some numerical figures.

We updated the abbreviation list accordingly.


Round  2

Reviewer 2 Report

1. This article should give a definite definition or estimation function for Multiculture Network Reliability, and also demonstrate why such Multiculture Networks need to maximize the reliability.
For this comment, I understood the meaning of “resilience”. However, the authors did not answer my question. The authors should reply that how to assess a given network if it is resilience or not by using a quantitative estimation, and how to maximize this quantitative estimation, e.g. optimal selection of technology set, fair technology distribution or reliable node placing, so as to maximize the network resilience.

2. This article did not design the any scenario of correlated failures, especially correlated attacks. It is import to show the readers that network robustness can avoid suffering different kinds of failures and attacks.
For this comments, I have no idea about the meaning of “Massive” and “Correlated”, which were presented in the title. If we demonstrate network resilience in face of some failures, it not only depends on the network design, but it should be also related with two conditions: one is the number of failures nodes and the other one is the relationships among these failures nodes. Otherwise, this work just demonstrated the network resilience based on the network topology and connectivity, not in the face of massive correlated failures.

3. The networks used in article are oversimple, which only contained six nodes and seven edges. The authors should compare several classic and also large-scale networks, such as regular networks, random networks and small-world networks.
For this comments, the authors use eight networks from Internet Topology Zoo. The most complicated network may be Sprint and Gridnet, which only contains 11 nodes and 9 nodes. Therefore, there are two limitations. First, either of them is the larger-scale network and nodes coloring is not a NP-hard problem, so that it is unnecessary to employ GA to solve such problems. Second, all of them are Internet network, which insufficiently support the results of this work. Because, if this work investigated kinds of multiculture networks, or Internet networks with different technologies.

4. The experimental results were also oversimple, and there is also no real implication for readers to understand the results in real case.
For this comment, the authors only demonstrated when the average degree of network is less than 2, its ATTR not equals 1. Actually, network topology not means network average degree. It also represents the different network connected structures; even they contain the same number of nodes and edges. This work just investigated a node coloring problem on different Internet networks.
Other comments:
6. Page 10: the authors did not explain why employ GA to solve the problem in Section 4.2. Furthermore, why used integer chromosome but not the classic binary chromosome.
The authors should list all the parameters of GA, e.g. population size, what kind of crossover operation, fitness function and why chose crossover probability=0.8 and mutation probability=0.01.

9. Figure 7: it was quite similar as Figure 5.
The authors should rewrite the work-flow of Figure 5 and Figure 7 in the right form, which can be found lots of examples from the Internet.

 

Author Response

Comment 1. This article should give a definite definition or estimation function for Multiculture Network Reliability, and also demonstrate why such Multiculture Networks need to maximize the reliability.
For this comment, I understood the meaning of “resilience”. However, the authors did not answer my question. The authors should reply that how to assess a given network if it is resilience or not by using a quantitative estimation, and how to maximize this quantitative estimation, e.g. optimal selection of technology set, fair technology distribution or reliable node placing, so as to maximize the network resilience.

Our answer: The reviewer has a point here because, after examining our revised paper and the answer we gave after the first round of reviews, we did not replied to his/her request in a straight manner. In the previously revised version of the paper, we introduced a definition of resilience on page 7, Definition 6 but in a qualitative manner.  Next, we stated that “…we will assess the resilience of a communication network after the occurrence of a PSRNG event by means of two metrics, which are mathematically defined in Section 4.4.” In fact, such metrics are the quantitative estimations for resilience he/she is demanding. The All Terminal Reliability (ATR) and the Average Two Terminal Reliability (ATTR) are two well-known metrics used in the data networking literature for assessing resilience and reliability. 

In the new revised version of the paper, we introduced Definition 7 where we define: (I) weather a network is resilient or not, in the face of correlated failures, in terms of the ATR metric; and (ii) the average degree of resiliency of a network, in the face of correlated failures, in terms of the average ATTR metric. 

In addition, we clearly state now on page 10, third paragraph of Section 4.2.3  that to maximize the network resilience we introduced, in the multiculture network design, the number of connected components emerging after a correlated failure, and when that number is minimal is equivalent to the maximal ATR metric.

Comment 2. This article did not design the any scenario of correlated failures, especially correlated attacks. It is import to show the readers that network robustness can avoid suffering different kinds of failures and attacks.
For this comments, I have no idea about the meaning of “Massive” and “Correlated”, which were presented in the title. If we demonstrate network resilience in face of some failures, it not only depends on the network design, but it should be also related with two conditions: one is the number of failures nodes and the other one is the relationships among these failures nodes. Otherwise, this work just demonstrated the network resilience based on the network topology and connectivity, not in the face of massive correlated failures.

Our answer: We thank the reviewer for pointing here the use of the word “massive”. After carefully checking the data networking literature for a precise definition of such word, we realized that the community have used it as a synonym for the word “multiple.” There are two major reasons for this. First, in the networking resiliency literature, the most common assumption is that only a single failure occurs so massive has been used to mean multiple node failures. Second, works modelling multiple failures have followed the approach by Modiano’s group at MIT where multiple failures are induced by massive events, such as earthquakes and hurricanes.  After realizing this issue, we changed the word “massive” by “multiple” throughout the entire manuscript for clarity and precision. Regarding the definition of “correlated,” we refer the reviewer to Definitions 3, 4, and 5 in Subsection 4.1 in the previous version of the revised paper, where we formally defined the terms.

As the reviewer states, network resilience in face of failures it is related to the number of failed nodes and the relationships among such nodes. When we introduced the definition of correlated failures in Subsection 4.1 of the previous version of the revised paper, we formally defined that several nodes do fail simultaneously because they share some common risk, and it is the occurrence of such risk that triggers a multi-node failure. In our work, we have not stated that the network resilience depends on the network design, as the reviewer suggests, we have stated that we can smartly design a network to cope with some type of failures. On the other hand, it is a well-known result in Network Science that the network topology and its connectivity strongly impact on the resilience. What we have done in our work is used such fundamental result and engineered the network design, modifying only the nodes but not the links, such that connectivity is preserved as much as possible when multiple nodes fail.

Comment 3. The networks used in article are oversimple, which only contained six nodes and seven edges. The authors should compare several classic and also large-scale networks, such as regular networks, random networks and small-world networks. 
For this comments, the authors use eight networks from Internet Topology Zoo. The most complicated network may be Sprint and Gridnet, which only contains 11 nodes and 9 nodes. Therefore, there are two limitations. First, either of them is the larger-scale network and nodes coloring is not a NP-hard problem, so that it is unnecessary to employ GA to solve such problems. Second, all of them are Internet network, which insufficiently support the results of this work. Because, if this work investigated kinds of multiculture networks, or Internet networks with different technologies.

Our answer: First, we would like to clarify that the scope of our paper spans only data communication networks such as IP routing networks (Internet networks as the reviewer states) or optical networks. As introduced in Definitions 1 and 2 in Section 3 in the previous version of the revised paper, monoculture and multiculture networks refer to data communication networks and not any general type of network. In fact, these definitions are related later to the network constants (communication protocols) and shared risks (operating systems and hardware implementations). For clarity in the new revised version of the paper we included the word “data” in “communication network” as part of the definition. This, we respectfully disagree with the reviewer's comment "Second, all of them are Internet network, which insufficiently support the results of this work" because our findings and results apply to data communication networks.

Second, the network topologies assessed in our paper correspond to actual networks, which the data network research community employs as benchmarks in research papers. Thus, the scale and number of nodes used in our works was selected in agreement with what data network researchers do.

Third,we respectfully yet strongly disagree with the reviewer's comment "either of them is the larger-scale network and nodes coloring is not a NP-hard problem, so that it is unnecessary to employ GA to solve such problems." The node coloring problem is in fact an NP-hard problem well documented in the literature. Therefore, when researchers tackle NP-hard problems efficient solving algorithms must be provided regardless of the scale of the examples one uses. Of course, for small scale examples, we can always use an exhaustive search approach but not providing an algorithm for solving the optimization problem would be an incomplete work. Thus, using GAs and the reduction method to a clique problem are absolutely necessary.

Comment 4. The experimental results were also oversimple, and there is also no real implication for readers to understand the results in real case.
For this comment, the authors only demonstrated when the average degree of network is less than 2, its ATTR not equals 1. Actually, network topology not means network average degree. It also represents the different network connected structures; even they contain the same number of nodes and edges. This work just investigated a node coloring problem on different Internet networks.

Our answer: As we stated in the previous answer, the scope of our paper spans only data communication networks;  our findings and results apply to them. So, the reviewer is right when he/she states that "This work just investigated a node coloring problem on different Internet networks." This is what we have done and we have not claimed that our work may be applied to more general network scenarios. 

We respectfully disagree with the reviewer's comment "the authors only demonstrated when the average degree of network is less than 2, its ATTR not equals 1." This comment is very simplistic, the statement has no context to support this claim, and does not take into account neither our research work nor the ideas of monoculture or multiculture. In a monoculture, an average node degree close to n-1, with n the number of network nodes, may easily yielded an ATTR less than one when multiple failures affect the homogeneous nodes. 

In our work, we have done and found much more than what the reviewer claims in his/her comment. The key finding in our work is that we can engineer the network design to cope with multiple failures using the idea of a multiculture network that accounts for different types of shared risks. Besides, in Section 5, we have showed results, for different types of network topologies with different average node degrees, that support our resilient multiculture network design, under our correlated failures definition. In Section 5.1, we showed that when dealing with technologies that, on average, present more vulnerabilities, we can find a lower number of vulnerability-disjoint technologies. As a practical result, when network engineers can only acquire this kind of technology, is of little use getting a lot of different vendor technologies, since they will share risks. In other words, technology diversity does not directly imply shared risk diversity.  In Section 5.2, we showed the relationship between the number of selected technologies and their risk index. Also, we assessed the impact of the maximum CAPEX available, as a factor in the technology distribution.  In practice, this implies that an undersized CAPEX will strongly impact the ability of the network to address resilience. However, the necessary CAPEX for fair distribution is bounded, and beyond that point, extra expenditures are of no use. In Section 5.3, we showed the results of the proposed node allocation method. We use ATTR and the Number of post-failure connected components to assess the results of the method. In Fig. 14, we depict technology clustering as a collateral result that improves connectivity under the model of failures we are dealing with. Finally, we illustrate how more connected networks (based on the average degree) benefit from multiculture design than less connected ones. Average degree is a number, among others, that describes network connectivity.
Other comments:
Comment 6. Page 10: the authors did not explain why employ GA to solve the problem in Section 4.2. Furthermore, why used integer chromosome but not the classic binary chromosome.
The authors should list all the parameters of GA, e.g. population size, what kind of crossover operation, fitness function and why chose crossover probability=0.8 and mutation probability=0.01.

Our answer: We thank the reviewer for the comment. In the new revised version of the paper, we have included the following paragraph that lists all the GA parameters. 

For the Optimal Fair Technology Distribution Problem in Section 4.3.2: "Regarding the GA operators, we followed standard guidelines from the GA theory to set the algorithm parameters at recommended values. Thus, population size is set to 500 chromosomes. We employ the single point crossover, with a probability of 0.8 for executing the operation, as the crossover operator. For mutations, one position of the chromosome is selected randomly, and its value is changed, with a probability of 0.01, by one of the other technologies available in the design. For selection, we used the fitness proportional selection, implemented by a roulette wheel"

For the Optimal Reliable Node Distribution Problem in Section 4.3.3: "Regarding the GA operators, we followed standard guidelines from the GA theory to set the algorithm parameters at recommended values. Thus, population size is set to 500 chromosomes. We employ the first order crossover, with a probability of 0.8, as the crossover operator for chromosomes. For mutations, swap mutation, which exchanges the value of two randomly chosen positions in the chromosome, was selected with a probability of mutation of 0.01. For selection, the fitness proportional selection was chosen again as in the fair technology distribution problem. "

Comment 9. Figure 7: it was quite similar as Figure 5.
The authors should rewrite the work-flow of Figure 5 and Figure 7 in the right form, which can be found lots of examples from the Internet.

 Our answer: We thank the reviewer for the comment. We followed his/her suggestion and we have written Algorithms 2 and 3 in a different fashion so that they do not look similar.

Summary of changes

We have fixed typos as found throughout all the revised document.

We have changed the word "massive" by "multiple" in the paper title to reflect in a better manner the work we have carried out.

We have changed throughout all the revised document the word "massive " by "multiple" to define more precisely the work we have carried out.

We introduced the adjective “data” in Definitions 1 to 7 to clarify that our work is related to data communication networks.

We introduced Definition 7 on page 7, Section 4.1, to quantitatively define network resilience.

We have written Algorithms 2 and 3 in a different fashion as in the previous version of the paper so that they do not look similar.

Lastly, we comment that all the changes in revision 1 are marked in blue while changes after revision 2 are in purple.


Round  3

Reviewer 2 Report

This version of the paper has a great improvement, and I am satisfied with the authors' reaction to my comments.

Back to TopTop