Capitalizing on the power of machine learning, we introduce a novel alias parser, AliasClassifier. This innovative tool reinterprets the router alias-resolution problem through the lens of classification. It employs aliased IPs as positive case samples and non-aliased IPs as negative case samples. These samples are subjected to a training process that utilizes a multitude of features, as detailed in the previous section, therefore facilitating the parsing of IP pairs for potential alias determination. In addition to its core functionality, AliasClassifier offers a set of aggregation rules specifically designed for potentially aliased IP pairs. These rules, which form the basis of the alias triangle aggregation algorithm, serve to further diminish the number of misclassified alias IP pairs. This, in turn, enhances the accuracy of router alias resolution, underscoring the efficacy of AliasClassifier as a potent tool in the realm of network topology analysis.
4.3. Alias Classification Module
The Alias Classification Module represents the pivotal stage in the process, dedicated to the discernment of potential aliased IP pairs following rigorous filtering procedures. These identified aliased IP pairs serve as the foundation for the subsequent step in the process. However, different classifiers have different performances, leading to variations in their efficiency and effectiveness when classifying the same set of samples. In the context of router alias resolution of the network containing n IP addresses, the algorithm’s overall time complexity is O() due to the utilization of IP pairs as input data. Therefore, the selection of classifier models for alias resolution is a crucial decision.
Our discovery during feature analysis was that most of the feature data exhibit heavy-tailed characteristics and discrete values. These findings have motivated the criteria for choosing an appropriate classifier model. Specifically, the selected classifier model must be lightweight to ensure swift classification, thus enhancing the efficiency of alias resolution. Additionally, priority is given to classifier models adept at handling heavy-tailed and discrete data. As a result of these considerations, four distinct classifier models have been chosen for experimental training: Naive Bayes (NB), Support Vector Machine (SVM), Decision Tree (DT), and Random Forest (RF). This strategic selection aligns with the identified requirements and is poised to facilitate effective alias resolution within the given context.
During the training phase of the alias-resolution classifier, the initial step involves the acquisition of both the alias IPs dataset and non-alias IPs dataset from the public data source ITDK, as expounded upon in
Section 2.2. Following this, the data collection module is deployed to procure pertinent feature data within the sample set, wherein designated classification features are utilized to construct feature vectors for each sample. The assemblage of samples constituting the training set is then inputted into the designated machine-learning model for the purposes of training. Consequent to this, the efficacy of the classifier is meticulously assessed through its application to the test set, therefore furnishing a pragmatic appraisal of its operational performance.
In our experimental setup, we have employed a self-help sampling method to create training and test sets for the ground truth collection. This method has yielded a total of five self-help samples. Consequently, all four classifiers, namely NB, SVM, DT, and RF, have been subjected to alias-resolution tasks five times each. To accurately assess the performance of each classifier, we have evaluated them using a comprehensive set of metrics, including precision, recall, F0.5 score, F1 score, F2 score, and parsing time per 100,000 IP pairs.
We assess the classification accuracy of various classifier models using precision metrics, evaluate the capability of these models to detect potential alias IPs through recall metrics and gauge the parsing efficiency of each classifier model by measuring the parsing time for every 100,000 IP pairs. Subsequently, we ascertain the comprehensive efficacy of each classifier model using the
index. The
provides a holistic evaluation of classifier performance, considering both precision and recall. Specifically, we consider the F1 score as equally significant as precision and recall while according twice the weight to recall in the F2 score. Conversely, in the F0.5 score, recall is assigned half the importance of precision, reflecting a more nuanced assessment of classifier efficacy. The detailed results of this evaluation are presented in
Table 4, with a description of the evaluation metrics provided in
Appendix A for reference. The classifier models leading in each metric have been Bolded for reference.
The analysis from
Table 4 reveals distinct strengths and weaknesses among the classifier models used in alias resolution. From the analysis in
Table 4, we found that the Bayesian classifier, despite its significant performance in recall, F1 score, and F2 score, exhibits a relatively low classification accuracy, averaging 74.4%. This deficiency undermines its reliability as a parser and introduces a potential constraint. The diminished accuracy observed in the Bayesian classifier can likely be attributed to the interdependence and continuous nature of the selected features. Conventionally, Bayesian classifiers operate under the assumption of feature independence and discreteness. However, certain features chosen for our analysis exhibit correlations, notably those concerning routing paths. Additionally, some selected features manifest as continuous values, exemplified by the Difference Value of round-trip time. This departure from discrete, independent features may entail information loss within the Bayesian classifier framework, consequently undermining its classification efficacy.
The decision tree classifier, while demonstrating commendable classification speed and accuracy, suffers from low recall. This observation is intricately linked to a known limitation inherent in decision tree classifiers—namely, their susceptibility to overfitting. Decision trees, by their nature, have a propensity to excessively tailor themselves to the intricacies of the training data, particularly under conditions of heightened tree depth or scant training samples. Under such circumstances, the phenomenon of overfitting becomes pronounced, as the classifier excessively captures noise and idiosyncrasies within the training data, compromising its generalization capability. This limitation could pose a significant challenge in practical applications, as the low recall may result in a diminished discovery rate of alias IPs. Consequently, this could lead to a restricted coverage of the inferable router topology, negatively impacting the construction of the router topology.
The SVM classifier is evidently burdened by the parsing speed. While SVM boasts robust generalization capabilities and excels in handling high-dimensional datasets, their efficacy is tempered by computational demands, particularly evident in the processing of large-scale datasets. SVM training entails substantial computational overhead and necessitates extensive storage resources, factors that impede expeditious alias resolution within network contexts. Moreover, SVM’s sensitivity to missing data imposes a requisite for meticulous data preprocessing, therefore introducing variability in classifier accuracy. Consequently, despite its theoretical strengths, SVM’s practical utility is constrained by computational exigencies and susceptibility to data quality fluctuations.
In contrast, the Random Forest classifier manages to strike a balance across multiple dimensions, achieving a harmonious blend of parsing speed, accuracy, and recall. While the Random Forest model may not exhibit conspicuous advantages over alternative classifier models concerning metrics such as precision, recall, and parsing time per 100,000 IP pairs, its paramount significance emerges in scenarios where parsing accuracy assumes primacy. The preeminence of the Random Forest classifier, underscored by its superior performance in the F0.5 score, substantiates its unparalleled capability to ensure parsing precision. In contexts where parsing accuracy is paramount, the Random Forest classifier emerges as the optimal choice, notwithstanding its comparative performance in other metrics. We expect the filtered classifier to simultaneously satisfy high precision of parsing results, fast parsing speed, and high discovery rate of alias IPs that can be parsed (high recall). The Random Forest classifier strikes an attractive balance, making it an ideal choice for researchers who require both high accuracy and a robust discovery of potential alias IPs. Its balanced performance, resistance to overfitting, and noise immunity further add to its appeal.
Given the favorable balanced performance and versatility of the Random Forest Classifier, as well as its ability to mitigate overfitting and handle noise in data, the decision to construct AliasClassifier based on the Random Forest model for comparison with other methods appears well-founded. Subsequent sections will delve into the details of the comparative experiments, shedding light on the effectiveness and advantages of this approach.
4.4. Alias Aggregation
Alias aggregation is the process of aggregating identified alias IP pairs into routers. Traditional machine-learning methods employ alias passing for router node aggregation following the identification of aliased IP pairs. Alias passing implies that for mutually aliased pairs (IP A, IP B) and (IP B, IP C), the trio (IP A, IP B, IP C) are classified to the same router node due to the existence of a common alias interface, IP B, for both pairs, as shown in
Figure 8.
However, as discussed in
Section 4.3, it is evident that regardless of the classification model employed, the final classification results contain some errors. When alias passing is used for router node aggregation on the IP pairs that have been discriminated against by the classifier, the misreported alias IPs associate with many unrelated routers. This association results in a significant reduction in the number of routers that are eventually aggregated. Consequently, some of the routers inferred through the alias-resolution process may be assigned more IP addresses than they actually possess.
To address the aforementioned “router bloat” problem, we propose a router aggregation algorithm based on alias triangulation. An alias triangle comprises any three IP addresses that are aliases of each other. If any three IP addresses in a router form an alias triangle, we consider the router comprising these three IP addresses to be real. A simple schematic of the router aggregation algorithm based on alias triangles is depicted in
Figure 9. The aggregation steps are as follows:
Step 1:We construct an alias set of IP addresses (Alias dataset) by incorporating all the IP addresses deemed to be aliased by the classifier. Specifically, for any IP address , we construct an alias set for by taking as the key and the IP addresses judged by the classifier to be aliases of as the value: .
To reduce the data volume during router aggregation, we sort the Alias dataset in ascending order based on the key IP and also eliminate the member IPs in the value IP that are smaller than the key IP, i.e., for the set of IP aliases , there exists for any . Where represents a decimal integer value.
Step 2: We select the IP address with the smallest key IP from the Alias dataset to initiate router aggregation. Each router is represented by a set of member IP addresses, which is dynamically expanded as more IP addresses are discriminated. Therefore, the initial IP set for is .
We subject the set of aliases to alias triangulation. If any two IP addresses within are also aliased to each other, then also belongs to the constituent IP addresses of Router . After alias triangulation of all IP addresses in , we obtain the new Router .
Step 3: The judgment process from Step 2 is also applied to the newly added IP addresses in Router until all member IP addresses in Router have completed the alias triangle judgment. If no new members are added to Router after all member IP addresses in Router have completed the judgment, it indicates that all members of Router have been recognized.
Step 4: Remove all member IPs of Router from the Alias dataset and repeat Steps 2 and 3 until all IP addresses in the Alias dataset are devoid of alias triangles. The IP addresses that do not form alias triangles undergo alias passing to construct potential routers. This subset of routers may contain IP addresses that have been incorrectly aliased, so we designate it as the Candidate Router.