1. Introduction
The need for digital technology has grown significantly since the COVID-19 pandemic. On the other hand, the development of new technology such as the Internet of Things (IoT), fifth-generation (5G) cellular networks and the boom in Artificial Intelligence (AI) have created a new demand for network connectivity. Attention is now focused on the next-generation near future applications of the Tactile Internet (TI) that need to support the latency-sensitive human-to-machine/robot (H2M/R) applications such as Extended Reality (XR), tele-surgery, industry automation, and intelligent transport systems [
1]. Statista has predicted that the number of connected devices will be more than 30.9 billion units by 2025 [
2]. This means that there will be significant challenges to the network operators to provide robust and guaranteed services to the users.
One of the future applications that will need ultra-low latency and robustness is the TI. The TI has some similarities with and distinctions from the IoT or 5G. The 5G cellular networks are focused more on improving Human-to-Human (H2H) communications, whereas the IoT is dependent on Machine-to-Machine (M2M) communications to facilitate industrial automation systems or machine-centric activities [
3]. However, the TI requires a human-centered design approach due to the inherent Human-in-the-Loop (HITL) nature of H2M/R interaction, such as tele-surgery types of applications [
4].
In [
5], they determined the QoS key performance indicator of the TI use cases. For example, the tele-operation scenario should have a latency below 1–10 ms and a reliability of 99.999%. In terms of M2M applications, such as self-driving cars and industrial automation, the required latency is 5–10 ms [
6] and reliability is 99.999% (an average of less than 6 min downtime per year) [
7]. These indicators show that the underlying network not only guarantees minimum latency but also ensures that the system is robust enough to have minimum reliability requirements.
Currently, wired and wireless communications networks are rapidly evolving in terms of their architecture and capabilities to challenge latency-sensitive H2M/R applications. In wired networks, especially optical fiber, Passive Optical Networks (PONs) have continually evolved over the years. PONs now offer bandwidth capacity and functionality, delivering low-latency, high-bandwidth, and cost-efficient services to large numbers of users. Moreover, most urban areas near residential and industrial premises have now deployed optical fiber [
8].
Ethernet Passive Optical Network (EPON) technology is among the best PON technologies due to its lower cost, high bandwidth, and readiness to support efficient Quality-of-Services (QoS). The current standard of EPON is the IEEE 802.3ca, which was approved in 2020 as the next-generation EPON (NG-EPON), boosting the bandwidth of a single channel by a factor of 5 to 25 Gbps [
9]. Moreover, the NG-EPON can have higher data rates by using channel bonding that can offer aggregated data rates of Nx25s Gbps. Consequently, a fully operating NG-EPON may deliver up to 50 Gbps for both upstream and downstream transmission [
10]. Nevertheless, managing an NG-EPON that can handle the strict QoS from residential or industrial users is challenging. Industrial users usually have stringent QoS requirements, one of which is maintenance service [
11]. This service includes ensuring that the network is fault-tolerant against any fiber fault. Any fiber cut or loss can significantly impact industrial systems, especially in terms of TI or H2M/R applications, which can involve life-and-death situations.
In general, different types of anomalies can affect the performance of NG-EPON. Some fiber failures can occur due to mechanical faults, optical faults, or electrical faults. Since a single fiber link can connect to the residential, industrial, or enterprise networks, carrying a mixture of data from personal to public or even 911 or TI data, any fiber failures can have enormous impacts and need to be responded to immediately [
12]. Moreover, failure in optical network communication can be categorized as soft or minor failures and severe failures. Severe failures lead to immediate service loss due to fiber cuts, bends, and other problems. Minor failures can degrade the transmission quality due to signal overlap, laser deflection, filter switching, noise, and other problems [
13]. Therefore, the network operators must ensure reliable data communication for high-speed Internet. Failures to handle this can lead to significant financial and data loss for both network operators and customers. At the same time, the network operators also need to reduce the operation and maintenance expenses (OPEX).
According to the Federal Communication Commission (FCC), more than one-third of fiber disruptions are caused by fiber-cable problems [
14]. These issues can include failures of connectors or power supplies, fiber breaks, macro bends, or even problems with Optical Line Terminal (OLT) or Optical Network Unit (ONU) transceiver problems. Consequently, a remote and automatic monitoring or diagnosing mechanism for the fiber links would be very beneficial to reduce the mean time to repair (MTTR), increasing customer satisfaction.
The main contribution of this paper is as follows:
We propose a smart resilience mechanism architecture and operations in Next-Generation Ethernet Passive Optical Network (NG-EPON).
We introduce a novel Resilience Dynamic Bandwidth Allocation (RDBA) mechanism, ensuring the Quality-of-Services (QoS) of real-time and tactile internet applications.
We build a supervised AI model using Multi-Layer Perceptron (MLP) for detecting any anomalies or faults in the branches.
The extensive simulation results demonstrate that the resilience of AI-enhanced anomaly and fault detection effectively manages delay for real-time and tactile internet applications.
The remainder of this paper is organized as follows. Related work is presented in
Section 2. The SDN-Enabled Broadband Access (SEBA) architecture is discussed in
Section 3.
Section 4 introduces the proposed smart resilience architecture.
Section 5 presents the performance evaluation. Finally,
Section 6 concludes our work.
2. Related Work
The objective of fiber monitoring is to detect any anomalies in the optical layer by analyzing the monitoring data. Several techniques are commonly used by engineers to identify fiber faults in Optical Distribution Networks (ODNs). For instance, one study [
14] uses a Reference Reflector (RR) placed at the end of each fiber on the ODN and uses Optical Time-Domain Reflectometry (OTDR) to detect, locate, and estimate the reflectance of the connections and mechanical splices in the fiber links. Another approach uses binary-coded Fiber Bragg Granting (FBG) [
11]. The FBG binary codes serve as indicators between one ONU to other ONUs by varying the wavelengths used by the FBGs to easily identify fault branches [
11]. Some early studies have also proposed embedded OTDR called miniaturized OTDR integrated into the ONUs [
15,
16,
17,
18].
Furthermore, to consistently meet the Service Level Agreement (SLA), network operators need a mechanism to maintain service continuity even when there are fiber faults in the ODN. In EPON, network operators usually use protection mechanisms such as trunk protection or tree protection. Trunk protection primarily focuses on protecting the OLT and feeder fiber. In contrast, tree protection covers the entire area but is very costly. Dedicated protection might deliver more reliability for service continuity but cannot provide efficient resource utilization [
19]. Several studies have shown the use of ring topology to minimize the cost of establishing redundant paths in traditional EPON while handling any fiber cut or failures within the network [
20,
21,
22]. Apart from all the various approaches such as tree, trunk, star, ring, or bus protection mechanisms, some studies also used hybrid topologies that improve EPON network redundancy but increase the network complexity [
23]. Moreover, some studies also use SDN capability and a bus protection line to enhance the resilience of the existing EPON system [
24].
Recently, Artificial Intelligence (AI) entities have been able to perform operations analogous to human activities, such as learning and decision-making. AI-based techniques are already changing and improving industries, including telecommunications networks. These techniques range from performance monitoring and guaranteeing the transmission to optical network control and management in both transport and access networks [
25]. Current studies related to fiber monitoring already use the Machine Learning (ML) approach to detect any anomaly in the optical networks [
12,
14,
26,
27]. These studies have shown that ML can detect and localize any fiber faults in the ODN. Although these studies have already proposed AI monitoring mechanisms, to the best of our knowledge, no studies have focused on integrated resilience that not only intelligently localizes any fiber faults in the ODN but also automatically recovers the network using AI mechanisms. Moreover, most studies only proposed the ML model without any simulation or experiment on the working PON systems.
Table 1 presents a table of related work contributions.
To realize this, our proposed architecture uses AI-enabled unified platforms to automate and adapt to changing circumstances and business needs. As Cisco’s 2024 Global Networking Trends Report stated, in the next two years, network operators will use AI-enabled unified platforms to automate and adapt to changing circumstances and business needs [
28]. SDN-Enabled Broadband Access (SEBA) is a unified cloud-native platform, providing scalable and flexible network management. SEBA is based on Software-Defined-Networking (SDN) principles, offering simpler, flexible, and easily customizable networks. Moreover, SEBA promotes interoperability between OLTs and ONUs from different manufacturers. SEBA is open-source, giving operators unprecedented flexibility in customizing SEBA for their access network, integrating it with the rest of their backend systems, implementing only the features they require, adding application programming interfaces (APIs) themselves, and not being bound by the timelines and prices of a traditional vendor [
29].
Network Fault Detection and Localization
Commonly, to detect anomalies in the ODN, engineers are using OTDR, which is a technique based on the Rayleigh backscattering [
12]. The concept is like a radar, so the OTDR will send a series of optical pulses into the ODN. Afterward, the backscattered signals will be recorded as a function of time that can be translated to the position of the optical fiber components such as the splitter, ONUs, and end connectors. This information is used for event analysis.
Figure 1 illustrates the example of OTDR trace.
As shown in
Figure 1, we can see that the initial drop at the beginning of the figure represents the launch condition level of around 25 dB. Afterward, the downward-sloped line indicates the attenuation of the fiber (feeder fiber). At the end of the linear attenuation, a small peak signifies the splitter, connectors, ONUs, or other reflective events. The dense scattering at the end marks the termination of the fiber.
OTDR traces are usually difficult to interpret even for experienced engineers due to the noise that affects the signals. Analyzing these traces may be very challenging using conventional methods, especially to distinguish subscribers unambiguously [
30]. It can be very time-consuming, since the engineer needs to remove the noise manually, which can increase the MTTR and reduce the detection and localization accuracy. One of the strategies to effectively manage and interpret the OTDR traces is for network operators to use baseline measurements, saving the measurements when the network is functioning normally. In this way, network operators create a reference point for future comparisons if faults occur in the ODN. Moreover, maintaining and organizing a database of reference points for all OTDR traces can help with quick retrieval and analysis during troubleshooting. Additionally, network operators must ensure that the network engineers are well-trained in interpreting OTDR traces and using the tools by conducting regular training sessions to stay updated. All these combined techniques still depend on the network engineers.
Furthermore, before any fault occurs in the ODN, some anomalies can also appear in the network condition. Network operators can use various visualization tools such as a Bit Error Rate (BER) analyzer, Optical Time Domain Visualizer, and Optical Spectrum Analyzer. These tools can show the performance of optical signal delivery. An eye diagram is used to measure the signal quality. Ideally, an eye diagram would consist of two parallel lines with instantaneous rise and fall times, making them virtually invisible. The eye diagram can show vital parameters such as timing jitter and inter-symbol interference [
31]. Combining both OTDR trace analysis and the eye diagram can improve the early detection of faults in the ODN.
Consequently, in this paper, we propose automatic detection and localization using an ML algorithm by incorporating OTDR trace analysis data and eye diagram analyzer data. By incorporating ML algorithms, we can improve the accuracy and efficiency of detecting and localizing fiber faults. ML can process vast amounts of data, identifying patterns and detecting anomalies much faster with greater precision than network engineers. By leveraging ML, network operators can reduce their reliance on network engineers for fault detection and localization, leading to quicker resolutions and increased network reliability (as illustrated in the proposed Smart Resilience Architecture in “Figure 4”).
4. Proposed Architecture
This section discusses the proposed smart resilience architecture that not only can detect and localize fiber faults but also automatically establish connections while waiting for the engineer to fix the fiber faults in the ODN. In this architecture, we use the SEBA for Residential Services Central Office Rearchitected as Datacenter (R-CORD) platform concept, which sits in the middle and provides management and abstraction solutions, enabling the use of white box hardware. White box hardware reduces both Capital expenditures (CAPEX) and Operation & Maintenance expenses (OPEX). In this way, we separate the software from the hardware, enhancing the agility that brings the best of the cloud Network Function Virtualization (NFV) and SDN together. The OLT and ONUs used in the proposed architecture are white box hardware, providing a highly flexible and cost-effective solution. The white box devices feature hardware platforms that can run third-party software, such as VOLTHA, which offers open programmability and interoperability.
Figure 4 shows the smart resilience architecture in NG-EPON. In the north part, the OLT is connected to VOLTHA, an SDN controller such as the ONOS and NEM. These components incorporate one another using APIs and Remote Procedure Calls (gRPC) to provide seamless communication between VOLTHA, SDN controllers, and the NEM. As already mentioned, VOLTHA will activate the OLT and add to its logical switch. Moreover, the ONUs will also be added to the logical switch by VOLTHA. The SDN controllers provide centralized control and management for dynamic traffic steering, automatic failover, and real-time network adjustment. The OTDR, located at the central office, detects and localizes fiber faults, while the BER analyzer at the business users’ side captures eye diagrams to detect anomalies. Furthermore, in the south part, the users are categorized into two different groups: business users and residential users. Usually, business users have very strict SLAs and requirements. Therefore, as shown in the figure, business users such as ONU
1 and ONU
2 have the resilience area (indicated by the red dashed circle) which will be covered with the Radio Frequency over Glass (RFoG). The RFoG serves as a critical backup mechanism for business users in the event of a fiber fault. The RFoG allows RF signals to be transmitted over fiber optic cables, maintaining compatibility while providing the benefits of fiber optics, such as higher bandwidth and lower latency. In the proposed architecture, RFoG is activated as a secondary communication path when the primary fiber link experiences a fault or anomaly. The failover process is handled automatically by the ONU and the SDN controller, ensuring that the RFoG backup link is ready to carry traffic when needed. This mechanism will maintain continuous service, minimize downtime, and enhance overall network resilience.
In normal conditions, ONUs send/receive data using the primary optical path (λ1, λ2). The SDN controller monitors the network performance such as Bit Error Rate (BER) and OTDR trace analysis. Network operators oversee the network using a centralized platform, i.e., the NEM, which provides dashboards, alerts, and reports for network operators. In our proposed architecture, edge computing is realized in the NEM. This integration edge computing is to receive incoming data in real-time, identify any potential issues, and perform real-time analysis and alerts. Edge computing within the NEM can be implemented using high-performance servers equipped with GPUs for accelerated AI processing. Typically, Kafka is used to stream the collected data (telemetry data) from the NEM to the edge computing device. One study [
34] has shown that a Kafka-based framework is highly scalable and can support up to around 4000 messages per second with low CPU load and achieve an end-to-end latency of about 50 ms. The AI model deployed at the edge can detect anomalies in the network, predicting a variety of faults such as fiber cuts, partial fiber degradation, fiber bending, and faulty splitters. When anomalies are detected, the NEM communicates with the SDN controller to take corrective actions based on the AI predictions, such as activating backup conditions.
When faults or degradations occur in the ODN, including fiber cuts, fiber bending, and faulty splitters, the AI model identifies these anomalies and initiates a backup-mode plan. The OLT and ONU are notified via the NEM, and the OTDR is used to localize the fault within the network of the branches. When the ONU activates the backup mode, the RFoG becomes activated and ready to send the data to the nearest ONU (backup ONU) within its coverage. Simultaneously, the SDN controller updates the network configuration to handle the failover scenario. For instance, if partial fiber degradation is detected, the SDN controller may initially attempt to reroute traffic within the primary path. In the event of a complete fiber cut to ONU1, the ONU1 and SDN controller trigger the RFoG backup mechanism, routing data through ONU2. This multi-layered approach ensures robustness against various types of failures.
Since there is no direct connection link between the affected ONU and OLT, a mechanism must be used so that the nearest ONU (backup ONU) can differentiate the incoming data from the OLT and send it to the affected ONU via RFoG. Similarly, the OLT needs to know that the data comes from the affected ONU. This can be achieved using data tagging such as a virtual local area network (VLAN).
In the proposed architecture, the VLAN tag table is established in the OLT and ONUs. This table can be changed over time and updated using the SDN controller, which dynamically updates the VLAN tag table and configurations based on network changes and faults. This ensures that the OLT and ONUs will map VLAN tags to their respective destinations.
Table 2 shows an example of the VLAN tag table.
4.1. Intelligence Fault Detection and Localization with Intelligence Diagnosis
As mentioned before, this paper focuses on fault detection and localization through OTDR trace analysis and the eye diagram evaluation.
Figure 5 illustrates the comparison between normal and fault conditions from these perspectives.
Figure 5a shows a clear eye opening, indicating minimal noise, jitter, and distortion. In contrast,
Figure 5b depicts a situation with anomalies. When there are anomalies in noise, jitter, or distortion, the eye diagram shows that the eye opening is reduced vertically and horizontally, distorting the eye shape, which indicates a very high level of noise, higher jitter, and potential issues with the transmission channel.
Figure 5c shows the power attenuation for different ONUs located at different distances in a normal trace event, while
Figure 5d highlights the scenario where ONU
1 experiences a fiber fault. The OTDR trace results for ONU
1 show a loss, with no peak detected, indicating the presence of a fault. Typically, both the OTDR trace analysis and eye diagram are tested against a predefined mask. Any violation of these masks can indicate potential fiber faults within the ODN.
Consequently, in our proposed fault detection and localization approach, we use eye diagrams to complement the OTDR in identifying subtle degradation in signal quality, since OTDR alone only detects severe faults such as fiber cuts. The proposed ML model uses this combination of eye diagram and OTDR data to enhance the accuracy of prediction and localization. This leads to improved accuracy and efficiency, especially in identifying minor or soft faults that would not be captured by OTDR alone.
The proposed framework for fault detection and localization with intelligent diagnosis is shown in
Figure 6, following the study in [
12]. There are five main stages to realize the proposed framework, namely, (1) Data collection: The deployed ODN infrastructure is periodically monitored using OTDR and BER Analyzer. The generated OTDR traces and the eye diagram data are sent to the SDN controller; (2) Data processing: The collected data is pre-processed to normalize and standardize the features to a similar scale; (3) Anomaly detection: The processed data are compiled into a dataset, which is then used to train and evaluate a machine learning model designed to detect anomalies in the network; (4) Fiber fault diagnosis and localization ML model; (5) Mitigation and recovery from fiber failures plan: The plan will be formulated to address and fix the detected faults. Alerts are generated to notify engineers and customers of the issues. The SDN controller facilitates dynamic management and control of the network based on the machine learning model outputs.
4.2. Simulation-Based Evaluation
To validate the proposed approach, the simulation-based evaluation setup was built using OptiSystem 21.0 software. OptiSystem is an innovative, rapidly evolving, and powerful software design tool that enables users to plan, test, and simulate almost every type of optical link in the transmission layer of a broad spectrum of optical networks, from LAN, SAN, and MAN to ultra-long-haul. It offers transmission layer optical communication system design and planning from component to system level and visually presents analysis and scenarios [
35]. The setup comprises an OLT connected to the 8 ONUs with a passive splitter. The distance between the OLT and ONUs ranges from 15 to 20 km, with a feeder fiber length of 15 km and branch lengths varying from 2 km to 7 km. The optical transmitter frequency is set to 1550 nm with a power of 7 dBm, using NRZ modulation. The attenuation loss is 0.2 dB/km, and the splitter loss varies from 4 dB to 8 dB. Two scenarios were simulated: normal and faulty scenarios. For the faulty scenarios, different anomalies were introduced, including macro-bending, micro-bending, fiber cut, and bad splitter. The simulation generated 709.054 samples. The data set was constructed, normalized, and divided into a training (60%), a validation (20%), and a test set (20%) for fault and normal scenario eye diagrams using OTDR traces, obtained from [
12]. It is worth mentioning that the eye diagrams were used for anomaly detection, while the OTDR was used to localize the fault. A BER analyzer was placed at the end of each branch to capture the eye diagrams. The dataset is balanced, with an approximately equal number of samples representing normal and faulty conditions. To mimic anomalies (such as fiber bending, bad splitter) and fiber faults, attenuators were placed at the 2 km, 3 km, 5 km, and 7 km branches, respectively. The termination at the end of the 7 km branch was removed to simulate a fiber fault.
Figure 7 shows a simulation-based evaluation setup for generating faulty branch data using OptiSystem in the passive optical network. Furthermore, the normal samples are derived from the simulation-based evaluation setup conducted without any attenuator.
4.3. Neural Network Architecture and Model Evaluation
We started by preprocessing the data, applying a standard scaler to normalize the features, and guaranteeing that all features are on a similar scale to enhance the model’s performance. We then implemented a Multi-Layer Perceptron (MLP) neural network due to its simpler architecture, which requires less computational power compared with other machine learning algorithms, making it ideal for high-speed network environments.
As shown in
Figure 8, our MLP model has an input layer, followed by three hidden layers. The input layer has two neurons (for time and amplitude/reflection) (indicated by blue), while the hidden layers have 8, 16, and 8 neurons, (indicated by green, red, and green), respectively. All layers use the ReLU activation function, except the output layer (indicated by blue), which has a single neuron and uses the sigmoid activation function for binary classification (e.g., fault or no-fault). In total, this model has 313 trainable parameters.
To assess the model’s performance and robustness, we utilized stratified K-fold cross-validation, where each fold maintains the same class distribution as the original dataset. The training was conducted over 40 epochs with a batch size of 256, using 20% of the training data as a validation set to monitor for overfitting. Performance metrics such as accuracy, precision, recall, and F1-score were calculated for each fold. After completing all folds, we computed the average of these metrics to summarize the model’s overall performance on unseen data. The model achieved an average accuracy of 0.8149, precision of 0.8433, average recall of 0.7818, and average F1-score of 0.8100. These results indicate that the model performs robustly in distinguishing between “Normal” and “Fault” classes. The high average precision suggests effective minimization of false positives, meaning the model reliably identifies true positives when making positive predictions. However, the slightly lower recall indicates that some fault instances may be missed, resulting in false negatives. The balanced average F1-score reflects a good trade-off between precision and recall, making the model suitable for applications where both types of errors are of concern.
4.4. Resilience Dynamic Bandwidth Allocation
The Resilience Dynamic Bandwidth Allocation (RDBA) uses an offline scheduler approach, where the OLT waits for report messages from all ONUs before performing dynamic bandwidth allocation (DBA). In this way, the OLT will have a holistic view of all ONU demands, ensuring fairness [
36]. In the normal condition where no fault is detected, the OLT will assign the bandwidth allocation to ONUs based on the following Formula (1):
where
RN is the EPON line rate (in bits per second),
Tmax is maximum cycle time (in milliseconds),
N is the total number of ONUs,
G is the guard time, and 512 bits is the control message length. The minimum guaranteed bandwidth (
Bmin) of the ONU is calculated with the following Formula (2):
where
Wmax is the maximum timeslot of an ONU,
Wreport is the reserved window size of the report message (in bits). We limit each ONU timeslot to prevent upstream channel monopolization by heavily loaded ONUs. However, the
Wmax can also be set according to the SLA.
When the proposed ML identifies faults or anomalies in the ODN by analyzing data from OTDR traces and the BER analyzer, the NEM will inform the OLT using the SDN controller to switch from normal DBA to RDBA. Once the RDBA is activated, the OLT dynamically adjusts the bandwidth allocation to prioritize the backup ONU, ensuring it can handle the data from both its traffic and the affected ONU (i.e., the faulty ONU). The backup ONU receives additional bandwidth, scaled based on predefined factors, to maintain service continuity for both ONUs. This process ensures minimal service disruption even during fault conditions, as the RFoG link facilitates the rerouting of traffic from the affected ONU to the backup ONU.
Figure 9 shows the pseudocode of the proposed RDBA. In the normal condition, the OLT calculates the available bandwidth (
Bavailable) and the guaranteed bandwidth (
Bmin) in each cycle. Under normal conditions in each cycle, the ONU gets the guaranteed bandwidth. If the guaranteed bandwidth (
Bmin) is greater than the reported bandwidth from the queue, the granted bandwidth (
GRANT_ONUi) is set to the queue’s requested bandwidth. Otherwise, the granted bandwidth is set to the remaining
Bmin. The remaining
Bmin is then updated by subtracting the granted bandwidth. In the restoration plan, when a fault occurs, the OLT will adjust for the backup ONU. If the current ONU is a backup ONU, the OLT sets the protection VLAN tag for the affected ONU. The
Bmin is then calculated for the backup ONU, but it will be multiplied by alpha (
). Here,
represents the additional bandwidth allocated to the backup ONU to ensure it can handle the increased traffic, as the affected ONU now routes all data through the backup ONU via RFoG. If the current ONU is not a backup ONU, the normal condition function is applied. Moreover, to verify that the total requested bandwidth from ONUs does not exceed
Bavailable due to the addition of variable
, the total_requested_bandwidth is calculated as follows (3):
5. Performance Evaluation
To validate the proposed model, we implemented the NG-EPON architecture using the OPNET simulator. All key components and protocols of NG-EPON, such as dynamic bandwidth allocation, cycle time, transmission capacity, guard time, etc., are fully modeled. The proposed system model consists of 32 ONUs and one OLT. The downstream and upstream channels between the OLT and ONU are configured to 1 Gbps. The distance from the OLT to the ONUs is uniformly distributed over 10 to 20 km. To generate Assured-Forwarding (AF), Best-Effort (BE), and Tactile-Internet (TI) traffic, we employ self-similarity and long-range dependence, generating highly bursty traffic with a Hurst parameter of 0.7 [
17]. The packet size is uniformly distributed between 512 and 12,144 bits. The Expedited Forwarding (EF) traffic is modeled using a T1 circuit-emulated line with a constant frame rate (1 frame/125 μs) and a fixed packet size of 560 bits, which occupies approximately 14% of the total upstream bandwidth. The remaining traffic is distributed as 50% AF, 20% BE, and 30% TI for scenario I, and 40% AF, 20% BE, and 40% TI for scenario II. To evaluate the proposed mechanism, we construct different scenarios: (1) no-fault, (2) one fault, and (3) three faults.
The focus of the simulation is to evaluate the system’s performance after faults are detected. Fault scenarios with one fault and three faults were introduced, and the system’s performance was measured in terms of key metrics, such as mean packet delay, system throughput, packet drop rate, and bandwidth waste. These measurements help validate the resilience of the architecture in ensuring performance guarantees, particularly in terms of low-latency requirements for real-time traffic such as Tactile Internet (TI). While the optical network’s physical characteristics (e.g., power levels, impairments) were not the focus of this simulation, the system response to fault scenarios was crucial in demonstrating the architecture’s ability to maintain service continuity and minimize disruption. To further validate the system, we compared the performance of the proposed RDBA mechanism against a traditional DBA approach, which does not incorporate fault-tolerant features. In the baseline DBA approach, bandwidth is allocated without any resilience mechanisms to manage fault scenarios. The simulation parameters are summarized in
Table 3.
5.1. Mean Packet Delay
Figure 10 shows the mean packet delay of Expedited Forwarding (EF), Assured Forwarding (AF), and Tactile Internet (TI) with different traffic proportions. Four scenarios are depicted: Normal: delay of no faults in the network (blue line); 1Fault_Average: delay with one fault in the network, which represents a single fault occurring in one branch of the ODN; 3Fault_Average: delay with three faults in the network, representing multiple faults distributed across different branches of the ODN; 1Fault_BackupNode: delay at a specific backup node handling the affected ONU with one fault; and 3Fault_BackupNode: delay at specific backup nodes handling the affected ONUs with three faults.
As seen in
Figure 10a, the EF delay under normal conditions increases gradually with the traffic load, showing an expected behavior where higher traffic leads to higher delay. However, in the 1Fault_Average and 3Fault_Average scenarios, when the traffic loads are below 70%, the delay remains close to the normal operation but increases more significantly as the traffic load exceeds 70%. This highlights the compounded effect of multiple faults on the network performance. The green lines (1Fault_BackupNode and 3Fault_BackupNode) show that the EF delay at specific backup nodes handling the affected ONUs is slightly higher than the normal operation but much lower than the 1Fault_Average line, demonstrating the effectiveness of the backup node in mitigating the impact of the faults on the affected ONUs.
In terms of TI delay, shown in
Figure 10b, when there is one fault in the network, the 1Fault_BackupNode and 3Fault_BackupNode manage to stay close to the normal operation levels, even at higher traffic loads. This again demonstrates the effectiveness of the backup node in mitigating the impact of the fault, ensuring that TI delay remains well below 2 ms up to 90% and slightly exceeds 2 ms at 100% load. In the 3Fault_BackupNode scenario, the delay remains relatively low at moderate traffic loads but spikes dramatically beyond 80% load, reaching up to 5 ms at 100% load. This indicates that while backup nodes help manage the delay better than without them, multiple faults still pose a significant challenge, especially under high-traffic conditions.
Figure 10c,d illustrate the AF and BE delay, respectively. AF delay, much like EF, shows a minimal increase with rising traffic loads in the normal scenario. BE traffic, typically given the lowest priority, shows non-congested conditions under normal conditions. However, as the traffic load increases, the limited available resources are allocated preferentially to higher-priority traffic; therefore, once the traffic load surpasses 70%, the resources available for AF and especially BE packets become increasingly constrained. When faults are present, resources are redistributed to maintain service levels for critical applications, exacerbating delays for AF and BE traffic.
Consequently, the proposed RDBA mechanism successfully ensures that delays for EF and TI packets remain below critical thresholds, i.e., below 2 ms [
4,
37], maintaining high QoS for real-time and tactile internet applications. The results show that under normal and fault conditions, the RDBA can keep the delays well managed. The RDBA prioritizes higher-priority traffic, which can lead to increased delays for AF and BE packets under fault conditions. The simulation results highlight the importance of having a robust DBA mechanism that incorporates resilient AI-enhanced fault detection and recovery to effectively manage delay, particularly for high-priority traffic such as EF and TI packets.
5.2. System Throughput
Figure 11 depicts the system throughput under normal and fault conditions. The system throughput of the network demonstrates a consistent increase as the traffic load rises, indicating the network’s robust capacity to handle escalating demands. This pattern shows an efficient RDBA that successfully adapts to increasing traffic demands. Moreover, in fault conditions (1Fault and 3Fault Averages), there is an observed increase in throughput efficiency compared with normal conditions. This is because the overhead communication required for inactive or faulty ONUs decreases, allowing more bandwidth to be allocated to active connections, thus improving the overall efficiency of the NG-EPON systems.
5.3. Packet Drop Rate
The packet drop rate shown in
Figure 12 shows that the drop rates remain minimal at up to 70% traffic load across all scenarios, indicating healthy network functionality under moderate loads. However, as the load exceeds 80%, packet drop rates begin to rise, especially under conditions of three faults. The packet losses occur predominantly in AF and BE traffic categories, while EF and TI packets, given the highest priority in the network, experience no packet drop. This differentiation in packet treatment highlights the network’s strategic prioritization, ensuring that critical real-time applications dependent on EF and TI traffic maintain uninterrupted service even as the system approaches or reaches full capacity.
5.4. Bandwidth Waste
Figure 13 showcases the trend of decreasing bandwidth waste as the traffic load increases with various scenarios including normal conditions and faults. At lower traffic loads, the bandwidth waste tends to be a surplus of allocated but unused bandwidth, leading to higher waste. As the traffic load increases, the demand for bandwidth rises, making the RDBA allocate nearly all available bandwidth to meet this demand, thereby minimizing waste. Thus, the RDBA demonstrates a robust capability to optimize resource management, particularly crucial when the network load reaches full capacity.
The results from the comparison show that the RDBA mechanism outperforms the baseline DBA, particularly under fault conditions. While the baseline DBA experiences significant delays in high-priority traffic (EF and TI) during fault scenarios, the RDBA mechanism mitigates these delays using backup nodes, ensuring that critical traffic maintains low-latency performance even when multiple faults are present. While the RDBA performs better in handling faults and maintaining service continuity, it introduces some complexity in terms of system management and leads to higher delays for low-priority traffic (AF and BE), particularly under high-load conditions. This trade-off highlights the need for balancing fault tolerance and resource management in heavily loaded networks.