Next Article in Journal
Virtual Simulation and Experiment of Quality Inspection Robot Workstation
Previous Article in Journal
Antibiotic Resistance in the Farming Environment
Previous Article in Special Issue
Comprehensive Validation on Reweighting Samples for Bias Mitigation via AIF360
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Diagnosis-Effective Sampling of Application Traces

1
Institute of Mathematics NAS Armenia, Yerevan 0019, Armenia
2
College of Science and Engineering, American University of Armenia, Yerevan 0019, Armenia
3
ML Laboratory, Yerevan State University, Yerevan 0025, Armenia
4
Picsart, Miami, FL 33009, USA
5
Department of Computer Science, University of Chile, Santiago 8330111, Chile
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2024, 14(13), 5779; https://doi.org/10.3390/app14135779
Submission received: 27 April 2024 / Revised: 24 June 2024 / Accepted: 26 June 2024 / Published: 2 July 2024
(This article belongs to the Special Issue Trustworthy Artificial Intelligence (AI) and Robotics)

Abstract

:
Distributed tracing is cutting-edge technology used for monitoring, managing, and troubleshooting native cloud applications. It offers a more comprehensive and continuous observability, surpassing traditional logging methods, and is indispensable for navigating modern complex software architectures. However, the sheer volume of generated traces is staggering in distributed applications, and the direct storage and utilization of every trace is impractical due to associated operational costs. This entails a sampling strategy to select which traces warrant storage and analysis. Historically, sampling methods have included a rate-based approach, often relying heavily on a manual configuration. There is a need for a more intelligent approach, and we propose a hierarchical sampling methodology to address multiple requirements concurrently. Initial rate-based sampling mitigates the overwhelming volume of traces, as no further analysis can be performed on this level. In the next stage, more nuanced analysis is facilitated based on the previous foundation, incorporating information regarding trace properties and ensuring the preservation of vital process details even under extreme conditions. This comprehensive approach not only aids in the visualization and conceptualization of applications but also enables more targeted analysis in later stages. As we delve deeper into the sampling hierarchy, the technique becomes tailored to specific purposes, such as the simplification of application troubleshooting. In this context, the sampling strategy prioritizes the retention of erroneous traces from dominant processes, thus facilitating the identification and resolution of underlying issues. The focus of this paper is to reveal the impact of sampling on troubleshooting efficiency. Leveraging intelligent and explainable artificial intelligence solutions enables the detection of malfunctioning microservices and provides transparent insights into root causes. We advocate for using rule-induction systems, which offer explainability and efficacy in decision-making processes. By integrating advanced sampling techniques with machine-learning-driven intelligence, we empower organizations to navigate the complexities of large-scale distributed cloud environments effectively.

1. Introduction

Distributed tracing (DT) has emerged as a crucial tool for effectively monitoring and troubleshooting native cloud applications [1,2,3,4,5,6,7]. It plays a pivotal role in understanding the behavior and performance of applications across distributed environments. Native cloud applications often adopt microservices architecture, decomposing the application into smaller, loosely coupled services. Each service performs a specific function and communicates with others via Application Programming Interfaces (APIs). DT enables the inspection of the flow of requests across the microservices, providing insights into how requests are processed and identifying any bottlenecks or issues within the system [8].
Native cloud applications are designed to be highly scalable and dynamically allocate resources based on demand. DT accommodates this dynamic nature by providing visibility regarding the performance of services as they scale up or down in response to changes in workload. This ensures that performance issues are detected and addressed in real time, maintaining the application’s reliability and responsiveness. They often utilize containerization and container orchestration via Docker and Kubernetes, respectively. Those technologies enable applications to be deployed and managed efficiently in a distributed environment. DT is integrated with container platforms for tracking requests as they traverse across containers and pods, providing insights into resource utilization and communication patterns.
DT is typically part of a broader observability stack, including metrics and logging [2,9]. Integration with metrics allows for correlation between trace data and performance metrics, enabling a more profound analysis of application behavior. Similarly, integration with logging platforms provides context around specific events or errors captured in logs, enhancing the troubleshooting process [10,11,12,13,14,15,16,17,18].
Several critical points are made regarding the complexities and challenges associated with DT. One is the volume of traces in which many requests are routine and unremarkable, often called normal traces. Sampling in DT involves capturing a subset of traces for analysis instead of storing every trace to manage costs. This subset includes “interesting” traces encompassing various events in a distributed architecture. Interest can be linked to a trace latency when traces surpassing a latency threshold are selectively sampled. This allows for the identification of performance bottlenecks and areas needing improvement. Interest can also be linked to errors. Erroneous traces or exceptions are sampled to investigate and address system reliability and stability issues. Requests or services can be prioritized to help determine which traces to sample. High-priority components, such as critical services or specific user interactions, are given preference for sampling. Companies typically adopt one of two strategies for capturing these events.
The common sampling strategy is head-based sampling [2], in which traces are randomly sampled based on a predefined rate or probability, such as 0.1–1% of all traces. This approach relies on the principle that a sufficiently large dataset will capture the most interesting traces. However, it is worth noting that sampling meaningfully diminishes the value of DT. While sampling is necessary to manage costs, it can limit the effectiveness of DT for troubleshooting and debugging purposes. Developers may encounter situations where important traces are missed due to sampling, leading to a loss of trust in the tracing data. Consequently, developers may revert to traditional debugging methods, such as logs, undermining the value of DT [8]. A more intelligent strategy is called tail-based sampling, where the system waits until all spans within a request have been completed before determining whether or not to retain the trace based on its entirety. What truly matters here are the around 5 % of traces that carry anomalies: errors, exceptions, instances of high latency, or other forms of soft errors. However, the ideal scenario would involve analyzing the entire set of traces, identifying anomalies, and retaining them for thorough examination. In this paper, we advocate for a multi-purpose and hierarchical strategy combining head-based and tail-based scenarios, focusing on the efficiency of application troubleshooting and root cause analysis (RCA). We are considering a multi-layered approach to efficiently capture and preserve different types of information within a system. Overall, this hierarchical sampling approach efficiently manages trace data by prioritizing information based on its relevance and importance for system analysis and troubleshooting. It balances the need to reduce data volume with the requirement to retain critical details for comprehensive understanding and problem resolution.
One of the main benefits of DT is improved application performance, resulting in reduced mean time to detect (MTTD) and mean time to repair (MTTR) IT issues. DT enables penetration to the bottom of application performance issues faster, often before users notice anything wrong. Upon discovering an issue, tracing can rapidly identify the root cause and address it. It also provides early warning capabilities when microservices are in poor health and highlights performance bottlenecks anywhere. However, this task is only feasible for a few system administrators or site reliability engineers (SREs) due to the large scale of modern distributed environments. The solution is an artificial intelligence (AI) for IT Operations (AIOps) strategy that leverages AI-powered software intelligence to automate developments, service deliveries, and troubleshooting. AIOps is a real game-changer for managing complex IT systems.
In AIOps, where machine learning (ML) algorithms analyze vast amounts of data to automate IT operations and decision-making processes, transparency and interpretability are paramount. Trust in AIOps hinges on implementing explainable AI (XAI) solutions [19]. Those techniques enable stakeholders to understand and interpret the decisions made by AI models, build confidence in their reliability, and facilitate human–AI collaboration. By providing insights into how AI algorithms arrive at their conclusions, XAI fosters trust among users, reduces the risk of biased or erroneous outcomes, and enhances the adoption and effectiveness of AIOps solutions. We focus on rule-learning systems and show their efficiency in combination with trace sampling. Rule induction systems, such as RIPPER (Repeated Incremental Pruning to Produce Error Reduction) [20] and C5.0 rules [21] are common examples of XAI solutions.
The paper is organized as follows. Section 2 describes the main trends of trace sampling and rule-induction. Section 3 describes the main idea of our approach. Section 4 discusses the main motivation and contribution of this research. Section 5 explains the methods of trace sampling based on properties when the goal is to preserve all available types, latencies, and errors uniformly. Section 6 presents an application troubleshooting strategy and the impact of noise reduction on a rule induction process. Section 7 summarizes the results and discusses future opportunities. Section 8 provides the list of patents related to this research.

2. Related Work

2.1. Distributed Tracing Vendors

Many companies/vendors are developing DT technologies with some sampling capabilities built in. We will mention a few of them.
  • Jaeger is an open-source DT system developed by Uber Technologies. It is widely used for monitoring and troubleshooting microservices-based applications [22].
  • Zipkin is another open-source DT system originally developed by Twitter. It helps developers gather data for various components of their applications and troubleshoot latency issues [23].
  • LightStep is a commercial company that offers tracing solutions for monitoring and troubleshooting distributed systems [24].
  • Datadog is a monitoring and analytics platform that offers features for collecting and analyzing traces, logs, and metrics from applications and infrastructure [25].
  • Dynatrace is a software intelligence platform that provides features for monitoring and analyzing the performance of applications and infrastructure, including DT capabilities [26].
  • New Relic offers a monitoring and observability platform that includes features for collecting and analyzing traces, logs, and metrics from applications and infrastructure [27].
  • AppDynamics is an application-performance-monitoring solution that provides features for monitoring and troubleshooting applications, including DT capabilities [28].
  • Honeycomb is an observability platform that offers features for collecting and analyzing high-cardinality data, including traces, to debug and optimize applications [29].
  • Grafana is an open-source analytics and visualization platform that can visualize traces, logs, and metrics collected from various sources, including DT systems [30].
  • Elastic is the company behind Elasticsearch, Kibana, and other products in the Elastic Stack. Their solutions offer features for collecting, analyzing, and visualizing traces, logs, and metrics [31].

2.2. Industrial Sampling Examples

Various companies and researchers are developing end-to-end DT technologies in which sampling is one of the core components as the prevailing approach to reducing tracing overheads. Instead of tracing every request, the sampling only captures and persists traces for a subset of requests to the system. To ensure that the captured data are useful, sampling decisions are coherent per request—a trace is either sampled in its entirety, capturing the full end-to-end execution, or not at all. Sampling effectively reduces computational overheads; these overheads are only paid if a trace is sampled, so they can be easily reduced by reducing the sampling probability ([32,33,34,35] with references therein).
Large technology companies such as Google, Microsoft, Amazon, and Facebook often develop methods to monitor and analyze the performance of their own systems and applications. Early tracing systems such as Google’s Dapper [36] and later Facebook’s Canopy [37] make sampling decisions immediately when a request enters the system. This approach, known as head-based sampling, avoids the runtime costs of generating trace data as they occur, uniformly at random, and the resulting data are simply a random subset of requests. In practice, sampling rates can be as low as 0.1 % [32].
Tail-based sampling is an alternative to head-based sampling. It captures traces for all requests and only decides whether to keep a trace after it has been generated. While OpenTelemetry [38] offers a tail-based sampling collector, its implementation presents challenges as storing all spans of a trace until the request concludes demands a sophisticated data architecture and solutions. One such end-to-end tracing method to enhance distributed system dependability by dynamically verifying and diagnosing correctness and performance issues is proposed in paper [39]. The idea is based on clustering execution graphs to bias sampling towards diverse and representative traces, even when anomalies are rare.
Google has developed DT technologies, such as Google Cloud Trace and OpenCensus, which allow users to collect and analyze trace data from applications running on Google Cloud Platform and other environments. Each Google Cloud service makes its own sampling decisions. When sampling is supported, that service typically implements a default sample rate or a mechanism to use the parent’s sampling decision as a hint as to whether to sample the span or maximum sampling rate [40]. OpenCensus provides “Always”, “Never”, “Probabilistic”, and “RateLimiting” samplers. The last one tries to sample with a rate per time window, which by default is 0.1 traces/s [41].
Azure Application Insights sampling [42] aims to reduce telemetry traffic, data, and storage costs while preserving a statistically correct analysis of application data. It enables three different types of sampling: adaptive sampling, fixed-rate sampling, and ingestion sampling. Adaptive filtering automatically adjusts the sampling to stay within the given rate limit. If the application generates low telemetry, like during debugging or low usage, it does not drop items as long as the volume stays under the limits. The sampling rate is adjusted to hit the target volume as the telemetry volume rises. Fixed-rate sampling reduces the traffic sent from web servers and web browsers. Unlike adaptive sampling, it reduces telemetry at a fixed rate that a user decides. Ingestion sampling operates where web servers’, browsers’, and devices’ telemetry reaches the Application Insights service endpoint. Although it does not reduce the telemetry traffic sent from an application, it does reduce the amount processed and retained (and charged for) by Application Insights. Microsoft researchers introduced an observability-preserving trace sampling method, denoted as STEAM, based on Graph Neural Networks, which aims to retain as much information as possible in the sampled traces [43].
In 2023, AWS announced the general availability of the tail sampling processor and the group-by-trace processor in the AWS Distro for OpenTelemetry collector [44]. Advanced sampling refers to a strategy in which the Group By Trace processor and Tail Sampling processor operate together to make sampling decisions based on set policies regarding trace spans. The Group By Trace processor gathers all of the spans of a trace and waits for a pre-defined time before moving them to the next processor. This component is usually used before the tail sampling processor to guarantee that all the spans belonging to the same trace are processed together. Then, the Tail Sampling processor samples traces based on user-defined policies.

2.3. XAI for Application Troubleshooting

One of the main purposes of DT is to ensure application availability and fast troubleshooting in case of performance degradation. However, system administrators can no longer perform real-time decision-making due to the growth of large-scale distributed cloud environments with complicated, invisible underlying processes. Those systems require more advanced and ML/AI-empowered intelligent RCA with explainable and actionable recommendations ([18,45,46,47,48,49,50,51] with references therein).
XAI [19] builds user and AI trust, increases solutions’ satisfaction, and leads to more actionable and robust prediction and RCA models. Many users think it is risky to trust and follow AI recommendations and predictions blindly, and they need to understand the foundation of those insights. Many ML approaches, like decision trees and rule-induction systems, have sufficient explainability capabilities for RCA. They can detect and predict performance degradations and identify the most critical features (processes) potentially responsible for malfunctioning.
In many applications, explainable outcomes can be more valuable than conclusions based on more powerful approaches that act like black boxes. Rule learners are the best if outcomes’ simplicity and human interpretability are superior to the predictive power. The list of known rule learners consists of many exciting approaches. We refer to [52] for a more detailed description of available algorithms, their comparisons, and historical analysis. It contains relatively rich references and describes several applications [20,21,53,54,55,56]. Rule learning algorithms have a long history in industrial applications [7,13,57,58,59,60,61,62,63,64]. Many such applications utilize classical classifiers like C5.0Rules [21] and JRip. The last one is the Weka implementation of RIPPER.
The recommendations derived from rule-learning systems can be additionally verified regarding uncertainty based on the Dempster–Shafer theory (DST) of evidence [65]. This is an inference framework with uncertainty modeling, where independent sources of knowledge (expert opinions) can be combined for reasoning or decision-making. In [66], it is leveraged to build a classification method that enables a user-defined rule validation mechanism. These rules “encode” expert hypotheses as conditions on the dataset features that might be associated with certain classes while making predictions for observations. The theory provides a “what-if” analysis framework for comprehending the underlying application, as applied in [67]. This means that in the current study context, we can apply the same framework to estimate the effectiveness of the trace sampling approaches while comparing the rule validation results in the pre- and post-sampling stages. DST is overlooked in terms of learning algorithms that could be utilized in predictive system diagnostics ([45], which emphasizes that gap). Therefore, its application in our use-case of RCA-effective trace sampling is an additional novelty that we introduce.

3. The Main Idea

This section describes our multi-layer strategy with the final goal of application troubleshooting. One common approach to collecting traces is to use a tracer based on OpenTelemetry, Zipkin, Jaeger, etc. The tracer is configured to forward traces, metrics, and logs from an application to a proxy. The proxy securely, quickly, and reliably sends data to a Trace Manager. A cloud-scale application generates a large number of traces. The proxy should be configured to sample data to reduce the volume. A common strategy is to apply head-based sampling, also known as a rate-based strategy, which randomly samples without going deep into the context (see Figure 1).
In our case, regardless of the sampling strategy, the tracer allows all error spans. Based on the traces, it collects and reports request, error, and duration (RED) metrics to provide full application observability. RED metrics measure requests (the number of requests being served per second), errors (the number of failed requests per second), and durations (per-minute histogram distributions of the amount of time that each request takes). These metrics can be used for application troubleshooting.
The Trace Manager collects and samples traces based on their properties, such as type, duration, and error, although other properties can be considered. Then, the Trace Manager visualizes and analyzes the traces. At this stage, the main purpose of the property-based sampling is to preserve all available properties for further analysis. Trace-type sampling involves categorizing traces based on their characteristics or origins within the system. On the other hand, duration-based sampling prioritizes the collection of traces associated with significant delays or performance bottlenecks.
Error-based sampling identifies and retains traces associated with errors, exceptions, or other abnormal conditions within the application. Combining these approaches allows the system to store more diverse traces, ensuring that valuable information across different categories and performance metrics is captured. By isolating and storing these traces, developers and system administrators have the necessary data to diagnose and address issues effectively, improving overall system reliability and performance.
However, the basic purpose of the Trace Manager is application troubleshooting in the case of performance degradation. It has diverse tools for different management tasks. One of them is the Traces Browser, which can reveal the context and details of traces. In the Traces Browser, one can search for traces that include spans for a particular operation or examine the spans that belong to a selected trace. It is also possible to view the corresponding RED metrics for troubleshooting purposes.
Another sophisticated troubleshooting engine is an applications map that overviews how the applications and services are linked. It allows us to focus on a specific service, view RED metrics for each service, and filter out trace traffic for the malfunctioning microservices. Here, noise reduction is performed based on the rate-based sampling of erroneous traces. It preserves all dominant normal and erroneous traces and removes the rare ones. This approach is acceptable when the problem has already been detected, and the goal is to explain it.
The final layer applies XAI on the sampled data for RCA, bottleneck detection, and optimizations. In particular, rule learning systems which applied to the reduced tracing traffic can generate recommendations in the form of rules which should be inspected by system administrators to accelerate the remediation of IT issues.

4. Motivations and the Main Contributions

A common method for troubleshooting applications involves utilizing Traces Browser or Application Maps, which enable the examination of malfunctioning microservices. These tools allow for the isolation of trace traffic, facilitating the feeding of trace data into the browser for an in-depth analysis of traces and spans. Following this manual investigative process, the metadata associated with the spans can uncover underlying issues and suggest potential remediation procedures. However, those approaches have serious drawbacks. The volume of traces without sampling is so large that it is not feasible to draw the Application Maps within a realistic period. Moreover, manual inspection requires expertise and a thorough understanding of the application’s internal flows. Additionally, it does not allow us to properly automate the procedure in case of many applications, traces, and spans. Our primary objective was to automate and optimize this intricate process using ML methodologies, focusing on achieving explanations through XAI solutions such as RIPPER or Dempster–Shafer theory. However, we encountered a significant challenge due to the overwhelming volume of traces, even after the initial stage of head-based sampling. Hence, the critical task was the identification of effective sampling strategies to tackle this issue.
Deliberating between ML approaches and simpler statistical methods, we initially opted for the latter for several reasons, including leveraging expert knowledge, maintaining transparency, and retaining control over the sampling process. This decision led us to develop a dynamic streaming solution that proved highly practical. Our next goal was to assess the impact of this sampling technique on XAI methods.
This paper’s contribution is multi-fold. We discuss the trace sampling approach that has been successfully implemented, and demonstrate its capability for diverse applications. We developed application trace sampling technology that leverages the statistical properties of traces, such as type, duration, and errors. It is easy to implement and has serious benefits compared to more sophisticated ML approaches. Statistical methods provide more interpretable results. This transparency can be crucial for understanding the underlying processes or relationships in the data. The corresponding parameters are easy to fine-tune and control. Statistical methods are built on clear assumptions about the data distribution, which can help to guide the analysis and ensure the validity of the results. In contrast, ML models are more ambivalent to data assumptions, which can sometimes lead to unexpected or inaccurate outcomes. Our approach is suitable for limited data availability. ML models, on the other hand, often require larger datasets to achieve reliable performance. Statistical methods often allow for the easier incorporation of domain knowledge or expert insights into the analysis process, enhancing the interpretability and relevance of the results.
Then, we explore the impact of sampling on application troubleshooting. We demonstrated that sampling positively impacted rule-learning systems by reducing time and generating clearer rules. Hence, sample size reduction does not impact the efficiency of the application troubleshooting approach. We also experimented with the Dempster–Shafer classifier, which was successfully implemented and proved applicable to the RCA in different IT issues.
However, the efficiency of ML strategies is an open problem. There is a compelling need to explore ML solutions for sampling that offer enhanced scalability, speed, and accuracy, particularly in application domains where such capabilities are crucial [68,69,70,71]. One of the important problems is related to trace type identification. ML methods can categorize traces based on specific characteristics or outcomes as a supervised learning problem. Alternatively, clustering approaches can be used to group similar traces together based on patterns or similarities. Another important problem is anomaly trace detection. We assume that anomalies are known and traces arrive as either normal or abnormal. This is an interesting problem, as anomalies have different sources, and ML approaches can help to identify unusual or abnormal traces that deviate from the “norm”. The next important problem is detecting the change in the dynamic flow of traces. Appropriate actions or decisions in dynamic trace analysis scenarios may be required. In some scenarios, when modeling and forecasting trace data over time, the capture of temporal dependencies can be an important problem. In general, diversifying our focus to include other ML methods could provide valuable insights into their effectiveness within our sampling strategy.
While our primary emphasis has been on rule induction methods, driven by their interpretability features, there is potential value in exploring alternative ML approaches and assessing how sampling strategies influence their performance. Furthermore, investigating how different sampling techniques impact these methods can offer a deeper understanding of their behavior and potential for optimization. In instances where interpretability may be compromised, supplementary techniques such as SHAP [72] or LIME [73] could be employed to mitigate this limitation and enhance the overall interpretability of the results.

5. Trace Sampling Based on Type, Duration, and Errors

Modern distributed applications trigger an enormous number of traces. The direct storage of hundreds of thousands of traces heavily impacts users’ budgets. A sampling strategy is a procedure that decides which traces to store for further utilization. We describe a procedure that samples traces based on their types, duration, and/or errors. The final goal is to preserve sufficient information for application troubleshooting without possible distortion.
A trace type can be approximately identified by the root span or the span that arrived the earliest. More strict identification is based on the analysis of its structure. However, the latest method is rather resourceful and will require additional grouping/clustering approaches. The durations of traces (in milliseconds) are available in the corresponding metadata. It is possible to extract the durations for different trace types and estimate the average/typical durations of the corresponding processes. The information on errors is also available in trace metadata. Traces are composed of spans. We have detailed information regarding the spans in the corresponding tags that describe the micro-processes. One of the tags indicates the health of a span. We can assume that the entire trace is erroneous if at least one of its spans has a “True” error label. We suggest a parametric approach for all sampling scenarios, allowing us to modify the required percentage of compression rates accordingly.

5.1. Sampling by Trace Type

Assume that we know how to determine a trace type. Let M be the number of trace types for a specific period and N k , k = 1 , , M be the number of traces in each type. This information should be collected/updated for a recent period, as the number of trace types and the trace-flow velocity are subject to rapid changes in modern applications. Let N be the number of all traces before sampling:
N = k = 1 M N k ,
and p k ( t y p e ) = N k / N , k = 1 , , M be the probability of the occurrence of the k-th type. Let r k ( t y p e ) and N k * , k = 1 , , M be the sampling rate (the percentage/fraction of stored traces) and the number of traces after the sampling of the k-th type, respectively:
N k * = r k ( t y p e ) N k , k = 1 , , M ,
where N k * should be rounded to the closest integer. Also, we denote by N * the total number of all traces after the sampling
N * = k = 1 M N k * .
The ratio r = N * / N is the final sampling rate. We propose to compute the sampling rate r k as inversely proportional to the probability p k ( t y p e ) :
r k ( t y p e ) = 1 ( p k ( t y p e ) ) α , α > 0 ,
where α is a parameter that a user should determine to satisfy the needed final sampling rate r. Generally, the more frequent a trace type, the lower the corresponding sampling rate r k .
Let us see how the parameter α should be determined if the final user requirement is known:
N * = k = 1 M N k * = k = 1 M r k N k = k = 1 M 1 ( p k ) α N k = N k = 1 M p k 1 ( p k ) α = N G t y p e ( α ) ,
where
r = N * / N = G t y p e ( α ) = k = 1 M p k ( t y p e ) 1 ( p k ( t y p e ) ) α .
This can be considered an analog of the Gini index, showing the total sampling rate. For a specific dataset of traces, we can compute the values of G ( α ) across different α and determine its value to satisfy a user’s requirement. We can try to estimate the proper value of α based on the recent collection of traces (say, for the last 2 h) and dynamically update those values to meet the requirement.
Let us illustrate how this procedure works for a specific application. Figure 2 shows the distribution of traces across different trace types.
There are types with almost 4000 traces and very rare ones, like ones with 3 representatives. Overall, we detected 20,425 traces distributed across 28 different trace types. Figure 3 and Table 1 reveal how different rates compress traces across types. We experimented with α = 1 ,   0.5 ,   0.25 , and 0.1 . The sampling rate is always smaller for common trace types and larger for rare ones (see Table 1).
Smaller values of α correspond to smaller final sampling rates. When α = 1 , we store 17,903 traces across all types: 87.7 % of all traces. When α = 0.5 , we store 13,629 traces, 66.7 % of all traces. For α = 0.25 , we store 8882 traces, 43.5 % . When α = 0.1 , we store 4273 traces, 20.9 % of all traces.
Let us inspect some of the sampling rates for specific types. The most frequent type contains 3996 traces with the probability p = 0.2 . The corresponding sampling rate (for α = 1 ) is 80.5 % , and 3215 traces from this class were stored. One of the rarest types contains 3 traces with almost zero probability. The sampling rate is almost 100 % , and we store two or three traces for different α .
The correct value of α that will satisfy a user requirement could be estimated by computing the G ( α ) values. Figure 4 explains the procedure. Let us assume that the required sampling rate is 10 % , meaning that we want to store around that percentage of traces. α = 0.044 provides a sampling rate r = 0.099 .
This algorithm should be applied if trace types carry important information for an application, and both rare and frequent types should be preserved for further analysis. Mathematically, it means no more than, say, 50 trace types with diverse frequencies. Diversity can be inspected via the Gini index or entropy. If those measures are close to zero, the sampling will try to maximize them and store equal portions from all types.

5.2. Sampling by a Trace Duration

In this subsection, we consider the sampling of traces based on their durations. This setup is reasonable if we have a few trace types where traces have almost the same average durations. We would like to keep all traces with some average durations and those with extraordinarily short or long ones.
Assume N traces with durations { d k } k = 1 N . Let D ( m ) be the corresponding histogram of durations with m number of bins:
D ( m ) = { n 1 , , n m } ,
where n s , s = 1 , , m 1 is the number of traces with durations within intervals [ t s 1 , t s ) , s = 1 , , m 1 and n m with durations within [ t m 1 , t m ] . Let p s ( d u r ) , s = 1 , , m be the probability that a trace has a duration from the s-th bin:
p s ( d u r ) = n s N , s = 1 , , m .
We determine the sampling rate r s ( d u r ) of a trace with a duration within the s-th bin to be inversely proportional to the corresponding probability p s ( d u r ) :
r s ( d u r ) = 1 ( p s ( d u r ) ) β , s = 1 , , m , β > 0 ,
where β is the parameter to meet a user’s requirement.
Let n s * be the number of traces after sampling in the s-th bin:
n s * = r s ( d u r ) n s .
Let N * be the total number of traces after the sampling:
N * = s = 1 m n s * = s = 1 m r s ( d u r ) n s = N s = 1 m 1 ( p s ) β p s = N G d u r ( β ) ,
where
r = N * / N = G d u r ( β ) = k = 1 M p k ( d u r ) 1 ( p k ( d u r ) ) β
is the total sampling rate for different values of parameter β .
Let us illustrate how this procedure works for the same dataset of traces presented earlier. We will skip the trace types and will look into the trace durations.
Figure 5 shows the distribution of traces with durations in different time intervals. It shows three dominant time intervals in which the most traces are concentrated. We can aggressively sample those frequent traces and moderately sample the rare ones outside the dense areas. Table 2 shows how the sampling procedure works for different β parameter values (we use seven bins).
The first column of Table 2, “Time Intervals”, shows the seven intervals of duration. The intervals have equal lengths, although the bins could be nonuniform. The second column shows the initial number of traces with the durations within the corresponding time intervals. The remaining columns show the numbers after sampling with the corresponding β . The last row of Table 2 reveals the final sampling rates. The bigger the value of β , the more severe the reduction in traces. The value β = 0.1 provides a sampling rate of around 13 % . Figure 6 visualizes the results of Table 2.
If the requirement is to sample exactly 10 % of traces, then Figure 7 shows the parameter β estimation based on the values of G d u r ( β ) . The red cross indicates that β = 0.083 provides a sampling rate of 0.0996 .
Proper histogram construction is a crucial milestone for frequency-based sampling. In general, the problem is the impact of outliers on the bin construction process. Standard procedures use an equidistant split of the data range, and data outliers can enlarge it, resulting in big bins with resolution distortion. A more accurate procedure should involve outlier detection to preserve the data in separate bins with further application of the classical procedure for the main part of the data. We consider outlier detection via a modified MAD (median absolute deviation) algorithm [74] with a slight modification. We define upper and lower baselines as 0.9 and 0.1 quantiles of data:
M ( u p ) = q 0.9 ( d a t a ) , M ( l o w ) = q 0.1 ( d a t a ) ,
where q s ( d a t a ) is the s-th quantile of the data with 0 s 1 . We calculate the upper and lower distances:
d i s t ( u p ) = | d a t a ( u p ) M ( u p ) | , d i s t ( l o w ) = | d a t a ( l o w ) M ( l o w ) | ,
where d a t a ( u p ) are data points bigger than or equal to M ( u p ) and d a t a ( l o w ) are data points smaller than or equal to M ( l o w ) . We set upper and lower MADs as
M A D ( u p ) = q 0.8 ( d i s t ( u p ) ) , M A D ( l o w ) = q 0.8 ( d i s t ( l o w ) ) .
Based on the MADs, upper and lower thresholds are defined as
u p p e r = m i n ( M ( u p ) + 2.5 M A D ( u p ) , m a x ( d a t a ) ) ,
and
l o w e r = m a x ( M ( l o w ) 2.5 M A D ( l o w ) , m i n ( d a t a ) ) .
All data points lying above or below than the corresponding thresholds are assumed to be upper and lower outliers, respectively. We cut outliers from the data and construct the classical histogram for the remaining data with some predefined numbers of bins (say bins = 5). Then, we calculate the number of lower and upper outliers and append them to the main histogram from left and right, respectively.
Hence, if data have small-value outliers, then the first bin of the histogram contains the number of data points within the interval [ d a t a ( m i n ) , u p p e r ) .
If data have big-value outliers, the last bin contains the number of data points within the interval ( u p p e r , m a x ( d a t a ) ] . The main part of the histogram relates to data points within the interval ( [ l o w e r , u p p e r ] ) . This procedure allows the proper construction of the corresponding histograms with a small number of bins. It is worth noting that the main part of the histogram consists of uniform intervals, while the widths of the first and last bins could be different from those bins. Finally, if data have outliers from both sides, the corresponding histogram has b i n s + 2 final bins. If data have outliers from only one side, then the final number of bins is b i n s + 1 . No other bins are added to the classical histogram if data come without outliers.

5.3. Hybrid Approach Based on Types and Durations

The hybrid approach takes into account both trace type and duration information. This algorithm performs accurate sampling if both durations and types are important. Assume that traces are already grouped into trace types. We generate the histogram of durations for each group and apply the procedure described in the previous subsection. We also consider the probability of a trace type. Common types will be sampled more aggressively.
Let
H ( k ) = { n 1 ( k ) , , n m ( k ) } , k = 1 , , M ,
be the histogram of durations of the k-th trace type and n s ( k ) be the number of traces in the s-th bin of the k-th type. As above, M is the number of different types, and m is the number of bins in the histogram. Let N k be the number of traces in the k-th trace type:
N k = s = 1 m n s ( k ) .
Let N be the total number of traces:
N = k = 1 M N k .
Let
P ( k ) = { p 1 ( k ) , , p m ( k ) } , p s ( k ) = n s ( k ) N k , k = 1 , , M ,
and
P = { p 1 , , p M } , p k = N k N .
We denote the sampling rate of a trace from the s-th bin of the k-th trace type by r s ( k ) . It shows the fraction of traces that should be stored. We compute it by the following formula as inversely proportional to the corresponding probabilities:
r s ( k ) = 1 ( p k ) α p s ( k ) β ,
where α , β 0 are some unknown parameters to be tuned to meet the requirement for the final sampling rate r.
Let us show how it could be carried out. Let N k * be the number of traces in the k-th trace type after sampling. Let n s ( k ) * be the number of traces in the s-th bin and the k-th trace type after sampling:
N k * = s = 1 m n s ( k ) * = s = 1 m n s ( k ) r s ( k ) = N k s = 1 m p s ( k ) 1 p k α p s ( k ) β .
Let N * be the total number of traces after sampling across all types and durations:
N * = k = 1 M N k * = k = 1 M N k s = 1 m p s ( k ) 1 p k α p s ( k ) β = N G ( α , β ) ,
where
r = N * N = G ( α , β ) = k = 1 M s = 1 m p k p s ( k ) 1 p k α p s ( k ) β
is the required final sampling rate that can be accomplished by appropriately selecting parameters α and β .
First, we will illustrate this procedure when α = β . Figure 8 reveals the procedure for a specific trace type containing 3996 traces. The left figure shows the scatter plot of durations. The right figure shows the histogram of durations with 7 bins before and after the samplings. The value β = 0.1 corresponds to the sampling that stores 1104 traces. The next value β = 0.05 stores 599 traces. The value β = 0.025 corresponds to the sampling with 315 preserved traces. Finally, the value β = 0.01 stores 131 traces. The last one corresponds to the 3.3 % sampling rate. In addition, the probability of this trace type is 0.2 , which also impacts the sampling rates with α = β setting.
Figure 9 shows the hybrid procedure for another trace type with the probability 0.0048 . It contains only 98 traces and is a rare type compared to the previous example. It also corresponds to the setting α = β . The right figure shows the sampling results for different values of β . For example, the value β = 0.01 corresponds to the sampling that stores only 7 traces of this specific type. It corresponds to the 7.1 % sampling rate. The value β = 0.025 corresponds to the 18.4 % sampling rate. The value β = 0.05 stores 31 traces and corresponds to the 31.6 % sampling rate. Finally, the value β = 0.1 corresponds to the sampling rate 52 % .
Figure 8. The hybrid approach for a specific trace type (N2 in Figure 10). The left figure shows the plot of durations. The right figure shows the counts of traces in different bins before and after sampling corresponding to different values of α = β .
Figure 8. The hybrid approach for a specific trace type (N2 in Figure 10). The left figure shows the plot of durations. The right figure shows the counts of traces in different bins before and after sampling corresponding to different values of α = β .
Applsci 14 05779 g008
Figure 10 shows the result of the hybrid approach across different trace types when α = β . The total sampling rate is 4 % for α = β = 0.01 . The values α = β = 0.1 correspond to the sampling rate 31 % . This means that the values of α and β can control the sampling rates in a wide range of values. If we need to accomplish a strict requirement, we can turn to the G ( α , β ) values corresponding to the sampling rates. Figure 11 shows the values of G ( β , β ) , where the red cross corresponds to the sampling rate of 10 % (the exact value of G is 0.099 ) with β = 0.029 . We can also tune the values of α and β independently. Figure 12 shows the surface of sampling rates corresponding to different values. For example, the total sampling rate of 10 % can be accomplished by α = 0.03 , β = 0.022 , or α = 0.026 , β = 0.029 , or α = 0.011 , β = 0.056 , or α = 0.023 , β = 0.035 , or α = 0.01 , β = 0.058 , etc.
Parameter optimization can be performed without the utilization of long historical data. We can take an initially random value for the parameters and, after each hour, verify the actual compression ratio. Then, by increasing or decreasing the values, we achieve the required sampling. This will especially work for dynamic applications when relying on available historical information is impossible.
How, in practice, can we sample a trace? When a specified trace (with known type and duration) is detected with a known sampling rate r, a random variable from the B e r n o u l l i ( r ) distribution should be generated. This random variable will have a Boolean outcome with the value 1 that has probability r and value 0 with the probability 1 r . If the outcome is 1, we will store the specified trace; otherwise, we will ignore it. This will allow us to store the traces with the required sampling rates in the long run.
Returning to the problem of histogram construction, we refer to a powerful approach known as the t-digest algorithm [75], which addresses several problems in histogram construction. The first concern corresponds to the storing of time series data with trace durations. For each trace type, we need to store the corresponding time series of durations for further histogram construction, and we need those time series with sufficient statistics. For 28 trace types, as in our experiments, we need to store the 28 time series. The second concern corresponds to the process of histogram construction. Each time we need those histograms, sorting should be applied to the corresponding duration data, and the procedure must be repeated each time in case of some new arrivals. The concern is the impact of outliers. An efficient solution to the mentioned problems is the t-digest algorithm. Instead of storing the entire time series data, it stores only the result of data cluster centroids and data counts in each cluster. The efficient merging approach allows for the combination of different t-digest histograms, streamlining the entire process. The t-digest is very precise while estimating extreme data quantiles (close to 0 and 1), making the procedure robust to outliers.

5.4. The Sampling of Erroneous Traces

In specific frameworks, sampling should preserve other important properties besides duration and type. One such property is the normality/abnormality of a trace, which characterizes microservices’ performance and could be used for troubleshooting and root-cause analysis. In the described approach, we cannot guarantee a sufficient number of important erroneous traces in the sampled dataset, as there was no requirement to preserve them. Now, we will try to address that requirement by discussing various approaches.
The first approach is the natural modification of the previous one, ensuring the existence of enough erroneous traces in each trace type after the sampling. More precisely, we verify the existence of erroneous traces in each trace type. If they exist, we divide the corresponding type into two types containing only normal or erroneous traces. Then, we can apply the approach that has already been considered (sampling by type, duration, or hybrid) to the same dataset but with renewed trace types.
Dividing erroneous traces into additional types according to the corresponding error codes is also possible. Assume that trace type “A” has erroneous traces containing error codes “400-Bad Request” and “401-unauthorized”. Then, we can divide it into three new trace types: “A-we” (without errors), “A-400” (with 400 code), and “A-401” (with 401 code). It is also possible to divide erroneous traces into types by their error codes independent of the types. For example, collect all traces with “400” error codes together in a single type, independent of their original types. Let us show how this approach works.
Assume two trace types; for example, 300 traces of type “A” and 130 of type “B”, and we must store only 30 % of the traces. The sampling, based only on the trace types, provides the value α = 0.619 . It means storing 60 traces of type “A” and 69 of type “B”. Figure 13 illustrates those numbers.
Now, let us also consider erroneous traces. Assume type “A” has 200 normal and 100 erroneous traces. Thus, types “A” and “A*” contain 200 and 100 traces, respectively. Then, assume type “B” has 60 normal and 70 erroneous traces. Thus, “B” and “B*” contain 60 and 70 traces, respectively. Now, instead of two types, there are four. The type-based sampling provides the value α = 0.283 , which leads to sampling rates 0.2 , 0.43 , 0.34 , and 0.4 , respectively.
Figure 14 shows the distributions before and after sampling. As a result, from 260 normal and 170 erroneous traces, the approach sampled 66 and 63, respectively (see the right figure of Figure 14). Now, we can guarantee that the final sampled set also contains erroneous traces across different types. It is possible to apply the hybrid approach that will also consider the duration of the erroneous traces.
The second approach tries to control the percentage of erroneous traces better in the sampled dataset. Let 0 < h < 1 be the final required sampling rate. Assume that h e and h n are the sampling rates of erroneous and normal traces, respectively:
N n * = h n N n ,
and
N e * = h e N e ,
where N n and N e are the number of normal and erroneous traces before sampling, respectively, and N n * and N e * after sampling.
Can we also put some requirements on h e ? We have
h n N n * N n + h e N e * N e = h ,
and
h n = N N n h h e N e N .
If the requirements on h and h e lead to the value 0 < h n < 1 , then the requirements can be accomplished. Otherwise, if h n turns out to be a negative number, that means it is impossible, and we must ask to change the requirements.
Let us return to the previous example with two trace types. Can we sample 30 % of traces while preserving 60 % of erroneous ones? We input h = 0.3 and h e = 0.6 and tried to find appropriate values for h n . Our calculations show that h n = 0.104 will work. We will preserve 60 % of erroneous traces and 10 % of normal traces by sampling 30 % of all traces. Figure 15 illustrates the choices.

6. Troubleshooting of Applications

The ultimate goal of DT is to monitor application performance, detect malfunctioning microservices, and explain the root causes of problems so that they can be resolved quickly. This is feasible by inspecting the tracing traffic passing through specific microservices, detecting erroneous traces/spans, and trying to explain their origin. Thus, the explainability of ML models is the most crucial property. We consider two highly explainable approaches.
One is RIPPER [20], which is state-of-the-art in inductive rule learning [53]. It has important technical characteristics, like supporting missing values and numerical and categorical variables, and multiple classes. We experimented with Weka’s RIPPER implementation, or JRip [76]. The application of RIPPER for the inspection of erroneous traces is discussed in [7]. We are not going into such details, but will show the impact of noise reduction on the rules.
Next is the Dempster–Shafer classifier considered in [66]. It is a far more complex approach but has interesting implications. One important characteristic is the ability to measure the uncertainty of specific rules. It should be interesting to compare the uncertainties of the same rules before and after the sampling.
The analysis of tracing traffic passing through a specific microservice starts with data preprocessing. We transform the trace traffic into tabular data to apply ML algorithms. A single trace shows an individual request passage through the microservices. It contains a series of tagged time intervals known as spans. A span contains metadata known as tags and application tags for better process resolution. The traffic can contain hundreds or thousands of traces with different numbers of spans and tags, which are application-specific.
In our examples [7], traces contain the names as trace-IDs, and then after each trace, the list of spans with process names, and after each span, the corresponding tags. During the preprocessing stage, we remove some of the fields containing redundant information, like “traceID”, “spanID”, “startMs”, and many others. It should be very straightforward to denoise trace traffic based on expert knowledge or user feedback. Firstly, we preprocess span names. We make a list of all distinct span names existing in trace traffic and put them as the names of the columns of a dataframe. Then, for each trace (as a row), we verify the names of its spans, and in the corresponding columns, enter value = 1. The other columns contain missing values, as that specific trace does not contain those spans. The entire dataframe consists of ones and missing values. Fortunately, the RIPPER and Dempster–Shafer classifiers ignore those missing values and explain the output based on the existing spans in a trace. Eventually, the number of rows in the dataframe coincides with the number of traces in trace traffic, and the number of columns coincides with the number of distinct spans (processes).
We label traces via the tag “error”, where the value “true” indicates that the corresponding trace is erroneous. A missing tag “error” indicates that the trace is normal. It can also be useful to construct a column in a dataframe that will indicate the type of trace. Secondly, we construct columns revealing metadata information. We put the metadata value in the corresponding cell, where the row corresponds to the trace and the column corresponds to the distinct metadata name.
Tracing traffic passing through a malfunctioning microservice can still contain many traces, even after a series of samplings. In this final stage, we apply the sampling of erroneous traces for reducing the volume and simplifying the application of rule induction methods. We aim to understand how the sampling impacts the rule-generation process regarding precision, recall, or uncertainty. It is worth noting that in this stage, the goal of the sampling is not to preserve all possible types and durations of erroneous traces. Exactly the opposite: the goal should be to remove rare erroneous traces and preserve dominant/common ones, as they probably explain the problems of a microservice. We separate normal and erroneous traces by trace types. Then, in all those groups, we sample at the same rate. As a result, some of the groups (rare ones) can vanish. Error-based sampling may downplay the influence of rare events by sampling them less frequently or excluding them altogether. We show how this reduction in outliers helps to stabilize the analyses and make recommendations more robust to extreme cases.
Then, we apply rule-induction classifiers to explain the origin of erroneous traces. These learning systems generate understandable rules based on input data. Humans can easily comprehend these rules, which provide clear insights into the decision-making process of the AI model. Unlike black-box ML models such as deep neural networks, where it is often challenging to understand how the model arrives at its predictions, rule induction systems offer transparency. Users can inspect the rules to understand the underlying logic and reasoning behind the rules. This enhances the trustworthiness of the AI system.
By interpreting the induced rules, developers and system administrators can gain insights regarding the possible remediation of issues. The rules provide transparent and interpretable explanations of certain errors occurring under specific conditions or scenarios. Armed with insights from the RCA, appropriate measures can be taken to optimize the system, fix bugs, or implement preventive measures to reduce the likelihood of similar errors occurring in the future.
Let us apply this procedure to a real customer cloud environment. The visualization of tracing traffic is known as an application map. It shows which microservices malfunction and which traces must be collected for troubleshooting. We collected 5899 traces with some 4917 features that should explain the origin of errors. We had 3145 normal and 2754 erroneous traces. JRip (RIPPER) ran for 11.09 s, outputting 13 rules (see Figure 16).
The first 12 rules describe the erroneous traces. They are mostly connected with the type. Only one of the rules is connected with the tag “-annotations-matched”. The fraction at the end of the rule shows how many traces were fired by the rule (numerator) and how many misfired (denominator). The first two rules have large coverage and are 100 % precise. The others should be connected with the noise in the dataset. We hope that the sampling will remove the noise. However, the accuracy of this classifier is 99.4 % . The precision and recall of both classes are also bigger than 99 % .
Figure 17 shows the distribution of normal traces across different types before (the top figure) and after (the bottom figure) the sampling. There are many types containing just a single representative. The sampling preserves only the common groups and removes all rare types. Here, we applied the 10 % sampling rate. Similarly, Figure 18 shows the distribution of erroneous traces. In both figures, the labels on the horizontal axes show the names of trace types. We identified the trace type by the root span.
Figure 19 reveals the JRip rules after the sampling, which preserved the first two important rules and removed the others connected with the noise. The sampled dataset contained 300 normal and 268 erroneous traces. The classifier applied to the sampling dataset showed 99.1 % accuracy. It misclassified 5 traces. The execution time was 0.1 s. As a result, the sampling reduced the execution time, especially for noisy datasets. Moreover, it shortened the list of top recommendations so that users can focus on the most important ones.
As we mentioned before, the sampling degraded the statistical evidence. The classifiers could not describe it. Quite the opposite; the accuracy of classifiers increased or remained almost the same. We can refer to the Dempster–Shafer theory and compare the uncertainties of the rules. The theory of belief functions, also referred to as evidence theory or Dempster–Shafer theory, is a general framework for reasoning with uncertainty. It offers an alternative to traditional probabilistic theory for the mathematical representation of uncertainty. Figure 20 shows how the sampling affects the uncertainties. It shows the uncertainties of the first two rules before and after the sampling.

7. Conclusions and Future Work

DT is essential for increasing the visibility of complex interactions and dependencies within native cloud applications. By tracing the requests across microservices, containers, and dynamic environments, users can effectively monitor, manage, and troubleshoot their native cloud applications to ensure optimal performance and reliability. Despite the efforts, the adoption of this technology encounters several challenges, and one of the most crucial is the volume of data and the corresponding resource consumption needed to handle it.
Sampling is a technique that alleviates the overhead in collecting, storing, and processing vast amounts of trace data. It reduces the volume of data by selectively capturing only a fraction of traces. This approach offers benefits such as decreased latency, reduced resource consumption, and improved scalability. However, sampling introduces challenges, particularly in maintaining representative samples and preserving the accuracy of analysis results. Striking a balance between sampling rate and data fidelity is crucial to ensure effective troubleshooting and performance analysis in distributed systems. Overall, DT sampling plays a vital role in managing the complexity of distributed environments while optimizing resource utilization and maintaining analytical efficacy.
We explored several approaches for sampling that preserved traces with some specific properties. One such property was the type of trace that described the transaction similarity. Traces of the same type should normally have almost the same structure. The goal of such sampling was to preserve traces across all available types. Another property was the duration of the trace. This approach could or might not be combined with the information regarding the type. Trace durations were important as they described the transaction duration. Similar transactions had some typical/ average durations. Atypical durations indicated a transaction/microservice malfunction. The next important trace characteristic was its normality. Erroneous traces carried important information regarding problems. Hence, it was natural to try to keep all representatives while monitoring the performance of microservices. The flow of erroneous traces would show which microservices had degraded performance. Further, those traces were critical sources of information for troubleshooting application issues.
We sampled only dominant/common errors at the troubleshooting stage, trying to remove the rare ones. This removed the noise and made explanations more confident. RCA could be performed by tracing traffic passing through a malfunctioning microservice. Rule-learning ML methods could help generate explicit rules that explain the problems and clarify the remediation process. We showed how rule-induction systems like RIPPER solved this problem and provided recommendations that system administrators could follow to accelerate the resolution process. We also showed that sampling could dramatically decrease the program execution time and provide more clear recommendations. However, as mentioned before, sampling degraded the statistics, and sometimes rare but important evidence could escape the analysis.
Several limitations of our approach are crucial to consider. Acknowledging and addressing these limitations is vital for enhancing the robustness and applicability of the sampling strategy. Firstly, a significant limitation is the dependence of trace types on the root span. This approach overlooks the fact that a substantial volume of traces lack a root span, necessitating the utilization of the first span instead. Overall, this introduces potential inaccuracies in type definitions, as many traces may exhibit distinct structures not adequately captured by this method. Secondly, the procedure for calculating trace durations may be prone to misinterpretation due to the challenges associated with defining trace types, as previously discussed. Thirdly, reliance on the system-defined definition of erroneous traces may not always align directly with the underlying issues that rule induction systems aim to address. This mismatch could hinder the effectiveness of the rule induction process by focusing on inaccurately identified erroneous traces. Fourthly, the current sampling strategy is primarily based on trace types and durations, overlooking potentially crucial properties for diverse applications. Lastly, the approach’s scalability is constrained by its lack of full streaming capability, which could pose challenges in efficiently handling large volumes of trace data.
In future work, we plan to explore the scalability of our strategy to handle larger datasets or more complex applications. We must investigate how our strategy performs as the volume of data increases, and consider further adjustments to maintain efficiency. Our sampling strategy is based on the statistical properties of application traces. We must delve deeper into the process of refining this by exploring different ML solutions to see how they influence the efficiency of the process. This can involve experimenting with various algorithms and parameters to optimize the sampling approach. It is important to develop specific evaluation metrics to quantify the impact of the sampling strategy on rule induction methods. These could include accuracy, efficiency, the interpretability of the rules generated, or any other relevant performance indicators. Based on those indicators, we should conduct comparative studies to compare the performance of our sampling strategy with existing sampling methods or other rule induction techniques. This can highlight the peculiarities of our approach. It is also important to examine related research areas, such as predictive modeling, anomaly detection, or decision support systems, to identify the impact of the sampling strategies. This will help to broaden the scope of our research and uncover new avenues for exploration.

8. Patents

Poghosyan A., Harutyunyan A., Grigoryan N., Pang C., Oganesyan G., Baghdasaryan D., Automated methods and systems that facilitate root-cause analysis of distributed-application operational problems and failures by generating noise-subtracted call-trace-classification rules. Filed: 1 October 2021. Application No.: US 17/492,099. Patent No.: US 11880272 B2. Granted: 23 January 2024.
Poghosyan A., Harutyunyan A., Grigoryan N., Pang C., Oganesyan G., Baghdasaryan D., Automated methods and systems that facilitate root cause analysis of distributed-application operational problems and failures. Filed: 1 October 2021. Application No.: US 17/491,967. Patent No.: US 11880271 B2. Granted: 23 January 2024.
Poghosyan, A., Harutyunyan, A., Grigoryan, N., Pang, C., Oganesyan, G., and Avagyan, K., Methods and systems for intelligent sampling of application traces. Application filed by VMware LLC in 2021. Application No.: US 17/367,490. Patent No.: US 11940895 B2. Granted: 26 March 2024.
Poghosyan, A.V., Harutyunyan, A.N., Grigoryan, N.M., Pang, C., Oganesyan, G., and Avagyan, K., Methods and systems for intelligent sampling of normal and erroneous application traces. Application filed by VMware LLC in 2021. Application No.: US 17/374,682. Publication of US 20220291982 A1 in 2022.
Poghosyan, A.V., Harutyunyan, A.N., Grigoryan, N.M., Pang, C., Oganesyan, G., and Baghdasaryan, D., Automated methods and systems that facilitate root cause analysis of distributed-application operational problems and failures. Application filed by VMware LLC in 2021. Application No.: US 17/491,967 and 17/492,099.
Grigoryan N.M., Poghosyan A., Harutyunyan A.N., Pang C., Nag D.A., Methods and Systems that Identify Dimensions Related to Anomalies in System Components of Distributed Computer Systems using Clustered Traces, Metrics, and Component-Associated Attribute Values. Filed: 12 December 2020. Application No.: US 17/119,462. Patent No.: US 11,416,364 B2. Granted: 16 August 2022.

Author Contributions

Conceptualization, A.P., A.H. and N.B.; Data curation, E.D. and K.P.; Formal analysis, A.P. and A.H.; Investigation, A.P. and A.H.; Methodology, A.P. and A.H.; Project administration, N.B.; Resources, N.B.; Software, E.D. and K.P.; Supervision, N.B.; Validation, A.P. and A.H.; Visualization, E.D. and K.P.; Writing—original draft, A.P. and A.H.; Writing—review and editing, A.P., A.H., E.D., K.P. and N.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by ADVANCE Research Grants from the Foundation for Armenian Science and Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Acknowledgments

We sincerely appreciate the reviewers’ commitment, exceptional dedication, and invaluable contributions. Their efforts and insightful feedback have greatly enriched the quality and depth of this paper.

Conflicts of Interest

Author Edgar Davtyan was employed by the company Picsart. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
AIOpsAI for IT Operations
APIApplication Programming Interface
AWSAmazon Web Services
DSTDempster–Shafer Theory
ITInformation Technologies
MADMedian Absolute Deviation
MLMachine Learning
MTTDMean Time to Detect
MTTRMean Time to Repair
SRESite Reliability Engineer
DTDistributed Tracing
XAIExplainable Artificial Intelligence

References

  1. Parker, A.; Spoonhower, D.; Mace, J.; Sigelman, B.; Isaacs, R. Distributed Tracing in Practice: Instrumenting, Analyzing, and Debugging Microservices; O’Reilly Media, Incorporated: Sebastopol, CA, USA, 2020. [Google Scholar]
  2. Shkuro, Y. Mastering Distributed Tracing: Analyzing Performance in Microservices and Complex Systems; Packt Publishing: Birmingham, UK, 2019. [Google Scholar]
  3. Opentracing. What Is Distributed Tracing? 2019. Available online: https://opentracing.io/docs/overview/what-is-tracing/ (accessed on 26 January 2021).
  4. Cai, Z.; Li, W.; Zhu, W.; Liu, L.; Yang, B. A real-time trace-level toot-cause diagnosis system in Alibaba datacenters. IEEE Access 2019, 7, 142692–142702. [Google Scholar] [CrossRef]
  5. Liu, D.; He, C.; Peng, X.; Lin, F.; Zhang, C.; Gong, S.; Li, Z.; Ou, J.; Wu, Z. MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Madrid, Spain, 25–28 May 2021; pp. 338–347. [Google Scholar]
  6. Poghosyan, A.V.; Harutyunyan, A.N.; Grigoryan, N.M.; Pang, C. Root Cause Analysis of Application Performance Degradations via Distributed Tracing. In Proceedings of the Third CODASSCA Workshop, Yerevan, Armenia: Collaborative Technologies and Data Science in Artificial Intelligence Applications, Yerevan, Armenia, 23–26 August 2022; pp. 27–31. [Google Scholar]
  7. Poghosyan, A.; Harutyunyan, A.; Grigoryan, N.; Pang, C. Distributed Tracing for Troubleshooting of Native Cloud Applications via Rule-Induction Systems. JUCS J. Univers. Comput. Sci. 2023, 29, 1274–1297. [Google Scholar] [CrossRef]
  8. Distributed Tracing—Past, Present and Future. 2023. Available online: https://www.zerok.ai/post/distributed-tracing-past-present-future (accessed on 25 June 2024).
  9. Young, T.; Parker, A. Learning OpenTelemetry; O’Reilly Media: Sebastopol, CA, USA, 2024. [Google Scholar]
  10. Cotroneo, D.; De Simone, L.; Liguori, P.; Natella, R. Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platform. J. Syst. Softw. 2023, 198, 111611. [Google Scholar] [CrossRef]
  11. Zhang, X.; Lin, Q.; Xu, Y.; Qin, S.; Zhang, H.; Qiao, B.; Dang, Y.; Yang, X.; Cheng, Q.; Chintalapati, M.; et al. Cross-dataset Time Series Anomaly Detection for Cloud Systems. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA, USA, 10–12 July 2019; pp. 1063–1076. [Google Scholar]
  12. Abad, C.; Taylor, J.; Sengul, C.; Yurcik, W.; Zhou, Y.; Rowe, K. Log correlation for intrusion detection: A proof of concept. In Proceedings of the 19th Annual Computer Security Applications Conference, Las Vegas, NV, USA, 8–12 December 2003; pp. 255–264. [Google Scholar]
  13. Suriadi, S.; Ouyang, C.; van der Aalst, W.; ter Hofstede, A. Root cause analysis with enriched process logs. In Proceedings of the Business Process Management Workshops, International Workshop on Business Process Intelligence (BPI 2012), Tallin, Estonia, 3–6 September 2012; pp. 174–186. [Google Scholar]
  14. BigPanda. Incident Management. 2020. Available online: https://docs.bigpanda.io/docs/incident-management (accessed on 26 January 2021).
  15. Josefsson, T. Root-Cause Analysis through Machine Learning in the Cloud. Master’s Thesis, Uppsala Universitet, Uppsala, Sweden, 2017. [Google Scholar]
  16. Tak, B.; Tao, S.; Yang, L.; Zhu, C.; Ruan, Y. LOGAN: Problem diagnosis in the cloud using log-based reference models. In Proceedings of the 2016 IEEE International Conference on Cloud Engineering (IC2E), Berlin, Germany, 4–8 April 2016; pp. 62–67. [Google Scholar]
  17. Mi, H.; Wang, H.; Zhou, Y.; Lyu, M.R.; Cai, H. Localizing root causes of performance anomalies in cloud computing systems by analyzing request trace logs. Sci. China Inf. Sci. 2012, 55, 2757–2773. [Google Scholar] [CrossRef]
  18. Poghosyan, A.; Harutyunyan, A.; Grigoryan, N.; Pang, C.; Oganesyan, G.; Ghazaryan, S.; Hovhannisyan, N. An Enterprise Time Series Forecasting System for Cloud Applications Using Transfer Learning. Sensors 2021, 21, 1590. [Google Scholar] [CrossRef] [PubMed]
  19. Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
  20. Cohen, W.W. Fast Effective Rule Induction. In Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995; Morgan Kaufmann: San Francisco, CA, USA, 1995; pp. 115–123. [Google Scholar]
  21. Quinlan, J.R. C4.5: Programs for Machine Learning; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
  22. Jaeger: Sampling. 2024. Available online: https://www.jaegertracing.io/docs/1.55/sampling/ (accessed on 25 June 2024).
  23. Anomaly Detection in Zipkin Trace Data. 2024. Available online: https://engineering.salesforce.com/anomaly-detection-in-zipkin-trace-data-87c8a2ded8a1/ (accessed on 25 June 2024).
  24. LightStep: Sampling, Verbosity, and the Case for (Much) Broader Applications of Distributed Tracing. 2024. Available online: https://medium.com/lightstephq/sampling-verbosity-and-the-case-for-much-broader-applications-of-distributed-tracing-f3500a174c17 (accessed on 25 June 2024).
  25. Datadog: Trace Sampling Use Cases. 2024. Available online: https://docs.datadoghq.com/tracing/guide/ingestion_sampling_use_cases/ (accessed on 25 June 2024).
  26. Partial Trace Sampling: A New Approach to Distributed Trace Sampling. 2024. Available online: https://engineering.dynatrace.com/blog/partial-trace-sampling-a-new-approach-to-distributed-trace-sampling/ (accessed on 25 June 2024).
  27. New Relic: Technical Distributed Tracing Details. 2024. Available online: https://docs.newrelic.com/docs/distributed-tracing/concepts/how-new-relic-distributed-tracing-works/#sampling (accessed on 25 June 2024).
  28. OpenTelemetry Trace Sampling. 2024. Available online: https://docs.appdynamics.com/observability/cisco-cloud-observability/en/application-performance-monitoring/opentelemetry-trace-sampling (accessed on 25 June 2024).
  29. When to Sample. 2024. Available online: https://docs.honeycomb.io/manage-data-volume/sample/guidelines/ (accessed on 25 June 2024).
  30. An Introduction to Trace Sampling with Grafana Tempo and Grafana Agent. 2024. Available online: https://grafana.com/blog/2022/05/11/an-introduction-to-trace-sampling-with-grafana-tempo-and-grafana-agent/ (accessed on 25 June 2024).
  31. Application Performance Monitoring: Transaction Sampling. 2024. Available online: https://www.elastic.co/guide/en/observability/current/apm-sampling.html (accessed on 25 June 2024).
  32. Las-Casas, P.; Papakerashvili, G.; Anand, V.; Mace, J. Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering. In Proceedings of the ACM Symposium on Cloud Computing, New York, NY, USA, 20–23 November 2019; pp. 312–324. [Google Scholar]
  33. Thereska, E.; Salmon, B.; Strunk, J.; Wachs, M.; Abd-El-Malek, M.; Lopez, J.; Ganger, G.R. Stardust: Tracking activity in a distributed storage system. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’06), Saint-Malo, France, 26–30 June 2006. [Google Scholar]
  34. Sambasivan, R.R.; Zheng, A.X.; Rosa, M.D.; Krevat, E.; Whitman, S.; Stroucken, M.; Wang, W.; Xu, L.; Ganger, G.R. Diagnosing performance changes by comparing request flows. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, USA, 1 April–30 March 2011. [Google Scholar]
  35. Fonseca, R.; Porter, G.; Katz, R.H.; Shenker, S.; Stoica, I. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation, Cambridge, MA, USA, 11–13 April 2007; p. 20. [Google Scholar]
  36. Sigelman, B.H.; Barroso, L.A.; Burrows, M.; Stephenson, P.; Plakal, M.; Beaver, D.; Jaspan, S.; Shanbhag, C. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure; Technical Report; Google, Inc.: Menlo Park, CA, USA, 2010. [Google Scholar]
  37. Kaldor, J.; Mace, J.; Bejda, M.; Gao, E.; Kuropatwa, W.; O’Neill, J.; Ong, K.W.; Schaller, B.; Shan, P.; Viscomi, B.; et al. Canopy: An End-to-End Performance Tracing And Analysis System. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, 28–31 October 2017; pp. 34–50. [Google Scholar]
  38. OpenTelemetry. 2024. Available online: https://opentelemetry.io/ (accessed on 25 June 2024).
  39. Las-Casas, P.; Mace, J.; Guedes, D.; Fonseca, R. Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay. In Proceedings of the ACM Symposium on Cloud Computing, Carlsbad, CA, USA, 11–13 October 2018; pp. 326–332. [Google Scholar]
  40. Google Cloud Observability: Trace Sampling. 2024. Available online: https://cloud.google.com/trace/docs/trace-sampling (accessed on 25 June 2024).
  41. OpenCensus: Sampling. 2024. Available online: https://opencensus.io/tracing/sampling/ (accessed on 25 June 2024).
  42. Azure Monitor: Sampling in Application Insights. 2024. Available online: https://learn.microsoft.com/en-us/azure/azure-monitor/app/sampling-classic-api (accessed on 25 June 2024).
  43. He, S.; Feng, B.; Li, L.; Zhang, X.; Kang, Y.; Lin, Q.; Rajmohan, S.; Zhang, D. STEAM: Observability-Preserving Trace Sampling. In Proceedings of the FSE’23 Industry, San Francisco, CA, USA, 3–9 December 2023. [Google Scholar]
  44. AWS: Advanced Sampling Using ADOT. 2024. Available online: https://aws-otel.github.io/docs/getting-started/advanced-sampling#best-practices-for-advanced-sampling (accessed on 25 June 2024).
  45. Solé, M.; Muntés-Mulero, V.; Rana, A.I.; Estrada, G. Survey on models and techniques for root-cause analysis. arXiv 2017, arXiv:1701.08546. [Google Scholar]
  46. Harutyunyan, A.N.; Poghosyan, A.V.; Grigoryan, N.M.; Hovhannisyan, N.A.; Kushmerick, N. On machine learning approaches for automated log management. JUCS J. Univers. Comput. Sci. 2019, 25, 925–945. [Google Scholar]
  47. Poghosyan, A.; Harutyunyan, A.; Grigoryan, N.; Kushmerick, N. Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers. JUCS J. Univers. Comput. Sci. 2021, 27, 1152–1173. [Google Scholar] [CrossRef]
  48. Poghosyan, A.V.; Harutyunyan, A.N.; Grigoryan, N.M. Managing cloud infrastructures by a multi-layer data analytics. In Proceedings of the 2016 IEEE International Conference on Autonomic Computing, ICAC 2016, Wuerzburg, Germany, 17–22 July 2016; Kounev, S., Giese, H., Liu, J., Eds.; IEEE Computer Society: Piscataway, NJ, USA, 2016; pp. 351–356. [Google Scholar]
  49. Marvasti, M.A.; Poghosyan, A.V.; Harutyunyan, A.N.; Grigoryan, N.M. Pattern detection in unstructured data: An experience for a virtualized IT infrastructure. In Proceedings of the 2013 IFIP/IEEE International Symposium on Integrated Network Management, IM 2013, Ghent, Belgium, 27–31 May 2013; Turck, F.D., Diao, Y., Hong, C.S., Medhi, D., Sadre, R., Eds.; IEEE: Piscataway, NJ, USA, 2013; pp. 1048–1053. [Google Scholar]
  50. Reynolds, P.; Killian, C.E.; Wiener, J.L.; Mogul, J.C.; Shah, M.A.; Vahdat, A. Pip: Detecting the Unexpected in Distributed Systems. In Proceedings of the Symposium on Networked Systems Design and Implementation, San Jose, CA, USA, 8–10 May 2007. [Google Scholar]
  51. Harutyunyan, A.; Poghosyan, A.; Harutyunyan, L.; Aghajanyan, N.; Bunarjyan, T.; Vinck, A.H. Challenges and Experiences in Designing Interpretable KPI-diagnostics for Cloud Applications. JUCS J. Univers. Comput. Sci. 2023, 29, 1298–1318. [Google Scholar] [CrossRef]
  52. Fürnkranz, J.; Gamberger, D.; Lavrač, N. Foundations of Rule Learning; Cognitive Technologies; Springer: Berlin/Heidelberg, Germany, 2012; pp. xviii+334. [Google Scholar]
  53. Fürnkranz, J.; Kliegr, T. A brief overview of rule learning. In Proceedings of the Rule Technologies: Foundations, Tools, and Applications, Berlin, Germany, 2–5 August 2015; Bassiliades, N., Gottlob, G., Sadri, F., Paschke, A., Roman, D., Eds.; Springer: Cham, Switzerland, 2015; pp. 54–69. [Google Scholar]
  54. Fürnkranz, J. Pruning Algorithms for Rule Learning. Mach. Learn. 1997, 27, 139–172. [Google Scholar] [CrossRef]
  55. Fürnkranz, J.; Widmer, G. Incremental reduced error pruning. In Proceedings of the 11th International Conference on Machine Learning, New Brunswick, NJ, USA, 10–13 July 1994; Morgan Kaufmann: San Francisco, CA, USA, 1994; pp. 70–77. [Google Scholar]
  56. Hühn, J.; Hüllermeier, E. FURIA: An algorithm for unordered fuzzy rule induction. Data Min. Knowl. Discov. 2009, 19, 293–319. [Google Scholar] [CrossRef]
  57. Lin, F.; Muzumdar, K.; Laptev, N.P.; Curelea, M.V.; Lee, S.; Sankar, S. Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment. Proc. ACM Meas. Anal. Comput. Syst. 2020, 4, 31. [Google Scholar] [CrossRef]
  58. Lee, W.; Stolfo, S.J. Data Mining Approaches for Intrusion Detection. In Proceedings of the 7th Conference on USENIX Security Symposium, San Antonio, TX, USA, 26–29 January 1998; Volume 7, p. 6. [Google Scholar]
  59. Helmer, G.; Wong, J.; Honavar, V.; Miller, L. Intelligent agents for intrusion detection. In Proceedings of the IEEE Information Technology Conference, Syracuse, NY, USA, 3 June 1998; Springer: Berlin/Heidelberg, Germany, 1998; pp. 121–124. [Google Scholar]
  60. Helmer, G.; Wong, J.S.; Honavar, V.; Miller, L. Automated discovery of concise predictive rules for intrusion detection. J. Syst. Softw. 2002, 60, 165–175. [Google Scholar] [CrossRef]
  61. Mannila, H.; Toivonen, H.; Verkamo, A.I. Discovering Frequent Episodes in Sequences Extended Abstract. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montreal, QC, Canada, 20–21 August 1995; AAAI Press: Washington, DC, USA, 1995; pp. 210–215. [Google Scholar]
  62. Liu, H.; Motoda, H. Perspectives of Feature Selection. In Feature Selection for Knowledge Discovery and Data Mining; Springer: Boston, MA, USA, 1998; pp. 17–41. [Google Scholar]
  63. John, G.H.; Kohavi, R.; Pfleger, K. Irrelevant features and the subset selection problem. In Proceedings of the Machine Learning: Proceedings of the 11th International Conference, New Brunswick, NJ, USA, 10–13 July 1994; Morgan Kaufmann: Burlington, MA, USA, 1994; pp. 121–129. [Google Scholar]
  64. Agrawal, R.; Imieliński, T.; Swami, A. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, 26–28 May 1993; pp. 207–216. [Google Scholar]
  65. Shafer, G. A Mathematical Theory of Evidence; Princeton University Press: Princeton, NJ, USA, 1976. [Google Scholar]
  66. Peñafiel, S.; Baloian, N.; Sanson, H.; Pino, J.A. Applying Dempster–Shafer theory for developing a flexible, accurate and interpretable classifier. Expert Syst. Appl. 2020, 148, 113262. [Google Scholar] [CrossRef]
  67. Poghosyan, A.; Harutyunyan, A.; Davtyan, E.; Petrosyan, K.; Baloian, N. A Study on Automated Problem Troubleshooting in Cloud Environments with Rule Induction and Verification. Appl. Sci. 2024, 14, 1047. [Google Scholar] [CrossRef]
  68. Chen, Z.; Jiang, Z.; Su, Y.; Lyu, M.R.; Zheng, Z. TraceMesh: Scalable and Streaming Sampling for Distributed Traces. arXiv 2024, arXiv:2406.06975. [Google Scholar]
  69. Gias, A.U.; Gao, Y.; Sheldon, M.; Perusquía, J.A.; O’Brien, O.; Casale, G. SampleHST: Efficient On-the-Fly Selection of Distributed Traces. arXiv 2022, arXiv:2210.04595. [Google Scholar]
  70. Huang, Z.; Chen, P.; Yu, G.; Chen, H.; Zheng, Z. Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems. In Proceedings of the 2021 IEEE International Conference on Web Services (ICWS), Virtual, 5–11 September 2021; pp. 436–446. [Google Scholar]
  71. Zhou, T.; Zhang, C.; Peng, X.; Yan, Z.; Li, P.; Liang, J.; Zheng, H.; Zheng, W.; Deng, Y. TraceStream: Anomalous Service Localization based on Trace Stream Clustering with Online Feedback. In Proceedings of the 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), Florence, Italy, 9–12 October 2023; pp. 601–611. [Google Scholar]
  72. Lundberg, S.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
  73. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. arXiv 2016, arXiv:1602.04938. [Google Scholar]
  74. Leys, C.; Ley, C.; Klein, O.; Bernard, P.; Licata, L. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 2013, 49, 764–766. [Google Scholar] [CrossRef]
  75. Dunning, T. The t-digest: Efficient estimates of distributions. Softw. Impacts 2021, 7, 100049. [Google Scholar] [CrossRef]
  76. Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J. Practical Machine Learning Tools and Techniques; Morgan Kaufmann: Burlington, MA, USA, 2005. [Google Scholar]
Figure 1. Trace sampling multi-layer design.
Figure 1. Trace sampling multi-layer design.
Applsci 14 05779 g001
Figure 2. The distribution of traces across different types for a specific application.
Figure 2. The distribution of traces across different types for a specific application.
Applsci 14 05779 g002
Figure 3. The sampling of traces of Figure 2 for different values of the parameter α .
Figure 3. The sampling of traces of Figure 2 for different values of the parameter α .
Applsci 14 05779 g003
Figure 4. The values of G t y p e ( α ) show the final sampling rates for different α . The red-cross corresponds to α = 0.044 with the final sampling rate r = 0.099 (around 10 % ).
Figure 4. The values of G t y p e ( α ) show the final sampling rates for different α . The red-cross corresponds to α = 0.044 with the final sampling rate r = 0.099 (around 10 % ).
Applsci 14 05779 g004
Figure 5. The distribution of traces with different durations (in milliseconds).
Figure 5. The distribution of traces with different durations (in milliseconds).
Applsci 14 05779 g005
Figure 6. The sampling of traces of Figure 5 for different values of the parameter β .
Figure 6. The sampling of traces of Figure 5 for different values of the parameter β .
Applsci 14 05779 g006
Figure 7. The values of G d u r ( β ) show the sampling rates for the different parameter values β . The red cross corresponds to β = 0.083 with a total sampling rate of 0.0996 (around 10 % ).
Figure 7. The values of G d u r ( β ) show the sampling rates for the different parameter values β . The red cross corresponds to β = 0.083 with a total sampling rate of 0.0996 (around 10 % ).
Applsci 14 05779 g007
Figure 9. The hybrid approach for a specific trace type (N17 in Figure 10). The left figure shows the plot of durations. The right figure shows the counts of traces for α = β .
Figure 9. The hybrid approach for a specific trace type (N17 in Figure 10). The left figure shows the plot of durations. The right figure shows the counts of traces for α = β .
Applsci 14 05779 g009
Figure 10. The hybrid sampling approach for α = β .
Figure 10. The hybrid sampling approach for α = β .
Applsci 14 05779 g010
Figure 11. The sampling rates corresponding to different values α = β . The red cross corresponds to 10 % with β = 0.03 .
Figure 11. The sampling rates corresponding to different values α = β . The red cross corresponds to 10 % with β = 0.03 .
Applsci 14 05779 g011
Figure 12. The sampling rates that correspond to different values of α and β . The colors correspond to the different ranges of sampling rates for a better visualization.
Figure 12. The sampling rates that correspond to different values of α and β . The colors correspond to the different ranges of sampling rates for a better visualization.
Applsci 14 05779 g012
Figure 13. The sampling of two trace types without counting the errors.
Figure 13. The sampling of two trace types without counting the errors.
Applsci 14 05779 g013
Figure 14. The sampling of two trace types also counts the errors.
Figure 14. The sampling of two trace types also counts the errors.
Applsci 14 05779 g014
Figure 15. The sampling of two trace types with stricter requirements on the percentage of erroneous traces. Now, we preserve 10 % of normal traces and 60 % of erroneous ones. The final sampling rate is 30 % .
Figure 15. The sampling of two trace types with stricter requirements on the percentage of erroneous traces. Now, we preserve 10 % of normal traces and 60 % of erroneous ones. The final sampling rate is 30 % .
Applsci 14 05779 g015
Figure 16. JRip rules before the sampling.
Figure 16. JRip rules before the sampling.
Applsci 14 05779 g016
Figure 17. The distribution of normal traces across the types before and after the sampling.
Figure 17. The distribution of normal traces across the types before and after the sampling.
Applsci 14 05779 g017
Figure 18. The distribution of erroneous traces across the types before and after the sampling.
Figure 18. The distribution of erroneous traces across the types before and after the sampling.
Applsci 14 05779 g018
Figure 19. JRip rules after the sampling.
Figure 19. JRip rules after the sampling.
Applsci 14 05779 g019
Figure 20. The uncertainties of rules before and after the sampling.
Figure 20. The uncertainties of rules before and after the sampling.
Applsci 14 05779 g020
Table 1. The number of traces before and after the sampling.
Table 1. The number of traces before and after the sampling.
Type IndexOriginal Numbers α = 1 α = 0.5 α = 0.25 α = 0.1
1788758634439219
23996321522291339602
33778308021541301587
4208218701418906426
524924622216789
6204318391397895421
714214213110156
8104104977743
96464614929
10621603513362184
1121621419414779
126262594828
1324123921516287
149393876939
154949473923
164444423521
179898927341
184141403319
1933332
206363604928
2133332
223028258018631150527
23141313161042689332
242323231912
25546532457326166
265858554526
275151494023
28526513442316162
Total20,42517,90313,62988824273
Rate-87.7%66.7%43.5%20.9%
Table 2. The number of traces before and after the sampling.
Table 2. The number of traces before and after the sampling.
Time IntervalsOriginal Numbers β = 1 β = 0.5 β = 0.25 β = 0.1
[ 9 , 1828 ) 6685449828611629707
[ 1828 , 3647 ) 5061380825421491660
[ 3647 , 5466 ) 6328436828061607700
[ 5466 , 7284 ) 8181766135
[ 7284 , 9103 ) 23523321015985
[ 9103 , 10922 ) 11111051852575281
[ 10922 , 12741 ] 925884729499247
Total20,42514,92310,07660212715
Rate-73.1%49.3%29.5%13.3%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Poghosyan, A.; Harutyunyan, A.; Davtyan, E.; Petrosyan, K.; Baloian, N. The Diagnosis-Effective Sampling of Application Traces. Appl. Sci. 2024, 14, 5779. https://doi.org/10.3390/app14135779

AMA Style

Poghosyan A, Harutyunyan A, Davtyan E, Petrosyan K, Baloian N. The Diagnosis-Effective Sampling of Application Traces. Applied Sciences. 2024; 14(13):5779. https://doi.org/10.3390/app14135779

Chicago/Turabian Style

Poghosyan, Arnak, Ashot Harutyunyan, Edgar Davtyan, Karen Petrosyan, and Nelson Baloian. 2024. "The Diagnosis-Effective Sampling of Application Traces" Applied Sciences 14, no. 13: 5779. https://doi.org/10.3390/app14135779

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop