Log-Based Fault Localization with Unsupervised Log Segmentation

Dobrowolski, Wojciech; Iwach-Kowalski, Kamil; Nikodem, Maciej; Unold, Olgierd

doi:10.3390/app14188421

Open AccessArticle

Log-Based Fault Localization with Unsupervised Log Segmentation

by

Wojciech Dobrowolski

^1,2,*

,

Kamil Iwach-Kowalski

¹

,

Maciej Nikodem

²

and

Olgierd Unold

²

¹

Nokia, Rodziny Hiszpanskich 8, 02-685 Warszawa, Poland

²

Department of Computer Engineering, Faculty of Information and Communication Technology, Politechnika Wroclawska, Wybrzeże Stanisława Wyspianskiego 27, 50-370 Wrocław, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8421; https://doi.org/10.3390/app14188421

Submission received: 13 August 2024 / Revised: 9 September 2024 / Accepted: 17 September 2024 / Published: 19 September 2024

(This article belongs to the Special Issue Software Engineering: Computer Science and System—Second Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Localizing faults in a software is a tedious process. The manual approach is becoming impractical because of the large size and complexity of contemporary computer systems as well as their logs, which are often the primary source of information about the fault. Log-based Fault Localization (LBFL) is a popular method applied for this purpose. However, in real-world scenarios, this method is vulnerable to a large number of previously unseen log lines. In this paper, we propose a novel method that can guide programmers to the location of a fault by creating a hierarchy of log lines with the highest rank, selected by the traditional LBFL method. We use the intuition that the symptoms of faults are in the context of normal behavior, whereas suspicious log lines grouped together are from new or additional functionalities turned on during faulty execution. To obtain this context, we used unsupervised log sequence segmentation, which has been previously used to segment log sequences into meaningful segments. Experiments on real-life examples show that our method reduces the effort to find the most crucial logs by up to 64% compared with the traditional timestamp approach. We demonstrate that context is highly useful in advancing fault localization, showing the possibility of further speeding up the process.

Keywords:

automated log analysis; log-based fault localization; log sequence; unsupervised log sequence segmentation; software reliability

1. Introduction

Software systems have become indispensable to the infrastructure of modern society, underpinning essential services in industries such as aviation, finance, healthcare, and government. The demand for reliability in these systems has increased dramatically as their role in daily operations has expanded. For example, in July 2024, a software glitch in a patch from CrowdStrike, a leading cybersecurity provider, triggered a widespread outage for Microsoft system users. This failure resulted in significant disruptions: flights at major airports were grounded, financial transactions were halted, and medical centers faced operational paralysis. Such incidents emphasize the critical impact software failures can have on businesses and the urgent need for robust fault localization techniques. Effective fault localization is crucial to minimizing downtime, reducing debugging costs, and ensuring the continuous operation of complex systems.

Fault localization works by narrowing the scope of a developer’s interest from the entire codebase to the specific areas where the fault is likely to be located. This process typically involves two main stages: detection of anomalies and correlation with the potential place of the fault or root cause. Anomalies are identified using various data sources, such as logs, execution traces, and metrics. Once detected, these anomalies are analyzed to determine the patterns and correlations that point to the probable fault location. By effectively pinpointing these locations, fault localization techniques can help developers focus their debugging efforts more precisely, thereby accelerating the process of resolving software issues. Fault localization often relies on multimodal data, including logs, execution traces, metrics [1], test results, and textual information [2]. To obtain execution traces and metrics, the system must be instrumented in advance, that is, it must be equipped with a code that generates the mentioned data. A popular framework for this purpose is OpenTelemetry [3], which implements the concept of observability and involves understanding the internal state of a system through its outputs. A system that is observable, such as AIOps [4], is a good foundation for applying automatic diagnosis and localization methods. However, obtaining such a rich source of data from a system is infrequent and sometimes not feasible. Thus, many fault-localization methods rely on unimodal data, that is, data from only one source, such as logs.

Spectrum-Based Fault Localization (SBFL) [2] is a rapidly growing field that uses unimodal data. The methods based on code coverage (spectrum) are lightweight and competitive. They are fast and scalable. They use code coverage generated by module and unit tests, which are often collected during execution. The anomaly measure is calculated based on the number of times a line appears in the passed and failed tests. The justification is that the lines of code that appear more frequently in failed tests are more likely to be the source of the problem. Statistical methods [5,6,7] are frequently used to calculate the suspicious scores. Based on this metric, a ranking of program lines can be produced and suggested to the software developer pointing where to look first. Ideally, the line with the actual error should be at the top of the suggestions.

Machine learning [8] and deep learning [9,10] models also use code coverage to predict the faulty lines. Training involves teaching the network to predict the test results based on code coverage. Then, during inference, virtual tests consisting of a single line of code are inputted. The output indicates that the line is a potential source of error. Deep learning methods use code coverage, represented as a Coverage Matrix [9,11]. It allows the network to detect which changes in the test execution cause an error.

However, code coverage information alone does not always correlate with the actual cause of an error. This is because the execution frequency data of a given line affects the localization result. Non-faulty code fragments can be executed more frequently than actual faulty fragments, which skews the SBFL result. Therefore, more costly and powerful Slicing-Based Fault Localization methods [12,13,14] are used to solve this problem.

Slicing methods identify exact instructions for a given system output. The reference point is a specific instruction or variable that we want to trace. The flow is observed throughout all places, from its creation and modification to its return. Slicing can be performed statically [12,15] or dynamically [13,16,17], or by combining both approaches [14] to limit the size of static slicing while improving the quality of dynamic slicing.

Obtaining code coverage or instrumenting code to obtain traces is not always feasible because code coverage is not collected during field execution, and instrumentation of the code often produces an unacceptable level of overhead and may introduce faults themselves. Logs are a source of information that introduces minimal overhead and intervention in the system. Therefore, logs are the most common source of information after failure. As part of the execution path, logs can be used as a source of fault localization information. Logs are generated by lines in the source code and are placed by programmers to reflect a specific intention. The structure and consistency of logs can vary significantly depending on the discipline within a company [18]. This variation necessitates a high resilience in log-based methods. Logs can be considered in terms of their static and dynamic components, where the static part corresponds to the text placed in the code by the programmer, whereas the dynamic part corresponds to the text generated during the program’s execution.

Several fault localization methods based on logs have been proposed. Some of these methods use logs in conjunction with other data, such as the content of a build configuration file [19] or key performance indicators (KPIs) [20]. However, it is also possible to use logs alone. Fault localization in such a scenario can be achieved by restricting the number of logs to analyze (Log-Based Fault Localization—LBFL) [21] or by substituting logs with the components behind them and then applying traditional Spectrum-Based Fault Localization (SBFL) to rank the components based on their suspicious scores [22]. The LBFL method successfully reduces the number of logs to a small percentage, it may still result in a large number for complex software systems.

Log-Based Fault Localization (LBFL) applies a technique similar to code coverage methods but is tailored specifically for log analysis. This approach adjusts the calculation of failure and pass occurrences by normalizing repetitions within a single file. Afterward, the suspiciousness score is computed using the same methodology as in Spectrum-Based Fault Localization (SBFL). The result is a ranking of log templates, ordered from the most suspicious to the least. However, in large systems, the number of logs with high suspiciousness scores is still overwhelming. To address this, we propose further refining the order of the log lines by incorporating contextual information. This is achieved through the use of an unsupervised log sequence segmentation method, Voting Experts [23]. We have previously demonstrated [24] that this method is effective in segmentation and meets the human golden standard segmentation in open source logs. The method is currently being piloted at Nokia, Wrocław, Poland.

The main contributions of this paper are as follows:

Introduction of a new measure to calculate the suspiciousness of a log segment,
A method combining existing Log-based Fault Localization techniques with a new metric to further refine the localization results,
Utilization of the output from unsupervised log sequence segmentation for automated log analysis,
Provision of an anonymized dataset from Nokia covering three real-world faults.

Section 2 describes the proposed method and datasets used for its evaluation. This is followed by the research questions and results presented in Section 3 as well as a discussion in Section 4. The last section outlines directions for further research.

2. Materials and Methods

The proposed fault localization model is designed to identify the log messages related to failures in complex systems through the expansion of existing fault localization techniques. This model is useful when logs from normal execution of the system are available, and the traditional log-based fault localization method produces a large number of rank 1 lines when applied to logs from a failed execution of the system. Logs, along with other parts of the code, constantly change, making it difficult for methods based on a long history to operate well. We propose a method that requires only one current set of normal logs and corresponding failed logs, generating a hierarchy of anomalous log lines, which is crucial from a fault localization point of view. As a result, our method overcomes this limitation and can be easily used for new software releases.

Traditional LBFL methods may return a substantial number of rank 1 log templates. When this happens, the user has to manually examine log lines related to log templates with rank 1, sorted by timestamp. We propose to use context so that most important log lines can be analyzed first, and they can be seen in meaningful context. The timestamp-based approach simply ranks log lines based on when they occurred, which can be ineffective in large logs with many failures and when a crucial log line is not the first suspicious line. We propose a context-based approach, which, analyzes the surrounding and considers how suspicious lines relates with their context (by calculating average of lines suspicious score). This is particularly useful when logs contain anomalies buried within sequences of normal-looking events.

Using unsupervised log sequence segmentation, our method focuses more precisely on related log lines, and distinguish between log anomalies that are genuinely related to faults and those that are incidental (e.g., logs from auxiliary processes or logging artifacts). This reduces false positives and makes it easier to localize the real issue.

Overall, the context-based ranking approach enhances fault localization by prioritizing log lines that are both suspicious and contextually significant, reducing the manual effort needed to identify the root cause of failure.

Our approach (Figure 1) consists of two stages:

A log-based fault localization framework that identifies the most suspicious log lines in a failed log file,
Context-based ranking of rank 1 log lines, based on unsupervised log sequence segmentation.

2.1. Normalization

Before analysis, the logs were normalized by removing timestamps and lines where objects were described (Figure 2). The timestamps follow a consistent format throughout the log file; therefore, they do not contribute to template distinction and only add overhead to the template extraction process. We used timestamps to sort the lines, but after that, they were removed. The inclusion of object descriptions in the log makes it challenging to capture with regular expressions because of variations in spaces and special punctuation marks. These objects do not contain execution-related information, and are essentially one-line information spread across multiple lines for readability. The structure and values of the printed fields of the objects can only be verified against specifications; therefore, we decided to exclude these lines from consideration.

2.2. Log-Based Fault Localization

The normalized log contains lines of interest; however, the lines are composed of constant and variable parts. The variable parts contain information inserted in log lines during execution, and documentation is required to examine their correctness. Therefore, the next step after normalization is to remove variable elements from the lines and create so-called log templates. A well-established method for log template extraction is Drain [25], which performs this task with the highest accuracy. For example, in the messages “Status of connection to IP:192.168.11.1 is SUCCESS” and “Status of connection to IP:192.168.11.2 is FAILURE”, the constant part is “Status of connection to IP:* is *”, and the variable parts are [‘192.168.11.1’, ‘SUCCESS’] and [‘192.168.11.2’, ‘FAILURE’]. This method relies heavily on the quality of the provided regular expressions, as it must correctly distinguish between the constant and variable parts of the log lines. This distinction is not trivial, and it has been shown that improvements at this stage can lead to improvements in downstream tasks [26].

Let

L_{n}

be the set of all log templates in normal logs, and

L_{f}

be the set of all log templates in the failed logs. The suspiciousness of a log template l is defined as follows:

S u s p (l) = \frac{\frac{F a i l (l)}{F_{T}}}{\frac{F a i l (l)}{F_{T}} + \frac{P a s s (l)}{P_{T}}},

(1)

where

F_{T}

,

P_{T}

are the number of all failed and passed logs, respectively;

F a i l (l)

is the number of times log template l was seen in the failed logs; and

P a s s (l)

is the number of times log template l was seen in the passed logs. These definitions differ from those of the original LBFL, where occurrences are normalized. We decided not to normalize occurrences, as we would like to operate in situations where there is only one failed, and normal logs and the frequency of log templates are important sources of information for our context-based approach. Figure 3 presents two histograms of suspicious scores—one for normalized and the other for non-normalized occurrences. Normalized occurrences are much less differentiated, whereas in our approach, frequency information is utilized to make the context more useful.

The

S u s p

metric has several important properties. It is equal to 1 for any log template that is exclusively present in failed log files and 0 for log templates that are exclusively present in the passed log files. Ranking is determined provided based on these values. However, this metric cannot further grade log lines that have a rank of 1. In the case of large systems, this may result in a substantial number of log lines having the same highest rank.

Table 1, Table 2 and Table 3 show the Log-based Fault Localization with suspiciousness score and ranking proposed in [21]. An example of this is a simple OK/NOK case. Table 1 and Table 2 contain log examples: the first column contains log line numbers, the second is the content of the log file, and the third is the ID of the extracted log template. The last one is the number of occurrences of each template ID (”-” is used when the template ID repeats).

Table 3 shows

F a i l (l)

and

P a s s (l)

values, and

S u s p

score and rankings based on occurrences. Then, the suspiciousness score was calculated using Formula (1) and it was used to rank the lines. As we can see in this simple example, four lines are already marked with the highest suspiciousness score (rank equal to 1). To find the root cause, the user has to analyze all lines related to rank 1 templates, sorted by timestamp.

2.3. Unsupervised Log Sequence Segmentation

Log sequence, as a part of the execution path, conveys some information on the software system and its execution. It contains events that are visible to humans and allows to distinguish patterns and segments of lines as they reflect the architecture of the system. Following this observation, a few approaches have been proposed to extract this structure to ease manual examination of the log file (for example [23,27]). However, the application of these methods in automated log analysis has not yet been demonstrated. One of our contributions is that they can be used not only to improve the manual inspection of logs. We used VotingExperts [23] to extract meaningful segments from long log sequences and calculate the rankings of localized suspicious lines. Let S represent the set of all sequences of log templates

S = {{\hat{e}}^{0}, \dots, {\hat{e}}^{m}},

(2)

where

{\hat{e}}^{i}

is the i-th sequence and

m + 1

is the total number of sequences. The sequence is obtained by extracting all log lines belonging to one thread, block ID, or node ID. A single sequence

{\hat{e}}^{j}

from

S

contains a sequence of log templates

l_{i}

:

{\hat{e}}^{j} = < l_{0}, . . ., l_{n} > .

(3)

Each

l_{i}

belongs to a set of log templates

L

contained in the log file. Segment

S e g m_{k}

is the sequence of

n_{k}

log templates from

{\hat{e}}^{j}

.

S e g m_{k} = 〈l_{i_{k}}, \dots, l_{i_{k + 1}}〉,

(4)

where

i_{k} \geq 0

and

i_{k} < n

. For the sake of simplicity, for further details, please refer to the original work [23] or our previous work [24], where we showed how unsupervised word segmentation methods can be transferred to the log sequence segmentation domain. For our experiments, we used VotingExperts with a window size of seven and threshold of four.

2.4. Context-Based Ranking

Existing Log-based Fault Localization methods struggle when there are many previously unseen logs. This situation is common in large software systems where a single issue can cause a cascade of failures and generate thousands of related logs. Simply returning all of them may not be helpful.

To address this issue, we propose a ranking based on context. Intuitively, unseen log lines appearing within the context of well-known log lines are more suspicious than unseen log lines grouped together. The reason is that grouped unseen log lines are often stack traces, crash dumps, or new functionalities, whereas a single anomaly midst well-known behavior can be the first symptom of a program going off the track. To calculate this metric, we first segment the log file using unsupervised log sequence segmentation, and then calculate the mean of suspicious scores of log templates in the segments where the most suspicious log templates are present. Context-based ranking was calculated using the following equation:

C o n t e x t_r a n k i n g (S e g m_{k}) = 1 - \frac{\sum_{l_{i} \in S e g m_{k}} S u s p (l_{i})}{| S e g m_{k} |},

(5)

where

S e g m_{k}

denotes the k-th segment of log file. The greater the mean value of the segment, the less suspicious are the lines.

In the example in Table 2, the fault is located in line 6. It appears in the context of successful connection responses and explains subsequent node failure. Let us assume that the first segment of the failed log is from lines 1 to 3. The sequences of the template IDs is

4, 5, 5

, and all lines were ranked as suspicious. The context ranking of this segment was calculated as

C o n t e x t_r a n k i n g (S_{1}) = 1 - 1 = 0

. The second segment is from lines 4 to 8 with a sequence of template IDs

1, 1, 6, 1, 1

, and the context ranking of this segment is

C o n t e x t_r a n k i n g (S_{2}) = 1 - m e a n (4 \times 0.36 + 1) = 0.5

. The last segment is from lines 9 to 11, and its context ranking is

C o n t e x t_r a n k i n g (S_{3}) = 1 - m e a n (0.5 + 0.5 + 1) = 0.3

.

The segment with the highest context ranking was

S_{2}

, which was the expected value. The third segment is lower in the hierarchy. The first segment is correctly the lowest in the ranking, as it is not related to the actual fault, but represents an additional logger turned on for the expected failure.

2.5. Context-Based HDFS Synthetic Example

In this section, we describe behavior of our method on the synthetic example prepared on the basis of real HDFS-10453 issue [28]. The issue is caused by deleting file right after changing the replication value, which causes ReplicationMonitor to fail in providing the requested number of replicas. We extracted only ReplicationMonior thread as an NOK example. As the OK log file we simulated behavior of the system without changing the replication value and deleting the file. We also added synthetic debug log files at the beginning to simulate frequent situation when tester add debug flags when they try to collect more info during fail run. Logs are available online in our repository [29]. The most important log is “Failed to place enough replicas, still in need of 2 to reach 5”, which appears during normal execution. When examined by the traditional LBFL + timestamp, there are 5 most suspicious templates, and the crucial line related to them is in position 10 when only related lines are extracted and sorted by timestamp (Figure 4). Then, we segmented logs with VotingExperts with parameters 7 and 4 and performed context-based ranking. In this situation, the most important line is visible in 1st segment, requiring only 1 suspicious line to analyze (Figure 5).

2.6. Context-Based Pitfall: Inversion of Fault Prioritization

Although the context-based ranking approach improves fault localization by considering the surrounding context of log lines, it may lead to incorrect prioritization in certain cases. One potential pitfall arises when a critical error is followed by a series of dependent errors, normal operations are resumed, and then an isolated error unrelated to the main failure appears at the end. In such a case, the context-based method may invert the fault priority, resulting in a misinterpretation of the root cause. In some cases, a critical error early in the log may indirectly cause a failure much later in the execution flow. One example is when a failure to read from the disk leads to missing credentials, which subsequently causes a database connection failure. The context-based ranking may incorrectly prioritize the database error, although the root cause lies in the disk read failure.

Consider a system where the following sequence of events occurs (Table 4):

Disk Read Error: The system attempts to read necessary configuration data, including credentials, from the disk, but the read operation fails.
Related Errors: The system encounters a series of related issues, such as retry attempts or warnings about missing files.
Normal Operations: Despite the disk read failure, the system continues with other tasks.
Database Connection Failure: When the system tries to connect to the database, it fails because the credentials, which should have been read from the disk, are missing.

In this case, the original error (disk read failure) is the root cause of the subsequent database connection failure. However, a context-based approach might misinterpret the situation, assigning higher priority to the final database connection error, given its proximity to otherwise normal operations. This can lead to an inversion of fault prioritization, where the final error is treated as the most suspicious, despite being a consequence of the earlier disk read failure.

The table below illustrates this scenario, with log lines that show the sequence of events leading to the database connection error:

This example illustrates a limitation of the context-based approach: it can fail to recognize causal relationships between distant errors. In large systems, where the effects of early failures manifest much later in the execution, this can lead to incorrect fault prioritization. Without proper attention to these causal links, context-based ranking might mislead the fault localization process, focusing on the symptoms rather than the true cause.

To address this issue, a more nuanced understanding of dependencies between log lines could help improve fault ranking and ensure that root causes are given higher priority, even when separated by normal execution.

2.7. Dataset

We performed our study on real industrial logs from Nokia, which deliver wireless connectivity solutions to many different enterprises. The reliability of communication is expected nowadays, putting pressure on the software and hardware components of the Base Transceiver Station (BTS). The software component of the BTS consists of many components that communicate over the interfaces. Failure of any component often leads to an unacceptable drop in performance. Detecting and fixing faults often can only be performed by comparison with previous normal behaviors.

However, simple manual comparisons of log files consisting of hundreds of thousands of log lines are not feasible. The dataset used in this study, which is publicly available along the code [29], comprises anonymized logs from Nokia containing three faults. Anonymization was performed by removing the content of the log templates, leaving only template IDs. Thread names and level info details were also anonymized by substituting with the “thread<num>” and “level<num>” strings. For the first two fault scenarios, we collected 10 logs from normal execution prior to the fault date. For the third one, there was only one normal log and one failed log. The logs were normalized by removing timestamps and lines with interface object descriptions, segmented by thread ID, and processed using Drain. VotingExperts was then used to segment the log template sequences from each thread.

2.7.1. Example 1

In this scenario, software failed because of the lack of communication with one of the nodes. The failed communication was logged in the middle of the normal communication with other nodes. At the beginning of the logs, there were many additional entries from the logging module, which had been turned on by the tester to ensure that all possible logs were collected. These logs, which are not usually enabled, contaminated the logs with false positives ranked as 1.

The failed log contained 250,578 lines with 5557 templates, while 10 normal logs contained 1,172,986 lines with 6124 templates.

2.7.2. Example 2

In this scenario, the software failed because of a timeout on one of the mutexes (Figure 6). The timeout was logged during normal behavior. Previous logs with a ranking of 1 in the suspicious score were due to the newly introduced functionality, which was not the source of the problem. This issue was occasional and occurred by chance when the new functionality was turned on.

The failed log contained 90,463 lines with 4624 templates, whereas the normal logs contained 676,942 lines with 5475 templates.

2.7.3. Example 3

In this scenario, we had a limited number of logs, as the normal log contained only 8188 lines and the failed log contained 6980 logs. The software failed because of the incorrect setup of some interfaces. The real difference was in some functionality not being executed; however, this can be localized by determining the recovery actions being taken. Therefore, the most important logs were all logs containing string “RecoveryService”.

The failed log contained 6980 lines with 1268 templates, whereas the normal logs contained 8188 lines with 1282 templates.

3. Results

The purpose of this work was to localize logs related to faults by reducing the number of logs an engineer has to analyze when traditional LBFL methods return many log lines with the highest rank of 1. The data used were labeled by experts to identify the crucial log in each scenario that revealed the real root cause. We collected the number of segments sorted by context ranking and standard timestamp. We consider our method to be better if it ranks the most important log template with a lower rank than the timestamp approach, and if going from the top of the ranking to the obtained segment, the developer has to examine a lower number of log templates. Suspicious log lines are presented to the developer in an original form, not in a log template form, as it is easier to analyze and understand. It is thus preferable to group all instances of the log template together so that after seeing one instance from the group, the developer may skip the rest. To measure the gain, we used a percentage of reduction of data to analyze, calculated as follows:

R e d = (\frac{X_{1} - X_{2}}{X_{1}}) \times 100,

(6)

where

X_{1}

represents the baseline value and

X_{2}

represents the obtained value.

During our experiments, we aimed to answer two research questions:

RQ1: Does ranking by context suspicious score reduce the number of distinct log templates to check?
RQ2: Does ranking by context suspicious score reduce the number of log lines to check?

RQ1:Does ranking by context suspicious score reduce the number of distinct log templates to check?

In Example 1, after the application of the traditional LBFL, there were 73 log templates with a suspicious score of 1, and 105 segments obtained by VotingExperts contained these log templates. The most crucial log template was located in line 445 when lines related to suspicious log templates were sorted by timestamp, which were related to 33 log templates. Our method ranked the most important line in segment 29, which corresponded to 18 templates to analyze (Table 5). The number of log templates to be analyzed was reduced by 45.45% (Figure 7b). In Example 2, the LBFL marked 668 log templates with a suspicious score of 1, and 1871 VE segments in the failed log contained these log templates. The most crucial log template was in line 3576, when lines related to suspicious log templates were sorted by timestamp. Those lines were related to 454 log templates (Table 5). The log snippet with failure from this example is shown in Figure 6. Context-based ranking ranked the most important line in segment 285 with 93 templates to check. The number of log templates to be analyzed was reduced by 79.51% (Figure 7b).

In Example 3, LBFL returned 86 log templates with a suspicious score of 1, and 319 segments from VotingExperts contained those log templates. The most crucial log template was present in the 47th line, for lines related to suspicious log templates sorted by timestamp. It corresponded to 22 log templates. Context-based ranking ranked the most important segment at position 32, with 15 log templates to check (Table 5). The number of log templates to be analyzed was reduced by 27.65% (Figure 7).

RQ2:Does ranking by context suspicious score reduce the number of log lines to check?

Example 1, with context-based ranking segments from 1 to 29, contained 35 different log lines for verification. In the timestamp ranking, 455 different log lines were to check (Table 6). The reduction in the number of log lines for these analyses was 92.31% (Figure 8). In Example 2, with context-based ranking in the first 285 segments, 320 log lines were for verification. With timestamp ranking, 3576 log lines were to be checked (Table 6). The reduction in the number of distinct templates for these analyses was 91.05% (Figure 8b). In Example 3, with context-based ranking, 34 log lines were to be verified in the first 32 segments. Using timestamp ranking, 47 log lines were to be checked (Figure 8a). The reduction in the number of distinct templates for these analyses was 27.65% (Figure 8b).

In all cases, our solution outperformed the standard timestamp approach, providing the possibility of identifying the most important log lines more quickly, as presented in Table 5.

4. Discussion

The proposed Log-based Fault Localization with context-based ranking for rank 1 lines proved to be more effective than timestamp ranking in the provided fault localization scenarios. Context-based ranking reduced the number of log lines to be analyzed as well as the number of distinct log templates. The former implies that developers must analyze fewer logs to encounter the most important one. The latter implies that log lines with the same log template are grouped together; therefore, after seeing one such example, a developer can skip similar ones. In contrast, the timestamp approach forces the developer to analyze many more log templates and their contexts before reaching the crucial log line. This demonstrates that log-based methods can be applied in situations with large differences in log templates between normal and failed logs, where traditional LBFL returns a substantial portion of the same rank 1 log templates for analysis, and context can be used to refine the analysis.

Our approach uses the context of log templates obtained from unsupervised log sequence segmentation, showing that the segmentation of log sequences can be useful for downstream tasks, such as fault localization.

For further research, it would be interesting to investigate how the quality of the Drain influences the results. Because Drain is the main source of information regarding log templates and is highly dependent on the quality and accuracy of regular expressions, it would also be valuable to experiment by substituting this method with neural-based semantic approaches that are less reliant on human effort. Further experiments can also be conducted using sophisticated neural-based methods to compare segments, including Transformer-based methods.

Author Contributions

Conceptualization, W.D.; methodology, W.D. and O.U.; software, W.D. and K.I.-K.; validation, W.D. and K.I.-K.; formal analysis, O.U.; investigation, W.D.; resources, O.U.; data curation, W.D.; writing—original draft preparation, W.D.; writing—review and editing, M.N. and W.D.; visualization, W.D.; supervision, O.U.; project administration, W.D.; funding acquisition, O.U. All authors have read and agreed to the published version of the manuscript.

Funding

The Polish Ministry of Education and Science financed this work. Funds were allocated from the “Implementation Doctorate” program. Funding number is SD1/DWD/4/13/2020/STYP.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available in a publicly accessible repository https://github.com/dobrowol/log_based_fault_localization (accessed on 16 September 2024).

Conflicts of Interest

Authors Wojciech Dobrowolski and Kamil Iwach-Kowalski were employed by the company Nokia. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SBFL	Spectrum-Based Fault Localization
KPI	Key Performance Indicators

References

Zhang, S.; Xia, S.; Fan, W.; Shi, B.; Xiong, X.; Zhong, Z.; Ma, M.; Sun, Y.; Pei, D. Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis. arXiv 2024, arXiv:2407.01710. [Google Scholar]
Wong, W.E.; Gao, R.; Li, Y.; Abreu, R.; Wotawa, F.; Li, D. Software fault localization: An overview of research, techniques, and tools. In Handbook of Software Fault Localization: Foundations and Advances; Wiley: Hoboken, NJ, USA, 2023; pp. 1–117. [Google Scholar]
OpenTelemetry Contributors. OpenTelemetry. 2024. Available online: https://opentelemetry.io (accessed on 12 July 2024).
Remil, Y.; Bendimerad, A.; Mathonat, R.; Kaytoue, M. Aiops solutions for incident management: Technical guidelines and a comprehensive literature review. arXiv 2024, arXiv:2404.01363. [Google Scholar]
Jones, J.A.; Harrold, M.J.; Stasko, J. Visualization of test information to assist fault localization. In Proceedings of the 24th International Conference on Software Engineering, Orlando, FL, USA, 25 May 2002; pp. 467–477. [Google Scholar]
Wong, W.E.; Debroy, V.; Gao, R.; Li, Y. The DStar method for effective software fault localization. IEEE Trans. Reliab. 2013, 63, 290–308. [Google Scholar] [CrossRef]
Abreu, R.; Zoeteweij, P.; Van Gemund, A.J. On the accuracy of spectrum-based fault localization. In Proceedings of the Testing: Academic and Industrial Conference Practice and Research Techniques-MUTATION (TAICPART-MUTATION 2007), Windsor, UK, 10–14 September 2007; pp. 89–98. [Google Scholar]
Wong, W.E.; Qi, Y. BP neural network-based effective fault localization. Int. J. Softw. Eng. Knowl. Eng. 2009, 19, 573–597. [Google Scholar] [CrossRef]
Li, Y.; Wang, S.; Nguyen, T. Fault localization with code coverage representation learning. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, Spain, 22–30 May 2021; pp. 661–673. [Google Scholar]
Wardat, M.; Le, W.; Rajan, H. Deeplocalize: Fault localization for deep neural networks. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, Spain, 22–30 May 2021; pp. 251–262. [Google Scholar]
Zhang, Z.; Lei, Y.; Mao, X.; Li, P. CNN-FL: An effective approach for localizing faults using convolutional neural networks. In Proceedings of the 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Hangzhou, China, 24–27 February 2019; pp. 445–455. [Google Scholar]
Weiser, M.D. Program Slices: Formal, Psychological, and Practical Investigations of an Automatic Program Abstraction Method; University of Michigan: Ann Arbor, MI, USA, 1979. [Google Scholar]
Zhang, X.; Gupta, N.; Gupta, R. A study of effectiveness of dynamic slicing in locating real faults. Empir. Softw. Eng. 2007, 12, 143–160. [Google Scholar] [CrossRef]
Mao, X.; Lei, Y.; Dai, Z.; Qi, Y.; Wang, C. Slice-based statistical fault localization. J. Syst. Softw. 2014, 89, 51–62. [Google Scholar] [CrossRef]
Binkley, D.; Gold, N.; Harman, M. An empirical study of static program slice size. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2007, 16, 8-es. [Google Scholar] [CrossRef]
Agrawal, H.; Horgan, J.R. Dynamic program slicing. ACM SIGPlan Not. 1990, 25, 246–256. [Google Scholar] [CrossRef]
Alves, E.; Gligoric, M.; Jagannath, V.; d’Amorim, M. Fault-localization using dynamic slicing and change impact analysis. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), Lawrence, KS, USA, 6–10 November 2011; pp. 520–523. [Google Scholar]
Zhu, J.; He, P.; Fu, Q.; Zhang, H.; Lyu, M.R.; Zhang, D. Learning to log: Helping developers make informed logging decisions. In Proceedings of the 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Florence, Italy, 16–24 May 2015; Volume 1, pp. 415–425. [Google Scholar]
Xu, J.; Chen, P.; Yang, L.; Meng, F.; Wang, P. Logdc: Problem diagnosis for declartively-deployed cloud applications with log. In Proceedings of the 2017 IEEE 14th International Conference on e-Business Engineering (ICEBE), Shanghai, China, 4–6 November 2017; pp. 282–287. [Google Scholar]
Zhang, Q.; Jia, T.; Wu, Z.; Wu, Q.; Jia, L.; Li, D.; Tao, Y.; Xiao, Y. Fault localization for microservice applications with system logs and monitoring metrics. In Proceedings of the 2022 7th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), Chengdu, China, 22–24 April 2022; pp. 149–154. [Google Scholar]
Sha, Y.; Nagura, M.; Takada, S. Fault localization in server-side applications using spectrum-based fault localization. In Proceedings of the 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Honolulu, HI, USA, 15–18 March 2022; pp. 1139–1146. [Google Scholar]
de Castro Silva, J.M.R. Spectrum-Based Fault Localization for Microservices via Log Analysis. 2022. Available online: https://repositorio-aberto.up.pt/bitstream/10216/143858/2/577947.pdf (accessed on 16 September 2024).
Cohen, P.; Heeringa, B.; Adams, N.M. An unsupervised algorithm for segmenting categorical timeseries into episodes. In Pattern Detection and Discovery; Springer: Berlin/Heidelberg, Germany, 2002; pp. 49–62. [Google Scholar]
Dobrowolski, W.; Libura, M.; Nikodem, M.; Unold, O. Unsupervised Log Sequence Segmentation. IEEE Access 2024, 12, 79003–79013. [Google Scholar] [CrossRef]
He, P.; Zhu, J.; Zheng, Z.; Lyu, M.R. Drain: An online log parsing approach with fixed depth tree. In Proceedings of the 2017 IEEE International Conference on Web Services (ICWS), Honolulu, HI, USA, 25–30 June 2017; pp. 33–40. [Google Scholar]
Wu, X.; Li, H.; Khomh, F. On the effectiveness of log representation for log-based anomaly detection. Empir. Softw. Eng. 2023, 28, 137. [Google Scholar] [CrossRef]
Shani, G.; Meek, C.; Gunawardana, A. Hierarchical probabilistic segmentation of discrete events. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, Miami Beach, FL, USA, 6–9 December 2009; pp. 974–979. [Google Scholar]
He, X. ReplicationMonitor Thread Could Get Stuck for a Long Time Due to the Race between Replication and Delete of the Same File in a Large Cluster. Apache Hadoop HDFS, Issue Resolved in 2018. 2016. Available online: https://issues.apache.org/jira/browse/HDFS-10453 (accessed on 16 September 2024).
Dobrowolski, W. Context Based Fault Locaclization. 2024. Available online: https://github.com/dobrowol/log_based_fault_localization (accessed on 12 August 2024).

Figure 1. Overview of the proposed context-based approach.

Figure 2. Description of objects in log lines. Such descriptions are removed during normalization as they do not provide useful information for error localization.

Figure 3. Occurences metrics used in (a) standard LBFL and (b) context-based approach.

Figure 4. HDFS-10453 lines related to templates with most suspicious score sorted by timestamp.

Figure 5. HDFS-10453 lines related to template with most suspicious score sorted by context. Line from context are grayed out.

Figure 6. Log snippet from Example 2 with the most important line number 11.

Figure 7. Comparison of (a) number and (b) reduction of log templates to analyze in the context-based compared with the timestamp method.

Figure 8. Comparison of (a) number and (b) reduction of log lines to analyze in the context-based compared with the timestamp method.

Table 1. Normal log example.

#	Log Content	Template ID	Pass(l)
1	INF, State change notif received [CONNECTED]	1	7
2	INF, State change notif received [CONNECTED]	1	-
3	INF, State change notif received [CONNECTED]	1	-
4	INF, State change notif received [CONNECTED]	1	-
5	INF, State change notif received [CONNECTED]	1	-
6	INF, State change notif received [CONNECTED]	1	-
7	INF, State change notif received [CONNECTED]	1	-
8	INF, SW Version: xxxxx	2	1
9	INF, currentImageType = package.01	3	1
Length	9	3

Table 2. Failed log example.

#	Fail Log	Template ID	Fail(l)
1	WRN, BlackBoxLoggingService MemoryItem::update	4	1
2	WRN, BlackBoxLoggingService get entry value	5	2
3	WRN, BlackBoxLoggingService get entry value	5	-
4	INF, State change notif received [FAILED]	1	4
5	INF, State change notif received [CONNECTED]	1	-
6	INF, Recovery action requested	6	1
7	INF, State change notif received [CONNECTED]	1	-
8	INF, State change notif received [CONNECTED]	1	-
9	INF, Received reset request because of node2 failure.	7	1
10	INF, SW Version: xxxxx	2	1
11	INF, currentImageType = package.01	3	1
Length	11	7

Table 3. Not normalized Log-based Fault Localization.

#	Template ID	Fail(l)	Pass(l)	Susp	Rank
1	1	4	7	0.36	3
2	2	1	1	0.5	2
3	3	1	1	0.5	2
4	4	1	0	1.0	1
5	5	2	0	1.0	1
6	6	1	0	1.0	1
7	7	1	0	1.0	1

Table 4. Example of disk read failure leading to database connection error.

#	Log Line	Template ID	Suspiciousness Score
1	ERR, Failed to read from disk	15	1.0
2	WRN, Retrying disk read operation	16	0.9
3	WRN, Retrying disk read operation	16	0.9
4	INF, Continuing execution	5	0.1
5	INF, Processing other tasks	6	0.1
6	ERR, Failed to connect to database (no credentials)	17	1.0

Table 5. Log template reduction results.

Example	Initial Templates (Timestamp)	Final Templates (Context-Based)	Reduction Percentage (%)
Example 1	33	18	45.45
Example 2	454	93	79.51
Example 3	22	15	27.65

Table 6. Log line reduction results.

Example	Initial Log Lines (Timestamp)	Final Log Lines (Context-Based)	Reduction Percentage (%)
Example 1	455	35	92.31
Example 2	3576	320	91.05
Example 3	47	34	27.65

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dobrowolski, W.; Iwach-Kowalski, K.; Nikodem, M.; Unold, O. Log-Based Fault Localization with Unsupervised Log Segmentation. Appl. Sci. 2024, 14, 8421. https://doi.org/10.3390/app14188421

AMA Style

Dobrowolski W, Iwach-Kowalski K, Nikodem M, Unold O. Log-Based Fault Localization with Unsupervised Log Segmentation. Applied Sciences. 2024; 14(18):8421. https://doi.org/10.3390/app14188421

Chicago/Turabian Style

Dobrowolski, Wojciech, Kamil Iwach-Kowalski, Maciej Nikodem, and Olgierd Unold. 2024. "Log-Based Fault Localization with Unsupervised Log Segmentation" Applied Sciences 14, no. 18: 8421. https://doi.org/10.3390/app14188421

APA Style

Dobrowolski, W., Iwach-Kowalski, K., Nikodem, M., & Unold, O. (2024). Log-Based Fault Localization with Unsupervised Log Segmentation. Applied Sciences, 14(18), 8421. https://doi.org/10.3390/app14188421

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Log-Based Fault Localization with Unsupervised Log Segmentation

Abstract

1. Introduction

2. Materials and Methods

2.1. Normalization

2.2. Log-Based Fault Localization

2.3. Unsupervised Log Sequence Segmentation

2.4. Context-Based Ranking

2.5. Context-Based HDFS Synthetic Example

2.6. Context-Based Pitfall: Inversion of Fault Prioritization

2.7. Dataset

2.7.1. Example 1

2.7.2. Example 2

2.7.3. Example 3

3. Results

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI