After assessing the identified literature and categorizing both the approaches and goals, we can answer the questions raised at the beginning of our study. The below subsections summarize and interpret the knowledge from the previous section.
5.1. RQ1: Methods and Techniques Used
There were several interesting things to note about the common trends among the types of methods. Of the two main methods of analysis, dynamic analysis is used more often than static analysis. Using dynamic techniques, several different approaches become feasible that cannot be done via static techniques, for example, performance analysis and optimization techniques as well as other metrics-based analyses [
12,
32,
36,
39,
40]. A specific subset of dynamic techniques is commonly applied to fault analysis and root cause analysis: log analysis and execution trace analysis are perfect for this task, as they examine traces directly related to program execution [
35,
40,
42,
43,
44].
Most papers that used static analysis as their main method tended to do so to gather information specifically about the architecture. These techniques depend on analyzing statically defined artifacts to reconstruct an architectural view of a system, mainly analyzing source code [
16,
26,
28,
30], but also other artifacts, such as OpenAPI specifications [
15]. Static analysis is applicable to other goals as well, such as anti-pattern or code smell detection [
26,
28,
29].
An approach used less often is that of model-based analysis, in which a specific model is built to represent the microservice system. This can range from modeling dependencies and architecture of microservices [
56,
57,
62] to developing security models [
59] or performance [
60] and resilience models [
58,
61].
Like model-based techniques, graph-based analysis depends on representing the microservice system as a graph and then analyzing that structure, exploiting microservices’ natural graph-like connections. Graph-based techniques are commonly used for detecting faults or performing root cause analysis [
47,
52,
54,
55], as well as performing monolith-to-microservice migration by representing an existing monolith as a graph that can be segmented into microservices [
23,
24,
48,
49]. However, it can also be used in tracing patterns [
27,
28]. Finally, graph-based methods are also used in monitoring and visualization systems [
2,
26,
53].
Another approach is pattern-based analysis that mostly employs static analysis to recognize specific patterns in the source code [
27,
28,
29,
31] and architectural patterns [
30]. However, it can use graphs as well [
27,
28]. This approach also involves mathematical model checking techniques to evaluate microservice systems [
30,
61,
62].
The summary of our findings relating to RQ1 is shown in
Table 5.
5.2. RQ2: Goals Addressed by the Papers
The main goal we found was that of fault analysis, either by detecting and preventing faults [
2,
30,
41,
42,
44,
45,
46,
53], or determining their origin through root cause analysis [
43,
47,
52,
54,
55]. It is apparent that identification of the cause is important, especially given that microservices run large enterprise systems. Failures in such systems has major economic impacts, whether the users cannot access it or the cloud resource demands peak due to an error. Still, it is a challenge to perform detection in real time, and many challenges remain, as we discuss later when answering RQ4.
Another common topic addressed was migration to microservice-based architectures either by providing tools to decompose monolithic architectures into microservices [
15,
16,
19,
32,
50,
56] or to assist in deciding if migration is beneficial and feasible [
21,
48,
49,
64]. Although we could assume a silver bullet by now when it comes to the design of microservice systems, there is no perfect guidance for engineers managing legacy systems.
Identifying or resolving technical debt in a microservice architecture [
29,
39,
63] is an interesting research direction of significant impact. We expect this specific research to grow significantly in the next few years as consequences of system evolution impact the operational budget.
Other goals involve analysis of the software architecture [
14,
28]. Several works discuss the inherent challenges in analyzing microservices versus other architectures such as monolithic systems, or SOA [
1,
63]. Several works propose methods of software architecture reconstruction on microservice systems [
26,
34,
57], which is especially important to help reason about the system. Software architecture can be represented through various views serving that represent the system slightly differently, meeting the needs of distinct stakeholders.
Overlapping with technical debt and architecture analysis is the goal of streamlining and protecting the process of microservice evolution. Works performing SAR [
26,
34,
57] combat evolutionary problems by providing the user with an up-to-date view of system architecture, giving insights that can prevent architectural degradation. Antipatterns and code smells that threaten sustainability have also been addressed [
27,
28,
29,
31], as has the issue of accumulating technical debt [
63,
65]. In addition, some works improve the evolutionary outlook of a system by addressing maintainability metrics [
39] or performance degradation [
40].
It is quite common to see research addressing various quality concerns of microservices such as security [
30,
59] or performance [
12,
35,
36,
39,
40,
58]. We must recall that software architecture is the frame for various software qualities. The primary reason to use cloud-native systems might be the performance for some. At the same time, security cannot be omitted. However, there are many other aspects to this not mentioned directly, such as maintainability, which drives the whole category of evolution.
Finally, many works [
1,
14,
63,
64,
65] have performed survey on the state of the art on related topics.
Table 6 summarizes specific goals assessed in this study.
5.3. RQ3: Relationship between Microservices and Other Architectures
We found that microservices share more differences than similarities with other architectures. Works considering related architectures are referenced in
Table 7. The most related architectural style is Service-Oriented Architectures (SOA) because microservice architectures borrow many characteristics from SOA, such as the emphasis on scalability and the concept of service availability and responsiveness. Unlike SOA, however, individual services in microservice architectures are brought to production independently and are more granular [
63]. It is also worth noting that while graph-based methods of performing root cause analysis can be used in both microservice architectures and SOA [
47], many traditional methods of code analysis are not sufficient for microservice architectures [
14].
The greatest concern of much research towards static-code analysis is that microservices are distributed. Thus, it is rather common to observe static-code analysis in monolith-like systems since the system is homogeneous and often comes with one codebase. On the other hand, microservices, as opposed to monoliths, are heavily distributed and heterogeneous, and thus the argument might sound valid. However, still static-code analysis is applied to individual system modules; in addition, with sufficient cross-platform support, it is possible to derive a holistic view of system architecture of microservices solely by static-code analysis [
26].
However, what seems lacking in nearly all works we assessed is consideration of other recent architectural advancements such as serverless or micro-frontends [
64]. Of course, one can object that this study searched for microservices, but the argument by Auer et al. [
64] remains valid: if researcher and practitioner do not understand the benefits of these new architectural advancements, they are possibly decreasing productivity or increase technical debt. This could be seen as similar to using SOA to develop a new system.
5.4. RQ4: Future Research Directions
We have assessed the identified literature for future work and open challenges. This section mentions our observations on static analysis, dynamic analysis, anomaly detection, prediction, migration from monoliths to microservices, and other topics we found open for research, such as visualization, architecture evolution, and benchmarks.
The first significant conclusion we make is that all forms of static analysis seemed under-represented in the current literature. There is currently a greater focus on forms of dynamic analysis when analyzing microservices, leaving a gap for targeting statically defined artifacts. This kind of analysis can be done earlier in the development pipeline as the system does not need to be deployed for the analysis to take place, and it is less prone to false positives that may plague dynamic sources of information. However, the greatest challenge to address is to cope with distributed system nature and heterogeneity of system modules that may include distinct platforms, different versions and dependencies, or different development styles. Thus, future tools cannot just naively only consider Java as is the case for many works. Rather, a broad spectrum of languages such as Python, NodeJS, Go, C++, etc., must be considered to be well.
The large amount of research involving graph-based architectural analysis has opened up many new avenues for future research. One such avenue could be using graph-based static-code analysis to identify the potential for architectural degradation early on. Deriving the system into a graph brings needed abstraction. Such an approach could be beneficial when resolving degradation, which is difficult to prevent, or at least it could identify it early on.
With respect to dynamic analysis, we found multiple obstacles. There is considerable data collection overhead to collect metrics. In fact, a huge number of traces are produced at runtime, which makes it challenging to capture the required information in real time. In particular, the trace data need to be efficiently processed to produce aggregated trace representations of different levels of quality, and such detailed information of specific traces might need to be available on-demand. Even if we manage that, we need to store the data and analyze them. Thus, research must consider tracing microservices at a massive scale. This is rarely the case because of lacking benchmarks, which we mention later. Researchers questioned which metrics should be monitored and whether the metrics’ accuracy or their impact on performance has been considered at large scales.
Dynamic analysis is often involved in anomaly detection and especially root cause analysis. In this context, Brandon et al. [
47] highlighted the need for taking into account a time dimension, where the evolution of the system during a time window can be compared instead of single snapshots. However, what remains a challenge is the comparison between generated graphs representing the snapshots. This can be parallelized, but the search space and the response time need to be reduced for real-time processing. They proposed one way to address by transforming graphs into vectors to feed machine learning models dealing with anomaly detection, but this was just their vision for future work. Given machine learning, multiple works suggested the use of machine learning in this context, while statistical analysis can also be used for root cause analysis. We also found little relevant research on using machine learning as a primary method for microservice analysis. The only notable work that used machine learning as a primary method of analysis was written by Jin et al. [
43], where a Robust Component Principal Analysis algorithm was used alongside dynamic analysis to detect anomalies. Zhou et al. [
45] briefly mention the potential of using machine learning-based algorithms for improving fault localization. Other than the mentions in these articles, there were not any other notable instances of machine learning used as a primary method.
One outstanding challenge is prediction. For instance, since tuning configuration parameters can improve latency [
36], researchers could look at the prediction of calls. Prediction of future system call could predict possible faults and learn from comparing the predicted system call and the actual to find specific patterns. Predictive models could help to detect potential system bottlenecks and system capacity saturation for timely reactions to better handle such situations [
12].
A trendy research topic is migration from monolithic applications to microservices. Since microservice-based applications have many desirable characteristics, a wide array of tools and methods that aid in the migration process would be extremely desirable. Further research into a way to accurately automate the process of decomposition with minimal impact by identifying potential candidate microservices would be of extreme significance for the industry. This would be beneficial because the current automated decomposition methods are either inaccurate or could have significant repercussions on the system, and human analysis is extremely time-consuming and sometimes leads to nowhere. A better-automated tool would allow companies to focus resources on further developing and improving their architecture.
One specific problem in microservice migration has many names and deals with accurate system decomposition, service boundaries, or proper service cuts. Some suggest that there are too many characteristics (e.g., non-functional requirements [
50]) to take into account and propose the task for artificial intelligence. However, it can be a dynamic problem of continuous microservice system re-modularization. The situation is complicated by new advancements such as serverless or micro-frontends; should they be considered, or does migration from monoliths ultimately lead to technical debt? Perhaps the core challenge is understanding how to develop microservice systems. Alternatively, perhaps we need better evaluation metrics to answer this challenge.
One of the topics with great potential is software architecture reconstruction, especially its automation. Current approaches use static or dynamic analysis, but joint forces are inevitable since one deals well with decentralization and the other with a white-box view. Such system architecture can help with consistency checking and be the core artifact to refer to when dealing with evolution and technical debt. One challenge is proper visualization of the architecture or its perspectives.
In the context of visualization, researchers challenged proper execution trace visualization or improved fault localization. Others challenged visualization for data-flow across the system, recognizing reads and writes. Furthermore, it could be used for modeling and simulation of an actual microservices-based application.
With respect to technical debt, metrics for measuring debt are needed to quantify costs and benefits and support prioritization and decision-making. More investigation is needed on the relationship and composition between microservices availability tactics and microservices patterns. And some take the ambitious future goal to define an exhaustive and uniform catalog of microservices antipatterns.
Finally, one of the greatest deficiencies in related research is the lack of benchmarks [
45]. We need more microservice data sets to test the systems. These specifically need to represent industrial settings [
45]. Furthermore, to support advancements, unification could occur, and researchers should develop theme-specific benchmarks, such as a unified benchmark for fault injections and anomaly detection where approaches can compare easily, similar to what is common in other disciplines.