**1. Introduction**

Testing is one of the most complex and time-consuming processes in software development, and is necessary to ensure high quality of the developed product [1]. Therefore, automating the testing process, or at least its subprocesses, is an important research task. One of the important subprocesses of the testing is the test data generation. An analysis of existing studies, methods and approaches in the field of the application of methods for automatic test data generation has shown [2–4] that in software development, a blind strategy of random data generation is mainly used. At the same time, an analysis of scientific studies has shown that there are approaches, the development and application of which can significantly improve the quality of generated tests cases, expressed in the degree of coverage of the code being tested [5,6].

Among such advanced approaches to test data generation, static methods of symbolic analysis of program code were historically the first [7,8]. The generation of test data as a result of such analysis was reduced to automatic generation and resolution in symbolic form of a system of equations and inequalities obtained by logical union and the intersection of all conditions of the software-under-test (SUT). The undoubted advantage of the static approach is that it obtains results in symbolic form, which makes it possible to analytically

**Citation:** Avdeenko, T.; Serdyukov, K. Modified Evolutionary Test Data Generation Algorithm Based on Dynamic Change in Fitness Function Weights. *Eng. Proc.* **2023**, *33*, 23. https://doi.org/10.3390/ engproc2023033023

Academic Editors: Askhat Diveev, Ivan Zelinka, Arutun Avetisyan and Alexander Ilin

Published: 13 June 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Applied Mathematics and Computer Science Department, Novosibirsk State Technical University, 20 Karla Marksa Ave., 630073 Novosibirsk, Russia; avdeenko@corp.nstu.ru

determine subareas of test-case values that guarantee the passage of calculations over the given parts of the code.

However, a significant limitation of the possibility to apply the static approach is the problem of computational complexity of symbolic computations even for tasks of relatively small dimensionality. Therefore, a dynamic approach based on the actual execution of the SUT with specially generated values of input variables and subsequent analysis of data flows is currently more realistic and efficient for practical use in software development companies.

The most promising methods for implementing a dynamic approach to test data generation are evolutionary optimization methods [9–11]. The evolutionary paradigm, which is the basis of the genetic algorithm (GA), uses a set of random test data generated at the initial stage, after which a sequential "evolution" of the data is performed to improve the coverage quality of the testing code. Therefore, the assumption arises that the GA can be adapted to implement the idea of the evolutionary improvement of test data in terms of maximizing the coverage of the testing code. However, the existing research focuses on solving local problems, such as finding a specific set of test data that covers some given statements. At the same time, to solve the practical task of comprehensive software testing, it is relevant to develop methods for generating test sets that provide the maximum coverage of the entire SUT, taking into account its multi-connected complex structure.

The paper is organized as follows. Section 1 gives an introduction into the problem. Section 2 describes the usage of the GA in terms of test data generation. In Section 3, we formulate the fitness function for achieving maximum coverage. In Section 4, we propose a modification of the algorithm with dynamic weight assignment. Section 5 provides the results. Section 6 is conclusion.

### **2. Theoretical Background**

The genetic algorithm [12–14] works iteratively, performing consecutive steps at each iteration several until the completion conditions are reached. With each new iteration, GA creates a new generation of the population based on the previous (parent) one. The main GA cycle for test data generation includes the following stages, which, except for the first stage, are performed iteratively until a given coverage value or number of generations are reached:


$$var\_i^{offspring} = \beta\_i \times var\_i^{mother} + (1 - \beta\_i) \times var\_i^{father}, i = \overline{1, N}.$$


After all GA stages have been executed, it is determined whether the completion conditions are met, or whether the process proceeds to the next iteration. The iterability of GA is the factor that allows the obtaining of new solutions. Each new generation is formed based on the previous one, i.e., the test sets of the previous generation participate in the formation of new sets, thus providing the "evolution" of previously obtained solutions

#### **3. Multi-Path Algorithm for Maximum Code Coverage**

Input variables of the testing code are either the variables *vari*, *j* = 1, *N*, which are part of the input statement, either input parameters of procedures and functions, initiating calculations along a certain code path. In this way, we can describe a vector of input variables as (*var*1, *var*2, ... , *varN*), and entire definition area as *D* = *D*<sup>1</sup> × *D*<sup>2</sup> × ... × *DN*, where *Di* is definition area of the input variable *vari*. When chromosome *xi* ∈ *D* is represented by a dimensional vector *N xi* = [*var<sup>i</sup>* <sup>1</sup>, *var<sup>i</sup>* <sup>2</sup>,..., *var<sup>i</sup> N*].

The purpose of automatic test data generation is to find many test cases {*x*1, *x*2, ... , *xm*}, which initiate passing through a given set of reachable paths, i.e., the paths that can be covered by the test sets. The main coverage criterion is the criterion of statement coverage [15]. We introduce notation *g*(*xi*) as a vector that is an indicator of the statement coverage initiated by a certain test set *xi*:

$$\mathfrak{g}(\mathfrak{x}\_i) = (\mathfrak{g}\_1(\mathfrak{x}\_i), \mathfrak{g}\_2(\mathfrak{x}\_i), \dots, \mathfrak{g}\_n(\mathfrak{x}\_i))\_\* $$

where *n* is the number of statements of the SUT, and

*gj*(*xi*) = 1 if path initiated by the set *xi*passes though the statement *j*; 0 otherwise.

If we denote the vector of statement weights of the SUT as (*w*1, *w*2,..., *wn*), then we can define the fitness function for a single chromosome *xi* as follows

$$F(\mathbf{x}\_i) = \sum\_{j=1}^{n} w\_j g\_j(\mathbf{x}\_i),\tag{1}$$

where *wj*—weight of the statement *j*, *gj*—value of the coverage indication, *n*—number of statements.

The greater the sum of the statement weights executed on the path initiated by the test case *xi*, the greater the value of the fitness function *F*(*xi*). To ensure greater population diversity, a component is added to Formula (1) that allows the consideration of the remoteness of paths from each other. The remoteness of the paths is defined through the similarity operation. To calculate the *j*-th similarity coefficient *simj*(*xi*<sup>1</sup> , *xi*<sup>2</sup> ) of two chromosomes *xi*<sup>1</sup> and *xi*<sup>2</sup> , check whether the *j*-th statement of SUT, whose coverage is marked by the indicator *gj*, is at the intersection of both paths initiated by test cases *xi*<sup>1</sup> and *xi*<sup>2</sup> :

$$\text{sim}\_{j}(\mathbf{x}\_{i1}, \mathbf{x}\_{i2}) = \overline{\mathcal{g}\_{j}(\mathbf{x}\_{i1}) \oplus \mathcal{g}\_{j}(\mathbf{x}\_{i2})}, j = \overline{1, n} \tag{2}$$

where the logical operations "negation" (NOT) and "exclusive OR" (⊕, XOR) are used.

The more matching the covered statements at the intersection of two paths, the greater the similarity value between chromosomes. The following formula defines the similarity between two chromosomes as the weighted average of the similarity over all code statements:

$$\operatorname{sim}(\mathbf{x}\_{i\_1}, \mathbf{x}\_{i\_2}) = \sum\_{j=1}^{n} w\_j \times \operatorname{sim}\_j(\mathbf{x}\_{i\_1}, \mathbf{x}\_{i\_2}) \tag{3}$$

The similarity value between chromosome *xi* and the other chromosomes of the population is calculated as

$$f\_{sim}(\mathbf{x}\_i) = \frac{1}{(m-1)} \sum\_{\mathbf{s}=1; \mathbf{s} \neq i}^{m} \text{sim}(\mathbf{x}\_{\mathbf{s}}, \mathbf{x}\_i) \,\tag{4}$$

where *m* is the number of chromosomes in the population.

Now, we can determine the average similarity value of paths in the entire population

$$\overline{f\_{\rm sim}} = \frac{1}{m} \sum\_{i=1}^{m} f\_{\rm sim}(\boldsymbol{x}\_i). \tag{5}$$

and further formulate the additive component of the fitness function responsible for the diversity of paths in the population as the modulus of the difference between the average similarity of the population and the similarity of a particular chromosome

$$F\_2(\boldsymbol{\chi}\_i) = \left| \overline{f\_{sim}} - f\_{sim}(\boldsymbol{\chi}\_i) \right|. \tag{6}$$

As a result, the resulting fitness function for chromosome *xi*, taking into account the diversity of paths, is calculated by the formula

$$F(\mathbf{x}\_i) = F\_1(\mathbf{x}\_i) + k \times F\_2(\mathbf{x}\_i),\tag{7}$$

where *F*1(*xi*) and *F*2(*xi*) are defined by Formulas (1) and (6), respectively. Accordingly, the first component *F*1(*xi*) determines the complexity of the path initiated by the chromosome *xi*, and the second component *F*2(*xi*) determines the remoteness of this path from all other paths in the population. The parameter *k* defines the relationship between the components.

Using Formula (7) as a fitness function leads to more diverse populations as a result of a single GA run. However, due to the use of a continuous version of the genetic algorithm in this research, the resulting diversity is not sufficient to fully cover the code within a single GA run.

The latter circumstance is related to the detected "swing effect" arising from the presence of indistinguishable chromosomes in the population. If indistinguishable chromosomes have a high value of fitness function in one generation, they will be selected for crossover. Then, their offspring will, with high probability, also be indistinguishable from their parents. The new generation, in this case, will consist of a greater number of indistinguishable chromosomes, and similarity in the population will depend more on them. For all these chromosomes, the value of the additive component *F*<sup>2</sup> of the fitness function will be reduced, and for chromosomes passing through other paths, it will be increased. Now, other chromosomes could be selected for crossing and will form a multitude of indistinguishable chromosomes for the next generation. Thus, the population will be cyclically first filled with indistinguishable chromosomes, which in the next generation will lead to a decrease in similarity value for them, and, accordingly, to a decrease of *F*2, thus reducing the priority of indistinguishable chromosomes in the next iteration. A similar cycle will be repeated for different sets, and as a result, both path complexity (*F*1) and similarity value (*F*2) cease to play an important role in the formation of a new population, and different sets are constantly shuffled without investigation of the solution space.

To exclude "swing effect" it is proposed to use the indicator *ind*(*x*1,..., *xi*), which is determined by the number of chromosomes from the set {*x*1, ... , *x*(*i* −1)} indistinguishable from the chromosome *xi*:

$$\overline{F}(\mathbf{x}\_{i}) = F\_{1}(\mathbf{x}\_{i}) + \frac{1}{1 + \operatorname{ind}(\mathbf{x}\_{1}, \dots, \mathbf{x}\_{i})} \times F\_{2}(\mathbf{x}\_{i}).\tag{8}$$

Indeed, the initial value *ind*(*x*1) = 0, because the set in which the indistinguishable chromosomes are identified is empty at the first step. At each subsequent step, the value *ind*(*x*1, ... , *xi*) can either increase by 1 if the next chromosome is indistinguishable from one of the previous ones, or keep the same value if the next chromosome is unique. This will allow chromosomes passing through different paths to be more evenly distributed throughout the population as a whole.

#### **4. Modification of the Fitness Function Based on Dynamic Changes in Statements Weights**

The studies carried out in the articles [16–18] showed a relatively strong influence of the value of *k* on the coverage of the SUT. At *k* = 0, the coverage was minimal, reaching its maximum value at *k* = 10, after which it began to decline. Obviously, choosing the right k can significantly affect the final results. The value of *k* = 10 obtained in the studies was optimal only within the tested programs; for others, this value may not be optimal. Therefore, to achieve greater universality of the algorithm, it would be preferable to reduce the influence of *k*.

For this purpose, we propose the modification of the *F*<sup>1</sup> component of the fitness function (8), so that a greater population diversity is achieved by it alone. The idea of modifying the *F*<sup>1</sup> component is inspired by other evolutionary methods. In studies [19–21] some of the evolutionary methods are used, in particular Particle Swarm Optimization (PSO), which is one of the Swarm Intelligence algorithms. However, the application of PSO in existing studies has been based more on comparing PSO and GA implementations than on the hybridization of approaches [22]. Other representatives of this family are Ant Colony Optimization (ACO), Artificial Bee Colony Algorithm (ABC), Cuckoo Search (CS), and many other algorithms based on the collective interaction of different particles or agents.

The ACO [23] is one of the methods that allows solving pathfinding problems on graphs. It is based on simulating the behavior of a colony of ants. The ants, passing along certain paths, leave a trail of pheromones behind them. The better the solution found, the more pheromones there will be on one or another path. In the next generation, ants already form their paths based on the number of pheromones—the more pheromones on a certain path, the more ants will be directed to that path and continue exploring it. In this way, the colony gradually explores the entire solution space, gradually reaching better and better paths.

It is not possible to directly apply the ACO to the test data generation problem, because the output to certain paths is initiated by different datasets, and the only way to change the path is to directly change the values of the test sets themselves. Nevertheless, the idea of using the "pheromones" model to prioritize pathfinding could have a positive effect in providing more diversity in the population. When applied to the problem of increasing the diversity of test sets, the idea of pheromones leads to the expediency of dynamically (from generation to generation) increasing or decreasing the weights of operators *wj*, *j* = 1, *n*, depending on the number of chromosomes previously (in the previous generations) passed through these statements. The dynamic change in the weights of statements can be represented as

$$
\bar{w}\_{\dot{j}}^{(q)} = \text{Ph}\_{\dot{j}}^{(q)} w\_{\dot{j}\prime} \dot{j} = \overline{1,n}; q = \overline{1,Q} \tag{9}
$$

where *<sup>w</sup>*-(*q*) *<sup>j</sup>* is the weight assigned to the statement *<sup>j</sup>* in generation *<sup>q</sup>*, *Ph*(*q*) *<sup>j</sup>* is weight multiplier of the statement *<sup>j</sup>* in generations *<sup>q</sup>* (0 <sup>≤</sup> *Ph*(*q*) *<sup>j</sup>* ≤ 1), *Q* is number of generations (iterations of GA). Taking into account dependence (9) the dynamic variant of the *F*<sup>1</sup> fitness function component will have the form:

$$F\_1^{(q)} = \sum\_{j=1}^n \vec{w}\_j^{(q)} g\_j(\mathbf{x}\_i) = \sum\_{j=1}^n P h\_j^{(q)} w\_j g\_j(\mathbf{x}) i \rangle; q = \overline{1, Q}. \tag{10}$$

In expression (10) it is very important to determine the dependence of the multiplier *Ph*(*q*) *<sup>j</sup>* (0 <sup>≤</sup> *Ph*(*q*) *<sup>j</sup>* ≤ 1) on the arguments, so that the statements weights in the fitness function respond to operator coverage in the previous generations in time. The resulting diversity of the population of test datasets, and hence the degree to which they cover the SUT, depends on the choice of the variation method of *Ph*(*q*) *j* .

Two basic strategies were proposed for the initial behavior of the multiplier *Ph*(*q*) *j* depending on the number of generations *q*—the direct and the reverse strategy. In the direct strategy, we assume *Ph*(1) *<sup>j</sup>* = 0 in the first generation and then this value increases (or remains the same) depending on the coverage (or non-coverage) of operator *j* in the previous generation. In the reverse strategy, on the contrary, in the first generation we assume *Ph*(1) *<sup>j</sup>* = 1 and then this value decreases (or remains the same), depending on the coverage (or non-coverage) of operator *j*.

In both strategies, the multiplier can reach the boundaries of the interval [0, 1]. Thus, in the direct strategy, the value of *Ph*(*q*) *<sup>j</sup>* increases monotonically, but after reaching the limit value *Ph*(*q*) *<sup>j</sup>* = 1 (this value corresponds to the maximum priority of the operator j in the fitness function) it is necessary to begin its decrease to change the algorithm direction to other, still uncovered, statements. Then, after reaching the minimum possible value *Ph*(*q*) *<sup>j</sup>* = 0, corresponding to the non-inclusion of operator j in the fitness function, we start monotonic increasing again, and so on. In the reverse strategy, changes occur in opposite directions, first in the direction of decreasing, then in the direction of increasing, etc.

This fluctuating change in the multiplier *Ph*(*q*) *<sup>j</sup>* between values 0 and 1 can occur with different rates, given by the parameter Δ*Ph*, which affects the total number of fluctuations within the interval [0, 1] during the process of test data generations. Let *Trans*(*q*) *<sup>i</sup>* be the number of complete passes from 0 to 1 or from 1 to 0 by the multiplier *Ph*(*q*) *<sup>j</sup>* made to the current generation *q*. Then, the behavior of the multiplier for the direct strategy can be written as

$$\operatorname{Ph}\_{j}^{(q)} = \mathbf{x} = \begin{cases} 0 & \text{if } q = 1, \\ \operatorname{Ph}\_{j}^{(q-1)} + \Delta \operatorname{Ph} \times (-1)^{\operatorname{Trans}\_{j}^{(q)}} & \text{if } \tilde{m}\_{j}^{(q-1)} \neq 0, \\ \operatorname{Ph}\_{j}^{(q-1)} & \text{if } \tilde{m}\_{j}^{(q-1)} = 0, \end{cases} \tag{11}$$

and the behavior of the multiplier for the reverse strategy is in the form

$$\operatorname{Ph}\_{j}^{(q)} = \mathbf{x} = \begin{cases} 1 & \text{if } q = 1, \\ \operatorname{Ph}\_{j}^{(q-1)} - \Delta \operatorname{Ph} \times (-1)^{\operatorname{Trans}\_{j}^{(q)}} & \text{if } \tilde{m}\_{j}^{(q-1)} \neq 0, \\ \operatorname{Ph}\_{j}^{(q-1)} & \text{if } \tilde{m}\_{j}^{(q-1)} = 0, \end{cases} \tag{12}$$

where *<sup>m</sup>*-(*q*−1) *<sup>j</sup>* is the number of chromosomes in a population consisting of *m* individuals that covered the operator *j* in a generation (*q* − 1).

The article comprises several methods for determining the rate parameter Δ*Ph*. The *Hal f* method assumes that one full pass of the multiplier (from 0 to 1 or from 1 to 0) with the rate Δ*Ph* can be obtained by covering the operator *j* in half of the generations from the initially given number of generations *Q* (*Q*/2), the *Quarter* method—in quarters of generations (*Q*/4), *Tenth*—one tenth of all generations (*Q*/10). The fewer generations (iterations) needed for one complete pass, the greater the rate parameter Δ*Ph*, and the more often the multiplier will fluctuate between the limit values [0, 1]. Table 1 presents the main indicators used to implement the proposed methods for varying the multiplier *Ph*(*q*) *<sup>j</sup>* using a constant rate of change Δ*Ph*. Direct strategy methods are marked with a plus sign (+), and methods with a reverse strategy are marked with a minus sign (−).


**Table 1.** Methods for determining the rate parameter <sup>Δ</sup>*Ph* of multiplier *Ph*(*q*) *<sup>j</sup>* .

Another method of determining the multiplier *Ph*(*q*) *<sup>j</sup>* , which we called *Count*− (the method is based on the reverse strategy), involves changing it not by a constant value, but by a value depending on the coverage intensity of the operator *j* in the previous generation:

$$\operatorname{Ph}\_{j}^{(q)} = \mathfrak{x} = \begin{cases} 1 & \text{if } q = 1, \\ \operatorname{Ph}\_{j}^{(q-1)} (1 - \overline{m}\_{j}^{(q-1)} / m & \text{if } \overline{m}\_{j}^{(q-1)} \neq 0, \\ 1 & \text{if } \overline{m}\_{j}^{(q-1)} = 0. \end{cases} \tag{13}$$

In contrast to the previously proposed method, in *Count*<sup>−</sup> the value of *Ph*(*q*) *<sup>j</sup>* will decrease the stronger the more chromosomes in the previous generation were covered by operator *j*. There is no gradual increase in the multiplier in this case; instead, if the operator was not covered (*m*-(*q*−1) *<sup>j</sup>* <sup>=</sup> 0), then the maximum value *Ph*(*q*) *<sup>j</sup>* = 1 is set. Thus, often covered operators cease to play a significant role in the process of searching for test sets, and the algorithm will mostly try to generate sets for as yet uncovered paths.

#### **5. Results**

Let us compare the application of various methods for determining the multiplier *Ph*(*q*) *<sup>j</sup>* , using the methods proposed above for the test program SUT2 described in [24]. Figure 1 shows a comparison of the average coverage for different values of the parameter *k* of the components of the fitness function (8), in which *F*<sup>1</sup> is calculated either by Formula (1), i.e., without modification, or by Formula (10) when using modification by the methods *Hal f*+, *Quarter*+, *Tenth*+ and *Count*−. The average coverage is calculated based on 1500 runs. *Q* = 50 and *m* = 25 are chosen as the GA parameters, at which full coverage is relatively rarely achieved.

In Figure 1, red highlights the average coverage when using Formula (8) (static method), methods of monotonic change in *Ph* based on direct strategy are in shades of blue and black—*Count*− method. The methods for determining the parameter Δ*Ph* based on the reverse strategy are not presented in the figure, but, in general, they have approximately similar values of average coverage.

**Figure 1.** Comparison of different modifications (*Q* = 50, *m* = 25).

Each of the proposed methods for determining the multiplier *Ph*(*q*) *<sup>j</sup>* showed a higher average coverage value than the static method (without modification) for each of the *k* values. Figure 1 shows that for the static method, the average coverage gradually increases with increasing *k*, while using modifications, the maximum average coverage is reached already at *k* = 2, and thereafter does not decrease. The best result among all proposed methods showed *Count*−, which is why exactly this method will be used in further research of this modification of the fitness function.

Analysis of the results presented in Figure 1 allows us to conclude that the modification allows the significant increase in the coverage even without using the previously determined optimal value of *k* = 10. At the same time, even at *k* = 0, i.e., without use of the additive component *F*<sup>2</sup> of the fitness function, a higher coverage is achieved than without modification. Comparison of average coverage without and with the best count modification method can be seen in Figure 2. It shows average coverage with algorithm parameters *Q* = 50, *m* = 25.

**Figure 2.** Comparison of coverage with and without modification (*Q* = 50, *m* = 25).

Thus, the use of the modification makes it possible to increase the average coverage, which is especially noticeable at *k* = 0. More importantly, the maximum coverage is achieved when using any non-zero value of *k*, i.e., the ratio parameter of the fitness function components *k* ceases to play a significant role in achieving the maximum coverage. As a result, the proposed modification based on the dynamic change in statement weights makes it possible to increase code coverage when generating test sets, as well as eliminate the need to determine the *k* value for each individual SUT.

#### **6. Conclusions**

The paper proposes a modification of the method for generating test data for multiple paths in one launch of GA. The initial problem, which consists of the necessity to determine the value of the ratio parameter of the fitness function components, is solved by dynamically changing the weights of statements between generations. The methods proposed in the paper eliminate the need to define the parameter for each individual program, and one of the methods, *Count*−, allows the achievement of greater coverage, even if the ratio parameter value is zero. Therefore, not only the original goal of implementing the modification is achieved, but also the diversity of generated test cases increases, so overall coverage has also improved.

**Author Contributions:** Conceptualization, T.A.; methodology, T.A.; software, K.S.; investigation, K.S.; validation, T.A.; writing—original draft preparation, T.A. and K.S.; writing—review and editing, T.A.; visualization, K.S.; supervision, T.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Ministry of Science and Higher Education of Russian Federation (project No. FSUN-2020-0009).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
