*Article* **Gene-Similarity Normalization in a Genetic Algorithm for the Maximum** *k***-Coverage Problem**

**Yourim Yoon <sup>1</sup> and Yong-Hyuk Kim 2,\***


Received: 28 February 2020; Accepted: 30 March 2020; Published: 2 April 2020

**Abstract:** The maximum *k*-coverage problem (MKCP) is a generalized covering problem which can be solved by genetic algorithms, but their operation is impeded by redundancy in the representation of solutions to MKCP. We introduce a normalization step for candidate solutions based on distance between genes which ensures that a standard crossover such as uniform and *n*-point crossovers produces a feasible solution and improves the solution quality. We present results from experiments in which this normalization was applied to a single crossover operation, and also results for example MKCPs.

**Keywords:** maximum *k*-coverage; redundant representation; normalization; genetic algorithm

#### **1. Introduction**

The maximum *k*-coverage problem (MKCP) is regarded as a generalization of several covering problems. The problem has a range of applications in combinatorial optimization such as scheduling, circuit layout design, packing, facility location, and covering graphs by subgraphs [1,2]. Recently, besides some theoretical approaches [3,4], it has been extended to many real-world applications such as blog-watch [5], seed selection in a Web crawler [6], map-reduce [7], influence maximization in social networks [8], recommendation in e-commerce [9], sensor deployment in wireless sensor networks [10], multi-depot train driver scheduling [11], cloud computing [12], and location problem [13].

Let *A* = (*aij*) be an *m* × *n* 0-1 matrix, and let *wi* be a weight applied to each row of *A*. The objective of MKCP is to choose *k* columns so as to maximize the sum of the weights of the rows that contain '1' and are also located in one of these *k* columns.

This problem can be represented formally as follows:

$$\begin{aligned} \text{maximize } & \quad \sum\_{i=1}^{m} w\_i \cdot I\left(\sum\_{j=1}^{n} a\_{ij} x\_j \ge 1\right) \\ \text{subject to } & \quad \sum\_{j=1}^{n} x\_j = k \\ & \quad x\_j \in \{0, 1\}, \quad j = 1, 2, \dots, n, \end{aligned}$$

where *I*(·) is an indicator function (*I*(*f alse*) = 0 and *I*(*true*) = 1).

We are concerned with the case that *wi* = 1 for all *i*, when we say that a row is covered by the selected *k* columns if it contains '1' which is also located in one of these columns. In this case, MKCP is to find *k* columns that cover as many rows as possible.

MKCP has many real-world applications. For example, MKCP can be applied to the following practical applications about museum touring [14]: Suppose that a museum can operate only *k* guided tour programs of visiting a set of exhibits among *n* possible programs due to some constraints such as operating costs. If there are *m* visitors and it is possible to predict whether or not each visitor *i* is satisfied with the experience of each program *j* (meaning *aij*), the problem of choosing exactly *k* programs to satisfy as many visitors as possible is exactly formulated as MKCP. Among the visitors, if it is preferred to satisfy specific visitors who have more importance such as VIPs, the problem can be modeled more elaborately by giving different weight (*wi*) to each visitor *i*. Similarly, MKCP can be applied to other practical applications such as public safety networks and systems [15,16]. For example, we can consider a disaster management system in which *n* agencies are involved and *m* positions for Unmanned Aerial Vehicles (UAVs) are available to enable the communication between the agencies. In this situation, it can be easily investigated whether or not each agency *i* is covered when a UAV is placed in each position *j* (meaning *aij*). If only *k* UAVs are available for resource management, the problem of choosing exactly *k* positions of UAVs to cover as many agencies in the disaster area as possible can also be formulated as MKCP.

The NP-hardness of MKCP can easily be deduced from the NP-hardness of the minimum set covering problem (MSCP) [17]. Many meta-heuristics have been applied to the MSCP such as tabu search [18], genetic algorithms [19,20], particle swarm optimization [21], ant colony optimization [22], etc., but MKCP has been scarcely addressed. Some naïve greedy heuristics [1,7,23–25] and an improved local search [26], which do not scale well to large datasets, have been studied. To the best of our knowledge, there is only one meta-heuristic [27] that uses particle swarm optimization, applied to MKCP, and genetic algorithms have not been applied. The present authors previously conducted an initial investigation of MKCP [28], which we now extend. Because MKCP selects a fixed number of columns, the representation of solutions is simpler than it is in other covering problems, and this favors the adoption of a genetic algorithm (GA). When we apply a GA to the MKCP, it also has an interesting inherent property that each gene of a chromosome is not just a number but actually a column of a matrix, which we analyze and relate to the characteristic of its solution space. We go on to present a problem-specific normalization for use in a GA which takes account of the feasibility of solutions and improves the solution quality.

The remaining parts are organized as follows. We analyze the solution space of MKCP, and the representation of solutions in Section 2. Based on this analysis, in Section 3, we propose a normalization method for producing feasible and improved solutions. In Section 4, we present experimental results to assess the effectiveness of our method. Finally, we draw conclusions in Section 5.

#### **2. Representation and Space of Solution to MKCP**

When we solve a problem with GAs, the representation of chromosomes is an important issue. Representation depends on the problem and reflects the properties of the problem. One representation of a solution to MKCP is a vector of *k* integers representing column indices; another is a binary vector of length *n*, in which each element indicates whether or not the corresponding column is selected. Thus, when *k* = 2 and *n* = 4, the integer representation (1, 4) is equivalent to the binary vector representation (1, 0, 0, 1). We focus on the integer vector representation.

Using the integer vector representation, the encodings (1, 4) and (4, 1) both represent the same solution. The coverage, which is the value of the objective function, is also independent of this ordering. Since for the same solution there may be more than one different encoding, encoding space (genotypes) is different from solution space (phenotypes). We can consider permutations and combinations as genotypes and phenotypes, respectively. In the integer vector representation, it is natural to encode combinations using permutation encoding. However, in this case, many encodings correspond to the same combination. That is, this representation is redundant. Let *G* be the space of encodings in the integer vector representation, and let two encodings *x* and *y* in *G* be (*x*1, *x*2, ... , *xn*) and (*y*1, *y*2, ... , *yn*), respectively. Now, the relation ∼ is defined on *G*:

**Definition 1.** *x* ∼ *y if and only if we have a permutation σ* ∈ Σ*<sup>k</sup> such that σ*(*x*1, *x*2, ... , *xk*) = (*y*1, *y*2,..., *yk*)*, where* Σ*<sup>k</sup> is the set of permutations of size k, and σ*(*x*) *represents x permuted by σ.*

#### **Proposition 1.** *The relation* ∼ *becomes an equivalence relation.*

**Proof.** For each *i* ≤ *k*, Σ*i*, ◦ forms a symmetric group *Si*, where ◦ is the function composition operator [29]. Then, the direct product *P* := Π*<sup>k</sup> <sup>i</sup>*=1*Sk* is also a group [29], and therefore has an identity meaning that the relation <sup>∼</sup> is reflexive. When *<sup>σ</sup>* <sup>∈</sup> *<sup>P</sup>*, its inverse *<sup>σ</sup>*−<sup>1</sup> <sup>∈</sup> *<sup>P</sup>* exists, i.e., the relation <sup>∼</sup> is symmetric. The group *<sup>P</sup>* is closed under the operator ◦: It means if *<sup>σ</sup>*1, *<sup>σ</sup>*<sup>2</sup> <sup>∈</sup> *<sup>P</sup>*, then *<sup>σ</sup>*<sup>1</sup> ◦ *<sup>σ</sup>*<sup>2</sup> <sup>∈</sup> *<sup>P</sup>*. That is, the relation ∼ is also transitive. Taken together, the relation ∼ becomes an equivalence relation.

For example, the solutions (1, 3, 5, 7) and (3, 7, 5, 1) are equivalent. The number of vectors equivalent to a vector *g* ∈ *G* is *k*!.

The equivalence relation of Definition 1 allows us to consider the real solution space (phenotype space) as the set of equivalence classes of elements in *G*, i.e., the quotient space *G*/∼.

We can measure the similarity of two vectors *x* and *y* in *G*, the space of MKCP solution encodings, using a distance metric *D*, which we obtain by summing values of a subsidiary metric *d* which measures the difference between two columns of the matrix *A* which expresses the MKCP, as follows:

$$D(x, y) := \sum\_{i=1}^{n} d(x\_i, y\_i). \tag{1}$$

Here, the genes of two chromosomes, *xi* and *yi*, represent chosen column indices. If we regard the indices as just labels, not column vectors, the discrete metric, which becomes one if the two indices are the same, and zero otherwise, can be used as a metric *d*. This satisfies all the conditions for a distance metric including the triangular inequality.

An alternative metric is Hamming distance, which is a measure of the dissimilarity of two binary vectors. If we consider each column of the matrix *A* as a binary vector (the column vector), not just an integer, we can find the Hamming distance between any two column vectors, and hence between the corresponding genes *xi* and *yi* in two chromosomes from *G*. These distances can then be summed as shown in Equation (1).

Figure 1 shows two distances in *G* calculated using discrete metric and Hamming distance. As shown in Figure 1c, the discrete metric simply compares the column indices of *xi* and *yi*, and ignores the contents of the corresponding columns of *A*. In Figure 1d, we see that the distance between *x*<sup>1</sup> and *y*<sup>1</sup> is the Hamming distance between the first and the second column vectors of *A*.


**Figure 1.** Discrete metric vs. Hamming distance, for the chromosomes (1,3) and (2,3).

Now, we establish a metric in the quotient space *G*/∼ by the following proposition:

**Proposition 2.** *Let x* = (*x*1, *x*2, ... , *xn*) *and y* = (*y*1, *y*2, ... , *yn*) *be in G, and let a metric D on G be defined by Equation (1). Then,*

$$\bar{D}(\vec{x}, \vec{y}) := \min\_{\sigma \in \Sigma\_k} \sum\_{i=1}^k d(\mathbf{x}\_i, \sigma\_i(\vec{y})) \tag{2}$$

*becomes a metric on G*/∼*, where σi*(*y*) *represents the ith element of permuted y by σ.*

**Proof.** Let *σ* be in the group Π*<sup>k</sup> <sup>i</sup>*=1*Sk*. The computation *<sup>D</sup>*(*x*, *<sup>y</sup>*) :<sup>=</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *d*(*xi*, *yi*), is unaffected by summation order, and thus *D*(*x*, *y*) = *D*(*σ*(*x*), *σ*(*y*)). Hence, *σ* is an isometry on *G*, and thus Π*<sup>k</sup> <sup>i</sup>*=1*Sk* is an isometry subgroup. The relation <sup>∼</sup> is an equivalent relation obtained from <sup>Π</sup>*<sup>k</sup> <sup>i</sup>*=1*Sk*. Hence from [30,31] *<sup>D</sup>*¯ (*x*¯, *<sup>y</sup>*¯) is a metric on *<sup>G</sup>*/∼.

#### **3. Normalization in MKCP**

As shown above, redundancy in the integer representation of solutions to MKCP means that the encoding (genotype) space *G* is unnecessarily larger than the true solution (phenotype) space, which is the quotient space *G*/∼. Redundant representations can be expected to reduce the performance of genetic algorithms significantly, which, in particular, undermines the effectiveness of standard crossovers defined using masks [32]. The problem of redundant representations have been addressed by a number of methods such as adaptive crossover [33–36], and among which the normalization technique [37] is representative. Normalization changes a parent genotype into a different genotype with the same phenotype which is similar to the genotype of the other parent before a standard crossover is performed. It is based on adaptive crossovers [34,35] and many variants have appeared [31,38,39].

#### *3.1. Preserving Feasibility*

A representation is infeasible if it does not meet the requirements of a solution. Figure 2a shows an integer representation which is infeasible because it contains duplicate column indices, and Figure 2b shows a binary representation which is infeasible because it contains the wrong number of '1's. Figure 2 also shows how these infeasible solutions are likely to be created by standard crossover operators such as uniform and *n*-point crossovers.

**Figure 2.** How infeasible solutions can be created by crossover operations.

A repairing step can be performed to restore feasibility, but this has the effect of a mutation, and may garble gene sequences inherited from parents. Using integer encoding, the problem can be avoided by rearranging the parents so that any shared column indices are in the same position. Such rearrangement naturally makes offspring preserve feasibility and no special repairing step is required. This form of normalization is easily implemented, as shown in the pseudocode in Figure 3, and takes *O*(*k*2) time.

```
FP_normalization(x, y)
{
   for i ← 0 to k
       for j ← 0 to k
           if xi = yj, then
               Swap values of yi and yj;
}
```
**Figure 3.** Pseudocode of normalization to preserve feasibility.

In Figure 4, this normalization to preserve feasibility for the example in Figure 2a is shown.

**Figure 4.** Normalization to preserve feasibility in the example from Figure 2a.

#### *3.2. Normalization for Improving Solution Quality*

A good solution to MKCP will consist of dissimilar columns. For example, in Figure 1, (1, 3) is a better solution than (1, 4) because the '1's in Columns 1 and 3 all occur in different rows, while Columns 1 and 4 share a '1' in Row 2.

In the context of a genetic algorithm, we would expect chromosomes with genes corresponding to dissimilar columns to be most effective. Looking again at Figure 1, suppose that we apply a standard one-point crossover to the parents (1, 2) and (3, 4). If the cutting line lies between the first gene and the second one, then the offspring will be (1, 4), and this offspring covers the rows {1, 2, 3} of *A*. However, if the second parent is rearranged to (4, 3), the offspring becomes (1, 3), which covers {1, 2, 4, 5}. This example is illustrated in Figure 5. In general, we want the offspring to have genes that correspond to columns that are as dissimilar as possible. Thus, it is helpful to rearrange chromosomes so that genes corresponding to similar columns are in the same positions. The goal of this rearrangement can be formulated in terms of distance in the phenotype space *G*/∼. Consider two parents *x* and *y* in the genotype space *G*. Rearranging genes so that those corresponding to the most similar columns are located in the same positions is equivalent to the search for a permutation *σ*∗ such that the distance between the equivalence class of *x* and that of *y* is equal to the distance between *x* and *σ*∗(*y*). This is also equivalent to finding the *σ* which minimizes the distance between *x* and the permuted *y* in Equation (2).

**Figure 5.** Effect of rearranging genes to match similar columns in the example from Figure 1.

The optimal rearrangement is achieved by considering all of the permutations of genes in the second parent, and choosing the permutation that minimizes the distance sum between the column vectors corresponding to gene pairs in the same locations. If Hamming distance *H* is used to get the dissimilarity between the column vectors corresponding to two genes, we will choose the permutation *σ*∗ such that

$$
\sigma^\* = \underset{\sigma \in \Sigma\_k}{\text{argmin}} \sum\_{i=1}^k H(\mathbf{x}\_i, \sigma\_i(y)),
\tag{3}
$$

where Σ*<sup>k</sup>* is the set of all the permutations of length *k* and *σi*(*y*) denotes the *i*th element of permuted *y* by *σ*.

We give an example case in Figure 6, in which the chromosomes (1, 2) and (3, 4) are both parents, and now we normalize (3, 4). Because *k* is 2, the number of all the permutations is just 2. We compute ∑*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *H*(*xi*, *σi*(*y*)) for each permutation. If *σ*<sup>1</sup> = ( 1 2 1 2), *<sup>σ</sup>*1(*y*)=(3, 4). Then, *<sup>H</sup>*(*x*1, *<sup>σ</sup>*<sup>1</sup> <sup>1</sup> (*y*)) = 4, *H*(*x*2, *σ*<sup>1</sup> <sup>2</sup> (*y*)) = 2, and their sum is 6. For the second permutation *<sup>σ</sup>*<sup>2</sup> = ( 1 2 2 1), *<sup>σ</sup>*2(*y*)=(4, 3). Then, *H*(*x*1, *σ*<sup>2</sup> <sup>1</sup> (*y*)) = 2, *<sup>H</sup>*(*x*2, *<sup>σ</sup>*<sup>2</sup> <sup>2</sup> (*y*)) = 2, and their sum is 4. Since this is smaller than 6, *<sup>σ</sup>*<sup>2</sup> = ( 1 2 2 1) is the optimal permutation.

**Figure 6.** Normalization with Hamming distance in the example from Figure 1.

Enumerating all *k*! permutations is intractable for a large *k*, and thus this procedure is intractable. However, the problem can be solved using the Hungarian method [40], which is a network-flow-based technique. It provides an optimal result, and runs in *O*(*k*3) time [41]. Alternatively, we can use a fast heuristic [42], which runs in *O*(*k*2) and produces results very close to the optimum. Either of these methods can be treated as a function that accepts a 2D array, in which the elements are the distances between two genes, and returns the permutation with the minimum total distance between chromosomes. This normalization is shown in the pseudocode of Figure 7.

OPT\_normalization(*x*, *y*) { **for** *i* ← 0 **to** *k* **for** *j* ← 0 **to** *k D*[*xi*, *yj*] ← *H*(*xi*, *yj*); *<sup>σ</sup>*<sup>∗</sup> <sup>←</sup> argmin*σ*∈Σ*<sup>k</sup>* <sup>∑</sup>*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *H*(*xi*, *σi*(*y*)); // using the Hungarian method or its fast variant *y* ← *σ*∗(*y*); }

**Figure 7.** Pseudocode of normalization by the optimal rearrangement.

After we rearrange the second parent according to the permutation *σ*∗, a standard crossover is applied.

Now, we investigate the relation between this optimal rearrangement and the feasibility. There may be more than one optimal rearrangement satisfying Equation (3), but we can show that one of them is always feasible. Feasibility is preserved by an optimal permutation *σ*∗ which locates all common indices in the same positions by *σ*∗, as we now prove:

**Proposition 3.** *If x* = (*x*1, *x*2, ... , *xk*) *and y* = (*y*1, *y*2, ... , *yk*) *are two chromosomes and xp* = *yq, then there exists σ*<sup>∗</sup> ∈ Σ*<sup>k</sup> such that σ*<sup>∗</sup> *<sup>p</sup>* (*y*) = *yq and* ∑*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *d*(*xi*, *σ*<sup>∗</sup> *<sup>i</sup>* (*y*)) <sup>≤</sup> <sup>∑</sup>*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *d*(*xi*, *σi*(*y*)) *for all σ* ∈ Σ*k, where d is a distance metric.*

**Proof.** Let *<sup>σ</sup>* be argmin*σ*∈Σ*<sup>k</sup>* <sup>∑</sup>*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *d*(*xi*, *σi*(*y*)). There is an index *r* satisfying *σ <sup>r</sup>*(*y*) = *yq*. Let *σ* be the same permutation as *σ* , except that *σ* (*y*)*<sup>p</sup>* and *σ <sup>r</sup>*(*y*) are exchanged:

$$
\sigma\_i^{\prime\prime}(y) = \begin{cases}
\sigma\_r^{\prime}(y) & \text{if } i = p\_\prime \\
\sigma\_p^{\prime}(y) & \text{if } i = r\_\prime \\
\sigma\_i^{\prime}(y) & \text{otherwise}.
\end{cases}
$$

Then,

$$\begin{split} \sum\_{i=1}^{k} d(x\_i, \sigma\_i^{\mathcal{V}}(y)) &= \quad d(x\_p, \sigma\_p^{\mathcal{V}}(y)) + d(x\_r, \sigma\_r^{\mathcal{V}}(y)) + \sum\_{i \neq p, i \neq r} d(x\_i, \sigma\_i^{\mathcal{V}}(y)) \\ &= \quad d(x\_p, \sigma\_i'(y)) + d(x\_r, \sigma\_p'(y)) + \sum\_{i \neq p, i \neq r} d(x\_i, \sigma\_i'(y)) \\ &= \quad d(x\_p, y\_q) + d(x\_r, \sigma\_p'(y)) + \sum\_{i \neq p, i \neq r} d(x\_i, \sigma\_i'(y)) \\ &= \quad d(x\_r, \sigma\_p'(y)) + \sum\_{i \neq p, i \neq r} d(x\_i, \sigma\_i'(y)) \quad (\because \, d(x\_p, y\_q) = 0 \text{ by assumption}) \\ &\leq \quad d(x\_r, x\_p) + d(x\_p, \sigma\_p'(y)) + \sum\_{i \neq p, i \neq r} d(x\_i, \sigma\_i'(y)) \quad (\because \text{ triangular inequality}) \\ &= \quad d(x\_r, y\_q) + d(x\_p, \sigma\_p'(y)) + \sum\_{i \neq p, i \neq r} d(x\_i, \sigma\_i'(y)) \\ &= \quad d(x\_r, \sigma\_i'(y)) + d(x\_p, \sigma\_p'(y)) + \sum\_{i \neq p, i \neq r} d(x\_i, \sigma\_i'(y)) \\ &= \quad \sum\_{i=1}^{k} d(x\_i, \sigma\_i'(y)) \\ &\leq \quad \sum\_{i=1}^{k} d(x\_i, \sigma\_i'(y)) \text{ for all } \sigma \in \Sigma\_k. \end{split}$$

$$\text{Hence, } \sum\_{i=1}^{k} d(\mathbf{x}\_i, \sigma\_i^{\prime\prime}(y)) \le \sum\_{i=1}^{k} d(\mathbf{x}\_i, \sigma\_i(y)) \text{ for all } \sigma \in \Sigma\_k \text{ and } \sigma\_p^{\prime\prime}(y) = y\_q. \quad \Box$$

This proof relies on the triangular inequality, which is a key property of any valid distance. Thus, Proposition 3 holds for distances other than Hamming distance. We could use discrete metric, introduced in Section 2. This distance provides a very rough comparison of two solutions, but Proposition 3 still holds. Equation (3) can be rewritten using discrete metric *ρ* instead of Hamming distance *H*:

$$
\sigma^\* = \underset{\sigma \in \Sigma\_k}{\text{argmin}} \sum\_{i=1}^k \rho(\mathbf{x}\_i, \sigma\_i(y)). \tag{4}
$$

In particular, in this case, feasibility is preserved by any optimal permutation *σ*∗ in Equation (4). Its proof is quite similar to the proof of Proposition 3.

**Proposition 4.** *If x* = (*x*1, *x*2, ... , *xk*) *and y* = (*y*1, *y*2, ... , *yk*) *are two chromosomes and xp* = *yq, then σ*∗ *<sup>p</sup>* (*y*) = *yq, where σ*<sup>∗</sup> *is a permutation such that* ∑*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *ρ*(*xi*, *σ*<sup>∗</sup> *<sup>i</sup>* (*y*)) <sup>≤</sup> <sup>∑</sup>*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *ρ*(*xi*, *σi*(*y*)) *for all σ* ∈ Σ*k.*

**Proof.** We assume that *σ*∗ *<sup>p</sup>* (*y*) = *yq*. There is an index *r* satisfying *σ*<sup>∗</sup> *<sup>r</sup>* (*y*) = *yq*. Let *σ* be the same permutation as *σ*∗, except that *σ*∗(*y*)*<sup>p</sup>* and *σ*<sup>∗</sup> *<sup>r</sup>* (*y*) are exchanged:

$$
\sigma\_i'(y) = \begin{cases}
\sigma\_r^\*(y) & \text{if } i = p\_{\prime} \\
\sigma\_p^\*(y) & \text{if } i = r\_{\prime} \\
\sigma\_i^\*(y) & \text{otherwise}.
\end{cases}
$$

Then,

*k* ∑ *i*=1 *ρ*(*xi*, *σ <sup>i</sup>*(*y*)) = *ρ*(*xp*, *σ <sup>p</sup>*(*y*)) + *ρ*(*xr*, *σ <sup>r</sup>*(*y*)) + ∑ *i*=*p*,*i*=*r ρ*(*xi*, *σ <sup>i</sup>*(*y*)) = *ρ*(*xp*, *σ*<sup>∗</sup> *<sup>r</sup>* (*y*)) + *ρ*(*xr*, *σ*<sup>∗</sup> *<sup>p</sup>* (*y*)) + ∑ *i*=*p*,*i*=*r ρ*(*xi*, *σ*<sup>∗</sup> *<sup>i</sup>* (*y*)) = *ρ*(*xp*, *yq*) + *ρ*(*xr*, *σ*<sup>∗</sup> *<sup>p</sup>* (*y*)) + ∑ *i*=*p*,*i*=*r ρ*(*xi*, *σ*<sup>∗</sup> *<sup>i</sup>* (*y*)) = *ρ*(*xr*, *σ*<sup>∗</sup> *<sup>p</sup>* (*y*)) + ∑ *i*=*p*,*i*=*r d*(*xi*, *σ*<sup>∗</sup> *<sup>i</sup>* (*y*)) (∵ *xp* = *yq* by assumption) < 1 + 1 + ∑ *i*=*p*,*i*=*r d*(*xi*, *σ*<sup>∗</sup> *<sup>i</sup>* (*y*)) = *ρ*(*xp*, *σ*<sup>∗</sup> *<sup>p</sup>* (*y*)) + *ρ*(*xr*, *xp*) + ∑ *i*=*p*,*i*=*r ρ*(*xi*, *σ*<sup>∗</sup> *<sup>i</sup>* (*y*)) (∵ *xp* = *σ*<sup>∗</sup> *<sup>p</sup>* (*y*) and *xr* = *xp*) = *ρ*(*xp*, *σ*<sup>∗</sup> *<sup>p</sup>* (*y*)) + *ρ*(*xr*, *yq*) + ∑ *i*=*p*,*i*=*r ρ*(*xi*, *σ*<sup>∗</sup> *<sup>i</sup>* (*y*)) = *ρ*(*xp*, *σ*<sup>∗</sup> *<sup>p</sup>* (*y*)) + *ρ*(*xr*, *σ*<sup>∗</sup> *<sup>r</sup>* (*y*)) + ∑ *i*=*p*,*i*=*r ρ*(*xi*, *σ*<sup>∗</sup> *<sup>i</sup>* (*y*)) = *k* ∑ *i*=1 *ρ*(*xi*, *σ*<sup>∗</sup> *<sup>i</sup>* (*y*)).

This contradicts the assumption that ∑*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *ρ*(*xi*, *σ*<sup>∗</sup> *<sup>i</sup>* (*y*)) <sup>≤</sup> <sup>∑</sup>*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *ρ*(*xi*, *σi*(*y*)) for all *σ* ∈ Σ*k*.

Using discrete metric will only cause identical indices to be rearranged into the same positions. In fact, normalization by discrete metric is exactly the same as rearranging for preserving feasibility introduced in Section 3.1.

#### **4. Experiments**

#### *4.1. Test Sets and Test Environments*

Our experiments were conducted on 65 instances of 11 set cover problems with various size and densities, from the OR-library [43]. Although these benchmark data were designed as set cover problems, the data can also be considered as maximum *k*-coverage problems, as in [27]. Some details of these problems are presented in Table 1, where *m* and *n* are the numbers of rows and columns, respectively, and density is the percentage of '1's in the MKCP matrix *A*. The present authors previously experimented with problems with fixed values of *k* of 10 and 20 [28]. However, in this study, we varied *k* with the tightness ratio *α*, which is the product of *k* and the density of a problem. The higher is the tightness ratio, the larger is the value of the object function, which is the coverage, that we are likely to achieve. If the tightness ratio is 1, the optimum coverage is likely to be very close to *n*. We used tightness ratios of 0.8, 0.6, and 0.4.

We implemented all our tested algorithms in C language using *gcc* version 5.4.0, and ran them on Ubuntu 16.04.6.


**Table 1.** Problem set.

#### *4.2. Effect of Normalization on a Crossover*

To see whether or not normalization is effective at crossover, we performed experiments with three methods of rearranging the second of two parents before a crossover:


REPAIR produces feasible offspring by the method of replacing duplicate column indices with a randomly chosen one among the indices that are not contained in the offspring. OPT was implemented using the Hungarian method [40], which runs in *O*(*k*3) time.

To determine the most effective method, we performed the following steps:


This procedure was applied to a single instance of each problem listed in Table 1 using REPAIR, FP, and OPT. Table 2 shows the results for each of these 11 instances. We see that OPT normalization outperforms the others, and can therefore be expected to improve the performance of a GA. Moreover, the results of the one-tailed *t*-test show that the objective values of offspring produced using OPT normalization were better than those of their parents. On the contrary, the values produced using REPAIR and FP were similar to those of their parents. This suggests that, compared to other methods, OPT normalization would strongly support the function of the crossover operator in searching the solution space in a promising direction, without replacement strategy.


**Table 2.** Effect of normalization on a single crossover, after the REPAIR, FP, and OPT procedures.

Ave and SD are the average and the standard deviation of the fitness of 100 parents (in the column of "Parents") or 50 offspring (in the remaining columns of REPAIR, FP, and OPT), respectively. REPAIR produces feasible offspring by random repair. FP rearranges the second parent to produce feasible offspring using the normalization in Figure 3. OPT is optimized normalization of the second parent. \* The one-tailed *t*-test of the null hypothesis that the result of given method is equal to fitness of parents.

#### *4.3. Performance of GAs with Normalization Methods*

Our underlying evolutionary model is similar to the model of CHC [44], which was applied to many problems [45–50]. We paired a population of *N* chromosomes randomly, and then we applied crossover to each pair, generating a total of *N*/2 offspring. We ranked all parents and offspring and the fittest *N* individuals among them became the population in the next generation. We used 100 as the size of population in our experiments. We reinitialized the population, except for the best individual, if there were no changes over *kr*(1 − *r*) generations, where *r* is a divergence ratio that wes set to 0.25. This GA stopped after 500 generations and returned the best it has found. The pseudocode of our GA is given in Figure 8.

Initialize a population *P* of *N* individuals; **for** *i* ← **to** maximum generations { Randomly pair a population *P* of *N* individuals; *{N*/2 *pairs}* **for** each pair (*p*1, *p*2) ∈ *P {N*/2 *iterations}* { Normalize *p*<sup>2</sup> to make close to *p*1; *{optionally applied} o* ← crossover(*p*1, *p*2); *{make offspring from parents}* } *P* ← the best *N* individuals among *N* parents and *N*/2 offspring; **if** there are no changes in *P* over *kr*(1 − *r*) generations, **then** Reinitialize a population *P*, except for the best individual; } **return** the best individual;

In the following experiments, we changed the normalization method in a single GA. We compared the output of our GA with the best result that we found in this study, using the metric %-gap, which is 100 × |*best* − *output*|/*best*.

Thirty trials were performed for each method, and the averaged results are shown in Table 3. The results labeled RR-GA were produced without normalization: infeasible offspring produced were repaired randomly using the same method as REPAIR in Section 4.2. We also compared the results from the GA with a multi-start method using randomly generated solutions. In each run of this method, called Multi-Start, we sampled 10<sup>6</sup> random solutions and chose the best one. Even RR-GA performed significantly better than Multi-Start, suggesting that GAs are an appropriate mechanism for solving the MKCP.

The results labeled FP-GA in Table 3 were produced by rearranging the genes of the second parent to produce feasible offspring without the need for repair. FP-GA outperformed RR-GA for large values of *k* but not for small values. It seems that the effect of mutation by repair is rather effective when the solution space is small.

The results labeled OPT-GA in Table 3 were produced by the GA with the proposed normalization. Using the same GA as FP-GA, OPT-GA rearranges the genes of the second parent to minimize the sum of distances between genes (column vectors) before applying recombination. OPT-GA clearly outperforms Multi-Start and RR-GA; the results of one-tailed *t*-tests prove that OPT-GA also outperforms FP-GA significantly.

The results of the *t*-tests show that Multi-Start is clearly the worst technique, even though it is allowed 10<sup>6</sup> evaluations, while the GA-based methods produce relatively small 2.5 <sup>×</sup> <sup>10</sup><sup>4</sup> chromosomes. RR-GA and FP-GA have similar performance, and OPT-GA clearly does best.



hypothesis of OPT-GA = FP-GA.

#### **5. Conclusions**

We present the maximum *k*-coverage problem (MKCP) and analyze its representation and solution space. If we apply a GA to the MKCP, then we immediately encounter the issue of redundancy in the genotype space, which is larger than the phenotype space that we characterize as a quotient space. We introduce a method of normalizing chromosomes that ensures a crossover produces feasible offspring with genes that are column vectors of the MKCP matrix and are as dissimilar as possible. This normalization was implemented using the Hungarian method [40]. We performed experiments which showed the effectiveness of this approach.

In this study, we adopted two locus-based metrics of the discrete metric and its extended version derived from Hamming distance between genes (column vectors). However, other metrics such as some variants of Cayley metric on permutations [51,52] may also be applied to the proposed theoretical framework. In the case of such non-locus-based metric, we should design a new crossover tailored to the metric. This investigation will be a promising work, which we leave for future study.

As mentioned in the Introduction, the proposed theoretical framework can be applied to real-world applications such as the cyber-physical social systems and public safety networks. We leave this applied work for future study. We also expect that this approach can be applied to other problems which have the same representation for their solutions. By expanding this technique to solution representations of variable length as in [53,54], we believe it could also be applied to the set cover problem.

**Author Contributions:** Conceptualization, Y.Y. and Y.-H.K.; methodology, Y.-H.K.; software, Y.Y.; validation, Y.Y.; formal analysis, Y.Y.; investigation, Y.Y.; resources, Y.Y.; data curation, Y.Y.; writing—original draft preparation, Y.Y.; writing—review and editing, Y.-H.K.; visualization, Y.Y.; supervision, Y.-H.K.; project administration, Y.-H.K.; and funding acquisition, Y.-H.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** The present research was conducted by the Research Grant of Kwangwoon University in 2020. This research was a part of the project titled 'Marine Oil Spill Risk Assessment and Development of Response Support System through Big Data Analysis' funded by the Korea Coast Guard. This work was also supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science, ICT & Future Planning) (No. 2017R1C1B1010768).

**Acknowledgments:** The authors thank Byung-Ro Moon for his valuable suggestions, which improved this paper.

**Conflicts of Interest:** The authors declare that they have no conflict of interest.

**Disclosure:** A preliminary version of this paper appeared in the Proceedings of the Genetic and Evolutionary Computation Conference, pp. 593–598, 2008. In comparison with the conference paper, this paper was newly rewritten with the following new materials: (i) new work: complete literature survey (Section 1) and complete theoretical work of the proposed normalization (Sections 2 and 3.2); and (ii) improved work: an improved genetic algorithm (through the improvement of normalization technique), and its largely-extended experiments together with their statistical verification (Section 4).

#### **References**


c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
