1. Introduction
1.1. General Provisions
The penetration of the Internet of Things (IoT) into various areas of human activity has led to the emergence of various forms of implementation of wireless sensor networks (WSNs). These networks may differ in scale (geometric dimensions of the service area), number of nodes, traffic intensity, the location of nodes in space, and the nature of their movement. The choice of radio channel organization technology and control protocols for the logical structure of the network depends on these parameters [
1,
2,
3].
Various WSN applications require networking with fixed or mobile nodes. The logical structure of the WSN is determined by the routing rules used. In the simplest case, when building data collection networks, there is a star-shaped structure, and all network nodes are located in the service area of the cluster head (CH). Often, it also acts as a gateway. In such a case, the service area is limited to the CH area, which, as a rule, is limited to hundreds of meters.
To overcome the limitations of this structure, structures are used with the organization of transits and a mobile cluster head (MCH) that performs the functions of data collection [
4]. The use of an MCH allows you to significantly expand the WSN service area, as well as increase network connectivity [
5]. If you choose the trajectory of the MCH in such a way that it passes through the communication zone of each of the network nodes for a certain time, then during this time interval, with the right choice of movement speed [
6], it is possible to collect data from a given share of the total number of nodes.
In this case, the duration of data collection depends on the speed of the cluster head, the number of network nodes and the size of the service area. The requirements for the duration of data collection depend on the specific application tasks solved by the network. For example, in a network for monitoring ambient temperature and soil moisture, it can be calculated in minutes or tens of minutes, and in a network for monitoring the state of the atmosphere for the presence of hazardous substances, the duration of data collection should be in seconds.
Based on the previous studies, we note that in order to reduce the data collection time, the number of MCHs should be increased, and in this case, the tasks of choosing both the number of such nodes and the nature of their movement in the network service area arise, and the efficiency of using such nodes depends on these parameters. It is obvious that having too many nodes increases the cost of the network, i.e., reduces its effectiveness. The MCH, as a rule, is significantly more expensive than a stationary node, since it requires the availability of vehicles and energy costs for the movement process, as well as an interface with the global network. In the extreme case, it is possible to completely abandon stationary nodes, leaving only MCHs equipped with gateways; this will reduce the data collection time to a minimum. In this case, the network will have the maximum cost. In an intermediate variant, the number of MCHs should be sufficient to provide the required duration of data collection; however, the duration is determined not only by the number, but also by the nature of MCH movement.
Therefore, as a new contribution in this paper, we develop a method for constructing a network with an MCH, allowing for an increase in the efficiency of using WSN resources. The new method allows you to select clusters with different densities of nodes. The identification of such clusters makes it possible to organize data collection with the help of mobile cluster heads, choosing the optimal speed of movement in each of these clusters. The analysis and the simulation show the efficacy of the proposed model. Additionally, it possible to increase the efficiency of using mobile cluster heads, which depends on the heterogeneity of the IoT network.
1.2. Gap Analysis
In the presence of mobile head nodes, the problem arises of selecting clusters, i.e., connections to these nodes by the WSN nodes from which you want to receive data. Both the speed (time) of data collection and the required number of mobile head nodes depends on the solution to this problem.
The task of selecting a cluster can be solved by various methods, from the simplest, for example, to include in the cluster those nodes with which it is possible to establish a connection, to methods that ensure the necessary distribution of nodes among clusters.
In practice, for example, in an urban environment, one has to deal with networks for various purposes, the elements of which are located on various elements of the urban infrastructure. As a rule, the difference between such networks is expressed in geometric dimensions and in terms of the number of nodes, i.e., they tend to have different knot densities. This is obvious if we imagine the density of pedestrians on a sidewalk, the density of cars on a carriageway, or the density of passengers in the passenger compartment of a bus or a train car. Thus, it can be assumed that one of the main criteria for selecting a cluster can be the density of devices.
Considering various infrastructure elements with different densities of nodes, it becomes necessary to describe them by some model. Such a model can be a model of a point process, in this case, a process that has not a uniform, but a multimodal distribution of elements in the service area.
It would be reasonable to note that an element with the same density of nodes may turn out to be too large, and several head nodes may be required to service it. This is already a slightly different task, i.e., the task of ensuring the required performance. It, in turn, may also require the clustering of such an element, but already with the aim of highlighting “similar” clusters. To solve this problem, well-known clustering algorithms, for example, FOREL or k-means, can be used. Since the solution to this problem is quite well known, we do not discuss it in this paper.
The main objective of this work is the selection of clusters or areas with a similar structure for their service by mobile head nodes. The purpose of this work is to increase the efficiency of the process of collecting data from sensor network nodes using mobile head nodes. To provide an efficient solution, a method of selecting clusters with a similar structure that can be served by mobile head nodes is expected.
2. Network Model with Mobile Nodes
It was proved in [
7] that the speed of movement of an MCH that collects data depends on the density of nodes and the requirements for the quality of data collection (probability of missing a node). In the same work, it was proved that the optimal value of the speed of movement of such a node can be determined, at which the number of polled network nodes is the maximum. Such a model for organizing the movement of a node is effective with a uniform distribution of users in the service area, when they are described by a Poisson field [
8]. In most practical cases, the Poisson field model is applicable only in limited spaces, within which the distribution of network nodes can be considered uniform. In the entire service area, nodes are distributed, in most cases, unevenly. Such a distribution is usually described by multimodal laws [
9].
For example, the distribution of people in a city, cars on a roadway, or aquatic plants or fish in a water area. In such cases, the density of objects is uneven in the service area. There is a high density of people in residential, office and public buildings, the density of cars is greater at intersections than in the central part of the road, etc. Therefore, most WSNs also have an uneven density of nodes in the service area. Therefore, when choosing the number of MCHs and the speed of their movement, this feature must be taken into account. Otherwise, the use of an MCH may be ineffective.
When modeling the WSN, we assume that the network nodes are located on a plane, and the service area is a rectangle. This assumption allows us to simplify the reasoning, but it should be noted that all these reasonings can be used for a three-dimensional model as well.
We assume that the network nodes are randomly distributed in the service area. The distribution of node coordinates can be represented by a two-dimensional multimodal distribution, described as a mixed (composite) distribution [
10,
11,
12] with a probability density and a distribution function given by expressions (1) and (2), respectively
where
K—is the number of modes;
ηi is a numerical coefficient such that
;
fi(
x,
y) is the probability density of the
i-th component;
fi(
x,
y) is the probability distribution function of the
i-th component.
In the general case, various distribution functions can be used in (1). Their choice is determined by the degree of closeness of the model to the simulated system. In this model, it is proposed to use continuous random variables and the corresponding distribution functions, since we assume that the x- and y-coordinates can take any value. Both limited and unlimited random variables can be used for modeling. In the latter case, it is necessary to specify the probability of a random variable falling into the simulated service area and to choose the distribution parameters in such a way that this probability is sufficiently large.
Figure 1 shows a possible intersection model obtained using a mixed distribution of the form (1) at
K = 4, the components of which are two-dimensional normal distributions [
13] with a probability density
where
μx and
μy are the mathematical expectations of random variables X and Y, respectively;
σx and
σy are standard deviations for random variables X and Y, respectively;
c is the correlation coefficient between X and Y.
In this case, the simulation considers four road sections adjacent to the intersection, and vehicles located on these sections are considered as objects.
The random variables X and Y are the coordinates of network nodes (for example, cars). The mixed distribution shown in the figure makes it possible to judge the probability
p of finding a node in a given area of the considered service area
where
Ω is the selected area.
Equations (1) and (2) are very useful for the modeling and simulation process because they can be easy calculated in the different simulation systems and other software. Although these distributions may differ from the real ones, they make it easy to check the performance of the model.
Of course, the construction of distributions should be based on statistical data on the number of cars on different sections of the road. The photograph shown in the example is a special case and serves to explain this approach.
The shape (“sharpness” and “height”) of the peaks depends on the standard deviation and coefficients
ηi, corresponding to each of the distribution components. The probability value (3) is proportional to the proportion (number) of nodes in a given area. The density of nodes in this model is unevenly distributed in the service area and repeats the probability distribution. The average density of nodes in a given area
Ω can be estimated as
where
is the area of the region
Ω;
n is the total number of nodes in the service area.
As a rule, the environment in which the network is built has a number of structural elements, such as buildings, roads, sidewalks, bridges, tunnels, overpasses, parking lots, etc. These elements determine the features, and possibly the nature of the location of the network nodes. Therefore, to describe arbitrary cases, it may be convenient to use different distributions in various combinations within the framework of model (1). At the same time, the density of users can be different in the areas described by different structural elements.
Comparing the proposed modeling method with practical applications, we can make the following assumptions. For example, when describing a network of nodes placed on vehicles, a car parking area can probably be described by a uniform distribution, an intersection area by a normal distribution, squares and roundabouts by a uniform distribution, and so on. Of course, the choice of distribution should be based on statistical observations.
Considering
Figure 1, we can assume that in this example, four clusters can be distinguished, which correspond to four sections of roads adjacent to the intersection. This assumption suggests itself due to the fact that there is a greater number of network nodes (cars) in these sections, and between these sections there is a space where there are relatively few network nodes. Such an intuitive solution is based on the apparently different density of network nodes, which is easy to see from the illustration. It is likely that the maintenance of such a network can be implemented by mobile head nodes, having allocated its own head node for each of the clusters, or by a smaller number, moving head nodes between clusters. In any case, there is a problem of cluster selection, in which the main criterion is the density of network nodes.
3. Formulation of the Problem
Figure 2 shows the results of generating random network nodes based on a model built using a mixed distribution, the components of which are two-dimensional normal and uniform distributions [
14].
This model demonstrates several typical elements that may be in the network service area. These are areas of various shapes and with a different distribution of users. If we consider the road network, then areas designated as 1 and 2 can correspond to the area, areas 3 and 4 to intersections, and 5, 6 and 7 to sections of roads and parking lots. We call such areas structural elements.
To build an efficient data collection system using a mobile cluster head, it is necessary to choose the motion parameters in each of these areas appropriately. To do this, first of all, it is necessary to be able to select these areas in the initial data set, i.e., the selection of user groups and determination of geometric (or geographical) coordinates necessary to select the parameters of the movement of nodes.
As mentioned above, to select traffic parameters, the primary task is to select areas that are homogeneous in terms of the location of network nodes. Analyzing
Figure 1, it is quite easy to intuitively identify seven areas (structural elements). However, firstly, the situation is far from always being so obvious, and secondly, a formal method is needed on the basis of which this problem can be solved.
Formally, this problem can be described as a classification problem or a clustering problem [
15,
16]. When it comes to matching network nodes to given structural elements, this task is a classification task. In the case when several groups of nodes can be allocated within the framework of one structural element, and the goal is to select groups of nodes and not just their comparison with structural elements, this task is a clustering task. The application of clustering methods to communication problems is described, for example, in [
17,
18].
Let the total number of nodes n of the network be the set
in which it is necessary to select subsets
such that their intersection is equal to the empty set, while the power of the complement of the set
S before the union of the sets
Mi is much less than the power of the set
S.
In practice, expressions (5)–(9) can be interpreted as the selection of disjointed subsets in the initial set of nodes k, such that they include almost all nodes of the network, with the exception of some of the nodes included in set a. These expressions should be interpreted as a formal mathematical description of this clustering problem. Expression (6) shows that the result should be disjointed sets. Expression (7) shows that all clusters obtained as a result of clustering contain a set of elements B (all elements from the found clusters). Expression (8) shows that set B (elements of all found clusters) does not contain some elements of the original set, i.e., set a. Expression (9) says that the overwhelming majority of the elements of the original set are included in the found clusters; only the elements of set a, which are considered as noise, are not included.
We will assume that the movement can be described by the speed and trajectory (route) of the node. In [
7], a model was obtained that relates the speed of the movement of nodes with the quality of service and the density of network nodes.
Under the conditions of the task set, clustering should satisfy not only conditions (6)–(9), but the cluster should also be, if possible, homogeneous in terms of the distribution of nodes. The most appropriate criterion for this problem is the density of nodes, since it is this parameter that determines the choice of motion parameters for mobile nodes.
In a real situation, it is impossible to create “absolutely impenetrable boundaries” for network nodes (cars, pedestrians, animals, etc.), as well as absolutely reliable devices. Therefore, it should be expected that a certain number of nodes may be located outside the structural elements (roads, buildings, water bodies, etc.) and not perform the functions assigned to them. The coordinates of such nodes can be present in the source data, which creates additional difficulties in solving the problem. To be able to exclude such nodes from the problem, condition (9) is introduced, which admits that not all n nodes should be included in the allocated k clusters. By analogy with other random processes, we call the presence of such nodes noise.
As an additional condition, we choose the difference between the node densities in the selected clusters.
where
ρi and
ρj are the density of network nodes in the corresponding clusters;
ε is the allowable change in the density of nodes.
The solution to the clustering problem makes it possible to select the regions and the speeds of movement of the cluster head
within them, for example, using the method described in [
19]. The next task is to select the trajectories of the cluster heads within the boundaries of the selected clusters.
4. Method for Selecting the Motion Parameters of the Cluster Head
We assume that the movement can be described by the speed and trajectory (route) of the node. In [
19], a model was obtained that makes it possible to select the trajectory of motion based on the use of the FOREL algorithm for clustering network nodes according to the node connection radius criterion and subsequent routing through the centers of the found clusters by approximately solving the traveling salesman problem.
In this case, this method does not lead to the desired results, because the criterion for selecting clusters does not take into account the influence of the distribution of nodes on the features of motion.
Figure 3 shows the result of clustering an example (
Figure 2) using the
k-means (
k-means) algorithm; in the example,
k = 7. Objects assigned to different clusters are highlighted in color. The above example clearly shows that as a result of clustering, areas are identified that do not coincide with structural elements. One structural element can be divided into several clusters, and one cluster can include parts of different elements. This result of the
k-means algorithm is expected, because this algorithm does not take into account the peculiarities of node placement.
This algorithm was chosen only as an example of clustering that is unsuccessful for this problem; many other clustering algorithms yield similar results.
To select a suitable clustering algorithm, experiments were carried out with 11 known algorithms: MiniBatch
k-means, FOREL, Affinity Propagation, Mean Shift, Spectral Clustering, Ward, Agglomerative Clustering, DBSCAN, OPTICS, BIRCH, and Gaussian Mixture. The algorithms were compared based on a number of examples in terms of the possibility of selecting clusters according to the task, i.e., within structural elements, as well as by execution time. The execution time for calculations was obtained experimentally. During the experiment, a computer with an 8-core 4 GHz processor with a memory capacity of 16 GB was used. As examples with different parameters, the initial data used to test and diagnose such algorithms from the library [
20] were generated: “nested circles”, “two moons”, “drops”, and uniform distribution. Such a set of initial data presents certain difficulties for clustering problems and allows us to describe the capabilities of the algorithms under study.
Table 1 shows the results of the study of clustering algorithms.
The estimate of the average execution time
te was obtained by averaging over a set of six test cases, the same for all algorithms.
The runtime variation
K was calculated as
where
ti is the processing time for the
i-th experiment;
n is the number of experiments.
K characterizes the variability of processing time during a series of n experiments. This variability is expressed relative to the average value of this time, which gives an idea of the relative deviations of the processing time.
As the results show, four algorithms yielded results close to expectations, such as Spectral Clustering, Agglomerative Clustering, Optics, and DBSCAN. They allow you to select structural elements, but only two of them make it possible to take noise into account, which are OPTICS and DBSCAN.
The analysis of the considered algorithms showed that the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm met the requirements of the task to the greatest extent. The name of the algorithm, “Spatial Clustering of Noisy Applications based on Density” quite fully reflects its functionality, which also meets the requirements of the problem being solved.
The execution time of the DBSCAN algorithm was almost six times less than the time required for the OPTICS algorithm. In addition, it had a relatively small coefficient of variation in execution time compared to other algorithms, which can also be regarded as a positive property (a small spread in execution time for various tasks).
The DBSCAN algorithm was developed in 1996 by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu, and was published in [
21]. In 2014, the algorithm was awarded the Time-Tested award at the KDD Data Mining Conference [
22]. Today it is the most efficient and frequently used algorithm in clustering and classification problems.
Figure 4 shows the result of initial data clustering by the DBSCAN algorithm.
The DBSCAN algorithm allows you to select clusters based on the density of nodes. This allows you to select groups of nodes within the boundaries of structural elements. In the above example, groups of nodes were identified in all given structural elements (seven clusters). Additionally, a group of nodes located at a distance from the areas of structural elements, which were defined as “noise”, was identified.
A brief description of this algorithm is as follows. All initial objects (within the framework of the problem, these are nodes; in the description of the algorithm, we call them points) are divided into three types: core points, density reachable, and discarded or noise (outliers).
A point p is called basic if its neighborhood of radius r contains at least m points (including the point p itself). We say that all these points are directly accessible from p. The ratio expresses the density of objects in the vicinity of the main point.
A point q is reachable from a point p by density if it is possible to lay a route between them on a graph built from the main points. An edge in a graph exists when the distance between points does not exceed r.
All points that are not reachable from other points are called outliers or noise points. The algorithm allows you to select clusters in the area of clusters of initial objects, while the conditional boundaries of the cluster are determined by the inaccessibility of neighboring points, i.e., low density of objects between clusters.
If a point is found to be a dense part of a cluster, then its ε-neighborhood is also part of that cluster. Hence, all points that are found within the ε-neighborhood are added, as are their own ε-neighborhoods when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.
DBSCAN can be used with any distance function [
1,
4] (as well as similarity functions or other predicates) [
7]. The distance function (dist) can therefore be seen as an additional parameter. Algorithm 1 may be described in pseudocode as shown below.
Algorithm 1. Clustering by the DBSCAN |
DBSCAN(DB, distFunc, eps, minPts) { C : = 0 /* Cluster counter */ for each point P in database DB { if label(P) ≠ undefined then continue /* Previously processed in inner loop */ Neighbors N : = RangeQuery(DB, distFunc, P, eps) /* Find neighbors */ if |N| < minPts then { /* Density check */ label(P) : = Noise /* Label as Noise */ continue } C : = C + 1 /* next cluster label */ label(P) : = C /* Label initial point */ SeedSet S : = N \ {P} /* Neighbors to expand */ for each point Q in S { /* Process every seed point Q */ if label(Q) = Noise then label(Q) : = C /* Change Noise to border point */ if label(Q) ≠ undefined then continue /* Previously processed (e.g., border point) */ label(Q) : = C /* Label neighbor */ Neighbors N : = RangeQuery(DB, distFunc, Q, eps)/* Find neighbors */ if |N| ≥ minPts then { /* Density check (if Q is a core point) */ S : = S ∪ N /* Add new neighbors to seed set */ } } } } |
To solve the problem, it is necessary to choose a value of density ρ at which the selected clusters satisfy the conditions of the problem, i.e., the identified clusters are homogeneous and are located within the boundaries of the structural elements.
Based on the logic of the algorithm, it can be seen that at the boundary values of the density, i.e., at ρ, one cluster will be selected, including all elements, while there will be no noise; at ρ, zero clusters will be formed, i.e., all elements will be treated as noise. At intermediate values, the number of clusters will increase as the density increases. Exploring the operation of the algorithm, if we set the density value based on the radius r, then the minimum number of nodes m will be considered constant. For analysis, we will count the number of discarded elements (noise) as r changes.
Figure 5a shows the results obtained with multiple clustering of a random sample with different radius values.
Figure 5b shows the dependence of the proportion of discarded elements on the radius value, and the statistical data are marked with circles. To approximate the obtained dependence, a Gaussoid was used (blue curve).
The curve parameters c and r0 (13) were chosen by the least-squares method.
To choose the value of
r, we used the cubit point method [
23] (
Figure 5b), i.e., the inflection points of the curve at which the angle of inclination of the tangent to it is equal to π/4. This method is often used in clustering problems, for example, to select the number of clusters and use the
k-means method. For a Gaussoid, this point is equal to the parameter
c.
In this case, several trial clustering operations are performed, the number of discarded elements is calculated for them, and then the position of the cubital point is determined.
The next subtask when using DBSCAN is the choice of the density value, for which it is necessary to set the minimum number of nodes in the vicinity of the main points m. The algorithm cannot select clusters with different element densities; areas with a density significantly lower than the specified one will be defined as noise. Therefore, if a different density of users is expected in the service area, then it is necessary to organize a series of clustering stages for different values of it.
The algorithm of the proposed method is shown in
Figure 6.
We describe this algorithm step by step.
The following notation is used in the algorithm: S is the initial set of objects, rmin and rmax are the minimum and maximum node density, nmin is the minimum number of nodes in the cluster, delta is the step of changing the node density, r0 is the node density value found by the cubital point method and is the cycle counter.
Step 1. Input of initial data, i.e., set of coordinates of network nodes S and initialization of the cycle counter i, as well as an array of boundary values r and nmin.
Step 2. Choice of the i-th set of boundary values r and nmin.
Step 3. The initial value of r is set (minimum, which corresponds to the maximum density).
Step 4. The DBSCAN algorithm is being executed.
Step 5. Verification of whether calculations have been performed for the entire range of r values. If not, then go to step 6; if yes, then go to step 7.
Step 6. Increase r by Δ.
Step 7. This step is performed after executing the DBSCAN algorithm for a number of r values. Based on the clustering results, the percentage of dropped nodes is estimated, and r0 is estimated using the cubital point method.
Step 8. This step is performed after obtaining the value r0, for which the DBSCAN algorithm is performed. It should be noted that the clustering result for the value r0 could be obtained in steps 4–6, in which case this step is a formality.
Step 9. Checking whether all sets of parameters r and nmin have been considered. If not, then go to step 10, if yes, then go to step 11.
Step 10. Incrementation of counter i by one, i.e., selection of the next set of parameters. If yes, then go to the next step.
Step 11. At this step, the trajectory of the cluster head is selected.
The last step includes the task of choosing the trajectory of the node. Note that within the framework of this article, tasks were considered whose application area was urban infrastructure and such structural elements as elements of roads, rears, and structures. The geometric dimensions of such objects, as a rule, do not exceed the communication zone of the cluster head (hundreds of meters). Therefore, the solution to this problem is obvious in most cases, for example, movement along the center line of the road, along the sidewalk border, along the wall of a building, along a corridor, etc.
The solution to this problem is expedient when the sizes of the obtained clusters and structural elements of the environment significantly exceed the communication zone of the cluster head, for example, agricultural land, water areas, forests, etc. These problems are the subject of further research.
We only note that a possible method for solving it can be, for example, the method proposed in [
19]. In such a case, this method must be applied for each of the resulting clusters.
The algorithm described above was implemented in Python and tested on several network models, in particular, on the model described at the beginning of the article. The testing of the algorithm showed its efficiency in the selection of clusters, including those with different node densities.