We have conducted the majority of the evaluation tests on the w-iLab.t wireless sensor testbed in Zwijnaarde [
32]. This testbed provides a controlled test environment in a large (66 m × 20 m) open room with 60 fixed nodes and 15 mobile nodes. Each node includes sensors (Rmoni RM090 and Zolertia Z1) and Wi-Fi (IEEE 802.11a/b/g/n). In our experiments, we have used up to 40 fixed nodes as sensor nodes, two fixed nodes as a sensor network GWs and two other fixed nodes to generate interfering Wi-Fi traffic when needed.
Figure 6 shows the part of the testbed that we used during our experiments. The other nodes were idle during the experiments to make sure that they do not cause extra interference and are not shown in
Figure 6 for clarity. The two GWs are connected via an Ethernet link. When both GWs were used, they operated on different IEEE 802.15.4 Radio Frequency (RF) channels. However, in most experiments, we used only GW1 and kept GW2 idle in order to not interfere with the running experiments. In all of the performance evaluation experiments, we used the Rmoni RM090 boards with Contiki. We have used RPL as the routing protocol and enabled the SMRF multicast engine. We did not use any RDC. The GWs are running the example rpl-border-router provided by Contiki, and therefore, they are the RPL DODAGroots for their subnets, delegate the global IPv6 prefix and route traffic to and from the constrained networks. All other nodes run the Erbium server and also have RPL enabled. In
Table 6, we summarize the most important settings for Contiki and the used protocols.
In the following subsections, we present the results of our evaluation experiments on the testbed and compare them with simulated results when appropriate.
5.2.1. Congestion Control Optimizations
Congestion control is an important aspect of group communication, especially in LLN, where resources are limited and network congestion can lead to extended response times and significant energy consumption due to frequent retransmissions of packets. CoAP provides basic congestion control by using the exponential back-off mechanism (
Section 2.2.1) and by limiting the number of open requests from a client to any server to one request by default. Furthermore, CoAP specifies that, when using multicasts, a certain random delay should be inserted before replying to multicast requests. In CoAP terms, this delay is called leisure. The server could either use a default value for leisure or compute a value for it. If the server has a group size estimate
G, a target data transfer rate
R and an estimated response size
S, a rough lower bound for leisure can then be computed as:
In our experiments,
G was between five and 40,
S equals approximately 80 bytes and the target rate
R can be set to a conservative 8 kbit/s = 1 kB/s. The resulting lower bound for the leisure is then between 0.4 s and 3.2 s. However, since CoAP servers will often not be able to compute the leisure, we elected to use the default leisure value of 5 s, as recommended by [
6], in all of our multicast experiments. For a more complete discussion of the leisure period and its estimation, we refer to Section 8.2 of [
6].
CoAP does not specify a congestion control mechanism when a single client is communicating with many servers using unicasts, as is the case in our group communication solution. However, our experience shows that this can quickly lead to congestion. A simple solution for avoiding network congestion when using unicasts is to limit the rate at which requests are sent. This way, the group members will get the requests spread over a period of time, and thus, there replies will also be spread over a period of time in a similar way to leisure. In order to get the replies spread over a period
, the EM should insert a delay between requests
D that equals
divided by
, e.g.,
For our experiments, we get
ms and
ms for
and
, respectively. In order to verify the effect of the delay length, we conducted a series of experiments on the testbed to query an entity of five members and measure the response time, which is expressed as the time between the moment the client issues the request to the EM until it gets back the response. We repeated the same experiment for different delays between the requests sent from the EM to the members. We repeated the experiment 50 times for each setting and computed the averages. During these experiments, all Wi-Fi devices were turned off, and thus, no noticeable external interference was present. In [
9], we have done the same experiments, but using the Cooja network simulator.
Figure 7 shows the results of the experiments on the testbed and in Cooja. In general, there is a very good match between the results of the simulation and the results on the testbed. The figure clearly shows the need for the delay between the requests. Without inserting the delay, the response time of the entity was about 3 s. When using a delay of 0.1 s as calculated from Equation (
2), the response time drops to 550 ms and is very close to the minimum value of 390 ms that was achieved for a delay of 50 ms.
In order to verify whether the same relationship exists between the delay and the response time for other group sizes, we have repeated the same set of experiments on the testbed using additional group sizes (). At the time of experimenting with 30 members, one testbed node did not start properly, leaving only 29 group members in the experiment.
Figure 8 shows the results of those experiments. Since the EM sends the requests to the members sequentially, it is expected that the response time for the complete entity gets larger as the group size gets larger. This relationship is very obvious in the figure. Regardless of this fact, one can see that graphs for all of the group sizes follow the same pattern. To further analyze the relationship between the delay and the group size, please consider
Figure 9, which shows the same results as
Figure 8, but this time normalized over the group size
G. In a star topology, such as in our case, where all members need to communicate with the root of the star (the GW), one expects that the average response time would increase as the number of neighbors increases, resulting in a higher number of collisions on the shared medium. This is indeed the case when we compare any larger group size with the group of five members. However when comparing the larger groups together, this relationship cannot be observed. The reason for this is that with the increase of the group size and the way nodes are distributed over the testbed, members become no longer directly reachable from the GW, and their traffic was routed via other members. Consequently, the average hop count (
h) was also increasing from just one hop for the group of five members to 2.4 hops for the group of 40 members, while the maximum hop count increased from one to five (see
Table 7). A higher hop count implies that a lower percentage of members can communicate directly with the GW. It also means that a lower percentage of nodes is in the collision domain of the GW. This makes it possible that more parallel communication can happen inside the Wireless Sensor Network (WSN) before they reach the collision domain of the GW where the bottleneck is located.
Regardless of the small changes in the values for the various groups, we observe that the shape of the relationship function is very similar among all of them. The response times were always improved significantly when the delay was around the recommended range of 82 ms to 100 ms. However, as the delay between requests grows larger, it becomes the dominating factor for the total response time with a linear relationship between the two.
Another indicator of the performance of any communication solution is its reliability. During the tests we conducted in this section, the reliability of the communication was always 100% for all group sizes lower than 40. This is not surprising, since we had no external interference, and the only cause for errors was internal collisions. For the group size of 40, the reliability of member replies was never 100%. It was always between 99.8% and 99.9%, regardless of the delay between requests. This is also not surprising as with the larger group size, the chance for collisions increases, and the CoAP retransmission mechanism starts to be sometimes insufficient. We will discuss reliability in more detail in the next subsection.
As a result of the observations we made in these experiments, we have used an entity delay between requests of 100 ms in all of the following experiments, which is also in line with the results of Equation (
2).
5.2.2. Reliability
Reliability is a key performance indicator. In this subsection, we experimentally evaluate the reliability of both unicast and multicast CoAP group communication in the presence of Wi-Fi interference. To generate this interference, we send UDP traffic from one Wi-Fi node to the other at a constant bandwidth by using the iperf tool. We have setup the Wi-Fi communication to use Wi-Fi Channel 13, which completely overlaps with IEEE 802.15.4 Channels 25 and 26 that we use inside the WSN. Since we are using CSMA as the Media Access Control (MAC), the sensor nodes will back off when Wi-Fi is sending. However this is not true for the other direction. Typically, Wi-Fi MAC will not detect that wireless sensors are sending and will not back off.
To measure the reliability, we used the same experiment setup shown in
Figure 6 to communicate with a group of 10, 20 and 30 members. We gradually increased the Wi-Fi interference in the network in steps of 5 Mb/s and measured the reliability of getting responses to the respective requests. We repeated the same experiment for our group communication solution and for multicasts. We run each experiment 50 times and show the averages of the member reliability (i.e., reliability of the communication with individual group members) in
Figure 10. Multicasts are not transported reliably, and thus, the reliability of the network decreases as soon as there is a packet loss due to the Wi-Fi interference in the network. When using our unicast group communication solution, CoAP confirmable messages are used.
For the group of 10 members, the reliability of individual resources remains always 100%, even when the Wi-Fi nodes were transmitting as fast as they could (28.5 Mb/s). The reliability of individual resources for the group of 20 nodes dropped a bit to 99.9% under maximum Wi-Fi interference. For the 30-member group, the reliability is further reduced to 99.5% (compared to 94.6% in the case of multicasts).
Figure 10 also shows that the reliability of individual members decreases with an increasing group size, both for unicast and multicast communication. This is due to two reasons. First, larger groups are denser and, thus, have a higher chance of collision between the group members. Second, and maybe with a higher impact on the reliability, bigger groups have a larger average hop count. This means that every message (both request and reply) between a client and a server has an additional chance of getting dropped at each hop on the way to its destination. Nevertheless, in our 20- and 30-member groups, 100% reliability was maintained for unicast communication until a Wi-Fi transmission rate of 25 Mb/s, with one single exception for the 30-member group at 5 Mb/s, where one message was lost and resulted in a reliability of 99.9%.
In many group communication use cases, it is desirable to get answers from all members of the group. A complete group communication is considered successful when communication to all members in the group is successful.
Figure 11 shows the effect of packet loss on the reliability of the complete group for our 10-, 20- and 30-member groups. Certainly, the reliability of a complete group is less than the reliability of its individual members, since the loss of a message to or from a single member renders the complete group request unsuccessful. In these cases, the use of multicasts does not provide good results. Already at 15 Mb/s Wi-Fi traffic, the reliability of 20- and 30-member groups drops to about 80%. In contrast, our unicast-based group communication maintains 100% reliability for the 10- and 20-member groups, even with the maximum transmission speed of the Wi-Fi nodes, and only drops to 98% in the case of the 30-member group.
These results are generally in line with the simulations that we performed previously in [
9]. However, a direct comparison is not possible, since the simulations used a more controlled topology, in which five nodes were one-hop away, another five nodes were two hops away, and so on. On the testbed, the location of the nodes is fixed, and it was up to RPL to construct the topology for the routing. Furthermore, the simulations randomly dropped packets at a configurable percentage to simulate external interference, while on the testbed, real Wi-Fi traffic at one point of the network was used.
The drawbacks of the improved reliability of our unicast-based approach are the increased network overhead and response time. These are expected results, since the reliability is achieved by transmitting acknowledged messages, which results in more messages and longer delays in the case of errors. We have discussed these issues in detail in our previous work [
9].
5.2.4. Group Size
As shown in the previous subsections, using large groups can have a negative impact on the reliability of the group. In our tests, unicast groups started to become unreliable after a group size of 30 members. Multicast groups are generally unreliable, but the reliability also becomes worse with an increasing group size. The reason for this is that with the increase in group size, the density of the nodes typically also increases, and as a result, more collisions occur in the network. Furthermore, for the unicast-based solution, the group size directly affects the response time since the EM adds a delay between the requests it sends to the members. One simple solution is to split the groups. However, splitting the groups does not bring much benefit, when both groups are still using the same RF channel. When using our group communication solution, one can use more than one GW and create different WSNs that use different IEEE 802.15.4 RF channels. The groups are split accordingly.
In order to test this approach and to demonstrate the use of more than one GW to create two WSNs that are overlapping in the physical space, but are using different RF channels, we have created a new experiment. In this experiment, each GW is communicating with a network of 10 sensor nodes using its own RF channel (IEEE 802.15.4 Channels 25 and 26). The two GWs are connected via an Ethernet cable, and routing is enabled between them. We have repeated the same test as in
Section 5.2.3, but now using a group that consists of two smaller groups.
Figure 13 shows the response time vs. the speed of interfering Wi-Fi traffic for the new experiment along with the results for groups of 10 and 20 members from the previous section for comparison. As expected, the response time for the group of two smaller groups is better than that of the one big group, although the total number of nodes was the same in both cases (20 nodes). Further, the response time is larger than the case of a single group with 10 members. The reason here is that we use nested groups, i.e., a group that contains two groups. This results in some additional processing overhead and also inserts a delay of 100 ms between the requests being issued to the different subgroups. Additionally, since we used two neighboring IEEE 802.15.4 channels, a small amount of interference between the channels is present. The reason for selecting two neighboring channels was to have both channels equally interfered from the Wi-Fi Channel 13, which overlaps both of the used IEEE 802.15.4 Channels 25 and 26. In a production setting, one should not use a neighboring channel to also avoid this limited amount of interference. The selection of channels should also take into consideration which Wi-Fi channels are used.
5.2.5. CoAP Retransmission Timeout
As described in
Section 2.2.1, CoAP has its own basic reliability mechanism that can be used for unicast communication. When reliability is needed, the sender of the CoAP message should use a Confirmable Message (CON). The receiver has to acknowledge this type of messages by sending an ACK. If the sender does not receive a reply within a back-off time, it retransmits the confirmable message at exponentially increasing intervals, until it receives an ACK or runs out of attempts. By default, the initial back-off is set to a random time between
ACK_TIMEOUT and
ACK_TIMEOUT * ACK_RANDOM_FACTOR. By default,
ACK_TIMEOUT = 2 s and
ACK_RANDOM_FACTOR = 1.5, and thus, the default initial back-off is between 2 and 3 s. If a reply to the first transmission attempt of a CON is not received within the initial back-off time, the CoAP sender will double the initial back-off time and retransmit the packet. If a reply to the first retransmission is not received, CoAP will again double the back-off time and retry the transmission until
MAX_RETRANSMIT (by default, four) is reached. If no reply is received after expiration of the back-off time of the last retransmission, the client will be notified about the error condition. When using the default values, the best case timeout will be after
s and in the worst case after
s.
The CoAP protocol allows the client to change the default parameters according to its needs. Changing those parameters will effect both the reliability and the response time. Changing
MAX_RETRANSMIT effects the reliability directly, since it changes the number of attempts to get a successful communication. In our tests, the reliability was most of the times 100% with the exception of using large groups and large interference. As such, the default value of four retransmissions is fine for our use case. On the other hand, changing
ACK_TIMEOUT, and thus, the initial back-off time, has a direct impact on the response time, since it specifies the time between the retransmission attempts. In order to investigate the effect of changing the initial back-off time on our solution, we have conducted a series of tests that are similar to those described in
Section 5.2.2 for different values of the initial back-off times.
Figure 14 shows the effect of Wi-Fi interference on the response time for three different values of the initial back-off time (
ACK_TIMEOUT = 0.5, 1 and 2 s) for a group of 10 members. When there is no Wi-Fi interference, there is no need for retransmissions, and thus, the initial back-off has no effect. When Wi-Fi traffic was interfering with our WSN, reducing
ACK TIMEOUT from 2 s to 1 s helped to improve the response time. However, reducing
ACK TIMEOUT further to 0.5 s had a negative effect. This is due to the fact that in this case, CoAP was not waiting long enough for the replies to arrive and meanwhile trying to retransmit the requests, causing more collisions in the network.