1. Introduction
The Internet of things (IoT) is an enabling paradigm of Industry 4.0 that uses sensors to extract environment-aware data in diverse applications, such as domotics [
1], smart energy [
2], and precision agriculture [
3], among other things. The data collected is further stored and analyzed to perform classifications or regressions, helping organizations make decisions about their processes [
4]. Although there are applications where the end nodes (ENs) transmit information over short distances and have unlimited energy resources (e.g., domotics), there are also some cases where the ENs must be deployed in hard-to-access places where the sensors’ information has to be transmitted over distances, and where changing batteries is difficult or impossible, e.g., in forest fire monitoring [
5], regulating the water level in dams [
6], and landslide detection [
7]. Thus, when the application has low energy and long-distance constraints, low power wide area networks (LPWANs) are used because they exhibit a good compromise between range and power consumption [
8]. One of the most widespread protocols for LPWANs is LoRaWAN [
9], which has gained popularity for IoT deployments because it operates in unlicensed bands, consumes low energy, and covers wide ranges compared to other competitors like narrow-band IoT (NB-IoT) or Sigfox [
10].
Because LoRaWAN is a wireless sensor network (WSN) protocol [
11], deploying ENs in the field requires previous network planning and link budget analyses. These analyses help establish the network parameters that achieve reliable connectivity at low energy consumption, so designers can choose different radio elements, including antennas’ geometries, antennas’ gains, allowed attenuations caused by cables and connectors, expected path loss (PL) and shadowing features posed by the channel, and transmission powers [
12]. Consequently, the link budget calculation guarantees that the received signal strength indicator (RSSI) on the gateway (GW) side is sufficiently large to be demodulated correctly. The PL effects can be estimated by using theoretical models (e.g., Friis [
13], and two-ray [
12]) or empirical models (e.g., Okumura–Hata [
14] and log-distance models [
12]) to accomplish the link margin goals. More specifically, the Friis approach considers that the PL depends only on distance and frequency and does not consider multipath phenomena that cause shadowing [
15]. Besides, the two-ray model considers a theoretical approximation of the line of sight (LoS) ray and the ray reflected over the ground, so it partially considers shadow fading effects [
12].
However, because the multipath phenomenon in real applications is very diverse and complex, empirical models based on measurement campaigns are also proposed. For instance, the Okumura–Hata approach [
14] provides a closed-form expression derived from collected data in Tokyo, Japan. It depends on the distance between the EN and the GW, the antenna heights, and the frequency. This model was fitted for large cells where the antennas’ heights are from 30 to 100
; however, in WSN/IoT networks, ENs’ antennas are close to the ground (for example, in precision agriculture [
16]), causing shadow fading effects up to 14 dB [
12], which may not be suitable for IoT applications, considering that the maximum transmitter power is about 20 dBm (e.g., LoRaWAN [
9]). Because of these limitations, a log-distance path loss model (LDPLM) is also fitted from field data, including a shadow fading term, which is modeled considering that the probability density function (PDF) attends a lognormal distribution. In that way, according to [
17], the statistical validity of the LDPLM must meet the following conditions: (
i) pass an analysis of variance (ANOVA) test, by which the log-distance weight (also known as a path-loss exponent) is analyzed to check its significance, and (
ii) the residual error/shadow fading term must be log-normally distributed, homoscedastic, and uncorrelated. However, in the proposed models, these tests are rarely addressed, and in the cases where it is handled, normality is not always met ([
18,
19]).
Due to the limitations mentioned above, some previous datasets have tackled the problem of modeling the radio frequency features in LoRaWAN networks to improve PL predictions. For instance, in [
20], the authors collected 665 samples in the city of Beirut, Lebanon, logging timestamp, distance, frequency (868 MHz), RSSI, signal-to-noise ratio (SNR), GW coordinates, and spreading factor (SF), with a fixed bandwidth (
BW) of 125 kHz and a fixed payload of 37 bytes. An LDPLM was enhanced by adding a new feature based on the EN antenna height. After fitting the corresponding model, they found a PL standard deviation of 7.2 dB.
Another approach can be found in [
21], wherein the authors collected some operational aspects of LoRaWAN in Brno, Czech Republic. Regarding PL modelling, the authors collected data for two months, logging timestamp, RSSI, SNR, timestamp, GW coordinates, EN coordinates, time on air (ToA), frequency (868 MHz), SF, payload size, and frame count. They found that the RSSI fluctuated up to 50 dB, concluding that the conventional propagation models may lead to significant inaccuracies in PL prediction.
Furthermore, in [
22], the authors deployed nine GWs in central London, UK, and collected timestamp, frequency (868 MHz), RSSI, SNR, SF, and payload size. However, this dataset does not include the distance between the GW and the ENs, so its use is mainly oriented to optimizing network parameters, and PL modelling is impossible.
Moreover, in [
23], the authors provide a dataset for localization/tracking purposes by using fingerprinting techniques by which the base station position is not needed. This dataset provides the RSSI of 68 base stations, timestamp, SF, and EN coordinates, for three months, in the city center of Antwerp, Belgium. As an application, this approach shows a fingerprint location using clustering techniques, particularly KNN [
24], achieving a mean error of
. In that way, this dataset is unsuitable for PL modelling purposes.
In addition, because our dataset can be mainly used for path loss modeling, we retrieved the most recent approaches for LoRaWAN to further exhibit our dataset’s contribution. For instance, Anzum et al. [
25] proposed a LoRaWAN path loss model to characterize the attenuation in oil palm crops by using an LDPLM mainly based on the distance between the ENs and the GW and the number of canopies and trunks throughout the communication path. Alobaidy et al. [
26] fitted a semiempirical machine-learning-based path loss model for LoRaWAN links combining the Friis model with a stepwise multiple linear regression that depends on the frequency, bandwidth, antennas’ heights, spreading factor, and distance. Batalha et al. [
27] performed a measurement campaign by using LoRaWAN in a suburban environment and fitted close-in and floating intercept LDPLMs that depend on the distance between the ENs and the GW; then, they compared their performance versus the Okumura–Hata model. Bianco et al. collected path loss measurements in a mountain environment, fitted an LDPLM by using the distance as a predictor variable, and used it in tracking and rescue applications. Callebaut et al. [
28] evaluated coverage and path loss in urban, forest, and coastal environments and fitted a two-slope LDPLM to assess the protocol’s reliability in each scenario. Finally, El Chall et al. [
20] proposed different LDPLMs for indoor, campus, and city environments. The contributions regarding datasets and path loss models are summarized in
Table 1.
As presented in
Table 1, the measurement campaigns and path loss models for LoRaWAN are mainly based on distance, frequency, and antennas’ heights. However, previous studies have shown that PL variability is also accentuated by the change of some environmental-related variables like temperature [
30], relative humidity [
31], barometric pressure ([
32]), and particulate matter [
33]. However, these effects have not been measured in the available LoRaWAN datasets. In that way, this paper provides a comprehensive LoRaWAN measurement campaign carried out in an urban environment, in Medellín, Colombia, for four months. Our measurement setup includes one GW, and four fixed ENs from 2
to 8
. The dataset has up to 930.000 observations, including geometric conditions (distance and antennas’ heights), link budget features (transmitter powers, antennas’ gains, cables and connectors attenuations, carrier frequency, SF, and frame length), propagation variables (RSSI, SNR, ToA, effective signal power (ESP), noise power (P
n), and consumed energy) and environmental variables (temperature, relative humidity, barometric pressure, and particulate matter). The main contribution of building this dataset is the inclusion of the environmental variables because designers can fit more accurate path loss and shadowing models depending on weather variations.
The rest of this paper is organized as follows.
Section 2 briefly introduces the main features of the LoRaWAN protocol.
Section 3 specifies the logged fields in the dataset.
Section 4 shows the experimental setup from the ENs’ construction to the database logging.
Section 5 shows a possible application of PL modelling using the dataset, including a lognormal combined path loss and shadowing (CPLS) model and an environment-based CPLS model that improves the prediction errors and increases the correlation factor. Finally,
Section 6 shows the conclusions.
3. Data Description
The given dataset contains a comma-separated values file with the measurements of four ENs and one GW in Medellín, Colombia. The database includes 930,753 observations from October 2021 to March 2022, with a mean sample time of 60 s. According to the regulations of the ISM bands for US915, the maximum transmission time is 400 ms [
38]. In that way, we transmitted up to 242 bytes with SF = 7, 125 bytes with SF = 8, 53 bytes with SF = 9, and 11 bytes with SF = 10. These frame sizes and SFs guarantee that the transmission time is less than 400 ms (
https://www.thethingsnetwork.org/airtime-calculator, accessed on 14 December 2020) Furthermore, because each node transmitted data each 60
and the maximum transmission time was 400
, we obtained that a duty cycle of
, which is recommended to have a fair use of the spectrum (obtained with SF = 7 and frame size of 242 bytes).
The fields in the dataset are described as follows.
index: Sequential number that identifies the corresponding observation.
timestamp: Date and time mark of the current observation. It is in format yyyy-mm-dd hh:mm:ss.
device_id: String that identifies the EN’s name of the current measurement. The corresponding names can be EN1, EN2, EN4, and EN4.
distance: Distance between the GW and the corresponding EN, in meters.
ht: Antenna height of the corresponding EN, in meters.
hr: Antenna height of the GW, in meters. Because the GW was installed in a static position, this height is fixed.
ptx: Transmitter (EN) radiated power in dBm. It was fixed to 20 dBm.
ltx: Transmitter (EN) losses associated with cables and connectors, in dB.
gtx: Transmitter (EN) antenna gain (characterized with a vector network analyzer), in dBi.
lrx: Receiver (GW) losses associated with cables and connectors, in dB. The measured attenuation was 4.25 dB.
grx: Receiver (GW) antenna gain (characterized with a vector network analyzer), in dBi. The measured gain was 4.161 dBi.
frequency: Carrier frequency, in . The experiments were performed in the US902-928 ISM band.
frame_length: Number of bytes of the current transmission’s payload.
temperature: Temperature of the environment, in °C.
rh: Relative humidity of the environment, in %.
bp: Barometric pressure of the environment, in hPa.
pm2_5: Particulate matter PM2.5 of the environment, in g/m3.
rssi: Received signal strength indicator at the GW, in dBm.
snr: Signal-to-noise ratio in dB.
toa: Time on air, in seconds.
experimental_pl: Experimental path loss (in dB) calculated by .
energy: Consumed energy of the current transmission, in Joules.
esp: Effective signal power of the current transmission, in dBm.
pn: Noise power, in dBm.
A statistical description of the numerical dataset fields is shown in
Table 3. In addition, the empirical distributions of the most representative variables are depicted in
Figure 2. These descriptions help us understand how data is distributed. For instance, it can be noticed from
Figure 2a that the SFs are uniformly distributed from 7 to 10 for ENs 1, 2, and 4; however, EN3 used only SF = 10 beause the distribution of the SNR exhibited a mean of −15 dB (
Figure 2j), which means that SFs of 7 to 9 are not large enough to demodulate the received signals (
Table 2). To guarantee uniform distribution of SF, we disabled the ADR scheme and controlled it manually. It also can be noticed that the carrier frequencies used are uniformly distributed overall (
Figure 2c). Regarding the environmental variables, it can be noticed that the weather conditions describe tropical weather. For instance, temperatures were from 13.9 °C to 35.1 °C, and concentrated around 20 to 30 °C (
Figure 2d). Furthermore, relative humidity was concentrated in high values, showing the common behavior in a tropical environment (
Figure 2e). Moreover, particulate matter was concentrated in low values for EN4 because it is located inside a campus surrounded by a forest; however, there are two peaks in 28 and 50
g/m
3, which are caused by a rock mine near the campus (
Figure 2g). In addition, it can be noticed that the distribution of the experimental path loss is Gaussian-bell-shaped as expected [
12] (
Figure 2h). Finally, the distributions of consumed energy for ENs 1, 2, and 4 are similar; nevertheless, the EN3 has its energy concentrated around 0.1 J, which was caused by the fixed SF = 10 that guaranteed that the received signal could be demodulated.
Regarding the packet delivery rate (PDR) of each EN, we obtained 95.1%, 85.2%, 81.6%, and 86.35% for EN1, EN2, EN3, and EN4, correspondingly. These PDRs can be explained from the SNRs obtained for each EN, as depicted in
Figure 2j. According to
Table 2, varying the SF allows the signal power level to fall below the noise power level up to
dB. Furthermore, as we will see in
Section 4.4, we distributed the SF uniformly for each EN from 7 to 10. In that way, we obtained PDRs according to the SNR of each EN. For instance, EN1 achieved SNRs over 0 dB, so getting a PDR of 95.1% is expected because many packages were delivered successfully. On the other hand, we notice that EN3 achieved the lowest PDR (81.6%) because the mean SNR is approximately
dB, so using a low SF can cause a loss of packets.
The dataset also includes the Effective Signal Power (
ESP) metric, which is defined as the signal power in the receiver without including the noise power (Equation (
3)) and the noise power
(Equation (
4)) [
39]:
The
ESP and
are relevant metrics by which to evaluate the quality of LoRaWAN radio links instead of
RSSI and receiver sensitivity (traditionally,
) because successful demodulation is achieved when
. Thus, the
ESP and
empirical distributions are depicted in
Figure 3, where it can be noticed that the
ESP is always under the
, concluding that LoRaWAN can withstand very adverse channel conditions.