**1. Introduction**

The Internet of Things (IoT) interconnects and embeds objects, machines and devices, forming a highly distributed network of device broadcasting with humans and other devices [1]. Recent application areas progressing within the IoT sector include smart cities, agriculture, building, healthcare, and shopping [2–4]. This paper proposes an open-source [5] and low-cost Long-Range Wide-Area Network (LoRaWAN) solution for strawberry-plant monitoring.

Conducting data-mining on raw IoT data will help to reduce the cost of powering sensors, the amount of packet transmission, latency, and response delay [6]. Moreover, the discovery of information from raw data improves system performance. Data-mining generates knowledge models from data received to support decision-making. Common methods include data compression, data-mining, and data reduction [6].

Recent research introduces a range of methods, including compression [7] and reconstruction [8], aggregation [9], redundancy removal, and reduction of the number of sensors [6] using time-discounted histogram encoding. To replace multiple sensors that send appliance energy usage in households, smart phone data is used as the only data source to estimate user activities [10]. A lightweight monitoring framework has been developed to cope with limited processing capabilities. It adapts the amount of data disseminated through the network over time [11]. Another framework transmits updates when the sensor readings are detected to be unusual, and have triggered dissemination [12], adapting the monitoring sensing intensity and dynamically adjusting the data volume payload.

Reducing the amount of data to be analyzed in IoT systems can be done either offline or online. The offline analysis is to collect data as much as possible during trials, to conduct offline analysis, and discover patterns. During the real-time operation, data will be checked against such learned patterns while running lightweight data analysis programs, for example signature-based network-intrusion-detection systems [13]. The online analysis operates data reduction in real time, to calculate the difference from normal behaviors, for example anomaly-based detection network-intrusion-detection systems [11,12,14]. The offline approach might out-perform the online approach in finding the previously trained events or situations, but the second approach would be better if unknown situations happens during real-time operation. This paper focuses on the offline approach. Here, the feature selection will not serve as pre-processing techniques for data-mining only, but also determine the sources of data to be collected in deployment and operation.

The rest of this paper is structured as follows: Section 2 showcases Related Work, Section 3 shows the motivation and analysis behind the Data reduction in IoT monitoring, Section 4 presents the Problem and System Architecture, including the design and implementation of the proposed LoRaWAN-based IoT system for strawberry-plant monitoring, Section 5 shows the experimental results and analysis including experimental set-up and sensors calibration, traffic analysis, data visualization, feature selection and evaluation, and example in decision support, and Section 6 provides conclusions with any future research directions.

### **2. Related Work**

#### *2.1. Usage of Sensors in Agriculture*

The usage of sensors and actuators has been replacing the traditional human-intensive ways of monitoring in agriculture [15]. Sensors can measure environmental parameters and convert them into meaningful signals [16], for example, water resource monitoring for irrigation [17]. It is reported that in 2000, there were approximately 525 million farms on record across the globe, but none connected to the IoT. However, by 2025 for the same base of 525 million farms, it is expected for there to be around 600 million sensors installed, connected and in use in these farms [18]. The technological advancement as well as size abatement of devices make employment of sensors feasible for agriculture applications [16].

### *2.2. IoT LPWAN Communication Protocols: LoRa and LoRaWAN*

Low-power wide-area (LPWAN) communication protocols are designed for low-power consumption, suitable for applications which demand limited efforts for maintenance. One of the protocols, LoRaWAN, has been introduced by the LoRa Alliance organization as the protocol for low-power and wide-area coverage [19]. LoRaWAN, which stands for long-range wide-area network, defines the communication protocol and the system architecture for the network [20].

By definition, LoRa is the physical layer or the wireless modulation used to create long communication links. In terms of the LoRa functionality, an end-device communicates to a gateway which is employing LoRa with LoRaWAN. To be more specific, a LoRa gateway passes raw LoRaWAN packets from the end-devices to a network server [19]. Major advantages of LoRa are its low-power consumption, long-range capability, security and relatively easily expandable network. However, LoRa advantages have their trade-offs: for example, the time delay for the data to be stored in the cloud after being obtained, and the final data usage or display [21]. Therefore, it might not be the ideal choice for those applications requiring immediate responses or high-resolution data.

However for low-cost and low-power IoT systems, the data transmission is constrained. Therefore, how to reduce the volume of data to be sent from a LoRa node to a network server, while still enabling data-driven decision, is a challenge.

### *2.3. Feature Reduction*

To reduce the number of features to be used, the main data-mining methods include: feature selection, which selects a subset of the original feature set; and feature extraction, which creates a set of new features by combining original features. The choice of selecting features are problem-dependent, but the resulting subset features should remain a faithful, perhaps simplified representation to the original data set and preserve the intrinsic knowledge accurately. This paper focuses on feature selection.

Feature selection methods were used to identify the set of features which brings high accuracy to detect cyber-attacks [22]. It has been found that features have discriminatory contribution to classification accuracy in identifying attacks. Some features are redundant, irrelevant, partially relevant to the learning target and some even reduce accuracy, for example noise.

In addition, feature construction or feature transformation can create new features or transform existing features into a new set of features, smaller than the original set [23]. This method requires decent domain related knowledge, for example the understanding of energy usage patterns as shown in [23]. Principal component analysis (PCA) also summarizes data into fewer dimensions by projecting it onto an orthogonal basis.

Deep learning has demonstrated high performance in terms of accuracy [24]. However in the setting of real-time operation in IoT, response time is one of key requirements and edge devices or even gateways have limited computational resources to use the computationally demanding method deep learning, especially for large scale of IoT systems. In addition, the results from deep learning is difficult to be interpreted. This method is especially unpractical when human involves in analysis, monitoring, decision-making and control.

#### **3. Data Reduction in IoT Monitoring**

To illustrate the feature reduction, we provide a sample scenario in a plant-monitoring context. For example, some sensors can be used: temperature *w*(*t*), humidity *h*(*t*), lighting *b*(*t*) and soil moist *<sup>s</sup>*(*t*). The decision-making for a specific action can be represented as a function *<sup>f</sup>* : <sup>4</sup>*<sup>k</sup>* → as follows:

$$d(t) = f(\mathbf{w}(t), \mathbf{h}(t), \mathbf{b}(t), \mathbf{s}(t), \theta) \tag{1}$$

where *d*(*t*) ∈ is the decision variable representing the action to be taken. For example, *d*(*t*) = 1 means watering and *<sup>d</sup>*(*t*) = 0 means no watering. And **<sup>w</sup>**(*t*) ∈ *k*, **<sup>h</sup>**(*t*) ∈ *k*, **<sup>b</sup>**(*t*) ∈ *<sup>k</sup>* and **<sup>s</sup>**(*t*) ∈ *<sup>k</sup>* are data vectors for temperature, humidity, lighting and soil moist for the last *k* samples until time *<sup>t</sup>*, For example, **<sup>w</sup>**(*t*)=[*w*(*<sup>t</sup>* − *<sup>k</sup>* + <sup>1</sup>), *<sup>w</sup>*(*<sup>t</sup>* − *<sup>k</sup>* + <sup>2</sup>), ··· , *<sup>w</sup>*(*<sup>t</sup>* − <sup>1</sup>), *<sup>w</sup>*(*t*)]*<sup>T</sup>* is the last *<sup>k</sup>* samples of the temperature at time *t*. *k* is referred to as the sampling window.

The research question is how to make the correct decision with less data. More specifically, the data reduction problem can be stated as follows: Are all these four types of data needed to make the decision? Would it be possible to just use three type of data and which three type of data should be selected to make the decision?

### *3.1. Feature Selection Using Laplacian Score*

Carrying out data analysis on many features is always computationally expensive. Its computational complexity increases while the dimensions or the number of features increase. Therefore, to select the most important features becomes necessary, especially in source-limited situations.

There is a rich range of dimensionality reduction methods. Some are suitable for classification, for example, to rank features using neighborhood component analysis, to rank features using minimum redundancy maximum relevance algorithm, to estimate predictor importance for classification tree. Some are suitable for regression, to select those independent variables which have the best relation to the predictor, i.e., the dependent variable, for example, to rank features using F-test. This method will be useful if the dependent variable is known and its data is collected. In our IoT system, it has a set of

sensors for monitoring, but its predicting variable is unknown. Therefore we will need to consider feature selection for unsupervised learning. For unsupervised learning, Laplacian scores have been used to rank features.

Laplacian score was designed to select features in unsupervised learning [25]. Feature selection in unsupervised learning is more difficult than supervised learning, due to lacking of class labels to guide search. Laplacian score was introduced as a filter method to evaluate a feature by "its power of locality preserving", using local neighborhood relationships between data points [25].

For feature selection in supervised learning, Laplacian score has been used for multi-label classification, to measure feature relevance [26] to be used together with manifold learning which is non-linear dimensionality reduction [27]. For feature selection in unsupervised learning, Laplacian score concept has been used to produce pseudo class labels [28], in clustering [29], and to rank multi-cluster structure [30].

### *3.2. Laplacian Scores to Rank Features for Unsupervised Learning*

To reduce the volume of data for specific tasks, class labels are normally available for supervised learning. However in many applications, feature reduction is needed for general usage, not limited to a specific task. This falls into unsupervised learning. Laplacian scores can rank features and users can select important features from the resulting rank [25] for the situations where no class label is available.

The similarity *Si*,*<sup>j</sup>* is defined as:

$$S\_{i,j} = \exp\left(-\left(\frac{D\_{i,j}}{\delta}\right)\right) \tag{2}$$

where *δ* is a scale factor and *Di*,*<sup>j</sup>* is the distance of two data points *i* and *j* in a local neighborhood. The *i th* element, *Dg*, of the Degree matrix *D* is defined as

$$D\_{\mathcal{K}}(i, i) = \sum\_{j=1}^{n} \mathbb{S}\_{i, j} \tag{3}$$

The Laplacian matrix is defined as the difference between the degree matrix *Dg* and the similarity matrix *S*:

$$L = D\_{\mathcal{S}} - \mathcal{S} \tag{4}$$

Alternatively, the feature selection results agree with to minimize the value:

$$\frac{\sum\_{i,j}(\mathbf{x}\_{ir} - \mathbf{x}\_{jr})^2 \times S\_{i,j}}{Var(\mathbf{x}\_r)}\tag{5}$$

where *r* is the *r*th feature, *xir* is the *i*th observation of the *r*th feature. This means that features with large variance is preferred.

In the next section, a simple IoT system is designed to install four sensor measurements (temp, humidity, lighting, soil moisture) to monitor an environment. Then our planned feature reduction will be tested IoT systems. We will select more important features from the aforementioned four and evaluate whether the reduced dataset can achieve comparative performance with the full dataset.

### **4. Problem Definition and System Architecture**

This section starts with the design of the IoT system architecture, followed by five building blocks and their choices of hardware/software for implementation. The gathered data of the real-world plant-monitoring IoT system is then used to test the proposed data-mining method.

### *4.1. System Architecture*

The proposed system will be able to (1) collect data from sensors to monitoring agriculture related variables; (2) transmit such data to the gateway; (3) facilitate the gateway to send data to the cloud server; (4) enable data to be displayed at mobile APP or a client service.

The overall system design is illustrated in Figure 1. Starting from the left, sensors/actuators for monitoring, such as temperature, humidity, light intensity and soil moisture are attached to a low-cost development platform. This platform consists of both a FRDM-K64F ARM mbed evaluation board (as the base) and a SX1272MB2xAS LoRa radio shield, to be explained later in this section. The main function of this platform is to transmit sensor data to a gateway. This cluster of physically connected devices is named "LoRa Node" in this paper. The LoRa Node sits next to the test site, for example, a plant.

The LoRa Node is transmitting data to a Gateway, using LoRa wireless communication. This wireless communication will be explained in Section 4.4. The Gateway is responsible for establishing an IP communication with, and sending data to an IoT Cloud Server. The Cloud Server sends data and its visualization to the end-user(s) through web and mobile dashboards. The following sections will explain the main building blocks in details.

**Figure 1.** Overview of IoT system for strawberry-plant monitoring using LoRaWAN.

### *4.2. IoT Platform Development*

This platform consists of both a FRDM-K64F ARM mbed evaluation board (as the base) and a SX1272MB2xAS LoRa radio shield, to be explained later in this section. The main function of this platform is to transmit sensor data to a gateway. This cluster of physically connected devices is named "LoRa Node" in this paper. The LoRa Node sits next to the test site, for example, a plant.

The LoRa Node is transmitting data to a Gateway, using LoRa wireless communication. This wireless communication will be explained in Section 4.4. The Gateway is responsible for establishing an IP communication with, and sending data to an IoT Cloud Server. The Cloud Server sends data and its visualization to the end-user(s) through web and mobile dashboards. The following sections will explain the main building blocks in detail.

### 4.2.1. Sensors

There is a rich range of sensors available in the market. The sensors chosen here are examples.

### Soil Moisture Sensor

A soil moisture sensor detects the moisture of soil based on soil resistance measurement. In other words, sensor output value will decrease once soil moisture deficits. The output signal from the sensor is an analog value [31]. Notice that its measurements can be converted to a specific unit (e.g., voltage extraction) by employing FRDM–K64F ARM mbed board's 16-bit ADC converter for meaningful data. The soil resistance measurement is in a range of 0 to 5 Volts soil moisture excitation. For instance, the soil resistance measurement can be calculated using the analog value as:

$$mo\text{to}\\true\\Volume = moisture\\Analysis \times (5.0/65, 536.0)\tag{6}$$

### Temperature and Humidity Sensor

The chosen temperature and humidity sensor provides both temperature and humidity measurements as a pre-calibrated digital output using a negative temperature coefficient thermistor and a capacitive sensor element, accordingly [32]. Its detailed characteristics can be viewed through Table 1. At the beginning, the temperature and humidity sensor starts running the active mode from the low-power consumption mode once MCU sends a trigger signal. As a result, 40-bit data is collected back by the MCU consisting of 16-bit humidity data, 16-bit temperature data and 8-bit checksum number.


**Table 1.** Temperature and Humidity Sensor Main Characteristics.

### Light-Intensity Sensor

A light-intensity sensor exposes the intensity of light based on the resistance value of a photo-resistor (for the device chosen, GL5528 photo-resistor (Seeed, Shenzhen, China) ). In particular, the resistance of a photo-resistor increases when the intensity of light decreases. The output signal is an analog value [33]. The measurements can be converted to a specific unit (e.g., voltage extraction) by deploying FRDM–K64F ARM mbed board's 16-bit ADC converter for meaningful data gathering. For example, the following calculation:

$$\text{lightVoltage} = \text{lightAnalog} \,\*\,(5.0/65, 536.0) \tag{7}$$

can be considered to be a 0 to 5 Volts light-intensity excitation.

### 4.2.2. Lora Node Platform

As shown in Figure 1, a development platform attaches sensors and a transceiver send such data to a gateway.
