Automatic Cleaning of Time Series Data in Rural Internet of Things Ecosystems That Use Nomadic Gateways
Abstract
:1. Introduction
2. Identification of Anomalies in Soil Data
2.1. Descriptive Approach
2.2. Comprehensive Approach
- Soil temperature (T): Solar heat is gradually accumulated in the soil from sunrise and radiated out after sunset. In consequence, the signal slowly increases until sunset and decreases afterwards, thus exhibiting a strong daily trend and seasonality.
- Soil moisture (M): The physical properties of soil moisture measured in a depth of about 0.5 m indicate significantly extended seasonality intervals—their analysis may require even decades [16]. In consequence, both the trend and seasonality of the signal are hard to capture. Moreover, heavy rainfalls combined with soil/terrain conditions and the location of the sensor may result in temporary flooding of the measurement probe. Hence, over a period of several days, when a single sensor makes its daily measurements, one can only expect slow changes in the signal with varying random trends.
- Soil acidity/alkalinity (pH): The signal does not change much from month to month or even year to year [17]. This is due to the fact that soil solids dissolve very slowly in the soil solution and gradually supplement the microelements that are crucial for vegetation. Consequently, pH should not be considered seasonal, as it may show at most some barely noticeable changes possibly correlated with changes in soil moisture M.
- Solar irradiance (PV): The cell produces a nominal maximum voltage, which rises/drops logarithmically from/to a near-zero value for solar irradiance above/below a certain minimal threshold, while remaining high and almost unchanged at higher irradiance levels [18]. In consequence, the signal should exhibit no trend and strong daily seasonality (sunrise/sunset cycles), but possibly with a significant residual component depending on the actual charge level of the power supply battery.
2.2.1. Missing or Misplaced Samples
2.2.2. Erratic Samples
2.2.3. Change Points
2.2.4. Temporary Deviations
2.2.5. Irregular Fluctuations
3. Time Series Data Cleaning
3.1. Anomaly Model
3.2. Anomaly Detection
3.2.1. Power Gaps
3.2.2. Absolute Errors
3.2.3. Peaks
3.2.4. Jumps
3.2.5. Bumps
- A sufficiently small difference between the values of the first and last samples of the fragment containing the bump:
- A sufficiently small difference between the value of the first sample in and the signal mean value in the fragment on the left side of the bump:
- A sufficiently small difference between the right boundary value of the bump and the signal mean value in the fragment on the right side of the bump:
- A sufficiently large difference between the mean signal mean in the fragment on the left side of the bump and the signal mean value in the fragment containing the bump:
3.2.6. Instabilities
3.3. Cleaning Operators
- 1.
- Samples misplaced by power outages may have correct values, so they have to be marked as “shifted”. It may be implemented, for example, by inverting the sign bit of each marked sample value. Finding them requires calculating Formula (1) for each two consecutive front slopes of the PV signal, and if needed, a missing number of “empty” samples for each signal, T, M, and pH, is added. Although the end device could try to determine the locations of the missing samples by examining disturbances in the trends of other predictable signals, e.g., signal T, due to its stable daily periodicity discussed in Section 2, for some implementations of the end device, it may still be too power-costly to implement. In our current implementation of rural IoT, we skipped that and found that the fusion of series from multiple sensors performed in the cloud gave better results; “empty” and “shifted” values of minute samples may be considered as “misleading” data when merging them into hourly (median or average) samples by the sensor for further resolution on the cloud, where they can certainly be handled more accurately than on a local end device, without adding any extra bandwidth load.
- 2.
- After detecting power gaps and marking samples of each of the four signals as “shifted” or “empty”, the end device continues detecting anomalies only in signals T, M, and pH. This is because variability in the PV signal, as argued in Section 2, is caused by charging of the device’s battery; in fact, it shows no anomalies worth analyzing and correcting, except for proper handling of power gaps that affect the other three. So the next step in Figure 9 is the detection of “absolute errors”, i.e., minute samples whose values are outside the allowed ranges specified in Table 1. Out-of-range values of minute samples may not be taken into account when merging minute samples into the hourly ones; therefore, they are labeled as “error” samples. Similarly, previously inserted “empty” samples will also be disregarded in the merging process. Note that marking “error” samples does not affect “shifted” samples with correct values. During fusion later in the cloud, the “shifted and error” samples may eventually be properly time-stamped and set a correct value.
- 3.
- The next step should be the detection of abrupt changes, i.e., “peaks” and “jumps”. This order comes from the fact that according to their physics analyzed in Section 2, all changes in our soil signals should be smooth and gentle. Detection of abrupt changes indicates the occurrence of anomalies in the measurement process itself; thus, signal values in any fragment identified as anomalous are in error. The respective samples are replaced by samples with interpolated values of their neighbors not marked as “empty”, “shifted”, or “error”.
- 4.
- After “peaks” and “jumps”, less abrupt signal changes such as “bumps” are handled. As discussed earlier, these anomalies are related to the occurrence of a local maximum in a relatively larger portion of samples and detected. If needed, the “bump” fragment of samples is slightly more flattened by calculating its new values based on the average values of samples from both its left and right sides. As before, neighbor samples marked as “empty”, “shifted”, or “error” are not taken into account.
- 5.
- Finally, “instabilities” are detected and samples from their anomalous fragments are replaced with the signal trend samples calculated as a daily moving average.
- 1.
- The portion may contain correct (unlabeled) and “error” samples. If at least half of them are correct, the aggregated hourly sample is calculated as their average or median; otherwise, it is labeled as “error”. Note that any other combination of unlabeled and labeled samples in is not possible.
- 2.
- may contain “shifted” samples, of which some may be marked additionally as “error”. If at least half of the “shifted” but correct samples are present, the aggregated hourly sample is calculated as their average or median; otherwise, it is labeled as “shifted and error”. Note that the absolute values of “shifted” samples are considered correct and are needed later on for data fusion in the cloud.
- 3.
- If contains at least half of samples marked as “empty”, the aggregated hourly sample is also marked as “empty”; otherwise, the aggregated hourly sample is either calculated as the average or median of the complement samples or marked as “error”—depending on whether the rest of the portion is marked only as “shifted” or “shifted and error”.
4. Improvement of the Anomaly Model
4.1. Synthetic Data Generation
- For a given device, gaps are inserted at the same positions for all signals T, M, pH, and PV;
- The positions of anomalies other than gaps are not synchronized among signals;
- A maximum of one jump per day is inserted; its edge is selected at random;
- A jump edge is placed randomly within any fragment and the durable change point samples are continued until the sunset sample or inserted before the edge starting from the sunrise sample;
- Bumps and jumps do not overlap;
- There is no significant difference between the average value of samples before and after a bump; i.e., bumps are not injected on the steep slopes of time series;
- Some minimum distance between an instability and a jump or bump is preserved;
- Peaks do not overlap with other anomalies;
- There is no significant difference between the average value of samples before and after a peak; i.e., peaks are not injected on the steep slopes of time series;
- Peaks and instabilities are not adjacent to gaps; i.e., there are some samples before and after a peak or instability.
Algorithm 1 Injecting anomalies to a reference time series. |
|
4.2. Parameter Optimization
5. Experimental Results
5.1. Time Series Distance Metrics
5.2. Quality Assessment of Cleaned Data
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
FDR | False Discovery Rate |
LoRaWAN | Long-Range Wide-Area Network |
LiPo | Lithium Polymer |
M | Moisture |
MCU | Microcontroller Unit |
pH | Potential of Hydrogen |
PV | Photovoltaic |
RAT | Radio Access Technology |
SA | Simulated Annealing |
T | Temperature |
UAV | Unmanned Aerial Vehicle |
References
- Han, H.; Liu, Z.; Li, J.; Zeng, Z. Challenges in remote sensing based climate and crop monitoring: Navigating the complexities using AI. J. Cloud Comput. 2024, 13, 34. [Google Scholar] [CrossRef]
- Ghosh, A.M.; Grolinger, K. Edge-Cloud Computing for Internet of Things Data Analytics: Embedding Intelligence in the Edge With Deep Learning. IEEE Trans. Ind. Inform. 2021, 17, 2191–2200. [Google Scholar] [CrossRef]
- Gkonis, P.; Giannopoulos, A.; Trakadas, P.; Masip-Bruin, X.; D’Andria, F. A Survey on IoT-Edge-Cloud Continuum Systems: Status, Challenges, Use Cases, and Open Issues. Future Internet 2023, 15, 383. [Google Scholar] [CrossRef]
- Zinnari, F.; Coral, G.; Tanelli, M.; Cazzulani, G.; Baldi, A.; Mariani, U.; Mezzanzanica, D. A Multivariate Time-Series Segmentation Framework for Flight Condition Recognition. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 2451–2463. [Google Scholar] [CrossRef]
- Karniadakis, G.E.; Kevrekidis, I.G.; Lu, L.; Perdikaris, P.; Wang, S.; Yang, L. Physics-informed machine learning. Nat. Rev. Phys. 2021, 3, 422–440. [Google Scholar] [CrossRef]
- Influxdata. The Leading Platform for Time Series Data. Available online: https://www.influxdata.com (accessed on 9 November 2024).
- Dembski, J.; Kołakowska, A.; Wiszniewski, B. Rural IoT Soil Data. IEEE DataPort. 2024. Available online: https://ieee-dataport.org/documents/rural-iot-soil-data (accessed on 30 December 2024). [CrossRef]
- Tsay, R.S. Analysis of Financial Time Series, 3rd ed.; Wiley: Hoboken, NJ, USA, 2010. [Google Scholar]
- Cook, A.A.; Misirli, G.; Fan, Z. Anomaly Detection for IoT Time-Series Data: A Survey. IEEE Internet Things J. 2020, 7, 6481–6494. [Google Scholar] [CrossRef]
- Salles, R.; Belloze, K.; Porto, F.; Gonzalez, P.H.; Ogasawara, E. Nonstationary time series transformation methods: An experimental review. Knowl.-Based Syst. 2019, 164, 274–291. [Google Scholar] [CrossRef]
- Zhang, L.; Zhu, Y.; Gao, Y.; Lin, J. Robust Time Series Chain Discovery with Incremental Nearest Neighbors. In Proceedings of the 2022 IEEE International Conference on Data Mining (ICDM), Orlando, FL, USA, 28 November– 1 December 2022; pp. 1311–1316. [Google Scholar] [CrossRef]
- van den Burg, G.J.J.; Williams, C.K.I. An Evaluation of Change Point Detection Algorithms. arXiv 2022, arXiv:2003.06222. [Google Scholar]
- Lazar, A.; Jin, L.; Spurlock, C.A.; Wu, K.; Sim, A. Data quality challenges with missing values and mixed types in joint sequence analysis. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; pp. 2620–2627. [Google Scholar] [CrossRef]
- Kołakowska, A.; Godlewska, M. Analysis of Factors Influencing the Prices of Tourist Offers. Appl. Sci. 2022, 12, 12938. [Google Scholar] [CrossRef]
- Yin, C.; Zhang, S.; Wang, J.; Xiong, N.N. Anomaly Detection Based on Convolutional Recurrent Autoencoder for IoT Time Series. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 112–122. [Google Scholar] [CrossRef]
- Qin, T.; Feng, J.; Zhang, X.; Li, C.; Fan, J.; Zhang, C.; Dong, B.; Wang, H.; Yan, D. Continued decline of global soil moisture content, with obvious soil stratification and regional difference. Sci. Total Environ. 2023, 864, 160982. [Google Scholar] [CrossRef] [PubMed]
- Zhang, H. Soil pH and Buffer Index. Oklahoma Cooperative Extension Service PSS-2229. 2017. Available online: https://extension.okstate.edu/fact-sheets/soil-ph-and-buffer-index.html (accessed on 30 December 2024).
- Chegaar, M.; Hamzaoui, A.; Namoda, A.; Petit, P.; Aillerie, M.; Herguth, A. Effect of Illumination Intensity on Solar Cells Parameters. Energy Procedia 2013, 36, 722–729. [Google Scholar] [CrossRef]
- GreenCast. Soil Moisture/Temperature Maps. Available online: https://www.greencastonline.com (accessed on 9 November 2024).
- Ditzler, C.; Scheffe, K.; Monger, H. Soil Survey Manual. Handbook 18, Soil Science Division, United States Department of Agriculture. 2017. Available online: https://www.nrcs.usda.gov/sites/default/files/2022-09/The-Soil-Survey-Manual.pdf (accessed on 30 December 2024).
- Yang, X.; Song, Z.; King, I.; Xu, Z. A Survey on Deep Semi-Supervised Learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 8934–8954. [Google Scholar] [CrossRef]
- Gou, J.; Yu, B.; Maybank, S.; Tao, D. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
- Górecki, T.; Piasecki, P. A Comprehensive Comparison of Distance Measures for Time Series Classification. In Stochastic Models, Statistics and Their Applications; Springer International Publishing: Cham, Switzerland, 2019; pp. 409–428. [Google Scholar] [CrossRef]
- Wang, X.; Ding, H.; Trajcevski, G.; Scheuermann, P.; Keogh, E.J. Experimental Comparison of Representation Methods and Distance Measures for Time Series Data. Data Min. Knowl. Discov. 2013, 26, 275–309. [Google Scholar] [CrossRef]
- Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B (Methodol.) 1995, 57, 289–300. [Google Scholar] [CrossRef]
- Christ, M.; Kempa-Liehr, A.W.; Feindt, M. Distributed and parallel time series feature extraction for industrial big data applications. arXiv 2017, arXiv:1610.07717. [Google Scholar]
- Christ, M.; Braun, N.; Neuffer, J.; Kempa-Liehr, A.W. Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh—A Python package). Neurocomputing 2018, 307, 72–77. [Google Scholar] [CrossRef]
- Krawczyk, H.; Wiszniewski, B. Collaborative Learning as a Service—A blueprint for a cloud based rural IoTs deployment facility. In Proceedings of the 15th International Conference on Parallel Processing & Applied Mathematics (PPAM 2024), Ostrava, Czech Republic, 8–11 September 2024. in print. [Google Scholar]
Signal | Unit | Range | Physical Quantity Measured | Change | Seasonality |
---|---|---|---|---|---|
Temperature (T) | °C | [0, 40] | Resistance of a thermistor placed in the ground (approx. 0.5 m) | mild trend | daily |
Moisture (M) | % | [10, 80] | Capacity of the capacitor in the form of a printed circuit board placed in the ground (approx. 0.2 m) | slow trend | non-daily |
Acidity–alkalinity (pH) | — | [3.0, 9.0] | Electromotive force of a cell composed of a glass indicator electrode and a reference electrode placed in the ground | almost constant | no periodic fluctuations |
Solar irradiation (PV) | V | [0.0, 6.6] | Open-circuit voltage of the PV cell | rapid | drop (rise) at sunset (sunrise) |
Anomaly | Parameter | Description |
---|---|---|
Power gap | gap width | |
Jump | slope width | |
whether the values jump up or down | ||
jump height | ||
whether the modified segment is before or after the jump | ||
Bump | bump width | |
bump height | ||
noise vector of length | ||
Instability | instability width | |
noise vector of length | ||
Peak | peak width | |
maximum value of the peak | ||
location of the peak maximum | ||
remaining (other than the maximum) values of the peak |
Parameters | Error | Peaks | Bumps | Jumps | Instabilities | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | M | pH | T | M | pH | T | M | pH | T | M | pH | ||
initial | 0.75 | 0.69 | 0.71 | 1.01 | 0.8 | 1.0 | 1.75 | 0.47 | 0.99 | 0.76 | 0.82 | 0.88 | |
0.69 | 0.6 | 0.61 | 1.03 | 0.79 | 0.99 | 2.52 | 1.46 | 0.97 | 0.82 | 0.86 | 0.91 | ||
E | 0.72 | 0.65 | 0.66 | 1.02 | 0.8 | 1.0 | 2.14 | 0.97 | 0.98 | 0.79 | 0.84 | 0.9 | |
optimized | 0.62 | 0.56 | 0.47 | 0.96 | 0.91 | 1.0 | 0.92 | 0.39 | 1.0 | 0.26 | 0.38 | 0.28 | |
0.47 | 0.4 | 0.31 | 0.97 | 0.83 | 0.99 | 0.83 | 0.42 | 1.0 | 0.16 | 0.29 | 0.25 | ||
E | 0.55 | 0.48 | 0.39 | 0.97 | 0.87 | 1.0 | 0.88 | 0.41 | 1.0 | 0.21 | 0.34 | 0.27 |
Parameters | Peaks | Bumps | Jumps | Instabilities | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
initial (T, M, pH) | 30 | 15 | 0.75 | 0.90 | 75 | 75 | 0.54 | 0.21 | 0.15 | 105 | 15 | 0.60 | 0.45 | 0.03 | 22.5 |
optimized (T) | 12 | 11 | 0.46 | 0.78 | 46 | 71 | 0.58 | 0.20 | 0.14 | 71 | 6 | 0.077 | 0.21 | 0.038 | 14 |
optimized (M) | 16 | 12 | 0.21 | 1.26 | 37 | 115 | 0.93 | 0.18 | 0.25 | 81 | 15 | 0.65 | 0.12 | 0.051 | 11 |
optimized (pH) | 19 | 27 | 0.18 | 1.21 | 97 | 76 | 0.50 | 0.22 | 0.15 | 90 | 14 | 0.71 | 0.36 | 0.034 | 12 |
MULTI-DAY FEATURES | |
---|---|
Feature | Signal |
series length divided by the maximum possible samples per segment | T |
standard deviation (SD) | T, M, pH |
maximum value | T, M, pH |
kurtosis | T, M, pH |
percentage of values greater than the mean value | T |
percentage of values greater than SD from the mean value | T, M, pH |
mean, SD, and max of the absolute differences between subsequent values | T, M, pH |
variation coefficient | T, pH |
relative number of changes in slope direction | M |
mean of local maxima | M |
SD of local maxima | T, M, pH |
SD of local minima | T, pH |
mean and SD of the distance between consecutive local maxima | T, M |
mean of the distance between consecutive local minima | T, M |
SD of the distance between consecutive local minima | M |
mean of the distance between local minima and the nearest subsequent maxima | M |
SD of the distance between local minima and the nearest subsequent maxima | T |
mean and SD of the distance between local maxima and the nearest subsequent minima | T |
ONE-DAY FEATURES | |
Feature | Signal [aggregation] |
SD | T, M, pH[m, max] |
maximum value | T[m], M[m], pH[m] |
relative position of the first maximum | pH[min], T[max] |
relative position of the last maximum | M[m, max], pH[max] |
relative position of the first minimum | T[max], M[m] |
relative index of time series where of the mass lies on the left | pH[min, max, m], T[min, m] |
relative number of changes in slope direction | M[min, max] |
variation coefficient | T, pH[max, m] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dembski, J.; Kołakowska, A.; Wiszniewski, B. Automatic Cleaning of Time Series Data in Rural Internet of Things Ecosystems That Use Nomadic Gateways. Sensors 2025, 25, 189. https://doi.org/10.3390/s25010189
Dembski J, Kołakowska A, Wiszniewski B. Automatic Cleaning of Time Series Data in Rural Internet of Things Ecosystems That Use Nomadic Gateways. Sensors. 2025; 25(1):189. https://doi.org/10.3390/s25010189
Chicago/Turabian StyleDembski, Jerzy, Agata Kołakowska, and Bogdan Wiszniewski. 2025. "Automatic Cleaning of Time Series Data in Rural Internet of Things Ecosystems That Use Nomadic Gateways" Sensors 25, no. 1: 189. https://doi.org/10.3390/s25010189
APA StyleDembski, J., Kołakowska, A., & Wiszniewski, B. (2025). Automatic Cleaning of Time Series Data in Rural Internet of Things Ecosystems That Use Nomadic Gateways. Sensors, 25(1), 189. https://doi.org/10.3390/s25010189