4.2.1. Loads

In this subset of LIT-Dataset, the electrical circuits are: (a) resistor; (b) resistor and inductor; (c) diode rectifier with a resistor; (d) diode full-wave bridge rectifier with resistor and capacitor; (e) thyristor rectifier with resistor; (f) thyristor rectifier with resistor and inductor; and (g) universal motor. The load templates were chosen according to the load profile of electrical appliances commonly found in consumer units [40], such as drill (universal motor), mobile phone charger (different types of rectifiers), fan (universal motor), hairdryer (universal motor), LED lamp (different types of rectifiers), incandescent lamp (resistor), router (different types of rectifiers), and vacuum cleaner (simplified by resistor and inductor).

The diagram of the simulated subset is shown in Figure 5, in which each block represents a different load. The switching control of each load is automated, and the trigger time can be previously adjusted.

**Figure 5.** General diagram of the simulated subset.

To implement each set of loads with different electrical power from circuits (a)–(f), the Power Eletronics Library from Matlab/Simulink was used, as shown in Figure 6: (a) resistor, (b) resistor and inductor, (c) diode rectifier with resistor, (d) diode bridge with resistor and capacitor, (e) thyristor rectifier with resistor, (f) thyristor rectifier with resistor and inductor.

**Figure 6.** Diagrams of simulated loads, being (**a**) resistor, (**b**) resistor and inductor, (**c**) diode rectifier with resistor, (**d**) diode bridge with resistor and capacitor, (**e**) thyristor rectifier with resistor, (**f**) thyristor rectifier with resistor and inductor.

For the implementation of the universal motor (g), a mathematical model based on [41] was used. Figure 7 shows the diagram that represents this model, in which the following parameters are included: rated power, rated terminal voltage, rated speed, armature winding inductance (*Laq*), series field winding inductance (*Lse*), rated frequency of supply voltage, armature winding resistance (*Ra*), series field winding resistance (*Rse*), rotor inertia (*J*), speed at which magnetization curve data was taken (*<sup>ω</sup>mo*). To connect the math model with other circuits, the generated signal was connected

to a current source generator and sent to other blocks, i.e., electrical—mathematical interface in MATLAB-Simulink.

**Figure 7.** MATLAB-Simulink block diagram of Universal Motor Model.

## 4.2.2. Waveform Generation

To automate the waveform generation, an automatic parameter variation method was implemented. With that method, it is possible to vary: up to seven different load combinations for each waveform; up to four different values of the electrical components for each circuit; the total time of the simulation; the load combination; and the trigger time of the circuits, with three different options of switching (turn-ON and turn-OFF) angles: 0, 45, and 90 degrees. The four parameters variations, resulting in four different rated power levels for each set of loads, are detailed in Table 8, where: (a) resistor; (b) resistor and inductor; (c) diode rectifier with a resistor; (d) diode full-wave bridge rectifier with resistor and capacitor; (e) thyristor rectifier with resistor; (f) thyristor rectifier with resistor and inductor; (g) universal motor.


**Table 8.** Load parameters.

#### 4.2.3. Configuration of Simulation Scenarios

To create different simulation scenarios, six configuration settings can be used, as follows: ideal (DB-1); with stray inductance, representing the equivalent of the electrical network (DB-2); with stray inductance and harmonics (DB-3); and with stray inductance, harmonics, and additive white gaussian noise (AWGN), with 60 dB, 30 dB, and 10 dB of SNR (DB-4, DB-5, and DB-6), as shown

in Table 9. Each of the six configurations are applied to all the loads, resulting in 4824 waveforms, being 804 for each configuration. Therefore, it is possible to evaluate the impact of harmonics and noise (with different intensities) and to compare to an ideal scenario, to the performance of detection, feature extraction, and classification methods. This type of analysis can support the proposal of more robust and applicable methods in different NILM scenarios.


**Table 9.** Simulated subset settings.

The first scenario (DB-1), was an ideal setting, without stray inductance and harmonic content in the voltage waveform. The second one (DB-2), include stray inductance. The characteristics of the electrical network in our laboratory were used to select the values of the inductor and resistor, resulting inL=1 μH and R = 2 <sup>m</sup>Ω. The third scenario (DB-3), includes stray inductance and a voltage source with harmonics, based on the voltage acquisition in our laboratory. The last three include a voltage source with harmonics, stray inductance, and different levels of AWGN.

## *4.3. Natural Subset*

The Natural subset of the LIT-Dataset consists of recording where a natural load shaping occurs, in the sense that waveforms are registered in a real-world environment (residential, research lab, commercial, industrial) over longer periods of time. To precisely detect and record the load events, sensors that detect power-ON, power-OFF, and power-level-changes are attached to each load, therefore, while the aggregated current and voltage are recorded, so are the individual load events.

#### 4.3.1. Natural Subset—Data Collection Architecture and Implementation

Accurate time synchronization is an important requirement in this scenario, in which time-stamped data should be provided by distributed nodes and then correlated with a limited jitter among them. Concerning specifically the development of the Natural subset of the LIT-Dataset, an infrastructure composed of a centralized acquisition device and a large number (50+ units) of networked wireless sensors is required. These nodes are attached to each load to detect load events, such as ON-OFF transient, change of state and power variations, and send the event data to the centralized acquisition element so that they can be later consolidated and correlated with the acquired voltage and current data.

This infrastructure, from this point on referred to as Natural Subset Acquisition System (NSAS), depends on time synchronization with accuracy and precision of at least 1 ms, to facilitate the correlation between the events obtained by the distributed event detection modules and the voltage and current samples obtained by the centralized acquisition element. Additionally, considering the large number of modules to be installed and their distributed characteristic, they are required to be

built with low-cost components. In this sense, even though there are several techniques and protocols that address the precise time synchronization issue, most of them rely on specialized hardware and/or software solutions, thus incurring a relatively high cost to deploy the synchronization network [42,43].

An overview of the architecture used to collect a dataset of traces with natural load shaping is presented in Figure 8. It is important to notice that the voltage and current traces for the aggregate of the loads are collected at a single point, namely at the sensors next to the fuse box. The distributed nodes only detect power events (ON, OFF, and power changes) and record the occurrence of such events locally. It is this recording that requires a millisecond timing accuracy, achieved through the synchronization mechanism implemented by the NSAS.

**Figure 8.** Overview of the architecture for collecting a natural load shaping subset.

The principle of operation of this low-cost synchronization network is to have a time base master, with a GPS based real-time clock, to periodically broadcast a two-byte synchronization packet to all nodes in the synchronization network.

To avoid delays imposed by complex packet-based protocols, an approach that implements the synchronization task right before the PHY is used. This is performed using a low-cost, byte-based RF 433 MHz transmitter-receiver pair [44], similar to the one used in [45] for an application with similar requirements. The typical reception delay for this solution is about 300 μs, which meets NSAS timing requirements of 1 ms.

Furthermore, the main contribution of this proposed architecture is its low cost (about one dollar for the receiver), in a way that its impact on the cost of the whole NSAS is minimized. The block diagram of the Natural Subset Acquisition System is presented in Figure 9.

The Synchronization Master and Acquisition Node (SMAN), on the top of the block diagram, is implemented by using a National Instruments MyRIO module [39] attached to a GPS module and the 433 MHz RF transmitter [44]. The MyRIO module is connected to the other NSAS modules via a WLAN and is programmed, via LabView, to perform the SMAN main tasks. The RF transmitter receives a digital timing synchronization signal as input and broadcasts it in the 433 MHz band at a rate of up to 2400 bps. The EDNs (Event Detection Nodes) consist of ESP32 Heltec WiFi modules, as well as 433 MHz RF receivers. The ESP32 Heltec kit is a low-cost development board, which is programmable using the Arduino IDE and corresponding libraries to perform the EDN tasks. It connects to the other NSAS modules via a WiFi-based WLAN. The RF receivers are responsible for receiving the signal that is broadcast by the RF transmitter of the SMAN. Each EDN is physically connected to a Power circuit connection element (interrupter, outlet, etc.), so it can perform the sampling of current to detect

variations that indicate a load switch event (ON, OFF, or other state change such as changing from standby to active mode).

**Figure 9.** Block diagram of the Natural Subset Acquisition System (NSAS).

The SMAN is responsible for acquiring the voltage and current samples at a frequency of 15,384 Hz, which is slightly above the minimum 15,360 Hz frequency specified for the LIT-Dataset due to the MyRio-timer configuration options. The SMAN is also responsible for collecting and storing the event data sent by the EDNs via the WLAN. The GPS module provides the SMAN with an absolute time reference on every second employing the PPS (pulse per second) signal, whose typical jitter is of hundreds of nanoseconds. This time reference is used to ensure that the millisecond data used by the SMAN to synchronize the EDNs is synchronized to an absolute reference, regardless of potential clock drifts presented by the SMAN itself (typically 10 ppm).

Upon detection of an event on a load connected to a monitored power circuit connection element, the corresponding EDN sends the event data to the SMAN via the WLAN and waits for the event acknowledgment. If the acknowledgment times out, the event is sent again. The EDNs also communicate with the SMAN by means of "abs time req" messages, which are sent during EDN initialization. The SMAN responds with an "abs time resp" message containing the absolute time and date, with a resolution of one second, obtained from the GPS receiver. This transaction is responsible for performing a relatively coarse synchronization (i.e., with an accuracy of one second) between the SMAN and the EDNs. The synchronization between EDNs and SMAN is improved to millisecond-accuracy upon reception of an RF message, broadcast by the SMAN, which consists of a 16-bit synchronization code. The SMAN sends the code at a rate of 1000 bps (i.e., one bit per millisecond) on every second boundary (1 Hz). Hence, upon completion of the reception and validation of the code, every EDN shall (re)adjust the millisecond's field of its current time to 16, corresponding to the 16 ms that have passed from the latest second boundary to the end of reception of the last bit of the synchronization code.

As the typical clock drift for the EDN hardware is 10 ppm, a drift of 0.5 ms would occur every 50 s; therefore the resynchronization rate of 1 Hz is, theoretically, widely sufficient to ensure that the EDNs remain synchronized with the SMAN even if 98% (49 of 50) of the RF synch messages are lost. Additionally, the typical jitter of the RF link (300 μs) is small enough not to introduce indeterminism on the millisecond value to be adjusted into the EDNs.

However, it is observed that some EDNs present much higher drift rates than the typical case; in some cases, more than 1000 ppm have been observed under operating conditions, which would compromise the millisecond precision required by the system. Therefore, it is necessary to implement an extra strategy to prevent desynchronization between the several EDNs and the SMAN that compose the NSAS.

The drift correction strategy consists of the algorithm shown in Figure 10a. Initially, the timer tick is set to 1000 μs (1 ms), which is the default period for time-stamp updates. Upon reception of a synch word (i.e., on every second), the EDN compares the millisecond on which the synch word has been effectively received with the millisecond on which it should have been completely received (16, because of the 16-bit synch word sent at 1000 bps starting from 0 ms at the SMAN) (line 6). The more positive the difference between the former and the latter, the more this EDN´s specific tick is being advanced in relation to the nominal tick frequency (1 kHz) because of its clock drift; the same happens when the difference is negative, meaning that the clock drift is causing the EDN tick to be delayed. Next, the EDN timer period is proportionally adjusted (lines 9 and 10), so the next ticks can compensate the clock drift by an increase (or decrease) of the programmed tick frequency.

**Figure 10.** Algorithms used in Event Detection Nodes (EDNs). (**a**) Drift correction algorithm. (**b**) Spurious synch managemen<sup>t</sup> algorithm.

Another algorithm, shown in Figure 10b, is implemented to take into account possible spurious synchronization words that can be received due to noise at the RF link. This is a real concern, as the 433 MHz radios used for the NSAS are very susceptible to such noise, and the implemented synchronization algorithm, which is supposed to be simple and deterministic, does not make use of any software checking mechanisms to improve data reception reliability.

The spurious sync managemen<sup>t</sup> algorithm analyzes the calculated drift obtained from the algorithm of Figure 10a. If this is the first synchronization, the calculated drift is probably correct, as there is no previous synchronization between the SMAN and this EDN. If this is not the first synchronization, and the absolute calculated drift value is greater than a specified limit of 5, corresponding to 5000 ppm. Since 5000 ppm is significantly larger than the typical 10 ppm drift, or even the 1000 ppm drift occasionally detected, a spurious sync word has likely been received on a

random time, leading to a drift miscalculation; in this case, the EDN ignores the spurious sync unless it has already been received more than three times in sequence (as tested in line 6). If that happens, the first received sync was probably spurious, and thus, the new sync is assumed to be the correct one.

#### 4.3.2. Natural Subset—Collected Data

For the natural subset, 14 different load configurations, divided into 11 load classes (Table 10), were used, as well as their combinations. The 3-load combination has 30 s of duration and 6 events. The 7-load combinations have 2 h of duration and 20 events or more. The load configurations mean either that one load has more than one state or that more than one device of the same class was used.


**Table 10.** Characteristics of the natural subset of the LIT-dataset.

#### *4.4. LIT-Dataset Integration to NILMTK*

As NILMTK uses an internal data format (NILMTK-DF), a data format conversion function must be implemented such as those already available for REDD, Smart, and UK-Dale [36]. Such a function was implemented for the LIT-Dataset; hence, its waveforms can be processed in NILMTK. Figure 11a,b presents one of the LIT-Dataset waveforms, an incandescent light bulb that is also presented in Section 5.

**Figure 11.** A LIT-Dataset waveform of an incandescent lamp presented in NILMTK. (**a**) Complete waveform. (**b**) Image zoomed into power-ON event.

#### **5. Results and Analysis**

The results of data collection for each subset and the corresponding analysis are detailed as follows.
