**1. Introduction**

Non-Intrusive Load Monitoring (NILM) techniques are under development, globally, as part of the effort to improve Electrical Energy Efficiency. To support this development, specific datasets have been elaborated, particularly during the last decade. A NILM dataset consists of a collection of samples taken over time; these may include voltage, current, active power, and reactive power. As NILM techniques concern disaggregation of loads, typically, the samples in a NILM dataset comprise aggregated current and power.

According to the International Energy Agency [1], the worldwide electricity demand is currently 29,000 TWh and will increase to 42,000 TWh in 2040 at about 2.1% per year. Under current Stated Policies, less than 50% of this energy will come from renewable sources: mainly solar, wind, and hydro. Improvements in this scenario may be attainable by reducing waste and improper use, as well as reducing the electrical energy needs, mainly by improving the efficiency of electrical devices. Hence, the importance of Electrical Energy Efficiency, which aims at the reduction in power and energy demands of electrical systems without affecting their functionality. Most importantly, reducing energy needs has a significant effect on the environment by reducing the world's carbon emissions.

NILM techniques are based on a centralized measurement of electrical energy consumption and, through a disaggregation process, determination of the individual consumption of each electrical load. Typically, NILM uses a database of known power signatures of devices to analyze the aggregated power consumption and identify the contribution of each load. Therefore, NILM is a low-cost, easily deployed, flexible, and, therefore, viable solution that provides consumers with detailed information about their energy consumption [2]. NILM provides essential information for use in Smart Grids, in Energy Management Systems, and for Energy Efficiency initiatives.

The rationale for a NILM dataset is fivefold:


To support our ongoing research on a NILM solution [3] a dataset with particular requirements was needed. Since the available NILM datasets did not match these requirements, we decided to pursue the development of a new dataset [4], named after our laboratory, by using an engineering development process starting with requirements elicitation. During this development, a testing jig was constructed to allow recording in a framework where up to eight loads could be individually controlled (turned on or off) and register their waveforms (samples of voltage and current) in a controlled load shaping scenario, named Synthetic load shaping subset. Power detection devices were also built and connected to each load, in a residential or research lab environment, to provide precise event records in a scenario of recording in a real (thus, not-controlled) environment (this subset was named Natural load shaping). To these two subsets, a third one was added, consisting of Simulated loads. In this case, scenarios that are hard to obtain in the real world, such as short circuits, can be included.

The taxonomy of NILM datasets may be organized by (1) sample frequency with low-frequency being up to 1 Hz and high-frequency when above that [5]; (2) the event-aware datasets being those that register the occurrence of each load event, while the event-free datasets do not; and (3) the presence (or not) of ground-truth information, either by indicating which loads caused each event or by registering the individual consumption of each load over time. The LIT-Dataset samples voltage and aggregated current at 15 kHz (256 samples per 60 Hz-mains cycle); records single and multiple concurrent loads and registers each load event to provide ground truth.

The organization of the following sections is as follows: Section 2 describes the publicly available datasets; Section 3 lists the requirements for the proposed dataset; Section 4 describes the three subsets that compose the LIT-Dataset: synthetic load shaping, simulated loads, and natural load shaping; Section 5 presents and analyses the results obtained, and Section 6 presents the conclusions to the work presented here.

#### *Previous Research Contributions*

The LIT-Dataset has been developed under an ongoing research project, funded by COPEL and ANEEL. Previous publications and patent requests, resulting from this research project, are listed in Table 1.


**Table 1.** Related publications from the same research project.

Three of the publications listed in Table 1 concern the LIT-Dataset, presenting preliminary results. In [9], the dataset proposal, jig's design, and initial results for the Synthetic Subset were presented, emphasizing the control mechanism for load switching and the acquisition circuit with its respective instrumentation. In [10], subsequently, the initial results with Simulated Subset were discussed, demonstrating the validation of load models and the automation procedure for generating waveforms. Finally, in [13], the architecture for a Natural Subset was presented, with a focus on the low-cost proposal of a time synchronization mechanism among nodes.

The other publications [3,6–8,11,12] detail the power signature analysis methods proposed in the same research project, using the LIT-Dataset and other recent datasets presented in the literature. Particularly in [3], a multi-agent architecture was presented and validated for event detection, feature extraction, and load classification, using different publicly available datasets. Some of the results were only possible due to the original features of the proposed LIT-Dataset. For instance, agents trained in a single load scenario and tested in another scenario with multiple concurrent loads were only possible because the LIT-Dataset includes such waveforms with single loads and different load combinations. A sample-level comparison for event detection was also only feasible due to the accurate labeling of the LIT-Dataset. This precise annotation of occurrence of each event is also primordial to allow the extraction of transient features from waveforms during the training stage, and, consequently, make use of the different feature extraction agents proposed in that work.

## **2. Related Work**

The subject of the NILM dataset can be placed in the broader area of energy-related datasets and the associated means of data sensing and recording. Concerning data acquisition technologies, according to [14], there are five technologies classes employed to gather data and associated modeling methodologies: (1) energy consumption quantification, based on electricity meters; (2) indoor environmental measurements, based on ambient sensors, e.g., temperature, humidity, CO2 concentration, among others; (3) occupant behavior statistics that are estimated using cameras, Passive InfraRed (PIR) sensing, and similar sensors; (4) status sensors, including doors and windows status readers; (5) others, combining different elements, as Radio Frequency IDentification (RFID) or Ultra Wide Band (UWB) sensors.

Concerning NILM datasets and NILM systems, electrical energy data is usually collected directly by low-cost voltage and current sensors. In [15], voltage AC sensors, Hall-effect based current sensors, and analog-to-digital converters were employed for load monitoring purposes. With respect to the communications infrastructure, [15] employed Ethernet, while [16] used a 433 MHz wireless sensor network gathering AC voltages and currents from individual devices.

A taxonomy for datasets of power consumption in buildings is presented in [17]. On a first level, datasets are classified as Appliance Level versus Aggregated Level. An Appliance Level dataset contains individualized information of energy consumption of every appliance, while an Aggregated Level dataset contains aggregated power consumption data of a whole residence or building. On a second level, seven application purposes are listed: energy savings, appliance recognition, occupancy detection, preference detection, energy disaggregation, demand prediction, and anomaly detection. A survey with 32 datasets is presented, comparing their characteristics and application purposes.

In the following sections, NILM datasets described in the literature are presented in two classes: (1) low-frequency datasets (sampling frequency up to 1 Hz); (2) high-frequency datasets.
